Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

Post-OCR is an important processing step that follows optical character recognition (OCR) and is meant to improve the quality of OCR documents by detecting and correcting residual errors. This paper describes the results of a statistical analysis of OCR errors on four document collections. Five aspects related to general OCR errors are studied and compared with human-generated misspellings, including edit operations, length effects, erroneous character positions, real-word vs. non-word errors, and word boundaries. Based on the observations from the analysis we give several suggestions related to the design and implementation of effective OCR post-processing approaches.


INTRODUCTION
In an effort to preserve and provide an easy access to past documents, optical character recognition (OCR) techniques have been developed to transform paper-based documents into digital documents.However, various layouts and poor physical quality of degraded documents pose big challenges to OCR engines.Post-OCR is crucial for improving the quality of OCR documents by detecting and correcting errors.
Although OCR errors share some common features with spelling errors, OCR errors have their own special characteristics as they are created by different processes than spelling errors.Naturally, better understanding of OCR errors can help to create better post-OCR approaches.However, to this date, few analyses were done to uncover common characteristics of OCR errors, and they all have been on a coarse level [13,23].This paper reports then the results of the analyses of various characteristics of OCR errors on popular public datasets, and compares them with misspellings.Particularly, edit operation types and edit distance are considered.In addition, we concentrate not only on word lengths but also on OCR token lengths.Moreover, positions of incorrect characters and real-word vs. non-word errors are analyzed.Problems related to the wrong deletion/insertion of white spaces (word boundaries) are also examined.
For the analysis, we utilize four public English datasets along with their ground truth data.Two of them come from the English part of the Post-OCR text correction competition dataset [4] -the largest public, aligned dataset of this kind [5] 1 .Two others are the OverProof Evaluation data [7] 2 .While other datasets contain synthetic data or are private, these two datasets (and their manual GT) include OCR texts of old documents collected from two wellknown libraries and are made public.Our analysis should be beneficial for researchers and practitioners helping them better understand strengths as well as weakness of their approaches.Based on the reported results, we also provide guidelines for building more effective post-processing approaches.
To sum up, we make the following contributions in this paper.
(1) Firstly, we analyze OCR errors and compare them with human-generated misspellings in several aspects.Our analysis forms the basis for better judgment of the pros and cons of post-OCR approaches and for improving their performances.(2) Secondly, we also make statistics on some extended aspects beside typical ones for spelling errors characterization [13], such as string similarities between errors and their ground truth words based on Longest Common Sequence (LCS), OCR token lengths and different erroneous character positions.
(3) To provide clearer views about OCR errors, novel error type classifications are proposed.In particular, we review challenges of correcting short-word/long-word errors with large/small edit distances by grouping errors according to word length.In addition, real-word/non-word errors are also categorized according to word-boundary problem.(4) Finally, based on our observations, we recommend several suggestions for designing OCR post-processing techniques, such as ones related to edit distance thresholds, frequent edit operation types, erroneous character positions, etc.
The remainder of this paper is organized as follows.We introduces four datasets we work on in Sec. 2.Then, Sec. 3 surveys related work.In Section 4, we analyze OCR errors and give many useful statistics.After that, the summary of our major findings is shown in Section 5. Finally, conclusions are discussed in Section 6.

DATASETS
Four analyzed datasets are public collections of historical documents obtained from four libraries.
Two first datasets come from ICDAR2017 Post-OCR text correction competition [4].The competition data contains OCR processed text of ancient English and French documents from two national libraries, the National Library of France (BnF) and the British Library (BL).The corresponding ground truth (GT) was created by different projects (such as Gutenberg, Europeana Newspapers).In this paper, we focus on English OCR text of this multilingual dataset which consists of 813 files belonging to two types: monograph and periodical.The competition organizers divided this English OCR text into two datasets, Monograph and Periodical.There is no information about which OCR engines were used to generate the OCR text of the competition dataset.
Two others are Overproof evaluation datasets [7].The first one (denoted as OverNLA) consists of 159 medium-length news articles with at least 85% correct lines, which were extracted from one of the longest-running titles in the National Library of Australia's Trove newspaper archive -The Sydney Morning Herald, 1842-1954.Its corresponding GT was additionally corrected by Evershed et al. [7] after crowd sourcing corrections [8].The second one (denoted as Overproof LC) consists of 49 medium-length news articles randomly selected from 5 titles of the Library of Congress Chronicling America newspaper archive.The corresponding GT of OverNC was manually corrected by Evershed et al. [7].Both of the Overproof datasets are noisier than the competition ones.Their combined size is 208 articles/files, and they were processed by ABBYY FineReader3 , which is the state-of-the-art commercial OCR system.
The four datasets thus contain OCR texts of past documents from popular libraries (National Library of France, British Library, National Library of Australia, Library of Congress Chronicling America).The included documents are characterized by varying levels of degradation under independent conservation and originate from a relatively wide time range spanning from 1744 to 1954.In view of these, altogether the datasets are representative for historical OCR texts with typical OCR errors.The details of sources, types, years, word error rates (W.E.R), sizes and the file counts of all the four datasets are listed in Table 1.

RELATED WORK
This paper studies OCR errors and compares them with humangenerated misspellings.Our observations are then used for drawing several suggestions towards designing OCR post-processing methods.Consequently, in the two following sections, we review works related to misspellings, OCR errors and post-OCR approaches.

Misspellings and OCR errors
Due to certain shared features between misspellings and OCR errors, an overview of misspelled words could give basic ideas on OCR errors.Kukich [13] made a coarse-grained survey on spelling error characteristics and automatic spellers.Similar features of misspellings were described in [22,23].Spelling errors have been studied from the viewpoint of basic edit operation types, word length effects, first-position errors, non-word/real-word errors, and word boundaries.Firstly, depending on edit distance, there are single-error tokens with edit distance of 1 (e.g.'school' vs. 'schopl') and multi-error tokens with higher edit distance (e.g.'school' vs. 'schopi').Damerau [6] and Mitton [17] indicated that single-error typos were around 80%, 69% of misspellings, respectively.Thus, the average rate of single-error typos can be considered as 74.5%.
Secondly, word lengths have been also considered from the viewpoint of misspellings tendency.Errors were examined as for whether they appear in short words (defined as words of 2, 3 and 4 characters) or longer-length words.Let us call errors involving short words as short-word errors.Kukich [12] found that 63% of errors involved short words.
Thirdly, misspellings can occur at the first character (e.g.'world' vs. 'uorld') or at other characters (e.g.'world' vs. 'workd', 'world' vs. 'worlh').Mitton [17] described that 7% of the misspellings of his dataset appeared at the first character.In the dataset of Kukich [12], that proportion was 15%.The average rate of first-position errors can be then considered to be around 11% of misspellings.
Next, if a token is not a lexicon entry, it is deemed a non-word error.In this case, determination of an error depends then on the coverage and quality of a particular lexicon used.If a valid word occurs in a wrong context, it is considered as a real-word error.For example, in two phrases 'glow-worm candles' vs. 'glow-wonn candies', a non-word error is 'glow-wonn' and 'candies' is a real-word error.Researches on different datasets informed different rates of real-word errors.Mitton [17] revealed that 40% of misspelled words involved real-word errors.Young et al. [28] showed that the rate of real-word errors of their corpus was 25%.On average, one could assume that 67.5% of misspellings are related to non-word errors.
As to the problem of word boundaries, wrongly deleting/inserting white spaces results in incorrect split errors (e.g.'depend' vs. 'de pend') and run-on errors (e.g.'is said' vs. 'issaid').In the corpus of Kukich [12], the percent of word boundary spelling errors were 15% with 13% of run-on errors, and 2% of incorrect split errors.Moreover, Kukich mentioned that OCR text tended to split than to join tokens.
While Kukich mainly focused on spelling errors, Nagy et al.
[18] concentrated on examining selected examples of erroneous OCR tokens.Their work pointed out possible causes of OCR errors, including imaging defects, similar symbols, punctuation, and typography, then gave several potential solutions.However, it did not provide any detailed statistics on each source of OCR errors.

OCR post-processing approaches
A typical post-processing approach consists of two steps, detecting and correcting errors.In terms of the detection task, dictionary and character n-gram models are often used to detect non-word errors.In terms of the correction task, for each OCR error, the list of candidates are generated based on different sources at character level, word level.The best candidate is the correction in an automatic mode, or the top n candidates are suggested to correct the error in a semi-automatic mode.
A wide range of approaches was devoted to OCR post-processing, which can be classified into two main types: dictionary-based and context-based types.The dictionary-based type aims to correct isolated-word errors and does not take nearby context into consideration [3,20], hence this type cannot deal with real-word errors.The context-based type, which considers grammatical and semantic contexts of errors, promises to overcome the issues of the first type.Most of the techniques of this type rely on noisy channel and language model [1,15,27].The others explore several machine learning techniques to suggest correct candidates [2,10,16].
Jones et al. [1] and Tong et al. [27] explored several features, including character n-grams, character confusion (or device mapping statistics), and word bi-gram in different ways to detect and correct erroneous OCR tokens.Using similar features, Llobet et al. [15] built an error model and a language model, then added one more model built from character recognition confidences, called hypothesis model.Three models were compiled separately into Weighted Finite-State Transducers (WFSTs), then were composed into the final transducer.The best token was the lowest cost path of this final transducer.However, character recognition confidence is often missing at least with the whole competition dataset [4] and Overproof evaluation datasets [7].
Along with the development of machine translation techniques, some approaches considered OCR post-processing as machine translation (MT), which translates OCR text into the correct one in the same language.Afli et al. [2] and some competition approaches of the competition [4] applied machine translation techniques (from statistical MT, neural MT to hybrid MT at word and/or character level) to deal with detecting and correcting OCR errors.
Other approaches [10,16] explored different sources to generate candidates and then ranked them using a regression model.Several features were extracted such as confusion probability, uni-gram frequency, context feature, term frequency in the OCR text, word confidence, and string similarity.Then, a regression model was used to predict the best candidate for erroneous OCR token.Post-processing approaches offered different views about OCR errors however none of them gave a general hierarchy of OCR errors.Some mainly focused on real-word and non-word errors [27].Other approaches considered errors with segmentation at word or character level [10,16].Lastly, some others [1,2,7,[24][25][26] just gave examples of OCR errors without any detailed statistics.
In contrast to the above-discussed researches, our work focuses on analyzing OCR errors and gives detailed statistics based on four public datasets.Besides the aspects mentioned in the survey [13], we examine additional features like non-standard substitution mappings, different erroneous character positions, OCR token lengths.Moreover, we give novel classifications and provide several suggestions about the design of post-OCR techniques.

ANALYSIS OF OCR ERRORS
In the following sections, we present five main types of analyses conducted on all the datasets.

Edit operations
In this section, we discuss edit operation types, standard/nonstandard substitution mappings (denoted as standard/non-standard mappings), edit distance and string similarity based on LCS.4.1.1Edit operation types.In order to transform token A to token B, four basic edit operation types can be performed: deletion, insertion, substitution, and transposition [6].Prior works [11,16,27] indicated that transposition is common in misspellings but rarely occurs in OCR errors.We then only consider the three first types.
Fig. 1 shows the percentages of single modification error types (deletion, insertion and substitution denoted as del, ins and sub, respectively) and ones of their possible combinations (del+ins, del+sub, ins+sub, del+ins+sub, respectively) in all the four datasets.Among single edit operation types, the average percentage of substitution (51.6%) is much higher than that of two others.Furthermore, the total percentage of three single edit operation types is about 77.02%, thus higher than that of their combinations.It leads to the conclusion that post-OCR techniques can correct most of errors by just concentrating on a single modification type.
As to the combinations of edit operation types, deletion and insertion rarely occur together.In fact, the combinations of deletion and insertion have very small occurrence rate being 0.24% (del+ins) and 1% (del+ins+sub).Post-OCR approaches could then in our opinion pay less attention on the combinations in candidate generation.
Moreover, the average rate of OCR errors involving substitution, insertion, deletion are approximately 5:1:1, which is useful information for some post-OCR approaches [7,15,19] to decide the number of substitution/insertion character candidates for each OCR character position in candidate generation.If the rate is too small, no correct candidates can be suggested.Otherwise, many incorrect candidates are created negatively affecting the candidate ranking process.
The standard mapping 1:1 of our datasets is illustrated in Table 2.In this table, we compute the percentage of appearance frequency of each GT character being recognized as an OCR character for each dataset.Let us name this percentage as mapping percentage.In order to make the table compact, we only show OCR characters whose mapping percentages are more than 0.1%.Other cases whose mapping percentages are less than 0.1% are denoted as @.Because 1 GT character can be recognized as 1 or n OCR characters, so other cases include OCR characters in 1:1 mappings and 1:n mappings.For example, the percentages of frequency of character b in Periodical being recognized as 'b', 'h' and other characters are 96.7%,1.6% and 1.7%, respectively.Table 2 indidates that the characters with the highest and lowest recognition accuracy are t, z with 98.53% and 88%, respectively.Moreover, the statistics also reveal that characters sharing similar shapes are easily confused, such as 'b' vs. 'h'; 'c' vs. {'o', 'e'}; 'e' vs. {'o', 'c'}.
This standard mapping is used to create character confusion matrix -one of the most important sources to generate and rank candidates.It is obvious that the more similar frequent error patterns between a training part and a testing part of the used datasets are, the higher the probability that the correct candidates are generated.However, OCR errors can vary from OCR engines, layouts as well as degradation levels of documents, and etc.Therefore, some very frequent characters along with their highly possible misrecognition (e.g.'e' vs. 'o', 'j' vs. 'i') may not occur in the large training part and only appear in the small testing part.In such cases, it is impossible to generate valid candidates for unseen error patterns of the testing part.

Non-standard mappings.
Besides the standard mapping 1:1, OCR errors are also subject to more complex mappings [1,12].Different from past related work, our study provides the detailed statistics on the four popular datasets instead of only giving examples of non-standard mappings.
The second point is n:1 mapping, in which n GT characters are recognized as one OCR character (e.g.'main' vs. 'mam').The frequency rates of n GT characters being recognized as one OCR character are computed on four datasets in Table 4.This table only shows GT character ngrams whose mapping percentages are higher than 0.01% and which appear at least 10% of max frequency of their ngrams.Different from Table 2 and 3, in Table 4 we group percentages according to OCR characters because it is inefficient to show many GT character ngrams in the first column.For example, in Monograph dataset, the percentage of appearance frequency of GT character bigram 'li' being recognized as 'b' is 0.03%.Based on the statistics of n:1 mappings, some common patterns with their average rates emerge, (shown as 1 OCR character: n GT characters), such as 'b'{'si':0.05,'li':0.04};'d'{'il':0.7,'ll':0.12};'h'{'li':0.16,'ly':0.1}.
Our observations on these mappings support a conclusion that some characters 'b', 'd', 'h', 'm', 'n' are easily recognized as 'li', {'il', 'cl'}, 'li', {'rn', 'in'}, {'ri', 'ii'}, respectively.In opposite way, 'li', {'il', 'cl'}, 'li', {'rn', 'in'}, {'ri', 'ii'} can be recognized as 'b', 'd', 'h', 'm', 'n', respectively.These kinds of mappings also play important roles in generating and ranking candidates.It should be noted that the statistics of these non-standard mappings are extracted from aligned OCR and their corresponding GT.Although we make a full use of OCR text along with its corresponding GT, there are still some unavoidable noises in our statistics due to the lack of character recognition confidences from OCR engines.4.1.4Edit distances.In case of edit distances, the survey on spelling errors [13] pointed out two main types: single-error tokens (with one edit distance) and multi-error tokens (with higher edit distances).It is obvious that the smaller edit distance an error has, the easier the correction task is.
Percentages of errors based on edit distances of our datasets in Fig. 2 show that most of OCR errors are single-error tokens with approximately 58.92% occurrences.That rate is smaller than the rate Table 4: Percentages of non-standard mapping n:1 (n GT characters are substituted by one OCR character).Only OCR characters are results of n GT characters mis-recognition are listed, and only values higher than 0.01% are shown.Even though this table shows 1:n mapping, the presentation is in a reverse way (1:n) in order to save space.  is a lower value than that of misspellings with 63% on average.In addition, from the highest percentage at length 3, the percentage of incorrect word recognition decreases gradually according to the increase of GT token length.Furthermore, around 85.27% of all OCR errors occur in words of lengths from 2 to 9.

OCR token length.
In practice, post-OCR approaches have to deal with OCR tokens instead of GT words, and lengths of OCR tokens can differ from those of GT words, therefore we consider lengths of OCR tokens.For example, in OCR tokens 'scho ol' and their GT word 'school', two incorrect OCR tokens are 'scho' of Similar to word length, the analysis of incorrect OCR token lengths (see Fig. 5) suggests that incorrect OCR tokens of length 3 are the most common one.In addition, about 80.55% of all invalid OCR tokens are of lengths between 2 and 9.

Two-dimensional classification based on word lengths and edit distances.
There are some arguments that it is more difficult to deal with short-word errors than with errors appearing in longer-length words.Because short-word errors are more likely to yield another lexicon entry when applying character edit operations [14].
However, we believe that the problem does not only result from length but also from edit distance between an error and its GT word.For example, there are two errors (e.g.'ict', 'lct') and their GT word (e.g.'let').The first error 'ict' requires 2 edit operations to be transformed into its GT word, which is more challenging than the second error 'lct' needing only 1 modification to be converted to its GT word.To give a clear view of such problem, we suggest a novel classification by grouping errors according to word lengths and edit distances.With run-on errors (e.g.'blue sky' vs. 'blucsky'), we assume the sum of lengths of words related to the errors as their word length.
The two-dimensional classification of four datasets is shown in Fig. 6.Based on this classification, some post-processing approaches can decide edit distance thresholds for each word length.As mentioned in Sec 4.1.4,around 81.49% of errors have edit distance of 1, 2. In other words, maximum number of possible errors that postprocessing approaches can correct is about 81.49% if edit distance threshold is set as 2 for all word lengths.
In our opinion, by adjusting edit distance threshold according to word length, post-OCR techniques can deal with higher rate of errors.Based on our observations, we suggest to set edit distance thresholds 2, 3, 4 for word lengths less than 4, 10, 13, respectively.On average, those settings increase the rate of errors that post-OCR techniques can process from 81.49% to 89.15%.

Erroneous character positions
The survey on misspellings [13] has shown that there are a few errors at the 1st character.However, there is no research related to erroneous character positions in OCR text.Hence, we examine OCR errors at different character positions, including the first/last/middle Figure 6: Error rates based on word lengths and edit distances position (denoted as first, last, nth, respectively), and their possible combinations (denoted as first+last, first+nth, last+nth, first+last+nth respectively).In case of run-on errors, because this error type incorrectly removes white space at the end of the first word, we decide that this error type always has one last-position error.
Details of erroneous character positions of our datasets are shown in Fig. 7.While 12.46% of OCR errors are first-position errors, spelling errors have slightly smaller percent of such errors with average 11% of all errors.
It is noticeable that on average 27.37% of all errors are lastposition errors, which are even comparable with that of middleposition errors (28.69%).Moreover, our observations on four datasets indicate that erroneous characters rarely appear at the first/last position in the same error.In fact, statistics show that less than 10% of errors belong to (first + last) or (first + last + nth) combinations.Therefore, OCR post-processing can firstly focus on single positions or some combinations (first+nth, last+nth).

Real-word vs. non-word errors
In the next analysis we study the rate of real-word and non-word errors in OCR text.Real-word errors are valid in dictionary but incorrect in context (e.g.'hear' vs. 'bear').The amount of real-word  errors varies naturally with the size of the lexicon [21].Too small lexicon can ignore valid tokens and increase the number of false negatives.In contrast, a too large dictionary can match invalid tokens to low-frequent lexical entries or special domain terms, potentially raising the number of false positives.In other words, the larger the lexicon is, the more real-word errors can occur.
On the other hand, non-word errors are invalid in dictionary (e.g.'hear' vs. 'hcar').It is obvious that non-word errors are easier to be detected and corrected than real-word errors.In addition, there are words which appear in GT but are not lexicon entries, known as out-of-vocabulary (OOV) words.Using the word frequency of COHA corpus, the rate of OOV words in our datasets is found to be about 1%.
The statistics of real-word errors and non-word errors in Fig. 8 show that approximately 59.21% of OCR errors are real-word errors.The proportion of real-word errors in our four datasets is about 1.47 times higher than that of non-word ones.On the contrary, misspellings have opposite trend with 67.5% non-word errors.
Our observations on the four datasets also indicate that approximately 13.77% of non-word errors involve digits, and 25.08% of realword errors relate to punctuations.High percentage of punctuation errors is one notable feature of OCR text.In fact, the low physical quality of old documents causes misrecognition of punctuation.Therefore, OCR texts tend to contain more incorrect/redundant commas and dots than human-generated texts.To give clearer views of OCR errors, we suggest to make a hierarchical classification based on incorrect/correct word boundary error types, and real-word/non-word error types.We firstly separate OCR errors into incorrect/correct word boundary error types.Secondly, in terms of incorrect word boundary error type, depending on inserting/deleting white spaces we classify into two main sub-types, including incorrect split/run-on error types.In terms of correct word boundary error type, we divide into real-word/non-word error types.Finally, for incorrect split/run-on error types we continue grouping into real-word/non-word error types.
Percentages of incorrect/correct word boundary types of our four datasets are shown in Fig. 9.It is clear that all of the four datasets give a similar trend.Around 82.85% of errors are correct word boundary errors, which is much higher than that of incorrect word boundary ones.Incorrectly putting two or more words together creates a runon error which is often not in the lexicon.In other words, most of run-on errors are non-word errors, and they are easy to be detected.Correcting such errors is more complicated because it easily leads to a combinatorial explosion of the number of possible word combinations.
Wrongly splitting one word into some strings results in incorrect split errors.Both detecting and correcting such errors are challenging because some of split strings are not in the lexicon (non-word errors) and others are lexicon entries (real-word errors).
Percentages of incorrect word boundary sub-types of four datasets are shown in Fig. 10 with incorrect split errors denoted as split, runon errors denoted as run-on and their combination (split + run-on).It is notable that the percent of incorrect split errors is on average 2.36 times higher than that of run-on errors.In contrast, most of incorrect word boundary errors in misspellings are run-on errors with 6.5 times higher occurrence than incorrect split ones.In addition, incorrect split and run-on errors rarely appear together in errors.The percentage of their combination (split + run-on) is only   11.Real-word/non-word errors mentioned in this section are sub-sets of the real-word and non-word errors pointed out in Fig. 8, which reveals the similar trend with their super-sets.Other non-word and real-word errors are of the incorrect word boundary type, with 28.72% real-word errors and 24.82% non-word ones, on average.

SUMMARY OF MAIN FINDINGS
We summarize in this section the key observations from our study.Firstly, we examined OCR errors and compared them with spelling errors in several aspects.Misspellings and OCR errors have similar trends in two cases.In particular, most of them are single-error errors (74.5% misspellings, 58.92% OCR errors), and few of them are first-position errors (11% misspellings, 12.46% OCR errors).
However, misspellings and OCR errors differ in three other aspects, including real-word vs. non-word errors, incorrect split vs. run-on errors, short-word errors.We found that most of misspellings (67.5%) are non-word errors while most of OCR errors (59.21%) are real-word ones.Regarding the incorrect word boundary error type, the percentage of run-on errors is 6.5 times higher than that of incorrect split ones in case of spelling errors.In contrast, the proportion of incorrect split errors is on average 2.36 times greater than that of run-on errors in case of OCR errors.Moreover, while 63% of misspellings appear in short words, only 42.1% of OCR errors are short-word errors.
Secondly, besides similar aspects as in Kukich's survey, we present novel statistics (non-standard mappings, string similarities based on LCS, OCR token lengths, and erroneous character positions).For non-standard mappings, our analysis reveals that some characters 'b', 'd', 'h', 'm', 'n' are easily recognized as 'li', {'il', 'cl'}, 'li', {'rn', 'in'}, {'ri', 'ii'}, respectively.In opposite way, some strings 'li', {'il', 'cl'}, 'li', {'rn', 'in'}, {'ri', 'ii'} can be recognized as 'b', 'd', 'h', 'm', 'n', respectively.In case of string similarities based on LCS, around 83.5% of OCR errors achieve no less than 0.125 similarity S with their GT words.As to OCR token lengths, they show similar trend with word lengths.Particularly, incorrect OCR token of length 3 is the most common, and most of erroneous OCR tokens are of lengths from 2 to 9. For erroneous character positions, around 27.37% errors are lastposition errors, and they thus are comparable to middle-position errors (28.69%).In addition, we observe that errors rarely have errorenous characters at both the first and last position (in total 9.75% of first+last and first+last+nth).
Finally, based on the analysis on four datasets, we make some suggestions for designing post-processing approaches.Because last-position errors rarely appear together with first-position errors, post-OCR techniques can ignore their combinations (first+last, first+last+nth).
Our observations show that deletion, insertion and substitution occasionally appear together in the same word (around 22.98%); algorithms of candidate generation can then pay more attention on single modification types instead of their combinations.Moreover, the rate of the number of substitution/deletion/insertion character candidates for each character position of OCR token can be set as 5:1:1 in generating candidates.
Edit distance is considered as an important criteria in selecting relevant candidates.Interestingly, 81.49% of OCR errors are of edit distance 1 or 2, so with edit distance threshold 2, post-processing approaches could easily remove many irrelevant candidates.Moreover, edit distance thresholds can be adjusted according to word lengths.With flexible settings of edit distance threshold, post-processing techniques would be able to handle about 89.15% of errors.

CONCLUSION
In this paper, we examine different aspects of OCR errors towards a better understanding of OCR errors and related challenges.Based on our observations on four datasets we also suggest guidelines for designing post-processing approaches.In addition, we propose a novel two-dimensional classifications, including grouping errors according to word lengths and edit distances, as well as grouping of real-word/non-word errors following word boundary types.Our work can be viewed as an important, initial step to further analyses or towards more efficient and robust post-OCR techniques.

Figure 1 :
Figure 1: Error rates based on edit operation types

Figure 2 :
Figure 2: Error rates based on edit distances

Figure 3 :
Figure 3: Error rates based on the LCS similarity S

Figure 4 :
Figure 4: Rates of correct and incorrect word recognition based on word lengths

Figure 5 :
Figure 5: Error rates based on OCR token lengths

Figure 7 :
Figure 7: Error rates of erroneous character positions

Figure 8 :
Figure 8: Rates of real-word vs. non-word errors

Figure 9 :
Figure 9: Rates of correct vs. incorrect word boundary errors

4. 5 . 1
Incorrect word boundary errors.In terms of incorrect word boundary errors, we study two popular sub-types: incorrect split/runon error types.

Figure 10 :Figure 11 :
Figure 10: Error rates of incorrect word boundary subtypes

4. 5 . 2
Correct word boundary.In terms of correct word boundary, we directly classify errors into real-word/non-word error types.Percentages of real-word and non-word errors in correct word boundary type are shown in Fig.

Table 1 :
Sources, types, years, word error rates (W.E.R), sizes and a number of files of four datasets.