Neural Machine Translation with BERT for Post-OCR Error Detection and Correction

The quality of OCR has a direct impact on information access, and an indirect impact on the performance of natural language processing applications, making fine-grained (e.g., semantic) information access even harder. This work proposes a novel post-OCR approach based on a contextual language model and neural machine translation, aiming to improve the quality of OCRed text by detecting and rectifying erroneous tokens. This new technique obtains results comparable to the best-performing approaches on English datasets of the competition on post-OCR text correction in ICDAR 2017/2019.


INTRODUCTION
Historical documents contain valuable knowledge that gets considerable attention from researchers and libraries around the world. Substantial efforts have been devoted to transforming paper-based documents into electronic text in order to preserve them as well as make them fully accessible.
Limitations of modern OCR technologies in handling historical documents lead to difficulties in reading, retrieving as well as other processes on digitized collections [13]. In other words, they reduce the benefits of digitization projects by making it difficult for users to acquire knowledge from past documents. Our work attempts at minimizing the influences of the OCR problems by detecting and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. JCDL '20, August 1-5, 2020 correcting errors of digitized texts. Bidirectional encoder representations from transformers (BERT) and neural machine translation (NMT) are employed in our approach with some variations.
A shared task is a good chance to compare techniques. Therefore, we use the evaluation metrics and English datasets of the two occurrences of the competition on post-OCR text correction in 2017 [2] and 2019 [11] to evaluate the performance of our proposed methods. Experimental results show that our approach performs slightly better than the winners of the competition on error detection and obtains comparable improvements on error correction.
Our three contributions are mentioned as follows. The first one is to apply static word embeddings in fine-tuned BERT models, which increases performance of error detection. Our character embeddings created by training NMT on aligned OCRed text and its ground truth (GT) achieve some positive results in rectifying errors. The last contribution is to utilize a length difference for removing irrelevant candidates, which improves correction output.

RELATED WORK
The literature of OCR post-processing research has a rich family of models. They are grouped into three types: manual approach type which lets human manually review and correct OCRed texts, lexical approach type based on the comparison of source words to a dictionary entries, and statistical approach type that utilizes error distributions from training data.
The manual type. The manual approaches work effectively with high accuracy, whereas they also have some limitations. In fact, these methods require the original documents which are often unavailable for some OCRed corpora. In addition, they heavily depend on volunteer work.
The lexical type. The methods of the lexical type typically exploit distance measures between an erroneous word and a lexicon entry to suggest candidates for correcting OCR errors.
Lexical approach type is easy to be applied, however, it also goes together with some difficulties. Historical documents do not follow the same spelling rules as modern texts and often lack complete lexicons. Moreover, the approaches of this type only concentrate on single words so they cannot tackle real-word errors (e.g. 'hear' is a real-word error in a phrase 'stay hear in Japan').
The statistical type. Most of the post-processing approaches are statistical, which enable to model specific distributions of the target domain from available training data.
In the 2017 and 2019 competitions on post-OCR text correction [2,11], participants implemented various methods to detect and correct OCR errors. The best-performing of the detection task was the fine-tuned BERT model of CCC team. Both of the winners in the correction task (Char-NMT/SMT, CCC) use character-level machine translation techniques with some additional features.
As a result, our proposals are developed from BERT and characterlevel NMT with some extensions, including static embeddings used in BERT, our character embeddings applied in NMT, and post filter.

ERROR DETECTION
BERT [3] is a multi-layer bidirectional transformer encoder. It is pre-trained on unlabeled data over two different tasks, including Masked Language Model (MLM), and Next Sentence Prediction (NSP). BERT models can be fine-tuned to handle NLP problems. Downstream tasks are firstly set with the pre-trained parameters, which are adjusted by their labelled data.
There are multiple task-specific BERT models [3], some of them work at sentence level, others perform at token level. Error detection problem can be viewed as token classification which classifies OCRed tokens as either valid or invalid. We focus on fine-tuning BERT models at token level.
We adapt the model of named entity recognition (NER) to an error detection model. Particularly, instead of tagging tokens with NER taggers, we tag tokens with label 1 (invalid token) or 0 (valid token). Our approach is similar to the one of the winner of the 2019 competition, but we simplify the model with only one fullyconnected layer on the top of the hidden-states output. In addition, it is proved that pre-trained word embedding models increase the performance of NLP tasks. Thus, instead of randomly initializing embeddings like the competition winner CCC does, we employ popular word embeddings (Fasttext, Glove) in our model.
Our approach consists of the four following steps. OCRed input is first split into OCRed tokens based on white-space. Next, we apply WordPiece [14] tokenization on each token to get corresponding sub-tokens. A mapping between the original OCRed token and its sub-tokens is also maintained. Then, Glove or Fasttext is used to embed sub-tokens in lieu of assigning random numbers as initial embeddings.
After that, these embeddings are combined with segment and position embeddings as inputs of BERT token classification model, which is a BERT model with an additional fully-connected layer. This design is simpler than the state of the art which uses both convolutional and fully-connected layers. The outcome of this stage is labelled sub-tokens, with label 1 for invalid tokens and 0 for valid tokens. Finally, the original tokens are considered as invalid ones if at least one of their sub-tokens is labelled as error.
Take an OCRed sequence 'we wyll go' with an error 'wyll' as an example to illustrate our approach. The input of the first step is a list of tokens tokenized by white-spaces, {'we', 'wyll', 'go'}. Applying WordPiece on each OCRed token, we have the corresponding subtokens and their mappings to their original tokens, {'we': 'we', 'wyll': {'w', '##yl', '##l'}, 'go': 'go'}. Next, the pre-trained word embeddings Glove or Fasttext embed the sub-tokens to be used as inputs for BERT token classification. The classifier labels each sub-token as either a valid or invalid word. The original token ('wyll') is identified as the error since its sub-tokens are classified as invalid ones ('w', '##yl', '##l').
In our experiments, we apply uncased BERT-base models with batch size as 32, learning rate of the optimizer Adam set to 3e-5, maximum sequence length as 75. The model is trained with a higher number of epochs than recommended (20) while the other hyperparameters remain unchanged.

ERROR CORRECTION
As mentioned in Section 2, character-level MT is the state of the art for the error correction task, where it enables to tackle the problem of data sparsity. Regarding MT techniques, SMT consists of many small sub-components that are tuned separately. In contrast, NMT aims at building a single neural network which maximizes the translation performance. Its performance is comparable to the existing state-of-the-art phrase-based model [14]. Consequently, we employ NMT at character level to translate OCRed text into its corrected version (in the same language).
Our models are built on an open-source toolkit for neural machine translation (OpenNMT) [6]. We use most of the default values of OpenNMT, except for embedding, hidden layer size, sequence length. Input and output texts are written in the same language, therefore we configure to share embeddings between the source and target side with embedding size of 160 (tested against 100). Hidden layer size is increased from 500 to 1000. We set the maximum sequence length to 70 (instead of the default one, 50) to cover longer sequences of training data.
It is the fact that most of OCRed tokens are correct. If the MT system is trained on a dataset with a large proportion of valid tokens, then it might not rectify errors. In order to reduce the negative impact of imbalanced data and deal with real-word errors, we consider erroneous OCRed tokens and some nearby tokens (which can be correct or incorrect) as input; the corresponding GT texts are provided as output of NMT models.
Particularly, given one error and its four neighbors, we generate five word 5-grams which are represented at character level and used as input sequences. By doing this, we augment data for training NMT models. In the data representation, space and '#' are viewed as character delimiters and word boundary markers, respectively. If an error is a run-on one, '$' is used as word delimiter within its target text. It should be noted that we do not consider an input sequence with all four words on the left side of the error and no word on its right side. The reason is that we expect to tackle incorrect split errors, such as 'main tain' vs. GT word 'maintain'.
For example, given an error 'andjust' in OCRed phrase 'twenty in number andjust then published in a', and its corresponding GT 'twenty in number and just then published in a', four input sequences of the error and their output ones are shown in Table 1.
Furthermore, Sennrich et al. concluded that linguistic features (e.g. POS tags, morphological features, etc.) yield high performance Table 1: Example of input/output sequences OCRed text (source side) t w e n t y # i n # n u m b e r # a n d j u s t # t h e n i n # n u m b e r # a n d j u s t # t h e n # p u b l i s h e d n u m b e r # a n d j u s t # t h e n # p u b l i s h e d # i n a n d j u s t # t h e n # p u b l i s h e d # i n # a GT text (target side) t w e n t y # i n # n u m b e r # a n d $ j u s t # t h e n i n # n u m b e r # a n d $ j u s t # t h e n # p u b l i s h e d n u m b e r # a n d $ j u s t # t h e n # p u b l i s h e d # i n a n d $ j u s t # t h e n # p u b l i s h e d # i n # a OCRed text (source side) n|M u|M m|M b|M e|M r|M #|M a|M n|M d|M j|M u|M s|M t|M GT text (target side) n u m b e r # a n d $ j u s t of NMT systems. However, these features are specifically designed for words rather than characters. Amrhein et al. [1] applied two features in their NMT models, including the text types and the written time span. Nevertheless, both of the features are missing from Comp2019 dataset. We think that OCRed texts of Comp2019 dataset might share some common characteristics, thus, our work considers the source of this dataset as its type. In total, there are three text types in the competition datasets (monograph and periodical from Comp2017, and Comp2019), which are exploited as additional input feature (or factor) for MT model.
By applying factored NMT, we have more training data. Moreover, instead of training different models for each dataset, we only need to train a single model to test on our three datasets. An example of an input sequence of Monograph dataset with factored representation is shown in Table 2. Factored NMT model is the first version of our approach (denoted as Correction 1).
MT techniques apply pre-trained word embeddings to improve translation performance. Several word embeddings are available and free to access while it is not easy to find a character embedding. McCann et al. [7] reported that a pre-trained encoder of a MT model increases the performance of other NLP tasks. Their contextualized word vectors are known as Context Vectors (CoVe). Broadening this idea, we extract embeddings from character-level NMT model trained with an aligned data.
Particularly, we align OCRed text with its corresponding GT text, then we generate input sequences from each aligned error with its contextual tokens. New character embeddings are extracted from models trained with the aligned data and shared embeddings between source and target side. It is expected that the embeddings (called as aligned embeddings) are able to put characters closer together in the vector space provided that they have similar contexts and/or shapes. The second version of our approach (called as Correction 2) is similar to the first one but uses aligned embeddings.
According to previous work [9], more than 80% of OCR errors have an edit distance less than 3. We apply this feature to remove some irrelevant candidates. Specifically, after getting candidates for each error from MT models, we only select candidates which have edit distance with the error lower than 3. Furthermore, the analyses also indicate that percentage of deletion and insertion errors are much lower than that of substitution errors. While it is expensive to compute edit distance between two sequences, the length difference between candidate length and OCRed token length is simple and fast to calculate. We find that by setting the length difference threshold to 4, we obtain a performance comparable to using edit distance. The last version of our approach (denoted as Correction 3) is the same as Correction 2 with the addition of the length difference filter.

Metrics
For the detection task, we use the official metrics of the competition to evaluate our approach: Precision, Recall, F-score. Regarding the correction task, the measure is the improvement percentage computed based on the difference of the original distance (between GT and OCRed text) and the corrected distance (between GT and the corrected text) which considers the confidence of each candidate to be the correction in case of many candidates for the same error.

Datasets
English OCRed texts of both rounds of the competition are exploited as evaluation data. The dataset of Comp2017 consists of 813 English written files that were either published in periodicals or monographs. Therefore, they were divided by the competition organizers into two datasets: Monograph and Periodical. The dataset of Comp2019 contains 200 files in English, which are from IMPACT project. The corresponding GT is created by different projects such as Europeana Newspapers, IMPACT, etc. These datasets are distributed as a training set of 80% and an evaluation set of 20%. The details of the used evaluation datasets are shown in Table 3.

Results
Tables 4 and 5 illustrate the performance of our approaches on the competition datasets in ICDAR 2017, ICDAR 2019, respectively. In these tables, '-' marks no improvement noted by the competition, 'x' denotes no reported result by the competition or the prior approaches that only work on detection [10] or correction task [8].
Error Detection. In overall, our approach surpasses other approaches on Periodical (with 4% higher F-score) and Comp2019 (with 1% higher F-score) but not on Monograph. These results are partly explained by the rate of real-word and non-word errors in each dataset and the strength of our neural network based approach. In fact, there are more real-word errors in two datasets (Periodical and Comp2019) than in Monograph. BERT is a contextual language model, so it is reasonable that the BERT-based model can detect more real-word errors.
The rate of correctly detected real-word errors supports our assumption. Our approach is able to identify 64% of context-sensitive errors on Monograph, 63% on Periodical, 48% on Comp2019, which is better than the results reported in the prior work [10] (43% on Table 4: Results on the competition datasets in ICDAR 2017 (F-score: detection, Impr.: correction).
In general, the best model of our approach outperforms some of our counterparts. In terms of the 2017 competition, our single model performs better than most of participants, except for the state-of-the-art approach (Char-SMT/NMT) which combines five different models of statistical MT and neural MT. The authors of Char-SMT/NMT claimed that their system is complicated and difficult to apply to new datasets. Therefore, they suggested the most promising single system (denoted as Single Char-SMT/NMT in Table 4) [1] which works across all data sets. However, its performance is significantly lower than the ensemble model as well as our proposals. In contrast, our models are easy to implement with available data. Moreover, it should be emphasized that our improvement is much higher than the neural MT based approach (CLAM) or statistical MT based one (MMDT) [12]. Consequently, we think that our model can be considered as a reliable solution to reduce OCR errors across various data sets.
In terms of the 2019 competition (i.e., without the provided error list) we have to use our list of errors obtained from the detection task . Our best model still underperforms some other methods, including RAE2, RAE1 and CCC. In our opinion, the reason is that our models are built on the limited resources of the 2019 competition which is small and contains several real-word errors involving wrong line recognition. The RAE1, RAE2 competitors and the CCC team benefit from using external materials like the Google Books Ngram Corpus. Nonetheless, there is no clear conclusion between the performance of our best model and that of RAE1&2 (called WFST-PostOCR) as the former performed better in the first round.

CONCLUSIONS
This paper presents a novel approach to improve the quality of digitized outputs. Our error detector enables to detect several realword errors by exploiting word embeddings and pre-trained BERT models. Our correction approach which applies NMT techniques on contextual input data and some additional features (e.g. our character embedding, post filter) is promising to reduce OCR errors. Nevertheless, if real-word errors relate to wrong line recognition, the performance of our approach is limited. Future work will focus on employing additional external resources to improve our results.