Conference paper Open Access

Neural Machine Translation with BERT for Post-OCR Error Detection and Correction

Nguyen, Thi-Tuyet-Hai; Jatwot, Adam; Nguyen, Nhu-Van; Doucet, Antoine; Coustaty, Mickael

The quality of OCR has a direct impact on information access, and an indirect impact on the performance of natural language processing applications, making fine-grained (e.g., semantic) information access even harder. This work proposes a novel post-OCR approach based on a contextual language model and neural machine translation, aiming to improve the quality of OCRed text by detecting and rectifying erroneous tokens. This new technique obtains results comparable to the best-performing approaches on English datasets of the competition on post-OCR text correction in ICDAR 2017/2019.

174
1,272
views
downloads
All versions This version
Views 174174
Downloads 1,2721,272
Data volume 486.6 MB486.6 MB
Unique views 167167
Unique downloads 1,1901,190

Share

Cite as