Survey of Post-OCR Processing Approaches

doi:10.5281/zenodo.4635569

Published March 1, 2021 | Version v1

Journal article Open

Survey of Post-OCR Processing Approaches

1. University of La Rochelle, L3i
2. University of Innsbruck

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. OCR engines can perform well on modern text, unfortunately, their performance is significantly reduced on historical materials. Additionally, many texts have been already processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their affects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outline some research directions of this field.

Files

ACM_template___Survey_of_post_OCR_approaches-1.pdf

Files (1.5 MB)

Name	Size	Download all
ACM_template___Survey_of_post_OCR_approaches-1.pdf md5:7b32343e8ba4edcc8323924b4eb64f74	1.5 MB	Preview Download

Additional details

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299: European Commission

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	437	205
Downloads	704	208
Data volume	2.1 GB	338.2 MB

Survey of Post-OCR Processing Approaches

Creators

Description

Files

ACM_template___Survey_of_post_OCR_approaches-1.pdf

Files (1.5 MB)

Additional details

Funding