Conference paper Open Access

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

Nguyen, Thi-Tuyet-Hai; Jatowt, Adam; Coustaty, Mickael; Nguyen, Nhu-Van; Doucet, Antoine

Post-OCR is an important processing step that follows optical character recognition (OCR) and is meant to improve the quality of OCR documents by detecting and correcting residual errors. This paper describes the results of a statistical analysis of OCR errors on four document collections. Five aspects related to general OCR errors are studied and compared with human-generated misspellings, including edit operations, length effects, erroneous character positions, real-word vs. non-word errors, and word boundaries. Based on the observations from the analysis we give several suggestions related to the design and implementation of effective OCR post-processing approaches.

Files (700.6 kB)
Name Size
700.6 kB Download
All versions This version
Views 146146
Downloads 1,2931,293
Data volume 905.9 MB905.9 MB
Unique views 135135
Unique downloads 1,2431,243


Cite as