Conference paper Open Access

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

Nguyen, Thi-Tuyet-Hai; Jatowt, Adam; Coustaty, Mickael; Nguyen, Nhu-Van; Doucet, Antoine

Post-OCR is an important processing step that follows optical character recognition (OCR) and is meant to improve the quality of OCR documents by detecting and correcting residual errors. This paper describes the results of a statistical analysis of OCR errors on four document collections. Five aspects related to general OCR errors are studied and compared with human-generated misspellings, including edit operations, length effects, erroneous character positions, real-word vs. non-word errors, and word boundaries. Based on the observations from the analysis we give several suggestions related to the design and implementation of effective OCR post-processing approaches.

Files (700.6 kB)
Name Size
JCDL2019_Deep_Analysis.pdf
md5:dc1da7ab2572a18b2b1e9c9e14b58e22
700.6 kB Download
146
1,293
views
downloads
All versions This version
Views 146146
Downloads 1,2931,293
Data volume 905.9 MB905.9 MB
Unique views 135135
Unique downloads 1,2431,243

Share

Cite as