Conference paper Open Access

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

Nguyen, Thi-Tuyet-Hai; Jatowt, Adam; Coustaty, Mickael; Nguyen, Nhu-Van; Doucet, Antoine

Post-OCR is an important processing step that follows optical character recognition (OCR) and is meant to improve the quality of OCR documents by detecting and correcting residual errors. This paper describes the results of a statistical analysis of OCR errors on four document collections. Five aspects related to general OCR errors are studied and compared with human-generated misspellings, including edit operations, length effects, erroneous character positions, real-word vs. non-word errors, and word boundaries. Based on the observations from the analysis we give several suggestions related to the design and implementation of effective OCR post-processing approaches.

Files (700.6 kB)
Name Size
JCDL2019_Deep_Analysis.pdf
md5:dc1da7ab2572a18b2b1e9c9e14b58e22
700.6 kB Download
99
279
views
downloads
All versions This version
Views 9999
Downloads 279279
Data volume 195.5 MB195.5 MB
Unique views 9191
Unique downloads 270270

Share

Cite as