Conference paper Open Access

Post-OCR Error Detection by Generating Plausible Candidates

Thi Tuyet Hai Nguyen; Adam Jatowt; Mickaël Coustaty; Nhu Van Nguyen; Antoine Doucet

The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digital documents
are indexed, accessed and exploited. Post-processing approaches detect and correct remaining errors to improve the
quality of OCR texts. However, state-of-the-art approaches still need to be improved. Most of the existing post-OCR techniques
use predefined error position lists or apply simple techniques to detect errors. In this paper, we describe a novel error
detector using different features from character-level (including character noisy channel, index of peculiarity) to word-level
(such as frequencies of n-grams, skip-grams, part-of-speech) Experimental results show that our approach outperforms the
best performing techniques in the ICDAR 2017 Competition on Post-OCR text correction.

Files (249.3 kB)
Name Size
Post-OCR Error Detection by Generating Plausible Candidates.pdf
249.3 kB Download
All versions This version
Views 346346
Downloads 359360
Data volume 89.5 MB89.8 MB
Unique views 314314
Unique downloads 335336


Cite as