Conference paper Open Access

Post-OCR Error Detection by Generating Plausible Candidates

Thi Tuyet Hai Nguyen; Adam Jatowt; Mickaël Coustaty; Nhu Van Nguyen; Antoine Doucet

The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digital documents
are indexed, accessed and exploited. Post-processing approaches detect and correct remaining errors to improve the
quality of OCR texts. However, state-of-the-art approaches still need to be improved. Most of the existing post-OCR techniques
use predefined error position lists or apply simple techniques to detect errors. In this paper, we describe a novel error
detector using different features from character-level (including character noisy channel, index of peculiarity) to word-level
(such as frequencies of n-grams, skip-grams, part-of-speech) Experimental results show that our approach outperforms the
best performing techniques in the ICDAR 2017 Competition on Post-OCR text correction.

Files (249.3 kB)
Name Size
Post-OCR Error Detection by Generating Plausible Candidates.pdf
md5:8292242f9bf97e2416ffc480ae020bdb
249.3 kB Download
210
106
views
downloads
All versions This version
Views 210210
Downloads 106107
Data volume 26.4 MB26.7 MB
Unique views 187187
Unique downloads 9596

Share

Cite as