Published August 29, 2019 | Version v1
Conference paper Open

Post-OCR Error Detection by Generating Plausible Candidates

  • 1. L3i Laboratory, University of La Rochelle

Description

The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digital documents
are indexed, accessed and exploited. Post-processing approaches detect and correct remaining errors to improve the
quality of OCR texts. However, state-of-the-art approaches still need to be improved. Most of the existing post-OCR techniques
use predefined error position lists or apply simple techniques to detect errors. In this paper, we describe a novel error
detector using different features from character-level (including character noisy channel, index of peculiarity) to word-level
(such as frequencies of n-grams, skip-grams, part-of-speech) Experimental results show that our approach outperforms the
best performing techniques in the ICDAR 2017 Competition on Post-OCR text correction.

Files

Post-OCR Error Detection by Generating Plausible Candidates.pdf

Files (249.3 kB)

Additional details

Funding

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299
European Commission