Published August 27, 2020 | Version v1
Conference paper Open

Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition

  • 1. L3i laboratory, University of La Rochelle, France

Description

The accessibility to digitized documents in digital libraries is greatly affected by the quality of document indexing. Among the most relevant information to index, named entities are one of the main entry points used to search and retrieve digital documents. However, most digitized documents are indexed through their OCRed version and OCR errors hinder their accessibility. This paper aims to quantitatively estimate the impact of OCR quality on the performance of named entity recognition (NER). We tested state-of-the-art NER techniques over several evaluation benchmarks, and experimented with various levels and types of OCR noise so as to estimate the impact of OCR noise on NER performance. To the best of our knowledge, no other research work has systematically studied the impact of OCR on named entity recognition over data sets in multiple languages. The final outcome of this study is an evaluation over historical newspaper data provided by the national library of Finland, resulting in a large increase over the best-known results to this day.

Files

TPDL2020_Assessing_and_Minimizing_the_Impact_of_OCR_Quality_on_NER.pdf

Files (603.2 kB)

Additional details

Funding

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299
European Commission