Conference paper Open Access

Impact of OCR Quality on Named Entity Linking

Linhares Pontes, Elvys; Hamdi, Ahmed; Sidere, Nicolas; Doucet, Antoine

Digital libraries are online collections of digital objects that can include text, images, audio, or videos. It has long been observed that named entities (NEs) are key to the access to digital library portals as they are contained in most user queries. Combined or subsequent to the recognition of NEs, named entity linking (NEL) connects NEs to external knowledge bases. This allows to differentiate ambiguous geographical locations or names (John Smith), and implies that the descriptions from the knowledge bases can be used for semantic enrichment. However, the NEL task is especially challenging for large quantities of documents as the diversity of NEs is increasing with the size of the collections. Additionally digitized documents are indexed through their OCRed version which may contains numerous OCR errors. This paper aims to evaluate the performance of named entity linking over digitized documents with different levels of OCR quality. It is the first investigation that we know of to analyze and correlate the impact of document degradation on the performance of NEL. We tested state-of-the-art NEL techniques over several evaluation benchmarks, and experimented with various types of OCR noise. We present the resulting study and subsequent recommendations on the adequate documents and OCR quality levels required to perform reliable named entity linking. We further provide the rst evaluation benchmark for NEL over degraded documents.

Files (709.1 kB)
Name Size
ICADL Impact of OCR Quality on Named Entity Linking.pdf
md5:500b0eb031eec759f2ce019a30a2a5a0
709.1 kB Download
182
121
views
downloads
All versions This version
Views 182182
Downloads 121121
Data volume 85.8 MB85.8 MB
Unique views 163163
Unique downloads 111111

Share

Cite as