4008952
doi
10.5281/zenodo.4008952
oai:zenodo.org:4008952
user-newseye
user-eu
Hamdi, Ahmed
University of La Rochelle, France
Doucet, Antoine
University of La Rochelle, France
When to use OCR post-correction for named entity recognition?
Huynh, Vinh-Nam
University of Science and Technology of Hanoi, ICTLab, Vietnam
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
named entity recognition
optical character recognition
character degradation
spelling correction
<p>In the last decades, a huge number of documents has been digitised, before undergoing optical character recognition (OCR) to extract their textual content. This step is crucial for indexing the documents and to make the resulting collections accessible. However, the fact that documents are indexed through their OCRed content is posing a number of problems, due to the varying performance of OCR methods over time. Indeed, OCR quality has a considerable impact on the indexing and therefore the accessibility of digital documents. Named entities are among the most adequate information to index documents, in particular in the case of digital libraries, for which log analysis studies have shown that around 80% of user queries include a named entity. Taking full advantage of the computational power of modern natural language processing (NLP) systems, named entity recognition (NER) can be operated over enormous OCR corpora efficiently. Despite progress in OCR, resulting text files still have misrecognised words (or noise for short) which are harming NER performance. In this paper, to handle this challenge, we apply a spelling correction method to noisy versions of a corpus with variable OCR error rates in order to quantitatively estimate the con- tribution of post-OCR correction to NER. Our main finding is that we can indeed consistently improve the performance of NER when the OCR quality is reasonable (error rates respectively between 2% and 10% for characters (CER) and between 10% and 25% for words (WER)). The noise correction algorithm we propose is both language-independent and with low complexity.</p>
Zenodo
2020-08-31
info:eu-repo/semantics/conferencePaper
4008951
user-newseye
user-eu
award_title=NewsEye: A Digital Investigator for Historical Newspapers; award_number=770299; award_identifiers_scheme=url; award_identifiers_identifier=https://cordis.europa.eu/projects/770299; funder_id=00k4n6c32; funder_name=European Commission;
1598878765.734543
546091
md5:8a28f104a5849161e1334c4388092ed6
https://zenodo.org/records/4008952/files/ICADL_2020__ When_to_use_OCR_post-correction_for_ named_entity_recognition.pdf
public
10.5281/zenodo.4008951
isVersionOf
doi