Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

doi:10.5281/zenodo.4475989

Published July 1, 2020 | Version v1

Conference paper Open

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

1. University of La Rochelle, L3i
2. University of La Rochelle, L3i; University of Toulouse, IRIT

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

Files

alleviating_digitization_errors_in_named_entity_recognition_for_historical_documents.pdf

Files (409.2 kB)

Name	Size	Download all
alleviating_digitization_errors_in_named_entity_recognition_for_historical_documents.pdf md5:d78c27b3f211c879de332ef4dac6b28e	409.2 kB	Preview Download

Additional details

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299: European Commission

	All versions	This version
Views	97	95
Downloads	125	125
Data volume	51.6 MB	51.6 MB

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Creators

Description

Files

alleviating_digitization_errors_in_named_entity_recognition_for_historical_documents.pdf

Files (409.2 kB)

Additional details

Funding