Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published August 27, 2020 | Version v1
Dataset Open

Benchmark for the evaluation of named entity recognition over ancient documents

Description

The dataset consists of a multilingual noisy corpora for named entity recognition (NER).
The noisy versions are  simulated from the CoNLL-02 (Spanish and Dutch) and CoNLL-03 (English) NER corpora.
The original collections are re-OCRed and four types of noises at two different levels are added in order to simulate various OCR output.

More precisely, we first extracted raw texts and converted them into images. These images have been contaminated by adding some common noises when using a scanner. We further extract OCRed data using tesseract open source
OCR engine v-3.04.01. Consequently to the image noise insertions, OCRed data contains degradations. Original and noisy texts are finally aligned.

This archive contains three folders (one per language). The folders contain the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data.

These are the supplementary materials for the TPDL 2020 paper Assessing and minimizing the impact of OCR quality on named entity recognition. If you end up using whole or parts of this resource,
please cite this paper:

@InProceedings{10.1007/978-3-030-54956-5_7,
author="Hamdi, Ahmed and Jean-Caurant, Axel and Sid{\`e}re, Nicolas and Coustaty, Micka{\"e}l and Doucet, Antoine",
editor="Hall, Mark and Mer{\v{c}}un, Tanja and Risse, Thomas and Duchateau, Fabien",
title="Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition",
booktitle="Digital Libraries for Open Knowledge",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="87--101",
isbn="978-3-030-54956-5"
}

Acknowledgments
This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant 770299 [NewsEye](https://www.newseye.eu/).

Files

ner_dataset-ocr_degradation.zip

Files (979.8 MB)

Name Size Download all
md5:d5ca5c63be66541ac2365ef58c6dacd9
979.8 MB Preview Download

Additional details

Funding

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299
European Commission