Benchmark for the evaluation of named entity recognition over ancient documents
- 1. University of La Rochelle
Description
The dataset consists of a multilingual noisy corpora for named entity recognition (NER).
The noisy versions are simulated from the CoNLL-02 (Spanish and Dutch) and CoNLL-03 (English) NER corpora.
The original collections are re-OCRed and four types of noises at two different levels are added in order to simulate various OCR output.
More precisely, we first extracted raw texts and converted them into images. These images have been contaminated by adding some common noises when using a scanner. We further extract OCRed data using tesseract open source
OCR engine v-3.04.01. Consequently to the image noise insertions, OCRed data contains degradations. Original and noisy texts are finally aligned.
This archive contains three folders (one per language). The folders contain the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data.
These are the supplementary materials for the TPDL 2020 paper Assessing and minimizing the impact of OCR quality on named entity recognition. If you end up using whole or parts of this resource,
please cite this paper:
@InProceedings{10.1007/978-3-030-54956-5_7,
author="Hamdi, Ahmed and Jean-Caurant, Axel and Sid{\`e}re, Nicolas and Coustaty, Micka{\"e}l and Doucet, Antoine",
editor="Hall, Mark and Mer{\v{c}}un, Tanja and Risse, Thomas and Duchateau, Fabien",
title="Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition",
booktitle="Digital Libraries for Open Knowledge",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="87--101",
isbn="978-3-030-54956-5"
}
Acknowledgments
This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant 770299 [NewsEye](https://www.newseye.eu/).
Files
ner_dataset-ocr_degradation.zip
Files
(979.8 MB)
Name | Size | Download all |
---|---|---|
md5:d5ca5c63be66541ac2365ef58c6dacd9
|
979.8 MB | Preview Download |