Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition
Description
This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).
The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.
The contents of the archive should be placed in the Datasets/raw directory of the DANIEL codebase.
Contents of the archive:
-
wiki_en: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article. -
wiki_en_ner: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities. -
wiki_fr: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article. -
wiki_de.txt: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.
Data format for corpora in Hugging Face datasets structure:
Each record in the datasets follows the dictionary structure below:
{
"id": "<Wikipedia article ID>",
"url": "<URL of the Wikipedia article>",
"title": "<Title of the Wikipedia article>",
"text": "<Full text of the Wikipedia article>"
}
---
Named Entity Encoding (for wiki_en_ner):
Named entities are annotated using special Unicode symbols, in accordance with the following mapping:
{
"DATE": "⭕",
"PERSON": "蘋",
"GPE": "臧",
"LAW": "徠",
"ORG": "诛",
"PERCENT": "疸",
"MONEY": "麾",
"WORK_OF_ART": "頜",
"CARDINAL": "嗖",
"QUANTITY": "頷",
"NORP": "勲",
"LOC": "麂",
"TIME": "掂",
"EVENT": "砧",
"FAC": "👦",
"PRODUCT": "裾",
"ORDINAL": "📖",
"LANGUAGE": "ǫ"
}
Citation Request
If you publish material based on this database, we request you to include a reference to the paper:
« Constum, T., Tranouez, P. & Paquet, T., DANIEL: a fast document attention network for information extraction and labelling of handwritten documents. IJDAR (2025). https://doi.org/10.1007/s10032-024-00511-9 »
Files
wiki_de.txt
Additional details
Related works
- Is described by
- Journal article: https://link.springer.com/article/10.1007/s10032-024-00511-9 (URL)