Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition

CONSTUM, Thomas

doi:10.1007/s10032-024-00511-9

Published June 10, 2025 | Version v1

Dataset Open

Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition

CONSTUM, Thomas (Researcher)¹

1. Université de Rouen Normandie

This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).

The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.

The contents of the archive should be placed in the Datasets/raw directory of the DANIEL codebase.

Contents of the archive:

wiki_en: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.
wiki_en_ner: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.
wiki_fr: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.
wiki_de.txt: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.

Data format for corpora in Hugging Face datasets structure:

Each record in the datasets follows the dictionary structure below:

{
"id": "<Wikipedia article ID>",
"url": "<URL of the Wikipedia article>",
"title": "<Title of the Wikipedia article>",
"text": "<Full text of the Wikipedia article>"
}

---

Named Entity Encoding (for wiki_en_ner):

Named entities are annotated using special Unicode symbols, in accordance with the following mapping:

{
"DATE": "⭕",
"PERSON": "蘋",
"GPE": "臧",
"LAW": "徠",
"ORG": "诛",
"PERCENT": "疸",
"MONEY": "麾",
"WORK_OF_ART": "頜",
"CARDINAL": "嗖",
"QUANTITY": "頷",
"NORP": "勲",
"LOC": "麂",
"TIME": "掂",
"EVENT": "砧",
"FAC": "👦",
"PRODUCT": "裾",
"ORDINAL": "📖",
"LANGUAGE": "ǫ"
}

Citation Request

If you publish material based on this database, we request you to include a reference to the paper:

« Constum, T., Tranouez, P. & Paquet, T., DANIEL: a fast document attention network for information extraction and labelling of handwritten documents. IJDAR (2025). https://doi.org/10.1007/s10032-024-00511-9 »

Files

wiki_de.txt

Files (11.6 GB)

Name	Size	Download all
wiki_de.txt md5:063e49861ae2cd3a2b4b545b9f45b5c0	119.8 MB	Preview Download
wiki_en.zip md5:5ba834bd674ac3c65fc6906574b228b6	6.8 GB	Preview Download
wiki_en_ner.zip md5:7af7ef7a23122631e5745ed186d02094	2.4 GB	Preview Download
wiki_fr.zip md5:2ae112917502b74d78c15e2095c55700	2.3 GB	Preview Download

Additional details

Is described by: Journal article: https://link.springer.com/article/10.1007/s10032-024-00511-9 (URL)

	All versions	This version
Views	64	64
Downloads	108	108
Data volume	256.0 GB	256.0 GB

Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition

Authors/Creators

Description

Citation Request

Files

wiki_de.txt

Files (11.6 GB)

Additional details

Related works