Published January 10, 2023 | Version 0.1
Dataset Open

Dataset and evaluation for HTR models for Latin and French Medieval Documentary Manuscripts

  • 1. University of Luxembourg
  • 2. École nationale des chartes

Description

1. Dataset presentation.

This is the dataset used to produce the HTR models applied to documentary Latin and French manuscripts presented in the paper: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval
Manuscripts.
2022. https://hal.science/hal-03892163

The dataset contains mostly charters and registers from the Late-medieval period (12th-15th). The training and evaluation, entailing 1855 pages, 120k lines of text and almost 1M tokens, were conducted using three freely available ground-truth corpora :

The Alcar-HOME database : https://zenodo.org/record/5600884

The e-NDP corpus : https://github.com/chartes/e-NDP_HTR

The Himanis project : https://zenodo.org/record/5535306

The final model operates in a multilingual environment (Latin and French) and it is able to recognize several Latin script families (mostly Textualis and Cursiva) in documents produced in ca. 12th - 15th centuries. During the evaluation the models shows an accuracy of 94.01% on the validation set and a CER (character error ratio) of about 0.12 to 0.17 on four external unseen datasets. A fine-tuning exercise using 10 ground-truth pages can raise these results to a CER between 0.06 to 0.10 respectively.

 

2. Dataset contents .

a) GT_list : List containing the GT file names which constitute the training, evaluation and test sets. The images and transcriptions can be downloaded from their original repositories.

b) Training : Contains the training and testing results (evaluation and prediction files) presented in the original paper for the two training phases: Regular (Textualis and Cursiva separated training) and Quartiles (mixed training by quartiles).

c) Useful_scripts : Scripts to produce the HTR metrics (CER, WER, SER) and plot the model's accuracy.

d) Best_model : Contains the best multilingual and multi-script model.

Files

HTR_medieval_documentary_manuscripts.zip

Files (600.1 MB)

Name Size Download all
md5:27b9163851c3f5bb08160434697d6b9e
23.7 MB Download
md5:5c0588a70dae5ca263185c5ef1c7b087
576.4 MB Preview Download

Additional details

Related works

Is published in
Working paper: https://hal.science/hal-03892163 (URL)
References
Dataset: 10.5281/zenodo.5600884 (DOI)
Dataset: 10.5281/zenodo.5535306 (DOI)
Dataset: https://github.com/chartes/e-NDP_HTR (URL)