Dataset and evaluation for HTR models for Latin and French Medieval Documentary Manuscripts
Creators
- 1. University of Luxembourg
- 2. École nationale des chartes
Description
1. Dataset presentation.
This is the dataset used to produce the HTR models applied to documentary Latin and French manuscripts presented in the paper: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval
Manuscripts. 2022. https://hal.science/hal-03892163
The dataset contains mostly charters and registers from the Late-medieval period (12th-15th). The training and evaluation, entailing 1855 pages, 120k lines of text and almost 1M tokens, were conducted using three freely available ground-truth corpora :
The Alcar-HOME database : https://zenodo.org/record/5600884
The e-NDP corpus : https://github.com/chartes/e-NDP_HTR
The Himanis project : https://zenodo.org/record/5535306
The final model operates in a multilingual environment (Latin and French) and it is able to recognize several Latin script families (mostly Textualis and Cursiva) in documents produced in ca. 12th - 15th centuries. During the evaluation the models shows an accuracy of 94.01% on the validation set and a CER (character error ratio) of about 0.12 to 0.17 on four external unseen datasets. A fine-tuning exercise using 10 ground-truth pages can raise these results to a CER between 0.06 to 0.10 respectively.
2. Dataset contents .
a) GT_list : List containing the GT file names which constitute the training, evaluation and test sets. The images and transcriptions can be downloaded from their original repositories.
b) Training : Contains the training and testing results (evaluation and prediction files) presented in the original paper for the two training phases: Regular (Textualis and Cursiva separated training) and Quartiles (mixed training by quartiles).
c) Useful_scripts : Scripts to produce the HTR metrics (CER, WER, SER) and plot the model's accuracy.
d) Best_model : Contains the best multilingual and multi-script model.
Files
HTR_medieval_documentary_manuscripts.zip
Files
(600.1 MB)
Name | Size | Download all |
---|---|---|
md5:27b9163851c3f5bb08160434697d6b9e
|
23.7 MB | Download |
md5:5c0588a70dae5ca263185c5ef1c7b087
|
576.4 MB | Preview Download |
Additional details
Related works
- Is published in
- Working paper: https://hal.science/hal-03892163 (URL)
- References
- Dataset: 10.5281/zenodo.5600884 (DOI)
- Dataset: 10.5281/zenodo.5535306 (DOI)
- Dataset: https://github.com/chartes/e-NDP_HTR (URL)