Dataset and evaluation for HTR models for Latin and French Medieval Documentary Manuscripts

Torres Aguilar, Sergio; Jolivet, Vincent

doi:10.5281/zenodo.7401833

Published January 10, 2023 | Version 0.1

Dataset Open

Dataset and evaluation for HTR models for Latin and French Medieval Documentary Manuscripts

1. University of Luxembourg
2. École nationale des chartes

1. Dataset presentation.

This is the dataset used to produce the HTR models applied to documentary Latin and French manuscripts presented in the paper: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval
Manuscripts. 2022. https://hal.science/hal-03892163

The dataset contains mostly charters and registers from the Late-medieval period (12th-15th). The training and evaluation, entailing 1855 pages, 120k lines of text and almost 1M tokens, were conducted using three freely available ground-truth corpora :

The Alcar-HOME database : https://zenodo.org/record/5600884

The e-NDP corpus : https://github.com/chartes/e-NDP_HTR

The Himanis project : https://zenodo.org/record/5535306

The final model operates in a multilingual environment (Latin and French) and it is able to recognize several Latin script families (mostly Textualis and Cursiva) in documents produced in ca. 12th - 15th centuries. During the evaluation the models shows an accuracy of 94.01% on the validation set and a CER (character error ratio) of about 0.12 to 0.17 on four external unseen datasets. A fine-tuning exercise using 10 ground-truth pages can raise these results to a CER between 0.06 to 0.10 respectively.

2. Dataset contents .

a) GT_list : List containing the GT file names which constitute the training, evaluation and test sets. The images and transcriptions can be downloaded from their original repositories.

b) Training : Contains the training and testing results (evaluation and prediction files) presented in the original paper for the two training phases: Regular (Textualis and Cursiva separated training) and Quartiles (mixed training by quartiles).

c) Useful_scripts : Scripts to produce the HTR metrics (CER, WER, SER) and plot the model's accuracy.

d) Best_model : Contains the best multilingual and multi-script model.

Files

HTR_medieval_documentary_manuscripts.zip

Files (600.1 MB)

Name	Size	Download all
HTR_medieval_documentary_best.mlmodel md5:27b9163851c3f5bb08160434697d6b9e	23.7 MB	Download
HTR_medieval_documentary_manuscripts.zip md5:5c0588a70dae5ca263185c5ef1c7b087	576.4 MB	Preview Download

Additional details

Is published in: Working paper: https://hal.science/hal-03892163 (URL)
References: Dataset: 10.5281/zenodo.5600884 (DOI); Dataset: 10.5281/zenodo.5535306 (DOI); Dataset: https://github.com/chartes/e-NDP_HTR (URL)

	All versions	This version
Views	487	483
Downloads	53	53
Data volume	18.5 GB	18.5 GB

Dataset and evaluation for HTR models for Latin and French Medieval Documentary Manuscripts

Creators

Description

Files

HTR_medieval_documentary_manuscripts.zip

Files (600.1 MB)

Additional details

Related works