TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)

Torres Aguilar, Sergio

doi:10.5281/zenodo.10788591

Published March 6, 2024 | Version v1

Other Open

TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)

Torres Aguilar, Sergio¹

1. University of Luxembourg

TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies providing a versatile tool for historians and philologists in transforming and analyzing historical texts.

A paper presenting the first version of the model is available here: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163

Transcriptions rules :

Since the majority of the training documents come from diplomatic editions, the transcriptions were normalized to contemporary reading standards, and abbreviations were expanded with the aim of facilitating a more fluid reading of the document.

The following rules were applied:

The abbreviations have been expanded, both those by suspension (facimꝰ ---> facimus) and by contraction (dñi --> domini). Likewise, those using conventional signs (⁊ --> et ; ꝓ --> pro) have been resolved.
The named entities (names of persons, places and institutions) have been capitalized. The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
The consonantal i and u characters have been transcribed as j and v in both French and Latin.
The punctuation marks used in the manuscript like: . or / or | have not been systematically transcribed as the transcription has been standardized with modern punctuation.
Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign $ at the beginning and at the end.

Versions :

Version 1 of the model was trained on charters and registers dataset from the Late Medieval period (12th-15th centuries). The training and evaluation involved 1855 pages, 120k lines of text, and almost 1M tokens, conducted using three freely available ground-truth corpora:

The Alcar-HOME database: https://zenodo.org/record/5600884
The e-NDP corpus: https://zenodo.org/record/7575693
The Himanis project: https://zenodo.org/record/5535306

Version 2 of the model has added new datasets from feudal books and legal proceedings (14th-16th centuries), incorporating an additional 115k lines and more than 1.2M tokens to the previous version using other corpora like:

Königsfelden Abbey corpus: https://zenodo.org/record/5179361
Monumenta Luxemburgensia.

Accuracy

TRIDIS was trained using a CNN+RNN+CTC architecture within the Kraken suite (https://kraken.re/). This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries. During evaluation, the model showed an accuracy of 93.1% on the validation set and a CER (Character Error Ratio) of about 0.11 to 0.15 on four external unseen datasets. Fine-tuning the model with 10 ground-truth pages can improve these results to a CER of between 0.06 to 0.10, respectively.

Other formats

The ground truth used for version 2 was also employed to train a Transformer HTR model that combines TrOCR as the encoder with a RoBERTa medieval model as the decoder. This model exhibits a slighly better performance in terms of CER metrics to the current TRIDIS version and shows an improved WER by about 25%. The model is available on the Hugging Face Hub: magistermilitum/tridis_HTR

Files

metadata.json

Files (24.9 MB)

Name	Size	Download all
metadata.json md5:bcf5f6a23501df61d7fed56a79c808c0	2.0 kB	Preview Download
Tridis_Medieval_EarlyModern.mlmodel md5:f38d08c1dc6ec5d618861266e1f98c66	24.9 MB	Download

	All versions	This version
Views	2,784	1,236
Downloads	1,648	1,207
Data volume	16.5 GB	5.1 GB

TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)

Authors/Creators

Description

Transcriptions rules :

Versions :

Accuracy

Other formats

Files

metadata.json

Files (24.9 MB)