Published March 9, 2024 | Version v2
Dataset Open

TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)

  • 1. University of Luxembourg
  • 2. École nationale des chartes

Description

TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies providing a versatile tool for historians and philologists in transforming and analyzing historical texts.

A paper presenting the first version of the model is available here: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163

 

Transcriptions rules :

Since the majority of the training documents come from diplomatic editions, the transcriptions were normalized to contemporary reading standards, and abbreviations were expanded with the aim of facilitating a more fluid reading of the document.

The following rules were applied:

  • The abbreviations have been expanded, both those by suspension (facimꝰ ---> facimus) and by contraction (dñi --> domini). Likewise, those using conventional signs ( --> et ; --> pro) have been resolved. 
  • The named entities (names of persons, places and institutions) have been capitalized. The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
  • The consonantal i and u characters have been transcribed as j and v in both French and Latin.
  • The punctuation marks used in the manuscript like: . or / or | have not been systematically transcribed as the transcription has been standardized with modern punctuation.
  • Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign $ at the beginning and at the end.

 

Versions :

Version 1 of the model was trained on charters and registers dataset from the Late Medieval period (12th-15th centuries). The training and evaluation involved 1855 pages, 120k lines of text, and almost 1M tokens, conducted using three freely available ground-truth corpora:

Version 2 of the model has added new datasets from feudal books and legal proceedings (14th-16th centuries), incorporating an additional 115k lines and more than 1.2M tokens to the previous version using other corpora like:

 

Accuracy

TRIDIS was trained using a CNN+RNN+CTC architecture within the Kraken suite (https://kraken.re/). This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries. During evaluation, the model showed an accuracy of 93.1% on the validation set and a CER (Character Error Ratio) of about 0.11 to 0.15 on four external unseen datasets. Fine-tuning the model with 10 ground-truth pages can improve these results to a CER of between 0.06 to 0.10, respectively.

Other formats

The ground truth used for version 2 was also employed to train a Transformer HTR model that combines TrOCR as the encoder with a RoBERTa medieval model as the decoder. This model exhibits a slighly better performance in terms of CER metrics to the current TRIDIS version and shows an improved WER by about 25%. The model is available on the Hugging Face Hub: magistermilitum/tridis_HTR

Files

metadata.json

Files (24.9 MB)

Name Size Download all
md5:bcf5f6a23501df61d7fed56a79c808c0
2.0 kB Preview Download
md5:f38d08c1dc6ec5d618861266e1f98c66
24.9 MB Download

Additional details

Related works