TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)
Description
TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies providing a versatile tool for historians and philologists in transforming and analyzing historical texts.
A paper presenting the first version of the model is available here: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163
Transcriptions rules :
Since the majority of the training documents come from diplomatic editions, the transcriptions were normalized to contemporary reading standards, and abbreviations were expanded with the aim of facilitating a more fluid reading of the document.
The following rules were applied:
- The abbreviations have been expanded, both those by suspension (
facimꝰ
--->facimus
) and by contraction (dñi
-->domini
). Likewise, those using conventional signs (⁊
-->et
;ꝓ
-->pro
) have been resolved. - The named entities (names of persons, places and institutions) have been
capitalized
. The beginning of a block of text as well as the original capitals used by the scribe are also capitalized. - The consonantal
i
andu
characters have been transcribed asj
andv
in both French and Latin. - The punctuation marks used in the manuscript like:
.
or/
or|
have not been systematically transcribed as the transcription has been standardized with modern punctuation. - Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign
$
at the beginning and at the end.
Versions :
Version 1 of the model was trained on charters and registers dataset from the Late Medieval period (12th-15th centuries). The training and evaluation involved 1855 pages, 120k lines of text, and almost 1M tokens, conducted using three freely available ground-truth corpora:
- The Alcar-HOME database: https://zenodo.org/record/5600884
- The e-NDP corpus: https://zenodo.org/record/7575693
- The Himanis project: https://zenodo.org/record/5535306
Version 2 of the model has added new datasets from feudal books and legal proceedings (14th-16th centuries), incorporating an additional 115k lines and more than 1.2M tokens to the previous version using other corpora like:
- Königsfelden Abbey corpus: https://zenodo.org/record/5179361
- Monumenta Luxemburgensia.
Accuracy
TRIDIS was trained using a CNN+RNN+CTC architecture within the Kraken suite (https://kraken.re/). This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries. During evaluation, the model showed an accuracy of 93.1% on the validation set and a CER (Character Error Ratio) of about 0.11 to 0.15 on four external unseen datasets. Fine-tuning the model with 10 ground-truth pages can improve these results to a CER of between 0.06 to 0.10, respectively.
Other formats
The ground truth used for version 2 was also employed to train a Transformer HTR model that combines TrOCR as the encoder with a RoBERTa medieval model as the decoder. This model exhibits a slighly better performance in terms of CER metrics to the current TRIDIS version and shows an improved WER by about 25%. The model is available on the Hugging Face Hub: magistermilitum/tridis_HTR
Files
metadata.json
Files
(24.9 MB)
Name | Size | Download all |
---|---|---|
md5:bcf5f6a23501df61d7fed56a79c808c0
|
2.0 kB | Preview Download |
md5:f38d08c1dc6ec5d618861266e1f98c66
|
24.9 MB | Download |