Published January 30, 2024 | Version 2024-01-30
Model Open

CATMuS-Print [Large]

  • 1. ROR icon University of Geneva
  • 2. ROR icon Institut national de recherche en informatique et en automatique
  • 1. ROR icon University of Geneva
  • 2. ROR icon Université de Strasbourg
  • 3. ROR icon Institut national de recherche en informatique et en automatique

Description

CATMuS-Print (Large) - Diachronic model for French prints and other West European languages

CATMuS (Consistent Approach to Transcribing ManuScript) Print is a Kraken HTR model trained on data produced by several projects, dealing with different languages (French, Spanish, German, English, Corsican, Catalan, Latin, Italian…) and different centuries (from the first prints of the 16th c. to digital documents of the 21st century).

Transcriptions follow graphematic principles and try to be as compatible as possible with guidelines previously published for French: no ligature (except those that still exist), no allographetic variants (except the long s), and preservation of the historical use of some letters (u/v, i/j). Abbreviations are not resolved. Inconsistencies might be present, because transcriptions have been done over several years and the norms have slightly evolved.

The model is trained with NFKD Unicode normalization: each diacritic (including superscripts) are transcribed as their own characters, separately from the "main" character.

This model is the result of the collaboration from researchers from the University of Geneva and Inria Paris and will be consolidated under the CATMuS Medieval Guidelines in an upcoming paper.

Files

metadata.json

Files (22.9 MB)

Name Size Download all
md5:9ed1ed4a6c34e1f4b292b380bf9c5543
22.9 MB Download
md5:2332232be0b1d0cfb714ef4f13764345
2.7 kB Preview Download

Additional details

Related works

Is documented by
Journal article: https://hal.science/hal-02577236 (URL)

Dates

Available
2024-01-30