CATMuS-Print [Large]
Authors/Creators
Contributors
Data collector (13):
Researcher:
Description
CATMuS-Print (Large) - Diachronic model for French prints and other West European languages
CATMuS (Consistent Approach to Transcribing ManuScript) Print is a Kraken HTR model trained on data produced by several projects, dealing with different languages (French, Spanish, German, English, Corsican, Catalan, Latin, Italian…) and different centuries (from the first prints of the 16th c. to digital documents of the 21st century).
Transcriptions follow graphematic principles and try to be as compatible as possible with guidelines previously published for French: no ligature (except those that still exist), no allographetic variants (except the long s), and preservation of the historical use of some letters (u/v, i/j). Abbreviations are not resolved. Inconsistencies might be present, because transcriptions have been done over several years and the norms have slightly evolved.
The model is trained with NFKD Unicode normalization: each diacritic (including superscripts) are transcribed as their own characters, separately from the "main" character.
This model is the result of the collaboration from researchers from the University of Geneva and Inria Paris and will be consolidated under the CATMuS Medieval Guidelines in an upcoming paper.
Files
metadata.json
Files
(22.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9ed1ed4a6c34e1f4b292b380bf9c5543
|
22.9 MB | Download |
|
md5:2332232be0b1d0cfb714ef4f13764345
|
2.7 kB | Preview Download |
Additional details
Related works
- Is documented by
- Journal article: https://hal.science/hal-02577236 (URL)
Dates
- Available
-
2024-01-30