McCATMuS - Transcription model for handwritten, printed and typewritten documents from the 16th century to the 21st century
Creators
Contributors
Project members:
Description
Built upon datasets from institutions and projects committed to Open Science, McCATMuS provides an interoperable dataset encompassing over 180 manuscripts in 7 different languages (French, Latin, Spanish, English, German, Italian and Occitan). It includes more than 118,000 lines of text and nearly 4 million characters, covering a period from the early 16th century to the present day.
All the datasets were automatically or, when precised, manually corrected to correspond to the CATMuS transcription guidelines, available here: https://catmus-guidelines.github.io/
The annotations in the dataset result for layout extraction, line extraction, typing and transcription, from the original creators of the dataset in most cases, or from automatic or manual corrections by the curator of the CATMuS modern dataset. The alignment of the dataset with CATMuS' guidelines was performed by the curator of the dataset.
The curated dataset can be accessed via HuggingFace: https://huggingface.co/datasets/CATMuS/modern
This model was trained on the McCATMuS Dataset, with Kraken v.4.3.13, with NFD Unicode normalization and a batch size of 32 over 157 epochs.
Files
metadata.json
Files
(16.2 MB)
Name | Size | Download all |
---|---|---|
md5:e531463f631303c700750784b4f9ed63
|
16.2 MB | Download |
md5:5950cd621117d45d87f91ce8331b2353
|
1.7 kB | Preview Download |