Published September 19, 2024 | Version v1
Other Open

McCATMuS - Transcription model for handwritten, printed and typewritten documents from the 16th century to the 21st century

  • 1. ROR icon Institut national de recherche en informatique et en automatique
  • 2. ROR icon Université de Montréal
  • 3. ROR icon École Pratique des Hautes Études

Description

Built upon datasets from institutions and projects committed to Open Science, McCATMuS provides an interoperable dataset encompassing over 180 manuscripts in 7 different languages (French, Latin, Spanish, English, German, Italian and Occitan). It includes more than 118,000 lines of text and nearly 4 million characters, covering a period from the early 16th century to the present day.

All the datasets were automatically or, when precised, manually corrected to correspond to the CATMuS transcription guidelines, available here: https://catmus-guidelines.github.io/

The annotations in the dataset result for layout extraction, line extraction, typing and transcription, from the original creators of the dataset in most cases, or from automatic or manual corrections by the curator of the CATMuS modern dataset. The alignment of the dataset with CATMuS' guidelines was performed by the curator of the dataset.

The curated dataset can be accessed via HuggingFace: https://huggingface.co/datasets/CATMuS/modern

This model was trained on the McCATMuS Dataset, with Kraken v.4.3.13, with NFD Unicode normalization and a batch size of 32 over 157 epochs.

 

Files

metadata.json

Files (16.2 MB)

Name Size Download all
md5:e531463f631303c700750784b4f9ed63
16.2 MB Download
md5:5950cd621117d45d87f91ce8331b2353
1.7 kB Preview Download

Additional details