McCATMuS - Transcription model for handwritten, printed and typewritten documents from the 16th century to the 21st century

Chagué, Alix

doi:10.5281/zenodo.13788177

Published September 19, 2024 | Version v1

Other Open

McCATMuS - Transcription model for handwritten, printed and typewritten documents from the 16th century to the 21st century

Chagué, Alix (Editor)^{1, 2, 3}

1. Institut national de recherche en informatique et en automatique
2. Université de Montréal
3. École Pratique des Hautes Études

Contributors

Project member (3):

Built upon datasets from institutions and projects committed to Open Science, McCATMuS provides an interoperable dataset encompassing over 180 manuscripts in 7 different languages (French, Latin, Spanish, English, German, Italian and Occitan). It includes more than 118,000 lines of text and nearly 4 million characters, covering a period from the early 16th century to the present day.

All the datasets were automatically or, when precised, manually corrected to correspond to the CATMuS transcription guidelines, available here: https://catmus-guidelines.github.io/

The annotations in the dataset result for layout extraction, line extraction, typing and transcription, from the original creators of the dataset in most cases, or from automatic or manual corrections by the curator of the CATMuS modern dataset. The alignment of the dataset with CATMuS' guidelines was performed by the curator of the dataset.

The curated dataset can be accessed via HuggingFace: https://huggingface.co/datasets/CATMuS/modern

This model was trained on the McCATMuS Dataset, with Kraken v.4.3.13, with NFD Unicode normalization and a batch size of 32 over 157 epochs.

Files

metadata.json

Files (16.2 MB)

Name	Size	Download all
McCATMuS_nfd_nofix_V1.mlmodel md5:e531463f631303c700750784b4f9ed63	16.2 MB	Download
metadata.json md5:5950cd621117d45d87f91ce8331b2353	1.7 kB	Preview Download

Additional details

Is continued by: Dataset: https://huggingface.co/datasets/CATMuS/modern (URL)
Is derived from: Dataset: 10.5281/zenodo.10813666 (DOI); Dataset: 10.5281/zenodo.10813111 (DOI); Dataset: 10.5281/zenodo.10198863 (DOI); Dataset: 10.5281/zenodo.10198870 (DOI); Dataset: 10.5281/zenodo.10666988 (DOI); Dataset: 10.5281/zenodo.8193319 (DOI); Dataset: 10.5281/zenodo.6126625 (DOI); Dataset: 10.5281/zenodo.12799158 (DOI); Dataset: 10.5281/zenodo.10631356 (DOI); Dataset: 10.5281/zenodo.10632594 (DOI); Dataset: 10.5281/zenodo.10177570 (DOI); Dataset: 10.5281/zenodo.7778045 (DOI); Dataset: 10.5281/zenodo.10668574 (DOI); Dataset: 10.5281/zenodo.10533865 (DOI); Dataset: 10.5281/zenodo.7075186 (DOI); Dataset: 10.5281/zenodo.5417946 (DOI); Dataset: https://github.com/FoNDUE-HTR/FONDUE-FR-MSS-17 (URL); Dataset: https://github.com/FoNDUE-HTR/FoNDUE_Wolfflin_Fotosammlung (URL); Dataset: https://github.com/FoNDUE-HTR/FONDUE-LA-MSS-16-PR (URL); Dataset: https://github.com/HTR-United/lectaurep-repertoires (URL); Dataset: https://github.com/Gallicorpora/HTR-imprime-18e-siecle (URL); Dataset: https://github.com/Gallicorpora/HTR-imprime-17e-siecle (URL); Dataset: https://github.com/HTR-United/cremma-16-17-print (URL)

	All versions	This version
Views	3,396	3,396
Downloads	2,946	2,946
Data volume	22.3 GB	22.3 GB

McCATMuS - Transcription model for handwritten, printed and typewritten documents from the 16th century to the 21st century

Authors/Creators

Contributors

Project member (3):

Description

Files

metadata.json

Files (16.2 MB)

Additional details

Related works