Multilingual Party model for European languages
Description
Party for European Languages
Party is page-wise recognition of text-y. It is a replacement for conventional text recognizers in automatic text recognition pipelines that utilize either bounding box or baseline+bounding polygon segmentation methods for layout analysis.
This is a model for the recognition of print and handwriting in a number of European languages using the most recent party release with language token support:
- Ancient Greek
- Catalan
- Church Slavonic
- Corsican
- Czech
- Dutch
- English
- Finnish
- French
- German
- Irish
- Latin
- Lithuanian
- Middle Dutch
- Middle French
- Norwegian
- Occitan
- Picard
- Polish
- Portuguese
- Romanian
- Russian
- Serbian
- Slovenian
- Spanish
- Ukrainian
Serbian has been trained only on Cyrillic script.
Architecture
The recognizer is a deep fusion multimodal model consisting of a Swin vision encoder and a tiny Llama (100M parameters) decoder trained with octet tokenization. The network is prompted with the line positions through positional embeddings added to the encoder hidden state.
During training the encoder weights were initialized with a ImageNet-22k pretrained Swin-base from pytorch-image-models, the decoder weights came from a custom Llama 3.2 pretrained on a subset of OSCAR 2301 tokenized with a ByT5-style octet tokenizer.
The pre-initialized model was then pre-trained on a collection of public and private training historical document page datasets augmented with born-digital data crafted from PubLayNet.
Uses
This model is a recognition foundation model primarily targeted at automatic text recognition for the humanities. While it produces fairly accurate output on an impressive range of material it is intended to be fine-tuned on some target dataset to ensure compliance with desired transcription guidelines.
Transcription guidelines, Normalization, and Transformations
No attempts have been made to normalize the datasets or to only use data adhering to common transcription guidelines. While some subsets of the corpus are internally consistent, only a very small subset of the languages in the training data only contain datasets from a single source.
Bias, Risks, and Limitations
The training corpus frequently incorporates datasets of esoteric material transcribed for specific purposes. Especially machine-printed and born-digital material lack diversity, so error rates will most likely vary considerably across languages and document type.
Some additional limitations are to be expected:
- Some transcriptions resolved abbreviations while others did not. Inconsistent output is to be expected, in particular for European manuscripts in Latin script.
- As the model predicts 8-bit UTF-8 code units directly the lack of consistent Unicode normalization can cause slightly different code point streams during prediction.
How to Get Started with the Model
Install the party package from github and follow the instructions.
Training Details
Training Data
This model has been fine-tuned from a very generic base model with datasets containing writing from European languages, principally in Latin script but also Cyrillic, Greek, and Glagolithic.
|Language|Pages|Lines|Datasets|
|---|---|---|---|
|Catalan| | |FONDUE-CA-PRINT-20|
|Corsican| | |HN2021-OCR-Poesie-Corse|
|Czech| | |Padeřov-Bible-handwriting-ground-truth|
|Dutch| | |ATR_TrainingSet_NLF_Newseye_GT_SV_M2+
4 private manuscript datasets
VOC dataset|
|English| | |FONDUE-EN-PRINT-20
PubLayNet
University of Denver Collections
Joseph Hooker HTR
CCCC MS 41|
|Finnish| | |NewsEye/READ OCR Finnish Newspapers|
|French| | |NewsEye READ AS French Newspapers
Boccace
Fabliaux
Liber
Cremma Medieval
DecameronFR
FONDUE-FR-MSS-18
FONDUE-FR-MSS-19
FONDUE-FR-PRINT-16
FONDUE-FR-PRINT-17
FONDUE-FR-PRINT-20
Données imprimés gothiques du 16e siècle
Données HTR incunables du 15e siècle
Données HTR manuscrits du 15e siècle
"Tables Décennales" French Civil Registry
Données imprimés du 16e siècle
Données imprimés du 17e siècle
Données imprimés du 18e siècle
Incunable français du 15e siècle
HTRomance
HTR-SETAF-Jean-Michel
HTR-SETAF-LesFaictzJCH
HTR-SETAF-Pierre-de-Vingle
La Correspondance Jacques Doucet - René Jean
OCR17+
Tapus Corpus
TIMEUS Corpus
Recensement Valaisan
3 private handwritten and print datasets|
|German| | |Charlottenburger Amtsschrifttum
DACH GT
DigiTue GT
Fibeln
FONDUE-DE-MSS-18
FoNDUE_Wolfflin_Fotosammlung
HKB GT
Ground truth for Neue Zürcher Zeitung black letter
Reichsanzeiger GT
StABS Ratsbücher O10
NewsEye / READ OCR Austrian Newspapers
Weisthuemer
3 private manuscript datasets|
|Greek| | |EPARCHOS
HTR CPgr23
Handwritten Paleographic Greek Text Recognition
ΧΦ114
XΦ79
ΧΦ53
10 small private manuscript datasets|
|Italian| | |episearch-htr
FONDUE-IT-PRINT-20
HTRomance Italian
1 private print dataset|
|Latin| | |Caroline Minuscule
CREMMA-Medieval-LAT
HTRomance Latin
DIVA-HisDB
Eutyches
FONDUE-LA-MSS-MA
FONDUE-LA-PRINT-16
Lateinische Gedichte
Wien ÖNB Cod 2160
2 private manuscript datasets|
|Multilingual| | |FONDUE-MLT-ART
[FONDUE-MLT-CAT](https://github.com/FoNDUE-HTR/FONDUE-MLT-CAT)
[FONDUE-MLT-PRINT-TEST](https://github.com/FoNDUE-HTR/FONDUE-MLT-PRINT-TEST)
gt_structure_text|
|Portuguese| | |Portuguese Handwriting 16th-19th c.|
|Russian| | | |1 private manuscript dataset|
|Spanish| | |FONDUE-ES-PRINT-19
FoNDUE-Spanish-chapbooks-Dataset
HTR Araucania
HTRomance Spa
3 private manuscript datasets|
For ancient Greek, Czech, Dutch, Finnish, Irish, Latvian, Lithuanian, Polish, Romanian, Russian, Serbian, and Slovenian additional synthetic print data generated with the pangoline tool was used.
Training Procedure and Hyperparameters
- Training regime:: 6 * A40 GPU, BF16-mixed precision, Mars-AdamW optimizer with caution, batch size: 18, gradient accumulation: 8, effective batch size: 864, 12 epochs with 5000 iteration warmup and cosine decay, max LR 1e-4, min LR 1e-6 at end of epoch 12, weight decay 1e-5, gradient clipping 1.0, augmentation, random sampling of bbox and curve batches
Evaluation
Testing Data, Factors & Metrics
Testing Data
{{ testing_data | default("[More Information Needed]", true)}}
Factors
{{ testing_factors | default("[More Information Needed]", true)}}
Metrics
CER: WER:
Summary
{{ results_summary | default("", true) }}
Citation [optional]
BibTeX:
{{ citation_bibtex | default("[More Information Needed]", true)}}
Files
README.md
Additional details
Related works
- Is derived from
- Other: https://huggingface.co/timm/swin_base_patch4_window12_384.ms_in22k (URL)
- Other: https://github.com/mittagessen/bytellama (URL)
Funding
- European Commission
- MIDRASH - Migrations of Textual and Scribal Traditions via Large-Scale Computational Analysis of Medieval Manuscripts in Hebrew Script 101071829
- Agence Nationale de la Recherche
- Biblissima+ - Biblissima+, Observatoire des cultures écrites anciennes, de l’argile à l’imprimé ANR-21-ESRE-0005