Published June 28, 2025 | Version v1
Other Open

Multilingual Party model for European languages

Authors/Creators

  • 1. École Pratique des Hautes Études, PSL University

Description

Party for European Languages

Party is page-wise recognition of text-y. It is a replacement for conventional text recognizers in automatic text recognition pipelines that utilize either bounding box or baseline+bounding polygon segmentation methods for layout analysis.

This is a model for the recognition of print and handwriting in a number of European languages using the most recent party release with language token support:

  • Ancient Greek
  • Catalan
  • Church Slavonic
  • Corsican
  • Czech
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Irish
  • Latin
  • Lithuanian
  • Middle Dutch
  • Middle French
  • Norwegian
  • Occitan
  • Picard
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Slovenian
  • Spanish
  • Ukrainian

Serbian has been trained only on Cyrillic script.

Architecture

The recognizer is a deep fusion multimodal model consisting of a Swin vision encoder and a tiny Llama (100M parameters) decoder trained with octet tokenization. The network is prompted with the line positions through positional embeddings added to the encoder hidden state.

During training the encoder weights were initialized with a ImageNet-22k pretrained Swin-base from pytorch-image-models, the decoder weights came from a custom Llama 3.2 pretrained on a subset of OSCAR 2301 tokenized with a ByT5-style octet tokenizer.

The pre-initialized model was then pre-trained on a collection of public and private training historical document page datasets augmented with born-digital data crafted from PubLayNet.

Uses

This model is a recognition foundation model primarily targeted at automatic text recognition for the humanities. While it produces fairly accurate output on an impressive range of material it is intended to be fine-tuned on some target dataset to ensure compliance with desired transcription guidelines.

Transcription guidelines, Normalization, and Transformations

No attempts have been made to normalize the datasets or to only use data adhering to common transcription guidelines. While some subsets of the corpus are internally consistent, only a very small subset of the languages in the training data only contain datasets from a single source.

Bias, Risks, and Limitations

The training corpus frequently incorporates datasets of esoteric material transcribed for specific purposes. Especially machine-printed and born-digital material lack diversity, so error rates will most likely vary considerably across languages and document type.

Some additional limitations are to be expected:

  • Some transcriptions resolved abbreviations while others did not. Inconsistent output is to be expected, in particular for European manuscripts in Latin script.
  • As the model predicts 8-bit UTF-8 code units directly the lack of consistent Unicode normalization can cause slightly different code point streams during prediction.

How to Get Started with the Model

Install the party package from github and follow the instructions.

Training Details

Training Data

This model has been fine-tuned from a very generic base model with datasets containing writing from European languages, principally in Latin script but also Cyrillic, Greek, and Glagolithic.

|Language|Pages|Lines|Datasets| |---|---|---|---| |Catalan| | |FONDUE-CA-PRINT-20| |Corsican| | |HN2021-OCR-Poesie-Corse| |Czech| | |Padeřov-Bible-handwriting-ground-truth| |Dutch| | |ATR_TrainingSet_NLF_Newseye_GT_SV_M2+
4 private manuscript datasets
VOC dataset| |English| | |FONDUE-EN-PRINT-20
PubLayNet
University of Denver Collections
Joseph Hooker HTR
CCCC MS 41| |Finnish| | |NewsEye/READ OCR Finnish Newspapers| |French| | |NewsEye READ AS French Newspapers
Boccace
Fabliaux
Liber
Cremma Medieval
DecameronFR
FONDUE-FR-MSS-18
FONDUE-FR-MSS-19
FONDUE-FR-PRINT-16
FONDUE-FR-PRINT-17
FONDUE-FR-PRINT-20
Données imprimés gothiques du 16e siècle
Données HTR incunables du 15e siècle
Données HTR manuscrits du 15e siècle
"Tables Décennales" French Civil Registry
Données imprimés du 16e siècle
Données imprimés du 17e siècle
Données imprimés du 18e siècle
Incunable français du 15e siècle
HTRomance
HTR-SETAF-Jean-Michel
HTR-SETAF-LesFaictzJCH
HTR-SETAF-Pierre-de-Vingle
La Correspondance Jacques Doucet - René Jean
OCR17+
Tapus Corpus
TIMEUS Corpus
Recensement Valaisan
3 private handwritten and print datasets| |German| | |Charlottenburger Amtsschrifttum
DACH GT
DigiTue GT
Fibeln
FONDUE-DE-MSS-18
FoNDUE_Wolfflin_Fotosammlung
HKB GT
Ground truth for Neue Zürcher Zeitung black letter
Reichsanzeiger GT
StABS Ratsbücher O10
NewsEye / READ OCR Austrian Newspapers
Weisthuemer
3 private manuscript datasets| |Greek| | |EPARCHOS
HTR CPgr23
Handwritten Paleographic Greek Text Recognition
ΧΦ114
XΦ79
ΧΦ53
10 small private manuscript datasets| |Italian| | |episearch-htr
FONDUE-IT-PRINT-20
HTRomance Italian
1 private print dataset| |Latin| | |Caroline Minuscule
CREMMA-Medieval-LAT
HTRomance Latin
DIVA-HisDB
Eutyches
FONDUE-LA-MSS-MA
FONDUE-LA-PRINT-16
Lateinische Gedichte
Wien ÖNB Cod 2160
2 private manuscript datasets| |Multilingual| | |FONDUE-MLT-ART
[FONDUE-MLT-CAT](https://github.com/FoNDUE-HTR/FONDUE-MLT-CAT)
[FONDUE-MLT-PRINT-TEST](https://github.com/FoNDUE-HTR/FONDUE-MLT-PRINT-TEST)
gt_structure_text
| |Portuguese| | |Portuguese Handwriting 16th-19th c.| |Russian| | | |1 private manuscript dataset| |Spanish| | |FONDUE-ES-PRINT-19
FoNDUE-Spanish-chapbooks-Dataset
HTR Araucania
HTRomance Spa
3 private manuscript datasets|

For ancient Greek, Czech, Dutch, Finnish, Irish, Latvian, Lithuanian, Polish, Romanian, Russian, Serbian, and Slovenian additional synthetic print data generated with the pangoline tool was used.

Training Procedure and Hyperparameters

  • Training regime:: 6 * A40 GPU, BF16-mixed precision, Mars-AdamW optimizer with caution, batch size: 18, gradient accumulation: 8, effective batch size: 864, 12 epochs with 5000 iteration warmup and cosine decay, max LR 1e-4, min LR 1e-6 at end of epoch 12, weight decay 1e-5, gradient clipping 1.0, augmentation, random sampling of bbox and curve batches

Evaluation

Testing Data, Factors & Metrics

Testing Data

{{ testing_data | default("[More Information Needed]", true)}}

Factors

{{ testing_factors | default("[More Information Needed]", true)}}

Metrics

CER: WER:

Summary

{{ results_summary | default("", true) }}

Citation [optional]

BibTeX:

{{ citation_bibtex | default("[More Information Needed]", true)}}

Files

README.md

Files (858.1 MB)

Name Size
md5:474276b4ec62c02bc31d72cbea5f6f71
858.1 MB Download
md5:54893634fdb40cb68a1162914c33a20d
11.6 kB Preview Download

Additional details

Funding

European Commission
MIDRASH - Migrations of Textual and Scribal Traditions via Large-Scale Computational Analysis of Medieval Manuscripts in Hebrew Script 101071829
Agence Nationale de la Recherche
Biblissima+ - Biblissima+, Observatoire des cultures écrites anciennes, de l’argile à l’imprimé ANR-21-ESRE-0005