Published March 24, 2025 | Version v3
Other Open

Pretrained multilingual Party model

Authors/Creators

  • 1. École Pratique des Hautes Études, PSL University

Description

Llama Party

Party is page-wise recognition of text-y. It is a replacement for conventional text recognizers in automatic text recognition pipelines that utilize either bounding box or baseline+bounding polygon segmentation methods for layout analysis.

Llama party is a full-page generative text recognizer that has been pretrained on a large corpus of multilingual historical, contemporary, and born-digital document page images, both handwritten and machine-printed.

Architecture

The recognizer is a deep fusion multimodal model consisting of a Swin vision encoder and a tiny Llama (100M parameters) decoder trained with octet tokenization. The network is prompted with the line positions through positional embeddings added to the encoder hidden state.

During training the encoder weights were initialized with a ImageNet-22k pretrained Swin-base from pytorch-image-models, the decoder weights came from a custom Llama 3.2 pretrained on a subset of OSCAR 2301 tokenized with a ByT5-style octet tokenizer.

The pre-initialized model was then pre-trained on a collection of public and private training historical document page datasets augmented with born-digital data crafted from PubLayNet.

Uses

Llama party is a recognition foundation model primarily targeted at automatic text recognition for the humanities. While it produces fairly accurate output on an impressive range of material it is intended to be fine-tuned on some target dataset to ensure compliance with desired transcription guidelines.

Transcription guidelines, Normalization, and Transformations

No attempts have been made to normalize the datasets or to only use data adhering to common transcription guidelines. While some subsets of the corpus are internally consistent, only a very small proportion of the languages in the training data only contain datasets from a single source.

Bias, Risks, and Limitations

The training corpus is heavily skewed towards a couple of languages (Chinese, English, French, German, and Portuguese) and frequently incorporates datasets of esoteric material transcribed for specific purposes. Especially machine-printed and born-digital material lack diversity, so error rates will most likely vary considerably across languages and document type.

Some additional limitations are to be expected:

  • Arabic, Hebrew, and South Indian script recognition is likely to require fine-tuning.
  • Some transcriptions resolved abbreviations while others did not. Inconsistent output is to be expected, in particular for European manuscripts in Latin script.
  • As the model predicts 8-bit UTF-8 code units directly the lack of consistent Unicode normalization can cause slightly different code point streams during prediction.

How to Get Started with the Model

Install the party package from github and follow the instructions.

Training Details

Training Data

The model has been pretrained on the vast majority of publicly available ATR datasets, in addition to a decent number of restricted datasets. For English exclusively we converted the PubLayNet dataset for layout analysis on born-digital documents into an ATR dataset with PDFMiner and some basic baseline heuristic based on the line bounding box.

|Language|Pages|Lines|Datasets| |:-------|:----|:----|:-------| |Arabic | | |RASAM 1
TariMa
OpenITI Arabic MS Data
OpenITI Arabic Print Data| |Catalan | | |FONDUE-CA-PRINT-20| |Chinese | | |1 large private dataset| |Corsican | | |HN2021-OCR-Poesie-Corse| |Czech| | | |Padeřov-Bible-handwriting-ground-truth| |Dutch| | | |4 private manuscript datasets
VOC dataset| |English | | |FONDUE-EN-PRINT-20
PubLayNet
University of Denver Collections
Joseph Hooker HTR
CCCC MS 41| |Finnish | | |NewsEye/READ OCR Finnish Newspapers| |French | | |NewsEye READ AS French Newspapers
Boccace
Fabliaux
Liber
Cremma Medieval
DecameronFR
FONDUE-FR-MSS-18
FONDUE-FR-MSS-19
FONDUE-FR-PRINT-16
FONDUE-FR-PRINT-17
FONDUE-FR-PRINT-20
Données imprimés gothiques du 16e siècle
Données HTR incunables du 15e siècle
Données HTR manuscrits du 15e siècle
"Tables Décennales" French Civil Registry
Données imprimés du 16e siècle
Données imprimés du 17e siècle
Données imprimés du 18e siècle
Incunable français du 15e siècle
HTRomance
HTR-SETAF-Jean-Michel
HTR-SETAF-LesFaictzJCH
HTR-SETAF-Pierre-de-Vingle
La Correspondance Jacques Doucet - René Jean
OCR17+
Tapus Corpus
TIMEUS Corpus
Recensement Valaisan
3 private handwritten and print datasets| |Georgian | | |1 private dataset| |German | | |Charlottenburger Amtsschrifttum
DACH GT
DigiTue GT
Fibeln
FONDUE-DE-MSS-18
FoNDUE_Wolfflin_Fotosammlung
HKB GT
Ground truth for Neue Zürcher Zeitung black letter
Reichsanzeiger GT
StABS Ratsbücher O10
NewsEye / READ OCR Austrian Newspapers
Weisthuemer
3 private manuscript datasets| |Greek | | |EPARCHOS
HTR CPgr23
Handwritten Paleographic Greek Text Recognition
ΧΦ114
XΦ79
ΧΦ53
10 small private manuscript datasets| |Hebrew | | |Tikkoun Sofrim
BiblIA| |Italian | | |episearch-htr
FONDUE-IT-PRINT-20
HTRomance Italian
1 private print dataset| |Japanese | | |mm-ocr-dataset-v1| |Latin | | |Caroline Minuscule
CREMMA-Medieval-LAT
HTRomance Latin
DIVA-HisDB
Eutyches
FONDUE-LA-MSS-MA
FONDUE-LA-PRINT-16
Lateinische Gedichte
Wien ÖNB Cod 2160
2 private manuscript datasets| |Multilingual| | |FONDUE-MLT-ART
[FONDUE-MLT-CAT](https://github.com/FoNDUE-HTR/FONDUE-MLT-CAT)
[FONDUE-MLT-PRINT-TEST](https://github.com/FoNDUE-HTR/FONDUE-MLT-PRINT-TEST)
gt_structure_text
| |Ottoman Turkish|| |OpenITI Arabic MS Data
OpenITI Arabic Print Data| |Farsi | | |OpenITI Arabic MS Data
OpenITI Arabic Print Data| |Portuguese| | |Portuguese Handwriting 16th-19th c.| |Russian | | |1 private manuscript dataset| |Spanish | | |FONDUE-ES-PRINT-19
FoNDUE-Spanish-chapbooks-Dataset
HTR Araucania
HTRomance Spa
3 private manuscript datasets| |Swedish | | |ATR_TrainingSet_NLF_Newseye_GT_SV_M2+
Kat -57 |Syriac | | |2 private print and manuscript datasets| |Urdu| | | |OpenITI Arabic MS Data
OpenITI Arabic Print Data| |Yiddish | | |1 private print datasets|

Training Procedure and Hyperparameters

  • Training regime: 6 * A40 GPU, BF16 precision (AMP), Mars-AdamW optimizer with caution, batch size: 32, gradient accumulation: 4, effective batch size: 768, 5+8 epochs(5 synthetic+real data, 8 real only) with 5000 iteration warmup and cosine decay, max LR 5e-4, min LR 5e-6 at end of epoch 5, weight decay 1e-5, gradient clipping 1.0, augmentation, random sampling of bbox and curve batches

Evaluation

The current base model's character accuracies on the validation set of 1000 randomly sampled pages with curve and bounding box prompts (sorted by ascending curve error rate):

| Script | Code Points | %Right (curves) | %Right (boxes) | | :-------- | :---------- | :-------------- | :------------- | | Han | 107416 | 98.90% | 98.88% |
| Hiragana | 1868 | 97.11% | 97.11% | | Cyrillic | 22239 | 92.70% | 92.34% | | Greek | 1036 | 92.28% | 91.31% | | Katakana | 390 | 90.00% | 90.00% | | Latin | 199703 | 88.02% | 86.98% | | Common | 85863 | 80.24% | 79.28% | | Arabic | 18061 | 79.22% | 79.64% | | Hebrew | 40182 | 73.98% | 73.97% | | Inherited | 2886 | 61.61% | 60.95% | | Unknown | 202 | 58.42% | 57.43% |

The script types are determined from the Unicode script property of each individual code point.

The base model has been trained on Georgian, Syriac, Newa, Malayalam, and Devanagari, albeit with fairly small datasets. No pages with these scripts are contained in the validation sample.

Files

README.md

Files (857.5 MB)

Name Size Download all
md5:7446bc756d10ce2c754c4cbb8b1aa5b6
857.5 MB Download
md5:b510da9cd3d094d8b0bc23395c8b56c5
13.7 kB Preview Download

Additional details

Funding

European Commission
MIDRASH - Migrations of Textual and Scribal Traditions via Large-Scale Computational Analysis of Medieval Manuscripts in Hebrew Script 101071829
Agence Nationale de la Recherche
Biblissima+ - Biblissima+, Observatoire des cultures écrites anciennes, de l’argile à l’imprimé ANR-21-ESRE-0005