Published April 17, 2026 | Version v1
Poster Open

Text Restoration of Historical Documents

Authors/Creators

Description

This PhD project investigates the application of pre-trained language models (PLMs) to the automated restoration of Latin diplomatic texts, with a focus on medieval notary documents. The project addresses a significant challenge in historical document studies: the reconstruction of damaged or missing text in low-resource Latin corpora. To this end, the project systematically evaluates a range of PLMs that vary in architecture, training language, and scale, to identify the most effective approach for this specialised restoration task.

The project is structured around the following research questions:

Does adding Ancient Greek and English during pre-training improve performance in Latin text restoration, or is monolingual pre-training exclusively on Latin more effective?

How does the performance of smaller, domain-specific models fine-tuned on Latin compare to large-scale commercial large language models using few-shot prompting in the context of Latin text restoration?

The experimental design distinguishes between two key settings based on whether the length of the missing text is known or unknown, which leads to the evaluation of both encoder-based models and encoder-decoder or decoder-only models. Controlled comparisons between model pairs which share identical architectures but differing in training data allow for a rigorous assessment of the effect of multilingual pre-training on downstream Latin text restoration tasks.

 

Files

Text Restoration of Historical Documents.pdf

Files (263.6 kB)

Name Size Download all
md5:6cddbf61be5c89257e41cbf3842bbddf
263.6 kB Preview Download

Additional details

Funding

European Commission
FutureData4EU - Training Future Big Data Experts for Europe 101126733

Dates

Created
2026-04-17