The REVERINO Collection of Regesta
Description
Overview
The REVERINO Dataset is a collection of 4,533 pairs of Latin regesta (summaries) and their corresponding full-text medieval pontifical documents.
The dataset is derived from two primary collections:
- MGH: Epistolae saeculi XIII e regestis pontificum Romanorum selectae (1216-1268)
- Auvray: Les Registres de Gregoire IX (1227/41)
The dataset is designed to support research in Latin text summarization and the development of tools for automatic regesta generation using Large Language Models (LLMs).
It serves as a benchmark for evaluating the performance of LLMs in summarizing medieval Latin texts.
Dataset Structure
The dataset is organized into nine JSON files, each corresponding to a volume of the collections.
Each JSON file contains an array of objects, where each object represents a single document with the following fields:
- numero: A unique identifier for the document.
- header: The header or title of the document, often including the date and location.
- regesto: An array of strings representing the _regestum_ (summary) of the document.
- testo esteso: An array of strings representing the full text of the document.
- apparato: An array of strings containing the apparatus (metadata or references) for the document.
Data Curation Process
The dataset was created through a four-step pipeline:
- Annotation: Manual annotation of selected pages using the eScriptorium platform to train segmentation models.
- Training: Adaptation of segmentation models to the specific layout of the manuscripts.
- Extraction: Automated extraction of text lines from the annotated pages.
- Post-processing: Separation of regesta, full texts, and apparatus using heuristics based on content and position.
License
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
You are free to share and adapt the material for any purpose, provided you give appropriate credit to the original authors.
References
Puccetti, Giovanni, Laura Righi, Ilaria Sabbatini, and Andrea Esuli. "REVERINO: REgesta generation VERsus latIN summarizatiOn." IRCDL, 2025.
Acknowledgments
This work was supported by the Italian Strengthening of ESFRI RI RESILIENCE (ITSERR) project, funded by the European Union under the NextGenerationEU funding scheme (CUP: B53C22001770006).
Contact
Giovanni Puccetti [giovanni.puccetti@isti.cnr.it]
Files
escriptorium_auvray_1a.json
Files
(14.2 MB)
Name | Size | Download all |
---|---|---|
md5:ca4eaf398c18f57c9bb782478b119ad3
|
1.7 MB | Preview Download |
md5:54082f5442a90cca974e1b22aa01fd88
|
1.8 MB | Preview Download |
md5:02499fcc585434decadbd26ade1935db
|
1.4 MB | Preview Download |
md5:5d396acbc381076c560200b20726c7a4
|
1.1 MB | Preview Download |
md5:ab676bdc6e12915a5daaae5eaddd0b86
|
792.9 kB | Preview Download |
md5:0c5459941b68b1518f36bdf23c14452b
|
695.6 kB | Preview Download |
md5:bc717f8eea8bd5881439adf0ffddcb99
|
2.5 MB | Preview Download |
md5:5a9e52550b7f093cd7cebe6dfa8af15c
|
1.7 MB | Preview Download |
md5:0105f0859c22a3d6d61e29b9cedc0b83
|
2.4 MB | Preview Download |
md5:68fb77c2b868ed11e4bbd69b4f924b93
|
2.7 kB | Preview Download |