Published March 5, 2025 | Version v1
Dataset Open

The REVERINO Collection of Regesta

  • 1. ROR icon Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo"
  • 2. ROR icon University of Modena and Reggio Emilia
  • 3. ROR icon University of Trento
  • 4. ROR icon University of Palermo

Description

Overview

The REVERINO Dataset is a collection of 4,533 pairs of Latin regesta (summaries) and their corresponding full-text medieval pontifical documents.
The dataset is derived from two primary collections:

  1. MGH: Epistolae saeculi XIII e regestis pontificum Romanorum selectae (1216-1268)
  2. AuvrayLes Registres de Gregoire IX (1227/41)

The dataset is designed to support research in Latin text summarization and the development of tools for automatic regesta generation using Large Language Models (LLMs). 
It serves as a benchmark for evaluating the performance of LLMs in summarizing medieval Latin texts.

Dataset Structure

The dataset is organized into nine JSON files, each corresponding to a volume of the collections. 
Each JSON file contains an array of objects, where each object represents a single document with the following fields:

  • numero: A unique identifier for the document.
  • header: The header or title of the document, often including the date and location.
  • regesto: An array of strings representing the _regestum_ (summary) of the document.
  • testo esteso: An array of strings representing the full text of the document.
  • apparato: An array of strings containing the apparatus (metadata or references) for the document.

Data Curation Process

The dataset was created through a four-step pipeline:

  1. Annotation: Manual annotation of selected pages using the eScriptorium platform to train segmentation models.
  2. Training: Adaptation of segmentation models to the specific layout of the manuscripts.
  3. Extraction: Automated extraction of text lines from the annotated pages.
  4. Post-processing: Separation of regesta, full texts, and apparatus using heuristics based on content and position.

License

This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0)
You are free to share and adapt the material for any purpose, provided you give appropriate credit to the original authors.

References

Puccetti, Giovanni, Laura Righi, Ilaria Sabbatini, and Andrea Esuli. "REVERINO: REgesta generation VERsus latIN summarizatiOn." IRCDL, 2025.

Acknowledgments

This work was supported by the Italian Strengthening of ESFRI RI RESILIENCE (ITSERR) project, funded by the European Union under the NextGenerationEU funding scheme (CUP: B53C22001770006).

Contact

Giovanni Puccetti [giovanni.puccetti@isti.cnr.it]

Files

escriptorium_auvray_1a.json

Files (14.2 MB)

Name Size Download all
md5:ca4eaf398c18f57c9bb782478b119ad3
1.7 MB Preview Download
md5:54082f5442a90cca974e1b22aa01fd88
1.8 MB Preview Download
md5:02499fcc585434decadbd26ade1935db
1.4 MB Preview Download
md5:5d396acbc381076c560200b20726c7a4
1.1 MB Preview Download
md5:ab676bdc6e12915a5daaae5eaddd0b86
792.9 kB Preview Download
md5:0c5459941b68b1518f36bdf23c14452b
695.6 kB Preview Download
md5:bc717f8eea8bd5881439adf0ffddcb99
2.5 MB Preview Download
md5:5a9e52550b7f093cd7cebe6dfa8af15c
1.7 MB Preview Download
md5:0105f0859c22a3d6d61e29b9cedc0b83
2.4 MB Preview Download
md5:68fb77c2b868ed11e4bbd69b4f924b93
2.7 kB Preview Download