Hell-Date. EGRAPSA Hellenistic Dated Papyri Dataset

Ferretti, Lavinia; Serbaeva Saraogi, Olga; De Gregorio, Giuseppe; Marthot-Santaniello, Isabelle

doi:10.5281/zenodo.15830980

Published September 1, 2025 | Version 1.2

Dataset Open

Hell-Date. EGRAPSA Hellenistic Dated Papyri Dataset

1. University of Basel

Contributors

Others:

1. University of Basel

Introduction

The present dataset stems from research pursued within the SNSF funded Starting Grant EGRAPSA: Retracing the evolutions of handwritings in Graeco-Roman Egypt thanks to digital palaeography (SNSF grant n° 211682). The aim is to provide solid ground truth for palaeographic dating, and detection and recognition of characters of cursive handwritten ancient Greek.

It is composed of 194 images of 157 papyri that are precisely dated (within two years) from the Hellenistic period (3^rd to 1^st c. BCE, more precisely from -310 to -3). The chronological coverage is balanced around 50 papyri per century over the considered period (3^rd to 1^st c. BCE); only the earliest decades are not covered, and the decade 250s is overrepresented. Most documents come from Egypt, but there are a few outsiders from Near East. The dataset also includes the annotation to the character level of these images, every time the preservation of the original writing and the quality of the image allowed for an annotation. It is divided into two subsets, a training set comprising 176 images and a test set comprising 18 images, following the same division used in De Gregorio et al. 2024.

For each papyrus, the following identifiers are used:

TM numbers: unique identifiers of texts according to the Trismegistos database (https://www.trismegistos.org/about_how_to_cite.php)
Checklist identifiers: short abbreviations indicating the publication of the edition of the papyrus (https://papyri.info/docs/checklist)

Dataset structure

The Hell-Date.zip archive contains the following files:

data.csv gives access to the 194 images with, for each image, a standard name, the location, collection name, inventory number, link to access online the file, and license attached to the image.

Names are standardised across the csv file as TMnumber_checklistAbbreviation. Some papyri are in more than one image, in that case the name contains additional information to distinguish the various images (e.g., two fragments of the same papyrus preserved in different collections, or the recto and verso of the same papyrus);
A python script is joint with the csv file to automatize the download process.

downloader.py allows to download automatically all the images of the dataset taking each of them from the original archive.
How_to_download_the_dataset.pdf briefly describes the simple procedure to download the images using the downloader.py script.
requirements.txt lists the requirements for the python environment to run the script correctly; it is needed to run downloader.py.
metadata.csv contains metadata for each image. Each column of the file represents the following metadata:

image_name: name of the file for the image of the papyrus;
checklist: checklist identifier of the papyrus (usual way to refer to the papyrus in papyrology);
TM: TM number as unique identifier of the text; as one text can have multiple images, images can share their TM number;
Year post quem: i.e. the year before which the papyrus cannot have been written;
Year ante quem: i.e. the year after which the papyrus cannot have been written;
Production Nome (supposed): the geographical region where the papyrus was written;
Function: the type of document (e.g. a contract, or a letter. This item could be a comma separated list);
Subset: the subset to which the image belongs, i.e. the training set or the test set;
Annotated: indicates whether the image is at least partially annotated at the character level.

annotations_training.json and annotations_test.json contain the annotation of the images in the coco-json standard for the training and the test sets. Each annotation identifies one character on an image. The file is structured as such:
- Bbox identifying the surface of the image where the character appears;
- Category_id identifying the category to which the character belongs;
- Unique id of the annotation;
- Image_id identifying the image to which the character belongs through its numerical id;
- BaseType tag, identifying the preservation status of the character:
- Zone, identifying the type of writing area in which the letter appears:
- Rotation, indicating the rotation of the cliplet compared to the original image.

Categories, identifying the type of character through their UNICODE value; the list includes Greek letters (24 alphabetical signs + 3 letters with numeric value: ϛ “stigma” = 6, ϙ “qoppa” = 90 and ϡ “sampi” = 900), one category “symbol”, and one category “unknown” for unidentified characters;
Licences, repeating the licences already mentioned in data.csv;
Images, identifying the images obtained through the downloader and attributing them a random numerical id;
Annotations, identifying the area on the image in which one specific character appears; following information is given:
Zones, giving the name for the zone_id of the annotations.

Description.pdf, repeating the present description.

In addition, the second version of the dataset includes the following additional file:

cliplets_exporter.py allows to download automatically, based on the annotation cocojson, all the images of individual letters (so-called cliplets) from the images of the dataset, taking each of them from the downloaded images. The instructions to run the script are included in the How_ to_download_the_dataset.pdf file. The script must be run after the downloader.py script is successfully run. For users familiar with Python, it is possible to personalise the script by filtering the export (filters by image, BTs, and letters are possible) and modify the folder structure of the export (merge the folders that are divided per training-test and per TM in the standard customisation of the script).

Specificities in the data collection

The images are downloaded from the WWW or directly provided by the following collections:

The images are neither pre-processed nor harmonised concerning their resolution, colour scale, scale. However, we have converted the images to jpeg format for ease of downloading.
Five images, labelled _GreekOnly, were cropped to remove their large Egyptian Demotic text and focus on the Greek text.
The image of TM3563 was divided into nine individual images, one for each column of text.
For some papyri, one of the two images has very little text on it. If pertinent, one can exclude it from the dataset.
We have annotated as many characters as possible; however, not all characters are annotated. We have avoided annotating characters that were barely readable, and small characters on low-resolution images. Moreover, for the very long TM3563, we only annotated the first column.

Licenses

Users of this dataset must comply with the licenses provided by the various websites that give access to the images. Please take note that some of them do not allow reuse, or commercial reuse, of the images, and that credits are mostly required. By using this dataset, you confirm that you have read and understood the following licenses:

Ann Arbor, Michigan University: CC Public Domain 1.0, https://creativecommons.org/publicdomain/mark/1.0/
Berkeley, University of California: Permissions policies at the UC Berkeley Library, https://www.lib.berkeley.edu/about/permissions-policies
Berlin, Staatliche Museen zu Berlin: Berlpap Nutzungshinweise, https://berlpap.smb.museum/nutzungshinweise/
Cairo, Egyptian Museum Cairo: unknown license
Cologne, Universität zu Köln: CC BY 4.0, https://creativecommons.org/licenses/by/4.0/
Durham (NC), Duke University: CC BY-NC 3.0, https://creativecommons.org/licenses/by-nc/3.0/
Florence, Biblioteca Medicea Laurenziana: PSIonline Autorizzazioni e diritti, https://psi-online.it/rightpermission
Genova, Università di Genova: PUG Contatti, http://www.pug.unige.net/Home/Contatti
Hamburg, Staats- und Universitätsbibliothek Hamburg: CC Public Domain 1.0, https://creativecommons.org/publicdomain/mark/1.0/
Heidelberg, Universität Heidelberg: Heidelberg Institute for Papyrology Terms of Use, https://www.uni-heidelberg.de/fakultaeten/philosophie/zaw/papy/images.html
London, British Library: Public Domain, https://creativecommons.org/public-domain/pdm/
Manchester, John Rylands Library: unknown license
New York, Columbia University: CC BY-NC 3.0, https://creativecommons.org/licenses/by-nc/3.0/
New York, Pierpoint Morgan Library: unknown license
Oxford, Art Archaeology and Ancient World Library: Rightsstatements in Copyright, https://rightsstatements.org/page/InC/1.0/?language=en
Paris, Musée du Louvre - Antiquités égyptiennes: Louvre Terms of Use, https://collections.louvre.fr/en/page/cgu
Paris, Sorbonne Université - Institut de Papyrologie: CC BY-NC 4.0 Sorbonne, https://papyrologie.sorbonne-universite.fr/la-collection/conditions-dutilisation-des-photos/
St. Louis, Washington University: CC Public Domain 1.0, https://creativecommons.org/publicdomain/mark/1.0/
Turin, Museo Egizio: CC0, https://creativecommons.org/public-domain/cc0/
Vienna, Österreichische Nationalbibliothek: OeNB Nutzung, https://www.onb.ac.at/nutzung
Warsaw, University of Warsaw: University of Warsaw Papyri Copyright, http://www.papyrology.uw.edu.pl/copyright.htm
Zurich, Archäologische Sammlung der Universität Zürich: © Archaeological Collection, University of Zurich, inv. 1904. Photo: Frank Tomio. CC BY-NC 4.0, https://creativecommons.org/licenses/by/4.0/

References

The research behind Hell-Date would not have been possible without the data provided by Papyri.info (https://papyri.info/, CC BY 3.0), Trismegistos (https://www.trismegistos.org/, CC BY-SA 4.0) and PapPal (https://pappal.info/ - many thanks to Rodney Ast for sharing the data).

For more details about Hell-Date and to credit it, please quote, in addition to this repository, the following publication: G. De Gregorio, L. Ferretti, R.C.G. Pena, I. Marthot-Santaniello, M. Konstantinidou, J. Pavlopoulos, “A New Framework for Error Analysis in Computational Paleographic Dating of Greek Papyri”, in Mouchère, H., Zhu, A. (eds) Document Analysis and Recognition – ICDAR 2024 Workshops. ICDAR 2024. Lecture Notes in Computer Science, vol 14936. Springer, Cham, 2024. https://doi.org/10.1007/978-3-031-70642-4_7.

For a related dataset, see I. Marthot-Santaniello, O. Serbaeva, S. White, S. Agolli, M. Seuret, G. Carrière, D. Rodriguez-Salas, V. Christlein, “ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri”, Zenodo, 2024. https://doi.org/10.5281/zenodo.13825619.

Modifications in Version 1.2

Addition of the cliplets_exporter.csv script.
Correction of two broken download links in data.csv.
Correction of one mistake in the downloader.py script.
Modification of the dataset description and the download procedure description to reflect the modifications.

Files

Hell-Date.zip

Files (2.7 MB)

Name	Size	Download all
Hell-Date.zip md5:615613dd527b4c283ff55838cc8fdd70	2.7 MB	Preview Download

Additional details

Continues: Dataset: 10.5281/zenodo.13825619 (DOI)
Is referenced by: Conference paper: 10.1007/978-3-031-70642-4_7 (DOI)

Swiss National Science Foundation
EGRAPSA: Retracing the evolutions of handwritings in Greco-Roman Egypt thanks to digital palaeography 211682

	All versions	This version
Views	207	15
Downloads	27	3
Data volume	82.9 MB	8.0 MB

Hell-Date. EGRAPSA Hellenistic Dated Papyri Dataset

Contributors

Others:

Introduction

Dataset structure

Specificities in the data collection

Licenses

References

Modifications in Version 1.2

Files

Hell-Date.zip

Files (2.7 MB)

Additional details

Related works

Funding

Hell-Date. EGRAPSA Hellenistic Dated Papyri Dataset

Creators

Contributors

Others:

Description

Introduction

Dataset structure

Specificities in the data collection

Licenses

References

Modifications in Version 1.2

Files

Hell-Date.zip

Files (2.7 MB)

Additional details

Related works

Funding