Published September 1, 2025 | Version 1.2
Dataset Open

Hell-Date. EGRAPSA Hellenistic Dated Papyri Dataset

Description

Introduction

The present dataset stems from research pursued within the SNSF funded Starting Grant EGRAPSA: Retracing the evolutions of handwritings in Graeco-Roman Egypt thanks to digital palaeography (SNSF grant n° 211682). The aim is to provide solid ground truth for palaeographic dating, and detection and recognition of characters of cursive handwritten ancient Greek.

It is composed of 194 images of 157 papyri that are precisely dated (within two years) from the Hellenistic period (3rd to 1st c. BCE, more precisely from -310 to -3). The chronological coverage is balanced around 50 papyri per century over the considered period (3rd to 1st c. BCE); only the earliest decades are not covered, and the decade 250s is overrepresented. Most documents come from Egypt, but there are a few outsiders from Near East. The dataset also includes the annotation to the character level of these images, every time the preservation of the original writing and the quality of the image allowed for an annotation. It is divided into two subsets, a training set comprising 176 images and a test set comprising 18 images, following the same division used in De Gregorio et al. 2024.

For each papyrus, the following identifiers are used:

Dataset structure

The Hell-Date.zip archive contains the following files:

  1. data.csv gives access to the 194 images with, for each image, a standard name, the location, collection name, inventory number, link to access online the file, and license attached to the image.
    • Names are standardised across the csv file as TMnumber_checklistAbbreviation. Some papyri are in more than one image, in that case the name contains additional information to distinguish the various images (e.g., two fragments of the same papyrus preserved in different collections, or the recto and verso of the same papyrus);
    • A python script is joint with the csv file to automatize the download process.
  2. downloader.py allows to download automatically all the images of the dataset taking each of them from the original archive.
  3. How_to_download_the_dataset.pdf briefly describes the simple procedure to download the images using the downloader.py script.
  4. requirements.txt lists the requirements for the python environment to run the script correctly; it is needed to run downloader.py.
  5. metadata.csv contains metadata for each image. Each column of the file represents the following metadata:
    • image_name: name of the file for the image of the papyrus;
    • checklist: checklist identifier of the papyrus (usual way to refer to the papyrus in papyrology);
    • TM: TM number as unique identifier of the text; as one text can have multiple images, images can share their TM number;
    • Year post quem: i.e. the year before which the papyrus cannot have been written;
    • Year ante quem: i.e. the year after which the papyrus cannot have been written;
    • Production Nome (supposed): the geographical region where the papyrus was written;
    • Function: the type of document (e.g. a contract, or a letter. This item could be a comma separated list);
    • Subset: the subset to which the image belongs, i.e. the training set or the test set;
    • Annotated: indicates whether the image is at least partially annotated at the character level.
  6. annotations_training.json and annotations_test.json contain the annotation of the images in the coco-json standard for the training and the test sets. Each annotation identifies one character on an image. The file is structured as such:
    • Bbox identifying the surface of the image where the character appears;
    • Category_id identifying the category to which the character belongs;
    • Unique id of the annotation;
    • Image_id identifying the image to which the character belongs through its numerical id;
    • BaseType tag, identifying the preservation status of the character:
      • Bt1: letter perfectly preserved;
      • Bt2: letter partially preserved but unequivocally identifiable;
      • Bt3: letter poorly preserved, traces ambiguous;
      • Bt4: letter partially preserved that allows for maximum two-three identifications; not used in this dataset, but see Marthot-Santaniello et al. 2024;
      • Bt5: letter deformed in its original appearance (e.g. squeezed, enlarged, modified to alter its meaning);
    • Zone, identifying the type of writing area in which the letter appears:
      • 0: Body: the main part of the text;
      • 1: Paratext: additional textual elements around the body;
      • 2: Addition: additional textual elements added by another scribe in another moment;
    • Rotation, indicating the rotation of the cliplet compared to the original image.
    • Categories, identifying the type of character through their UNICODE value; the list includes Greek letters (24 alphabetical signs + 3 letters with numeric value: ϛ “stigma” = 6, ϙ “qoppa” = 90 and ϡ “sampi” = 900), one category “symbol”, and one category “unknown” for unidentified characters;
    • Licences, repeating the licences already mentioned in data.csv;
    • Images, identifying the images obtained through the downloader and attributing them a random numerical id;
    • Annotations, identifying the area on the image in which one specific character appears; following information is given:
    • Zones, giving the name for the zone_id of the annotations.
  7. Description.pdf, repeating the present description.

In addition, the second version of the dataset includes the following additional file:

  1. cliplets_exporter.py allows to download automatically, based on the annotation cocojson, all the images of individual letters (so-called cliplets) from the images of the dataset, taking each of them from the downloaded images. The instructions to run the script are included in the How_ to_download_the_dataset.pdf file. The script must be run after the downloader.py script is successfully run. For users familiar with Python, it is possible to personalise the script by filtering the export (filters by image, BTs, and letters are possible) and modify the folder structure of the export (merge the folders that are divided per training-test and per TM in the standard customisation of the script).

Specificities in the data collection

Licenses

Users of this dataset must comply with the licenses provided by the various websites that give access to the images. Please take note that some of them do not allow reuse, or commercial reuse, of the images, and that credits are mostly required. By using this dataset, you confirm that you have read and understood the following licenses:

References

The research behind Hell-Date would not have been possible without the data provided by Papyri.info (https://papyri.info/, CC BY 3.0), Trismegistos (https://www.trismegistos.org/, CC BY-SA 4.0) and PapPal (https://pappal.info/ - many thanks to Rodney Ast for sharing the data).

For more details about Hell-Date and to credit it, please quote, in addition to this repository, the following publication: G. De Gregorio, L. Ferretti, R.C.G. Pena, I. Marthot-Santaniello, M. Konstantinidou, J. Pavlopoulos, “A New Framework for Error Analysis in Computational Paleographic Dating of Greek Papyri”, in Mouchère, H., Zhu, A. (eds) Document Analysis and Recognition – ICDAR 2024 Workshops. ICDAR 2024. Lecture Notes in Computer Science, vol 14936. Springer, Cham, 2024. https://doi.org/10.1007/978-3-031-70642-4_7.

For a related dataset, see I. Marthot-Santaniello, O. Serbaeva, S. White, S. Agolli, M. Seuret, G. Carrière, D. Rodriguez-Salas, V. Christlein, “ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri”, Zenodo, 2024. https://doi.org/10.5281/zenodo.13825619.

Modifications in Version 1.2

  • Addition of the cliplets_exporter.csv script.
  • Correction of two broken download links in data.csv.
  • Correction of one mistake in the downloader.py script.
  • Modification of the dataset description and the download procedure description to reflect the modifications.

Files

Hell-Date.zip

Files (2.7 MB)

Name Size Download all
md5:615613dd527b4c283ff55838cc8fdd70
2.7 MB Preview Download

Additional details

Related works

Continues
Dataset: 10.5281/zenodo.13825619 (DOI)
Is referenced by
Conference paper: 10.1007/978-3-031-70642-4_7 (DOI)

Funding

Swiss National Science Foundation
EGRAPSA: Retracing the evolutions of handwritings in Greco-Roman Egypt thanks to digital palaeography 211682