"ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri" Dataset
Creators
Description
Dataset description of the “ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri”
Prof. Dr. Isabelle Marthot-Santaniello, Dr. Olga Serbaeva
2024.09.16
Introduction
The present dataset stems from the ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri (original links to the competition are provided in the file “1b.CompetitionLinks.”)
The aim of this competition was to investigate the performance of glyph detection and recognition in a very challenging type of historical document: Greek papyri. The detection and recognition of Greek letters on papyri is a preliminary step for computational analysis of handwriting that can lead to major steps forward in our understanding of this important source of information on Antiquity. Such detection and recognition can be done manually by trained papyrologists. It is, however, a time-consuming task that would need automatising.
We provide here the documents related to two different tasks: localisation and classification. The document images are provided by several institutions and are representative of the diversity of book hands on papyri (a millennium time span, various script styles, provenance, states of preservation, means of digitization and resolution).
How the dataset was constructed
In the frame of D-Scribes project lead by Prof. Dr. Isabelle Marthot-Santaniello, 2018-2023, around 150 papyri fragments containing Iliad were manually annotated at a letter-level in READ.
The editions were taken, for the major part, from papyri.info, and were simplified, i.e. the accents, editorial marks, and other additional information were removed to be as close as possible to what is to be found on papyri. When the text was not available on papyri.info, the relevant passage was extracted from the Homer Iliad of Perseus.
From those, 150 plus papyri fragments, 185 surfaces (sides of fragments) belonging to 136 different manuscript identified by their Trismegistos numbers, (further TMs) were selected to serve as a material for Competition. These 185 surfaces were separated into the “training set” and the “test set” provided for the competition as a set of images and corresponding data in JSON format.
Details on the competition summarised in "ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri", by Mathias Seuret, Isabelle Marthot-Santaniello, Stephen A. White, Olga Serbaeva Saraogi, Selaudin Agolli, Guillaume Carrière, Dalia Rodriguez-Salas, and Vincent Christlein; edited by G. A. Fink et al. (Eds.): ICDAR 2023, LNCS 14188, pp. 498–507, 2023. https://doi.org/10.1007/978-3-031-41679-8_29.
After the competition ended, the decision was taken to release manually annotated dataset for the “test set” as well. Please find the description of each included document below.
Dataset Structure
“1. CompetitionOverview.xlsx” contains the metadata of the used images in Excel file, state 2024.09.19. Here is the structure of the Excel file:
|
Excel columns |
Name |
Content |
Notes |
|
A |
TM |
Trismegistos number is internationally used for papyri identification |
With READ item name in (). |
|
B |
Papyri.info link |
link |
|
|
C |
Fragments' Owning Institution (from papyri.info) |
Institution’s name |
Institution that physically stores the papyri |
|
D |
Availability (of metadata, papyri.info) |
link |
Metadata reuse clarification |
|
E |
text ID (READ) |
Number from READ SQL database that was used to link the images and the editions. |
Serves to locate the attached images and understand the JSON structure. |
|
F |
Test/Training |
|
I.e. the image was originally included in the training or in the test set of the dataset. |
|
G |
Image Name (for orientation) |
|
As in READ |
|
H |
Cedopal link |
link |
Contains additional metadata and includes the links to all available online images. |
|
I |
License from the Institution webpage. |
Either license or usage summary. |
If no precise licence has been given, the summary of the reuse rights is provided with a link to the regulations in column K |
|
J |
Image URL |
link |
Not all images are available online. Please contact the owning institution directly if the image is not available. |
|
K |
Information on the image usage from the institution |
link |
In case of any doubt, please contact the owning institution directly. |
|
L |
Notes |
|
|
For the purpose of an easy overview, the items with special problems, i.e. images not online or missing links, have been marked in red.
2. There are three data subsets:
2a. “Training file”
(containing 150 papyri images separated into 108 texts and HomerCompTraining.json). The images are those of papyri containing Iliad of Homer in JPG-format. These were processed in READ, namely, each visible letter on a given papyri was linked to the edition of the Iliad, through this process, each linked letter of the edition was linked to its coordinates in pixels on the HTML-surface of the image. All that information is provided in the JSON-file.
The JSON file contains the “annotations” (b-boxes of each letter/sign), “categories” (Greek letters), “images” (Image IDs), and “licenses”. The links between image and bboxes is defined via the “id” in the “images” part (for example, "id": 6109). This same id is encoded as “"image_id": 6109” in the “annotations”. Alternatively, “text_id” which can be found in the “images” URL and in the file-names provided here and containing images, can be used for data linking.
Let us now describe the content of each part of the JSON file:
Each “annotation” contains
“area" characterised as “bbox" with coordinates,
“category_id”, that allows to identify which Greek letter in categories is represented by the number; “id”, which is a unique number of the cliplet, i.e. area;
“image_id”, that links cliplet to the surface of the image having the same id;
“iscrowd" and “seg_id" are useful to find the information back in READ database;
and, finally, “tags”.
In tags, “BaseType" was used to annotate quality as described below. “FootMarkType”, ft1, etc., was used for clustering tests, but played no role for the Competition.
“BaseType” ot bt-tags were assigned to the letters to mark the quality of preservation:
bt-1: well-preserved letter that should allows easy identification for both human eyes and the Computer-vision;
bt-2: Partially preserved letter that might also have some background damage (holes, additional ink, etc), but remains readable, and has one interpretation.
bt-3: Letters damaged to such an extant that they cannot be identified without reading an edition. These are treated as traces of ink.
bt-4: The letters that have some damage, but this damage is of such kind that it makes possible multiple interpretations. For example, missing/defaced horizontal stroke makes alpha indistinguishable from damaged delta or lambda.
Each “category” contains
“id”, this is a number references also in “annotations” and it allows to identify which Greek letter was in the bbox;
”name”, for example, “χ”;
and “supercategory”, i.e. “Greek”.
Each “image” contains the following sub fields:
“bln_id" is an internal READ number of the html surface;
"date_captured": null - is another READ field;
"file_name": “./images/homer2/txt1/P.Corn.Inv.MSS.A.101.XIII.jpg", allows to link easy image and text, i.e. for the image in question the JPG will be in the file called “txt1”, it is very similar by structure and function to "img_url": "./images/homer2/txt1/P.Corn.Inv.MSS.A.101.XIII.jpg";
each image has “height" and “width" expressed in pixels.
Each image has “id”, and this id is referenced in the “annotations” under “image_id”.
Finally, each image contains a link to “license”, expressed as a number.
Each “licence” lists a license as it was found during the time of competition, i.e. in February 2023.
2b. “Test file”
contains 34 papyri image sides separated into 31 TMs and HomerCompTesting.json The JSON file here only allows to connect the images with the “categories”, “images”, “licenses”, but without the “annotations”. The structure and logic is otherwise the same like in “Training” JSON.
2c. “Answers file”
Containing the “annotations” and other information for the 34 papyri of the “Testing” dataset. The structure and logic is the same like in “Training” JSON.
3. “Additional files”
Containing lists of duplicate segments id (multiple possible readings or tags), respectively 6 items for “Training”, 17 for “Testing” and 15 for “Answers”.
4. “Dataset Description”
This same description included for completeness.
References
The Dataset was reused or mentioned in a number of publications (state September 2024)
Mohammed, H., Jampour, M. (2024). "From Detection to Modelling: An End-to-End Paleographic System for Analysing Historical Handwriting Styles". In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham, pp. 363–376. https://doi.org/10.1007/978-3-031-70442-0_22
De Gregorio, G., Perrin, S., Pena, R.C.G., Marthot-Santaniello, I., Mouchère, H. (2024). "NeuroPapyri: A Deep Attention Embedding Network for Handwritten Papyri Retrieval". In: Mouchère, H., Zhu, A. (eds) Document Analysis and Recognition – ICDAR 2024 Workshops. ICDAR 2024. Lecture Notes in Computer Science, vol 14936. Springer, Cham, pp. 71–86. https://doi.org/10.1007/978-3-031-70642-4_5
Vu, M. T., Beurton-Aimar, M. "PapyTwin net: a Twin network for Greek letters detection on ancient Papyri". HIP '23: 7th International Workshop on Historical Document Imaging and Processing, San Jose, CA, USA, August 2023.
https://doi.org/10.1145/3604951.3605522
https://dl.acm.org/doi/fullHtml/10.1145/3604951.3605522
Turnbull, R., Mannix, E. "Detecting and recognizing characters in Greek papyri with YOLOv8, DeiT and SimCLR". (Preprint).
arXiv:2401.12513
https://doi.org/10.48550/arXiv.2401.12513
Files
ICDAR2023_Competition_on_Detection_and_Recognition_of_Greek_Letters_on_Papyri_Dataset.zip
Files
(423.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:115645f6864bb922d60309687fcfaa56
|
423.9 MB | Preview Download |
Additional details
Related works
- Is described by
- Conference paper: 10.1007/978-3-031-41679-8_29 (DOI)
Funding
- Swiss National Science Foundation
- Reuniting fragments, identifying scribes and characterizing scripts: the Digital paleography of Greek and Coptic papyri 174149
- Swiss National Science Foundation
- EGRAPSA: Retracing the evolutions of handwritings in Greco-Roman Egypt thanks to digital palaeography 211682
Dates
- Created
-
2023-02-03Competition files published online
- Available
-
2024-09-22Dataset available on Zenodo
References
- ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri