Dataset for BnF, fr. 2813 - Grandes Chroniques de France
Authors/Creators
Description
This repository contains the extended version of the ground truth for the codex Paris, BnF, fr. 2813, used in the experiments for the paper “Leveraging Morphology for Metrological Historical Script Analysis”, accepted to International Conference on Document Analysis and Recognition (ICDAR 2026, Vienna, Austria).
What’s New Compared to v.1
-
95 newly annotated folios have been added (see the new btv1b84472995_metadata.csv for details);
-
The ALTO XML annotations now distinguish between #MainZone#1 and #MainZone#2, corresponding to the column order on each page;
-
Two versions of annotation.json are provided: one version includes hyphenation for word breaks at the end of lines.
As for version 1, the repository is organized into two main data folders:
---
📁 `btv1b84472995_GT.zip`
This folder contains the ground truth dataset used for Handwritten Text Recognition (HTR), created from the selected folia of the manuscript Paris, BnF, français. 2813
The identifier `btv1b84472995` refers to the ark ID of this manuscript in Gallica.
Folder structure:
btv1b84472995_GT
├── images
└── annotations
- `images/`: High-resolution selected images downloaded from Gallica.
Image names follow the pattern `btv1b84472995_f<number>`, corresponding to the Gallica view number.
➤ Credit: *Source gallica.bnf.fr / Bibliothèque nationale de France*
- `annotations/`: XML-ALTO annotation files created with eScriptorium.
Layout: Annotations follow the Segmonto ontology. The potential users of the ground truth should note that we use additional personalized tags for:
- `'RubricLines'`: Rubricated lines
- `'HalfLines'`: Partial or incomplete lines
- `'MainZone#1'` and `'MainZone#2'`: order of the column, instead of simply #MainZone
Transcription: The dataset is CATMuS-compliant, using a graphemic transcription approach.
---
📁 `dataset.zip`
This folder contains the dataset used in the experiments described in the paper, using the DTLR architecture for paleography, as detailed in the paper.
Folder structure:
dataset
├── images
└── annotation.json
- `images/`: Each subfolder contains polygonal line extractions (with alpha transparency) per manuscript page.
- `annotation.json`: Contains the annotation and metadata for each line.
`annotation.json` structure example:
```json
"<image_id>": { // corresponds to the image names in the images folders
"split": "train",
"label": "A beautiful calico cat.",// Transcription text of the line
"line": "DefaultLine", // Type of line
"zone": "MainZone#1", // Type of Zone where the line is found
"script": "RaouletOrleans", // Identifier for the scribal hand
"folio": "1r",
"gp": "GP1", // Identified Graphic Profile
"doc": "HT1",
}
Papers associated with the data:
v1: https://malamatenia.github.io/bnf-fr-2813/ (Scriptorium 2026)
v2: https://malamatenia.github.io/dtlr-for-metrology/ (ICDAR 2026)
This study was supported by the CNRS through MITI and the 80|Prime program (CrEMe Caractérisation des écritures médiévales), and by the European Research Council (ERC project DISCOVER, number 101076028).