Dataset for BnF, fr. 2813 - Grandes Chroniques de France

Vlachou Efstathiou, Malamatenia

doi:10.5281/zenodo.18745702

Published February 27, 2026 | Version v2

Dataset Open

Dataset for BnF, fr. 2813 - Grandes Chroniques de France

Vlachou Efstathiou, Malamatenia (Annotator)

This repository contains the extended version of the ground truth for the codex Paris, BnF, fr. 2813, used in the experiments for the paper “Leveraging Morphology for Metrological Historical Script Analysis”, accepted to International Conference on Document Analysis and Recognition (ICDAR 2026, Vienna, Austria).

What’s New Compared to v.1

95 newly annotated folios have been added (see the new btv1b84472995_metadata.csv for details);
The ALTO XML annotations now distinguish between #MainZone#1 and #MainZone#2, corresponding to the column order on each page;
Two versions of annotation.json are provided: one version includes hyphenation for word breaks at the end of lines.

As for version 1, the repository is organized into two main data folders:

---

📁 `btv1b84472995_GT.zip`

This folder contains the ground truth dataset used for Handwritten Text Recognition (HTR), created from the selected folia of the manuscript Paris, BnF, français. 2813
The identifier `btv1b84472995` refers to the ark ID of this manuscript in Gallica.

Folder structure:

btv1b84472995_GT

├── images

└── annotations

- `images/`: High-resolution selected images downloaded from Gallica.
Image names follow the pattern `btv1b84472995_f<number>`, corresponding to the Gallica view number.
➤ Credit: *Source gallica.bnf.fr / Bibliothèque nationale de France*

- `annotations/`: XML-ALTO annotation files created with eScriptorium.

Layout: Annotations follow the Segmonto ontology. The potential users of the ground truth should note that we use additional personalized tags for:
- `'RubricLines'`: Rubricated lines
- `'HalfLines'`: Partial or incomplete lines

- `'MainZone#1'` and `'MainZone#2'`: order of the column, instead of simply #MainZone

Transcription: The dataset is CATMuS-compliant, using a graphemic transcription approach.

---

📁 `dataset.zip`

This folder contains the dataset used in the experiments described in the paper, using the DTLR architecture for paleography, as detailed in the paper.

Folder structure:

dataset

├── images

└── annotation.json

- `images/`: Each subfolder contains polygonal line extractions (with alpha transparency) per manuscript page.
- `annotation.json`: Contains the annotation and metadata for each line.

`annotation.json` structure example:

```json
"<image_id>": { // corresponds to the image names in the images folders
"split": "train",
"label": "A beautiful calico cat.",// Transcription text of the line

"line": "DefaultLine", // Type of line
"zone": "MainZone#1", // Type of Zone where the line is found

"script": "RaouletOrleans", // Identifier for the scribal hand
"folio": "1r",
"gp": "GP1", // Identified Graphic Profile
"doc": "HT1",

}

Papers associated with the data:

v1: https://malamatenia.github.io/bnf-fr-2813/ (Scriptorium 2026)

v2: https://malamatenia.github.io/dtlr-for-metrology/ (ICDAR 2026)

This study was supported by the CNRS through MITI and the 80|Prime program (CrEMe Caractérisation des écritures médiévales), and by the European Research Council (ERC project DISCOVER, number 101076028).

Files

btv1b84472995_GT.zip

Files (2.9 GB)

Name	Size
btv1b84472995_GT.zip md5:c5e8337875b5d2f1c73677a95011dc13	334.8 MB	Preview Download
btv1b84472995_metadata.csv md5:88a209fd59be1587e65bb9927765898b	7.0 kB	Preview Download
dataset.zip md5:b69e82b36df1938079a2c99419c459e1	2.6 GB	Preview Download

Additional details

European Research Council
DISCOVER 101076028
Centre National de la Recherche Scientifique
CreMe

	All versions	This version
Views	213	76
Downloads	115	59
Data volume	93.5 GB	64.7 GB

Dataset for BnF, fr. 2813 - Grandes Chroniques de France

Authors/Creators

Description

What’s New Compared to v.1

📁 `btv1b84472995_GT.zip`

📁 `dataset.zip`

Files

btv1b84472995_GT.zip

Files (2.9 GB)

Additional details

Funding