Published April 25, 2025 | Version v1
Dataset Open

Dataset for BnF, fr. 2813 - Grandes Chroniques de France

Description

This repository is organized into two main data folders:

---

📁 `btv1b84472995_GT.zip`

This folder contains the ground truth dataset used for Handwritten Text Recognition (HTR), created from selected folia of the manuscript Paris, BnF, français. 2813 
The identifier `btv1b84472995` refers to the ark ID of this manuscript in Gallica.

Folder structure:

btv1b84472995_GT

├── images

└── annotations

- `images/`: High-resolution selected images downloaded from Gallica.  
  Image names follow the pattern `btv1b84472995_f<number>`, corresponding to the Gallica view number.  
  ➤ Credit: *Source gallica.bnf.fr / Bibliothèque nationale de France*

- `annotations/`: XML-ALTO annotation files created with eScriptorium.

Layout: Annotations follow the Segmonto ontology.  The potential users of the ground truth should note that we use additional personalized tags for:
  - `'RubricLines'`: Rubricated lines
  - `'HalfLines'`: Partial or incomplete lines

Transcription: The dataset is CATMuS-compliant, using a graphemic transcription approach.

---

 📁 `dataset.zip`

This folder contains the dataset used in the experiments described in the paper, using the Learnable Typewriter architecture.

Folder structure:

dataset

├── images

└── annotation.json

- `images/`: Each subfolder contains polygonal line extractions (with alpha transparency) per manuscript page.
- `annotation.json`: Contains the annotation and metadata for each line.

`annotation.json` structure example:

```json
"<image_id>": {                      // corresponds to the image names in the images folders
  "split": "train",          
  "label": "A beautiful calico cat.",// Transcription text of the line

 "line_type": "DefaultLine",// The type of line
  "script": "RaouletOrleans",       // Identifier for the scribal hand
  "folio": "1r",                    
  "gp": "GP1",                     
  "doc": "HT1"                     
}

** adding project webpage soon **

Code:  https://github.com/malamatenia/palaeographic-variability-analysis-grandes-chroniques-fr-2813

This study was supported by the CNRS through MITI and the 80|Prime program (CrEMe Caractérisation des écritures médiévales), and by the European Research Council (ERC project DISCOVER, number 101076028).

Files

btv1b84472995_GT.zip

Files (1.1 GB)

Name Size Download all
md5:19c21ca855cfd39ecaae3f306aa7fcaa
139.0 MB Preview Download
md5:624582169ed703b462c9aef06ed92e69
947.1 MB Preview Download

Additional details

Funding

European Research Council
DISCOVER 101076028
Centre National de la Recherche Scientifique
CreMe