Dataset for BnF, fr. 2813 - Grandes Chroniques de France
Description
This repository is organized into two main data folders:
---
📁 `btv1b84472995_GT.zip`
This folder contains the ground truth dataset used for Handwritten Text Recognition (HTR), created from selected folia of the manuscript Paris, BnF, français. 2813
The identifier `btv1b84472995` refers to the ark ID of this manuscript in Gallica.
Folder structure:
btv1b84472995_GT
├── images
└── annotations
- `images/`: High-resolution selected images downloaded from Gallica.
Image names follow the pattern `btv1b84472995_f<number>`, corresponding to the Gallica view number.
➤ Credit: *Source gallica.bnf.fr / Bibliothèque nationale de France*
- `annotations/`: XML-ALTO annotation files created with eScriptorium.
Layout: Annotations follow the Segmonto ontology. The potential users of the ground truth should note that we use additional personalized tags for:
- `'RubricLines'`: Rubricated lines
- `'HalfLines'`: Partial or incomplete lines
Transcription: The dataset is CATMuS-compliant, using a graphemic transcription approach.
---
📁 `dataset.zip`
This folder contains the dataset used in the experiments described in the paper, using the Learnable Typewriter architecture.
Folder structure:
dataset
├── images
└── annotation.json
- `images/`: Each subfolder contains polygonal line extractions (with alpha transparency) per manuscript page.
- `annotation.json`: Contains the annotation and metadata for each line.
`annotation.json` structure example:
```json
"<image_id>": { // corresponds to the image names in the images folders
"split": "train",
"label": "A beautiful calico cat.",// Transcription text of the line
"line_type": "DefaultLine",// The type of line
"script": "RaouletOrleans", // Identifier for the scribal hand
"folio": "1r",
"gp": "GP1",
"doc": "HT1"
}
** adding project webpage soon **
Code: https://github.com/malamatenia/palaeographic-variability-analysis-grandes-chroniques-fr-2813
This study was supported by the CNRS through MITI and the 80|Prime program (CrEMe Caractérisation des écritures médiévales), and by the European Research Council (ERC project DISCOVER, number 101076028).
Files
btv1b84472995_GT.zip
Files
(1.1 GB)
Name | Size | Download all |
---|---|---|
md5:19c21ca855cfd39ecaae3f306aa7fcaa
|
139.0 MB | Preview Download |
md5:624582169ed703b462c9aef06ed92e69
|
947.1 MB | Preview Download |