HORAE-LSv2. Layout Segmentation Dataset for Medieval Books of Hours (Version 2)
Authors/Creators
- 1. Institut de Recherche et d'Histoire des Textes
- 2. Teklia
- 3. TEKLIA
Description
HORAE_LSv2 is an updated and corrected version of the HORAE_LS dataset originally published in 2019 for automatic layout analysis of medieval Books of Hours. This carefully curated dataset contains 555 fully annotated images from 334 different manuscripts and printed books, providing comprehensive page layout segmentation including text zones, decorative elements, and structural features.
Why Version 2?
The original HORAE_LS (2019) referenced images via IIIF URLs, which became obsolete over time due to institutional changes. HORAE_LSv2 addresses this fragility by:
- Downloading all images as JPEG files for long-term preservation
- Correcting annotations: missing elements added, erroneous labels fixed
- Updating coordinates for images that were resized by institutions
- Migrating format from XML PAGE to accessible CSV format
- Preserving traceability with original IIIF URLs in metadata
Dataset Overview
- 555 images at maximum IIIF resolution (downloaded)
- 22,964 annotations across 7 element types
- 334 source manuscripts from 13th-16th centuries
- Complete layout segmentation: pages, text zones, text lines, decorative elements
- Multiple formats: CSV (primary), YOLO-compatible labels, HTML visualizations
Annotation Types
| Type | Count | Description |
|---|---|---|
| Text lines | 13,879 | Individual lines, often with transcription |
| Initials | 3,542 | Simple (600), decorated (2,917), historiated (25) |
| Decoration | 2,412 | Line fillers, borders, ornaments, music notation |
| Text zones | 896 | Text blocks and marginal text |
| Pages | 852 | Page boundaries, including calendars (52) |
| Rubrication | 1,224 | Colored text headers (red, blue, gold) |
| Miniatures | 159 | Narrative scenes in text field |
Key Features
Comprehensive layout analysis: Unlike focused decorative element datasets, HORAE_LSv2 captures complete page structure including text, decoration, and paratextual elements.
Quality control: All annotations were created by three annotators and systematically verified by a fourth person, ensuring high consistency.
IIIF traceability: Each image includes original IIIF URLs, manifest references, and canvas identifiers, enabling users to access source contexts and higher resolution versions when available.
Format flexibility: Annotations provided in both CSV (with polygon coordinates) and YOLO format for immediate use with modern detection frameworks.
Data Sources
Images originate from major IIIF-compliant digital libraries:
- Bibliothèque nationale de France (Gallica)
- Institut de Recherche et d'Histoire des Textes (BVMM/Arca)
- The Walters Art Museum, Cambridge University Library, and 30+ other institutions
Use Cases
Digital Humanities:
- Page layout analysis and document understanding
- Text zone detection for OCR/HTR pipelines
- Structural analysis of manuscript organization
Computer Vision:
- Historical document segmentation
- Multi-class object detection in complex layouts
- Transfer learning for manuscript analysis
Heritage Applications:
- Automated manuscript indexing
- Digital edition preparation
- Comparative codicological studies
Relationship to HORAE_Minit
HORAE_LSv2 represents an earlier, broader annotation approach focused on complete layout segmentation. For projects specifically targeting decorative elements (miniatures, initials, marginal decorations), we recommend the successor dataset HORAE_Minit (14,225 images, refined ontology, larger scale).
A subset of 530 images from HORAE_LSv2 was adapted to the HORAE_Minit ontology and is included in that dataset as HORAE_Minit_E, providing continuity between approaches.
Contents
HORAE_LSv2/
├── images/ # 555 JPEG images (max IIIF resolution)
├── labels/ # YOLO format annotations
├── metadata/ # CSV files with complete annotation data
│ ├── HORAE_LSv2_elements.csv # All 22,964 annotations exported from Arkindex
│ ├── HORAE_LSv2_image_data.csv # Image metadata & IIIF URLs
│ └── train/val/test.csv # Recommended splits
├── extras/ # HTML visualizations
└── README.md
Technical Specifications
- Annotation format: Polygon coordinates (CSV primary format)
- Coordinate system: Absolute pixel coordinates
- Image format: JPEG, maximum IIIF resolution
- Secondary format: YOLO-compatible labels (normalized coordinates)
- Annotation tool: Originally created in Transkribus, migrated via Arkindex
Known Limitations
- Heterogeneous application: Annotation ontology evolved during creation; not all elements uniformly applied
- Image variations: Some images differ from 2019 versions due to institutional changes
- Text line quality: Automatically detected then manually corrected; may contain residual errors
- Decorative element classification: Less refined than HORAE_Minit; use successor dataset for fine-grained analysis
Version History
Version 1.0 (June 2019)
- Original HORAE_LS publication on GitHub
- XML PAGE format with IIIF URL references
- Published with Boillet et al. (2019) at HIP'19
Version 2.0 (2024)
- Republished on Zenodo with persistent DOI
- Downloaded images for long-term preservation
- Corrected annotations and updated coordinates
- CSV format for broader accessibility
- Enhanced metadata and documentation
Citation
@dataset{stutzmann2024horae_lsv2,
author = {Stutzmann, Dominique and Bernard-Leterme, Lise and
Boillet, Mélodie and Bonhomme, Marie-Laurence and
Kermorvant, Christopher},
title = {{HORAE\_LSv2: Layout Segmentation Dataset for
Medieval Books of Hours (Version 2)}},
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.16919911},
url = {https://doi.org/10.5281/zenodo.16919911}
}
Please also cite the original publication:
@inproceedings{boillet2019horae,
title = {{HORAE: an annotated dataset of books of hours}},
author = {Boillet, Mélodie and Bonhomme, Marie-Laurence and
Stutzmann, Dominique and Kermorvant, Christopher},
booktitle = {Proceedings of the 5th International Workshop on
Historical Document Imaging and Processing (HIP'19)},
pages = {7--12},
year = 2019,
doi = {10.1145/3352631.3352633}
}
Related Publications
- HORAE_Minit (successor dataset): https://doi.org/10.5281/zenodo.17279364
- HORAE Detection Models: https://doi.org/10.5281/zenodo.17279775
- Bernard & Stutzmann (2025): "Detection of Miniatures and Initials in Medieval Books of Hours" (submitted)
License
Annotations: Creative Commons Attribution 4.0 International (CC BY 4.0)
Images: Retain original institutional licenses. Most are from public domain manuscripts; verify specific rights via provided IIIF URLs.
Funding & Acknowledgments
Version 2 (2025):
- Biblissima+ (ANR-21-ESRE-0005)
- Institut de Recherche et d'Histoire des Textes (CNRS)
Original Version (2019):
- HORAE project funding
- Teklia (Arkindex platform)
- 3 annotators + 1 validator
Communities (suggested)
- Digital Humanities
- Medieval Studies
- Computer Vision
- Document Analysis
- Cultural Heritage
- IIIF (International Image Interoperability Framework)
- Biblissima+
Related Identifiers
Is new version of:
- HORAE_LS (2019): https://github.com/oriflamms/HORAE
Is cited by:
- HORAE_Minit: https://doi.org/10.5281/zenodo.17279364
- HORAE Detection Models: https://doi.org/10.5281/zenodo.17279775
Is supplement to:
- Boillet et al. (2019): https://doi.org/10.1145/3352631.3352633
Is described by:
- Bernard & Stutzmann (2025): (DOI to be added upon publication)
Grants
ANR-21-ESRE-0005 (Biblissima+)
Contributors
Dominique Stutzmann (Creator, Contact person)
- Affiliation: Institut de Recherche et d'Histoire des Textes, CNRS
- Role: Project direction, Version 2 preparation
Lise Bernard-Leterme (Creator)
- Affiliation: Institut de Recherche et d'Histoire des Textes, CNRS
- Role: Version 2 preparation, corrections, documentation
Mélodie Boillet (Creator - Original version)
- Role: Original annotation and technical implementation (2019)
Marie-Laurence Bonhomme (Creator - Original version)
- Role: Original annotation coordination and quality control (2019)
Christopher Kermorvant (Creator - Original version)
- Affiliation: Teklia
- Role: Original technical supervision (2019)
Notes (Additional Information)
Differences from Original HORAE_LS (2019)
Major improvements:
- All images downloaded and stored as JPEG files (original used IIIF URLs only)
- Format migrated from XML PAGE to CSV for broader accessibility
- Corrected missing annotations and label errors
- Updated coordinates for 10 images affected by institutional resizing
- Added comprehensive metadata and IIIF traceability documentation
- Generated YOLO-compatible labels for modern frameworks
Minor changes:
- Some annotation type renaming during Arkindex migration
- Enhanced documentation and usage examples
- Added visualization files (HTML)
Recommended Use
Best suited for:
- Complete page layout analysis
- Text zone detection for OCR/HTR preparation
- Training general manuscript segmentation models
- Studying manuscript structure and organization
Less suited for:
- Fine-grained decorative element classification → Use HORAE_Minit instead
- Large-scale iconographic analysis → Use HORAE_Minit instead
- Projects requiring only decorative elements → Use HORAE_Minit instead
Integration with Modern Workflows
The dataset includes train/validation/test splits for reproducible experiments. YOLO-format labels enable immediate use with Ultralytics framework and other modern detection tools. CSV format allows easy integration with custom pipelines and annotation tools like Label Studio or CVAT.
Preservation Rationale
This republication was motivated by the fragility of web references in digital humanities. The original HORAE_LS dataset demonstrated that even well-established IIIF infrastructure can change, breaking research reproducibility. By downloading and archiving actual images alongside persistent metadata, HORAE_LSv2 ensures long-term accessibility while maintaining links to source contexts.
Files
README.md
Files
(1.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:8c88a32c19a30ba86958997766d1ffc8
|
2.3 MB | Preview Download |
|
md5:9a6b73e3282040b45a12c17b922f9404
|
1.5 GB | Preview Download |
|
md5:0ce2ed685f87c249c77ce2966f839900
|
464.8 kB | Preview Download |
|
md5:a4de7cc226b8aeda32f35b297499dde5
|
3.9 MB | Preview Download |
|
md5:066c951402b61dd5027e63c15d73179a
|
17.6 kB | Preview Download |
Additional details
Related works
- Is continued by
- Dataset: 10.5281/zenodo.17279364 (DOI)
- Is supplemented by
- Software: 10.5281/zenodo.17279775 (DOI)