Published August 21, 2025 | Version 2.0
Dataset Open

HORAE-LSv2. Layout Segmentation Dataset for Medieval Books of Hours (Version 2)

Description

HORAE_LSv2 is an updated and corrected version of the HORAE_LS dataset originally published in 2019 for automatic layout analysis of medieval Books of Hours. This carefully curated dataset contains 555 fully annotated images from 334 different manuscripts and printed books, providing comprehensive page layout segmentation including text zones, decorative elements, and structural features.

Why Version 2?

The original HORAE_LS (2019) referenced images via IIIF URLs, which became obsolete over time due to institutional changes. HORAE_LSv2 addresses this fragility by:

  • Downloading all images as JPEG files for long-term preservation
  • Correcting annotations: missing elements added, erroneous labels fixed
  • Updating coordinates for images that were resized by institutions
  • Migrating format from XML PAGE to accessible CSV format
  • Preserving traceability with original IIIF URLs in metadata

Dataset Overview

  • 555 images at maximum IIIF resolution (downloaded)
  • 22,964 annotations across 7 element types
  • 334 source manuscripts from 13th-16th centuries
  • Complete layout segmentation: pages, text zones, text lines, decorative elements
  • Multiple formats: CSV (primary), YOLO-compatible labels, HTML visualizations

Annotation Types

Type Count Description
Text lines 13,879 Individual lines, often with transcription
Initials 3,542 Simple (600), decorated (2,917), historiated (25)
Decoration 2,412 Line fillers, borders, ornaments, music notation
Text zones 896 Text blocks and marginal text
Pages 852 Page boundaries, including calendars (52)
Rubrication 1,224 Colored text headers (red, blue, gold)
Miniatures 159 Narrative scenes in text field

Key Features

Comprehensive layout analysis: Unlike focused decorative element datasets, HORAE_LSv2 captures complete page structure including text, decoration, and paratextual elements.

Quality control: All annotations were created by three annotators and systematically verified by a fourth person, ensuring high consistency.

IIIF traceability: Each image includes original IIIF URLs, manifest references, and canvas identifiers, enabling users to access source contexts and higher resolution versions when available.

Format flexibility: Annotations provided in both CSV (with polygon coordinates) and YOLO format for immediate use with modern detection frameworks.

Data Sources

Images originate from major IIIF-compliant digital libraries:

  • Bibliothèque nationale de France (Gallica)
  • Institut de Recherche et d'Histoire des Textes (BVMM/Arca)
  • The Walters Art Museum, Cambridge University Library, and 30+ other institutions

Use Cases

Digital Humanities:

  • Page layout analysis and document understanding
  • Text zone detection for OCR/HTR pipelines
  • Structural analysis of manuscript organization

Computer Vision:

  • Historical document segmentation
  • Multi-class object detection in complex layouts
  • Transfer learning for manuscript analysis

Heritage Applications:

  • Automated manuscript indexing
  • Digital edition preparation
  • Comparative codicological studies

Relationship to HORAE_Minit

HORAE_LSv2 represents an earlier, broader annotation approach focused on complete layout segmentation. For projects specifically targeting decorative elements (miniatures, initials, marginal decorations), we recommend the successor dataset HORAE_Minit (14,225 images, refined ontology, larger scale).

A subset of 530 images from HORAE_LSv2 was adapted to the HORAE_Minit ontology and is included in that dataset as HORAE_Minit_E, providing continuity between approaches.

Contents

HORAE_LSv2/
├── images/           # 555 JPEG images (max IIIF resolution)
├── labels/           # YOLO format annotations
├── metadata/         # CSV files with complete annotation data
│   ├── HORAE_LSv2_elements.csv      # All 22,964 annotations exported from Arkindex
│   ├── HORAE_LSv2_image_data.csv    # Image metadata & IIIF URLs
│   └── train/val/test.csv           # Recommended splits
├── extras/           # HTML visualizations
└── README.md

Technical Specifications

  • Annotation format: Polygon coordinates (CSV primary format)
  • Coordinate system: Absolute pixel coordinates
  • Image format: JPEG, maximum IIIF resolution
  • Secondary format: YOLO-compatible labels (normalized coordinates)
  • Annotation tool: Originally created in Transkribus, migrated via Arkindex

Known Limitations

  • Heterogeneous application: Annotation ontology evolved during creation; not all elements uniformly applied
  • Image variations: Some images differ from 2019 versions due to institutional changes
  • Text line quality: Automatically detected then manually corrected; may contain residual errors
  • Decorative element classification: Less refined than HORAE_Minit; use successor dataset for fine-grained analysis

Version History

Version 1.0 (June 2019)

  • Original HORAE_LS publication on GitHub
  • XML PAGE format with IIIF URL references
  • Published with Boillet et al. (2019) at HIP'19

Version 2.0 (2024)

  • Republished on Zenodo with persistent DOI
  • Downloaded images for long-term preservation
  • Corrected annotations and updated coordinates
  • CSV format for broader accessibility
  • Enhanced metadata and documentation

Citation

@dataset{stutzmann2024horae_lsv2,
  author       = {Stutzmann, Dominique and Bernard-Leterme, Lise and 
                  Boillet, Mélodie and Bonhomme, Marie-Laurence and 
                  Kermorvant, Christopher},
  title        = {{HORAE\_LSv2: Layout Segmentation Dataset for 
                   Medieval Books of Hours (Version 2)}},
  year         = 2024,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.16919911},
  url          = {https://doi.org/10.5281/zenodo.16919911}
}

Please also cite the original publication:

@inproceedings{boillet2019horae,
  title        = {{HORAE: an annotated dataset of books of hours}},
  author       = {Boillet, Mélodie and Bonhomme, Marie-Laurence and 
                  Stutzmann, Dominique and Kermorvant, Christopher},
  booktitle    = {Proceedings of the 5th International Workshop on 
                  Historical Document Imaging and Processing (HIP'19)},
  pages        = {7--12},
  year         = 2019,
  doi          = {10.1145/3352631.3352633}
}

Related Publications

  • HORAE_Minit (successor dataset): https://doi.org/10.5281/zenodo.17279364
  • HORAE Detection Models: https://doi.org/10.5281/zenodo.17279775
  • Bernard & Stutzmann (2025): "Detection of Miniatures and Initials in Medieval Books of Hours" (submitted)

License

Annotations: Creative Commons Attribution 4.0 International (CC BY 4.0)

Images: Retain original institutional licenses. Most are from public domain manuscripts; verify specific rights via provided IIIF URLs.

Funding & Acknowledgments

Version 2 (2025):

  • Biblissima+ (ANR-21-ESRE-0005)
  • Institut de Recherche et d'Histoire des Textes (CNRS)

Original Version (2019):

  • HORAE project funding
  • Teklia (Arkindex platform)
  • 3 annotators + 1 validator

 

Communities (suggested)

  • Digital Humanities
  • Medieval Studies
  • Computer Vision
  • Document Analysis
  • Cultural Heritage
  • IIIF (International Image Interoperability Framework)
  • Biblissima+

Related Identifiers

Is new version of:

  • HORAE_LS (2019): https://github.com/oriflamms/HORAE

Is cited by:

  • HORAE_Minit: https://doi.org/10.5281/zenodo.17279364
  • HORAE Detection Models: https://doi.org/10.5281/zenodo.17279775

Is supplement to:

  • Boillet et al. (2019): https://doi.org/10.1145/3352631.3352633

Is described by:

  • Bernard & Stutzmann (2025): (DOI to be added upon publication)

Grants

ANR-21-ESRE-0005 (Biblissima+)

Contributors

Dominique Stutzmann (Creator, Contact person)

  • Affiliation: Institut de Recherche et d'Histoire des Textes, CNRS
  • Role: Project direction, Version 2 preparation

Lise Bernard-Leterme (Creator)

  • Affiliation: Institut de Recherche et d'Histoire des Textes, CNRS
  • Role: Version 2 preparation, corrections, documentation

Mélodie Boillet (Creator - Original version)

  • Role: Original annotation and technical implementation (2019)

Marie-Laurence Bonhomme (Creator - Original version)

  • Role: Original annotation coordination and quality control (2019)

Christopher Kermorvant (Creator - Original version)

  • Affiliation: Teklia
  • Role: Original technical supervision (2019)

Notes (Additional Information)

Differences from Original HORAE_LS (2019)

Major improvements:

  • All images downloaded and stored as JPEG files (original used IIIF URLs only)
  • Format migrated from XML PAGE to CSV for broader accessibility
  • Corrected missing annotations and label errors
  • Updated coordinates for 10 images affected by institutional resizing
  • Added comprehensive metadata and IIIF traceability documentation
  • Generated YOLO-compatible labels for modern frameworks

Minor changes:

  • Some annotation type renaming during Arkindex migration
  • Enhanced documentation and usage examples
  • Added visualization files (HTML)

Recommended Use

Best suited for:

  • Complete page layout analysis
  • Text zone detection for OCR/HTR preparation
  • Training general manuscript segmentation models
  • Studying manuscript structure and organization

Less suited for:

  • Fine-grained decorative element classification → Use HORAE_Minit instead
  • Large-scale iconographic analysis → Use HORAE_Minit instead
  • Projects requiring only decorative elements → Use HORAE_Minit instead

Integration with Modern Workflows

The dataset includes train/validation/test splits for reproducible experiments. YOLO-format labels enable immediate use with Ultralytics framework and other modern detection tools. CSV format allows easy integration with custom pipelines and annotation tools like Label Studio or CVAT.

Preservation Rationale

This republication was motivated by the fragility of web references in digital humanities. The original HORAE_LS dataset demonstrated that even well-established IIIF infrastructure can change, breaking research reproducibility. By downloading and archiving actual images alongside persistent metadata, HORAE_LSv2 ensures long-term accessibility while maintaining links to source contexts.

Files

README.md

Files (1.5 GB)

Name Size Download all
md5:8c88a32c19a30ba86958997766d1ffc8
2.3 MB Preview Download
md5:9a6b73e3282040b45a12c17b922f9404
1.5 GB Preview Download
md5:0ce2ed685f87c249c77ce2966f839900
464.8 kB Preview Download
md5:a4de7cc226b8aeda32f35b297499dde5
3.9 MB Preview Download
md5:066c951402b61dd5027e63c15d73179a
17.6 kB Preview Download

Additional details

Related works

Is continued by
Dataset: 10.5281/zenodo.17279364 (DOI)
Is supplemented by
Software: 10.5281/zenodo.17279775 (DOI)

Funding

Agence Nationale de la Recherche
Hours - Recognition, Analysis, Editions – HORAE ANR-17-CE38-0008
Agence Nationale de la Recherche
Biblissima+ - Biblissima+, Observatoire des cultures écrites anciennes, de l’argile à l’imprimé ANR-21-ESRE-0005