LaudareProject/LaudareDataset: Scientific Data

Federico Simonetta

doi:10.5281/zenodo.18922615

Published March 9, 2026 | Version 1.0-1

Software Open

LaudareProject/LaudareDataset: Scientific Data

Federico Simonetta

Laudare Dataset

This dataset is one of the outcomes of the Laudare ERC AdG project.

It contains two medieval manuscripts containing music and text ready to be used for Historical Document Understanding (HDU), including:

Optical Character Recognition (OCR)
Optical Music Recognition (OMR)
Optical Character and Music Recognition (OCMR)
Layout Recognition
Text to music alignment (at the verse level)
Music and text symbolic analysis
...

An in-depth description can be found in the following article.

...

You can use this dataset to compare with our baselines using the accompanying framework. This approach has the benefit of computing proper evaluation measures and a standardized experimental protocol.

You can also use the provided annotations as is, in which case we recommend reading the below recommendations.

Recommendations

This dataset contains 3 different standardized splits, namely 5-fold cross-validation, sequential learning, and cross-manuscript tests.

The sequential learning splits are generated by splitting each manuscript across 10 steps where 1/10 of the dataset is added at each step, following the order of the pages. In order to prevent catastrophic memory loss, at each step we also randomly add 1/10 of the dataset from the already seen images. Thus, each step except the first contains 1/5 of the dataset in the train set. The sequential learning test split can actually be used for more advanced strategies (e.g. based on active learning or re-weighting already seen data).

The 5-fold splits can be adapted as well, especially the train-validation separation. We recommend to keep the test splits constant. Moreover, if 5-folds cross validation is too much expensive, we recommend using fold number 0 to be comparable with our preliminary works on this dataset (TODO: add reference here).

Annotations

We provide two versions of the annotations:

diplomatic: this reflects the graphical signs that can be seen on the page of the manuscript, with the exception of textual abbreviations that are provided already expanded.
editorial: this contains the modifications, corrections, and interpretations made by musicologists and philologists in the Laudare project.

Pre-defined splits

Each of these annotations come with pre-defined splits:

gt.json: this file contains all the annotations from all the images.
train.json and validation.json splits: these splits are used to train a model on a manuscript and test it or fine-tune on another.
processed_splits: this directory contains subsets of the previous annotations for different tasks (namely: OCR, OMR, OCMR, and layout recognition):
- <task>_fold_0x.json: these are the annotations for one out of five folds and for a given task.
- train_test_x: these directories contain the same annotations given in <task>_fold_0x, but they have been separated into train, test, and validation splits for convenience.
- random_sample: this directory contains annotations to simulate a sequential learning scenario, where annotations are added incrementally (50 annotations are added at each step). The directory random_sample/sequential_test must be used to test the model performances at each step, while the directories seq_xx contains the annotations to be used at each step for training and validating. These are built by adding 50 new annotations sequentially and randomly picking other 50 annotations from the previous steps.
- pagexml_all_xxxx: these directories contain PageXML versions of the annotations. When using the accompanying framework, the function benchmarking.utils.path_json2pagexml() can be used to get the paths of the PageXML included in a certain json file. In these files, music is encoded as normal text.

COCO-like file format

We provide the annotations in a COCO-like file format, with this simple modification:

the attribute description inside each annotation gives information about the musical object or text line.

Music in the `description` field

An enriched scientific pitch representation is used to represent the music content of a neume, clefs and music staffs. The custom scientific pitch representation is as follows:

K indicates a clef, followed by F or C to indicate the shape of the clef and by a number to indicate the line it is placed on (1 is the bottom line), e.g. KC3.
A, B, C, D, E, F, G indicate the pitch of a note, followed by a possible alteration and by the octave number (4 is the default, so C is equivalent to C4)
alterations are indicated via b, bb, #, ##
(...) indicate neumes, i.e. groups of notes sung on the same syllable, e.g. (C4 D4 E4), (Bb3)
notes without (...) should be considered custos
In the music lines (or staffs), slash and dashed bars indicate end of verses.
In the diplomatic version, pitches are encoded as one would read them, trusting all the clefs notated in the manuscripts. When clefs are lacking at the beginning of the staffs, the C clef on the 2nd line (KC2) is used. In the editorial version, instead the proper clef is always inserted and wrongly notated clefs are corrected upon musicological inspection.

Alignment

In a typical Gregorian source, neumes correspond to syllables. These manuscripts, instead, do not offer a clear alignment at the syllable level. In fact, these manuscripts were rarely read, which explains the large number of errors in the notation and the lack of clear syllabic alignment. At most, they were used a guideline for the practice of singing laudas, that was a primarily oral genre.

In the Laudare Project, we have opted not to annotate the syllabic alignment nor to suggest it in any way. We attempted, as much as possible, to annotate neumes as they are written based on orthographical analysis. However, the graphical alignment between music and text guided the musicological transcription in various cases where the orthographical notation was not enough to discern the neume separation.

In general, though, the alignment between text and music is always correct at the verse level. Both text and music transcription, indeed, contain verse and strophe separation symbols (/ and // respectively). In the diplomatic version of music lines, many MusicDelimiter objects are annotated, as they can be seen in the manuscript. Instead, in the editorial version, only those representing end of verses and strophes are annotated as MusicDelimiter.

Credits

Laudare Project (

Main coordinator of this dataset: Federico Simonetta (https://federicosimonetta.eu.org)

---

Changelog:

Added sequential_sample
Added filtered train/val/test for cross-manuscript tests
Fix typos and update dataset task descriptions
regenerated splits
updated and fixed data and annotations

Files

LaudareProject/LaudareDataset-1.0-1.zip

Files (3.0 GB)

Name	Size
LaudareProject/LaudareDataset-1.0-1.zip md5:2bfa71c2a8d9d0b209ebbb7918266a78	3.0 GB	Preview Download

Additional details

Is supplement to: Software: https://github.com/LaudareProject/LaudareDataset/tree/1.0-1 (URL)

European Commission
LAUDARE - The Italian Lauda: Disseminating Poetry and Concepts Through Melody (12th-16th centuries) 101054750

Repository URL: https://github.com/LaudareProject/LaudareDataset

	All versions	This version
Views	96	62
Downloads	14	13
Data volume	48.3 GB	45.3 GB

Laudare Dataset

Recommendations

Annotations

Pre-defined splits

COCO-like file format

Music in the `description` field

Categories

Alignment

Credits

Changelog:

LaudareProject/LaudareDataset-1.0-1.zip

Files (3.0 GB)

Related works

Funding

Software

LaudareProject/LaudareDataset: Scientific Data

Authors/Creators

Description

Laudare Dataset

Recommendations

Annotations

Pre-defined splits

COCO-like file format

Music in the description field

Categories

Alignment

Credits

Changelog:

Files

LaudareProject/LaudareDataset-1.0-1.zip

Files (3.0 GB)

Additional details

Related works

Funding

Software

Music in the `description` field