LaudareProject/LaudareDataset: Scientific Data
Authors/Creators
Description
Laudare Dataset
This dataset is one of the outcomes of the Laudare ERC AdG project.It contains two medieval manuscripts containing music and text ready to be used for Historical Document Understanding (HDU), including:
- Optical Character Recognition (OCR)
- Optical Music Recognition (OMR)
- Optical Character and Music Recognition (OCMR)
- Layout Recognition
- Text to music alignment (at the verse level)
- Music and text symbolic analysis
- ...
An in-depth description can be found in the following article.
...
You can use this dataset to compare with our baselines using the accompanying framework. This approach has the benefit of computing proper evaluation measures and a standardized experimental protocol.
You can also use the provided annotations as is, in which case we recommend reading the below recommendations.
Recommendations
This dataset contains 3 different standardized splits, namely 5-fold cross-validation, sequential learning, and cross-manuscript tests.The sequential learning splits are generated by splitting each manuscript across 10 steps where 1/10 of the dataset is added at each step, following the order of the pages. In order to prevent catastrophic memory loss, at each step we also randomly add 1/10 of the dataset from the already seen images. Thus, each step except the first contains 1/5 of the dataset in the train set. The sequential learning test split can actually be used for more advanced strategies (e.g. based on active learning or re-weighting already seen data).
The 5-fold splits can be adapted as well, especially the train-validation separation. We recommend to keep the test splits constant. Moreover, if 5-folds cross validation is too much expensive, we recommend using fold number 0 to be comparable with our preliminary works on this dataset (TODO: add reference here).
Annotations
We provide two versions of the annotations:- diplomatic: this reflects the graphical signs that can be seen on the page of the manuscript, with the exception of textual abbreviations that are provided already expanded.
- editorial: this contains the modifications, corrections, and interpretations made by musicologists and philologists in the Laudare project.
Pre-defined splits
Each of these annotations come with pre-defined splits:gt.json: this file contains all the annotations from all the images.train.jsonandvalidation.jsonsplits: these splits are used to train a model on a manuscript and test it or fine-tune on another.processed_splits: this directory contains subsets of the previous annotations for different tasks (namely: OCR, OMR, OCMR, and layout recognition):<task>_fold_0x.json: these are the annotations for one out of five folds and for a given task.train_test_x: these directories contain the same annotations given in<task>_fold_0x, but they have been separated into train, test, and validation splits for convenience.random_sample: this directory contains annotations to simulate a sequential learning scenario, where annotations are added incrementally (50 annotations are added at each step). The directoryrandom_sample/sequential_testmust be used to test the model performances at each step, while the directoriesseq_xxcontains the annotations to be used at each step for training and validating. These are built by adding 50 new annotations sequentially and randomly picking other 50 annotations from the previous steps.pagexml_all_xxxx: these directories contain PageXML versions of the annotations. When using the accompanying framework, the functionbenchmarking.utils.path_json2pagexml()can be used to get the paths of the PageXML included in a certain json file. In these files, music is encoded as normal text.
COCO-like file format
We provide the annotations in a COCO-like file format, with this simple modification:the attribute
descriptioninside each annotation gives information about the musical object or text line.
Music in the description field
An enriched scientific pitch representation is used to represent the music content of a neume, clefs and music staffs. The custom scientific pitch representation is as follows:Kindicates a clef, followed byForCto indicate the shape of the clef and by a number to indicate the line it is placed on (1 is the bottom line), e.g.KC3.A,B,C,D,E,F,Gindicate the pitch of a note, followed by a possible alteration and by the octave number (4 is the default, soCis equivalent toC4)- alterations are indicated via
b,bb,#,## (...)indicate neumes, i.e. groups of notes sung on the same syllable, e.g.(C4 D4 E4),(Bb3)- notes without
(...)should be considered custos - In the music lines (or staffs), slash and dashed bars indicate end of verses.
- In the
diplomaticversion, pitches are encoded as one would read them, trusting all the clefs notated in the manuscripts. When clefs are lacking at the beginning of the staffs, the C clef on the 2nd line (KC2) is used. In theeditorialversion, instead the proper clef is always inserted and wrongly notated clefs are corrected upon musicological inspection.
Categories
Each annotation has its own categories. The categories are listed in each json file, but we report theme here for convenience:id:1neume: a music neume, thedescriptioncontains its notesid:2clef: a music clef, thedescriptioncontains its pitchid:3custos: a custos, indicating the first pitch of the next staff. Thedescriptioncontains its pitchid:4text: a region of text linesid:5staff: a music line, thedescriptioncontains all the pitches, clefs, verse/strophe lines and custosid:6line: a text line, thedescriptioncontains the text, with/and//indicating end of verses and strophesid:7musicText: a region of text and music linesid:8discard: objects that have been removed in the editorial edition (but they are indeed drawn on the page image)id:9musicDelimiter: a delimiter in the music line (similar to a bar line of modern notation, often indicates the end of verse, but most of the time should be discarded)
In the OMR tasks, you can exploit the bounding boxes of each neume for the transcription. Note that the neume coordinates are not copied into the provided PageXML.
Alignment
In a typical Gregorian source, neumes correspond to syllables. These manuscripts, instead, do not offer a clear alignment at the syllable level. In fact, these manuscripts were rarely read, which explains the large number of errors in the notation and the lack of clear syllabic alignment. At most, they were used a guideline for the practice of singing laudas, that was a primarily oral genre.In the Laudare Project, we have opted not to annotate the syllabic alignment nor to suggest it in any way. We attempted, as much as possible, to annotate neumes as they are written based on orthographical analysis. However, the graphical alignment between music and text guided the musicological transcription in various cases where the orthographical notation was not enough to discern the neume separation.
In general, though, the alignment between text and music is always correct at the verse level. Both text and music transcription, indeed, contain verse and strophe separation symbols (/ and // respectively). In the diplomatic version of music lines, many MusicDelimiter objects are annotated, as they can be seen in the manuscript. Instead, in the editorial version, only those representing end of verses and strophes are annotated as MusicDelimiter.
Credits
Laudare Project (- Main coordinator of this dataset: Federico Simonetta (https://federicosimonetta.eu.org)
---
Changelog:
- Added sequential_sample
- Added filtered train/val/test for cross-manuscript tests
- Fix typos and update dataset task descriptions
- regenerated splits
- updated and fixed data and annotations
Files
LaudareProject/LaudareDataset-1.0-1.zip
Files
(3.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:2bfa71c2a8d9d0b209ebbb7918266a78
|
3.0 GB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/LaudareProject/LaudareDataset/tree/1.0-1 (URL)
Funding
Software
- Repository URL
- https://github.com/LaudareProject/LaudareDataset