BMdataset: A Musicologically Curated LilyPond Dataset
Contributors
Data curator:
Description
Baroque Music Dataset (bmdataset)
A musicologically curated collection of baroque and early classical music scores in LilyPond format. The dataset originates from baroquemusic.it, comprising transcriptions made directly from original manuscript sources by musicologists. Each transcription is annotated with a reference to the original manuscript and its catalogue number. All pieces have complete musical metadata.
The collection covers 71 unique composers, with works spanning from the Early Baroque (<1650) to the Transitional Classical period (>1750), with a focus on the Late Baroque (1700-1750). It includes 16 musical forms (concertos, sinfonias, suites, arias, cantatas, sonatas, and more) and 25 distinct MIDI instruments.
Dataset Structure
bmdataset_ly/├── raw/ # Original score files grouped by piece│ ├── {piece_name}/│ │ ├── ly/ # LilyPond source files│ │ ├── midi/ # MIDI files│ │ └── pdf/ # PDF scores and parts│ └── ...├── preprocessed/ # Merged LilyPond files in Nederlands notation├── metadata.json # Musical metadata per piece (391 entries)├── manifest.json # File inventory and dataset statistics├── taxonomy.json # Label taxonomy and hierarchy└── README.md
Raw Files
Each piece is stored in its own folder under raw/. A piece folder may contain up to three subfolders:
- ly/: LilyPond (.ly) source files, including the full score, individual instrument parts, movement files, and compilation helpers (format, header, variabili files). These are the original engraving sources from the multi-file LilyPond workspace.
- midi/: MIDI renderings of the score. May include a main score file and per-movement variants (e.g.,
_score.midi,_score-1.midi,_score-2.midi). - pdf/: PDF renderings including the full score and individual instrument parts (e.g.,
_score.pdf,_violino1.pdf,_flauto.pdf).
Preprocessed Files
The preprocessed/ folder contains 2,645 LilyPond files derived from the raw sources. These files have been:
1. Merged: Multi-file LilyPond projects have been combined into single self-contained files per instrument part or score. The parsing pipeline extracts the dependency order from headers, concatenates movement files sequentially, and appends the score file.
2. Translated: Pitch notation has been converted from Italian (do, re, mi, fa, sol, la, si) to Nederlands (c, d, e, f, g, a, b), which is LilyPond's default pitch language.
File naming follows the pattern:NO_PUB_codifica_{N}__NO_PUB_codifica__{piece_name}_{part}.ly
Metadata
metadata.json contains musical metadata for all 391 pieces. Each entry is keyed by the piece folder name and includes:
| Field | Description |
| composer | Composer name |
| musical_form | List of musical forms (e.g., "concerto", "sonata", ...) |
| midi_instruments | List of instruments in the score |
| period | Historical period |
| movements | Per-movement metadata including key, scale, tempo, and time signature |
| paths.raw | Path to the piece's raw folder |
| paths.preprocessed | List of paths to preprocessed files for this piece |
Movement metadata
Keys are in Italian solfege notation: do (C), re (D), mi (E), fa (F), sol (G), la (A), si (B).
Tempo is given as a note value and BPM (e.g., `"2 = 75"` means half note = 75 BPM).
Example entry:{ "abel_concerto_flauto_do_grof_618": { "composer": "Abel", "musical_form": ["concerto"], "midi_instruments": ["flute", "viola", "violin", "cello"], "period": "Transitional Classical", "movements": { "1: allegro_molto": { "key": "do", "scale": "major", "tempo": "2 = 75", "time": "4/4" } }, "paths": { "raw": "raw/abel_concerto_flauto_do_grof_618/", "preprocessed": [ "preprocessed/NO_PUB_codifica_1__NO_PUB_codifica__abel_concerto_flauto_Do_Grof_618_score.ly" ] } }}
Manifest
`manifest.json` provides a complete inventory of all 391 pieces with name, folder path, source information, file counts, and metadata flags.
Taxonomy
`taxonomy.json` defines the label hierarchy used for metadata annotation. It contains the controlled vocabularies for all categorical fields: 25 instruments, 4 historical periods, 17 musical forms, 71 composers, 7 keys (Italian solfege), and a section nomenclature tree that classifies movement titles by speed (slow/mid/fast/very fast), intention, and suite dance types.
File Formats
- .ly (LilyPond): Plain-text music engraving format. Can be compiled to PDF/MIDI using LilyPond
- .midi: Standard MIDI files for playback and computational analysis.
- .pdf: Rendered sheet music scores and individual instrument parts.
Statistics
| Metric | Count |
| Total pieces | 391 |
| Metadata entries | 391 |
| Raw .ly files | 6038 |
| Raw MIDI files | 1845 |
| Raw PDF files | 2982 |
| Preprocessed files | 2645 |
| Unique composers | 71 |
Files
bmdataset.zip
Files
(826.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:8a4f02c8795ceaacbe86e64152bbaa09
|
826.3 MB | Preview Download |
Additional details
Dates
- Submitted
-
2026-03-27
Software
- Repository URL
- https://github.com/CSCPadova/lilybert
- Programming language
- LilyPond , Python