Published February 21, 2026 | Version 1.0.0
Dataset Open

BMdataset: A Musicologically Curated LilyPond Dataset

  • 1. ROR icon University of Padua
  • 2. ROR icon Boston University

Contributors

Data curator:

Description

Baroque Music Dataset (bmdataset)

A musicologically curated collection of baroque and early classical music scores in LilyPond format. The dataset originates from baroquemusic.it, comprising transcriptions made directly from original manuscript sources by musicologists. Each transcription is annotated with a reference to the original manuscript and its catalogue number. All pieces have complete musical metadata.

The collection covers 71 unique composers, with works spanning from the Early Baroque (<1650) to the Transitional Classical period (>1750), with a focus on the Late Baroque (1700-1750). It includes 16 musical forms (concertos, sinfonias, suites, arias, cantatas, sonatas, and more) and 25 distinct MIDI instruments.

Dataset Structure


bmdataset_ly/
├── raw/                   # Original score files grouped by piece
│   ├── {piece_name}/
│   │   ├── ly/            # LilyPond source files
│   │   ├── midi/          # MIDI files
│   │   └── pdf/           # PDF scores and parts
│   └── ...
├── preprocessed/          # Merged LilyPond files in Nederlands notation
├── metadata.json          # Musical metadata per piece (391 entries)
├── manifest.json          # File inventory and dataset statistics
├── taxonomy.json          # Label taxonomy and hierarchy
└── README.md

Raw Files

Each piece is stored in its own folder under raw/. A piece folder may contain up to three subfolders:

  • ly/: LilyPond (.ly) source files, including the full score, individual instrument parts, movement files, and compilation helpers (format, header, variabili files). These are the original engraving sources from the multi-file LilyPond workspace.
  • midi/: MIDI renderings of the score. May include a main score file and per-movement variants (e.g., _score.midi_score-1.midi_score-2.midi).
  • pdf/: PDF renderings including the full score and individual instrument parts (e.g., _score.pdf, _violino1.pdf, _flauto.pdf).

Preprocessed Files

The preprocessed/ folder contains 2,645 LilyPond files derived from the raw sources. These files have been:

1. Merged: Multi-file LilyPond projects have been combined into single self-contained files per instrument part or score. The parsing pipeline extracts the dependency order from headers, concatenates movement files sequentially, and appends the score file.
2. Translated: Pitch notation has been converted from Italian (do, re, mi, fa, sol, la, si) to Nederlands (c, d, e, f, g, a, b), which is LilyPond's default pitch language.

File naming follows the pattern:
NO_PUB_codifica_{N}__NO_PUB_codifica__{piece_name}_{part}.ly

Metadata

metadata.json contains musical metadata for all 391 pieces. Each entry is keyed by the piece folder name and includes:

Field Description
composer Composer name
musical_form List of musical forms (e.g., "concerto", "sonata", ...)
midi_instruments List of instruments in the score
period Historical period
movements Per-movement metadata including key, scale, tempo, and time signature
paths.raw Path to the piece's raw folder
paths.preprocessed List of paths to preprocessed files for this piece

 

Movement metadata

Keys are in Italian solfege notation: do (C), re (D), mi (E), fa (F), sol (G), la (A), si (B).
Tempo is given as a note value and BPM (e.g., `"2 = 75"` means half note = 75 BPM).

Example entry:
{
  "abel_concerto_flauto_do_grof_618": {
    "composer": "Abel",
    "musical_form": ["concerto"],
    "midi_instruments": ["flute", "viola", "violin", "cello"],
    "period": "Transitional Classical",
    "movements": {
      "1: allegro_molto": {
        "key": "do",
        "scale": "major",
        "tempo": "2 = 75",
        "time": "4/4"
      }
    },
    "paths": {
      "raw": "raw/abel_concerto_flauto_do_grof_618/",
      "preprocessed": [
        "preprocessed/NO_PUB_codifica_1__NO_PUB_codifica__abel_concerto_flauto_Do_Grof_618_score.ly"
      ]
    }
  }
}

Manifest

`manifest.json` provides a complete inventory of all 391 pieces with name, folder path, source information, file counts, and metadata flags.

Taxonomy

`taxonomy.json` defines the label hierarchy used for metadata annotation. It contains the controlled vocabularies for all categorical fields: 25 instruments, 4 historical periods, 17 musical forms, 71 composers, 7 keys (Italian solfege), and a section nomenclature tree that classifies movement titles by speed (slow/mid/fast/very fast), intention, and suite dance types.

File Formats

  • .ly (LilyPond): Plain-text music engraving format. Can be compiled to PDF/MIDI using LilyPond
  • .midi: Standard MIDI files for playback and computational analysis.
  • .pdf: Rendered sheet music scores and individual instrument parts.

Statistics

Metric Count
Total pieces 391
Metadata entries 391
Raw .ly files 6038
Raw MIDI files 1845
Raw PDF files 2982
Preprocessed files 2645
Unique composers 71

 

Files

bmdataset.zip

Files (826.3 MB)

Name Size Download all
md5:8a4f02c8795ceaacbe86e64152bbaa09
826.3 MB Preview Download

Additional details

Dates

Submitted
2026-03-27

Software

Repository URL
https://github.com/CSCPadova/lilybert
Programming language
LilyPond , Python