mshoxxDB - a Versioned Dataset for Electronic Music
Description
This dataset was presented as a Late Breaking Demo at ISMIR 2024 in San Francisco, CA, including the paper (as an extended abstract), poster, and demo video. It was initially studied in this EURASIP Journal article. The dataset is listed in the ISMIR Resources (pulled from here).
Description
mshoxxDB is an open-source dataset for research in Music Information Retrieval (MIR), with a focus on Electronic Music. It was created by Michael Taenzer in the Reason Studios digital audio workstation (DAW). The dataset provides comprehensively annotated music audio data for a genre that has received comparatively limited attention in MIR research. With its combination of diverse synthetic timbres, classical instruments, and multitrack material, it supports tasks such as instrument detection, multi-pitch estimation, source separation, beat detection, and tempo estimation. It is particularly well suited for evaluating instrument-agnostic methods and model generalization. The music covers several sub-genres of Electronic Music, including video game music, 8-bit (chiptune), EDM, pop, house, and chillout/dreamy styles. For more info, please refer to "README.txt" contained in the archive.
Contents
- 18 full-length pieces of music, 61 minutes of audio in total
- mixtures and multitrack stems in FLAC format (44.1 kHz, 16-bit, mono, compression level 6)
- track-level MIDI files
- CSV metadata including, among others: genres, tempo/bpm, time signature, original composer and artist information
- ms12 and ms14 dataset splits in JSON format, as described in the initial study (see above)
Technical Properties
Not all mixtures are exact sums of their corresponding multitrack stems. Some mixtures may contain additional processing in the form of limiting and compression, e.g. applied to the full mix or through side-chain compression between tracks. No harmonic effects were added to the mixtures, such as reverb, echo, or delay, as these would introduce additional harmonic content and could lead to mismatches between MIDI and audio.
Demo Page & Repository
A demo page with selected listening examples is available on GitHub Pages: https://mic-tae.github.io/mshoxxdb/. The mshoxxDB repository is located at https://github.com/mic-tae/mshoxxdb. The canonical archived release of mshoxxDB is hosted here on Zenodo. The GitHub repository and demo page provide supplementary documentation, examples, and project-related resources.
License
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA 4.0). See "LICENSE.txt" for the full license terms.
Citation
If you use this dataset in your work, please cite it as follows (bibtex) (the first option is preferred):
@misc {taenzer:mshoxxDB:2024,
author = {Taenzer, Michael},
title = {{mshoxxDB - a Versioned Dataset for Electronic Music}},
booktitle = {{Late-Breaking and Demo Session of the 25th International Conference on Music Information Retrieval (ISMIR)}},
address = {{San Francisco, CA, USA}},
year = {2024},
}
@dataset{taenzer:mshoxxDB:2024,
author = {Taenzer, Michael},
title = {mshoxxDB - a Versioned Dataset for Electronic Music},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.15881577},
url = {https://doi.org/10.5281/zenodo.15881577},
}
For methodological details and the initial study of the dataset, please also refer to the accompanying journal article and ISMIR 2024 late-breaking demo contribution.
Future versions
Future versions of mshoxxDB may include additional music, segmentation annotations for each piece, automation information of synthesizer parameters in the MIDIs, and possibly stereo audio data.
Community
Contributions to this dataset are welcome, for example through additional music, annotations, metadata improvements, or other suggestions that could help improve mshoxxDB.
Changelog
Version 1.2 (15 April 2026)
- added "metadata.csv" back into the archive
- substantially restructured "metadata.csv":
- new columns: "piece_id", "artist", "bpm_min", "bpm_max"
- renamed columns: "genre" -> "genres", "length" -> "duration_seconds", "timesig" -> "time_signature"
- dropped column: "tempo"
- json files now use "piece_id" instead of filenames as identifier
- changed all remaining umlauts "ü" -> "ue"
- substantially extended "README.txt"
- small changes to "LICENSE.txt"
Version 1.1 (16 July 2025)
- renamed all files to reflect main DB version number v1
- changed umlaut "ü" from "Güte" -> "Guete"
- added dataset splits "ms12" (1 json) and "ms14" (3 jsons) as described and used in https://doi.org/10.1186/s13636-025-00398-2
- added LICENSE file
- added README file
Version 1 (14 November 2024)
- Initial release
Files
README.txt
Additional details
Related works
- Is described by
- Journal article: 10.1186/s13636-025-00398-2 (DOI)
- Is published in
- Dataset: https://ismir2024program.ismir.net/lbd_423.html#lbd (URL)
Dates
- Available
-
2024-08-09Initial Release
- Updated
-
2025-07-16Version 1.1
- Updated
-
2026-04-15Version 1.2
Software
- Repository URL
- https://github.com/mic-tae/mshoxxdb
- Development Status
- Wip