mshoxxDB - a Versioned Dataset for Electronic Music
Authors/Creators
Description
This dataset was presented as a Late Breaking Demo at ISMIR 2024 in San Francisco, CA, including the paper (as an extended abstract), poster, and demo video. It was initially studied in this EURASIP article.
Description
mshoxxDB is an open-source dataset for research in the field of Music Information Retrieval (MIR), with a focus on Electronic Music. It was created by Michael Taenzer in the Reason Studios digital audio workstation (DAW). The dataset provides comprehensively annotated music audio data for a genre that has received comparatively limited attention in MIR research. With its combination of diverse synthetic timbres, acoustic and traditional classical instruments, and multitrack material, it supports tasks such as instrument detection, multi-pitch estimation, and source separation, beat detection, and tempo estimation. It is particularly well suited for evaluating instrument-agnostic methods and model generalization. The music covers several sub-genres of Electronic Music, e.g. video game, 8-bit (chiptune), EDM, pop, house, and chillout/dreamy styles.
Contents
- 18 full-length pieces of music, 61 minutes of audio in total
- mixtures and multitrack stems in FLAC format (44.1 kHz, 16-bit, mono, compression level 6)
- track-level MIDI files
- CSV metadata including genre, tempo, time signature, and artist information
- ms12 and ms14 dataset splits in JSON format, as described in the initial study (see above)
Technical Properties
Not all mixtures are exact sums of their corresponding multitrack stems. Some mixtures may contain additional processing in the form of limiters and compression, e.g. applied to the full mix or through side-chain compression between tracks.
No harmonic effects were added onto the mixtures, such as reverb, echo, or delay, as these would introduce additional harmonic content, resulting in mismatches between MIDI and audio.
Demo Page & Repository
A demo page with selected listening examples is available on GitHub Pages: https://mic-tae.github.io/mshoxxdb/. The mshoxxDB repository is located at https://github.com/mic-tae/mshoxxdb. The canonical archived release of mshoxxDB is hosted on Zenodo. The GitHub repository and demo page provide supplementary documentation, examples, and project-related resources.
License
All contents are distributed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA 4.0). See the LICENSE file for details.
Citation
Should you use this dataset in your work, please cite it the following way (bibtex):
@misc {taenzer:mshoxxDB:2024,
author = {Taenzer, Michael},
title = {{mshoxxDB - a Versioned Dataset for Electronic Music}},
booktitle = {{Late-Breaking and Demo Session of the 25th International Conference on Music Information Retrieval (ISMIR)}},
address = {{San Francisco, CA, USA}},
year = {2024},
}
Future versions
Future versions of mshoxxDB may include additional music, segmentation annotations for each piece, and possibly stereo audio data.
Community
Contributions to this dataset are welcome in all forms, e.g. by adding new music, annotations, or other suggestions that could help improve mshoxxDB.
Changelog
Version 1.1 (16 July 2025)
- all files now reflect main dataset version number v1 (previous numbers referred to internal track session numbers)
- removed umlaut from “Güte” --> “Guete”
- added ms12 and ms14 dataset splits (JSON files), a LICENSE file, and a README file
Version 1.0 (9 August 2024)
- initial release
Files
mshoxxDB_v1.1.zip
Files
(629.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:87a3fa9f9c940d66c55a4f0fba43d195
|
629.6 MB | Preview Download |
Additional details
Related works
- Is described by
- Journal article: 10.1186/s13636-025-00398-2 (DOI)
Dates
- Available
-
2024-08-09Initial Release
- Updated
-
2025-07-16Version 1.1
Software
- Repository URL
- https://github.com/mic-tae/mshoxxdb
- Development Status
- Wip