PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing
Creators
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing. Refer to our paper for more information, and our GitHub repository for any code-related details. Please cite both our paper and our collaborators' paper if you use this dataset (see our GitHub for more information).
Upon further use of the PDMX dataset, we discovered a discrepancy between the public-facing copyright metadata on the MuseScore website and the internal copyright data of the MuseScore files themselves, which affected 31,221 (12.29% of) songs. We have decided to proceed with the former given its public visibility on Musescore (i.e. this is what the MuseScore website presents its users with). We have noted files with conflicting internal licenses in the license_conflict column of PDMX. We recommend using the no_license_conflict subset of PDMX (which still includes 222,856 songs) moving forward.
Additionally, for each song in PDMX, we not only provide the MusicRender and metadata JSON files, but we also try to include the associated compressed MusicXML (MXL), sheet music (PDF), and MIDI (MID) files when available. Due to the corruption of 42 of the original MuseScore files, these songs lack those associated files (since they could not be converted to those formats) and only include the MusicRender and metadata JSON files. The all_valid subset of PDMX describes the songs where all associated files are valid.
Abstract (English)
The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at https://pnlong.github.io/PDMX.demo/.
Technical info (English)
PDMX.csv
This is the main CSV file for PDMX. Each row represents a different song. Columns include various selected attributes from the metadata, features extracted from the music, and different subsets of the dataset. Each column is described here:
- path: Path to the data (MusicRender JSON) file.
- metadata: Path to the associated metadata (JSON) file. The basename of each file matches the basename of the corresponding file in the path column.
- mxl: Path to the associated compressed MusicXML (MXL) file. The basename of each file matches the basename of the corresponding file in the path column. Values may be N/A, since some of the original MuseScore files are corrupted and thus cannot be converted to compressed MusicXML.
- pdf: Path to the associated sheet music (PDF) file. The basename of each file matches the basename of the corresponding file in the path column. Values may be N/A, since some of the original MuseScore files are corrupted and thus cannot be converted to PDF.
- mid: Path to the associated MIDI (MID) file. The basename of each file matches the basename of the corresponding file in the path column. Values may be N/A, since some of the original MuseScore files are corrupted and thus cannot be converted to MID.
- version: Version of the original MuseScore file.
- is_user_pro: Whether the user who posted the original MuseScore file is a "pro" user (pays for a MuseScore subscription).
- is_user_publisher: Whether the user who posted the original MuseScore file is a music publisher.
- is_user_staff: Whether the user who posted the original MuseScore file is part of MuseScore staff.
- has_paywall: Whether the original MuseScore file had a paywall.
- is_rated: Whether the original MuseScore file had any ratings.
- is_official: Whether the original MuseScore file was an "official" score, a title decided by MuseScore.
- is_original: Whether the original MuseScore file was an original work.
- is_draft: Whether the original MuseScore file was marked as a draft by the user who posted it.
- has_custom_audio: Whether the original MuseScore file has an associated custom audio file (must retrieve the actual audio from the metadata).
- has_custom_video: Whether the original MuseScore file has an associated custom video file (must retrieve the actual video from the metadata).
- n_comments: Number of comments on the original MuseScore file.
- n_favorites: Number of users who favorited the original MuseScore file.
- n_views: Number of views on the original MuseScore file.
- n_ratings: Number of ratings on the original MuseScore file.
- rating: Average rating (out of five stars) of the original MuseScore file. A rating of zero indicates that a song is unrated.
- license: Creative Commons license of the original MuseScore file.
- license_url: Link to the Creative Commons license of the original MuseScore file. Directly related to the license column.
- license_conflict: Whether the song's public-facing copyright metadata license disagrees with the internal copyright license data of the original MuseScore file (i.e. publicy, the song holds a public domain license, but internally, the copyright license data of the original MuseScore file says elsewise).
- genres: Genre(s) associated with the original MuseScore file, separated with a "-" if there are multiple.
- groups: MuseScore group(s) associated with the original MuseScore file, separated with a "-" if there are multiple.
- tags: MuseScore tag(s) associated with the original MuseScore file, separated with a "-" if there are multiple.
- song_name: If available, the name of the song.
- title: If available, the title of the song, oftentimes the same as song_name.
- subtitle: If available, the subtitle of the song.
- artist_name: If available, the name of the artist who created the song.
- composer_name: If available, the name of the composer who created the song, oftentimes the same as artist_name.
- publisher: If available, the publisher of the song.
- complexity: The MuseScore complexity score of the original MuseScore file. Ranges from 0-3.
- n_tracks: The number of tracks (parts) in the original MuseScore file.
- tracks: Track{s} from the original MuseScore file, separated with a "-" if there are multiple.
- song_length: Length of the song, in metrical MusPy time steps. There are usually 12 time steps per beat, though the actual resolution of each song is determined as the least common multiple of all divisions. To be practical, this statistic is best calculated by loading the file in path as a MusicRender object, and obtaining the song length there.
- song_length.seconds: Length of the song, in seconds.
- song_length.bars: Length of the song, in bars.
- song_length.beats: Length of the song, in beats.
- n_notes: Number of notes in the song.
- notes_per_bar: Average number of notes per bar in the song.
- n_annotations: Number of annotations in the song.
- has_annotations: Whether the song has any annotations.
- n_lyrics: Number of lyrics in the song.
- has_lyrics: Whether the song has any lyrics.
- n_tokens: Number of tokens (n_notes + n_annotations + n_lyrics) in the song.
- pitch_class_entropy: Pitch Class Entropy of the song, as calculated by the MusPy Package.
- scale_consistency: Scale Consistency of the song, as calculated by the MusPy Package. Ranges from 0-1.
- groove_consistency: Groove Consistency of the song, as calculated by the MusPy Package. Ranges from 0-1.
- best_path: Best filepath in the song's title duplicate grouping (see paper for full description).
- is_best_path: Whether the song is the best_path in the title duplicate grouping.
- best_arrangement: Best filepath in the song's title-instrumentation duplicate grouping (see paper for full description).
- is_best_arrangement: Whether the song is the best_arrangement in the title-instrumentation duplicate grouping.
- best_unique_arrangement: Best filepath in the song's title-instrumentation-arrangement duplicate grouping (see paper for full description).
- is_best_unique_arrangement: Whether the song is the best_unique_arrangement in the title-instrumentation-arrangement duplicate grouping. All songs for which this value is true are part of the Deduplicated subset.
- subset:all: Whether the song is part of the All subset (all True).
- subset:deduplicated: Whether the song is part of the Deduplicated subset (the same as is_best_unique_arrangement).
- subset:rated: Whether the song is part of the Rated subset (the song has a non-zero rating).
- subset:rated_deduplicated: Whether the song is both part of the Rated and Deduplicated subsets.
- subset:no_license_conflict: Whether the song's public-facing copyright metadata license agrees with the internal copyright license data of the original MuseScore file; that is, publicy and internally, the song holds a public domain license (the negation of the license_conflict column).
- subset:all_valid: Whether the song's associated compressed MusicXML (MXL), sheet music (PDF), and MIDI (MID) files are all valid (non-N/A).
Directories
Besides the main CSV file, PDMX.csv, PDMX is comprised of multiple subdirectories:
- data: The main music files, stored as JSONified MusicRender (refer to the paper for a definition) objects. Stored in a similar tree directory structure to the MuseScore data corpus from which it was scraped. JSON objects can be reinstated into a Python environment with the load() function, described further in the GitHub repository.
- metadata: Associated JSON metadata files for each music data file (the basename of each metadata file matches the basename of its associated data file). Also stored in a similar tree directory structure as the MuseScore metadata corpus from which it was scraped.
- mxl: Associated compressed MusicXML (MXL) files for each music data file (the basename of each file matches the basename of its associated data file). Stored in tree directory structure identical to that of the data directory. Not every music data file has an associated compressed MusicXML file, since some of the original MuseScore files are corrupted and thus cannot be converted to MXL.
- pdf: Associated sheet music (PDF) files for each music data file (the basename of each file matches the basename of its associated data file). Stored in tree directory structure identical to that of the data directory. Not every music data file has an associated sheet music file, since some of the original MuseScore files are corrupted and thus cannot be converted to PDF.
- mid: Associated MIDI (MID) files for each music data file (the basename of each file matches the basename of its associated data file). Stored in tree directory structure identical to that of the data directory. Not every music data file has an associated MIDI file, since some of the original MuseScore files are corrupted and thus cannot be converted to MID.
- subset_paths: Contains the text files all.txt, deduplicated.txt, rated.txt, rated_deduplicated.txt, no_license_conflict.txt, and all_valid.txt, which each contain all of the paths in the given subset. Descriptions of each subset can be found in the column summaries of PDMX.csv, though the first four subsets are also described at length in the paper.
Notes (English)
Files
PDMX.csv
Files
(14.4 GB)
Name | Size | Download all |
---|---|---|
md5:f38dfa7b75f95e5a3d8d70459c1f9b72
|
2.2 GB | Download |
md5:5bc79445090dd2fe5e96cffa77a3461c
|
159.4 MB | Download |
md5:d920a21b2fcd99a56d9c381b39debbb2
|
214.4 MB | Download |
md5:49ffd75ecf5489c0be6d41182eb11ff7
|
1.9 GB | Download |
md5:2e03ccd072755332bd63a75c57c89b3f
|
9.6 GB | Download |
md5:30392ccf38bb63ce70e7afae70f9c88c
|
225.4 MB | Preview Download |
md5:092eee416ece8060f77d08575b94a43d
|
29.3 MB | Download |
Additional details
Software
- Repository URL
- https://github.com/pnlong/PDMX
- Programming language
- Python, Shell
- Development Status
- Active