PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Long, Phillip; Novack, Zachary; McAuley, Julian; Berg-Kirkpatrick, Taylor

doi:10.5281/zenodo.13763756

Published September 15, 2024 | Version v1

Dataset Open

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

1. University of California, San Diego
2. UCSD

We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing. Refer to our paper for more information, and our GitHub repository for any code-related details. Please cite our paper if you use this dataset.

Abstract (English)

The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at https://pnlong.github.io/PDMX.demo/.

Technical info (English)

PDMX.csv

This is the main CSV file for PDMX. Each row represents a different song. Columns include various selected attributes from the metadata, features extracted from the music, and different subsets of the dataset. Each column is described here:

path: Path to the data (MusicRender JSON) file.
metadata: Path to the associated metadata (JSON) file. The basename of each file matches the basename of the corresponding file in the path column.
has_metadata: Whether a song has an associated metadata file (all True).
version: Version of the original MuseScore file.
is_user_pro: Whether the user who posted the original MuseScore file is a "pro" user (pays for a MuseScore subscription).
is_user_publisher: Whether the user who posted the original MuseScore file is a music publisher.
is_user_staff: Whether the user who posted the original MuseScore file is part of MuseScore staff.
has_paywall: Whether the original MuseScore file had a paywall.
is_rated: Whether the original MuseScore file had any ratings.
is_official: Whether the original MuseScore file was an "official" score, a title decided by MuseScore.
is_original: Whether the original MuseScore file was an original work.
is_draft: Whether the original MuseScore file was marked as a draft by the user who posted it.
has_custom_audio: Whether the original MuseScore file has an associated custom audio file (must retrieve the actual audio from the metadata).
has_custom_video: Whether the original MuseScore file has an associated custom video file (must retrieve the actual video from the metadata).
n_comments: Number of comments on the original MuseScore file.
n_favorites: Number of users who favorited the original MuseScore file.
n_views: Number of views on the original MuseScore file.
n_ratings: Number of ratings on the original MuseScore file.
rating: Average rating (out of five stars) of the original MuseScore file. A rating of zero indicates that a song is unrated.
license: Creative Commons license of the original MuseScore file.
license_url: Link to the Creative Commons license of the original MuseScore file. Directly related to the license column.
genres: Genre(s) associated with the original MuseScore file, separated with a "-" if there are multiple.
groups: MuseScore group(s) associated with the original MuseScore file, separated with a "-" if there are multiple.
tags: MuseScore tag(s) associated with the original MuseScore file, separated with a "-" if there are multiple.
song_name: If available, the name of the song.
title: If available, the title of the song, oftentimes the same as song_name.
subtitle: If available, the subtitle of the song.
artist_name: If available, the name of the artist who created the song.
composer_name: If available, the name of the composer who created the song, oftentimes the same as artist_name.
publisher: If available, the publisher of the song.
complexity: The MuseScore complexity score of the original MuseScore file. Ranges from 0-3.
n_tracks: The number of tracks (parts) in the original MuseScore file.
tracks: Track{s} from the original MuseScore file, separated with a "-" if there are multiple.
song_length: Length of the song, in metrical MusPy time steps. There are usually 12 time steps per beat, though the actual resolution of each song is determined as the least common multiple of all divisions. To be practical, this statistic is best calculated by loading the file in path as a MusicRender object, and obtaining the song length there.
song_length.seconds: Length of the song, in seconds.
song_length.bars: Length of the song, in bars.
song_length.beats: Length of the song, in beats.
n_notes: Number of notes in the song.
notes_per_bar: Average number of notes per bar in the song.
n_annotations: Number of annotations in the song.
has_annotations: Whether the song has any annotations.
n_lyrics: Number of lyrics in the song.
has_lyrics: Whether the song has any lyrics.
n_tokens: Number of tokens (n_notes + n_annotations + n_lyrics) in the song.
pitch_class_entropy: Pitch Class Entropy of the song, as calculated by the MusPy Package.
scale_consistency: Scale Consistency of the song, as calculated by the MusPy Package. Ranges from 0-1.
groove_consistency: Groove Consistency of the song, as calculated by the MusPy Package. Ranges from 0-1.
best_path: Best filepath in the song's title duplicate grouping (see paper for full description).
is_best_path: Whether the song is the best_path in the title duplicate grouping.
best_arrangement: Best filepath in the song's title-instrumentation duplicate grouping (see paper for full description).
is_best_arrangement: Whether the song is the best_arrangement in the title-instrumentation duplicate grouping.
best_unique_arrangement: Best filepath in the song's title-instrumentation-arrangement duplicate grouping (see paper for full description).
is_best_unique_arrangement: Whether the song is the best_unique_arrangement in the title-instrumentation-arrangement duplicate grouping. All songs for which this value is true are part of the Deduplicated subset.
subset:all: Whether the song is part of the All subset (all True).
subset:deduplicated: Whether the song is part of the Deduplicated subset (the same as is_best_unique_arrangement).
subset:rated: Whether the song is part of the Rated subset (the song has a non-zero rating).
subset:rated_deduplicated: Whether a song is both part of the Rated and Deduplicated subsets.

Directories

Besides the main CSV file, PDMX.csv, there are three main directories in PDMX:

data: The main music files, stored as JSONified MusicRender (refer to the paper for a definition) objects. Stored in a similar tree directory structure to the MuseScore data corpus from which it was scraped. JSON objects can be reinstated into a Python environment with the load() function, described further in the GitHub repository.
metadata: Associated JSON metadata files for each music data file (the basename of each metadata file matches the basename of its associated data file). Also stored in a similar tree directory structure as the MuseScore metadata corpus from which it was scraped.
subset_paths: Contains four text files, all.txt, deduplicated.txt, rated.txt, and rated_deduplicated.txt, which contain the paths in each subset (described in the paper).

Notes (English)

To uncompress the PDMX directory, use the following command, where the absolute filepath /path/to/PDMX is the directory to which you want to uncompress the PDMX files and /path/to/gzip is the directory in which the .tar.gz file (downloaded here, on Zenodo) resides.

cd /path/to/gzip
PDMX_dir="/path/to/PDMX"
tar -xzf PDMX.tar.gz -C "${PDMX_dir}"

Every path in PDMX is saved relative to the PDMX directory. That is, to properly use the dataset, any software must be run from the PDMX directory. For instance, a data file is stored as ./data/path/to/file.json. To correct this on your machine, run the following commands, where /path/to/PDMX is the absolute filepath to the PDMX directory (the same filepath used in the previous command):

PDMX_dir="/path/to/PDMX"
sed -i "s+./data+${PDMX_dir}/data+g" "${PDMX_dir}/PDMX.csv"
sed -i "s+./metadata+${PDMX_dir}/metadata+g" "${PDMX_dir}/PDMX.csv"
find "${PDMX_dir}/subset_paths" -type f | xargs sed -i "s+./data+${PDMX_dir}/data+g"

Files

Files (1.6 GB)

Name	Size	Download all
PDMX.tar.gz md5:660944735e4545d1e3594f42ba933e42	1.6 GB	Download

Additional details

Repository URL: https://github.com/pnlong/PDMX
Programming language: Python, Shell
Development Status: Active

	All versions	This version
Views	4,258	1,918
Downloads	4,785	393
Data volume	8.7 TB	791.1 GB

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Creators

Description

Abstract (English)

Technical info (English)

PDMX.csv

Directories

Notes (English)

Files

Files (1.6 GB)

Additional details

Software