Published November 2024 | Version v1
Dataset Open

Discogs-VI: A musical version identification dataset based on public editorial metadata

  • 1. ROR icon Pompeu Fabra University

Description

Discogs-VI is a dataset of musical version metadata and precomputed audio representations, created for research on version identification (VI), also referred to as cover song identification (CSI). It was created using editorial metadata from the public Discogs music database by identifying version relationships among millions of tracks, utilizing metadata matching based on artist and writer credits as well as the track title metadata. Identified versions comprise the Discogs-VI dataset, with a large portion of it mapped to official music uploads on YouTube, resulting in the Discogs-VI-YT subset.

In the VI literature the set of tracks that are versions of each other is defined as a clique. Discogs-VI contains about 1.9 million versions belonging to around 348,000 cliques, while *Discogs-VI-YT* includes 493,000 versions across 98,000 cliques.

Files

intermediary.zip

Files (10.2 GB)

Name Size Download all
md5:bae0c27e8f9148049ddc1b1057a6d340
8.8 GB Preview Download
md5:136c671dba2d2f644b882e31c3e289e8
20.9 kB Download
md5:4bd20bc4163d1e69f5b1cd151e8eb934
1.4 GB Preview Download
md5:70c18c8b4ee18abed993858bb9060766
7.8 kB Preview Download

Additional details

Software

Repository URL
https://mtg.github.io/discogs-vi-dataset/
Development Status
Active

References

  • R. O. Araz, X. Serra, and D. Bogdanov, "Discogs-VI: A musical version identification dataset based on public editorial metadata," in Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024.