Planned intervention: On Wednesday June 26th 05:30 UTC Zenodo will be unavailable for 10-20 minutes to perform a storage cluster upgrade.
Published October 5, 2015 | Version 1.0
Dataset Open

SAS: Semantic Artist Similarity Dataset

  • 1. Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain


The Semantic Artist Similarity dataset consists of two datasets of artists entities with their corresponding biography texts, and the list of top-10 most similar artists within the datasets used as ground truth. The dataset is composed by a corpus of 268 artists and a slightly larger one of 2,336 artists, both gathered from in March 2015. The former is mapped to the MIREX Audio and Music Similarity evaluation dataset, so that its similarity judgments can be used as ground truth. For the latter corpus we use the similarity between artists as provided by the API. For every artist there is a list with the top-10 most related artists. In the MIREX dataset there are 188 artists with at least 10 similar artists, the other 80 artists have less than 10 similar artists. In the API dataset all artists have a list of 10 similar artists. 

There are 4 files in the dataset.

mirex_gold_top10.txt and lastfmapi_gold_top10.txt have the top-10 lists of artists for every artist of both datasets. Artists are identified by MusicBrainz ID. The format of the file is one line per artist, with the artist mbid separated by a tab with the list of top-10 related artists identified by their mbid separated by spaces.

artist_mbid \t artist_mbid_top10_list_separated_by_spaces \n

mb2uri_mirex and mb2uri_lastfmapi.txt have the list of artists. In each line there are three fields separated by tabs. First field is the MusicBrainz ID, second field is the name of the artist, and third field is the DBpedia uri.

artist_mbid \t lastfm_name \t dbpedia_uri \n

There are also 2 folders in the dataset with the biography texts of each dataset. Each .txt file in the biography folders is named with the MusicBrainz ID of the biographied artist. Biographies were gathered from the wiki page of every artist.

Using this dataset

We would highly appreciate if scientific publications of works partly based on the Semantic Artist Similarity dataset quote the following publication:

Oramas, S., Sordo M., Espinosa-Anke L., & Serra X. (In Press).  A Semantic-based Approach for Artist Similarity. 16th International Society for Music Information Retrieval Conference.

We are interested in knowing if you find our datasets useful! If you use our dataset please email us at and tell us about your research.


Files (4.2 MB)

Name Size Download all
4.2 MB Preview Download

Additional details

Related works

10230/26278 (Handle)