HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
Authors/Creators
Description
HumMusQA: A Human-Written Music Understanding QA Benchmark Dataset
HumMusQA is a benchmark dataset for evaluating music understanding in Large Audio-Language Models (LALMs).
It contains 320 human-written multiple-choice questions created and validated by musically trained experts to test perception and interpretation of musical content.
This dataset accompanies the paper:
Benno Weck, Pablo Puentes, Andrea Poltronieri, Satyajeet Prabhu, and Dmitry Bogdanov. 2026. HumMusQA: A Human-written Music Understanding QA Benchmark Dataset. In Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026), pages 58–67, Rabat, Morocco. Association for Computational Linguistics.
Files
HumMusQA.csv
Main dataset containing all questions.
Columns:
-
Song link -
start time -
end time -
Question -
True answer -
Distractor 1 -
Distractor 2 -
Distractor 3 -
Main Category -
Secondary Categories -
Difficulty
metadata.csv
Track metadata and licensing information.
Columns:
-
track_id -
song_link -
name -
artist_name -
album_name -
license_ccurl
audio_excerpts.zip
Trimmed audio excerpts corresponding to each question.
audio_full.zip
Full audio tracks.
Licensing
Each track follows its respective Creative Commons license, specified in metadata.csv.
Users must comply with the license associated with each track.
Citation
If you use this dataset, please cite:
@inproceedings{weck-etal-2026-hummusqa,
title = "{H}um{M}us{QA}: A Human-written Music Understanding {QA} Benchmark Dataset",
author = "Weck, Benno and
Puentes, Pablo and
Poltronieri, Andrea and
Prabhu, Satyajeet and
Bogdanov, Dmitry",
editor = "Epure, Elena V. and
Oramas, Sergio and
Doh, SeungHeon and
Ramoneda, Pedro and
Kruspe, Anna and
Sordo, Mohamed",
booktitle = "Proceedings of the 4th Workshop on {NLP} for Music and Audio ({NLP}4{M}us{A} 2026)",
month = mar,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.nlp4musa-1.9/",
doi = "10.18653/v1/2026.nlp4musa-1.9",
pages = "58--67",
ISBN = "979-8-89176-369-2",
abstract = "The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet.This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension.To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts."
}
Files
HumMusQA.csv
Additional details
Related works
- Is described by
- Conference paper: 10.18653/v1/2026.nlp4musa-1.9 (DOI)