Published February 2, 2026 | Version v1

HumMusQA: A Human-written Music Understanding QA Benchmark Dataset

  • 1. ROR icon Pompeu Fabra University
  • 2. ROR icon Universitat Autònoma de Barcelona

Description

HumMusQA: A Human-Written Music Understanding QA Benchmark Dataset

HumMusQA is a benchmark dataset for evaluating music understanding in Large Audio-Language Models (LALMs).
It contains 320 human-written multiple-choice questions created and validated by musically trained experts to test perception and interpretation of musical content.

This dataset accompanies the paper:

Benno Weck, Pablo Puentes, Andrea Poltronieri, Satyajeet Prabhu, and Dmitry Bogdanov. 2026. HumMusQA: A Human-written Music Understanding QA Benchmark Dataset. In Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026), pages 58–67, Rabat, Morocco. Association for Computational Linguistics.

Files

HumMusQA.csv
Main dataset containing all questions.

Columns:

  • Song link

  • start time

  • end time

  • Question

  • True answer

  • Distractor 1

  • Distractor 2

  • Distractor 3

  • Main Category

  • Secondary Categories

  • Difficulty

metadata.csv
Track metadata and licensing information.

Columns:

  • track_id

  • song_link

  • name

  • artist_name

  • album_name

  • license_ccurl

audio_excerpts.zip
Trimmed audio excerpts corresponding to each question.

audio_full.zip
Full audio tracks.

Licensing

Each track follows its respective Creative Commons license, specified in metadata.csv.
Users must comply with the license associated with each track.

Citation

If you use this dataset, please cite:

@inproceedings{weck-etal-2026-hummusqa,
    title = "{H}um{M}us{QA}: A Human-written Music Understanding {QA} Benchmark Dataset",
    author = "Weck, Benno  and
      Puentes, Pablo  and
      Poltronieri, Andrea  and
      Prabhu, Satyajeet  and
      Bogdanov, Dmitry",
    editor = "Epure, Elena V.  and
      Oramas, Sergio  and
      Doh, SeungHeon  and
      Ramoneda, Pedro  and
      Kruspe, Anna  and
      Sordo, Mohamed",
    booktitle = "Proceedings of the 4th Workshop on {NLP} for Music and Audio ({NLP}4{M}us{A} 2026)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.nlp4musa-1.9/",
    doi = "10.18653/v1/2026.nlp4musa-1.9",
    pages = "58--67",
    ISBN = "979-8-89176-369-2",
    abstract = "The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet.This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension.To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts."
}

Files

HumMusQA.csv

Files (1.1 GB)

Name Size
md5:ddc558760480cf6048f300c1a91184f4
475.0 MB Preview Download
md5:f10219538ef414e26ff2cb32e5bb2494
617.9 MB Preview Download
md5:a5c00b4d5135d403a6c0e22c8be7808c
62.9 kB Preview Download
md5:55b6d6ac5437b05a727301e5bed48d16
15.2 kB Preview Download

Additional details

Related works

Is described by
Conference paper: 10.18653/v1/2026.nlp4musa-1.9 (DOI)