TrustMus benchmark: The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?
Description
TrustMus is an initial, rigorously validated benchmark designed to assess the accuracy and reliability of large language models (LLMs) in the domain of musicology. This dataset includes a collection of 400 human-validated multiple-choice questions, categorized into four thematic areas: People (Ppl), Instruments and Technology (I&T), Genres, Forms, and Theory (Thr), and Culture and History (C&H).
The questions are derived from The Grove Dictionary Online using a semi-automated methodology. The process involves generating initial questions with a fine-tuned retrieval-augmented generation (RAG) model, filtering them through a series of automated checks, and finally validating them through expert human annotation. TrustMus is introduced in an initial paper, providing a critical resource for researchers and developers aiming to evaluate and improve LLM performance in this specialized field of musicology.
This benchmark is discussed in the paper :
BibTeX Citation:
@inproceedings{ramoneda2024trustmus,
title={The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?},
author={Ramoneda, Pedro and Parada-Cabaleiro, Emilia and Weck, Benno and Serra, Xavier},
booktitle={Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)},
year={2024},
month={November},
address={San Francisco, USA},
organization={Co-located with ISMIR'2024}
}