Assessing the Alignment of Audio Representations With Timbre Similarity Ratings

Haokun Tian; Stefan Lattner; Charalampos Saitis

doi:10.5281/zenodo.17706569

There is a newer version of the record available.

Published September 21, 2025 | Version v1

Conference paper Open

Assessing the Alignment of Audio Representations With Timbre Similarity Ratings

Psychoacoustical so-called "timbre spaces" map perceptual similarity ratings of instrument sounds onto low-dimensional embeddings via multidimensional scaling but suffer from scalability issues and are incapable of generalization. Recent results from audio (music and speech) quality assessment as well as image similarity have shown that deep learning provides emergent embeddings that align well with human perception while being largely free from these constraints. Although the existing 'timbre space' data is not large enough to train deep neural networks (only 2,614 pairwise ratings on 334 audio samples), it is sufficient and suitable for evaluating existing audio models. In this paper, we introduce metrics to assess the alignment of diverse audio representations with human judgements of timbre similarity by comparing both the absolute values and the rankings of embedding distances to human dissimilarity ratings. Our evaluation involves 3 signal-processing based methods, 10 pretrained models, and a novel sound matching model where three representations (including 'style' embeddings inspired by the style transfer task in the vision domain) are extracted and evaluated. Our analysis reveals that CLAP-based models and the style embeddings from our sound matching model achieve marginal gains over alternatives, yet MFCC remains competitive—underscoring gaps in current deep features' ability to encode timbre similarity.

Files

000083.pdf

Files (755.4 kB)

Name	Size	Download all
000083.pdf md5:3a98457f1114e456a83c71c693a83bc1	755.4 kB	Preview Download

178

Views

108

Downloads

Show more details

	All versions	This version
Views	178	112
Downloads	108	94
Data volume	88.4 MB	77.1 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

ISMIR

Imprint

Proceedings of the 26th International Society for Music Information Retrieval Conference, 724-732. Daejeon, South Korea.

Conference

International Society for Music Information Retrieval Conference (ISMIR 2025) , Daejeon, South Korea and Online, September 21-25, 2025

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 25, 2025
Modified: November 25, 2025

Assessing the Alignment of Audio Representations With Timbre Similarity Ratings

Authors/Creators

Description

Files

000083.pdf

Files (755.4 kB)