There is a newer version of the record available.

Published September 21, 2025 | Version v1
Conference paper Open

Assessing the Alignment of Audio Representations With Timbre Similarity Ratings

Description

Psychoacoustical so-called "timbre spaces" map perceptual similarity ratings of instrument sounds onto low-dimensional embeddings via multidimensional scaling but suffer from scalability issues and are incapable of generalization. Recent results from audio (music and speech) quality assessment as well as image similarity have shown that deep learning provides emergent embeddings that align well with human perception while being largely free from these constraints. Although the existing 'timbre space' data is not large enough to train deep neural networks (only 2,614 pairwise ratings on 334 audio samples), it is sufficient and suitable for evaluating existing audio models. In this paper, we introduce metrics to assess the alignment of diverse audio representations with human judgements of timbre similarity by comparing both the absolute values and the rankings of embedding distances to human dissimilarity ratings. Our evaluation involves 3 signal-processing based methods, 10 pretrained models, and a novel sound matching model where three representations (including 'style' embeddings inspired by the style transfer task in the vision domain) are extracted and evaluated. Our analysis reveals that CLAP-based models and the style embeddings from our sound matching model achieve marginal gains over alternatives, yet MFCC remains competitive—underscoring gaps in current deep features' ability to encode timbre similarity.

Files

000083.pdf

Files (755.4 kB)

Name Size Download all
md5:3a98457f1114e456a83c71c693a83bc1
755.4 kB Preview Download