There is a newer version of the record available.

Published September 21, 2025 | Version v1
Conference paper Open

Emergent Musical Properties of a Transformer Under Contrastive Self-Supervised Learning

Description

In music information retrieval (MIR), contrastive self-supervised learning is effective for global tasks such as automatic tagging. However, for local tasks such as chord estimation, it is widely assumed that contrastive pretext task is inadequate and that more sophisticated SSL is necessary; e.g., masked modeling. Our paper challenges this assumption by revealing the potential of contrastive SSL paired with a transformer in local MIR tasks. We consider a vision transformer with one-dimensional patches in the time--frequency domain (ViT-1D) and train it with simple contrastive SSL through normalized temperature-scaled cross-entropy loss (NT-Xent). Although NT-Xent operates only over the class token, we observe that, potentially thanks to weight sharing, informative musical properties emerge in ViT-1D's sequence tokens. On global tasks, their temporal average offers a performance boost compared to the class token. On local tasks, they perform unexpectedly well, despite not being specifically trained for. Furthermore, high-level musical features such as onsets and chord changes emerge from layerwise self-similarity matrices and attention maps. Our paper does not aim to outperform the state of the art but advances the musical interpretation of transformers and sheds light on some overlooked abilities of contrastive SSL paired with transformers for sequence modeling in MIR.

Files

000028.pdf

Files (1.6 MB)

Name Size Download all
md5:97dba35c0332f53da1d9696a9e4a045c
1.6 MB Preview Download