Published September 21, 2025
| Version v1
Conference paper
Open
Emergent Musical Properties of a Transformer Under Contrastive Self-Supervised Learning
Authors/Creators
Description
In music information retrieval (MIR), contrastive self-supervised learning is effective for global tasks such as automatic tagging.
However, for local tasks such as chord estimation, it is widely assumed that contrastive pretext task is inadequate and that more sophisticated SSL is necessary; e.g., masked modeling.
Our paper challenges this assumption by revealing the potential of contrastive SSL paired with a transformer in local MIR tasks.
We consider a vision transformer with one-dimensional patches in the time--frequency domain (ViT-1D) and train it with simple contrastive SSL through normalized temperature-scaled cross-entropy loss (NT-Xent).
Although NT-Xent operates only over the class token, we observe that, potentially thanks to weight sharing, informative musical properties emerge in ViT-1D's sequence tokens.
On global tasks, their temporal average offers a performance boost compared to the class token.
On local tasks, they perform unexpectedly well, despite not being specifically trained for.
Furthermore, high-level musical features such as onsets and chord changes emerge from layerwise self-similarity matrices and attention maps.
Our paper does not aim to outperform the state of the art but advances the musical interpretation of transformers and sheds light on some overlooked abilities of contrastive SSL paired with transformers for sequence modeling in MIR.
Files
000028.pdf
Files
(1.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:97dba35c0332f53da1d9696a9e4a045c
|
1.6 MB | Preview Download |