Emergent Musical Properties of a Transformer Under Contrastive Self-Supervised Learning

Yuexuan KONG; Gabriel Mesegues-Brocal; Vincent Lostanlen; Mathieu Lagrange; Romain Hennequin

doi:10.5281/zenodo.17706383

There is a newer version of the record available.

Published September 21, 2025 | Version v1

Conference paper Open

Emergent Musical Properties of a Transformer Under Contrastive Self-Supervised Learning

In music information retrieval (MIR), contrastive self-supervised learning is effective for global tasks such as automatic tagging. However, for local tasks such as chord estimation, it is widely assumed that contrastive pretext task is inadequate and that more sophisticated SSL is necessary; e.g., masked modeling. Our paper challenges this assumption by revealing the potential of contrastive SSL paired with a transformer in local MIR tasks. We consider a vision transformer with one-dimensional patches in the time--frequency domain (ViT-1D) and train it with simple contrastive SSL through normalized temperature-scaled cross-entropy loss (NT-Xent). Although NT-Xent operates only over the class token, we observe that, potentially thanks to weight sharing, informative musical properties emerge in ViT-1D's sequence tokens. On global tasks, their temporal average offers a performance boost compared to the class token. On local tasks, they perform unexpectedly well, despite not being specifically trained for. Furthermore, high-level musical features such as onsets and chord changes emerge from layerwise self-similarity matrices and attention maps. Our paper does not aim to outperform the state of the art but advances the musical interpretation of transformers and sheds light on some overlooked abilities of contrastive SSL paired with transformers for sequence modeling in MIR.

Files

000028.pdf

Files (1.6 MB)

Name	Size	Download all
000028.pdf md5:97dba35c0332f53da1d9696a9e4a045c	1.6 MB	Preview Download

180

Views

Downloads

Show more details

	All versions	This version
Views	180	110
Downloads	78	56
Data volume	138.1 MB	103.2 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

ISMIR

Imprint

Proceedings of the 26th International Society for Music Information Retrieval Conference, 249-257. Daejeon, South Korea.

Conference

International Society for Music Information Retrieval Conference (ISMIR 2025) , Daejeon, South Korea and Online, September 21-25, 2025

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 25, 2025
Modified: November 25, 2025

Emergent Musical Properties of a Transformer Under Contrastive Self-Supervised Learning

Authors/Creators

Description

Files

000028.pdf

Files (1.6 MB)