Uncovering underlying high-level musical content in the time domain

Oriol Colomé Font

doi:10.5281/zenodo.8380670

Published September 26, 2023 | Version v1

Thesis Open

Uncovering underlying high-level musical content in the time domain

Oriol Colomé Font¹

1. Universitat Pompeu Fabra

Contributors

Supervisors:

This thesis posits the existence of invariant high-level musical concepts that persist regardless of changes in sonic qualities, akin to the symbolic domain where essence endures despite varying interpretations through different performances, instruments, and styles, among many other, almost countless variables.
In collaboration with Epidemic Sound AB and the Music Technology Group (MTG) at Universitat Pompeu Fabra (UPF), we used self-supervised contrastive learning to uncover the underlying structure of Western tonal music by learning deep audio features for music boundary detection. We applied deep convolutional neural net-works with triplet loss function to identify abstract and semantic high-level musical elements without relying on their sonic qualities. This way, we replaced traditional acoustic features with deep audio embeddings, paving the way for sound-agnostic and content-sensitive music representation for downstream track segmentation tasks.
Our cognitively-based approach for learning embeddings focuses on using full-resolution data and preserving high-level musical information which unfolds in the time do-main. A key component in our methodology is triplet networks, which effectively understand and preserve the nuanced relationships within musical data. Drawing upon our domain expertise, we developed robust transformations to encode heuristic musical concepts that should remain constant. This novel approach combines music and machine learning intending to enhance machine listening models’ efficacy.
Preliminary results suggest that, while not outperforming state-of-the-art, our musically-informed technique has significant potential for boundary detection tasks. Most likely, so does for nearly all downstream sound-agnostic and content-sensitive tasks constrained by data scarcity, as it is possible to achieve competitive performance to traditional handcrafted signal processing methods by learning only from unlabeled audio files.
The question remains if such general-purpose audio representation can mimic human hearing.

Files

Oriol-Colome-Master-Thesis-2023.pdf

Files (12.3 MB)

Name	Size	Download all
Oriol-Colome-Master-Thesis-2023.pdf md5:c10a7cba62d9399e3d75cdce5bfdbf0a	12.3 MB	Preview Download

	All versions	This version
Views	224	201
Downloads	205	186
Data volume	3.2 GB	2.9 GB

Uncovering underlying high-level musical content in the time domain

Creators

Contributors

Supervisors:

Description

Files

Oriol-Colome-Master-Thesis-2023.pdf

Files (12.3 MB)

Oriol-Colome-Master-Thesis-2023.pdf