Thesis Open Access
Despite that L1 and L2 loss functions do not represent any perceptually-related information besides waveform-matching, these achieve remarkable results when used to train music source separation models. Our work contributes in extending the existing literature on loss functions for training deep learning audio models — to keep understanding of the pros and cons of several loss functions (including: L1, L2 and perceptually motivated losses) in a standardized evaluation framework.
In this work we focus on defining an evaluation framework for a fair comparison among losses — because we found diÿcult to extract conclusions out of the existing body of literature. Generally, loss improvements are presented along with additional model modifications (e.g. di˙erent data augmentation, or di˙erent model topology), making it diÿcult to assess the loss contribution to the results. This study focus on standardizing the evaluation process via employing the same dataset, the same data augmentation strategy and the same model topology — while varying its loss. The alternative losses we consider are based on cross-entropy, scale invariant SDR, multi-resolution STFT, and phase sensitive losses among others.