Published September 15, 2021 | Version v1
Thesis Open

PodcastMix: A dataset for separating music and speech in podcasts

Contributors

  • 1. Universitat Pompeu Fabra

Description

Over the last few years, the popularity of podcast shows in streaming services has increased considerably. Licensed music in these shows is frequently used, but the precision of song identification services could be a˙ected by the speakers voice in the mix. This presents a major problem both for the musicians, who do not receive their respective royalty payments, and for the broadcasters, who may be exposed to legal problems for non-compliance with international copyright laws. In this Master Thesis, a benchmark between two state of the art models for music source separa-tion, the ConvTasNet and the UNet, was performed against a novel Podcast-like audio dataset called PodcastMix with the objective of separating both the voice of the speakers and the background music from a podcast. In this way, the back-ground music and foreground speech source separation task was formalized. This new dataset is compound by music from the Jamendo free music streaming service, mixed with the VCTK speech dataset. The models were trained on this dataset and evaluated both in the test partition and on a dataset of real podcasts. The results show that UNet performs better than ConvTasNet in separating speakers and music from podcasts. The benchmark was performed using the Asteroid toolkit and the evaluation metrics were computed using BSSEval tool in order to measure the quality of the separations.

Files

2021-Nicolas-Schmidt.pdf

Files (1.5 MB)

Name Size Download all
md5:9c3b61a76612e58369ee19394fce6956
1.5 MB Preview Download