Exploring Detection and Localization of Overlapping Sound Sources with Deep Learning

Ronchini, Francesca

doi:10.5281/zenodo.4091365

Published September 15, 2020 | Version v1

Thesis Open

Exploring Detection and Localization of Overlapping Sound Sources with Deep Learning

Ronchini, Francesca¹

1. Universitat Pompeu Fabra

Contributors

Supervisors:

1. Universitat Pompeu Fabra

Sound event localization and detection (SELD) refers to the problem of identifying the presence of independent or temporally-overlapped sound sources, correctly de-termining to which sound class they belong, and estimating their spatial directions while they are active. Until recently, SELD has been considered and studied as two standalone tasks: sound event detection and sound event localization. Only in the last years, they started to be conjointly considered. Neural networks have be-come one of the prevailing method to approach the SELD task, with convolutional recurrent neural networks being among the most used systems.

The main scope of this project is to contribute to the SELD field, exploring the field of research of sound event detection and localization with deep learning. The algorithm presented in this work consists of a convolutional recurrent neural net-work using rectangular filters, specialized in recognizing significant spectral features related to the task. In order to further improve the evaluation metrics and to gen-eralize the system performance to unseen data, the training dataset size has been increased using data augmentation. The technique used to create new samples, in-creasing the dataset size, is based on channel rotations and reflection on the xy plane in the First Order Ambisonic domain. This approach allows to improve Direction of Arrival labels keeping the physical relationships between channels.

In order to reach the described method, the study has been mainly divided into two experiments. During the first experiment, di˙erent rectangular filter shapes have been studied in order to understand the filter’s size which gives the best per-formance and helps the network to properly learn frequency features with the aim to accordingly detect and classify events. The second experiment has been focused on reducing overfitting and further improve the evaluation metrics using data aug-mentation. In order to do so, the research has been principally concentrated on three data augmentation techniques: time stretching, pitch shifting, and channel rotations. Each technique has been independently explored. While time stretching and pitch shifting did not help to improve the results, channel rotation substantially enhances the evaluation metrics.

The system presented in this project has also been submitted to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, which main purpose is to encourage the development of computational acoustic scene and event analysis methods, comparing di˙erent proposals using a common publicly available dataset. This year challenge consisted of 6 tasks, each centered on a particular aspect of detection and classification of acoustic scene. This project has been submitted as possible solution to the Task 3, focused on sound event localization and detection.

The system has been evaluated using the same dataset provided for the DCASE 2020 Challenge Task 3: TAU-NIGENS Spatial Sound Events 2020. The network predictions have been evaluated considering the joint nature of localization and de-tection, proposed as evaluation criteria for the corresponding task of the DCASE 2020 challenge. In particular, the task of sound event detection has been evaluated considering location-dependent F-score and Error Rate, considering a prediction as true positive only if under a distance threshold of 20◦ from the ground truth. The sound localization task has been evaluated considering classification-dependent Lo-calization Error and Localization Recall, which are computed only if the prediction has been correctly classified. Evaluation results on the 6 splits development dataset show that the proposed system outperforms the baseline results, considerably im-proving Error Rate and F-score for location-aware detection.

Files

2020-Francesca-Ronchini.pdf

Files (4.1 MB)

Name	Size	Download all
2020-Francesca-Ronchini.pdf md5:3f0115463fd6dc64b7e140fdcb08ce7b	4.1 MB	Preview Download

	All versions	This version
Views	406	404
Downloads	249	248
Data volume	1.1 GB	1.1 GB

Exploring Detection and Localization of Overlapping Sound Sources with Deep Learning

Creators

Contributors

Supervisors:

Description

Files

2020-Francesca-Ronchini.pdf

Files (4.1 MB)