Published March 27, 2021 | Version V1
Dataset Open

L3DAS21 Challenge

Description

L3DAS21: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING

IEEE MLSP Data Challenge 2021

 

SCOPE OF THE CHALLENGE

The L3DAS21 Challenge for the IEEE MLSP 2021 aims at encouraging and fostering research on machine learning for 3D audio signal processing. In multi-speaker scenarios it is very important to properly understand the nature of a sound event and its position within the environment, what is the content of the sound signal and how to leverage it at best for a specific application (e.g., teleconferencing rather than assistive listening or entertainment, among others). To this end, L3DAS21 Challenge presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on first-order Ambisonics recordings in reverberant office environment.

Each task involves 2 separate tracks: 1-mic and 2-mic recordings, respectively containing sounds acquired by one Ambisonics microphone and by an array of two Ambisonics microphones. The use of two first-order Ambisonics microphones definitely represents one of the main novelties of the L3DAS21 Challenge.

  • Task 1: 3D Speech Enhancement
    The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant office environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises.The evaluation metric for this task is a combination of the short-time objective intelligibility (STOI) and word error rate (WER).
     
  • Task 2: 3D Sound Event Localization and Detection
    The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space.Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task are evaluated according to the location-sensitive detection error, which joins the localization and detection errors.

 

DATASETS

The LEDAS21 datasets contain multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions. Relying on the collected Ambisonics impulse responses (IRs), we augmented existing clean monophonic datasets to obtain synthetic tridimensional sound sources by convolving the original sounds with our IRs. We extracted speech signals from the Librispeech dataset and office-like background noises from the FSD50K dataset. We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment. We provide normalized raw waveforms as predictors data and the target data varies according to the task.

The dataset is divided in two main sections, respectively dedicated to the challenge tasks.

The first section is optimized for 3D Speech Enhancement and contains more than 30000 virtual 3D audio environments with a duration up to 10 seconds. In each sample, a spoken voice is always present alongside with other office-like background noises. As target data for this section we provide the clean monophonic voice signals.

The other sections, instead, is dedicated to the 3D Sound Event Localization and Detection task and contains 900 60-seconds-long audio files Each data point contains a simulated 3D office audio environment in which up to 3 simultaneous acoustic events may be active at the same time. In this section, the samples are not forced to contain a spoken voice.  As target data for this section we provide a list of the onset and offset time stamps, the typology class, and the spatial coordinates of each individual sound event present in the data-points.

We split both dataset sections into: a training set (44 hours for SE and 600 hours for SELD) and a test set (6 hours for SE and 5 hours for SELD), paying attention to create similar distributions. The train set of the SE section is divided in two partitions: train360 and train100, and contain speech samples extracted from the correspondent partitions of Librispeech (only the sample) up to 10 seconds). All sets of the SELD section are divided in: OV1, OV2, OV3. These partitions refer to the maximum amount of possible overlapping sounds, which are 1, 2 ore 3, respectively.

The evaluation test datasets can be downloaded here:

 

CHALLENGE WEBSITE AND CONTACTS

L3DAS21 Challenge Website: www.l3das.com/mlsp2021

GitHub repository: github.com/l3das/L3DAS21

Paper: arxiv.org/abs/2104.05499

IEEE MLSP 2021: 2021.ieeemlsp.org/

Email contact: l3das@uniroma1.it

Twitter: https://twitter.com/das_l3

Files

L3DAS_Task1_dev.zip

Files (50.0 GB)

Name Size Download all
md5:09702560ca48997c56e0685a49598c25
2.6 GB Preview Download
md5:5c4a3cc6bb327333b55bd1a8960c7a88
7.6 GB Preview Download
md5:479a0594a37aafbbb047240a2b8f45e2
28.6 GB Preview Download
md5:34573337335245683caee17a08e6acaf
2.2 GB Preview Download
md5:e7f7fc8589fc6f8b14c2982e830dba87
9.0 GB Preview Download

Additional details

Related works

Is documented by
Preprint: arXiv:2104.05499v1 (arXiv)