Published April 1, 2022 | Version v2
Dataset Open

[DCASE2022 Task 3] Synthetic SELD mixtures for baseline training

  • 1. Tampere University

Description

DESCRIPTION:

This audio dataset serves serves as supplementary material for the DCASE2022 Challenge Task 3: Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes. The dataset consists of synthetic spatial audio mixtures of sound events spatialized for two different spatial formats using real measured room impulse responses (RIRs) measured in various spaces of Tampere University (TAU). The mixtures are generated using the same process as the one used to generate the recordings of the TAU-NIGENS Spatial Sound Scenes 2021 dataset for the DCASE2021 Challenge Task 3

The SELD task setup in DCASE2022 is based on spatial recordings of real scenes, captured in the STARS22 dataset. Since the task setup allows use of external data, these synthetic mixtures serve as additional training material for the baseline model, and they are shared for reasons of reproducibility. For more details on the task setup, please refer to the task description.

Note that the generator code and the collection of room responses used to spatialize sound samples will be also be made available soon. For more details on the recording of RIRs, spatialization, and generation, see:

  • Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.

available here.

SPECIFICATIONS:

  • 13 target sound classes (see task description for details)
  • The sound event samples are sources from the FSD50K dataset, based on affinity of the labels in that dataset to the target classes. The selection on distinguishing which labels in FSD50K corresponded to the target ones, then selecting samples that were tagged with only those labels, and additionally that they had annotator rating of Present and Predominant (see FSD50K for more details). The list of the selected files is included here.
  • 1200 1-minute long spatial recordings
  • Sampling rate of 24kHz
  • Two 4-channel recording formats, first-order Ambisonics (FOA) and tetrahedral microphone array (MIC)
  • Spatial events spatialized in 9 unique rooms, using measured RIRs for the two formats
  • Maximum polyphony of 2 (with possible same-class events overlapping)
  • Even though the whole set is used for training of the baseline without distinction between the mixtures, we have included a separation into a training and testing split, in case on one needs to test the performance purely on those synthetic conditions (for example for comparisons with training on mixed synthetic-real data, fine-tuning on real data, or training on real data only).
  • The training split is indicated as fold1 in the dataset, contains 900 recordings spatialized on 6 rooms (150 recordings/room) and it is based on samples from the development set of FSD50K.
  • The testing split is indicated as fold2 in the dataset, contains 300 recordings spatialized on 3 rooms (100 recordings/room) and it is based on samples from the evaluation set of FSD50K.
  • Common metadata files for both formats are provided. For the file naming and the metadata format, refer to the task setup.

FSD50K SELECTION:

The list of selected sound event recordings is included along the recordings and metadata, as FSD50K_selected.txt. Each line in the text file has the following structure:

[target_label]/[train/test]/[FSD50K_label]/filename.wav

with an example:

domesticSounds/train/Boiling/16584.wav

meaning that the file 16584.wav from FSD50K, with the Boiling label of FSD50K, is included in the samples for the training split of those synthetic recordings, and it is mapped to the target class of domestic sounds. Note that there can be multiple FSD50K labels mapped the same target class. Also note that if these are downloaded from FSD50K, and a folder structure is created that replicates the structure in the list, the resulting folder can be used out-of-the-box with the scene generator to generate new mixtures with the same or different parameters.

Note that no sounds form FSD50K have been selected for the Music target class. Background and pop music tracks from the public domain have been cropped and used instead.

DOWNLOAD INSTRUCTIONS:

Download the zip files and use your preferred compression tool to unzip these split zip files. To extract a split zip archive (named as zip, z01, z02, ...), you could use, for example, the following syntax in Linux or OSX terminal:

  1. Combine the split archive to a single archive:
    zip -s 0 split.zip --out single.zip
  2. Extract the single archive using unzip:
    unzip single.zip

Files

DCASE2022_SELD_synth_data.zip

Files (19.3 GB)

Name Size Download all
md5:215399c8ec03397b3957ca611519eb2a
4.3 GB Download
md5:be29b6319a531481bd572e7945936f4f
4.3 GB Download
md5:8aabf9af1b9d7d1238023571333275a0
4.3 GB Download
md5:9f02366b0cae62d104a13c2c18bd65e9
4.3 GB Download
md5:fa45f264a0c272d411a0ceb7e059f512
2.2 GB Preview Download
md5:b3abe6d86ece2b9125120a645add5ce3
184.2 kB Preview Download

Additional details

Related works

References

  • Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.