Dataset Open Access
The **Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22)** dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of **Tampere University (TAU)**, and in Tokyo, Japan by **SONY**, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (**MIC**), and first-order Ambisonics one (**FOA**). These recordings serve as the development dataset for the DCASE 2022 Sound Event Localization and Detection Task of the DCASE 2022 Challenge.
Contrary to the three previous datasets of synthetic spatial sound scenes of TAU Spatial Sound Events 2019 (development/evaluation), TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021 associated with the previous iterations of the DCASE Challenge, the STARS22 dataset contains recordings of real sound scenes and hence it avoids some of the pitfalls of synthetic generation of scenes. Some such key properties are:
The recordings were collected between September 2021 and January 2022. Collection of data from the TAU side has received funding from Google.
REPORT & REFERENCE:
If you use this dataset please cite the report on its creation, and the related DCASE2022 task setup:
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. arXiv preprint arXiv:2206.01948.
The dataset is suitable for training and evaluation of machine-listening models for sound event detection (SED), general sound source localization with diverse sounds or signal-of-interest localization, and joint sound-event-localization-and-detection (SELD). Additionally, the dataset can be used for evaluation of signal processing methods that do not necessarily rely on training, such as acoustic source localization methods and multiple-source acoustic tracking. The dataset allows evaluation of the performance and robustness of the aforementioned applications for diverse types of sounds, and under diverse acoustic conditions.
More detailed information on the dataset can be found in the included README file.
13 target sound event classes are annotated. The classes follow loosely the Audioset ontology.
0. Female speech, woman speaking
1. Male speech, man speaking
5. Domestic sounds
6. Walk, footsteps
7. Door, open or close
9. Musical instrument
10. Water tap, faucet
The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. For more information see the README file.
An implementation of a trainable model of a convolutional recurrent neural network, performing joint SELD, trained and evaluated with this dataset is provided here. This implementation will serve as the baseline method in the DCASE 2022 Sound Event Localization and Detection Task.
DEVELOPMENT AND EVALUATION:
The current version (Version 1.1) of the dataset includes the 121 development audio recordings and labels, used by the participants of Task 3 of DCASE2022 Challenge to train and validate their submitted systems, and the 52 evaluation audio recordings without labels, for the evaluation phase of DCASE2022.
If researchers wish to compare their system against the submissions of DCASE2022 Challenge, they will have directly comparable results if they use the evaluation data as their testing set.
The file foa_dev.zip, correspond to audio data of the FOA recording format.
The file mic_dev.zip, correspond to audio data of the MIC recording format.
The metadata_dev.zip is the common metadata for both formats.
The file foa_eval.zip, corresponds to audio data of the FOA recording format for the evaluation dataset.
The file mic_eval.zip, corresponds to audio data of the MIC recording format for the evaluation dataset.
Download the zip files corresponding to the format of interest and use your favourite compression tool to unzip these zip files.
Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.
Archontis Politis, Sharath Adavanne, and Tuomas Virtanen (2020). A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan.
Sharath Adavanne, Archontis Politis, and Tuomas Virtanen (2019). A Multi-room reverberant dataset for sound event localization and detection. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA.