# TAU Spatial Sound Events 2020 (Development Datasets) [Audio Research Group / Tampere University](http://arg.cs.tut.fi/) Authors - Archontis Politis (, ) - Sharath Adavanne (, ) - Tuomas Virtanen (, ) Development - Archontis Politis - Sharath Adavanne Recording (2019-2020) - Archontis Politis - Ali Gohar - Aapo Hakala Recording (2017-2018) - Eemi Fagerlund - Aino Koskimies - Aapo Hakala ## Description The **TAU-NIGENS Spatial Sound Events 2020** dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (**MIC**), and first-order Ambisonics one (**FOA**). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a trajectory of its direction-of-arrival (DoA) to the recording point, and a temporal onset and offset time. The isolated sound event recordings used for the synthesis of the sound scenes are obtained from the [NIGENS general sound events database](https://doi.org/10.5281/zenodo.2535878). These recordings serve as the development dataset for the [DCASE 2020 Sound Event Localization and Detection Task](http://dcase.community/challenge2020/task-sound-event-localization-and-detection) of the [DCASE 2020 Challenge](http://dcase.community/challenge2020/). The RIRs were collected in Finland by staff of Tampere University between 12/2017 - 06/2018, and between 11/2019 - 1/2020. The older measurements from five rooms were also used for the [development](https://doi.org/10.5281/zenodo.2580091) and [evaluation](https://doi.org/10.5281/zenodo.3066124) datasets **TAU Spatial Sound Events 2019**, while ten additional rooms were added for this dataset. The data collection received funding from the European Research Council, grant agreement [637422 EVERYSOUND](https://cordis.europa.eu/project/id/637422). [![ERC](https://erc.europa.eu/sites/default/files/content/erc_banner-horizontal.jpg "ERC")](https://erc.europa.eu/) ## Aim The dataset includes a large number of mixtures of sound events with realistic spatial properties under different acoustic conditions, and hence it is suitable for training and evaluation of machine-listening models for sound event detection (SED), general sound source localization with diverse sounds or signal-of-interest localization, and joint sound-event-localization-and-detection (SELD). Additionally, the dataset can be used for evaluation of signal processing methods that do not necessarily rely on training, such as acoustic source localization methods and multiple-source acoustic tracking. The dataset allows evaluation of the performance and robustness of the aforementioned applications for diverse types of sounds, and under diverse acoustic conditions. ## Recording procedure To construct a realistic dataset, real-life IR recordings were collected using an [Eigenmike](https://mhacoustics.com/products) spherical microphone array. A [Genelec G Three loudspeaker](https://www.genelec.com/g-three) was used to playback a maximum length sequence (MLS) around the Eigenmike. The IRs were obtained in the STFT domain using a least-squares regression between the known measurement signal (MLS) and far-field recording independently at each frequency. The IRs were recorded at fifteen different indoor locations inside the Tampere University campus at Hervanta, Finland. Apart from the five spaces measured and used for the same task in DCASE2019, we added ten new spaces. Additionally, 30 minutes of ambient noise recordings were collected at the same locations with the IR recording setup unchanged. Contrary to DCASE2019, the new IRs were not measured on a spherical grid of fixed azimuth and elevation resolution, and at fixed distances. Instead, IR directions and distances differ with the space. Possible azimuths span the whole range of $\phi\in[-180,180)$, while the elevations span approximately a range between $\theta\in[-45,45]$ degrees. A summary of the measured spaces is as follows: ___ **DCASE2019** 1. Large common area with multiple seating tables and carpet flooring. People chatting and working. 2. Large cafeteria with multiple seating tables and carpet flooring. People chatting and having food. 3. High ceiling corridor with hard flooring. People walking around and chatting. 4. Corridor with classrooms around and hard flooring. People walking around and chatting. 5. Large corridor with multiple sofas and tables, hard and carpet flooring at different parts. People walking around and chatting. ___ **DCASE2020** 6. (2x) Large lecture halls with inclined floor. Ventilation noise. 7. (2x) Modern classrooms with multiple seating tables and carpet flooring. Ventilation noise. 8. (2x) Meeting rooms with hard floor and partially glass walls. Ventilation noise. 9. (2x) Old-style large classrooms with hard floor and rows of desks. Ventilation noise. 10. Large open space in underground bomb shelter, with plastic-coated floor and rock walls. Ventilation noise. 11. Large open gym space. People using weights and gym equipment. ## Recording formats To allow testing of methods with recording formats capturing different spatial features, we extracted two 4-channel formats from the high-resolution original 32-channel recordings. For methods or feature extraction that relies on knowledge of the array response for any direction, we provide this information below. The following theoretical spatial responses (steering vectors) modeling the two formats describe the directional response of each channel to a source incident from direction-of-arrival (DOA) given by azimuth angle $\phi$ and elevation angle $\theta$. **For the first-order ambisonics (FOA):** \begin{eqnarray} H_1(\phi, \theta, f) &=& 1 \\ H_2(\phi, \theta, f) &=& \sin(\phi) * \cos(\theta) \\ H_3(\phi, \theta, f) &=& \sin(\theta) \\ H_4(\phi, \theta, f) &=& \cos(\phi) * \cos(\theta) \end{eqnarray} The (FOA) format is obtained by converting the 32-channel microphone array signals by means of encoding filters based on anechoic measurements of the Eigenmike array response. Note that in the formulas above the encoding format is assumed frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the actual encoded responses starts to deviate gradually from the ideal one provided above at higher frequencies. **For the tetrahedral microphone array (MIC):** The four microphone have the following positions, in spherical coordinates $(\phi, \theta, r)$: M1: ( 45°, 35°, 4.2cm) M2: (-45°, -35°, 4.2cm) M3: (135°, -35°, 4.2cm) M4: (-135°, 35°, 4.2cm) Since the microphones are mounted on an acoustically-hard spherical baffle, an analytical expression for the directional array response is given by the expansion: \begin{equation} H_m(\phi_m, \theta_m, \phi, \theta, \omega) = \frac{1}{(\omega R/c)^2}\sum_{n=0}^{30} \frac{i^{n-1}}{h_n'^{(2)}(\omega R/c)}(2n+1)P_n(\cos(\gamma_m)) \end{equation} where $m$ is the channel number, $(\phi_m, \theta_m)$ are the specific microphone's azimuth and elevation position, $\omega = 2\pi f$ is the angular frequency, $R = 0.042$m is the array radius, $c = 343$m/s is the speed of sound, $\cos(\gamma_m)$ is the cosine angle between the microphone and the DOA, and $P_n$ is the unnormalized Legendre polynomial of degree $n$, and $h_n'^{(2)}$ is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Example routines that can generate directional frequency and impulse array responses based on the above formula can be found [here](https://github.com/polarch/Array-Response-Simulator). ## Dataset specifications The specifications of the dataset can be summarized in the following: - 600 one-minute long sound scene recordings (development dataset). - 200 one-minute long sound scene recordings (evaluation dataset). - Sampling rate 24kHz. - About 700 sound event samples spread over 14 classes (see [here](http://doi.org/10.5281/zenodo.2535878) for more details). - Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array. - Realistic spatialization and reverberation through RIRs collected in 15 different enclosures. - From about 1500 to 3500 possible RIR positions across the different rooms. - Both static reverberant and moving reverberant sound events. - Three possible angular speeds for moving sources of about 10, 20, or 40deg/sec. - Up to two overlapping sound events allowed, temporally and spatially. - Realistic spatial ambient noise collected from each room is added to the spatialized sound events, at varying signal-to-noise ratios (SNR) ranging from noiseless (30dB) to noisy (6dB). Each recording corresponds to a single room, and allowed overlap of two simulatenous sources, or no overlap. Each event spatialized in the recording has equal probability of being either static or moving, and is asigned randomly one of the room RIR positions, or motion along one of the predefined trajectories. The moving sound events are synthesized with a slow (10deg/sec), moderate (20deg/sec), or fast (40deg/sec) angular speed. A partitioned time-frequency interpolation scheme of the RIRs extracted from the measurements at regular intervals is used to approximate the time-variant room response corresponding to the target motion. In the development dataset, eleven out of the fifteen rooms along with the NIGENS event samples are assigned to 6 disjoint sets, and their combinations form 6 distinct splits of 100 recordings each. The splits permits testing and validation across different acoustic conditions. The evaluation dataset has spatialized sound events with the rest of the rooms, resulting in 200 additional recordings. ## Sound event classes To generate the spatial sound scenes the measured room IRs are convolved with dry recordings of sound samples belonging to distinct sound classes. The sound event database of sound samples used for that purpose is the recent [NIGENS general sound events database](https://doi.org/10.5281/zenodo.2535878): The 14 sound classes of the spatialized events are: 0. alarm 1. crying baby 2. crash 3. barking dog 4. running engine 5. female scream 6. female speech 7. burning fire 8. footsteps 9. knocking on door 10. male scream 11. male speech 12. ringing phone 13. piano ## Naming Convention (Development dataset) The recordings in the development dataset follow the naming convention: fold[split number]_room[room number per split]_mix[recording number per room per split]_ov[number of overlapping sound events].wav Note that the room number only distinguishes different rooms used inside a split. For example, `room1` in the first split is not the same as `room1` in the second split. The room or overlap information is provided for the user of the dataset to understand the performance of their method with respect to different conditions. ## Naming Convention (Evaluation dataset) The recordings in the evaluation dataset have no additional information and follow the naming convention: mix[recording number].wav ## Reference labels and directions-of-arrival For each recording in the development dataset, the labels and DoAs are provided in a plain text CSV file of the same filename as the recording, in the following format: [frame number (int)], [active class index (int)], [track number index (int)], [azimuth (int)], [elevation (int)] Frame, class, and track enumeration begins at 0. Frames correspond to a temporal resolution of 100msec. Azimuth and elevation angles are given in degrees, rounded to the closest integer value, with azimuth and elevation being zero at the front, azimuth $\phi \in [-180\deg, 180\deg]$, and elevation $\theta \in [-90\deg, 90\deg]$. Note that the azimuth angle is increasing counter-clockwise ($\phi = 90\deg$ at the left). The track index indicates instances of the same class in the recording, overlapping or non-overlapping, and it increases for each new occuring instance. By instance here we mean the sound event that is spatialized with a distinct static position in the room, or with a coherent continuous spatial trajectory in the case of moving events. This information is mostly redundant in the case of recordings with no overlap, but it becomes more important when overlap occurs. For example, when there are two same-class events occurring at the same time, and the user wants to resample their position to a higher resolution than 100msec, the track index can be used directly to disentangle the DoAs for interpolation, without the user having to solve the association problem themselves. Overlapping sound events are indicated with duplicate frame numbers, and can belong to a different or the same class. An example sequence could be as: 10, 1, 0, -50, 30 11, 1, 0, -50, 30 11, 1, 1, 10, -20 12, 1, 1, 10, -20 13, 1, 1, 10, -20 13, 4, 0, -40, 0 which describes that in frame 10-13, the first instance (_track 0_) of class _crying baby_ (_class 1_) is active, however at frame 11 a second instance (_track 1_) of the same class appears simultaneously at a different direction, while at frame 13 an additional event of class 4 appears. Frames that contain no sound events are not included in the sequence. Reference labels on the evaluation dataset will be published after the DCASE2019 Challenge has been completed. ## Task setup The dataset is associated with the [DCASE 2020 Challenge](http://dcase.community/challenge2020/), and to have consistent reporting of results on the development set between participants we define the following division: | Training splits | Validation split | Testing split | |:---------------:|:-----------------:|:-------------:| | 3, 4, 5, 6 | 2 | 1 | with results required only for the testing split. Hence, the above division is recommended if the users would like to compare performance with the methods participating in the challenge. ## File structure ``` dataset root │ README.md this file, markdown-format │ └───foa_dev Ambisonic format, 600 audio recordings, 24kHz, four channels │ │ fold1_room1_mix001_ov1.wav │ │ fold1_room1_mix002_ov1.wav │ │ ... │ │ fold1_room2_mix001_ov1.wav │ │ fold1_room2_mix002_ov1.wav │ │ ... │ │ fold5_room1_mix001_ov1.wav │ │ fold5_room1_mix002_ov1.wav │ │ ... │ └───mic_dev Microphone array format, 600 audio recordings, 24kHz, four channels │ │ fold1_room1_mix001_ov1.wav │ │ fold1_room1_mix002_ov1.wav │ │ ... │ │ fold1_room2_mix001_ov1.wav │ │ fold1_room2_mix002_ov1.wav │ │ ... │ │ fold5_room1_mix001_ov1.wav │ │ fold5_room1_mix002_ov1.wav │ │ ... │ └───metadata_dev `csv` format, 600 files | │ fold1_room1_mix001_ov1.wav | │ fold1_room1_mix002_ov1.wav | │ ... | │ fold1_room2_mix001_ov1.wav | │ fold1_room2_mix002_ov1.wav | │ ... | │ fold5_room1_mix001_ov1.wav | │ fold5_room1_mix002_ov1.wav | │ ... | └───foa_eval Ambisonic format, 200 audio recordings, 24kHz, four channels │ │ mix001.wav │ │ ... │ │ mix200.wav │ └───mic_eval Microphone array format, 200 audio recordings, 24kHz, four channels │ │ mix001.wav | | ... │ │ mix200.wav ``` ## Download The three files, `foa_dev.z01`, `foa_dev.z02`, and `foa_dev.zip`, correspond to audio data of the **FOA** recording format for the development dataset. The three files, `mic_dev.z01`, `mic_dev.z02`, and `mic_dev.zip`, correspond to audio data of the **MIC** recording format for the development dataset. The metadata_dev.zip contains the common metadata for both formats. The file, `foa_eval.zip`, corresponds to audio data of the **FOA** recording format for the evaluation dataset. The file, `mic_eval.zip`, corresponds to audio data of the **MIC** recording format for the evaluation dataset. Download the zip files corresponding to the format of interest and use your favorite compression tool to unzip these split zip files. ## Example application An implementation of a trainable model of a convolutional recurrent neural network (CRNN), performing joint SELD, trained and evaluated with this dataset is provided [here](https://github.com/sharathadavanne/seld-dcase2020). This implementation serves as the baseline method in the [DCASE 2020 Sound Event Localization and Detection Task](http://dcase.community/challenge2020/task-sound-event-localization-and-detection). ## License This datast is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license ([CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)) license.