Published May 28, 2026 | Version v1
Dataset Open

DCASE2026Task4EvaluationDataset : The Evaluation Dataset for Spatial Semantic Segmentation of Sound Scenes

Description

Dataset description

This dataset was recorded and designed for the Spatial Semantic Segmentation of Sound Scenes (S5) challenge task of the DCASE2026 Challenge. The development set for this dataset is available here.

This dataset comprises 2,567 soundscapes, in which the first 1,512 soundscapes are used for ranking in the DCASE 2026 Challenge Task 4, while the remaining soundscapes are used for task analysis. Each soundscape is a 4-channel audio mixture in ambisonic FOA B-format (WYZX), 10 seconds long and sampled at 32 kHz/16-bit.

The dataset includes synthesized soundscapes and real-world recorded soundscapes.
The synthesized soundscapes are generated from newly recorded individual sound events, background noise, and FOA room impulse responses (RIRs). Each sound event (including target events from the 18 classes and non-target events) is first convolved with a randomly selected RIR and then mixed to form a soundscape. Multi-channel background noise is also added to each mixture. Each soundscape contains 0 to 3 target sound events and 0 to 2 interference (non-target) sound events.
In the real-world subset, audio mixtures are recorded directly using an FOA microphone (the same microphone used to record the RIRs).

The dataset is organized as follows:

  • eval_0000.wav - eval_1511.wav (1512): Synthesized from newly recorded data

The data below is not used to calculate the ranking score, but we request that you submit it for analysis purposes:

  • eval_1511.wav - eval_1727.wav (216)  : Synthesized from newly recorded data, except target sound events from the development set
  • eval_1728.wav - eval_1943.wav (216)  : Synthesized from newly recorded data, except background noise from the development set
  • eval_1944.wav - eval_2159.wav (216)  : Synthesized from newly recorded data, except non-target sound events from the development set
  • eval_2160.wav - eval_2411.wav (252)  : Synthesized from newly recorded data, except RIRs from the development set
  • eval_2412.wav - eval_2541.wav (130)  : This was recorded in a real-world environment using an FOA microphone. It is a split of a 70-second recording with half-overlap.
  • eval_2542.wav - eval_2566.wav (25)  : This was generated from a single real-world audio event recorded with an FOA microphone. It includes a moving sound source.

Note that, since this dataset is designed for evaluation, it only contains synthesized or recorded soundscapes. It does not include the individual sound events, RIRs, or noise components of the sound scenes.

Recording information

In the following part of this description, we will briefly summarize the recording of sound events and RIR. 

Anechoic Sound Event (ASE) for Evaluation

This is an isolated sound event dataset for evaluation purposes, released under the same specifications as ASE1K, which is available in the development set. The recording was made using three cardioid microphones to capture the sound events from the left, front and right, and one omnidirectional microphone to capture the sound from above. In the S5 task, it is assumed that you will simply select a single channel (e.g. ch=3) from these and use it as a monaural sound event. For each class, around 20 events were recorded.

FOA RIR for Evaluation

The RIR dataset is made up of RIRs recorded in six environments for DCASE2026 Task4. These recording environments differ from the RIR included in the development set. All recordings were made using the same FOA Microphone (Sennheiser Ambeo VR Mic). RIR recordings were made from multiple locations in each environment, and these are compiled in sofa file format.

Noise recordings for Evaluation

This dataset also includes noise recordings in the FOA format. All of noise recordings are newly recorded for this evaluation by using FOA Microphone (Sennheiser Ambeo VR Mic). 

References

Further information is available at [1], DCASE 2026 Task description and Github.

[1] B. T. Nguyen, M. Yasuda, N. Harada, R. Serizel, M. Mishra, M. Delcroix, C. Hernandez-Olivan, S. Araki, D. Takeuchi, T. Nakatani, and N. Ono, “Description and Discussion on DCASE 2026 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes,” arXiv preprint arXiv:2604.00776, 2026.

License: see the file named LICENSE.pdf

Files

LISENCE.pdf

Files (4.8 GB)

Name Size Download all
md5:e10751d70611fa0f51522fb99d9112b0
1.1 GB Download
md5:ad4376eeb44aed247abd1a3597e1b8da
1.1 GB Download
md5:3665c0f2740ad73caed3b0defa1d3d49
1.1 GB Download
md5:174edab37a78710e3caa26d6ef5e04c3
1.1 GB Download
md5:fb1795152be02e2f48bf3d1a20379a0b
480.2 MB Preview Download
md5:1041e693bc98ba3b56f8028d03c9078c
301.6 kB Preview Download

Additional details

Dates

Available
2026-06-01