STAIRS26: Sony-Tau Acoustic Images of Real-World Scapes 2026
Authors/Creators
-
Roman, Iran R.
(Researcher)
-
Politis, Archontis
(Researcher)1
-
Shimada, Kazuki
(Researcher)2
-
Cheston, Huw
(Researcher)
-
Sudarsanam, Parthasaarathy
(Researcher)1
-
DÍAZ-GUERRA APARICIO, DAVID
(Researcher)
- Sun, Yifu (Researcher)1
-
Shibuya, Takashi
(Researcher)2
- Shusuke, Takahashi (Researcher)3
-
MITSUFUJI, YUKI
(Researcher)2, 4
Description
Description
STAIRS26 (Sony-Tau Acoustic Images of Real-World Scapes) is a spatial audio dataset designed to benchmark Semantic Acoustic Imaging: the task of visualizing sound energy and identifying semantic sounding objects in space. This release serves as the development set for Task 3 of the DCASE 2026 Challenge.
STAIRS26 fundamentally extends the legacy STARSS23 dataset, shifting the paradigm from sparse point-based localization to dense acoustic field estimation. It upgrades the original real-world recordings (captured in Finland and Japan) with two critical features:
-
32-Channel Raw Audio: Full microphone array signals enabling high-resolution beamforming and acoustic super-resolution.
-
Acoustic Radiance Maps: High-definition energy acoustic images that serve as ground-truth labels for training models to visually reconstruct acoustic fields.
(Note: For details on the physical recording setup, hardware specifications, and scene scripting, please refer to the original STARSS23 Dataset).
Aim
The primary goal of STAIRS26 is to train and evaluate models on acoustic super-resolution: reconstructing high-fidelity, class-aware energy maps from standard 4-channel inputs. By providing full 32-channel recordings and ground-truth images, the dataset enables researchers to:
-
Develop deep learning architectures that output dense polygon masks encoding event class, spatial location, and acoustic energy intensity.
-
Evaluate high-resolution Direction-of-Arrival (DOA) estimation and multi-source tracking algorithms.
-
Bridge audio signal processing with computer-vision-based semantic segmentation.
Specifications
Volume and Data Split
-
Size: ~7.5 hours of recordings across 168 development clips.
-
Scope: This release contains only the development data (audio and labels) used for training and validation.
-
Compatibility: File naming and splits are identical to STARSS23. To utilize the full multimodal suite, users should pair this dataset with the STARSS23 audio and video files.
Audio Format
-
Sampling rate: 24 kHz
-
Bit depth: 16-bit
-
Format: 32-channel (raw Eigenmike recordings)
Acoustic Maps (Labels)
High-definition acoustic images, generated via proximal gradient descent from the 32-channel recordings, are provided as individual .json files (one per recording).
-
Structure: The
annotationskey contains a list of dictionaries. Each dictionary represents a single active sound object at a specific frame (10 FPS temporal resolution). -
Multi-source Frames: If a frame contains multiple sources, multiple dictionaries are present. Silent frames have no annotations.
-
Metadata: Inherits frame indices and the 13 source classes from the DCASE2023/STARSS23 metadata
.csvfiles. -
Segmentation: Polygon masks are stored as an array of shape
(n_pixels, 3). Each row represents[x, y, amplitude]:-
xandy: Integer spatial coordinates corresponding to a 1-pixel-per-degree angular grid (x ∈ [0, 359],y ∈ [0, 179]). -
amplitude: Standardized acoustic energy intensity ([0.0, 1.0]), where1.0represents the loudest pixel within the entire training dataset.
-
File Downloads
-
32ch_audio_dev.zip: Development audio in the raw 32-channel Eigenmike format. -
labels_dev_std.zip: Generated acoustic-image labels in.jsonformat.
(Download and extract using standard compression utilities).
Citation
If you use this dataset, please cite the following:
Roman, I. R., Politis, A., Shimada, K., Cheston, H., Sudarsanam, P., Díaz-Guerra, D., Sun, Y., Shibuya, T., Takahashi, S., & Mitsufuji, Y. (2026). STAIRS26: Sony-Tau Acoustic Images of Real-World Scapes [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18171005
Shimada, K., et al. (2023). STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events. Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
Files
32ch_audio_dev.zip
Additional details
Related works
- Is new version of
- Dataset: 10.5281/zenodo.7709052 (DOI)
- Dataset: 10.5281/zenodo.6387880 (DOI)
References
- Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.
- Archontis Politis, Sharath Adavanne, and Tuomas Virtanen (2020). A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan.
- Sharath Adavanne, Archontis Politis, and Tuomas Virtanen (2019). A Multi-room reverberant dataset for sound event localization and detection. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA.
- Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France.
- Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji (2023). STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA.