STAIRS26: Sony-Tau Acoustic Images of Real-World Scapes 2026

Roman, Iran R.; Politis, Archontis; Shimada, Kazuki; Cheston, Huw; Sudarsanam, Parthasaarathy; DÍAZ-GUERRA APARICIO, DAVID; Sun, Yifu; Shibuya, Takashi; Shusuke, Takahashi; MITSUFUJI, YUKI

doi:10.5281/zenodo.18171005

Published April 1, 2026 | Version v1

Dataset Open

STAIRS26: Sony-Tau Acoustic Images of Real-World Scapes 2026

1. Tampere University
2. Sony (Japan)
3. Sony Group Corporation
4. Tokyo Institute of Technology

Description

STAIRS26 (Sony-Tau Acoustic Images of Real-World Scapes) is a spatial audio dataset designed to benchmark Semantic Acoustic Imaging: the task of visualizing sound energy and identifying semantic sounding objects in space. This release serves as the development set for Task 3 of the DCASE 2026 Challenge.

STAIRS26 fundamentally extends the legacy STARSS23 dataset, shifting the paradigm from sparse point-based localization to dense acoustic field estimation. It upgrades the original real-world recordings (captured in Finland and Japan) with two critical features:

32-Channel Raw Audio: Full microphone array signals enabling high-resolution beamforming and acoustic super-resolution.
Acoustic Radiance Maps: High-definition energy acoustic images that serve as ground-truth labels for training models to visually reconstruct acoustic fields.

(Note: For details on the physical recording setup, hardware specifications, and scene scripting, please refer to the original STARSS23 Dataset).

Aim

The primary goal of STAIRS26 is to train and evaluate models on acoustic super-resolution: reconstructing high-fidelity, class-aware energy maps from standard 4-channel inputs. By providing full 32-channel recordings and ground-truth images, the dataset enables researchers to:

Develop deep learning architectures that output dense polygon masks encoding event class, spatial location, and acoustic energy intensity.
Evaluate high-resolution Direction-of-Arrival (DOA) estimation and multi-source tracking algorithms.
Bridge audio signal processing with computer-vision-based semantic segmentation.

Specifications

Volume and Data Split

Size: ~7.5 hours of recordings across 168 development clips.
Scope: This release contains only the development data (audio and labels) used for training and validation.
Compatibility: File naming and splits are identical to STARSS23. To utilize the full multimodal suite, users should pair this dataset with the STARSS23 audio and video files.

Audio Format

Sampling rate: 24 kHz
Bit depth: 16-bit
Format: 32-channel (raw Eigenmike recordings)

Acoustic Maps (Labels)

High-definition acoustic images, generated via proximal gradient descent from the 32-channel recordings, are provided as individual .json files (one per recording).

Structure: The annotations key contains a list of dictionaries. Each dictionary represents a single active sound object at a specific frame (10 FPS temporal resolution).
Multi-source Frames: If a frame contains multiple sources, multiple dictionaries are present. Silent frames have no annotations.
Metadata: Inherits frame indices and the 13 source classes from the DCASE2023/STARSS23 metadata .csv files.
Segmentation: Polygon masks are stored as an array of shape (n_pixels, 3). Each row represents [x, y, amplitude]:
- x and y: Integer spatial coordinates corresponding to a 1-pixel-per-degree angular grid (x ∈ [0, 359], y ∈ [0, 179]).
- amplitude: Standardized acoustic energy intensity ([0.0, 1.0]), where 1.0 represents the loudest pixel within the entire training dataset.

File Downloads

32ch_audio_dev.zip: Development audio in the raw 32-channel Eigenmike format.
labels_dev_std.zip: Generated acoustic-image labels in .json format.

(Download and extract using standard compression utilities).

Citation

If you use this dataset, please cite the following:

Roman, I. R., Politis, A., Shimada, K., Cheston, H., Sudarsanam, P., Díaz-Guerra, D., Sun, Y., Shibuya, T., Takahashi, S., & Mitsufuji, Y. (2026). STAIRS26: Sony-Tau Acoustic Images of Real-World Scapes [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18171005

Shimada, K., et al. (2023). STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events. Advances in Neural Information Processing Systems 36 (NeurIPS 2023).

Files

32ch_audio_dev.zip

Files (28.7 GB)

Name	Size	Download all
32ch_audio_dev.zip md5:6de3a12ddec8d344b4b599b413e4061e	25.0 GB	Preview Download
labels_dev.zip md5:2f6f453b62c905a33518adfa0cc809cf	3.6 GB	Preview Download
LICENSE md5:78b83d885f9055bf6e383bba6d7eb596	1.2 kB	Download

Additional details

Is new version of: Dataset: 10.5281/zenodo.7709052 (DOI); Dataset: 10.5281/zenodo.6387880 (DOI)

Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.
Archontis Politis, Sharath Adavanne, and Tuomas Virtanen (2020). A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan.
Sharath Adavanne, Archontis Politis, and Tuomas Virtanen (2019). A Multi-room reverberant dataset for sound event localization and detection. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA.
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France.
Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji (2023). STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA.

	All versions	This version
Views	134	134
Downloads	170	170
Data volume	3.0 TB	3.0 TB

Description

Aim

Specifications

File Downloads

Citation

32ch_audio_dev.zip

Files (28.7 GB)

Related works

References

STAIRS26: Sony-Tau Acoustic Images of Real-World Scapes 2026

Authors/Creators

Description

Description

Aim

Specifications

File Downloads

Citation

Files

32ch_audio_dev.zip

Files (28.7 GB)

Additional details

Related works

References