Published September 17, 2025 | Version 1.0.0
Dataset Open

SAVGBench Dataset

  • 1. Sony AI
  • 2. Sony Group Corporation

Description

Description

This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking the Spatially Aligned Audio-Video Generation (SAVG) task.

We introduce the SAVGBench dataset, a spatially aligned audio-visual dataset. It is a stereo audio and perspective video dataset, whose audio and video data are curated based on whether sound events are onscreen or not. It is derived from an audio-visual dataset (STARSS23) consisting of first-order Ambisonics (FOA) audio, 360° video, and spatiotemporal annotations of sound events. We convert the STARSS23 data into stereo audio and perspective video data, tracking the position of sound events on and off the screen. We curate audio and video data that contain only onscreen sound events for the SAVGBench dataset. It focuses on humans and musical instruments in indoor environments, including speeches and instrument sounds.

Specifications

The specifications of the SAVGBench dataset can be summarized as follows:

Volume, duration, and data split:

  • 5,031 clips of 5-sec duration, with a total time of 7.0 hours (development set).
  • 547 clips of 5-sec duration, with a total time of 0.8 hrs (evaluation set).
  • In both the development and evaluation sets, the ratio of speech to instrument sounds is maintained at about 2:1.

Audio:

  • Sampling rate: 16kHz.
  • Stereo format: mid-side (M/S) technique with left-right cardioid stereo patterns.

Video:

  • Video format: perspective.
  • Video resolution: 256x256 with padding.
  • Video frames per second (fps): 4.

Naming convention

The MP4 files follow the naming convention:

  • fold[fold number]_room[room number]_mix[recording number per room]_deg[viewing angle in degree]_start[start time in frame].mp4

The fold number is currently used only to distinguish between the training (fold3 and fold4) and evaluation (fold5) splits. The room information is provided for the dataset user to help understand the performance of their method concerning different conditions. The fold number, room number, and recording number are derived from the STARSS23 recordings. The viewing angle and start time are provided to indicate the conversion configuration of the clip. Note that we change only the horizontal viewing angle while keeping the vertical viewing angle at 0° elevation.

Download instruction

The file SAVGBench_Dataset_Development.zip contains MP4 data with stereo audio and video for the development set.
The file SAVGBench_Dataset_Evaluation.zip corresponds to MP4 data for the evaluation set.

Download the zip files and use your preferred compression tool to unzip them.

Files

SAVGBench_Dataset_Development.zip

Files (652.8 MB)

Name Size Download all
md5:c83ff8b4a81145c08bda00a1514a469e
584.0 MB Preview Download
md5:fec9671ad01c39f67e8c0cda71457aa1
68.9 MB Preview Download

Additional details

Related works

Is derived from
Dataset: 10.5281/zenodo.7880637 (DOI)