SAVGBench Dataset
Authors/Creators
- 1. Sony AI
- 2. Sony Group Corporation
Description
Description
This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking the Spatially Aligned Audio-Video Generation (SAVG) task.
We introduce the SAVGBench dataset, a spatially aligned audio-visual dataset. It is a stereo audio and perspective video dataset, whose audio and video data are curated based on whether sound events are onscreen or not. It is derived from an audio-visual dataset (STARSS23) consisting of first-order Ambisonics (FOA) audio, 360° video, and spatiotemporal annotations of sound events. We convert the STARSS23 data into stereo audio and perspective video data, tracking the position of sound events on and off the screen. We curate audio and video data that contain only onscreen sound events for the SAVGBench dataset. It focuses on humans and musical instruments in indoor environments, including speeches and instrument sounds.
Specifications
The specifications of the SAVGBench dataset can be summarized as follows:
Volume, duration, and data split:
- 5,031 clips of 5-sec duration, with a total time of 7.0 hours (development set).
- 547 clips of 5-sec duration, with a total time of 0.8 hrs (evaluation set).
- In both the development and evaluation sets, the ratio of speech to instrument sounds is maintained at about 2:1.
Audio:
- Sampling rate: 16kHz.
- Stereo format: mid-side (M/S) technique with left-right cardioid stereo patterns.
Video:
- Video format: perspective.
- Video resolution: 256x256 with padding.
- Video frames per second (fps): 4.
Naming convention
The MP4 files follow the naming convention:
- fold[fold number]_room[room number]_mix[recording number per room]_deg[viewing angle in degree]_start[start time in frame].mp4
The fold number is currently used only to distinguish between the training (fold3 and fold4) and evaluation (fold5) splits. The room information is provided for the dataset user to help understand the performance of their method concerning different conditions. The fold number, room number, and recording number are derived from the STARSS23 recordings. The viewing angle and start time are provided to indicate the conversion configuration of the clip. Note that we change only the horizontal viewing angle while keeping the vertical viewing angle at 0° elevation.
Download instruction
The file SAVGBench_Dataset_Development.zip contains MP4 data with stereo audio and video for the development set.
The file SAVGBench_Dataset_Evaluation.zip corresponds to MP4 data for the evaluation set.
Download the zip files and use your preferred compression tool to unzip them.
Files
SAVGBench_Dataset_Development.zip
Files
(652.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c83ff8b4a81145c08bda00a1514a469e
|
584.0 MB | Preview Download |
|
md5:fec9671ad01c39f67e8c0cda71457aa1
|
68.9 MB | Preview Download |
Additional details
Related works
- Is derived from
- Dataset: 10.5281/zenodo.7880637 (DOI)