MULTIVOX: A Spatial Audio-Visual Dataset of Singing Groups

Meza, Gerardo; Sepúlveda, Mariana; Roman, Adrian; Sigal Sefchovich, Jorge Rodrigo; Roman, Iran R.

doi:10.5281/zenodo.17058101

Published September 10, 2025 | Version v1

Dataset Open

MULTIVOX: A Spatial Audio-Visual Dataset of Singing Groups

1. Universidad Nacional Autónoma de México
2. Universidad Autónoma de Tamaulipas
3. New York University
4. Escuela Nacional de Estudios Superiores
5. Queen Mary University of London

MULTIVOX is a multimodal, spatial audio–visual dataset of a-cappella vocal performances recorded in controlled conditions with both choir and vocal chamber ensembles. The dataset comprises 154 performances (≈3 hours total), captured in two acoustically distinct spaces (auditorium and recording studio).

Each performance includes synchronized 360° video, far-field audio (first-order Ambisonics and ORTF stereo), and per-singer near-field recordings captured on personal devices. Performances feature 6– and 16-singer configurations arranged in a circle around the 360° camera and far-field devices. The repertoire covers 18 short choral pieces, including vocal warm-ups, Latin American songs, and arranged popular music.

Annotations are provided in the metadata.csv, including session information, song, duration, tonal center, ensemble composition, condition, per-singer roles, facing direction, gender, height, and near-field file availability. Refer to the README.md for full annotation details. The MULTIVOX_extended_dataset_description_and_supplement.pdf file includes a detailed technical description of the dataset.

The dataset is designed to support research on spatial audio, sound event localization, source separation, ensemble synchronization, and multimodal modeling of group singing. The authors discourage the use of MULTIVOX for GenAI model training.

Citation: Please cite the dataset if you use it in your research.

You can read the AES paper here

@inproceedings{meza2025multivox,
author = {Meza, G. and Sepúlveda, M. and Roman, A. S. and Sigal Sefchovich, J. R. and Roman, I. R.},
title = {{MULTIVOX: A Spatial Audio-Visual Dataset of Singing Groups}},
booktitle = {Proceedings of the AES International Conference on Artificial Intelligence and Machine Learning for Audio (AES AIMLA)},
year = {2025},
address = {London},
doi = {10.5281/zenodo.17058101}
}

You can find files C3.zip and C4.zip here: https://doi.org/10.5281/zenodo.17065497

Files

C1.zip

Files (42.6 GB)

Name	Size	Download all
C1.zip md5:2332b9d1e3c2bbb176e539417a80c21a	22.4 GB	Preview Download
C2.zip md5:fed5ede02695b5a91453276b454fe148	20.2 GB	Preview Download
metadata.csv md5:81c154d7c6f5974c1ff7076faf32c24d	46.7 kB	Preview Download
MULTIVOX_extended_dataset_description_and_supplement.pdf md5:db218a963ad9546c628660cd92daa488	12.9 MB	Preview Download
README.md md5:4d3f1de5801965183b0beae99222d4a9	7.3 kB	Preview Download

Additional details

Other: ccrma.stanford.edu/~iran/papers/Meza_et_al_AESAIMLA_2025.pdf

H. Jers and S. Ternström, "Vocal ensembles," The Oxford Handbook of Music Performance, vol. 2, 2022.
M. J. Bonshor, "Confidence and choral configuration: The affective impact of situational and acoustic factors in amateur choirs," Psychology of Music, vol. 45, no. 5, 2017.
S. Ternström, "Choir acoustics: an overview of scientific research published to date," International Journal of Research in Choral Singing, vol. 1, no. 1, 2003.
A. H. Marshall, D. Gottlob, and H. Alrutz, "Acoustical conditions preferred for ensemble," The Journal of the Acoustical Society of America, vol. 64, no. 5, 1978.
A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y. Koyama, N. Takahashi, S. Takahashi, Y. Mitsufuji, and T. Virtanen, "Starss22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events," arXiv:2206.01948, 2022.
K. Shimada, A. Politis, P. Sudarsanam, D. A. Krause, K. Uchida, S. Adavanne, A. Hakala, Y. Koyama, N. Takahashi, S. Takahashi et al., "Starss23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events," Advances in Neural Information Processing Systems, vol. 36, 2023.
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., "Ego4d: Around the world in 3,000 hours of egocentric video," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V. Baiyya, S. Bansal, B. Boote et al., "Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
A. S. Roman, B. Balamurugan, and R. Pothuganti, "Enhanced sound event localization and detection in real 360-degree audio-visual soundscapes," arXiv preprint arXiv:2401.17129, 2024.
H. Pedroza, W. Abreu, R. M. Corey, and I. R. Roman, "Guitar-techs: An electric guitar dataset covering techniques, musical excerpts, chords and scales using a diverse array of hardware," IEEE ICASSP, 2025.
A. S. Roman, A. Chang, G. Meza, and I. R. Roman, "Generating diverse audio-visual 360 soundscapes for sound event localization and detection," arXiv preprint arXiv:2504.02988, 2025.
K. Shimada, A. Politis, I. R. Roman, P. Sudarsanam, D. Diaz-Guerra, R. Pandey, K. Uchida, Y. Koyama, N. Takahashi, T. Shibuya et al., "Stereo sound event localization and detection with onscreen/offscreen classification," DCASE 2025, 2025.

	All versions	This version
Views	326	326
Downloads	398	398
Data volume	3.3 TB	3.3 TB

C1.zip

Files (42.6 GB)

Identifiers

References

MULTIVOX: A Spatial Audio-Visual Dataset of Singing Groups

Authors/Creators

Description

Files

C1.zip

Files (42.6 GB)

Additional details

Identifiers

References