MULTIVOX: A Spatial Audio-Visual Dataset of Singing Groups
Authors/Creators
Description
MULTIVOX is a multimodal, spatial audio–visual dataset of a-cappella vocal performances recorded in controlled conditions with both choir and vocal chamber ensembles. The dataset comprises 154 performances (≈3 hours total), captured in two acoustically distinct spaces (auditorium and recording studio).
Each performance includes synchronized 360° video, far-field audio (first-order Ambisonics and ORTF stereo), and per-singer near-field recordings captured on personal devices. Performances feature 6– and 16-singer configurations arranged in a circle around the 360° camera and far-field devices. The repertoire covers 18 short choral pieces, including vocal warm-ups, Latin American songs, and arranged popular music.
Annotations are provided in the metadata.csv, including session information, song, duration, tonal center, ensemble composition, condition, per-singer roles, facing direction, gender, height, and near-field file availability. Refer to the README.md for full annotation details. The MULTIVOX_extended_dataset_description_and_supplement.pdf file includes a detailed technical description of the dataset.
The dataset is designed to support research on spatial audio, sound event localization, source separation, ensemble synchronization, and multimodal modeling of group singing. The authors discourage the use of MULTIVOX for GenAI model training.
Citation: Please cite the dataset if you use it in your research.
You can read the AES paper here
@inproceedings{meza2025multivox, author = {Meza, G. and Sepúlveda, M. and Roman, A. S. and Sigal Sefchovich, J. R. and Roman, I. R.}, title = {{MULTIVOX: A Spatial Audio-Visual Dataset of Singing Groups}}, booktitle = {Proceedings of the AES International Conference on Artificial Intelligence and Machine Learning for Audio (AES AIMLA)}, year = {2025}, address = {London}, doi = {10.5281/zenodo.17058101}}
You can find files C3.zip and C4.zip here: https://doi.org/10.5281/zenodo.17065497
Files
C1.zip
Files
(42.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:2332b9d1e3c2bbb176e539417a80c21a
|
22.4 GB | Preview Download |
|
md5:fed5ede02695b5a91453276b454fe148
|
20.2 GB | Preview Download |
|
md5:81c154d7c6f5974c1ff7076faf32c24d
|
46.7 kB | Preview Download |
|
md5:db218a963ad9546c628660cd92daa488
|
12.9 MB | Preview Download |
|
md5:4d3f1de5801965183b0beae99222d4a9
|
7.3 kB | Preview Download |
Additional details
Identifiers
- Other
- ccrma.stanford.edu/~iran/papers/Meza_et_al_AESAIMLA_2025.pdf
References
- H. Jers and S. Ternström, "Vocal ensembles," The Oxford Handbook of Music Performance, vol. 2, 2022.
- M. J. Bonshor, "Confidence and choral configuration: The affective impact of situational and acoustic factors in amateur choirs," Psychology of Music, vol. 45, no. 5, 2017.
- S. Ternström, "Choir acoustics: an overview of scientific research published to date," International Journal of Research in Choral Singing, vol. 1, no. 1, 2003.
- A. H. Marshall, D. Gottlob, and H. Alrutz, "Acoustical conditions preferred for ensemble," The Journal of the Acoustical Society of America, vol. 64, no. 5, 1978.
- A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y. Koyama, N. Takahashi, S. Takahashi, Y. Mitsufuji, and T. Virtanen, "Starss22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events," arXiv:2206.01948, 2022.
- K. Shimada, A. Politis, P. Sudarsanam, D. A. Krause, K. Uchida, S. Adavanne, A. Hakala, Y. Koyama, N. Takahashi, S. Takahashi et al., "Starss23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events," Advances in Neural Information Processing Systems, vol. 36, 2023.
- K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., "Ego4d: Around the world in 3,000 hours of egocentric video," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V. Baiyya, S. Bansal, B. Boote et al., "Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- A. S. Roman, B. Balamurugan, and R. Pothuganti, "Enhanced sound event localization and detection in real 360-degree audio-visual soundscapes," arXiv preprint arXiv:2401.17129, 2024.
- H. Pedroza, W. Abreu, R. M. Corey, and I. R. Roman, "Guitar-techs: An electric guitar dataset covering techniques, musical excerpts, chords and scales using a diverse array of hardware," IEEE ICASSP, 2025.
- A. S. Roman, A. Chang, G. Meza, and I. R. Roman, "Generating diverse audio-visual 360 soundscapes for sound event localization and detection," arXiv preprint arXiv:2504.02988, 2025.
- K. Shimada, A. Politis, I. R. Roman, P. Sudarsanam, D. Diaz-Guerra, R. Pandey, K. Uchida, Y. Koyama, N. Takahashi, T. Shibuya et al., "Stereo sound event localization and detection with onscreen/offscreen classification," DCASE 2025, 2025.