Published 2026 | Version v1
Dataset Open

The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders - AUDIO AND EMBEDDING FILES

  • 1. ROR icon University of Amsterdam
  • 1. ROR icon University of Amsterdam

Description

The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders - Audio and Embedding Files

This dataset contains the audio files and the embeddings for the paper "The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders" by Sauter el al. (2026). The corresponding code can be found here: https://github.com/adrian-sauter/visual_grounding_speech_analysis.

audio_files.zip contains the audio files from MALD [1] and LibriSpeech [2] that were used in our work.

fast_vgs_plus_librispeech_audioslicing.pkl.zip and w2v2_LibriSpeech_audioslicing.pkl.zip contain the embeddings for words from LibriSpeech (obtained via audio-slicing) for the FaST-VGS+ [3] and the wav2vec2 [4] model.

FULL_DF_MALD.pkl.zip contains the embeddings for words from MALD for FaST-VGS+ [3], wav2vec2 [4], GloVe [5], BERT [6], and VG-BERT [7].

 

References

[1] Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadić, F., & Sims, M. (2019). The massive auditory lexical decision (MALD) database. Behavior research methods, 51, 1187-1204.

[2] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: an ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206-5210.

[3] Peng, P. & Harwath, D. (2022). Self-supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling. Proceedings of the AAAI Symposium on AI for Speech and Audio Processing.

[4] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33, 12449-12460.

[5] Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543.

[6] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186).

[7] Zhang, Y., Choi, M., Han, K., & Liu, Z. (2021). Explainable semantic space by grounding language to vision with cross-modal contrastive learning. Advances in Neural Information Processing Systems34, 18513-18526.

Files

audio_files.zip

Files (10.4 GB)

Name Size Download all
md5:b2de8bee376f56ded35e489de90e1041
1.8 GB Preview Download
md5:18045a102f77db190c220e3a0dbe718b
2.4 GB Preview Download
md5:203d88f5f84806150ac9c40ec8839d3b
3.7 GB Preview Download
md5:3329f871c2b8c4e1e4993e38179b2ac7
2.4 GB Preview Download

Additional details

Related works

Is derived from
Dataset: 10.3758/s13428-018-1056-1 (DOI)
Dataset: 10.1109/ICASSP.2015.7178964 (DOI)
Dataset: 10.5281/zenodo.2619474 (DOI)