The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders - AUDIO AND EMBEDDING FILES
Authors/Creators
Contributors
Researchers:
Description
The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders - Audio and Embedding Files
This dataset contains the audio files and the embeddings for the paper "The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders" by Sauter el al. (2026). The corresponding code can be found here: https://github.com/adrian-sauter/visual_grounding_speech_analysis.
audio_files.zip contains the audio files from MALD [1] and LibriSpeech [2] that were used in our work.
fast_vgs_plus_librispeech_audioslicing.pkl.zip and w2v2_LibriSpeech_audioslicing.pkl.zip contain the embeddings for words from LibriSpeech (obtained via audio-slicing) for the FaST-VGS+ [3] and the wav2vec2 [4] model.
FULL_DF_MALD.pkl.zip contains the embeddings for words from MALD for FaST-VGS+ [3], wav2vec2 [4], GloVe [5], BERT [6], and VG-BERT [7].
References
[1] Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadić, F., & Sims, M. (2019). The massive auditory lexical decision (MALD) database. Behavior research methods, 51, 1187-1204.
[2] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: an ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206-5210.
[3] Peng, P. & Harwath, D. (2022). Self-supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling. Proceedings of the AAAI Symposium on AI for Speech and Audio Processing.
[4] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449-12460.
[5] Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543.
[6] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186).
[7] Zhang, Y., Choi, M., Han, K., & Liu, Z. (2021). Explainable semantic space by grounding language to vision with cross-modal contrastive learning. Advances in Neural Information Processing Systems, 34, 18513-18526.
Files
audio_files.zip
Additional details
Related works
- Is derived from
- Dataset: 10.3758/s13428-018-1056-1 (DOI)
- Dataset: 10.1109/ICASSP.2015.7178964 (DOI)
- Dataset: 10.5281/zenodo.2619474 (DOI)
Software
- Repository URL
- https://github.com/adrian-sauter/visual_grounding_speech_analysis
- Programming language
- Python