Published August 4, 2025 | Version v1
Dataset Open

Dataset to publication: "Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?"

  • 1. ROR icon University of Helsinki
  • 2. Univeristy of Helsinki

Description

Dataset of voxlingua107-xls-r-300m-wav2vec (Alumäe & Kukk, 2022) language identification model embeddings extracted from utterances from the Common Voice 16.1 (Ardila et al., 2020) dataset.

Used in "Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?". Preprint available at https://arxiv.org/abs/2506.08564

Code available at https://github.com/TuukkaOT/speech_embedding_analyzer

Files

Files (566.2 MB)

Name Size Download all
md5:11bc20babdb047819162d4c22c9c1adc
566.2 MB Download

Additional details

Related works

Is supplement to
Preprint: arXiv:2506.08564 (arXiv)

Funding

Research Council of Finland
Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis 357262

Software

References

  • Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., ... & Weber, G. (2020, May). Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4218-4222).
  • Alumäe T, Kukk K. Pretraining approaches for spoken language recognition: TalTech submission to the OLR 2021 challenge. Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2022); Beijing; 2022 jun 28-jul 2; 240-247. doi: 10.21437/Odyssey.2022-34