Dataset to publication: "Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?"

Törö, Tuukka; Suni, Antti; Simko, Juraj

doi:10.5281/zenodo.16268473

Published August 4, 2025 | Version v1

Dataset Open

Dataset to publication: "Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?"

1. University of Helsinki
2. Univeristy of Helsinki

Dataset of voxlingua107-xls-r-300m-wav2vec (Alumäe & Kukk, 2022) language identification model embeddings extracted from utterances from the Common Voice 16.1 (Ardila et al., 2020) dataset.

Used in "Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?". Preprint available at https://arxiv.org/abs/2506.08564

Code available at https://github.com/TuukkaOT/speech_embedding_analyzer

Files

Files (566.2 MB)

Name	Size	Download all
embeddings.pkl md5:11bc20babdb047819162d4c22c9c1adc	566.2 MB	Download

Additional details

Is supplement to: Preprint: arXiv:2506.08564 (arXiv)

Research Council of Finland
Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis 357262

Repository URL: https://github.com/TuukkaOT/speech_embedding_analyzer
Programming language: Python

Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., ... & Weber, G. (2020, May). Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4218-4222).
Alumäe T, Kukk K. Pretraining approaches for spoken language recognition: TalTech submission to the OLR 2021 challenge. Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2022); Beijing; 2022 jun 28-jul 2; 240-247. doi: 10.21437/Odyssey.2022-34

	All versions	This version
Views	64	64
Downloads	38	38
Data volume	21.5 GB	21.5 GB

Files (566.2 MB)

Related works

Funding

Software

References

Dataset to publication: "Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?"

Authors/Creators

Description

Files

Files (566.2 MB)

Additional details

Related works

Funding

Software

References