SSL-NL dataset
Creators
Description
SSL-NL is a curated dataset of Dutch speech recordings and accompanying forced alignments, designed to test the encoding of Dutch phonetic and lexical features in SSL speech representations while allowing for comparisons across different analysis methods.
It consists of two subsets from different domains: audiobook (MLS) and face-to-face conversations (IFADV). The MLS recordings were extracted from the Dutch part of Multilingual LibriSpeech, and the IFADV recordings were extracted from the IFA Dialog Video corpus and split by speaker turn. All audio recordings were downsampled to 16 kHz, and forced alignments were generated using the available transcripts and the WebMAUS API.
The SSL-NL evaluation dataset was released as part of:
de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025) What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025, 256-260, doi: 10.21437/Interspeech.2025-1526
Analysis code accompanying the SSL-NL dataset (to replicate results from the Interspeech paper) is available at https://github.com/mdhk/SSL-NL-eval.
Files
annotations.zip
Files
(1.7 GB)
Name | Size | Download all |
---|---|---|
md5:a7daafb5bdd002e2fca7641117845026
|
116.9 kB | Preview Download |
md5:9ca467a9274fa35cae5c4fcb8c233690
|
1.7 GB | Preview Download |
Additional details
Related works
- Is published in
- Conference paper: 10.21437/Interspeech.2025-1526 (DOI)
Funding
- SURFsara (Netherlands)
- EINF-8324
Software
- Repository URL
- https://github.com/mdhk/SSL-NL-eval
- Programming language
- Python
- Development Status
- Active
References
- de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025). What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025. http://doi.org/10.21437/Interspeech.2025-1526