Published May 29, 2025 | Version 1
Dataset Open

SSL-NL dataset

Description

SSL-NL is a curated dataset of Dutch speech recordings and accompanying forced alignments, designed to test the encoding of Dutch phonetic and lexical features in SSL speech representations while allowing for comparisons across different analysis methods. 

It consists of two subsets from different domains: audiobook (MLS) and face-to-face conversations (IFADV). The MLS recordings were extracted from the Dutch part of Multilingual LibriSpeech, and the IFADV recordings were extracted from the IFA Dialog Video corpus and split by speaker turn. All audio recordings were downsampled to 16 kHz, and forced alignments were generated using the available transcripts and the WebMAUS API.

The SSL-NL evaluation dataset was released as part of:

de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025) What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025, 256-260, doi: 10.21437/Interspeech.2025-1526

Analysis code accompanying the SSL-NL dataset (to replicate results from the Interspeech paper) is available at https://github.com/mdhk/SSL-NL-eval.

Files

annotations.zip

Files (1.7 GB)

Name Size Download all
md5:a7daafb5bdd002e2fca7641117845026
116.9 kB Preview Download
md5:9ca467a9274fa35cae5c4fcb8c233690
1.7 GB Preview Download

Additional details

Related works

Is published in
Conference paper: 10.21437/Interspeech.2025-1526 (DOI)

Funding

SURFsara (Netherlands)
EINF-8324

Software

Repository URL
https://github.com/mdhk/SSL-NL-eval
Programming language
Python
Development Status
Active

References

  • de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025). What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025. http://doi.org/10.21437/Interspeech.2025-1526