SSL-NL dataset

de Heer Kloots, Marianne; Mohebbi, Hosein; Pouw, Charlotte; Shen, Gaofei; Zuidema, Willem; Bentum, Martijn

doi:10.5281/zenodo.15548947

Published May 29, 2025 | Version 1

Dataset Open

SSL-NL dataset

1. University of Amsterdam
2. Tilburg University
3. Radboud University Nijmegen

SSL-NL is a curated dataset of Dutch speech recordings and accompanying forced alignments, designed to test the encoding of Dutch phonetic and lexical features in SSL speech representations while allowing for comparisons across different analysis methods.

It consists of two subsets from different domains: audiobook (MLS) and face-to-face conversations (IFADV). The MLS recordings were extracted from the Dutch part of Multilingual LibriSpeech, and the IFADV recordings were extracted from the IFA Dialog Video corpus and split by speaker turn. All audio recordings were downsampled to 16 kHz, and forced alignments were generated using the available transcripts and the WebMAUS API.

The SSL-NL evaluation dataset was released as part of:

de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025) What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025, 256-260, doi: 10.21437/Interspeech.2025-1526

Analysis code accompanying the SSL-NL dataset (to replicate results from the Interspeech paper) is available at https://github.com/mdhk/SSL-NL-eval.

Files

annotations.zip

Files (1.7 GB)

Name	Size	Download all
annotations.zip md5:a7daafb5bdd002e2fca7641117845026	116.9 kB	Preview Download
audio.zip md5:9ca467a9274fa35cae5c4fcb8c233690	1.7 GB	Preview Download

Additional details

Is published in: Conference paper: 10.21437/Interspeech.2025-1526 (DOI)

SURFsara (Netherlands)
EINF-8324

Repository URL: https://github.com/mdhk/SSL-NL-eval
Programming language: Python
Development Status: Active

de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025). What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025. http://doi.org/10.21437/Interspeech.2025-1526

	All versions	This version
Views	118	118
Downloads	26	26
Data volume	25.9 GB	25.9 GB

SSL-NL dataset

Files

annotations.zip

Files (1.7 GB)

Additional details

Related works

Funding

Software

References

SSL-NL dataset

Creators

Description

Files

annotations.zip

Files (1.7 GB)

Additional details

Related works

Funding

Software

References