Published February 20, 2025 | Version v1
Dataset Open

NISV 81k Dutch TV Speech Data Set

  • 1. ROR icon Radboud University Nijmegen
  • 2. ROR icon Netherlands Institute for Sound and Vision
  • 3. ROR icon University of Twente

Description

This dataset was developed as part of a Dutch HOSAN research program exploring the feasibility of utilizing heritage datasets from the Netherlands to create speech models that represent all Dutch voices

The dataset contains a large quantity of Dutch audio data from Dutch television broadcasts in the period 1972-2022, stored at the Netherlands Institute for Sound & Vision. The audio files add up to a total of 81k hours of audio, with most audio files having a length of 30 minutes to 1 hour.

An initial selection was made of material from the period 1972-2022 that met the following criteria:

  • TV broadcasts excluding international news
  • Radio broadcasts from the radio station NPO Radio 1
  • Excluding music-related genres
  • Broadcast programme material only (no rushes etc.)
  • Programme duration available in the metadata
  • Digital carrier available

This initial selection contained approximately 184k hours of TV and 128k hours of radio. For training speech models, only the TV data was selected. The set was further reduced by selecting specific genres (see genres.txt file), and by removing audio with a length longer than three hours. Only a single broadcast per day of any given series (e.g. one single edition of the Dutch public broadcaster's news programme per day) was selected, as it was a requirement for training the speech models that the set contained as little duplication of audio fragments as possible.

Low-resolution versions of the MXF carriers were downloaded, the audio (in AAC format) extracted and this dataset delivered to the researchers under secure conditions with strict non-disclosure agreements in place regarding both the data and the resulting models.

Initial use of the data revealed that eighty-eight audio files contained a virtually flat audio signal. Investigation of a sample at Sound & Vision revealed that these came from videos for which the original analogue carriers contained no audio signal. The carrier IDs of these files are contained in the file 'no_audio.txt'. 

This published version of the dataset contains the following files:

  • filtered_any_genre_cc0.zip
    • filtered_any_genre_cc0.csv - A dataframe containing the IDs of the programmes and their digital carriers, and non-copyrighted metadata about the programme such as title and broadcast date.
    • segments.txt  - The timecodes of the sections of the carriers used in training the speech models
  • genres.txt - a list of the genres selected (in Dutch)
  • no_audio.txt - a list of the carriers without significant audio content

The audio files themselves are under copyright. The published dataset serves as a reference standard for detailing any research conducted using it.

Files

filtered_any_genre_cc0.zip

Files (203.0 MB)

Name Size Download all
md5:c7902bba12030de2f26cd297ee3a192f
203.0 MB Preview Download
md5:9650c852dc387b201e0e483b6325aac9
556 Bytes Preview Download
md5:ffdcf5ec21b23e65930158e4688bea94
2.6 kB Preview Download

Additional details

Dates

Created
2023-03-22
Date on which the collected audio files are delivered