Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published May 15, 2021 | Version 1.0
Dataset Open

Speech recognition alignments for Finnish parliament data

  • 1. Aalto University

Description

This dataset contains speech from Finnish parliament 2008-2020 plenary sessions, segmented and aligned for speech recognition training. In total, the training set has:

  • 1.4 million samples
  • 3100 hours of audio
  • 460 speakers
  • over 19 million word tokens

Additionally, the upload contains 5h long development and 5h long evaluation sets described in publication 10.21437/Interspeech.2017-1115. Due to the size of the training set (~300 GB) and Zenodo upload limit (50 GB), only the development and evaluation sets are published on Zenodo. Rest of the data is available at: http://urn.fi/urn:nbn:fi:lb-2021051903

The training set comes in two parts:

  1. 2008-2016 set which is originally described in publication 10.21437/Interspeech.2017-1115. This set includes a list of samples from sessions in 2008-2014 that can be combined with the 2015-2020 set to form the 3100 hour training set.
  2. A new 2015-2020 dataset.

All audio samples are single-channel, 16 kHz and 16-bit wav files. Each wav file has corresponding transcript in a .trn text file. The data is machine-extracted so there still remains small inaccuracies in the training set transcripts and possibly few Swedish samples. Development and evaluation sets have been corrected by hand.

The licenses can be viewed at:

The code used in extraction is available at:

Files

fi-parl-asr-dev-eval.zip

Files (1.1 GB)

Name Size Download all
md5:4fa2b5e22b3b106982797e1ac8445f42
1.1 GB Preview Download

Additional details

Related works

References
Conference paper: 10.21437/Interspeech.2017-1115 (DOI)

Funding

MeMAD – Methods for Managing Audiovisual Data: Combining Automatic Efficiency with Human Accuracy 780069
European Commission