Published September 3, 2001 | Version 1.0

The IFA Spoken Language Corpus

  • 1. ROR icon The Netherlands Cancer Institute
  • 1. ROR icon University of Amsterdam
  • 2. ROR icon Radboud University Nijmegen
  • 3. ROR icon Dutch Research Council
  • 4. ROR icon Dutch Language Union

Description

The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech. It was constructed with off-the-shelf software using speech from 8 speakers (out of 10) in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker. Hand segmentation took 1,000 hours of labeling altogether. The asymptotic segmentation speed was about one word, or four boundaries, per minute. An evaluation showed that the Median Absolute Difference of the segment boundaries was 6 ms between labelers, and 4 ms within labelers. Label differences (substitutions, insertions, and deletions) were found in 8% of the segments between labelers and 5% within labelers. Compiled data are available in relational database format for querying with SQL.

The IFA Spoken Language corpus is currently in version 1.0. This is the "reference" version and the first I consider consistent enough to be usefull. However, the annotations (labeling) still contains errors. This means that there are inconsistencies in a few percent of the labels (e.g., wrong word assignment of syllables/phonemes, stress errors, etc.).

Summary information:

Net time in seconds (excluding all pauses)

Gender Age ID Recorded sentences (sec) Segmented sentences (sec)
F 20 N 3736 2760
F 28 G 4180 3978
F 40 L 3112 2485
F 60 E 4181 3245
M 15 R 2125 1439
M 40 K 2720 1891
M 56 H 2894 2368
M 66 O 3781 1696
Total     26733 19867
      7:26 hours 5:31 hours

Speech in tokens (total)

Gender Recorded   Segmented      
  Sentences Words Sentences Words Syllables Phonemes
4F / 4M 6128 73067 4492 51782 74702 187544

 

The IFA spoken language corpus is constructed using the Praat speech editting and analysis program. All speech material is accessible with praat.

The Dutch Language Organization (Nederlandse Taalunie) holds all copyrights (unless explicitely stated otherwise) and makes the complete corpus available under the GNU General Public License (see below). 

Methods

Audio files

Audio files are stored in AIFC format (16 bit, 44100 Hz). Recording microphones were coded as hm for head-mounted and fm for fixed microphone. Two-channel recordings were split into chunks ("paragraphs") for storage and processing. Chunks were split into single-channel sentences (fm and hm) for word and phoneme segmentation.

Recording equipment

Speech was recorded in a quiet, sound treated room. Recording equipment and a cueing computer were in a separated control room. Two-channel recordings were made with a head-mounted dynamic microphone (hm, Shure SM10A) on one channel and a fixed HF condenser microphone (fm, Sennheiser MKH 105) on the other. Recording was done directly to a Philips Audio CD-recorder, i.e., 16 bit linear coding at 44.1 kHz stereo. A standard sound source (white noise and pure 400 Hz tone) of 78 dB was recorded from a fixed position relative to the fixed microphone to be able to mark the recording level. These reference source recordings are stored with the speech as G[12]N and G[12]T. The head mounted microphone did not allow precise repositioning between sessions, and was even known to move during the sessions (which was noted).

Speakers

Speakers were selected at the Institute of Phonetic Sciences in Amsterdam (IFA) and consisted mostly of staff and students. Non-staff speakers were paid. In total, 18 speakers (9F/9M) completed both recording sessions. All speakers were mother-tongue speakers and none reported speaking or hearing problems. Recordings of 10 speakers (5F/5M) were selected and split into chunks (paragraphs), based on distribution of sex and age, and the quality of the recordings. Recordings of 4 women and 4 men were selected for phonemic segmentation. The ages of the selected speakers ranges from 15 to 66 years of age

Speaking styles

Eight speaking "styles" were recorded from each speaker.

From informal to formal these were:

  1. Informal story telling face-to-face to an "interviewer" (I)
  2. Retelling a previously read narrative story without sight contact (R)

    And reading aloud:
  3. A narrative story (T)
  4. A random list of all sentences of the narrative stories (S)
  5. "Pseudo-sentences" constructed by replacing all words in a sentence with randomly selected words from the text with the same POS tag (PS)
  6. Lists of selected words from the texts (W)
  7. Lists of all distinct syllables from the word lists (Sy)
  8. A collection of idiomatic (the Alphabet, the numbers 0-12) and "diagnostic" sequences (isolated vowels, /hVd/ and /VCV/ lists) (Pr)

The last style was presented in a fixed order, all other lists (S, PS, W, Sy) were (pseudo-)randomized for each speaker before presentation.

Each speaker read aloud from two separate text collections based on narrative texts. During the first recording session, each speaker read from the same two texts (Fixed text type). These texts were based on the Dutch version of "The north wind and the sun", and on a translation of the fairy tale "Jorinde und Joringel". During the second session, each speaker read from texts based on the informal story told during the first recording session (Variable text type). A non-overlapping selection of words was made from each text type (W). Words were selected to maximize coverage of phonemes and diphones and also included the 50 most frequent words from the texts. The word lists were automatically transcribed into phonemes using a simple CELEX* word list lookup and were split into syllables. The syllables were transcribed back into a pseudo-orthography which was readable for Dutch subjects (Sy). The 70 "pseudo-sentences" (PS) were based on the Fixed texts and corrected for syntactic number and gender. They were "semantically unpredictable" and only marginally grammatical.

* Burnage, G. "CELEX - A Guide for Users." Nijmegen: Centre for Lexical Information, University of Nijmegen. 1990.

Table of contents

Name    MD5    Size

# All documentation, forms etc
Additional Documents.zip    md5:abadd44992ff5ec406ea4060020c56a9        519.1 kB

# Articles describing the IFA corpus
Articles.zip    md5:7ded58a30fb87e0c181266c8f23e9acb        647.3 kB

# Audio data for "Can standard analysis tools be used on decompressed speech?"
COCOSDA 2002 compressed audio.zip       md5:87951aa19784a624c65993722c032f5e        700.2 MB

# All data in the form of .tsv database files (tab separated values) 
DatabaseFiles.zip    md5:ee42e7317d8cc7c504536b01d9a7aecd        402.8 MB

# The protocol files for the labeling
LabelProtocol.zip    md5:38dba02e2551ddaf8f4451c930423e35        325.7 kB

# All the annotation files as Praat TextGrid files: ASPEX, CELEX, Phonemes, POS, SPEX, Transliterations etc.
Labels-chunks.zip    md5:80200e4952786e24c97c540123fb1bf4        626.0 kB
Labels-sentences.zip    md5:c1e052b8b9ccf87a348008acfdd65d96        34.2 MB
Labels-validation.zip    md5:b902ce5c9f76dd0607fa9087c1cc0170        286.5 kB

# All Transcriptions, scripts, and other auxiliary files
SLcorpus.zip    md5:8c730878d4ef986bd092d1eea8234d32        4.1 MB

# Audio files: Chunks
SLspeech-chunks-F20N.zip    md5:a1a9c02278bfcf163f4ba27220a8a761        644.8 MB
SLspeech-chunks-F24I.zip    md5:325c6f84610a706be8b3072d91152979        624.2 MB
SLspeech-chunks-F28G.zip    md5:3cea531286f508267136eb2c87a8b0f8        682.0 MB
SLspeech-chunks-F40L.zip    md5:ea04e9a331dcdb678ee91eca24185247        551.9 MB
SLspeech-chunks-F60E.zip    md5:0274ed0d369ce0ba272f69c469465800        731.1 MB
SLspeech-chunks-M15R.zip    md5:8526497910a6722bce58eed0e520eefb     380.5 MB
SLspeech-chunks-M40K.zip    md5:118da90e364e683ed80cbb3935fa2d06        513.1 MB
SLspeech-chunks-M56H.zip    md5:239c1a50a55cd98c93e1dd5d9d7664f0        531.2 MB
SLspeech-chunks-M58D.zip    md5:69d9825b19769a5bac49131fc8d2f37d        664.6 MB
SLspeech-chunks-M66O.zip    md5:b81312dcc55642adcc566001fe216d84        702.8 MB

# Audio files: Sentences
SLspeech-sentences-fm-F20N.zip    md5:ece88127c0d4e39732a2282f05205a26        358.2 MB
SLspeech-sentences-fm-F28G.zip    md5:0a2b3cf1acf129335e990cf6d114fcdc        331.9 MB
SLspeech-sentences-fm-F40L.zip    md5:c71d3714b4bd4d01dc1bb625e867cce0        258.4 MB
SLspeech-sentences-fm-F60E.zip    md5:de2dafa6d57faf1949c8c4174600f74d        359.4 MB
SLspeech-sentences-fm-M15R.zip    md5:57b0bd8e36eba55c8809122eeb20ef27        178.1 MB
SLspeech-sentences-fm-M40K.zip    md5:0b52a72e7aac1a229c9e14a722f400b7        230.1 MB
SLspeech-sentences-fm-M56H.zip    md5:56a2db59db3132f9f3094ab09bba036f        248.4 MB
SLspeech-sentences-fm-M66O.zip    md5:6248b1c12791fbbb693e567638fe0518        308.7 MB
SLspeech-sentences-hm-F20N.zip    md5:f631c7e6e6dcf81864a05c7f24b51f15        321.1 MB
SLspeech-sentences-hm-F28G.zip    md5:ed29bcd23afe9d6b1bf3687adb5fd743        283.6 MB
SLspeech-sentences-hm-F40L.zip    md5:dea73e749829d52429f82f2e14b1b706        245.6 MB
SLspeech-sentences-hm-F60E.zip    md5:b4a3168e323dc840e13dcd164708d42e        319.8 MB
SLspeech-sentences-hm-M15R.zip    md5:66ca4c3607d1dc460ed67ece3b1e8145        159.2 MB
SLspeech-sentences-hm-M40K.zip    md5:b7fd9bec968f4240a3d412c8d093a24f        215.1 MB
SLspeech-sentences-hm-M56H.zip    md5:7daa99cb65af3978b73f4c0ef1c394c9        212.3 MB
SLspeech-sentences-hm-M66O.zip    md5:2644c73095f5b695b6e5379e7df2cf77        277.7 MB

Files

Articles.zip

Files (11.5 GB)

Name Size
md5:abadd44992ff5ec406ea4060020c56a9
519.1 kB Preview Download
md5:9dcd8e32ecdcc93e3fa3fa49f6b0220b
925.2 kB Preview Download
md5:87951aa19784a624c65993722c032f5e
700.2 MB Preview Download
md5:ee42e7317d8cc7c504536b01d9a7aecd
402.8 MB Preview Download
md5:38dba02e2551ddaf8f4451c930423e35
325.7 kB Preview Download
md5:80200e4952786e24c97c540123fb1bf4
626.0 kB Preview Download
md5:c1e052b8b9ccf87a348008acfdd65d96
34.2 MB Preview Download
md5:b902ce5c9f76dd0607fa9087c1cc0170
286.5 kB Preview Download
md5:8c730878d4ef986bd092d1eea8234d32
4.1 MB Preview Download
md5:a1a9c02278bfcf163f4ba27220a8a761
644.8 MB Preview Download
md5:325c6f84610a706be8b3072d91152979
624.2 MB Preview Download
md5:3cea531286f508267136eb2c87a8b0f8
682.0 MB Preview Download
md5:ea04e9a331dcdb678ee91eca24185247
551.9 MB Preview Download
md5:0274ed0d369ce0ba272f69c469465800
731.1 MB Preview Download
md5:8526497910a6722bce58eed0e520eefb
380.5 MB Preview Download
md5:118da90e364e683ed80cbb3935fa2d06
513.1 MB Preview Download
md5:239c1a50a55cd98c93e1dd5d9d7664f0
531.2 MB Preview Download
md5:69d9825b19769a5bac49131fc8d2f37d
664.6 MB Preview Download
md5:b81312dcc55642adcc566001fe216d84
702.8 MB Preview Download
md5:ece88127c0d4e39732a2282f05205a26
358.2 MB Preview Download
md5:0a2b3cf1acf129335e990cf6d114fcdc
331.9 MB Preview Download
md5:c71d3714b4bd4d01dc1bb625e867cce0
258.4 MB Preview Download
md5:de2dafa6d57faf1949c8c4174600f74d
359.4 MB Preview Download
md5:57b0bd8e36eba55c8809122eeb20ef27
178.1 MB Preview Download
md5:0b52a72e7aac1a229c9e14a722f400b7
230.1 MB Preview Download
md5:56a2db59db3132f9f3094ab09bba036f
248.4 MB Preview Download
md5:6248b1c12791fbbb693e567638fe0518
308.7 MB Preview Download
md5:f631c7e6e6dcf81864a05c7f24b51f15
321.1 MB Preview Download
md5:ed29bcd23afe9d6b1bf3687adb5fd743
283.6 MB Preview Download
md5:dea73e749829d52429f82f2e14b1b706
245.6 MB Preview Download
md5:b4a3168e323dc840e13dcd164708d42e
319.8 MB Preview Download
md5:66ca4c3607d1dc460ed67ece3b1e8145
159.2 MB Preview Download
md5:b7fd9bec968f4240a3d412c8d093a24f
215.1 MB Preview Download
md5:7daa99cb65af3978b73f4c0ef1c394c9
212.3 MB Preview Download
md5:2644c73095f5b695b6e5379e7df2cf77
277.7 MB Preview Download

Additional details

Funding

Dutch Research Council
Hoe efficiënt is spraak 355-75-001

References

  • van Son, R. J. J. H., Binnenpoorte, D., van den Heuvel, H., & Pols, L. C. (2001). The IFA Corpus: a Phonemically Segmented Dutch" Open Source" Speech Database. Proc. EUROSPEECH 2001, Aalborg, Denmark, Vol. 3, 2051− 2054.
  • Van Son, R. J. J. H., & Pols, L. C. (2001). Structure and access of the open source IFA Corpus. In Proceedings of the IRCS workshop on Linguistic Databases, Philadelphia (pp. 245-253).
  • Pols, L. C., & van Son, R. J. J. H. (2002). Accessing the IFA-corpus. Book in honor of the 70-th anniversary of Prof. LV Bondarko, 316-320.
  • Van Son, R. J. J. H. (2002). Can standard analysis tools be used on decompressed speech?. In COCOSDA 2002 Workshop of the International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques.
  • Van Son, R. J. J. H., & Pols, L. C. (2002). Evidence for efficiency in vowel production. In INTERSPEECH (pp. 37-40).
  • Van Son, R. J. (2005). A study of pitch, formant, and spectral estimation errors introduced by three lossy speech compression algorithms. Acta acustica united with acustica, 91(4), 771-778.