TUNDRA - A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision,

Stan, Adriana; Watts, Oliver; Clark, Rob; Yamagishi, Junichi; King, Simon
doi:10.5281/zenodo.12543428
Published June 26, 2013 | Version v1
Dataset Open
TUNDRA - A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision,

1. Technical University of Cluj-Napoca
2. Centre for Speech Technology Research, University of Edinburgh
3. University of Edinburgh
4. National Institute of Informatics
The corpus is described in:
A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision, In Proc. Interspeech, Lyon, France, August 2013
###############################################################
##                                                           ##
##              THE SIMPLE4ALL TUNDRA CORPUS                 ##
##                    version 1.0                            ##
##                                                           ##
###############################################################

Simple4All Tundra (version 1.0) is the first release of a
standardised multilingual corpus designed for text-to-speech 
research with imperfect or found data. The corpus consists of
approximately 60 hours of speech data from audiobooks in 14
languages, as well as utterance-level alignments obtained with
a lightly-supervised process. Most audiobooks are from the public
domain and allow redistribution. However, some have restricted
use, and in those cases the segmented and aligned data cannot 
be downloaded from our website. 


---------------------------------------------------------------
                         LICENCE
---------------------------------------------------------------

This work is licensed under a Creative Commons Attribution 3.0 
Unported License http://creativecommons.org/licenses/by/3.0/
This licence applies to the selection, segmentation and alignment
of the speech and text data. 

The underlying audio and text are licensed under their specific 
datasource terms. Please refer to the links below for a full 
description of them. 

If you use any part of the corpus in your work, please cite the 
following paper:

A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, 
S. King, TUNDRA: A Multilingual Corpus of Found Data for TTS Research 
Created with Light Supervision, In Proc. Interspeech, Lyon, France, 
August 2013

---------------------------------------------------------------
		   SPEECH AND TEXT SOURCES
---------------------------------------------------------------

1) Bulgarian - "Zhetvariat"  by Yordan Yovkov
audio: http://librivox.org/zhetvariat-by-yordan-yovkov
text: http://slovo.bg/showwork.php3?AuID=95&WorkID=9610&Level=1

2)Danish - "Grimms eventyr I udvalg"  by Grimm Brothers
audio: http://librivox.org/grimms-eventyr-i-udvalg-by-br%C3%B8drene-grimm
text: http://www.estrup.org/cms/?mod=text&id=392

3)Dutch- "Anna Karenina"  by Leo Tolstoy
audio: http://librivox.org/anna-karenina-by-leo-tolstoy
text: http://www.gutenberg.org/ebooks/13214

4) English - "Living Alone"  by Stella Benson
audio: http://librivox.org/living-alone-by-stella-benson
text: http://www.gutenberg.org/ebooks/14907

5) Finnish- "Rautatie" by Juhani Aho
audio: http://librivox.org/rautatie-by-juhani-aho
text: http://www.gutenberg.org/ebooks/10481

6) French - "Candide" by Voltaire
audio: http://librivox.org/candide-by-voltaire
text: http://www.gutenberg.org/cache/epub/4650/pg4650.txt

7) German - "Das Bildnis des Dorian Gray"  by Oscar Wilde
audio: http://librivox.org/das-bildnis-des-dorian-gray-by-oscar-wilde
text: http://gutenberg.spiegel.de/buch/1836/1

8) Hungarian - "Egri csillagok" by Geza Gardonyi
audio: http://gutenberg.spiegel.de/buch/1836/1
text: http://mek.oszk.hu/00600/00656/index.phtml

9) Italian - "Galatea"  by Anton Giulio Barrili
audio: http://librivox.org/galatea-by-anton-giulio-barrili/
text: http://www.gutenberg.org/ebooks/19427

10) Polish - "Siedem wybranyc opowiadan"  by Wladyslaw Orkan
audio: http://librivox.org/siedem-wybranych-opowiadan-by-wladyslaw-orkan/
text: http://pl.wikisource.org/wiki/Autor:W%C5%82adys%C5%82aw_Orkan

11) Portuguese - "Senhora"  by Jose de Alencar
audio: http://librivox.org/senhora-by-jose-de-alencar/
text: http://stat.correioweb.com.br/arquivos/educacao/arquivos/JosdeAlencar-Senhora0.pdf

12) Romanian - "Mara" by Ioan Slavici
audio: http://speech.utcluj.ro/corpora/mara.html
text: http://ro.wikisource.org/wiki/Mara

13) Russian - "Ucheniye Khrista" by Leo Tolstoy 
audio: http://librivox.org/teachings-of-christ-rus-by-leo-tolstoy/
text: http://az.lib.ru/t/tolstoj_lew_nikolaewich/text_0520.shtml

14) Spanish - "Don Quijote de la Mancha" by Miguel de Cervantes
audio: http://www.quijote.es/IVCentenario_AudioLibro.php
text: http://www.gutenberg.org/ebooks/5921

---------------------------------------------------------------
                          CONTENTS
---------------------------------------------------------------

For each audiobook you can download the following information,
with the exception of the Spanish audiobook which has a restricted
use and the speech data cannot be downloaded from out website:

1) Segmented and aligned data
    -- http://tundra.simple4all.org/download.html
    -- an archive containing the results of the lightly
       supervised segmentation and alignment algorithm;
    -- folders and files (the following are the same for
       both training and test data sets):
       -- wav/ - speech data maintaining the original chapter
          names, but with additional indexes resulted from the
	  sentence-level segmentation;
       -- txt/ - raw text files corresponding to each speech
          file from the wav/ folder;
       -- txtWithPunctuation/ - text files for each speech 
          segment with punctuation restored from the original
	  book text;
       -- speech_transcript.txt - a single file for all 
          orthographic transcripts; 

    -- a separate handmade test set data in the handmadeTest/ folder
       (see below for its description);


2) 1 hour subset of selected data
    -- http://tundra.simple4all.org/ssw8data.html
    -- an archive containing approximately 1 hour of selected
       audio used to train the voices from the Demo section;

3) Synthetic samples
    -- http://tundra.simple4all.org/ssw8data.html
    -- an archive containing the synthetic samples obtained with
        our lightly supervised TTS system for the handmade test set;
        
4) Chapter-level annotation
    -- http://tundra.simple4all.org/download.html
    -- a file with the chapter-level time alignment within the original
       data and the corresponding text for the confident data.
        
---------------------------------------------------------------
		SEGMENTATION AND ALIGNMENT
---------------------------------------------------------------

Descriptions of the lightly supervised segmentation and alignment 
methods can be found in the following papers:

1) A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, 
J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data 
for TTS Research Created with Light Supervision, In Proc. 
Interspeech, Lyon, France, August 2013

2) Adriana STAN, Peter BELL, Simon KING A grapheme-based method 
for automatic alignment of speech and text data, In Proc. IEEE 
Workshop on Spoken Language Technology, Miami, Florida, USA, 
December 2012

3) Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A.J. 
Clark, Simon King and Adriana STAN Lightly Supervised GMM VAD to 
use Audiobook for Speech Synthesiser, In Proc. ICASSP, May 2013

The synthetic voice building algorithm is described in detail here:

4) O. Watts, A. Stan, R. Clark, Y. Mamiya, M. Giurgiu, J. Yamagishi, 
S. King, Unsupervised and lightly-supervised learning for rapid 
construction of TTS systems in multiple languages from ‘found’ 
data: evaluation and analysis, In Proc. SSW8, Barcelona, Spain, 
August 2013

With a similar approach being presented in:

5) O. Watts, A. Stan, A. Suni, M. Burgos, J.M. Montero, The 
Simple4All entry to the Blizzard Challenge 2013, Blizzard 
Challenge 2013


---------------------------------------------------------------
		TRAIN/TEST DIVISION OF DATA
---------------------------------------------------------------


Test material is taken from the ends of books, from enough whole 
chapters or stories to make up at least 10 min of audio of 
aligned data (NB more can be harvested from these chapters 
from the unaligned utterances). 

The exceptions are the Hungarian and Portuguese audiobooks 
in which the following chapters have variable recording 
conditions and are not considered suitable for comparisons:

Hungarian: egricsillagok_[19-49]
Portuguese:  senhora_[14-20] and senhora_[23-41]


The following chapters are reserved for testing:

Bulgarian:  zhetvariat_2{3,4,5}*
Danish: eventyr_{08,09,10,11,12}*
Dutch: annakarenina_021*
German: doriangray_17*
English: livingalone_{09,10}*
Finnish:   rautatie_{7,8}*
French: candide_{29,30}*
Hungarian: egricsillagok_{17,18}*
Italian: galatea_{19,20}*
Polish:  siedemwybranchopowiadan_7*
Portuguese: senhora_{12,13}*
Romanian: mara_7{1,2}*
Russian: teachingsofchrist_9*
Spanish: Parte1_35*


For the evaluations published in the following paper:

O. Watts, A. Stan, R. Clark, Y. Mamiya, M. Giurgiu, J. Yamagishi, 
S. King, Unsupervised and lightly-supervised learning for rapid 
construction of TTS systems in multiple languages from 'found' data: 
evaluation and analysis, In Proc. SSW8, Barcelona, Spain, August 2013

a hand-segmented test set of about 40 utterances in all languages 
was prepared from the test chapters, so that various problems with 
the automatically aligned test utterances (sentence fragments, 
non-matching transcripts  etc.) would not confuse the evaluation 
results. These hand segmented and aligned utterances are contained 
in the ./handmadeTest folder. Synthesised samples of this handmade 
test set are also available for download. 

---------------------------------------------------------------
			CONTRIBUTORS
---------------------------------------------------------------

Adriana Stan (Communications Department, Technical University of Cluj-Napoca)
Oliver Watts (Centre for Speech Technology Research, University of Edinburgh)
Yoshitaka Mamiya (Centre for Speech Technology Research, University of Edinburgh)
Junichi Yamagishi (National Institute of Informatics, Tokyo)
Mircea Giurgiu (Communications Department, Technical University of Cluj-Napoca)
Rob Clark (Centre for Speech Technology Research, University of Edinburgh)
Simon King (Centre for Speech Technology Research, University of Edinburgh)


---------------------------------------------------------------
			CONTACT
---------------------------------------------------------------
Please send all you enquires regarding the Tundra Copus to one
of the following e-mail addresses:

adriana.stan@com.utcluj.ro
owatts@inf.ed.ac.uk


---------------------------------------------------------------
			ACKNOWLEDGEMNTS
---------------------------------------------------------------


The research leading to these results has received funding from the European 
Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement 
No 287678 (the Simple4All project - http://www.simple4all.org)

The research presented here has made use of the resources provided
by the Edinburgh Compute and Data Facility (ECDF: http://www.ecdf.ed.ac.uk). 
The ECDF is partially supported by the eDIKT initiative (http://www.edikt.org.uk).

We would like to thank Mihai Nae from Cartea Sonora for releasing the Romanian data, 
as well as to all the volunteers at Librivox and Gutenberg for dedicating their 
time to distribute this wide variety of data.
Files

TUNDRA_CORPUS_August2013.zip

Files (16.0 GB)

Name	Size
TUNDRA_CORPUS_August2013.zip md5:3bca1a7dea502e12fcffbaaac981a36a	16.0 GB	Preview Download
	All versions	This version
Views	282	282
Downloads	71	71
Data volume	1.3 TB	1.3 TB
TUNDRA - A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision,

Authors/Creators

Description

Files

TUNDRA_CORPUS_August2013.zip

Files (16.0 GB)