Published June 26, 2013
| Version v1
Dataset
Open
TUNDRA - A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision,
Description
The corpus is described in:
A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision, In Proc. Interspeech, Lyon, France, August 2013
###############################################################
## ##
## THE SIMPLE4ALL TUNDRA CORPUS ##
## version 1.0 ##
## ##
###############################################################
Simple4All Tundra (version 1.0) is the first release of a
standardised multilingual corpus designed for text-to-speech
research with imperfect or found data. The corpus consists of
approximately 60 hours of speech data from audiobooks in 14
languages, as well as utterance-level alignments obtained with
a lightly-supervised process. Most audiobooks are from the public
domain and allow redistribution. However, some have restricted
use, and in those cases the segmented and aligned data cannot
be downloaded from our website.
---------------------------------------------------------------
LICENCE
---------------------------------------------------------------
This work is licensed under a Creative Commons Attribution 3.0
Unported License http://creativecommons.org/licenses/by/3.0/
This licence applies to the selection, segmentation and alignment
of the speech and text data.
The underlying audio and text are licensed under their specific
datasource terms. Please refer to the links below for a full
description of them.
If you use any part of the corpus in your work, please cite the
following paper:
A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi,
S. King, TUNDRA: A Multilingual Corpus of Found Data for TTS Research
Created with Light Supervision, In Proc. Interspeech, Lyon, France,
August 2013
---------------------------------------------------------------
SPEECH AND TEXT SOURCES
---------------------------------------------------------------
1) Bulgarian - "Zhetvariat" by Yordan Yovkov
audio: http://librivox.org/zhetvariat-by-yordan-yovkov
text: http://slovo.bg/showwork.php3?AuID=95&WorkID=9610&Level=1
2)Danish - "Grimms eventyr I udvalg" by Grimm Brothers
audio: http://librivox.org/grimms-eventyr-i-udvalg-by-br%C3%B8drene-grimm
text: http://www.estrup.org/cms/?mod=text&id=392
3)Dutch- "Anna Karenina" by Leo Tolstoy
audio: http://librivox.org/anna-karenina-by-leo-tolstoy
text: http://www.gutenberg.org/ebooks/13214
4) English - "Living Alone" by Stella Benson
audio: http://librivox.org/living-alone-by-stella-benson
text: http://www.gutenberg.org/ebooks/14907
5) Finnish- "Rautatie" by Juhani Aho
audio: http://librivox.org/rautatie-by-juhani-aho
text: http://www.gutenberg.org/ebooks/10481
6) French - "Candide" by Voltaire
audio: http://librivox.org/candide-by-voltaire
text: http://www.gutenberg.org/cache/epub/4650/pg4650.txt
7) German - "Das Bildnis des Dorian Gray" by Oscar Wilde
audio: http://librivox.org/das-bildnis-des-dorian-gray-by-oscar-wilde
text: http://gutenberg.spiegel.de/buch/1836/1
8) Hungarian - "Egri csillagok" by Geza Gardonyi
audio: http://gutenberg.spiegel.de/buch/1836/1
text: http://mek.oszk.hu/00600/00656/index.phtml
9) Italian - "Galatea" by Anton Giulio Barrili
audio: http://librivox.org/galatea-by-anton-giulio-barrili/
text: http://www.gutenberg.org/ebooks/19427
10) Polish - "Siedem wybranyc opowiadan" by Wladyslaw Orkan
audio: http://librivox.org/siedem-wybranych-opowiadan-by-wladyslaw-orkan/
text: http://pl.wikisource.org/wiki/Autor:W%C5%82adys%C5%82aw_Orkan
11) Portuguese - "Senhora" by Jose de Alencar
audio: http://librivox.org/senhora-by-jose-de-alencar/
text: http://stat.correioweb.com.br/arquivos/educacao/arquivos/JosdeAlencar-Senhora0.pdf
12) Romanian - "Mara" by Ioan Slavici
audio: http://speech.utcluj.ro/corpora/mara.html
text: http://ro.wikisource.org/wiki/Mara
13) Russian - "Ucheniye Khrista" by Leo Tolstoy
audio: http://librivox.org/teachings-of-christ-rus-by-leo-tolstoy/
text: http://az.lib.ru/t/tolstoj_lew_nikolaewich/text_0520.shtml
14) Spanish - "Don Quijote de la Mancha" by Miguel de Cervantes
audio: http://www.quijote.es/IVCentenario_AudioLibro.php
text: http://www.gutenberg.org/ebooks/5921
---------------------------------------------------------------
CONTENTS
---------------------------------------------------------------
For each audiobook you can download the following information,
with the exception of the Spanish audiobook which has a restricted
use and the speech data cannot be downloaded from out website:
1) Segmented and aligned data
-- http://tundra.simple4all.org/download.html
-- an archive containing the results of the lightly
supervised segmentation and alignment algorithm;
-- folders and files (the following are the same for
both training and test data sets):
-- wav/ - speech data maintaining the original chapter
names, but with additional indexes resulted from the
sentence-level segmentation;
-- txt/ - raw text files corresponding to each speech
file from the wav/ folder;
-- txtWithPunctuation/ - text files for each speech
segment with punctuation restored from the original
book text;
-- speech_transcript.txt - a single file for all
orthographic transcripts;
-- a separate handmade test set data in the handmadeTest/ folder
(see below for its description);
2) 1 hour subset of selected data
-- http://tundra.simple4all.org/ssw8data.html
-- an archive containing approximately 1 hour of selected
audio used to train the voices from the Demo section;
3) Synthetic samples
-- http://tundra.simple4all.org/ssw8data.html
-- an archive containing the synthetic samples obtained with
our lightly supervised TTS system for the handmade test set;
4) Chapter-level annotation
-- http://tundra.simple4all.org/download.html
-- a file with the chapter-level time alignment within the original
data and the corresponding text for the confident data.
---------------------------------------------------------------
SEGMENTATION AND ALIGNMENT
---------------------------------------------------------------
Descriptions of the lightly supervised segmentation and alignment
methods can be found in the following papers:
1) A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark,
J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data
for TTS Research Created with Light Supervision, In Proc.
Interspeech, Lyon, France, August 2013
2) Adriana STAN, Peter BELL, Simon KING A grapheme-based method
for automatic alignment of speech and text data, In Proc. IEEE
Workshop on Spoken Language Technology, Miami, Florida, USA,
December 2012
3) Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A.J.
Clark, Simon King and Adriana STAN Lightly Supervised GMM VAD to
use Audiobook for Speech Synthesiser, In Proc. ICASSP, May 2013
The synthetic voice building algorithm is described in detail here:
4) O. Watts, A. Stan, R. Clark, Y. Mamiya, M. Giurgiu, J. Yamagishi,
S. King, Unsupervised and lightly-supervised learning for rapid
construction of TTS systems in multiple languages from ‘found’
data: evaluation and analysis, In Proc. SSW8, Barcelona, Spain,
August 2013
With a similar approach being presented in:
5) O. Watts, A. Stan, A. Suni, M. Burgos, J.M. Montero, The
Simple4All entry to the Blizzard Challenge 2013, Blizzard
Challenge 2013
---------------------------------------------------------------
TRAIN/TEST DIVISION OF DATA
---------------------------------------------------------------
Test material is taken from the ends of books, from enough whole
chapters or stories to make up at least 10 min of audio of
aligned data (NB more can be harvested from these chapters
from the unaligned utterances).
The exceptions are the Hungarian and Portuguese audiobooks
in which the following chapters have variable recording
conditions and are not considered suitable for comparisons:
Hungarian: egricsillagok_[19-49]
Portuguese: senhora_[14-20] and senhora_[23-41]
The following chapters are reserved for testing:
Bulgarian: zhetvariat_2{3,4,5}*
Danish: eventyr_{08,09,10,11,12}*
Dutch: annakarenina_021*
German: doriangray_17*
English: livingalone_{09,10}*
Finnish: rautatie_{7,8}*
French: candide_{29,30}*
Hungarian: egricsillagok_{17,18}*
Italian: galatea_{19,20}*
Polish: siedemwybranchopowiadan_7*
Portuguese: senhora_{12,13}*
Romanian: mara_7{1,2}*
Russian: teachingsofchrist_9*
Spanish: Parte1_35*
For the evaluations published in the following paper:
O. Watts, A. Stan, R. Clark, Y. Mamiya, M. Giurgiu, J. Yamagishi,
S. King, Unsupervised and lightly-supervised learning for rapid
construction of TTS systems in multiple languages from 'found' data:
evaluation and analysis, In Proc. SSW8, Barcelona, Spain, August 2013
a hand-segmented test set of about 40 utterances in all languages
was prepared from the test chapters, so that various problems with
the automatically aligned test utterances (sentence fragments,
non-matching transcripts etc.) would not confuse the evaluation
results. These hand segmented and aligned utterances are contained
in the ./handmadeTest folder. Synthesised samples of this handmade
test set are also available for download.
---------------------------------------------------------------
CONTRIBUTORS
---------------------------------------------------------------
Adriana Stan (Communications Department, Technical University of Cluj-Napoca)
Oliver Watts (Centre for Speech Technology Research, University of Edinburgh)
Yoshitaka Mamiya (Centre for Speech Technology Research, University of Edinburgh)
Junichi Yamagishi (National Institute of Informatics, Tokyo)
Mircea Giurgiu (Communications Department, Technical University of Cluj-Napoca)
Rob Clark (Centre for Speech Technology Research, University of Edinburgh)
Simon King (Centre for Speech Technology Research, University of Edinburgh)
---------------------------------------------------------------
CONTACT
---------------------------------------------------------------
Please send all you enquires regarding the Tundra Copus to one
of the following e-mail addresses:
adriana.stan@com.utcluj.ro
owatts@inf.ed.ac.uk
---------------------------------------------------------------
ACKNOWLEDGEMNTS
---------------------------------------------------------------
The research leading to these results has received funding from the European
Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement
No 287678 (the Simple4All project - http://www.simple4all.org)
The research presented here has made use of the resources provided
by the Edinburgh Compute and Data Facility (ECDF: http://www.ecdf.ed.ac.uk).
The ECDF is partially supported by the eDIKT initiative (http://www.edikt.org.uk).
We would like to thank Mihai Nae from Cartea Sonora for releasing the Romanian data,
as well as to all the volunteers at Librivox and Gutenberg for dedicating their
time to distribute this wide variety of data.
Files
TUNDRA_CORPUS_August2013.zip
Files
(16.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:3bca1a7dea502e12fcffbaaac981a36a
|
16.0 GB | Preview Download |