Published June 26, 2013
| Version v1
Dataset
Open
TUNDRA - A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision,
Description
The corpus is described in:
A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision, In Proc. Interspeech, Lyon, France, August 2013
############################################################### ## ## ## THE SIMPLE4ALL TUNDRA CORPUS ## ## version 1.0 ## ## ## ############################################################### Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Most audiobooks are from the public domain and allow redistribution. However, some have restricted use, and in those cases the segmented and aligned data cannot be downloaded from our website. --------------------------------------------------------------- LICENCE --------------------------------------------------------------- This work is licensed under a Creative Commons Attribution 3.0 Unported License http://creativecommons.org/licenses/by/3.0/ This licence applies to the selection, segmentation and alignment of the speech and text data. The underlying audio and text are licensed under their specific datasource terms. Please refer to the links below for a full description of them. If you use any part of the corpus in your work, please cite the following paper: A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision, In Proc. Interspeech, Lyon, France, August 2013 --------------------------------------------------------------- SPEECH AND TEXT SOURCES --------------------------------------------------------------- 1) Bulgarian - "Zhetvariat" by Yordan Yovkov audio: http://librivox.org/zhetvariat-by-yordan-yovkov text: http://slovo.bg/showwork.php3?AuID=95&WorkID=9610&Level=1 2)Danish - "Grimms eventyr I udvalg" by Grimm Brothers audio: http://librivox.org/grimms-eventyr-i-udvalg-by-br%C3%B8drene-grimm text: http://www.estrup.org/cms/?mod=text&id=392 3)Dutch- "Anna Karenina" by Leo Tolstoy audio: http://librivox.org/anna-karenina-by-leo-tolstoy text: http://www.gutenberg.org/ebooks/13214 4) English - "Living Alone" by Stella Benson audio: http://librivox.org/living-alone-by-stella-benson text: http://www.gutenberg.org/ebooks/14907 5) Finnish- "Rautatie" by Juhani Aho audio: http://librivox.org/rautatie-by-juhani-aho text: http://www.gutenberg.org/ebooks/10481 6) French - "Candide" by Voltaire audio: http://librivox.org/candide-by-voltaire text: http://www.gutenberg.org/cache/epub/4650/pg4650.txt 7) German - "Das Bildnis des Dorian Gray" by Oscar Wilde audio: http://librivox.org/das-bildnis-des-dorian-gray-by-oscar-wilde text: http://gutenberg.spiegel.de/buch/1836/1 8) Hungarian - "Egri csillagok" by Geza Gardonyi audio: http://gutenberg.spiegel.de/buch/1836/1 text: http://mek.oszk.hu/00600/00656/index.phtml 9) Italian - "Galatea" by Anton Giulio Barrili audio: http://librivox.org/galatea-by-anton-giulio-barrili/ text: http://www.gutenberg.org/ebooks/19427 10) Polish - "Siedem wybranyc opowiadan" by Wladyslaw Orkan audio: http://librivox.org/siedem-wybranych-opowiadan-by-wladyslaw-orkan/ text: http://pl.wikisource.org/wiki/Autor:W%C5%82adys%C5%82aw_Orkan 11) Portuguese - "Senhora" by Jose de Alencar audio: http://librivox.org/senhora-by-jose-de-alencar/ text: http://stat.correioweb.com.br/arquivos/educacao/arquivos/JosdeAlencar-Senhora0.pdf 12) Romanian - "Mara" by Ioan Slavici audio: http://speech.utcluj.ro/corpora/mara.html text: http://ro.wikisource.org/wiki/Mara 13) Russian - "Ucheniye Khrista" by Leo Tolstoy audio: http://librivox.org/teachings-of-christ-rus-by-leo-tolstoy/ text: http://az.lib.ru/t/tolstoj_lew_nikolaewich/text_0520.shtml 14) Spanish - "Don Quijote de la Mancha" by Miguel de Cervantes audio: http://www.quijote.es/IVCentenario_AudioLibro.php text: http://www.gutenberg.org/ebooks/5921 --------------------------------------------------------------- CONTENTS --------------------------------------------------------------- For each audiobook you can download the following information, with the exception of the Spanish audiobook which has a restricted use and the speech data cannot be downloaded from out website: 1) Segmented and aligned data -- http://tundra.simple4all.org/download.html -- an archive containing the results of the lightly supervised segmentation and alignment algorithm; -- folders and files (the following are the same for both training and test data sets): -- wav/ - speech data maintaining the original chapter names, but with additional indexes resulted from the sentence-level segmentation; -- txt/ - raw text files corresponding to each speech file from the wav/ folder; -- txtWithPunctuation/ - text files for each speech segment with punctuation restored from the original book text; -- speech_transcript.txt - a single file for all orthographic transcripts; -- a separate handmade test set data in the handmadeTest/ folder (see below for its description); 2) 1 hour subset of selected data -- http://tundra.simple4all.org/ssw8data.html -- an archive containing approximately 1 hour of selected audio used to train the voices from the Demo section; 3) Synthetic samples -- http://tundra.simple4all.org/ssw8data.html -- an archive containing the synthetic samples obtained with our lightly supervised TTS system for the handmade test set; 4) Chapter-level annotation -- http://tundra.simple4all.org/download.html -- a file with the chapter-level time alignment within the original data and the corresponding text for the confident data. --------------------------------------------------------------- SEGMENTATION AND ALIGNMENT --------------------------------------------------------------- Descriptions of the lightly supervised segmentation and alignment methods can be found in the following papers: 1) A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision, In Proc. Interspeech, Lyon, France, August 2013 2) Adriana STAN, Peter BELL, Simon KING A grapheme-based method for automatic alignment of speech and text data, In Proc. IEEE Workshop on Spoken Language Technology, Miami, Florida, USA, December 2012 3) Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A.J. Clark, Simon King and Adriana STAN Lightly Supervised GMM VAD to use Audiobook for Speech Synthesiser, In Proc. ICASSP, May 2013 The synthetic voice building algorithm is described in detail here: 4) O. Watts, A. Stan, R. Clark, Y. Mamiya, M. Giurgiu, J. Yamagishi, S. King, Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from ‘found’ data: evaluation and analysis, In Proc. SSW8, Barcelona, Spain, August 2013 With a similar approach being presented in: 5) O. Watts, A. Stan, A. Suni, M. Burgos, J.M. Montero, The Simple4All entry to the Blizzard Challenge 2013, Blizzard Challenge 2013 --------------------------------------------------------------- TRAIN/TEST DIVISION OF DATA --------------------------------------------------------------- Test material is taken from the ends of books, from enough whole chapters or stories to make up at least 10 min of audio of aligned data (NB more can be harvested from these chapters from the unaligned utterances). The exceptions are the Hungarian and Portuguese audiobooks in which the following chapters have variable recording conditions and are not considered suitable for comparisons: Hungarian: egricsillagok_[19-49] Portuguese: senhora_[14-20] and senhora_[23-41] The following chapters are reserved for testing: Bulgarian: zhetvariat_2{3,4,5}* Danish: eventyr_{08,09,10,11,12}* Dutch: annakarenina_021* German: doriangray_17* English: livingalone_{09,10}* Finnish: rautatie_{7,8}* French: candide_{29,30}* Hungarian: egricsillagok_{17,18}* Italian: galatea_{19,20}* Polish: siedemwybranchopowiadan_7* Portuguese: senhora_{12,13}* Romanian: mara_7{1,2}* Russian: teachingsofchrist_9* Spanish: Parte1_35* For the evaluations published in the following paper: O. Watts, A. Stan, R. Clark, Y. Mamiya, M. Giurgiu, J. Yamagishi, S. King, Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis, In Proc. SSW8, Barcelona, Spain, August 2013 a hand-segmented test set of about 40 utterances in all languages was prepared from the test chapters, so that various problems with the automatically aligned test utterances (sentence fragments, non-matching transcripts etc.) would not confuse the evaluation results. These hand segmented and aligned utterances are contained in the ./handmadeTest folder. Synthesised samples of this handmade test set are also available for download. --------------------------------------------------------------- CONTRIBUTORS --------------------------------------------------------------- Adriana Stan (Communications Department, Technical University of Cluj-Napoca) Oliver Watts (Centre for Speech Technology Research, University of Edinburgh) Yoshitaka Mamiya (Centre for Speech Technology Research, University of Edinburgh) Junichi Yamagishi (National Institute of Informatics, Tokyo) Mircea Giurgiu (Communications Department, Technical University of Cluj-Napoca) Rob Clark (Centre for Speech Technology Research, University of Edinburgh) Simon King (Centre for Speech Technology Research, University of Edinburgh) --------------------------------------------------------------- CONTACT --------------------------------------------------------------- Please send all you enquires regarding the Tundra Copus to one of the following e-mail addresses: adriana.stan@com.utcluj.ro owatts@inf.ed.ac.uk --------------------------------------------------------------- ACKNOWLEDGEMNTS --------------------------------------------------------------- The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement No 287678 (the Simple4All project - http://www.simple4all.org) The research presented here has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF: http://www.ecdf.ed.ac.uk). The ECDF is partially supported by the eDIKT initiative (http://www.edikt.org.uk). We would like to thank Mihai Nae from Cartea Sonora for releasing the Romanian data, as well as to all the volunteers at Librivox and Gutenberg for dedicating their time to distribute this wide variety of data.
Files
TUNDRA_CORPUS_August2013.zip
Files
(16.0 GB)
Name | Size | Download all |
---|---|---|
md5:3bca1a7dea502e12fcffbaaac981a36a
|
16.0 GB | Preview Download |