LJ Speech English TTS

Taubert, Stefan

doi:10.5281/zenodo.10200955

Published November 10, 2023 | Version 0.0.1

Model Open

LJ Speech English TTS

Taubert, Stefan (Contact person)¹

1. Chemnitz University of Technology

This upload contains a TTS model which was trained on the LJ Speech dataset using these transcriptions but with explicit phoneme duration markers removed. The model is trained using tacotron-cli.

The model achieves the following values on the validation set:

MOS naturalness: 3.49 ± 0.28 (GT: 4.17 ± 0.23)
MOS intelligibility: 4.44 ± 0.21 (GT: 4.63 ± 0.19)
mean mel-cepstral distance: 30.96
mean penalty: 0.1341

Files:

101000.pt
- checkpoint after 500 epochs with a batch size of 64
1-setup-env.sh
- script to install all required tools
2-create-dataset.sh
- script to create the base dataset using public resources
3-create-train-val-set.sh
- script to create the training set and validation set
4-start-training.sh
- script to start training using Tacotron
5-convert-english-to-ipa.sh
- script to prepare English texts for synthesis by transcribing them to IPA
6-synthesize.sh
- script to synthesize IPA transcribed text
example-north-wind.zip
- contains an example passage which was synthesized using the model

The model is able to synthesize the following symbols:

vowels: i, u, æ, ɑ, ɔ, ə, ɛ, ɪ, ʊ, ʌ
diphthongs: aɪ, aʊ, eɪ, oʊ, ɔɪ
r-colored vowels: ɔr, ər, ɛr, ɪr, ʊr, ʌr
consonants: b, d, dʒ, f, h, j, k, l, m, n, p, r, s, t, tʃ, v, w, z, ð, ŋ, ɡ, ʃ, θ
breaks: SIL0, SIL1, SIL2, SIL3
special characters: . ? ! , : ; - — " ' ( ) [ ]

Each vowel, diphthong and r-colored vowel can have a leading stress symbol ˈˌ attached, e.g., ˈoʊ.

Example:

The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.

ð|ə|SIL0|n|ˈɔr|θ|SIL0|w|ˈɪ|n|d|SIL0|ə|n|d|SIL0|ð|ə|SIL0|s|ˈʌ|n|SIL0|w|ˈʌr|SIL0|d|ɪ|s|p|j|ˈu|t|ɪ|ŋ|SIL0|w|ˈɪ|tʃ|SIL0|w|ˈɑ|z|SIL0|ð|ə|SIL0|s|t|r|ˈɔ|ŋ|ər|,|SIL1|w|ˈɛ|n|SIL0|ə|SIL0|t|r|ˈæ|v|ə|l|ər|SIL0|k|ˈeɪ|m|SIL0|ə|l|ˈɔ|ŋ|SIL0|r|ˈæ|p|t|SIL0|ɪ|n|SIL0|ə|SIL0|w|ˈɔr|m|SIL0|k|l|ˈoʊ|k|.|SIL2

Notes (English)

The authors gratefully acknowledge the GWK support for funding this project by providing computing time through the Center for Information Services and HPC (ZIH) at TU Dresden.

The authors are grateful to the Center for Information Services and High Performance Computing [Zentrum fur Informationsdienste und Hochleistungsrechnen (ZIH)] at TU Dresden for providing its facilities for high throughput calculations.

Files

example-north-wind.zip

Files (342.7 MB)

Name	Size	Download all
1-setup-env.sh md5:664cac6dcc130411e4d049e1538c516c	726 Bytes	Download
101000.pt md5:9c77b2ee8dfdb50ac556588064943846	341.5 MB	Download
2-create-dataset.sh md5:4370d52bc842f8f762465f66a9559260	673 Bytes	Download
3-create-train-val-set.sh md5:28497962271dadc210b536d88c66dc83	14.2 kB	Download
4-start-training.sh md5:ee60232dcf4ee20c843aa9983bf5503c	2.0 kB	Download
5-convert-english-to-ipa.sh md5:a07f757776a4276b8927eb60e4ed50ad	3.1 kB	Download
6-synthesize.sh md5:50c58aa494904442f70d509c86edab0c	2.7 kB	Download
example-north-wind.zip md5:bc6c1271423dd42c181f19ffa036ab8e	1.2 MB	Preview Download

Additional details

Deutsche Forschungsgemeinschaft
SFB 1410 416228727

Ito, K., & Johnson, L. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
Prenger, R., Valle, R., & Catanzaro, B. (2019). WaveGlow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
Taubert, S. (2023). LJ Speech - Aligned IPA transcriptions (0.0.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7499098
Taubert, S. (2023). tacotron-cli (0.0.4). Zenodo. https://doi.org/10.5281/zenodo.7543638
Taubert, S. (2022). waveglow-cli (0.0.1). Zenodo. https://doi.org/10.5281/zenodo.7044345

	All versions	This version
Views	149	149
Downloads	281	281
Data volume	14.0 GB	14.0 GB

example-north-wind.zip

Files (342.7 MB)

Funding

References

LJ Speech English TTS

Authors/Creators

Description

Notes (English)

Files

example-north-wind.zip

Files (342.7 MB)

Additional details

Funding

References