LJ Speech English TTS
Description
This upload contains a TTS model which was trained on the LJ Speech dataset using these transcriptions but with explicit phoneme duration markers removed. The model is trained using tacotron-cli.
The model achieves the following values on the validation set:
- MOS naturalness: 3.49 ± 0.28 (GT: 4.17 ± 0.23)
- MOS intelligibility: 4.44 ± 0.21 (GT: 4.63 ± 0.19)
- mean mel-cepstral distance: 30.96
- mean penalty: 0.1341
Files:
- 101000.pt
- checkpoint after 500 epochs with a batch size of 64
- 1-setup-env.sh
- script to install all required tools
- 2-create-dataset.sh
- script to create the base dataset using public resources
- 3-create-train-val-set.sh
- script to create the training set and validation set
- 4-start-training.sh
- script to start training using Tacotron
- 5-convert-english-to-ipa.sh
- script to prepare English texts for synthesis by transcribing them to IPA
- 6-synthesize.sh
- script to synthesize IPA transcribed text
- example-north-wind.zip
- contains an example passage which was synthesized using the model
The model is able to synthesize the following symbols:
- vowels: i, u, æ, ɑ, ɔ, ə, ɛ, ɪ, ʊ, ʌ
- diphthongs: aɪ, aʊ, eɪ, oʊ, ɔɪ
- r-colored vowels: ɔr, ər, ɛr, ɪr, ʊr, ʌr
- consonants: b, d, dʒ, f, h, j, k, l, m, n, p, r, s, t, tʃ, v, w, z, ð, ŋ, ɡ, ʃ, θ
- breaks: SIL0, SIL1, SIL2, SIL3
- special characters: . ? ! , : ; - — " ' ( ) [ ]
Each vowel, diphthong and r-colored vowel can have a leading stress symbol ˈˌ attached, e.g., ˈoʊ.
Example:
The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
ð|ə|SIL0|n|ˈɔr|θ|SIL0|w|ˈɪ|n|d|SIL0|ə|n|d|SIL0|ð|ə|SIL0|s|ˈʌ|n|SIL0|w|ˈʌr|SIL0|d|ɪ|s|p|j|ˈu|t|ɪ|ŋ|SIL0|w|ˈɪ|tʃ|SIL0|w|ˈɑ|z|SIL0|ð|ə|SIL0|s|t|r|ˈɔ|ŋ|ər|,|SIL1|w|ˈɛ|n|SIL0|ə|SIL0|t|r|ˈæ|v|ə|l|ər|SIL0|k|ˈeɪ|m|SIL0|ə|l|ˈɔ|ŋ|SIL0|r|ˈæ|p|t|SIL0|ɪ|n|SIL0|ə|SIL0|w|ˈɔr|m|SIL0|k|l|ˈoʊ|k|.|SIL2
Notes (English)
Files
example-north-wind.zip
Files
(342.7 MB)
Name | Size | Download all |
---|---|---|
md5:664cac6dcc130411e4d049e1538c516c
|
726 Bytes | Download |
md5:9c77b2ee8dfdb50ac556588064943846
|
341.5 MB | Download |
md5:4370d52bc842f8f762465f66a9559260
|
673 Bytes | Download |
md5:28497962271dadc210b536d88c66dc83
|
14.2 kB | Download |
md5:ee60232dcf4ee20c843aa9983bf5503c
|
2.0 kB | Download |
md5:a07f757776a4276b8927eb60e4ed50ad
|
3.1 kB | Download |
md5:50c58aa494904442f70d509c86edab0c
|
2.7 kB | Download |
md5:bc6c1271423dd42c181f19ffa036ab8e
|
1.2 MB | Preview Download |
Additional details
References
- Ito, K., & Johnson, L. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset
- Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
- Prenger, R., Valle, R., & Catanzaro, B. (2019). WaveGlow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
- Taubert, S. (2023). LJ Speech - Aligned IPA transcriptions (0.0.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7499098
- Taubert, S. (2023). tacotron-cli (0.0.4). Zenodo. https://doi.org/10.5281/zenodo.7543638
- Taubert, S. (2022). waveglow-cli (0.0.1). Zenodo. https://doi.org/10.5281/zenodo.7044345