LJ Speech English TTS with explicit duration markers
Description
This upload contains a TTS model which was trained on the LJ Speech dataset using these transcriptions, which contain phoneme duration markers. The model is trained using tacotron-cli.
The model achieves the following values:
- MOS naturalness: 3.55 ± 0.28 (GT: 4.17 ± 0.23)
- MOS intelligibility: 4.44 ± 0.24 (GT: 4.63 ± 0.19)
- mean mel-cepstral distance: 29.15
- mean penalty: 0.1018
Files:
- 101000.pt
- checkpoint after 500 epochs with a batch size of 64
- 1-setup-env.sh
- script to install all required tools
- 2-create-dataset.sh
- script to create the base dataset using public resources
- 3-create-train-val-set.sh
- script to create the training set and validation set
- 4-start-training.sh
- script to start training using Tacotron
- 5-convert-english-to-ipa.sh
- script to prepare English texts for synthesis by transcribing them to IPA
- 6-synthesize.sh
- script to synthesize IPA transcribed text
- example-north-wind.zip
- contains an example passage which was synthesized using the model
The model is able to synthesize the following symbols:
- vowels: i, u, æ, ɑ, ɔ, ə, ɛ, ɪ, ʊ, ʌ
- diphthongs: aɪ, aʊ, eɪ, oʊ, ɔɪ
- r-colored vowels: ɔr, ər, ɛr, ɪr, ʊr, ʌr
- consonants: b, d, dʒ, f, h, j, k, l, m, n, p, r, s, t, tʃ, v, w, z, ð, ŋ, ɡ, ʃ, θ
- breaks: SIL0, SIL1, SIL2, SIL3
- special characters: . ? ! , : ; - — " ' ( ) [ ]
Each Vowel, diphthong, r-colored vowel and consonant can have one of these duration markers: ːˑ˘, e.g. oʊː.
Furthermore, each vowel, diphthong and r-colored vowel can have a leading stress symbol ˈˌ attached, e.g., ˈoʊː.
Example:
The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
ð˘|ə|SIL0|n|ˈɔr|θ|SIL0|w|ˈɪ|nː|dː|SIL0|ə|n˘|d|SIL0|ð|ə|SIL0|sː|ˈʌː|nː|SIL0|w|ˈʌr˘|SIL0|d|ɪ|s|p|j|ˈu|t|ɪ|ŋ|SIL0|w|ˈɪ˘|tʃ|SIL0|w|ˈɑ˘|z|SIL0|ð|ə|SIL0|s|t|r|ˈɔ˘|ŋ|ərˑ|,|SIL1|w|ˈɛ˘|n|SIL0|ə|SIL0|t|r|ˈæ|v|ə|l|ər|SIL0|k|ˈeɪ|m|SIL0|ə|l|ˈɔ|ŋ|SIL0|r|ˈæ|p|t˘|SIL0|ɪ|n|SIL0|ə|SIL0|wː|ˈɔrː|mː|SIL0|kː|l|ˈoʊ|k|.|SIL2
Notes (English)
Files
example-north-wind.zip
Files
(344.4 MB)
Name | Size | Download all |
---|---|---|
md5:d0d593c2dec0a53de9ca074609416e65
|
730 Bytes | Download |
md5:975eb63eba5fa239e3c38e9545adc21e
|
343.1 MB | Download |
md5:2a897a7d805d1d757d3a10e46cb63956
|
674 Bytes | Download |
md5:653f3aebf2b803d858bfe4515471e209
|
13.9 kB | Download |
md5:ee60232dcf4ee20c843aa9983bf5503c
|
2.0 kB | Download |
md5:a07f757776a4276b8927eb60e4ed50ad
|
3.1 kB | Download |
md5:a42e8170f89fcffb92cdd631dc8c372b
|
2.7 kB | Download |
md5:c27ca7513bc8b63af70e34b1d6e0777f
|
1.3 MB | Preview Download |
Additional details
References
- Ito, K., & Johnson, L. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset
- Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
- Prenger, R., Valle, R., & Catanzaro, B. (2019). WaveGlow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
- Taubert, S. (2023). LJ Speech - Aligned IPA transcriptions (0.0.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7499098
- Taubert, S. (2023). tacotron-cli (0.0.4). Zenodo. https://doi.org/10.5281/zenodo.7543638
- Taubert, S. (2022). waveglow-cli (0.0.1). Zenodo. https://doi.org/10.5281/zenodo.7044345