LJ Speech English TTS with explicit duration markers

doi:10.5281/zenodo.10107104

Published November 10, 2023 | Version 0.0.1

Model Open

LJ Speech English TTS with explicit duration markers

Taubert, Stefan (Researcher)¹

1. Chemnitz University of Technology

This upload contains a TTS model which was trained on the LJ Speech dataset using these transcriptions, which contain phoneme duration markers. The model is trained using tacotron-cli.

The model achieves the following values:

MOS naturalness: 3.55 ± 0.28 (GT: 4.17 ± 0.23)
MOS intelligibility: 4.44 ± 0.24 (GT: 4.63 ± 0.19)
mean mel-cepstral distance: 29.15
mean penalty: 0.1018

Files:

101000.pt
- checkpoint after 500 epochs with a batch size of 64
1-setup-env.sh
- script to install all required tools
2-create-dataset.sh
- script to create the base dataset using public resources
3-create-train-val-set.sh
- script to create the training set and validation set
4-start-training.sh
- script to start training using Tacotron
5-convert-english-to-ipa.sh
- script to prepare English texts for synthesis by transcribing them to IPA
6-synthesize.sh
- script to synthesize IPA transcribed text
example-north-wind.zip
- contains an example passage which was synthesized using the model

The model is able to synthesize the following symbols:

vowels: i, u, æ, ɑ, ɔ, ə, ɛ, ɪ, ʊ, ʌ
diphthongs: aɪ, aʊ, eɪ, oʊ, ɔɪ
r-colored vowels: ɔr, ər, ɛr, ɪr, ʊr, ʌr
consonants: b, d, dʒ, f, h, j, k, l, m, n, p, r, s, t, tʃ, v, w, z, ð, ŋ, ɡ, ʃ, θ
breaks: SIL0, SIL1, SIL2, SIL3
special characters: . ? ! , : ; - — " ' ( ) [ ]

Each Vowel, diphthong, r-colored vowel and consonant can have one of these duration markers: ːˑ˘, e.g. oʊː.

Furthermore, each vowel, diphthong and r-colored vowel can have a leading stress symbol ˈˌ attached, e.g., ˈoʊː.

Example:

The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.

ð˘|ə|SIL0|n|ˈɔr|θ|SIL0|w|ˈɪ|nː|dː|SIL0|ə|n˘|d|SIL0|ð|ə|SIL0|sː|ˈʌː|nː|SIL0|w|ˈʌr˘|SIL0|d|ɪ|s|p|j|ˈu|t|ɪ|ŋ|SIL0|w|ˈɪ˘|tʃ|SIL0|w|ˈɑ˘|z|SIL0|ð|ə|SIL0|s|t|r|ˈɔ˘|ŋ|ərˑ|,|SIL1|w|ˈɛ˘|n|SIL0|ə|SIL0|t|r|ˈæ|v|ə|l|ər|SIL0|k|ˈeɪ|m|SIL0|ə|l|ˈɔ|ŋ|SIL0|r|ˈæ|p|t˘|SIL0|ɪ|n|SIL0|ə|SIL0|wː|ˈɔrː|mː|SIL0|kː|l|ˈoʊ|k|.|SIL2

Notes (English)

The authors gratefully acknowledge the GWK support for funding this project by providing computing time through the Center for Information Services and HPC (ZIH) at TU Dresden.

The authors are grateful to the Center for Information Services and High Performance Computing [Zentrum fur Informationsdienste und Hochleistungsrechnen (ZIH)] at TU Dresden for providing its facilities for high throughput calculations.

Files

example-north-wind.zip

Files (344.4 MB)

Name	Size	Download all
1-setup-env.sh md5:d0d593c2dec0a53de9ca074609416e65	730 Bytes	Download
101000.pt md5:975eb63eba5fa239e3c38e9545adc21e	343.1 MB	Download
2-create-dataset.sh md5:2a897a7d805d1d757d3a10e46cb63956	674 Bytes	Download
3-create-train-val-set.sh md5:653f3aebf2b803d858bfe4515471e209	13.9 kB	Download
4-start-training.sh md5:ee60232dcf4ee20c843aa9983bf5503c	2.0 kB	Download
5-convert-english-to-ipa.sh md5:a07f757776a4276b8927eb60e4ed50ad	3.1 kB	Download
6-synthesize.sh md5:a42e8170f89fcffb92cdd631dc8c372b	2.7 kB	Download
example-north-wind.zip md5:c27ca7513bc8b63af70e34b1d6e0777f	1.3 MB	Preview Download

Additional details

SFB 1410 416228727: Deutsche Forschungsgemeinschaft

Ito, K., & Johnson, L. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
Prenger, R., Valle, R., & Catanzaro, B. (2019). WaveGlow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
Taubert, S. (2023). LJ Speech - Aligned IPA transcriptions (0.0.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7499098
Taubert, S. (2023). tacotron-cli (0.0.4). Zenodo. https://doi.org/10.5281/zenodo.7543638
Taubert, S. (2022). waveglow-cli (0.0.1). Zenodo. https://doi.org/10.5281/zenodo.7044345

	All versions	This version
Views	119	119
Downloads	160	160
Data volume	24.7 GB	24.7 GB

LJ Speech English TTS with explicit duration markers

Notes (English)

Files

example-north-wind.zip

Files (344.4 MB)

Additional details

Funding

References

LJ Speech English TTS with explicit duration markers

Creators

Description

Notes (English)

Files

example-north-wind.zip

Files (344.4 MB)

Additional details

Funding

References