THCHS-30 Chinese TTS with explicit duration markers
Description
This upload contains a TTS model which was trained on the THCHS-30 dataset using these transcriptions, which contain explicit phoneme duration markers (see below). The model is trained using tacotron-cli (v0.0.4).
The model achieves the following values on the validation set:
- mean mel-cepstral distance: 24.58
- mean penalty: 0.1072
Files:
- 103500.pt
- checkpoint after 500 epochs with a batch size of 64
- 1-setup-env.sh
- script to install all required tools
- 2-create-dataset.sh
- script to create the base dataset using public resources
- 3-create-train-val-set.sh
- script to create the training set and validation set
- 4-start-training.sh
- script to start training using Tacotron
- 5-convert-chinese-to-ipa.sh
- script to prepare Chinese texts for synthesis by transcribing them to IPA
- 6-synthesize.sh
- script to synthesize IPA transcribed text
- example-north-wind.zip
- contains an example passage which was synthesized using the model (speaker: D7)
The model is able to synthesize the following symbols:
- vowels: a ɛ e ə ɚ ɤ i o u ʊ y
- diphthongs: ai̯ au̯ ei̯ ou̯
- consonants: f j k kʰ l m n p pʰ ɹ̩ ɻ ɻ̩ s t ts tsʰ tɕ tɕʰ tʰ w x ŋ ɕ ɥ ʂ ʈʂ ʈʂʰ
- breaks:
- SIL0 -> no break
- SIL1 -> short break
- SIL2 -> break
- SIL3 -> long break
- special characters: 。 ?
Vowels and diphthongs contain one of these tones:
- ˥ -> first tone, e.g., e˥
- ˧˥ -> second tone, e.g., e˧˥
- ˧˩˧ -> third tone, e.g., e˧˩˧
- ˥˩ -> fourth tone, e.g., e˥˩
- nothing -> no tone, e.g., e
Vowels, diphthongs and consonants contain one of these duration markers:
- ˘ -> very short, e.g., ou̯˘
- nothing -> normal, e.g., ou̯
- ˑ -> half long, e.g., ou̯ˑ
- ː -> long, e.g., ou̯ː
Tones and duration markers can be combined, e.g., ə˧˥ː
Available speakers:
- male: A9, A33, A35, B21, B34, A8, B8, C8, D8
- female: A11, A12, A13, A14, A19, A2, A22, A23, A32, A34, A36, A4, A5, A6, A7, B11, B12, B15, B2, B22, B31, B32, B33, B4, B6, B7, C12, C13, C14, C17, C18, C19, C2, C20, C21, C22, C23, C31, C32, C4, C6, C7, D11, D12, D13, D21, D31, D32, D4, D6, D7
Example sentence:
有一次, 北风 和 太阳 正在 争论 谁 比较 有本事。
j|ou̯˧˩˧|i˥|tsʰ|ɹ̩˥˩|SIL2|p|ei̯˧˩˧|f|ə˥|ŋ|SIL0|x|ɤ˧˥|SIL0|tʰ|ai̯˥˩|j|a˧˥˘|ŋ˘|SIL0|ʈʂ|ə˥˩|ŋ|ts|ai̯˥˩|SIL0|ʈʂ|ə˥|ŋ|l|w|ə˥˩ː|nˑ|SIL0|ʂ|w˘|ei̯˧˥|SIL0|p|i˧˩˧|tɕ˘|j|au̯˥˩˘|SIL0|j|ou̯˧˩˧|p|ə˧˩˧|n|ʂ|ɻ̩˥˩|。
Notes (English)
Files
example-north-wind.zip
Files
(344.7 MB)
Name | Size | Download all |
---|---|---|
md5:a3be5e5858d44ac00d543bb7673c8f52
|
732 Bytes | Download |
md5:6c4523fc68b0dcc177db189ec63d9dc8
|
343.6 MB | Download |
md5:3784afec07415a697afac0baeb3952f2
|
3.0 kB | Download |
md5:df3a48296494e451e801974bf7e209a8
|
15.6 kB | Download |
md5:83913682cdda88e3245225111ecb8ffa
|
1.9 kB | Download |
md5:d270f948c6ded9bc198332a73ebaf5da
|
2.8 kB | Download |
md5:c39bb47023a56e67cc9c92556b5fbed8
|
2.8 kB | Download |
md5:75c5e26cd77df92aa93ca96f77f95c4c
|
1.1 MB | Preview Download |
Additional details
References
- Wang, D., Wu, D., & Zhu, X. (2001). TCMSD: A New Chinese Continuous Speech Database. International Conference on Chinese Computing (ICCC'01), 2001.
- Wang, D., Zhang, X., & Zhang, Z. (2015). THCHS-30: A Free Chinese Speech Corpus (arXiv:1512.01882). arXiv. http://arxiv.org/abs/1512.01882
- Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
- Prenger, R., Valle, R., & Catanzaro, B. (2019). WaveGlow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
- Taubert, S. (2023). THCHS-30 - Aligned IPA transcriptions (0.0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7528596
- Taubert, S. (2023). tacotron-cli (0.0.4). Zenodo. https://doi.org/10.5281/zenodo.7543638
- Taubert, S. (2022). waveglow-cli (0.0.1). Zenodo. https://doi.org/10.5281/zenodo.7044345