Published November 27, 2023 | Version 0.0.1
Model Open

THCHS-30 Chinese TTS

  • 1. ROR icon Chemnitz University of Technology

Description

This upload contains a TTS model which was trained on the THCHS-30 dataset using these transcriptions but with explicit phoneme duration markers removed. The model is trained using tacotron-cli (v0.0.4).

The model achieves the following values on the validation set:

  • mean mel-cepstral distance: 25.19
  • mean penalty: 0.1456

Files:

  • 103500.pt
    • checkpoint after 500 epochs with a batch size of 64
  • 1-setup-env.sh
    • script to install all required tools
  • 2-create-dataset.sh
    • script to create the base dataset using public resources
  • 3-create-train-val-set.sh
    • script to create the training set and validation set
  • 4-start-training.sh
    • script to start training using Tacotron
  • 5-convert-chinese-to-ipa.sh
    • script to prepare Chinese texts for synthesis by transcribing them to IPA
  • 6-synthesize.sh
    • script to synthesize IPA transcribed text
  • example-north-wind.zip
    • contains an example passage which was synthesized using the model (speaker: D7)

The model is able to synthesize the following symbols:

  • vowels: a ɛ e ə ɚ ɤ i o ɹ̩ ɻ ɻ̩ u ʊ y
  • diphthongs: ai̯ au̯ ei̯ ou̯
  • consonants: f j k kʰ l m n p pʰ s t ts tsʰ tɕ tɕʰ tʰ w x ŋ ɕ ɥ ʂ ʈʂ ʈʂʰ
  • breaks:
    • SIL0 -> no break
    • SIL1 -> short break
    • SIL2 -> break
    • SIL3 -> long break
  • special characters: 。 ?

Vowels and diphthongs contain one of these tones:

  • ˥ -> first tone, e.g., e˥
  • ˧˥ -> second tone, e.g., e˧˥
  • ˧˩˧ -> third tone, e.g., e˧˩˧
  • ˥˩ -> fourth tone, e.g., e˥˩
  • nothing -> no tone, e.g., e

Available speakers:

  • male: A9, A33, A35, B21, B34, A8, B8, C8, D8
  • female: A11, A12, A13, A14, A19, A2, A22, A23, A32, A34, A36, A4, A5, A6, A7, B11, B12, B15, B2, B22, B31, B32, B33, B4, B6, B7, C12, C13, C14, C17, C18, C19, C2, C20, C21, C22, C23, C31, C32, C4, C6, C7, D11, D12, D13, D21, D31, D32, D4, D6, D7

Example sentence:

有一次, 北风 和 太阳 正在 争论 谁 比较 有本事。

j|ou̯˧˩˧|i˥|tsʰ|ɹ̩˥˩|SIL2|p|ei̯˧˩˧|f|ə˥|ŋ|SIL0|x|ɤ˧˥|SIL0|tʰ|ai̯˥˩|j|a˧˥|ŋ|SIL0|ʈʂ|ə˥˩|ŋ|ts|ai̯˥˩|SIL0|ʈʂ|ə˥|ŋ|l|w|ə˥˩|n|SIL0|ʂ|w|ei̯˧˥|SIL0|p|i˧˩˧|tɕ|j|au̯˥˩|SIL0|j|ou̯˧˩˧|p|ə˧˩˧|n|ʂ|ɻ̩˥˩|。

Notes (English)

The authors gratefully acknowledge the GWK support for funding this project by providing computing time through the Center for Information Services and HPC (ZIH) at TU Dresden.

The authors are grateful to the Center for Information Services and High Performance Computing [Zentrum fur Informationsdienste und Hochleistungsrechnen (ZIH)] at TU Dresden for providing its facilities for high throughput calculations.

Files

example-north-wind.zip

Files (342.8 MB)

Name Size Download all
md5:aa8de1324b822cb70447e74a7f514ddb
728 Bytes Download
md5:54e5b3dd2e3d0a35922c361b0bd82f19
341.6 MB Download
md5:3784afec07415a697afac0baeb3952f2
3.0 kB Download
md5:11de1b5ca5e13a7a586cc3e51b044a64
16.2 kB Download
md5:e9237e3038dcfa9c143fb29aab914c19
1.9 kB Download
md5:2bef60367cea7cd09ffc35a9bafea20f
2.9 kB Download
md5:e6db33a699ab6b23ca8938cfe1e26c75
2.7 kB Download
md5:98fee13d863016a14ea8ac8d7f9cf5c3
1.2 MB Preview Download

Additional details

Funding

SFB 1410 416228727
Deutsche Forschungsgemeinschaft

References

  • Wang, D., Wu, D., & Zhu, X. (2001). TCMSD: A New Chinese Continuous Speech Database. International Conference on Chinese Computing (ICCC'01), 2001.
  • Wang, D., Zhang, X., & Zhang, Z. (2015). THCHS-30: A Free Chinese Speech Corpus (arXiv:1512.01882). arXiv. http://arxiv.org/abs/1512.01882
  • Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
  • Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
  • Prenger, R., Valle, R., & Catanzaro, B. (2019). WaveGlow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
  • Taubert, S. (2023). THCHS-30 - Aligned IPA transcriptions (0.0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7528596
  • Taubert, S. (2023). tacotron-cli (0.0.4). Zenodo. https://doi.org/10.5281/zenodo.7543638
  • Taubert, S. (2022). waveglow-cli (0.0.1). Zenodo. https://doi.org/10.5281/zenodo.7044345