IMPROVING SPEECH NATURALNESS IN UZBEK TEXT-TO-SPEECH USING DEEP LEARNING-BASED PROSODY MODELING
Authors/Creators
- 1. Samarkand Branch of Tashkent University of Information Technologies
Description
Speech naturalness is one of the most critical challenges in text-to-speech (TTS) systems, especially for low-resource languages such as Uzbek. While recent advances in deep learning have significantly improved the intelligibility of synthesized speech, achieving natural prosody—including appropriate intonation, rhythm, stress, and timing—remains a complex problem. This study focuses on improving speech naturalness in Uzbek TTS systems through deep learning-based prosody modeling. The paper analyzes existing approaches to prosody modeling, discusses the linguistic characteristics of the Uzbek language that affect prosodic patterns, and proposes the integration of neural network-based methods to capture expressive and natural speech features. The findings highlight the potential of deep learning architectures to enhance the quality and naturalness of Uzbek speech synthesis and contribute to the development of more human-like TTS systems.
Files
465-468.pdf
Files
(212.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a43427b0755b61b329e4a903dcb7e54a
|
212.5 kB | Preview Download |
Additional details
References
- Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge University Press.
- Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
- Wang, Y., et al. (2017). Tacotron: Towards end-to-end speech synthesis. Proceedings of Interspeech.
- Jumanazar o'g'li, B. J. SOCIO-PSYCHOLOGICAL CHARACTERISTICS OF THE FORMATION OF SOCIAL INSTITUTIONS IN STUDENTS
- Oord, A. V. D., et al. (2016). WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.