Published February 15, 2026 | Version v1
Dataset Open

IMPROVING SPEECH NATURALNESS IN UZBEK TEXT-TO-SPEECH USING DEEP LEARNING-BASED PROSODY MODELING

  • 1. Samarkand Branch of Tashkent University of Information Technologies

Description

Speech naturalness is one of the most critical challenges in text-to-speech (TTS) systems, especially for low-resource languages such as Uzbek. While recent advances in deep learning have significantly improved the intelligibility of synthesized speech, achieving natural prosody—including appropriate intonation, rhythm, stress, and timing—remains a complex problem. This study focuses on improving speech naturalness in Uzbek TTS systems through deep learning-based prosody modeling. The paper analyzes existing approaches to prosody modeling, discusses the linguistic characteristics of the Uzbek language that affect prosodic patterns, and proposes the integration of neural network-based methods to capture expressive and natural speech features. The findings highlight the potential of deep learning architectures to enhance the quality and naturalness of Uzbek speech synthesis and contribute to the development of more human-like TTS systems.

Files

465-468.pdf

Files (212.5 kB)

Name Size Download all
md5:a43427b0755b61b329e4a903dcb7e54a
212.5 kB Preview Download

Additional details

References

  • Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge University Press.
  • Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
  • Wang, Y., et al. (2017). Tacotron: Towards end-to-end speech synthesis. Proceedings of Interspeech.
  • Jumanazar o'g'li, B. J. SOCIO-PSYCHOLOGICAL CHARACTERISTICS OF THE FORMATION OF SOCIAL INSTITUTIONS IN STUDENTS
  • Oord, A. V. D., et al. (2016). WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.