Published April 2, 2025 | Version v1
Journal article Open

Comprehensive Survey on Kannada Language Speech to English Language Translation and Voice Cloning System

Description

India is a culturally rich country with diverse languages, with over 22 official languages and countless dialects spoken across the country. However, this linguistic diversity often acts as a communication barrier, hindering interactions between individuals who speak different languages. To address this challenge and revolutionize communication, there is an increasing interest in using Artificial Intelligence (AI) for language trans- lation. This research explores the application of AI in language translation, with a specific focus on converting local languages into a universal language. Two AI models, namely VALL-EX and ELLA-V, play a important role in this project. These models are trained on extensive multilingual speech data and are designed to overcome the communication gaps and achieve zero-shot cross-lingual speech synthesis. The proposed approach takes advantage of recent advances in text-to-speech synthesis. With the development of voice cloning techniques and synthesized speech quality approaching human equivalency, the industry has seen huge developments over the years. This research introduces a novel approach to address language barriers, proposing solutions with the help of VALL-EX. This AI models aim to create high-quality zero-shot cross-lingual voice synthesis using data gathered from large multilingual speech samples. By doing this, the study hopes to improve current communication breakdowns and support smooth information transfer across various linguistic contexts.

Files

Comprehensive Survey on Kannada Language.pdf

Files (365.4 kB)

Name Size Download all
md5:e1655de6f32ebbbdbdd189539ff9360e
365.4 kB Preview Download

Additional details

References

  • 1. Swamy, M. (2022). Robust automatic speech recognition system for Kannada speech sentences in the presence of noise.
  • 2. Zhang, Z., et al. (2023). Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926.
  • 3. Song, Y., Chen, Z., Wang, X., Ma, Z., & Chen, X. (2024). ELLA-V: Stable neural codec language modeling with alignment-guided sequence reordering. arXiv preprint arXiv:2401.07333.
  • 4. Kain, A., & Macon, M. (1998). Personalizing a speech synthesizer by voice adaptation. In The Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis.
  • 5. Xin, D., Saito, Y., Takamichi, S., Koriyama, T., & Saruwatari, H. (2021). Cross-lingual speaker adaptation using domain adaptation and speaker consistency loss for text-to-speech synthesis. In Interspeech (pp. 1614-1618).
  • 6. Le, M., Vyas, A., Shi, B., Karrer, B., Sari, L., Moritz, R., ... & Hsu, W. N. (2024). Voicebox: Text-guided multilingual universal speech generation at scale. Advances in Neural Information Processing Systems, 36.
  • 7. Sun, J., Chen, H., Tian, J., & Xie, L. (2022). Speaker embedding for cross-lingual speech synthesis. arXiv preprint arXiv:2204.09042.
  • 8. Chen, Z., Rosenberg, A., Zhang, Y., Wang, G., Ramabhadran, B., & Moreno, P. J. (2020, October). Improving speech recognition using GAN-based speech synthesis and contrastive unspoken text selection. In Interspeech (pp. 556-560).
  • 9. Baevski, A., Srinivasan, A., Shankar, S., Bengio, Y., & Pineau, J. (2022). Meta-learning for zero-shot cross-lingual speech synthesis. arXiv preprint arXiv:2206.05150.
  • 10. Nguyen, H., Li, K., & Unoki, M. (2022). Automatic mean opinion score estimation with temporal modulation features on gammatone filterbank for speech assessment. In INTERSPEECH (pp. 4526-4530).