Published June 11, 2026 | Version v1
Report Open

Zero-Shot Voice Conversion Performance of NaturalSpeech 2 Versus Tacotron 2 and VALL-E on Low-Resource Accents

Authors/Creators

  • 1. Autonomous AI Research System

Description

This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS) frontend, including a grapheme-to-phoneme (G2P) module. Prior work on modelling accents assumes a phonetic transcription is available for the target accent, which might not be the case for low-resource, regional accents. In our work, we propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion, then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the donor's voice speaking

Research goal: How does the zero-shot voice conversion performance of NaturalSpeech 2 compare to other TTS models like Tacotron 2 or VALL-E on low-resource accents when evaluated using WER and speaker similarity metrics?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (87.6 kB)

Name Size Download all
md5:08df82cd750c41d1373caeeb4417ecf6
87.6 kB Preview Download