Zero-Shot Voice Conversion Performance of NaturalSpeech 2 Versus Tacotron 2 and VALL-E on Low-Resource Accents

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20637633

Published June 11, 2026 | Version v1

Report Open

Zero-Shot Voice Conversion Performance of NaturalSpeech 2 Versus Tacotron 2 and VALL-E on Low-Resource Accents

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS) frontend, including a grapheme-to-phoneme (G2P) module. Prior work on modelling accents assumes a phonetic transcription is available for the target accent, which might not be the case for low-resource, regional accents. In our work, we propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion, then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the donor's voice speaking

Research goal: How does the zero-shot voice conversion performance of NaturalSpeech 2 compare to other TTS models like Tacotron 2 or VALL-E on low-resource accents when evaluated using WER and speaker similarity metrics?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (87.6 kB)

Name	Size	Download all
paper.pdf md5:08df82cd750c41d1373caeeb4417ecf6	87.6 kB	Preview Download

	All versions	This version
Views	2	2
Downloads	1	1
Data volume	87.6 kB	87.6 kB

Zero-Shot Voice Conversion Performance of NaturalSpeech 2 Versus Tacotron 2 and VALL-E on Low-Resource Accents

Authors/Creators

Description

Notes

Files

paper.pdf

Files (87.6 kB)