Robustness of Zero-Shot Cross-Lingual Voice Cloning in Flow-Matching TTS Under Noisy and Adversarial Conditions

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20666213

Published June 12, 2026 | Version v1

Report Open

Robustness of Zero-Shot Cross-Lingual Voice Cloning in Flow-Matching TTS Under Noisy and Adversarial Conditions

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice\$\_\text\s1\\\$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio pr

Research goal: How does the robustness of zero-shot cross-lingual voice cloning in flow-matching TTS models vary when evaluated on noisy or adversarial input audio compared to diffusion-based and autoregressive models?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.0/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.0/10.

Files

paper.pdf

Files (80.9 kB)

Name	Size	Download all
paper.pdf md5:a447add204c961611a66f89550d7f351	80.9 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Robustness of Zero-Shot Cross-Lingual Voice Cloning in Flow-Matching TTS Under Noisy and Adversarial Conditions

Authors/Creators

Description

Notes

Files

paper.pdf

Files (80.9 kB)