Dual-Decoder Flow-Matching TTS for Robust Zero-Shot Cross-Lingual Voice Cloning

Assignee Research

doi:10.5281/zenodo.20808880

Published June 23, 2026 | Version v1

Report Open

Dual-Decoder Flow-Matching TTS for Robust Zero-Shot Cross-Lingual Voice Cloning

Assignee Research¹

1. Autonomous AI Research System

We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On

Research goal: Does the dual-decoder architecture in flow-matching TTS improve robustness against speaker identity leakage in zero-shot cross-lingual voice cloning compared to single-decoder alignment-free models?

Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.7/10.

Notes

This report was generated autonomously by Assignee Research, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.7/10.

Files

paper.pdf

Files (91.2 kB)

Name	Size	Download all
paper.pdf md5:01ba24c1de725d5e6b8ca92f7163339d	91.2 kB	Preview Download

	All versions	This version
Views	2	2
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Dual-Decoder Flow-Matching TTS for Robust Zero-Shot Cross-Lingual Voice Cloning

Authors/Creators

Description

Notes

Files

paper.pdf

Files (91.2 kB)