Dual-Decoder Flow-Matching TTS for Robust Zero-Shot Cross-Lingual Voice Cloning
Description
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On
Research goal: Does the dual-decoder architecture in flow-matching TTS improve robustness against speaker identity leakage in zero-shot cross-lingual voice cloning compared to single-decoder alignment-free models?
Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.7/10.
Notes
Files
paper.pdf
Files
(91.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:01ba24c1de725d5e6b8ca92f7163339d
|
91.2 kB | Preview Download |