Quantitative Trade-off Between Inference Latency and Speaker Verification Accuracy in Knowledge-Distilled Multi-Speaker Synthesis
Description
Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user's voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utter
Research goal: What is the quantitative trade-off between inference latency and speaker verification accuracy when applying knowledge distillation to multi-speaker speech synthesis models?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.5/10.
Notes
Files
paper.pdf
Files
(87.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:4f77ccaa9e07b97d711177b4bdda816c
|
87.9 kB | Preview Download |