PHE-Net: Envelope-Guided Speaker Extraction with Unlimited Speaker Scalability via WavLM-Based Discovery
Authors/Creators
Description
We present PHE-Net, a modular voice extraction system that separates individual speakers from single-channel mixtures of 2 to 20 simultaneous talkers. The system achieves +18.27 dB SI-SNRi with oracle guidance, scaling from N=2 to N=20 with zero degradation. In fully blind evaluation, +8.20 dB SI-SNRi at N=10 speakers with no enrollment audio. Through systematic ablation, we discover that the spectral envelope channel alone determines extraction quality — speaker embeddings are provably ignored (cosine 0.50 = cosine 1.00), and F0 pitch contributes nothing when envelope is sufficient (zero-F0 ceiling = +16.25 dB at N=10). This finding simplifies the research problem to a single well-defined challenge: improving blind spectral envelope estimation from multi-speaker mixtures.
Files
blind_N10_mix.wav
Files
(669.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:c0698eaca8aec782835cb7c44b0d8b4c
|
96.0 kB | Preview Download |
|
md5:46d3534b64f256824a09e12a6066c47c
|
96.0 kB | Preview Download |
|
md5:fcd490861db6d73798653ce0ef7e468f
|
96.0 kB | Preview Download |
|
md5:d8714c560b7342ce66678747bee0c58e
|
160.0 kB | Preview Download |
|
md5:d2802bb0ab178443b8437d3bf77a891c
|
160.0 kB | Preview Download |
|
md5:640f9407a9eddae06d62d8b492ba23fe
|
60.5 kB | Preview Download |
|
md5:7679b76dc20175d0f844237c98c961b4
|
465 Bytes | Preview Download |