Published April 21, 2026 | Version 1.0
Preprint Open

PHE-Net: Envelope-Guided Speaker Extraction with Unlimited Speaker Scalability via WavLM-Based Discovery

Authors/Creators

Description

We present PHE-Net, a modular voice extraction system that separates individual speakers from single-channel mixtures of 2 to 20 simultaneous talkers. The system achieves +18.27 dB SI-SNRi with oracle guidance, scaling from N=2 to N=20 with zero degradation. In fully blind evaluation, +8.20 dB SI-SNRi at N=10 speakers with no enrollment audio. Through systematic ablation, we discover that the spectral envelope channel alone determines extraction quality — speaker embeddings are provably ignored (cosine 0.50 = cosine 1.00), and F0 pitch contributes nothing when envelope is sufficient (zero-F0 ceiling = +16.25 dB at N=10). This finding simplifies the research problem to a single well-defined challenge: improving blind spectral envelope estimation from multi-speaker mixtures.

Files

blind_N10_mix.wav

Files (669.2 kB)

Name Size Download all
md5:c0698eaca8aec782835cb7c44b0d8b4c
96.0 kB Preview Download
md5:46d3534b64f256824a09e12a6066c47c
96.0 kB Preview Download
md5:fcd490861db6d73798653ce0ef7e468f
96.0 kB Preview Download
md5:d8714c560b7342ce66678747bee0c58e
160.0 kB Preview Download
md5:d2802bb0ab178443b8437d3bf77a891c
160.0 kB Preview Download
md5:640f9407a9eddae06d62d8b492ba23fe
60.5 kB Preview Download
md5:7679b76dc20175d0f844237c98c961b4
465 Bytes Preview Download