Published February 26, 2022 | Version 0.1
Physical object Open

Sudo rm -rf pre-trained audio source separation models

  • 1. University of Illinois at Urbana-Champaign

Description

Efficient pre-trained models for 8kHz 2-speaker source separation (anechoic and noisy with reverberation). You can see the full code here alongside some basic description of the models' performance and computation requirements github-codebase. You can git-clone the repo and download the pre-trained models under: sudo_rm_rf/pretrained_models 

We have also prepared an easy to use example for the pre-trained sudo rm -rf models here python-notebook so you can take all models for a spin 🏎️.. Simply normalize the input audio and infer!

# Load a pretrained model
separation_model = torch.load(anechoic_model_p)

# Normalize the waveform and apply the model
input_mix_std = separation_model.std(-1, keepdim=True)
input_mix_mean = separation_model.mean(-1, keepdim=True)
input_mix = (separation_model - input_mix_mean) / (input_mix_std + 1e-9)

# Apply the model
rec_sources_wavs = separation_model(input_mix.unsqueeze(1))

# Rescale the input sources with the mixture mean and variance
rec_sources_wavs = (rec_sources_wavs * input_mix_std) + input_mix_mean

One of the main points that sudo rm -rf models have brought forward is that focusing only on the reconstruction fidelity performance and ignoring all other computational metrics, such as: execution time and actual memory consumption is an ideal way of wasting resources for getting almost neglidgible performance improvement. To that end, we show that the Sudo rm -rf models can provide a very effective alternative for a range of separation tasks while also being respectful to users who do not have access to immense computational power or researchers who prefer not to train their models for weeks on a multitude of GPUs.

Results on WSJ0-2mix   

Results on WHAMR!

Thus, Sudo rm- rf models are able to perform adequately with SOTA and even surpass it in certain cases with minimal computational overhead in terms of both time and memory. Also, the importance of reporting all the above metrics when proposign a new model becomes apparent. We have conducted all the experiments assuming 8kHz sampling rate and 4 seconds of input audio on a server with an NVIDIA GeForce RTX 2080 Ti (11 GBs) and an 12-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz. OOM means out of memory for the corresponding configuration. A value of Z ex/sec corresponds to the throughput of each model, in other words, for each second that passes, the model is is capable of processing (either forward or backward pass) Z 32,000 sampled audio files. The attention models, which undoubtly provide the best performance in most of the cases, are extremely heavy in terms of actual time and memory consumption (even if they appear that the number of parameters is rather small). They also become prohibitively expenssive for longer sequencies.



Please cite as:
```BibTex
@inproceedings{tzinis2020sudo,
  title={Sudo rm-rf: Efficient networks for universal audio source separation},
  author={Tzinis, Efthymios and Wang, Zhepei and Smaragdis, Paris},
  booktitle={2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP)},
  pages={1--6},
  year={2020},
  organization={IEEE}
}

@article{tzinis2022compute,
  title={Compute and Memory Efficient Universal Sound Source Separation},
  author={Tzinis, Efthymios and Wang, Zhepei and Jiang, Xilin and Smaragdis, Paris},
  journal={Journal of Signal Processing Systems},
  year={2022},
  volume={94},
  number={2},
  pages={245--259},
  publisher={Springer}
}
```

 

 

Files

Files (248.6 MB)

Name Size Download all
md5:a8f681bc4156b72d251c91146ae24e96
2.2 MB Download
md5:e3207a0ab57f8ee563fc76d56e13f595
25.7 MB Download
md5:cfe3109624b4e0937d9dd88009de0559
20.3 MB Download
md5:93b9e3c7cbe4c73198d078a6d42c608b
93.5 MB Download
md5:b0fe546d4a42594afe99749ccdc89a17
107.0 MB Download

Additional details

References

  • Tzinis, E., Wang, Z. and Smaragdis, P., 2020, September. Sudo rm-rf: Efficient networks for universal audio source separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE.
  • E. Tzinis, Z. Wang, X. Jiang, and P. Smaragdis, "Compute and memory efficient universal sound source separation," Journal of Signal Processing Systems, vol. 94, no. 2, pp. 245–259, 2022.