Sudo rm -rf pre-trained audio source separation models
Description
Efficient pre-trained models for 8kHz 2-speaker source separation (anechoic and noisy with reverberation). You can see the full code here alongside some basic description of the models' performance and computation requirements github-codebase. You can git-clone the repo and download the pre-trained models under: sudo_rm_rf/pretrained_models
We have also prepared an easy to use example for the pre-trained sudo rm -rf models here python-notebook so you can take all models for a spin 🏎️.. Simply normalize the input audio and infer!
# Load a pretrained model
separation_model = torch.load(anechoic_model_p)
# Normalize the waveform and apply the model
input_mix_std = separation_model.std(-1, keepdim=True)
input_mix_mean = separation_model.mean(-1, keepdim=True)
input_mix = (separation_model - input_mix_mean) / (input_mix_std + 1e-9)
# Apply the model
rec_sources_wavs = separation_model(input_mix.unsqueeze(1))
# Rescale the input sources with the mixture mean and variance
rec_sources_wavs = (rec_sources_wavs * input_mix_std) + input_mix_mean
One of the main points that sudo rm -rf models have brought forward is that focusing only on the reconstruction fidelity performance and ignoring all other computational metrics, such as: execution time and actual memory consumption is an ideal way of wasting resources for getting almost neglidgible performance improvement. To that end, we show that the Sudo rm -rf models can provide a very effective alternative for a range of separation tasks while also being respectful to users who do not have access to immense computational power or researchers who prefer not to train their models for weeks on a multitude of GPUs.
Thus, Sudo rm- rf models are able to perform adequately with SOTA and even surpass it in certain cases with minimal computational overhead in terms of both time and memory. Also, the importance of reporting all the above metrics when proposign a new model becomes apparent. We have conducted all the experiments assuming 8kHz sampling rate and 4 seconds of input audio on a server with an NVIDIA GeForce RTX 2080 Ti (11 GBs) and an 12-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz. OOM means out of memory for the corresponding configuration. A value of Z ex/sec corresponds to the throughput of each model, in other words, for each second that passes, the model is is capable of processing (either forward or backward pass) Z 32,000 sampled audio files. The attention models, which undoubtly provide the best performance in most of the cases, are extremely heavy in terms of actual time and memory consumption (even if they appear that the number of parameters is rather small). They also become prohibitively expenssive for longer sequencies.
Please cite as:
```BibTex
@inproceedings{tzinis2020sudo,
title={Sudo rm-rf: Efficient networks for universal audio source separation},
author={Tzinis, Efthymios and Wang, Zhepei and Smaragdis, Paris},
booktitle={2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP)},
pages={1--6},
year={2020},
organization={IEEE}
}
@article{tzinis2022compute,
title={Compute and Memory Efficient Universal Sound Source Separation},
author={Tzinis, Efthymios and Wang, Zhepei and Jiang, Xilin and Smaragdis, Paris},
journal={Journal of Signal Processing Systems},
year={2022},
volume={94},
number={2},
pages={245--259},
publisher={Springer}
}
```
Files
Files
(248.6 MB)
Additional details
References
- Tzinis, E., Wang, Z. and Smaragdis, P., 2020, September. Sudo rm-rf: Efficient networks for universal audio source separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE.
- E. Tzinis, Z. Wang, X. Jiang, and P. Smaragdis, "Compute and memory efficient universal sound source separation," Journal of Signal Processing Systems, vol. 94, no. 2, pp. 245–259, 2022.