Conditional Sound Effects Generation with Regularized WGAN
Description
Over recent years generative models utilizing deep neural networks have demonstrated outstanding capacity in synthesizing high-quality and plausible human speech and music. The majority of research in neural audio synthesis (NAS) targets speech or music, whereas general sound effects such as environmental sounds or Foley sounds have received less attention. In this work, we study the generative performance of NAS models for sound effects with a conditional Wasserstein GAN (WGAN) model. We train our models conditioned on different classes of sound effects and report on their performances in terms of quality and diversity. Many existing GAN models use magnitude spectrograms which require audio reconstruction using phase estimation after training. The often imperfect reconstruction of the audio signal has led us to propose an additional audio reconstruction loss term for the generator. We show that this additional loss term improves the quality of the audio generation considerably with small sacrifice to the diversity. The results indicate that a conditional WGAN model trained on log-magnitude spectrograms paired with an appropriately weighted reconstruction loss is capable of synthesizing highly plausible sound effects.
Files
34-40_Liu_et_al_SMC2023_proceedings.pdf
Files
(535.5 kB)
Name | Size | Download all |
---|---|---|
md5:46e6d4d93b2cacbe6972520d34600bde
|
535.5 kB | Preview Download |