Published June 14, 2023 | Version v1
Conference paper Open

Conditional Sound Effects Generation with Regularized WGAN

  • 1. University of Sydney

Description

Over recent years generative models utilizing deep neural networks have demonstrated outstanding capacity in synthesizing high-quality and plausible human speech and music. The majority of research in neural audio synthesis (NAS) targets speech or music, whereas general sound effects such as environmental sounds or Foley sounds have received less attention. In this work, we study the generative performance of NAS models for sound effects with a conditional Wasserstein GAN (WGAN) model. We train our models conditioned on different classes of sound effects and report on their performances in terms of quality and diversity. Many existing GAN models use magnitude spectrograms which require audio reconstruction using phase estimation after training. The often imperfect reconstruction of the audio signal has led us to propose an additional audio reconstruction loss term for the generator. We show that this additional loss term improves the quality of the audio generation considerably with small sacrifice to the diversity. The results indicate that a conditional WGAN model trained on log-magnitude spectrograms paired with an appropriately weighted reconstruction loss is capable of synthesizing highly plausible sound effects.

Files

34-40_Liu_et_al_SMC2023_proceedings.pdf

Files (535.5 kB)

Name Size Download all
md5:46e6d4d93b2cacbe6972520d34600bde
535.5 kB Preview Download