Conditional Sound Effects Generation with Regularized WGAN

Liu, Yunyi; Jin, Craig

doi:10.5281/zenodo.8316168

Published June 14, 2023 | Version v1

Conference paper Open

Conditional Sound Effects Generation with Regularized WGAN

1. University of Sydney

Over recent years generative models utilizing deep neural networks have demonstrated outstanding capacity in synthesizing high-quality and plausible human speech and music. The majority of research in neural audio synthesis (NAS) targets speech or music, whereas general sound effects such as environmental sounds or Foley sounds have received less attention. In this work, we study the generative performance of NAS models for sound effects with a conditional Wasserstein GAN (WGAN) model. We train our models conditioned on different classes of sound effects and report on their performances in terms of quality and diversity. Many existing GAN models use magnitude spectrograms which require audio reconstruction using phase estimation after training. The often imperfect reconstruction of the audio signal has led us to propose an additional audio reconstruction loss term for the generator. We show that this additional loss term improves the quality of the audio generation considerably with small sacrifice to the diversity. The results indicate that a conditional WGAN model trained on log-magnitude spectrograms paired with an appropriately weighted reconstruction loss is capable of synthesizing highly plausible sound effects.

Files

34-40_Liu_et_al_SMC2023_proceedings.pdf

Files (535.5 kB)

Name	Size	Download all
34-40_Liu_et_al_SMC2023_proceedings.pdf md5:46e6d4d93b2cacbe6972520d34600bde	535.5 kB	Preview Download

	All versions	This version
Views	43	43
Downloads	38	38
Data volume	20.3 MB	20.3 MB

Conditional Sound Effects Generation with Regularized WGAN

Creators

Description

Files

34-40_Liu_et_al_SMC2023_proceedings.pdf

Files (535.5 kB)