Published August 26, 2021 | Version 1.2.0
Dataset Open

WaveFake: A data set to facilitate audio DeepFake detection

  • 1. Ruhr University Bochum

Description

The main purpose of this data set is to facilitate research into audio DeepFakes. We hope that this work helps in finding new detection methods to prevent such attempts. These generated media files have been increasingly used to commit impersonation attempts or online harassment. You can find the accompanying code repository on GitHub.

The data set consists of  104,885 generated audio clips (16-bit PCM wav).  We examine multiple networks trained on two reference data sets. First, the LJSpeech data set consisting of 13,100 short audio clips (on average 6 seconds each; roughly 24 hours total) read by a female speaker. It features passages from 7 non-fiction books and the audio was recorded on a MacBook Pro microphone. Second, we include samples based on the JSUT data set, specifically, basic5000 corpus. This corpus consists of 5,000 sentences covering all basic kanji of the Japanese language (4.8 seconds on average; roughly 6.7 hours total). The recordings were performed by a female native Japanese speaker in an anechoic room. Finally, we include samples from a full text-to-speech pipeline (16,283 phrases; 3.8s on average; roughly 17.5 hours total). Thus, our data set consists of approximately 175 hours of generated audio files in total. Note that we do not redistribute the reference data.

We included a range of architectures in our data set:

Additionally, we examined a bigger version of MelGAN and include samples from a full TTS-pipeline consisting of a conformer and parallel WaveGAN model.

Collection Process

For WaveGlow, we utilize the official implementation (commit 8afb643) in conjunction with the official pre-trained network on PyTorch Hub. We use a popular implementation available on GitHub (commit 12c677e) for the remaining networks. The repository also offers pre-trained models. We used the pre-trained networks to generate samples that are similar to their respective training distributions, LJ Speech and JSUT. When sampling the data set, we first extract Mel spectrograms from the original audio files, using the pre-processing scripts of the corresponding repositories. We then feed these Mel spectrograms to the respective models to obtain the data set. For sampling the full TTS results, we use the ESPnet project. To make sure the generated phrases do not overlap with the training set, we downloaded the common voices data set and extracted 16.285 phrases from it.

This data set is licensed with a CC-BY-SA 4.0 license.

This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -- EXC-2092 CaSa -- 390781972.

Files

datasheet.pdf

Files (28.9 GB)

Name Size Download all
md5:489be6c3ce07327397c7f6c2f9a99502
167.5 kB Preview Download
md5:76b3e62d69f866e57ad6b1debaff434b
28.9 GB Preview Download
md5:9cc9e1ad97513505bfb75fc148a70005
14.0 kB Download

Additional details

References

  • Kumar, Kundan, et al. "Melgan: Generative adversarial networks for conditional waveform synthesis." arXiv preprint arXiv:1910.06711 (2019).
  • Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
  • Yang, Geng, et al. "Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.
  • Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "Waveglow: A flow-based generative network for speech synthesis." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
  • Sonobe, Ryosuke, Shinnosuke Takamichi, and Hiroshi Saruwatari. "JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis." arXiv preprint arXiv:1711.00354 (2017).