WaveFake: A data set to facilitate audio DeepFake detection

Frank, Joel; Schönherr, Lea

doi:10.5281/zenodo.5642694

Published August 26, 2021 | Version 1.2.0

Dataset Open

WaveFake: A data set to facilitate audio DeepFake detection

1. Ruhr University Bochum

The main purpose of this data set is to facilitate research into audio DeepFakes. We hope that this work helps in finding new detection methods to prevent such attempts. These generated media files have been increasingly used to commit impersonation attempts or online harassment. You can find the accompanying code repository on GitHub.

The data set consists of 104,885 generated audio clips (16-bit PCM wav). We examine multiple networks trained on two reference data sets. First, the LJSpeech data set consisting of 13,100 short audio clips (on average 6 seconds each; roughly 24 hours total) read by a female speaker. It features passages from 7 non-fiction books and the audio was recorded on a MacBook Pro microphone. Second, we include samples based on the JSUT data set, specifically, basic5000 corpus. This corpus consists of 5,000 sentences covering all basic kanji of the Japanese language (4.8 seconds on average; roughly 6.7 hours total). The recordings were performed by a female native Japanese speaker in an anechoic room. Finally, we include samples from a full text-to-speech pipeline (16,283 phrases; 3.8s on average; roughly 17.5 hours total). Thus, our data set consists of approximately 175 hours of generated audio files in total. Note that we do not redistribute the reference data.

We included a range of architectures in our data set:

Additionally, we examined a bigger version of MelGAN and include samples from a full TTS-pipeline consisting of a conformer and parallel WaveGAN model.

Collection Process

For WaveGlow, we utilize the official implementation (commit 8afb643) in conjunction with the official pre-trained network on PyTorch Hub. We use a popular implementation available on GitHub (commit 12c677e) for the remaining networks. The repository also offers pre-trained models. We used the pre-trained networks to generate samples that are similar to their respective training distributions, LJ Speech and JSUT. When sampling the data set, we first extract Mel spectrograms from the original audio files, using the pre-processing scripts of the corresponding repositories. We then feed these Mel spectrograms to the respective models to obtain the data set. For sampling the full TTS results, we use the ESPnet project. To make sure the generated phrases do not overlap with the training set, we downloaded the common voices data set and extracted 16.285 phrases from it.

This data set is licensed with a CC-BY-SA 4.0 license.

This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -- EXC-2092 CaSa -- 390781972.

Files

datasheet.pdf

Files (28.9 GB)

Name	Size	Download all
datasheet.pdf md5:489be6c3ce07327397c7f6c2f9a99502	167.5 kB	Preview Download
generated_audio.zip md5:76b3e62d69f866e57ad6b1debaff434b	28.9 GB	Preview Download
LICENSE md5:9cc9e1ad97513505bfb75fc148a70005	14.0 kB	Download

Additional details

Kumar, Kundan, et al. "Melgan: Generative adversarial networks for conditional waveform synthesis." arXiv preprint arXiv:1910.06711 (2019).
Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
Yang, Geng, et al. "Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.
Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "Waveglow: A flow-based generative network for speech synthesis." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
Sonobe, Ryosuke, Shinnosuke Takamichi, and Hiroshi Saruwatari. "JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis." arXiv preprint arXiv:1711.00354 (2017).

	All versions	This version
Views	18,489	15,677
Downloads	22,462	21,175
Data volume	710.3 TB	694.3 TB

WaveFake: A data set to facilitate audio DeepFake detection

Authors/Creators

Description

Files

datasheet.pdf

Files (28.9 GB)

Additional details

References