Published June 9, 2022 | Version v4
Dataset Open

CFAD: A Chinese Dataset for Fake Audio Detection

  • 1. Institute of Automation, Chinese Academy of Sciences

Description

Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions.

In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding. CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available on GitHub https://github.com/ADDchallenge/CFAD

 

CFAD dataset considers 12 types of fake audio, 11 of which are generated by different speech synthesis techniques and the remaining one is partially fake type. Partially fake audio is completely different from synthesis speech and thus can better evaluate the generalization of the detection model to unknown types. The real audio is collected from 6 different corpora to increase the diversity of real category distributions, which makes model less prone to artifact from a single database. For robustness evaluation, we additionally simulate background noise and media codecs that might occur in real life and provide detailed labels, including fake type, real source, noise type, signal noise ratio (SNR), and media codecs. Overall, CFAD dataset consists of three different versions, named clean, noisy, and codec versions.

Each version of the dataset is divided into disjoint training, development, and test sets in the same way. There is no speaker overlap across these three subsets. Each test set is further divided into seen and unseen test sets. Unseen test sets can evaluate the generalization of the methods to unknown types. It is worth mentioning that both real audio and fake audio in the unseen test set are unknown to the model.

For the noisy speech part, we select three noise databases for simulation. Additive noises are added to each audio in the clean dataset at 5 different SNRs. The additive noises of the unseen test set and the remaining subsets come from different noise databases.

For the codec speech part, we select six different codecs. Two of them are applied for unseen test set.

In each version (clean, noisy, and codec versions) of the CFAD dataset, there are 138400 utterances in training set, 14400 utterances in development set, 42000 utterances in seen test set, and 21000 utterances in unseen test set.

 

Clean Real Audios Collection

From the point of eliminating the interference of irrelevant factors, we collect clean real audios from two aspects: 5 open resources from OpenSLR platform (http://www.openslr.org/12/) and one self-recording dataset. 

 

Clean Fake Audios Generation

We select 11 representative speech synthesis methods to generate the fake audios and one partially fake audios.

 

Noisy Audios Simulation

Noisy audios aim to quantify the robustness of the methods under noisy conditions. To simulate the real-life scenarios, we artificially sample the noise signals and add them to clean audios at 5 different SNRs, which are 0dB, 5dB, 10dB, 15dB and 20dB. Additive noises are selected from three noise databases: PNL 100 Nonspeech Sounds, NOISEX-92, and TAU Urban Acoustic Scenes.

 

Audio Transcoding

The Codec version aims to quantify the robustness of the methods under different format conversions. We select a total of six codecs. For the training, development, and seen test sets in codec version, mp3, flac, ogg, and m4a are used. For the unseen test set of the codec version, aac, and wma are used.

Audio transcoding operation is operated on the audio in the clean version. Each clean audio will be randomly transformed with one of the candidate codecs and converted back to original WAV files using ffmpeg toolkits.

 

 

This data set is licensed with a CC BY-NC-ND 4.0 license.

You can cite the data using the following BibTeX entry.

Files

CFAD.zip

Files (35.6 GB)

Name Size Download all
md5:49a484ffa0c15f1b4138085a2673c708
10.7 GB Download
md5:ce3cae04256eef4d66c5d4a5be7b4c68
10.7 GB Download
md5:3a6f6211413e92b4d71d0f2d8adebbcf
10.7 GB Download
md5:31b15fc55e68b81d31661b3c56bbc157
3.4 GB Preview Download

Additional details

References

  • Jiangyan Yi, Ye Bai, Jianhua Tao, Zhengkun Tian, Chenglong Wang, Tao Wang, and Ruibo Fu. Half-truth: A partially fake audio detection dataset. Interspeech 2021.
  • Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5. IEEE, 2017.
  • Xin Xu Shaoji Zhang Ming Li Yao Shi, Hui Bu. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. 2015.
  • Zhiyong Zhang Dong Wang, Xuewei Zhang. Thchs-30 : A free chinese speech corpus, 2015.
  • Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, et al. Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset. arXiv preprint arXiv:2203.16844, 2022.
  • Guoning Hu and DeLiang Wang. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2067–2079,2010.
  • Andrew Varga and Herman JM Steeneken. Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication, 12(3):247–251, 1993.
  • Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pages 9–13, November 2018.