Codecfake dataset

Xie, Yuankun

doi:10.5281/zenodo.11169781

Published May 10, 2024 | Version v1

Dataset Open

Codecfake dataset - test set (part 1 of 2)

Xie, Yuankun

This dataset is the test set (part 1 of 2) of the Codecfake dataset , corresponding to the manuscript "The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection".

Abstract

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD)
models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose
the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.

Due to platform restrictions on the size of zenodo repositories, we have divided the Codecfake dataset into various subsets as shown in the table below:

Codecfake dataset	description	link
training set (part 1 of 3) & label	train_split.zip & train_split.z01 - train_split.z06	https://zenodo.org/records/11171708
training set (part 2 of 3)	train_split.z07 - train_split.z14	https://zenodo.org/records/11171720
training set (part 3 of 3)	train_split.z15 - train_split.z19	https://zenodo.org/records/11171724
development set	dev_split.zip & dev_split.z01 - dev_split.z02	https://zenodo.org/records/11169872
test set (part 1 of 2)	Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.zip	https://zenodo.org/records/11169781
test set (part 2 of 2)	Codec unseen test: C7.zip	https://zenodo.org/records/11125029

Countermeasure

The source code of the countermeasure and pre-trained model are available on GitHub https://github.com/xieyuankun/Codecfake.

The Codecfake dataset and pre-trained model are licensed with CC BY-NC-ND 4.0 license.

Files

A1.zip

Files (39.0 GB)

Name	Size	Download all
A1.zip md5:f8b4f02ac6be1e3dfd0498e6499f5a4f	1.6 GB	Preview Download
A2.zip md5:42cab78703e1acb5bd85ab0a922882c5	2.1 GB	Preview Download
A3.zip md5:066d37f54c16e6acdca1bc5d12795941	7.4 GB	Preview Download
C1.zip md5:978bdf4344c0ecc7c5a938ded175114c	4.0 GB	Preview Download
C2.zip md5:cac60db109961ca4746a5da91d495db6	4.2 GB	Preview Download
C3.zip md5:fb3b4f82a0dff14e2f67d74dc5a21865	4.3 GB	Preview Download
C4.zip md5:791ac07810facf218c3b90d8fc0463a7	4.7 GB	Preview Download
C5.zip md5:2f5c205cfa43b2105d63ddc16ddfa79d	6.1 GB	Preview Download
C6.zip md5:e09e0279f5e44b2396914651dd2ef4a5	4.7 GB	Preview Download

	All versions	This version
Views	1,265	518
Downloads	3,961	1,336
Data volume	59.6 TB	13.9 TB

Codecfake dataset - test set (part 1 of 2)

Creators

Description

Abstract

Codecfake Dataset

Countermeasure

Files

A1.zip

Files (39.0 GB)