# CodecDeepfakeDetection (CDD) — Dataset README ``` +-- asvspoof5 | +-- flac_D | +-- flac_E | `-- flac_T +-- CoRS_dev | +-- DescriptAudioCodec | +-- EnCodecVocos | +-- Mimi +-- CoRS_test | +-- DescriptAudioCodec | +-- EnCodecVocos | +-- Mimi | `-- XCodec2 +-- CoRS_train | +-- DescriptAudioCodec | +-- EnCodecVocos | +-- Mimi `-- CoSG ``` ## 1) What each folder contains - `asvspoof5/` Bonafide source audio from **ASVspoof 5** in FLAC format, 16kHz, partitioned as: - `flac_T/` — **Train** split - `flac_D/` — **Dev** split - `flac_E/` — **Eval** split - `CoSG/` **Speech-generation (spoof) set** produced via zero-shot TTS following the ASVspoof 5 contribution protocol (details below). WAV format, 16kHz. - `CoRS_train/`, `CoRS_dev/`, `CoRS_test/` **Codec ReSynthesized (CoRS)** audio created by passing bonafide utterances through neural audio codecs (NACs). WAV format, 16kHz. Each split contains one subfolder per codec: - `DescriptAudioCodec/` - `EnCodecVocos/` - `Mimi/` - `XCodec2/` Files inside each codec directory are the reconstructed versions of the corresponding bonafide items selected for that split. ## 2) Construction of **CoSG** (speech-generation spoof) Create a spoofed set via zero-shot TTS while mirroring ASVspoof 5’s data-generation design and speaker-overlap constraints. **Protocol & selection.** - Start from the official **ASVspoof 5 contribution protocol** lists for each split. - For each _attacker_ (TTS system): - Randomly sample **speakers per split**: **360 (train) / 120 (dev) / 120 (test)**. - For each selected speaker, pick **~10 target texts**. - Use **one reference utterance** (zero-shot setting): choose the **longest** available reference audio for robust speaker embedding; its **transcription** is pulled from _MultiLingLibriSpeech_. **Generation & counts.** - For **each attacker**, synthesize: - **3.6k** spoofed utterances for **train**, - **1.2k** for **dev**, - **1.2k** for **test**. **Speaker-overlap rule.** - To align with ASVspoof 5, within each split **only 50% of speakers overlap** between the **raw bonafide** partition and the **CoSG spoof** partition; the remaining 50% are disjoint. ## 3) Construction of **CoRS** (codec resynthesis) **Codecs.** - We use **four NACs**: `DescriptAudioCodec`, `EnCodecVocos`, `Mimi`, and `XCodec2`. Each codec has its own directory under `CoRS_{train,dev,test}/`. **Sampling & counts (per NAC).** - For **each NAC**, we resynthesize: - **3.6k** utterances for **train**, - **1.2k** utterances for **dev**, - **1.2k** utterances for **test**. **Train-time pairing (augmentation).** - For **training**, **all NACs resynthesize the _same_ pool of bonafide utterances**. This guarantees clean **paired (original ↔ reconstructed)** examples needed by augmentation schemes (e.g., probabilistic replacement). **Dev/Test anti-bias policy.** - For **dev** and **test**, each NAC uses **its own set of 1.2k _unique_ bonafide utterances**. This avoids bias from repeated resynthesis of identical content across codecs and ensures fair evaluation. **File placement.** - Resynthesized audio is written under: ``` CoRS_train// CoRS_dev// CoRS_test// ``` with filenames mirroring or mapping back to the selected bonafide items used for each split. ## 4) MetaFile For ease of development, the structure of the meta data is kept consistent with ASVSpoof5, as shown below (quoted from the ASVSpoof5's README.txt) > *.tsv are space-separated ``` SPEAKER_ID: T_****, D_****, or E_**** ID of the speaker in the FLAC file FLAC_FILE_NAME: T/D/E_********** name of the FLAC file SPEAKER_GENDER: F or M gender of the speaker CODEC: C** or - name of the codec or compressor condition CODEC_Q: N or - codec quality factor configuration number CODEC_SEED: T/D/E_********** or - if this utterance is coded, name of the original utterance ATTACK_TAG: AC* or - tag of the attacker adaptation condition ATTACK_LABEL: A** or bonafide name of the attack KEY: spoof or bonafide CM key TMP: - reserved column ``` Note that the CoRS audio file FLAC_NAME is the same as the source bonafide audio used to generate CoRS, but it is placed in a different directory. So when loading the data, you may still need to handle the CoRS files separately. md5sum ``` 0970486b8f4fda320b4f00199dbf47d1 asvspoof5.tar.gz 567ee72806e754629db84264e9b54ba5 CoRS_dev.tar.gz bfca1af1e7a00b8a5cc91d034f6735f3 CoRS_test.tar.gz ca0725ff436bf649a61f9ec9447d9164 CoRS_train.tar.gz 8a06eb8e24dbb8ba6824c8bc083ef4c5 CoSG.tar.gz b9ed403c7b69b912742369f6f15cdc8e meta.tar.gz ``` --- **References** **[1]** Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi, Myeonghun Jeong, Ge Zhu, Yongyi Zang, You Zhang, Soumi Maiti, Florian Lux, Nicolas Müller, Wangyou Zhang, Chengzhe Sun, Shuwei Hou, Siwei Lyu, Sébastien Le Maguer, Cheng Gong, Hanjie Guo, Liping Chen, and Vishwanath Singh. 2024. ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech. In Computer Speech & Language, 2026, 95. Jg., S. 101825. [https://www.sciencedirect.com/science/article/pii/S0885230825000506](https://www.sciencedirect.com/science/article/pii/S0885230825000506) **[2]** Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, and Junichi Yamagishi. 2024. ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. In ASVspoof Workshop 2024, 2024. 1--8. [https://doi.org/10.21437/ASVspoof.2024-1](https://doi.org/10.21437/ASVspoof.2024-1)