VoiceWukong: Benchmarking Deepfake Voice Detection (part_aa)
Creators
Description
VoiceWukong
VoiceWukong is a comprehensive benchmark for deepfake voice detection, designed to evaluate the performance of various detectors in real-world application scenarios.
Dataset Features
- Large Scale: Contains 265,200 English and 148,200 Chinese deepfake voice samples
- Diverse Sources: Covers voice samples generated by 19 commercial tools and 15 open-source tools
- Real-world Scenarios: Constructed 38 data variants covering 6 types of audio manipulations common in practical applications
- Bilingual Support: Supports evaluation in both Chinese and English languages
Evaluation Results
- Conducted comprehensive evaluations on 12 state-of-the-art deepfake voice detectors
- AASIST2 achieved the best performance with an Equal Error Rate (EER) of 13.50%
- Other detectors showed EERs exceeding 20%
- Results indicate significant challenges for current detectors in practical applications
Human-Machine Comparison Study
- Conducted user studies with over 300 participants
- Comparative analysis of detection capabilities among humans, detectors, and multimodal large language models (Qwen2-Audio)
- Different detectors and humans showed varying identification capabilities for deepfake voices at different deception levels
- Multimodal large language models demonstrated no effective detection ability
Dataset
This is the first part of the dataset, and it requires the complete download of both part_aa and part_ab for proper extraction and use. Please ensure that both files are in the same folder. For a detailed introduction to the data, please refer to our paper (to be made available).
The second part (part_ab) is at part_ab
extract command : cat VoiceWukong.part_* | tar -xz
Leaderboard
Our leaderboard presents comprehensive evaluation results in three main sections:
- Overall Performance - General evaluation metrics for each detector across the entire dataset, providing a broad view of detection capabilities.
- Manipulation-specific Performance - Detailed results showing how each detector performs under different types of audio manipulations, offering insights into specific strengths and weaknesses.
- User Study-based Evaluation - Performance analysis of detectors on deepfake voices categorized by difficulty levels based on our user study results, demonstrating detector effectiveness across varying deception capabilities.
Visit our leaderboard(github.io) for detailed performance metrics and rankings. Additionally, we provide a copy of the leaderboard code here for premanent storage.
Evaluated Detectors' Weighted Models
- All evaluated detectors’ weighted models can be obtained from huggingface.co. Additionally, we provide a copy of the weights files here for premanent storage.
User Study Results & Original Outputs
- This code repository(github) stores our user study results and the original outputs of the evaluation detectors. Additionally, we provide a copy of the code repository here for permanent storage.
Note: VoiceWukong prohibits use for commercial purposes.
Files
Additional details
Software
- Repository URL
- https://voicewukong.github.io/