SWAN-DF database of audio-video deepfakes
Description
Description
SWAN-DF: the first high fidelity publicly available dataset of realistic audio-visual deepfakes, where both faces and voices appear and sound like the target person. The SWAN-DF dataset is based on the public SWAN database of real videos recorded in HD on iPhone and iPad Pro (in year 2019). For 30 pairs of manually selected people from SWAN, we swapped faces and voices using several autoencoder-based face swapping models and using several blending techniques from the well-known open source repo DeepFaceLab and voice conversion (or voice cloning) methods, including zero-shot YourTTS, DiffVC, HiFiVC, and several models from FreeVC.
For each model and each blending technique, there are 960 video deepfakes. We used three types of models of the following resolutions: 160x160, 256x256, and 320x320 pixels. We took one pre-trained model corresponding for each resolution, and tuned it for each of the 30 pairs (both ways) of subjects for 50K iterations. Then, when generating deepfake videos for each pair of subjects, we used one of the tuned models and a way to blend the generated image back into the original frame, which we call blending technique. SWAN-DF dataset contains 25 different combinations of models and blending, which means the total number of deepfake videos is 960*25=24000.
We generated speech deepfakes using four voice conversion methods: YourTTS, HiFiVC, DiffVC, and FreeVC. We did not use text to speech methods for our video deepfakes, since the speech they produce is not synchronized with the lip movements in the video. For YourTTS, HiFiVC, and DiffVC methods, we used the pretrained models provided by the authors. HiFiVC was pretrained on VCTK, DiffVC on LibriTTS, and YourTTS on both VCTK and LibriTTS datasets. For FreeVC, we generated audio deepfakes for several variants: using the provided pretrained models (for 16Hz with and without pretrained speaker encoder and for 24Hz with pretrained speaker encoder) as is and by tuning 16Hz model either from scratch or starting from the pretrained version for different number of iterations on the mixture of VCTK and SWAN data. In total, SWAN-DF contains 12 different variations of audio deepfakes: one for each of YourTTS, HiFiVC, and DiffVC and 9 variants of FreeVC.
Acknowledgements
If you use this database, please cite the following publication:
Pavel Korshunov, Haolin Chen, Philip N. Garner, and Sébastien Marcel, "Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes", IEEE International Joint Conference on Biometrics (IJCB), September 2023.
https://publications.idiap.ch/publications/show/5092