Published September 21, 2023 | Version v1
Dataset Restricted

SWAN-DF database of audio-video deepfakes

  • 1. Idiap Research Institute

Description

Description

SWAN-DF: the first high fidelity publicly available dataset of realistic audio-visual deepfakes, where both faces and voices appear and sound like the target person. The SWAN-DF dataset is based on the public SWAN database of real videos recorded in HD on iPhone and iPad Pro (in year 2019). For 30 pairs of manually selected people from SWAN, we swapped faces and voices using several autoencoder-based face swapping models and using several blending techniques from the well-known open source repo DeepFaceLab and voice conversion (or voice cloning) methods, including zero-shot YourTTS, DiffVC, HiFiVC, and several models from FreeVC.

For each model and each blending technique, there are 960 video deepfakes. We used three types of models of the following resolutions: 160x160, 256x256, and 320x320 pixels. We took one pre-trained model corresponding for each resolution, and tuned it for each of the 30 pairs (both ways) of subjects for 50K iterations. Then, when generating deepfake videos for each pair of subjects, we used one of the tuned models and a way to blend the generated image back into the original frame, which we call blending technique. SWAN-DF dataset contains 25 different combinations of models and blending, which means the total number of deepfake videos is 960*25=24000.

We generated speech deepfakes using four voice conversion methods: YourTTS, HiFiVC, DiffVC, and FreeVC. We did not use text to speech methods for our video deepfakes, since the speech they produce is not synchronized with the lip movements in the video. For YourTTS, HiFiVC, and DiffVC methods, we used the pretrained models provided by the authors. HiFiVC was pretrained on VCTK, DiffVC on LibriTTS, and YourTTS on both VCTK and LibriTTS datasets. For FreeVC, we generated audio deepfakes for several variants: using the provided pretrained models (for 16Hz with and without pretrained speaker encoder and for 24Hz with pretrained speaker encoder) as is and by tuning 16Hz model either from scratch or starting from the pretrained version for different number of iterations on the mixture of VCTK and SWAN data. In total, SWAN-DF contains 12 different variations of audio deepfakes: one for each of YourTTS, HiFiVC, and DiffVC and 9 variants of FreeVC.

 

Acknowledgements

If you use this database, please cite the following publication:

Pavel Korshunov, Haolin Chen, Philip N. Garner, and Sébastien Marcel, "Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes", IEEE International Joint Conference on Biometrics (IJCB), September 2023.
https://publications.idiap.ch/publications/show/5092

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

Access to the dataset is based on an End-User License Agreement. The use of the dataset is strictly restricted to non-commercial research.

Please provide us the following information about the authorized signatory (MUST hold a permanent position):

  • Full name
  • Name of organization
  • Position / job title
  • Academic / professional email address
  • URL where we can verify the information details

Only academic/professional email addresses from the same organization as the signatory are accepted for the online request. All online requests coming from generic email providers such as gmail will be rejected.

You are currently not logged in. Do you have an account? Log in here