Published May 13, 2024 | Version v1
Dataset Open

DeepFakeNews

Creators

Description

The DeepFakeNews dataset is a novel and comprehensive dataset designed for the detection of both deepfakes and fake news. This dataset is an extension and enhancement of the existing Fakeddit fake news dataset (i strongly suggest to read the related paper (here) from the authors to better understand this dataset), with significant modifications to cater specifically to the complexities of modern misinformation.

  • Enhancements: Derived from the Fakeddit fake news dataset, the DeepFakeNews dataset comprehends a total of 509,916 images and has been enriched with 254,958 deepfake images generated using three different generative models: Stable Diffusion 2, Dreamlike, and GLIDE.
  • Balance and Composition: The dataset is perfectly balanced, containing an equal number of pristine (authentic) and generated (deepfake) images.
  • Removal of Hand-Modified Content: The original "manipulated content" category from Fakeddit, which consisted of images altered or modified by hand, has been removed. These have been replaced with deepfakes to provide a more relevant and challenging set of synthetic images.
  • Cleaning and Quality Control: The Fakeddit dataset was thoroughly cleaned, removing any images that were not found, contained only logos, or were otherwise unsuitable for deepfake detection. This cleaning process ensures a higher quality and more reliable dataset for training and evaluation.
  • Application: The DeepFakeNews dataset is suitable for both deepfake detection and fake news detection. Its diverse and balanced nature makes it an excellent benchmark for evaluating multimodal detection systems that analyze both visual and textual content.

The dataset comes with three CSV files for training, testing, and validation sets, along with corresponding zip files containing the split images for each set. The deepfake images are named in both the CSV files and the image filenames following a specific format based on the generative model used: "SD_fake_imageid" for Stable Diffusion, "GL_fake_imageid" for GLIDE, and "DL_fake_imageid" for Dreamlike.

The Deepfake Generation Pipeline involves a 2 steps approach:

  1. first generating a caption for a pristine image using a captioning model.
  2. then feeding this caption into a generative model to create a new synthetic image.

By incorporating images from multiple generative technologies, the dataset is designed to prevent any bias towards a single generation method in the training process of detection models. This  choice aims to enhance the generalization capabilities of models trained on this dataset, enabling them to effectively recognize and flag deepfake content produced by a variety of different methods, not just the ones they have been exposed to during training. The other half consists of pristine, unaltered images to ensure a balanced dataset, crucial for unbiased training and evaluation of detection models.

The dataset has been structured to maintain retrocompatibility with the original Fakeddit dataset. All samples have retained their original Fakeddit class labels (6_way_label), allowing for fine-grained fake news detection across the five original categories: True, Satire/Parody, False Connection, Imposter Content, and Misleading Content. This feature ensures that the DeepFakeNews dataset can be used not only for multimodal and unimodal deepfake detection but also for traditional fake news detection tasks. It offers a versatile resource for a wide range of research scenarios, enhancing its utility in the field of digital misinformation detection.

For full info and details about dataset creation, cleaning pipeline, composition and generation process please refer to my Master Thesis.

Table of contents

+--------------------------------------------------------------------------------------------------------------------------------+
|                                              |         Train Set            |       Validation Set       |     Test Set              |
+-------------------------------------------------------------------------------------------------------------------------------+
|                                              |  Truthful  |    Fake       |   Truthful   |    Fake     | Truthful   |   Fake    |
+-------------------------------------------------------------------------------------------------------------------------------+
| Pristine                                 |   107,549  |   85,387    |   11,491    |   8,964     |   23,352   | 18,215  |
+-------------------------------------------------------------------------------------------------------------------------------+
| Generated                            |   107,549  |   85,387    |   11,491    |   8,964     | 23,352     | 18,215  |
+-------------------------------------------------------------------------------------------------------------------------------+
| Subtotal                               |   215,098  |  170,774   |   22,982     |  17,928   |  46,704   | 36,430   |
+-------------------------------------------------------------------------------------------------------------------------------+
| Total                                     |              385,872         |             40,910            |            83,134         |
+-------------------------------------------------------------------------------------------------------------------------------+

Files

test.csv

Files (24.7 GB)

Name Size Download all
md5:0ec35c33a7a583e18082369550a62d8d
8.5 MB Preview Download
md5:d12213c62a345981d59fe83bef6cf498
2.2 GB Preview Download
md5:4a7c167a9a89970ab077574d81b1b1cb
1.8 GB Preview Download
md5:3630ba9f612278b693cf9c7baa4bceaf
47.2 MB Preview Download
md5:4c7fdcbc3835d9b0116b85ce757fd431
5.4 GB Preview Download
md5:19ccb61e7009e56b7438df46d5b4646d
5.4 GB Preview Download
md5:6c6ad97c1ee7ec5e68d7c00fb465db88
5.4 GB Preview Download
md5:7f35cfb067b461a19316ea51cc4cd7ed
2.1 GB Preview Download
md5:367c0738ecba1133dce7086bbee3d851
5.0 MB Preview Download
md5:95b30bec4e75cf30aa7041fecfe88c48
2.2 GB Preview Download
md5:7063b7adf6d23fd15cd66af17076c293
275.3 MB Preview Download

Additional details

Related works

Is derived from
Dataset: 10.48550/arXiv.1911.03854 (DOI)

Software

Programming language
Python