DeepFakeNews

Nello, Enrico

doi:10.5281/zenodo.11186584

Published May 13, 2024 | Version v1

Dataset Open

DeepFakeNews

Nello, Enrico (Producer)

The DeepFakeNews dataset is a novel and comprehensive dataset designed for the detection of both deepfakes and fake news. This dataset is an extension and enhancement of the existing Fakeddit fake news dataset (i strongly suggest to read the related paper (here) from the authors to better understand this dataset), with significant modifications to cater specifically to the complexities of modern misinformation.

Enhancements: Derived from the Fakeddit fake news dataset, the DeepFakeNews dataset comprehends a total of 509,916 images and has been enriched with 254,958 deepfake images generated using three different generative models: Stable Diffusion 2, Dreamlike, and GLIDE.
Balance and Composition: The dataset is perfectly balanced, containing an equal number of pristine (authentic) and generated (deepfake) images.
Removal of Hand-Modified Content: The original "manipulated content" category from Fakeddit, which consisted of images altered or modified by hand, has been removed. These have been replaced with deepfakes to provide a more relevant and challenging set of synthetic images.
Cleaning and Quality Control: The Fakeddit dataset was thoroughly cleaned, removing any images that were not found, contained only logos, or were otherwise unsuitable for deepfake detection. This cleaning process ensures a higher quality and more reliable dataset for training and evaluation.
Application: The DeepFakeNews dataset is suitable for both deepfake detection and fake news detection. Its diverse and balanced nature makes it an excellent benchmark for evaluating multimodal detection systems that analyze both visual and textual content.

The dataset comes with three CSV files for training, testing, and validation sets, along with corresponding zip files containing the split images for each set. The deepfake images are named in both the CSV files and the image filenames following a specific format based on the generative model used: "SD_fake_imageid" for Stable Diffusion, "GL_fake_imageid" for GLIDE, and "DL_fake_imageid" for Dreamlike.

The Deepfake Generation Pipeline involves a 2 steps approach:

first generating a caption for a pristine image using a captioning model.
then feeding this caption into a generative model to create a new synthetic image.

By incorporating images from multiple generative technologies, the dataset is designed to prevent any bias towards a single generation method in the training process of detection models. This choice aims to enhance the generalization capabilities of models trained on this dataset, enabling them to effectively recognize and flag deepfake content produced by a variety of different methods, not just the ones they have been exposed to during training. The other half consists of pristine, unaltered images to ensure a balanced dataset, crucial for unbiased training and evaluation of detection models.

The dataset has been structured to maintain retrocompatibility with the original Fakeddit dataset. All samples have retained their original Fakeddit class labels (6_way_label), allowing for fine-grained fake news detection across the five original categories: True, Satire/Parody, False Connection, Imposter Content, and Misleading Content. This feature ensures that the DeepFakeNews dataset can be used not only for multimodal and unimodal deepfake detection but also for traditional fake news detection tasks. It offers a versatile resource for a wide range of research scenarios, enhancing its utility in the field of digital misinformation detection.

For full info and details about dataset creation, cleaning pipeline, composition and generation process please refer to my Master Thesis.

+--------------------------------------------------------------------------------------------------------------------------------+
| | Train Set | Validation Set | Test Set |
+-------------------------------------------------------------------------------------------------------------------------------+
| | Truthful | Fake | Truthful | Fake | Truthful | Fake |
+-------------------------------------------------------------------------------------------------------------------------------+
| Pristine | 107,549 | 85,387 | 11,491 | 8,964 | 23,352 | 18,215 |
+-------------------------------------------------------------------------------------------------------------------------------+
| Generated | 107,549 | 85,387 | 11,491 | 8,964 | 23,352 | 18,215 |
+-------------------------------------------------------------------------------------------------------------------------------+
| Subtotal | 215,098 | 170,774 | 22,982 | 17,928 | 46,704 | 36,430 |
+-------------------------------------------------------------------------------------------------------------------------------+
| Total | 385,872 | 40,910 | 83,134 |
+-------------------------------------------------------------------------------------------------------------------------------+

Files

test.csv

Files (24.7 GB)

Name	Size	Download all
test.csv md5:0ec35c33a7a583e18082369550a62d8d	8.5 MB	Preview Download
test_1.zip md5:d12213c62a345981d59fe83bef6cf498	2.2 GB	Preview Download
test_2.zip md5:4a7c167a9a89970ab077574d81b1b1cb	1.8 GB	Preview Download
train.csv md5:3630ba9f612278b693cf9c7baa4bceaf	47.2 MB	Preview Download
train_1.zip md5:4c7fdcbc3835d9b0116b85ce757fd431	5.4 GB	Preview Download
train_2.zip md5:19ccb61e7009e56b7438df46d5b4646d	5.4 GB	Preview Download
train_3.zip md5:6c6ad97c1ee7ec5e68d7c00fb465db88	5.4 GB	Preview Download
train_4.zip md5:7f35cfb067b461a19316ea51cc4cd7ed	2.1 GB	Preview Download
validation.csv md5:367c0738ecba1133dce7086bbee3d851	5.0 MB	Preview Download
validation_1.zip md5:95b30bec4e75cf30aa7041fecfe88c48	2.2 GB	Preview Download
validation_2.zip md5:7063b7adf6d23fd15cd66af17076c293	275.3 MB	Preview Download

Additional details

Is derived from: Dataset: 10.48550/arXiv.1911.03854 (DOI)

Programming language: Python

	All versions	This version
Views	277	277
Downloads	3,212	3,212
Data volume	1.5 TB	1.5 TB

DeepFakeNews

Table of contents

Files

test.csv

Files (24.7 GB)

Additional details

Related works

Software

DeepFakeNews

Creators

Description

Table of contents

Files

test.csv

Files (24.7 GB)

Additional details

Related works

Software