Generalized Deception Dataset

Zeng, Victor; Liu, Xuting; Verma, Rakesh M.

doi:10.5281/zenodo.8371762

Published May 25, 2022 | Version 1.1

Dataset Open

Generalized Deception Dataset

1. University of Texas at Austin
2. University of California, Berkeley
3. University of Houston

We took labeled datasets from five different deception-detection tasks with no licensing issues and converted them to a standard format. We inspected each dataset for quality and generated new cleaned versions.

Task	# Deceptive	# Truthful
Product Reviews	10493	10481
Phishing	6134	9202
Job Scams	608	13735
Political Statements	5669	7167
Fake News	27486	34615

Our data is structured as five jsonlines files (one for each task) with a text to classify and a Boolean is_deceptive label.

Sample data point:

{
  "text":"the Annies List political group supports third-trimester abortions on demand.",
  "is_deceptive":true
}

Changelog

1.1

Fixed flipped labels in the job scams dataset.

Notes

This work was completed in part with resources provided by the Research Computing Data Core at the University of Houston and supported in part by NSF grants DGE 1433817, CCF 1950297, ARO award W911NF-20-1-0254, and ONR award N00014-19-S-F009.

Files

Files (277.7 MB)

Name	Size	Download all
fake_news.jsonl md5:474f610d8e2a9362a3d8fcccc8599321	212.4 MB	Download
job_scams.jsonl md5:4f2e75fddc34a838facda63b40cac664	18.5 MB	Download
phishing.jsonl md5:88641d2c43a2f1b9d6941d3d5ccc8c22	36.7 MB	Download
political_statements.jsonl md5:0cb00bfa2809f87c3b026f956e55f65b	1.8 MB	Download
product_reviews.jsonl md5:74a49eba495e17c900be49d02a70ecb3	8.4 MB	Download

Additional details

Is supplement to: Conference paper: 10.1145/3508398.3519358 (DOI)

Pawan Kumar Verma, Prateek Agrawal, and Radu Prodan. 2021. WELFake dataset for fake news detection in text data. https://doi.org/10.5281/zenodo.4561253
Sokratis Vidros, Constantinos Kolias, Georgios Kambourakis, and Leman Akoglu. 2017. Automatic Detection of Online Recruitment Frauds: Characteristics, Methods, and a Public Dataset. Future Internet 9, 1 (2017).
Rakesh M Verma, Victor Zeng, and Houtan Faridi. 2019. Data quality for security challenges: Case studies of phishing, malware, and intrusion detection datasets. In Proc. ACM SIGSAC Conf. on Computer and Communications Security. 2605–2607.
Tariq Alhindi, Savvas Petridis, and Smaranda Muresan. 2018. Where is your Evidence: Improving Fact-checking by Justification Modeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). 85–90

	All versions	This version
Views	1,407	273
Downloads	947	304
Data volume	90.2 GB	19.8 GB

Generalized Deception Dataset

Notes

Files

Files (277.7 MB)

Additional details

Related works

References

Generalized Deception Dataset

Creators

Description

Notes

Files

Files (277.7 MB)

Additional details

Related works

References