Published May 25, 2022 | Version 1.1
Dataset Open

Generalized Deception Dataset

  • 1. University of Texas at Austin
  • 2. University of California, Berkeley
  • 3. University of Houston

Description

We took labeled datasets from five different deception-detection tasks with no licensing issues and converted them to a standard format. We inspected each dataset for quality and generated new cleaned versions. 

Task # Deceptive # Truthful
Product Reviews 10493 10481
Phishing 6134 9202
Job Scams 608 13735
Political Statements 5669 7167
Fake News 27486 34615

Our data is structured as five jsonlines files (one for each task) with a text to classify and a Boolean is_deceptive label. 

Sample data point: 

{
  "text":"the Annies List political group supports third-trimester abortions on demand.",
  "is_deceptive":true
}

 

Changelog

1.1

  • Fixed flipped labels in the job scams dataset. 

Notes

This work was completed in part with resources provided by the Research Computing Data Core at the University of Houston and supported in part by NSF grants DGE 1433817, CCF 1950297, ARO award W911NF-20-1-0254, and ONR award N00014-19-S-F009.

Files

Files (277.7 MB)

Name Size Download all
md5:474f610d8e2a9362a3d8fcccc8599321
212.4 MB Download
md5:4f2e75fddc34a838facda63b40cac664
18.5 MB Download
md5:88641d2c43a2f1b9d6941d3d5ccc8c22
36.7 MB Download
md5:0cb00bfa2809f87c3b026f956e55f65b
1.8 MB Download
md5:74a49eba495e17c900be49d02a70ecb3
8.4 MB Download

Additional details

Related works

Is supplement to
Conference paper: 10.1145/3508398.3519358 (DOI)

References

  • Pawan Kumar Verma, Prateek Agrawal, and Radu Prodan. 2021. WELFake dataset for fake news detection in text data. https://doi.org/10.5281/zenodo.4561253
  • Sokratis Vidros, Constantinos Kolias, Georgios Kambourakis, and Leman Akoglu. 2017. Automatic Detection of Online Recruitment Frauds: Characteristics, Methods, and a Public Dataset. Future Internet 9, 1 (2017).
  • Rakesh M Verma, Victor Zeng, and Houtan Faridi. 2019. Data quality for security challenges: Case studies of phishing, malware, and intrusion detection datasets. In Proc. ACM SIGSAC Conf. on Computer and Communications Security. 2605–2607.
  • Tariq Alhindi, Savvas Petridis, and Smaranda Muresan. 2018. Where is your Evidence: Improving Fact-checking by Justification Modeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). 85–90