Published May 25, 2022
| Version 1.1
Dataset
Open
Generalized Deception Dataset
Creators
- 1. University of Texas at Austin
- 2. University of California, Berkeley
- 3. University of Houston
Description
We took labeled datasets from five different deception-detection tasks with no licensing issues and converted them to a standard format. We inspected each dataset for quality and generated new cleaned versions.
Task | # Deceptive | # Truthful |
---|---|---|
Product Reviews | 10493 | 10481 |
Phishing | 6134 | 9202 |
Job Scams | 608 | 13735 |
Political Statements | 5669 | 7167 |
Fake News | 27486 | 34615 |
Our data is structured as five jsonlines files (one for each task) with a text to classify and a Boolean is_deceptive label.
Sample data point:
{
"text":"the Annies List political group supports third-trimester abortions on demand.",
"is_deceptive":true
}
Changelog
1.1
- Fixed flipped labels in the job scams dataset.
Notes
Files
Files
(277.7 MB)
Additional details
Related works
- Is supplement to
- Conference paper: 10.1145/3508398.3519358 (DOI)
References
- Pawan Kumar Verma, Prateek Agrawal, and Radu Prodan. 2021. WELFake dataset for fake news detection in text data. https://doi.org/10.5281/zenodo.4561253
- Sokratis Vidros, Constantinos Kolias, Georgios Kambourakis, and Leman Akoglu. 2017. Automatic Detection of Online Recruitment Frauds: Characteristics, Methods, and a Public Dataset. Future Internet 9, 1 (2017).
- Rakesh M Verma, Victor Zeng, and Houtan Faridi. 2019. Data quality for security challenges: Case studies of phishing, malware, and intrusion detection datasets. In Proc. ACM SIGSAC Conf. on Computer and Communications Security. 2605–2607.
- Tariq Alhindi, Savvas Petridis, and Smaranda Muresan. 2018. Where is your Evidence: Improving Fact-checking by Justification Modeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). 85–90