Published October 19, 2023 | Version v1
Dataset Open

GPT-3 Curie generated synthetic datasets based on the datasets: Founta, Stormfront, HatEval 2019, Davidson, GermEval 2021, SemEval 2022 Task 4

  • 1. ROR icon University of Regensburg

Description

This dataset is a composition of six toxic or hateful synthetic datasets based on the datasets published by:

 

"Large scale crowdsourcing and characterization of twitter abusive behavior"

"Hate Speech Dataset from a White Supremacy Forum"

"Automated hate speech detection and the problem of offensive language"

"Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter"

"Overview of the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments"

"Don't patronize me! An annotated dataset with patronizing and condescending language towards vulnerable communities"

 

All data is generated by a separate GPT-3 Curie model fine-tuned on one label of the dataset. The data is not filtered and likely needs to be processed before being useful.

Files

new_Davidson_hateful_synthetic_data.csv

Files (86.1 MB)

Name Size Download all
md5:24012e90a17dcf895ccd873b3474075f
8.2 MB Download
md5:0a9dd05f6eb8fc98243d8d6f4d886aed
8.6 MB Download
md5:661908b264eb185a0b376f509322eaba
5.0 MB Preview Download
md5:4134947725625708582cefd0a3ae35fe
4.2 MB Preview Download
md5:294f6174ab10da4dc090a2ba900c57cd
5.9 MB Preview Download
md5:a49484e1b314689de61a700041cac723
5.1 MB Preview Download
md5:68a9ffdbda5e3a691755abb31393dc51
7.1 MB Download
md5:04152b49e41e27b4d9a2e2779ef43422
7.3 MB Download
md5:7692bb1d69fa9f52a3d88147a6fbfae5
11.2 MB Preview Download
md5:bbd44f5720ed8ad4bd09fc114e9b87e2
13.8 MB Preview Download
md5:5e84a326bb9b2ea027b1c0451f80f431
5.8 MB Download
md5:2b4af47eb39829777312331f62f8f16f
3.8 MB Download

Additional details

Dates

Created
2023-10-19