UPDATE: Zenodo migration postponed to Oct 13 from 06:00-08:00 UTC. Read the announcement.

Dataset Open Access

RP-Mod & RP-Crowd: Moderator- and Crowd-Annotated German News Comment Datasets

Assenmacher, Dennis; Niemann, Marco; Müller, Kilian; Seiler, Moritz V.; Riehle, Dennis M.; Trautmann, Heike

Abuse and hate are penetrating social media and many comment sections of news media companies. These platform providers invest considerable efforts to moderate user-generated contributions to prevent losing readers who get appalled by inappropriate texts. This is further enforced by legislative actions, which make non-clearance of these comments a punishable action. While (semi-)automated solutions using Natural Language Processing and advanced Machine Learning techniques are getting increasingly sophisticated, the domain of abusive language detection still struggles as large non-English and well-curated datasets are scarce or not publicly available.

With this work, we publish and analyse the largest annotated German abusive language comment datasets to date. In contrast to existing datasets, we achieve a high labelling standard by conducting a thorough crowd-based annotation study that complements professional moderators' decisions, which are also included in the dataset. We compare and cross-evaluate the performance of baseline algorithms and state-of-the-art transformer-based language models, which are fine-tuned on our datasets and an existing alternative, showing the usefulness for the community.

The research leading to these results received funding from the federal state of North Rhine-Westphalia and the European Regional Development Fund (EFRE.NRW 2014-2020), Project: MODERAT! (No. CM-2-2-036a).
Files (81.2 MB)
Name Size
CrowdGuru-Demographic.xlsx
md5:fc0e87dc16071ce1c3f062c44694a4d6
27.7 kB Download
CrowdGuru-Ratings.xlsx
md5:604ab76a1ce43392be2828868380dc9c
10.9 MB Download
RP-Crowd-1-folds.csv
md5:00b5eb167a982414198387c193348fef
14.1 MB Download
RP-Crowd-1.csv
md5:541b7a9521cd8ba04e4ee1a02dd633b4
14.1 MB Download
RP-Crowd-2-folds.csv
md5:3951b2ca327c8af0875cb0963d53b747
4.2 MB Download
RP-Crowd-2.csv
md5:5185160d6ee3fcb10d98e8a264a124c7
4.2 MB Download
RP-Crowd-3-folds.csv
md5:86009f480c3e4a826e140f5cc0ae8407
1.5 MB Download
RP-Crowd-3.csv
md5:12250a4e4ed17bc80ad724b22df8117a
1.5 MB Download
RP-Crowd-4-folds.csv
md5:33301b7c98db8a105925aaa4ad75f376
454.2 kB Download
RP-Crowd-4.csv
md5:cf3150a86a0ed745733cc17c10f2fc7f
456.2 kB Download
RP-Crowd-5-folds.csv
md5:09fc3b493faaa3bc4aa2a0c5ddeeb6a2
93.1 kB Download
RP-Crowd-5.csv
md5:15e346bee37f9e38d9c06f79bd5e05ad
93.5 kB Download
RP-Mod-Crowd.csv
md5:eec963103f9baca53153278b72206fb3
23.2 MB Download
RP-Mod-folds.csv
md5:a2caffcfa3d7c1e1e2662e41c7f21bb7
3.2 MB Download
RP-Mod.csv
md5:2b54e6c10ee1ed27ce4af0afe63208ba
3.1 MB Download
1,661
2,374
views
downloads
All versions This version
Views 1,6611,465
Downloads 2,3742,364
Data volume 21.9 GB21.7 GB
Unique views 1,2941,195
Unique downloads 903899

Share

Cite as