Hybrid Approaches to Detect Comments Violating Macro Norms on Reddit

doi:10.5281/zenodo.2541450

Published January 16, 2019 | Version 1.0

Dataset Open

Hybrid Approaches to Detect Comments Violating Macro Norms on Reddit

1. School of Interactive Computing, Georgia Institute of Technology
2. School of Information, University of Michigan

This dataset was generated as an extension of our CSCW 2018 paper:

Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. 2018. The Internet’s Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and Macro Scales. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 32.

Description:

Working with over 2.8M removed comments collected from 100 different communities on Reddit (subreddit names listed in data/study-subreddits.csv), we identified 8 macro norms, i.e., norms that are widely enforced on most parts of Reddit. We extracted these macro norms by employing a hybrid approach—classification, topic modeling, and open-coding—on comments identified to be norm violations within at least 85 out of the 100 study subreddits. Finally, we labelled over 40K Reddit comments removed by moderators according to the specific type of macro norm being violated, and make this dataset publicly available (also available on Github).

For each of the labeled topics, we identified the top 5000 removed comments that were best fit by the LDA topic model. In this way, we identified over 5000 removed comments that are examples of each type of macro norm violation described in the paper. The removed comments were sorted by their topic fit, stored into respective files based on the type of norm violation they represent, and are made available on this repo.

8 files, each containing 5000+ removed comments obtained from Reddit, are stored in: data/macro-norm-violations/ , and they are split into different files based on the macro norm they violated. Each new line in the files represent a comment that was posted on Reddit between May 2016 to March 2017, and subsequently removed by subreddit moderators for violating community norms. All comments were preprocessed using the script in code/preprocessing-reddit-comments.py , in order to do the following: 1. remove new lines, 2. convert text to lowercase, and 3. strip numbers and punctuations from comments.

Descriptions of each file containing 5059 comments (that were removed from Reddit, and preprocessed) violating macro norms present in data/macro-norm-violations/:

"macro-norm-violations-n10-t0-misogynistic-slurs.csv" - Comments that use misogynistic slurs.
"macro-norm-violations-n15-t2-hatespeech-racist-homophobic.csv" - Comments containing hate speech that is racist or homophobic.
"macro-norm-violations-n10-t3-opposing-political-views-trump.csv", "macro-norm-violations-n15-t10-opposing-political-views-trump.csv" - Comments with opposing political views around Trump (depends on originating sub).
"macro-norm-violations-n10-t4-verbal-attacks-on-Reddit.csv" - Comments containing verbal attacks on Reddit or specific subreddits.
"macro-norm-violations-n10-t5-porno-links.csv" - Comments with pornographic links.
"macro-norm-violations-n10-t8-personal-attacks.csv", "macro-norm-violations-n10-t9-personal-attacks.csv"- Comments containing personal attacks.
"macro-norm-violations-n15-t3-abusing-and-criticisizing-mods.csv" - Comments abusing and criticisizng moderators.
"macro-norm-violations-n15-t9-namecalling-claiming-other-too-sensitive.csv" - Comments with name-calling, or claiming that the other person is too sensitive.

More details about the dataset can be found on arXiv: https://arxiv.org/abs/1904.03596

Files

macro-norm-violations-n10-t0-misogynistic-slurs.csv

Files (6.0 MB)

Name	Size	Download all
macro-norm-violations-n10-t0-misogynistic-slurs.csv md5:bd2bbc28b540fc7513b45ae960fc5b3a	378.9 kB	Preview Download
macro-norm-violations-n10-t3-opposing-political-views-trump.csv md5:c69b30a3687c72fc3917da43bb2948c6	702.8 kB	Preview Download
macro-norm-violations-n10-t4-verbal-attacks-on-Reddit.csv md5:f3554b4e453d786c8cc89eb2ff5d3d06	595.6 kB	Preview Download
macro-norm-violations-n10-t5-porno-links.csv md5:fbbafdd28890423ec89aee86437445f1	606.3 kB	Preview Download
macro-norm-violations-n10-t8-personal-attacks.csv md5:ccfc2c4517d6b5b483bdccaebc3268ab	473.7 kB	Preview Download
macro-norm-violations-n10-t9-personal-attacks.csv md5:97da0495f9d450aa25647e43f718f5ec	440.8 kB	Preview Download
macro-norm-violations-n15-t10-opposing-political-views-trump.csv md5:1b916751773a26b142241443eee267e5	696.8 kB	Preview Download
macro-norm-violations-n15-t2-hatespeech-racist-homophobic.csv md5:97a453236b1d8b37fb26685958d60d9f	741.3 kB	Preview Download
macro-norm-violations-n15-t3-abusing-and-criticisizing-mods.csv md5:7b34bca007f1624413616c8011ceea16	732.1 kB	Preview Download
macro-norm-violations-n15-t9-namecalling-claiming-other-too-sensitive.csv md5:3d635e21ddb7b3ab0a16062738a5c430	604.9 kB	Preview Download

Additional details

CAREER: Machine Learning-Based Approaches Toward Combatting Abusive Behavior in Online Communities 1553376: National Science Foundation

	All versions	This version
Views	2,121	689
Downloads	204,144	13,021
Data volume	2.1 TB	6.3 GB

Hybrid Approaches to Detect Comments Violating Macro Norms on Reddit

Creators

Description

Files

macro-norm-violations-n10-t0-misogynistic-slurs.csv

Files (6.0 MB)

Additional details

Funding