Salvaging the Internet Hate Machine: Using the discourse of extremist online subcultures to identify emergent extreme speech

Peeters, Stijn; Hagen, Sal; Das, Partha

doi:10.5281/zenodo.3676483

Published February 20, 2020 | Version 1.0

Dataset Open

Salvaging the Internet Hate Machine: Using the discourse of extremist online subcultures to identify emergent extreme speech

1. University of Amsterdam

This dataset accompanies a paper submitted to the WebSci 20 conference. In this paper, we present a lexicon of 'extreme speech' that may be used to detect hate speech and extreme speech on online platforms. We outline a cross-disciplinary research protocol through which this lexicon is initially extracted from a corpus of 3,335,265 posts from 4chan's /pol/ sub-forum using a hybrid method comprising word2vec modeling and subsequent snowballing of nearest neighbours of a small initial expert seed list of extreme language. The choice of corpus is significant, as 4chan is a space of rapid language innovation and obscure extreme vernacular, complicating generalised approaches. Our lexicon detects significantly more extreme posts within a corpus from a more mainstream platform (Reddit) than another popular lexicon, Hatebase, with similar accuracy. Our lexicon and the method of its creation thus provide a contribution to the study of the toxicity of online subcultures similar to 4chan, as well as more mainstream platforms. As we demonstrate, the lexicon allows for more effective detecting of extreme speech in these spaces. This method and the lexicon have further been made available through an open-source web tool for the study of online social platforms, 4CAT. The computational methods and lexicon on offer here can thus be used by a wide academic audience, fostering interdisciplinary approaches to the study of online hate and extreme speech.

The dataset comprises the following items:

The 4chan corpus from which the extreme speech lexicon was generated (posts from /pol/, 1 October 2019 - 1 November 2019)
The Reddit corpus used to verify and test the lexicon (posts from the_donald, theredpill, politics and chapotraphouse, 1 October 2019 - 1 November 2019)
The word2vec model from which the extreme speech lexicon was generated
The extreme speech lexicon that was generated

Notes

Stijn Peeters received funding from the ODYCCEUS Horizon 2020 project, ERC grant agreement number 732942.

Files

4chan-pol-dataset.csv

Files (1.4 GB)

Name	Size	Download all
4chan-pol-dataset.csv md5:e7dd7619b2aac84ba4706edb28d35360	711.1 MB	Preview Download
extreme-speech-word2vec.model md5:00dd51e0deb2244d04ba1b07ef8207c9	14.9 MB	Download
oilab-extreme-speech-lexicon.csv md5:a6dd04d904d911e88c085e5b49822e82	8.0 kB	Preview Download
reddit-dataset.csv md5:6fb74f2539e9ad2679b28eb627668246	659.9 MB	Preview Download

Additional details

ODYCCEUS – Opinion Dynamics and Cultural Conflict in European Spaces 732942: European Commission

	All versions	This version
Views	2,397	2,394
Downloads	4,718	4,716
Data volume	3.9 TB	3.9 TB

Salvaging the Internet Hate Machine: Using the discourse of extremist online subcultures to identify emergent extreme speech

Creators

Description

Notes

Files

4chan-pol-dataset.csv

Files (1.4 GB)

Additional details

Funding