Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published February 20, 2020 | Version 1.0
Dataset Open

Salvaging the Internet Hate Machine: Using the discourse of extremist online subcultures to identify emergent extreme speech

  • 1. University of Amsterdam

Description

This dataset accompanies a paper submitted to the WebSci 20 conference. In this paper, we present a lexicon of 'extreme speech' that may be used to detect hate speech and extreme speech on online platforms. We outline a cross-disciplinary research protocol through which this lexicon is initially extracted from a corpus of 3,335,265 posts from 4chan's /pol/ sub-forum using a hybrid method comprising word2vec modeling and subsequent snowballing of nearest neighbours of a small initial expert seed list of extreme language. The choice of corpus is significant, as 4chan is a space of rapid language innovation and obscure extreme vernacular, complicating generalised approaches. Our lexicon detects significantly more extreme posts within a corpus from a more mainstream platform (Reddit) than another popular lexicon, Hatebase, with similar accuracy.  Our lexicon and the method of its creation thus provide a contribution to the study of the toxicity of online subcultures similar to 4chan, as well as more mainstream platforms. As we demonstrate, the lexicon allows for more effective detecting of extreme speech in these spaces. This method and the lexicon have further been made available through an open-source web tool for the study of online social platforms, 4CAT. The computational methods and lexicon on offer here can thus be used by a wide academic audience, fostering interdisciplinary approaches to the study of online hate and extreme speech. 

The dataset comprises the following items:

  • The 4chan corpus from which the extreme speech lexicon was generated (posts from /pol/, 1 October 2019 - 1 November 2019)
  • The Reddit corpus used to verify and test the lexicon (posts from the_donald, theredpill, politics and chapotraphouse, 1 October 2019 - 1 November 2019)
  • The word2vec model from which the extreme speech lexicon was generated
  • The extreme speech lexicon that was generated

Notes

Stijn Peeters received funding from the ODYCCEUS Horizon 2020 project, ERC grant agreement number 732942.

Files

4chan-pol-dataset.csv

Files (1.4 GB)

Name Size Download all
md5:e7dd7619b2aac84ba4706edb28d35360
711.1 MB Preview Download
md5:00dd51e0deb2244d04ba1b07ef8207c9
14.9 MB Download
md5:a6dd04d904d911e88c085e5b49822e82
8.0 kB Preview Download
md5:6fb74f2539e9ad2679b28eb627668246
659.9 MB Preview Download

Additional details

Funding

ODYCCEUS – Opinion Dynamics and Cultural Conflict in European Spaces 732942
European Commission