Salvaging the Internet Hate Machine: Using the discourse of extremist online subcultures to identify emergent extreme speech
Description
This dataset accompanies a paper submitted to the WebSci 20 conference. In this paper, we present a lexicon of 'extreme speech' that may be used to detect hate speech and extreme speech on online platforms. We outline a cross-disciplinary research protocol through which this lexicon is initially extracted from a corpus of 3,335,265 posts from 4chan's /pol/ sub-forum using a hybrid method comprising word2vec modeling and subsequent snowballing of nearest neighbours of a small initial expert seed list of extreme language. The choice of corpus is significant, as 4chan is a space of rapid language innovation and obscure extreme vernacular, complicating generalised approaches. Our lexicon detects significantly more extreme posts within a corpus from a more mainstream platform (Reddit) than another popular lexicon, Hatebase, with similar accuracy. Our lexicon and the method of its creation thus provide a contribution to the study of the toxicity of online subcultures similar to 4chan, as well as more mainstream platforms. As we demonstrate, the lexicon allows for more effective detecting of extreme speech in these spaces. This method and the lexicon have further been made available through an open-source web tool for the study of online social platforms, 4CAT. The computational methods and lexicon on offer here can thus be used by a wide academic audience, fostering interdisciplinary approaches to the study of online hate and extreme speech.
The dataset comprises the following items:
- The 4chan corpus from which the extreme speech lexicon was generated (posts from /pol/, 1 October 2019 - 1 November 2019)
- The Reddit corpus used to verify and test the lexicon (posts from the_donald, theredpill, politics and chapotraphouse, 1 October 2019 - 1 November 2019)
- The word2vec model from which the extreme speech lexicon was generated
- The extreme speech lexicon that was generated