Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets

doi:10.5281/zenodo.4529174

Published May 1, 2020 | Version v1

Conference paper Open

Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets

1. Universitat Pompeu Fabra

The field of the automatic detection of hate speech and related concepts has raised a lot of interest in the last years. Different datasets were annotated and classified by means of applying different machine learning algorithms. However, few efforts were done in order to clarify the applied categories and homogenize different datasets. Our study takes up this demand. We analyze six different publicly available datasets in this field with respect to their similarity and compatibility. We conduct two different experiments. First, we try to make the datasets compatible and represent the dataset classes as Fast Text word vectors analyzing the similarity between different classes in a intra and inter dataset manner. Second, we submit the chosen datasets to the Perspective API Toxicity classifier, achieving different performances depending on the categories and datasets. One of the main conclusions of these experiments is that many different definitions are being used for equivalent concepts, which makes most of the publicly available datasets incompatible. Grounded in our analysis, we provide guidelines for future dataset collection and annotation.

Files

2020.lrec-1.838.pdf

Files (243.6 kB)

Name	Size	Download all
2020.lrec-1.838.pdf md5:7a16dcf5193ab82f477a0d51b6f8eccf	243.6 kB	Preview Download

Additional details

CONNEXIONs – InterCONnected NEXt-Generation Immersive IoT Platform of Crime and Terrorism DetectiON, PredictiON, InvestigatiON, and PreventiON Services 786731: European Commission

	All versions	This version
Views	90	89
Downloads	29	29
Data volume	7.3 MB	7.3 MB

Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets

Creators

Description

Files

2020.lrec-1.838.pdf

Files (243.6 kB)

Additional details

Funding