HateBR: Large-scale expert annotated dataset of Brazilian Instagram comments for abusive language detection

Francielle Vargas; Isabelle Carvalho; Fabiana Rodrigues de Góes; Thiago A. S. Pardo; Fabricio Benevenuto

doi:10.5281/zenodo.7681303

Published February 27, 2023 | Version v1.0.0

Dataset Open

HateBR: Large-scale expert annotated dataset of Brazilian Instagram comments for abusive language detection

1. University of São Paulo
2. Federal University of Minas Gerais

The HateBR dataset was collected from the comment section of Brazilian politicians’ accounts on Instagram and manually annotated by specialists, reaching a high inter-annotator agreement. The corpus consists of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). We also implemented baseline experiments for offensive language and hate speech detection and compared them with a literature baseline. Results show that the baseline experiments on our corpus outperform the current state-of-the-art for the Portuguese language.

Notes

This resource is available only for academic purposes. The Sinch company retains commercial copyrights.

Files

franciellevargas/HateBR-v1.0.0.zip

Files (1.7 MB)

Name	Size	Download all
franciellevargas/HateBR-v1.0.0.zip md5:d28944fb30d16a1262e1f8cbe934b18f	1.7 MB	Preview Download

Additional details

Is supplement to: https://github.com/franciellevargas/HateBR/tree/v1.0.0 (URL)

	All versions	This version
Views	251	71
Downloads	40	10
Data volume	41.2 MB	17.4 MB

HateBR: Large-scale expert annotated dataset of Brazilian Instagram comments for abusive language detection

Creators

Description

Notes

Files

franciellevargas/HateBR-v1.0.0.zip

Files (1.7 MB)

Additional details

Related works