Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published June 27, 2024 | Version v0.3.3 -- Purple Butterfly
Dataset Open

Annotated Data in Spanish for Toxicity and Insults in Digital Social Networks

  • 1. ROR icon Leiden University
  • 2. ROR icon University of California, Irvine
  • 1. ROR icon Leiden University
  • 2. ROR icon University of California, Irvine
  • 3. ROR icon Universidad del Desarrollo
  • 4. Training Data Lab

Description

This repository contains data sets and materials for a gold standard elaboration on toxicity and incivility in the digital sphere based on human coding to benchmark algorithmic classification tasks with transformers and LLMs. The labelling progress is 62%.

We are labelling two samples of novel datasets of political digital interactions on Twitter (rebranded as X). The first set comprises almost 5 million data points from three Latin American protest events: (a) protests against the coronavirus and judicial reform measures in Argentina during August 2020; (b) protests against education budget cuts in Brazil in May 2019; and (c) the social outburst in Chile stemming from protests against the underground fare hike in October 2019. We are focusing on interactions in Spanish to elaborate a gold standard for digital interactions in this language, therefore, we prioritise Argentinian and Chilean data. The second set contains more than 31 million messages and more than 9 million interactions between 2010 and 2022, covering the election of members of the first Constitutional Convention in Chile, the drafting process and the referendum in which the proposal was rejected.

This project is generously funded by the OpenAI Academic Programme, 2024 FAE-UDP Research Grant, and partially by the St Hilda's College Muriel Wise Fund at the University of Oxford. The Training Data Lab research group also logistically supports this project.

Files

training-datalab/gold-standard-toxicity-v0.3.3.zip

Files (1.3 MB)

Additional details

Software

Programming language
R
Development Status
Wip