hateRADAR-es: Annotated Corpus for Anti-Refugee Hate Speech Detection in Spanish (Training and Test Sets)

Pachón Álvarez, Victoria; Mata, Jacinto; Gualda, Estrella; Rebollo-Díaz, Carolina

doi:10.5281/zenodo.17259982

Published October 3, 2025 | Version Version 1

Dataset Restricted

hateRADAR-es: Annotated Corpus for Anti-Refugee Hate Speech Detection in Spanish (Training and Test Sets)

1. Universidad de Huelva
2. Universidad de Huelva - Escuela Técnica Superior de Ingeniería

The hateRADAR-es (Anti-Hate Refugees Annotated Dataset and Analysis Resource) dataset is a corpus of Spanish-language dataset of Twitter messages focused on hate speech and negative discourse directed towards refugees. It was manually annotated by expert sociologists and social workers to ensure quality and reliability in the identification of anti-refugee narratives.The dataset contains 5,000 tweets, divided into training (4,000) and test (1,000) sets (hateRADAR-es_train and hateRADAR-es_test), with balanced labels for the detection of hate speech (0 = no hate speech, 1 = hate speech). Tweets were collected between December 2015 and December 2016 using NodeXL, filtered for the keyword “refugiados” (refugees), and curated to remove duplicates and retweets.

hateRADAR-es provides a high-quality benchmark for research in Natural Language Processing (NLP), machine learning, computational social science, and digital humanities. It supports studies on hate speech detection, implicit vs. explicit hostility, and narrative analysis of anti-refugee discourse.

This dataset was developed within the project [NON-CONSPIRA-HATE!] (PID2021-123983OB-I00). hateRADAR-es is available to the scientific community to encourage further research. This data is described in detail in the article:

Mata, J., Gualda, E., Pachón, V., Rebollo-Díaz, C., & Domínguez, J. L. (2025). From data to detection: Developing a corpus and training language models for the identification of anti-refugee narratives in Spanish. Array, 100526. https://doi.org/10.1016/j.array.2025.100526

Technical info

Provenance

This README.txt file was generated on 2025-10-03 by ESTRELLA GUALDA (ESEIS/COIDESO/CISCOA-Lab), JACINTO MATA (I2C/CITES), Universidad de Huelva, Spain, VICTORIA PACHÓN (I2C/CITES), Universidad de Huelva, Spain, and CAROLINA REBOLLO-DÍAZ (ESEIS/COIDESO/CISCOA-Lab)

General Information

Title of Dataset:

hateRADAR-es: Annotated Corpus for Anti-Refugee Hate Speech Detection in Spanish (Training and Test Sets)

Authors:

- VICTORIA PACHÓN, Universidad de Huelva, I2C/CITES, Escuela Técnica Superior de Ingeniería, Campus El Carmen, Avda. de las Fuerzas Armadas, s/n. - 21007 Huelva, Spain, vpachon@uhu.es, https://orcid.org/0000-0001-5329-9622

- JACINTO MATA, Universidad de Huelva, I2C/CITES, Escuela Técnica Superior de Ingeniería, Campus El Carmen, Avda. de las Fuerzas Armadas, s/n. - 21007 Huelva, Spain, mata@uhu.es, https://orcid.org/0000-0001-5329-9622

- ESTRELLA GUALDA, Universidad de Huelva, ESEIS/COIDESO/CISCOA-Lab, Facultad de Trabajo Social, Avda. Tres de Marzo, s/n, 21007-Huelva, estrella@uhu.es, ORCID: https://orcid.org/0000-0003-0220-2135

- CAROLINA REBOLLO-DÍAZ, Universidad de Huelva, ESEIS/COIDESO/CISCOA-Lab, Facultad de Trabajo Social, Avda. Tres de Marzo, s/n, 21007-Huelva, carolina.rebollo@dstso.uhu.es, ORCID: https://orcid.org/0000-0003-1511-656X

Principal Investigator Project PID2021-123983OB-I00 [NON-CONSPIRA-HATE!]:

- ESTRELLA GUALDA, Universidad de Huelva, ESEIS/COIDESO/CISCOA-Lab, Facultad de Trabajo Social, Avda. Tres de Marzo, s/n, 21007-Huelva, estrella@uhu.es, ORCID: http://orcid.org/0000-0003-0220-2135

Corpus Overview

The hateRADAR-es (Anti-Hate Refugees Annotated Dataset and Analysis Resource) corpus is a Spanish-language dataset of Twitter messages designed to support the study and automatic detection of hate speech and xenophobic discourse directed at refugees. The dataset was manually annotated by domain experts (sociologists and social workers) to ensure high-quality labeling, and is composed of two partitions: a training set (hateRADAR-es_train) and a test set (hateRADAR-es_test).

This data served as the fundamental basis for the study presented in the article:

- Mata, J., Gualda, E., Pachón, V., Rebollo-Díaz, C., Domínguez, J. L. (2025). From data to detection: Developing a corpus and training language models for the identification of anti-refugee narratives in Spanish. Array, 100526. https://doi.org/10.1016/j.array.2025.100526

Keywords:

Deep learning, Language models, Transformers, Social media, Twitter, Hate speech, Refugees, Immigrants, Online Hate Speech, Conspiracy Theories, Computational Social Science, Computational Sociology.

Data collection and preprocessing

Thousands of tweets were extracted daily from the search string “refugees” in six different languages: Spanish (“refugiados”), English (“refugees”), German (“fluechtlinge”), French (“réfugiés”), Italian (“rifugiati”) and Portuguese (“refugiados”), during the period from December 2015 to December 2016. The multilingual harvest was performed using NodeXL. Daily extractions were merged by language and duplicates were removed using the unique tweet ID. For the Spanish language, the initial collection contained 355,810 tweets; after excluding retweets to avoid repetition the corpus comprised 90,144 original tweets. From this Spanish collection a representative subset of 5,000 tweets was selected for manual annotation.

Corpus Annotation

The dataset was created through a meticulous labelling process involving 5,000 tweets. Each tweet was carefully categorized to form a training dataset suitable for the supervised classification task. In a first stage, the dataset was labelled by two experts on sociology and social work with the initial aim of distinguishing between tweets containing hate speech, racist or xenophobic discourse towards refugees (label “1”), and tweets that did not contain such an issue (label “0”). To carry out a manual labelling process by human experts, we used the Doccano tool (https://github.com/doccano/doccano), an open-source text annotation platform designed for human users. Given the subjective nature and the level of difficulty of the task, manual labeling of this dataset was facilitated with the support of the annotation guide specifically designed by domain experts (Gualda and Rebollo-Díaz, 2024):

• Gualda, E., Rebollo-Díaz, C., 2024. Guía de anotación – libro de códigos para la detección del discurso de odio hacia inmigrantes y refugiados, versión 3. URL: https://rabida.uhu.es/dspace/handle/10272/23340. Universidad de Huelva. Proyecto PID2021-123983OB-I00 [NONCONSPIRAHATE!].

Each tweet was labeled with one of two classes:

0: No hate speech

1: Hate speech, racist or xenophobic discourse against refugees

The overall inter-annotation agreement during this phase reached a satisfactory correlation index of 0.66 of the Cohen’s Kappa coefficient. Out of 5000 labelled tweets, annotators reached an 85% agreement, with a 15% disagreement rate. In a subsequent stage, tweets where there was no consensus among annotators or raised uncertainties were meticulously analysed and evaluated by a third person, also an expert in sociology of migrations and social work, to resolve disagreements. This thorough review and discussion of the most challenging classification cases assisted in addressing certain subtleties in tweets posted on Twitter in this domain.

Files Description

The hateRADAR-es dataset contains a collection of 5,000 tweets, exhaustively labelled for detecting hate speech, racist or xenophobic discourse towards refugees. The dataset is divided into two sets to ensure the reproducibility of experiments: a training dataset and a testing dataset.

Both datasets are provided in CSV format and include the following fields:

id_tweet: The unique identifier for each tweet. You can use this ID to retrieve individual tweets from the Twitter API of your choice.

text: Represents the content of the tweet.

label: Binary label where 0 represents "No hate speech" and 1 represents "Hate speech detected".

Train Dataset

Name: hateRADAR-es_train.csv

Number of Rows: 4000. This set is intended for model training and validation, ensuring comprehensive learning and robustness against overfitting.

Size: 516.24 KB

Class Distribution:

0 (No hate speech): 3048 entries, accounting for 76.2% of the dataset.

1 (Hate speech detected): 952 entries, making up 23.8% of the dataset.

Test Dataset

Name: hateRADAR-es_test.csv

Number of Rows: 1000. This set is used for model evaluation, providing an unbiased assessment of the model's performance on unseen data.

Class Distribution:

0 (No hate speech): 762 entries, representing 76.2% of the dataset.

1 (Hate speech detected): 238 entries, constituting 23.8% of the dataset.

Privacy and preprocessing

To protect user privacy and comply with data protection considerations, original usernames were replaced with the placeholder @user. The dataset contains tweet IDs so researchers may rehydrate content following Twitter’s terms and applicable policies; access is provided under the conditions described below.

Research Opportunities with the hateRADAR-es corpus

We aim to offer the scientific community a valuable resource for advancing automatic detection of anti-refugee hate speech. We are confident that this dataset will be an invaluable resource for researchers who are seeking to understand the impact of hate speech on refugees. By providing access to this data, we hope to contribute to a greater understanding of this issue and to promote positive change in our communities.

The hateRADAR-es dataset is available for research purposes. It comprises both training and test sets for anti-refugee discourse in Spanish, is a valuable resource for cross-disciplinary research. For instance:

Computer Science and AI (NLP):

Hate Speech Detection: Develop and fine-tune algorithms (e.g., Transformer-based models) for the automatic classification of anti-refugee hate content in Spanish.

Implicit vs. Explicit Hate: Research methods to differentiate between explicit hateful language and implicit, subtle toxicity found in online messages.

Model Generalization: Use the corpus to test if models trained on a specialized domain (anti-refugee) can generalize effectively to other types of hate speech.

Cross-lingual Transfer: Leverage this Spanish corpus as a resource for transfer learning to aid in hate speech detection efforts in low-resource languages.

Social Sciences and Digital Humanities:

Narrative Analysis and Framing: Analyze the rhetoric, linguistic patterns, and key narrative frameworks used to justify or spread hostility toward refugees.

Actor and Network Mapping: Identify and study the social and political actors (users, groups, organizations) responsible for the dissemination and amplification of anti-refugee narratives.

Annotation Bias Studies: Investigate the subjectivity and reliability of expert human annotation to improve future data collection methods for sensitive topics.

Contextual Analysis: Study the link between real-world events, policy changes, or crises and the subsequent spikes in online hate speech directed at refugees.

Funding

This data belong to the I+D+i Project titled “Conspiracy Theories and Online Hate Speech: Comparison of Patterns in Narratives and Social Networks about COVID-19, Immigrants, Refugees, and LGBTIQ+ People [NON-CONSPIRA-HATE!]”, PID2021-123983OB-I00, funded by MCIN/AEI/10.13039/501100011033/ and by “ERDF/EU”. Principal Investigator: Estrella Gualda. University of Huelva, ESEIS/COIDESO/CISCOA-Lab, Spain. We are also grateful for the support of our research groups: “Estudios Sociales e Intervención Social” (GrupoESEIS) and “Ingeniería de la Información y el Conocimiento” (I2C), and also the research centers “Pensamiento Contemporáneo e Innovación para el Desarrollo Social” (COIDESO), and the “Centro de Investigación en Tecnología, Energía y Sostenibilidad” (CITES), at the University of Huelva>

Other details on the Annotation Process, Pilot and Construction of the Dataset

Pilot Phase

The annotation process began with a pilot of 500 tweets. These tweets were selected from an original dataset of ~90,000 Spanish tweets (collected via NodeXL). These tweets had been previously coded with Atlas.ti in a doctoral thesis (Rebollo, 2021) and other publications. The pilot aimed to test and refine the annotation guide, focusing on anti-refugee hate speech.

- Rebollo-Díaz, C., 2021. Tuiteando sobre refugiados: una comparación internacional de discursos, imaginarios y representaciones sociales. Ph.D. thesis. URL: http://hdl.handle.net/10272/20113.

Test: A first systematic test with 200 tweets involved annotators Laura Cabrera Álvarez, Jonás González Díaz, Carolina Rebollo-Díaz (CRD), Elena Ruiz Ángel, and Francisco Javier Santos Fernández.

Tool setup and training: The pilot helped validate the guide and train annotators on the use of Doccano, installed on University of Huelva servers (http://nonconspirahate.uhu.es:8000/) by Jacinto Mata and Victoria Pachón.

Supervision and training: Estrella Gualda (EG) & Carolina Rebollo-Díaz.

Full Annotation

After refining the guide, 5,000 tweets were annotated following the new protocol.

- Annotators: Jonás González Díaz and Laura Cabrera Álvarez

- Referee: Carolina Rebollo-Díaz

- Supervision and training on the annotation guide: Estrella Gualda & Carolina Rebollo-Díaz

Dataset Construction Process

Data selection:

CRD collected the original 90,000 Spanish tweets.

EG & CRD performed a qualitative selection of tweets coded in Atlas.ti, to balance hate and non-hate tokens.

Random sampling:

EG applied R scripts to filter 2,500 hate-related and 2,500 non-hate tweets for annotation.

Preparation for annotation:

EG formatted the selected tweets for Doccano.

Building of the hateRADAR-es dataset:

Jacinto Mata & Victoria Pachón adapted the data to Doccano and built the hateRADAR-es dataset, which served as the input for model development in the paper: “From data to detection: Developing a corpus and training language models for the identification of anti-refugee narratives in Spanish.”

Contact:

If you have any questions, feedback, or need further information about the hateRADAR-es dataset, please feel free to contact the project account: nonconspirahate@uhu.es.

Notes

Sharing Access Information

Data availability: This dataset has Restricted Access. Nevertheless, information is available upon reasonable request from the authors. Researchers interested in accessing the data must complete a request form and send it through Zenodo or directed to: nonconspirahate@uhu.es [Subject: hateRADAR-es Annotated Dataset].

This form includes ethical commitments concerned with Twitter data, and the obligation to properly cite the data source. Access to the data is subject to the approval of the request and compliance with Twitter's data protection and ethical guidelines.

Data Request Form:

Include the following information to have access to the dataset:

Applicant's

Name:

Institution:

Email:

Purpose of the Research & Exploitation of Data:

Declare Ethical Commitments:

1. I commit to using the data solely for the purposes specified in this request.

2. I commit not to share the data with third parties without prior consent from the research team.

3. I commit to complying with all data protection and ethical guidelines of Twitter.

4. I commit to properly citing the data source in all publications and presentations derived from its use.

Citation:

- Pachón, V. P., Mata, J., Gualda, E. & Rebollo-Díaz, C. (2025). hateRADAR-es: Annotated Corpus for Anti-Refugee Hate Speech Detection in Spanish (Training and Test Sets) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17259982

Applicant's Signature:

Date:

Rights and permissions: The “hateRADAR-es annotated dataset” is subject to a specific usage license. Researchers interested in accessing the dataset must adhere to the terms of the chosen license. The dataset is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0>

Citation to publications that cite or use the data:

- Mata, J., Gualda, E., Pachón, V., Rebollo- Díaz, C., Domínguez, J. L. (2025). From data to detection: Developing a corpus and training language models for the identification of anti-refugee narratives in Spanish. Array, 100526. https://doi.org/10.1016/j.array.2025.100526

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Is part of: Journal article: 10.1016/j.array.2025.100526 (DOI)

Agencia Estatal de Investigación
Conspiracy Theories and Online Hate Speech: Comparison of patterns in narratives and social networks about COVID 19, immigrants, refugees and LGBTI people [NON CONSPIRA HATE!] PID 2021-123983OB-I00

Available: 2025-10-04

Datasets

	All versions	This version
Views	32	32
Downloads	10	10
Data volume	2.3 MB	2.3 MB

hateRADAR-es: Annotated Corpus for Anti-Refugee Hate Speech Detection in Spanish (Training and Test Sets)

Technical info

Notes

Files

Restricted

Additional details

Related works

Funding

Dates

hateRADAR-es: Annotated Corpus for Anti-Refugee Hate Speech Detection in Spanish (Training and Test Sets)

Creators

Description

Technical info

Notes

Files

Restricted

Additional details

Related works

Funding

Dates