Technical info
Provenance
This README.txt file was generated on 2025-10-03 by ESTRELLA GUALDA (ESEIS/COIDESO/CISCOA-Lab), JACINTO MATA (I2C/CITES), Universidad de Huelva, Spain, VICTORIA PACHÓN (I2C/CITES), Universidad de Huelva, Spain, and CAROLINA REBOLLO-DÍAZ (ESEIS/COIDESO/CISCOA-Lab)
General Information
Title of Dataset:
hateRADAR-es: Annotated Corpus for Anti-Refugee Hate Speech Detection in Spanish (Training and Test Sets)
Authors:
Principal Investigator Project PID2021-123983OB-I00 [NON-CONSPIRA-HATE!]:
Corpus Overview
The hateRADAR-es (Anti-Hate Refugees Annotated Dataset and Analysis Resource) corpus is a Spanish-language dataset of Twitter messages designed to support the study and automatic detection of hate speech and xenophobic discourse directed at refugees. The dataset was manually annotated by domain experts (sociologists and social workers) to ensure high-quality labeling, and is composed of two partitions: a training set (hateRADAR-es_train) and a test set (hateRADAR-es_test).
This data served as the fundamental basis for the study presented in the article:
- Mata, J., Gualda, E., Pachón, V., Rebollo-Díaz, C., Domínguez, J. L. (2025). From data to detection: Developing a corpus and training language models for the identification of anti-refugee narratives in Spanish. Array, 100526.
https://doi.org/10.1016/j.array.2025.100526
Keywords:
Deep learning, Language models, Transformers, Social media, Twitter, Hate speech, Refugees, Immigrants, Online Hate Speech, Conspiracy Theories, Computational Social Science, Computational Sociology.
Data collection and preprocessing
Thousands of tweets were extracted daily from the search string “refugees” in six different languages: Spanish (“refugiados”), English (“refugees”), German (“fluechtlinge”), French (“réfugiés”), Italian (“rifugiati”) and Portuguese (“refugiados”), during the period from December 2015 to December 2016. The multilingual harvest was performed using NodeXL. Daily extractions were merged by language and duplicates were removed using the unique tweet ID. For the Spanish language, the initial collection contained 355,810 tweets; after excluding retweets to avoid repetition the corpus comprised 90,144 original tweets. From this Spanish collection a representative subset of 5,000 tweets was selected for manual annotation.
Corpus Annotation
The dataset was created through a meticulous labelling process involving 5,000 tweets. Each tweet was carefully categorized to form a training dataset suitable for the supervised classification task. In a first stage, the dataset was labelled by two experts on sociology and social work with the initial aim of distinguishing between tweets containing hate speech, racist or xenophobic discourse towards refugees (label “1”), and tweets that did not contain such an issue (label “0”). To carry out a manual labelling process by human experts, we used the Doccano tool (
https://github.com/doccano/doccano), an open-source text annotation platform designed for human users. Given the subjective nature and the level of difficulty of the task, manual labeling of this dataset was facilitated with the support of the annotation guide specifically designed by domain experts (Gualda and Rebollo-Díaz, 2024):
• Gualda, E., Rebollo-Díaz, C., 2024. Guía de anotación – libro de códigos para la detección del discurso de odio hacia inmigrantes y refugiados, versión 3. URL:
https://rabida.uhu.es/dspace/handle/10272/23340. Universidad de Huelva. Proyecto PID2021-123983OB-I00 [NONCONSPIRAHATE!].
Each tweet was labeled with one of two classes:
0: No hate speech
1: Hate speech, racist or xenophobic discourse against refugees
The overall inter-annotation agreement during this phase reached a satisfactory correlation index of 0.66 of the Cohen’s Kappa coefficient. Out of 5000 labelled tweets, annotators reached an 85% agreement, with a 15% disagreement rate. In a subsequent stage, tweets where there was no consensus among annotators or raised uncertainties were meticulously analysed and evaluated by a third person, also an expert in sociology of migrations and social work, to resolve disagreements. This thorough review and discussion of the most challenging classification cases assisted in addressing certain subtleties in tweets posted on Twitter in this domain.
Files Description
The hateRADAR-es dataset contains a collection of 5,000 tweets, exhaustively labelled for detecting hate speech, racist or xenophobic discourse towards refugees. The dataset is divided into two sets to ensure the reproducibility of experiments: a training dataset and a testing dataset.
Both datasets are provided in CSV format and include the following fields:
id_tweet: The unique identifier for each tweet. You can use this ID to retrieve individual tweets from the Twitter API of your choice.
text: Represents the content of the tweet.
label: Binary label where 0 represents "No hate speech" and 1 represents "Hate speech detected".
Train Dataset
Name: hateRADAR-es_train.csv
Number of Rows: 4000. This set is intended for model training and validation, ensuring comprehensive learning and robustness against overfitting.
Size: 516.24 KB
Class Distribution:
0 (No hate speech): 3048 entries, accounting for 76.2% of the dataset.
1 (Hate speech detected): 952 entries, making up 23.8% of the dataset.
Test Dataset
Name: hateRADAR-es_test.csv
Number of Rows: 1000. This set is used for model evaluation, providing an unbiased assessment of the model's performance on unseen data.
Class Distribution:
0 (No hate speech): 762 entries, representing 76.2% of the dataset.
1 (Hate speech detected): 238 entries, constituting 23.8% of the dataset.
Privacy and preprocessing
To protect user privacy and comply with data protection considerations, original usernames were replaced with the placeholder @user. The dataset contains tweet IDs so researchers may rehydrate content following Twitter’s terms and applicable policies; access is provided under the conditions described below.
Research Opportunities with the hateRADAR-es corpus
We aim to offer the scientific community a valuable resource for advancing automatic detection of anti-refugee hate speech. We are confident that this dataset will be an invaluable resource for researchers who are seeking to understand the impact of hate speech on refugees. By providing access to this data, we hope to contribute to a greater understanding of this issue and to promote positive change in our communities.
The hateRADAR-es dataset is available for research purposes. It comprises both training and test sets for anti-refugee discourse in Spanish, is a valuable resource for cross-disciplinary research. For instance:
Computer Science and AI (NLP):
Hate Speech Detection: Develop and fine-tune algorithms (e.g., Transformer-based models) for the automatic classification of anti-refugee hate content in Spanish.
Implicit vs. Explicit Hate: Research methods to differentiate between explicit hateful language and implicit, subtle toxicity found in online messages.
Model Generalization: Use the corpus to test if models trained on a specialized domain (anti-refugee) can generalize effectively to other types of hate speech.
Cross-lingual Transfer: Leverage this Spanish corpus as a resource for transfer learning to aid in hate speech detection efforts in low-resource languages.
Social Sciences and Digital Humanities:
Narrative Analysis and Framing: Analyze the rhetoric, linguistic patterns, and key narrative frameworks used to justify or spread hostility toward refugees.
Actor and Network Mapping: Identify and study the social and political actors (users, groups, organizations) responsible for the dissemination and amplification of anti-refugee narratives.
Annotation Bias Studies: Investigate the subjectivity and reliability of expert human annotation to improve future data collection methods for sensitive topics.
Contextual Analysis: Study the link between real-world events, policy changes, or crises and the subsequent spikes in online hate speech directed at refugees.
Funding
This data belong to the I+D+i Project titled “Conspiracy Theories and Online Hate Speech: Comparison of Patterns in Narratives and Social Networks about COVID-19, Immigrants, Refugees, and LGBTIQ+ People [NON-CONSPIRA-HATE!]”, PID2021-123983OB-I00, funded by MCIN/AEI/10.13039/501100011033/ and by “ERDF/EU”. Principal Investigator: Estrella Gualda. University of Huelva, ESEIS/COIDESO/CISCOA-Lab, Spain. We are also grateful for the support of our research groups: “Estudios Sociales e Intervención Social” (GrupoESEIS) and “Ingeniería de la Información y el Conocimiento” (I2C), and also the research centers “Pensamiento Contemporáneo e Innovación para el Desarrollo Social” (COIDESO), and the “Centro de Investigación en Tecnología, Energía y Sostenibilidad” (CITES), at the University of Huelva>
Other details on the Annotation Process, Pilot and Construction of the Dataset
Pilot Phase
The annotation process began with a pilot of 500 tweets. These tweets were selected from an original dataset of ~90,000 Spanish tweets (collected via NodeXL). These tweets had been previously coded with Atlas.ti in a doctoral thesis (Rebollo, 2021) and other publications. The pilot aimed to test and refine the annotation guide, focusing on anti-refugee hate speech.
- Rebollo-Díaz, C., 2021. Tuiteando sobre refugiados: una comparación internacional de discursos, imaginarios y representaciones sociales. Ph.D. thesis. URL:
http://hdl.handle.net/10272/20113.
Test: A first systematic test with 200 tweets involved annotators Laura Cabrera Álvarez, Jonás González Díaz, Carolina Rebollo-Díaz (CRD), Elena Ruiz Ángel, and Francisco Javier Santos Fernández.
Tool setup and training: The pilot helped validate the guide and train annotators on the use of Doccano, installed on University of Huelva servers (
http://nonconspirahate.uhu.es:8000/) by Jacinto Mata and Victoria Pachón.
Supervision and training: Estrella Gualda (EG) & Carolina Rebollo-Díaz.
Full Annotation
After refining the guide, 5,000 tweets were annotated following the new protocol.
- Annotators: Jonás González Díaz and Laura Cabrera Álvarez
- Referee: Carolina Rebollo-Díaz
- Supervision and training on the annotation guide: Estrella Gualda & Carolina Rebollo-Díaz
Dataset Construction Process
Data selection:
CRD collected the original 90,000 Spanish tweets.
EG & CRD performed a qualitative selection of tweets coded in Atlas.ti, to balance hate and non-hate tokens.
Random sampling:
EG applied R scripts to filter 2,500 hate-related and 2,500 non-hate tweets for annotation.
Preparation for annotation:
EG formatted the selected tweets for Doccano.
Building of the hateRADAR-es dataset:
Jacinto Mata & Victoria Pachón adapted the data to Doccano and built the hateRADAR-es dataset, which served as the input for model development in the paper: “From data to detection: Developing a corpus and training language models for the identification of anti-refugee narratives in Spanish.”
Contact:
If you have any questions, feedback, or need further information about the hateRADAR-es dataset, please feel free to contact the project account:
nonconspirahate@uhu.es.