There is a newer version of the record available.

Published June 11, 2025 | Version v1
Dataset Open

EsCorpiusBias: Contextual Annotation and Transformer-Based Detection of Racism and Sexism in Spanish Dialogues

  • 1. Universidad de Granada
  • 2. Universidad Politécnica de Madrid - Campus Sur
  • 3. ROR icon Walmart (United States)
  • 4. Spanish Ministry of Economy

Description

The rise of online communication platforms has significantly increased exposure to harmful discourse, presenting ongoing challenges for digital moderation and user well-being. This paper introduces the EsCorpiusBias corpus, designed to enhance automated detection of sexism and racism within Spanish-language online dialogues, specifically sourced from the Mediavida forum. By means of a systematic, context-sensitive annotation protocol, approximately 1,000 three-turn dialogue units per bias category has been annotated, ensuring nuanced recognition of pragmatic and conversational subtleties. The annotation guidelines have been meticulously developed, covering explicit and implicit manifestations of sexism and racism, and annotations were performed using the Prodigy tool, resulting in moderate to substantial inter-annotator agreement (Cohen's Kappa: 0.55 for sexism, 0.79 for racism). Models including Logistic Regression, SpaCy's baseline n-gram bag-of-words model, and transformer-based BETO were trained and evaluated, demonstrating that contextualized transformer-based approaches significantly outperform baseline and general-purpose models. Additionally, lexical overlap analyses indicated a strong reliance on explicit lexical indicators, highlighting limitations in handling implicit biases. This research underscores the importance of contextually grounded, domain-specific fine-tuning for effective automated detection of toxicity, providing robust resources and methodologies to foster socially responsible NLP systems within Spanish-speaking online communities.

Notes

Acknowledgments: This dataset is a result of the project CONVERSA (TED2021-132470B-I00) funded by MCIN/AEI/10.13039/501100011033 and by "European Union NextGenerationEU/PRTR".

Files

labels.txt

Files (1.7 MB)

Name Size Download all
md5:20057af4203888b26770115a69547f23
39 Bytes Preview Download
md5:33b85d881661192e09ebf01f72bca371
792.2 kB Download
md5:b1e0858922f5820deb4c4030283e249b
866.4 kB Download
md5:c90c0067732507154f50e3277ce8168b
2.4 kB Preview Download

Additional details

Related works

Is part of
Dataset: 10.5281/zenodo.15023855 (DOI)

Funding

Ministerio de Ciencia, Innovación y Universidades
Effective and efficient resources and models for transformative conversational AI in Spanish and co-official languages (CONVERSA) TED2021-132470B-I00