Published 2025 | Version 0.1
Journal article Open

Spanish-Language Incel Corpus on X

Contributors

Contact person:

Description

Introduction

This dataset contains materials used in a Latent Dirichlet Allocation (LDA) topic modeling analysis applied to Twitter discussions related to incel spanish community. The files include topic-specific tweets, preprocessing artifacts, and the LDA model used for topic extraction. The dataset is intended for researchers analyzing online discourse, misogyny, radicalization, and online communities.

The dataset is structured to provide key insights into topic distributions, word transformations, deleted words, and the underlying LDA model, facilitating further analysis and reproducibility.
Files and Descriptions


1. Topic-Specific Tweet Data

These files contain top tweets categorized by different topic models with varying numbers of topics (14, 15, 24, and 27 topics). Each file presents tweets assigned to different topics based on their probability scores in the LDA model.

    Top_tweets_by_topic_15_TOPICS(1).xlsx – Tweets classified under a 15-topic LDA model.
    Top_tweets_by_topic_14_TOPICS(1).xlsx – Tweets classified under a 14-topic LDA model.
    Top_tweets_by_topic_24_TOPICS.xlsx – Tweets classified under a 24-topic LDA model.
    Top_tweets_by_topic_27_TOPICS(2).xlsx – Tweets classified under a 27-topic LDA model.

Each file contains:

    Tweet text
    Assigned topic
    Topic probability score (gamma value)

These files allow researchers to analyze the differences in topic structure and distribution depending on the number of topics in the model.


2. Preprocessing Artifacts

These files document preprocessing steps applied to the text data, including word deletions and transformations.

    deleted_words.csv – Words that were removed during preprocessing, typically stopwords or irrelevant tokens.
    word_transformations.csv – Records of word transformations, mapping different variations of words into standardized forms for improved topic modeling accuracy.

These files ensure transparency and reproducibility in text cleaning and preparation.


3. Twitter User Profiles

    Perfiles_twitter_observados.csv – List of Twitter profiles observed in the dataset, which may include user identifiers or descriptive metadata (e.g., username, activity level)This file is useful for studying user interactions and engagement patterns within the dataset.


4. Hashed User Data

    UCO_hash.csv – Contains hashed identifiers of observed users for anonymization purposes while preserving relational structures.This file ensures privacy protection while allowing for network analysis or tracking user interactions across topics.


5. LDA Topic Modeling Output

    incel_lds_model.rds – The trained LDA model file (RDS format) used for topic modeling. This model can be loaded in R for reproducibility or further refinement.

    LDA_incels.R – This script performs text mining on the 'UCO_hash.csv' tweets dataset, including data cleaning, word grouping, DTM construction, LDA topic modeling, and visualization.

Files

deleted_words.csv

Files (6.6 MB)

Name Size Download all
md5:44386210349aa0979e31a68b38377479
7.3 kB Preview Download
md5:30a6e0319398943565c58b2a8c7e4571
450.3 kB Download
md5:4d0b3796ab6f1600aefe3ac067448056
29.6 kB Download
md5:3c09a5d47660fa8ea1549d612b05bd31
3.0 kB Preview Download
md5:a401af0c302779ff892c93a95637cbba
110.9 kB Download
md5:f008e2108a32fee1cfa1a7082b9f6eee
121.8 kB Download
md5:19f00aba454295e59ff4bf697f043b3d
188.0 kB Download
md5:3e4a54aab5b5a3c1dfb9a30e99d40d20
192.7 kB Download
md5:081b77d116dbf93355b4cb1abf646c22
5.5 MB Preview Download
md5:d2dd08f85b6140e32bcc34b763ac66fe
9.7 kB Preview Download

Additional details

Funding

Ministerio de Ciencia, Innovación y Universidades
Lobby y Comunicación en la Unión Europea PID2020-118584RB-100
European Commission
DigiPatch (CHANSE ERA-NET) 101004509
Ministerio de Ciencia, Innovación y Universidades
Las Noticias Falsas en las Redes Sociales. Tres Estudios de Caso: Populismo, Covid y Cambio Climático PID2021-125788OB-I00

Dates

Submitted
2025

Software

Programming language
R , Python
Development Status
Active