Spanish-Language Incel Corpus on X
Authors/Creators
Contributors
Contact person:
Description
Introduction
This dataset contains materials used in a Latent Dirichlet Allocation (LDA) topic modeling analysis applied to Twitter discussions related to incel spanish community. The files include topic-specific tweets, preprocessing artifacts, and the LDA model used for topic extraction. The dataset is intended for researchers analyzing online discourse, misogyny, radicalization, and online communities.
The dataset is structured to provide key insights into topic distributions, word transformations, deleted words, and the underlying LDA model, facilitating further analysis and reproducibility.
Files and Descriptions
1. Topic-Specific Tweet Data
These files contain top tweets categorized by different topic models with varying numbers of topics (14, 15, 24, and 27 topics). Each file presents tweets assigned to different topics based on their probability scores in the LDA model.
Top_tweets_by_topic_15_TOPICS(1).xlsx – Tweets classified under a 15-topic LDA model.
Top_tweets_by_topic_14_TOPICS(1).xlsx – Tweets classified under a 14-topic LDA model.
Top_tweets_by_topic_24_TOPICS.xlsx – Tweets classified under a 24-topic LDA model.
Top_tweets_by_topic_27_TOPICS(2).xlsx – Tweets classified under a 27-topic LDA model.
Each file contains:
Tweet text
Assigned topic
Topic probability score (gamma value)
These files allow researchers to analyze the differences in topic structure and distribution depending on the number of topics in the model.
2. Preprocessing Artifacts
These files document preprocessing steps applied to the text data, including word deletions and transformations.
deleted_words.csv – Words that were removed during preprocessing, typically stopwords or irrelevant tokens.
word_transformations.csv – Records of word transformations, mapping different variations of words into standardized forms for improved topic modeling accuracy.
These files ensure transparency and reproducibility in text cleaning and preparation.
3. Twitter User Profiles
Perfiles_twitter_observados.csv – List of Twitter profiles observed in the dataset, which may include user identifiers or descriptive metadata (e.g., username, activity level)This file is useful for studying user interactions and engagement patterns within the dataset.
4. Hashed User Data
UCO_hash.csv – Contains hashed identifiers of observed users for anonymization purposes while preserving relational structures.This file ensures privacy protection while allowing for network analysis or tracking user interactions across topics.
5. LDA Topic Modeling Output
incel_lds_model.rds – The trained LDA model file (RDS format) used for topic modeling. This model can be loaded in R for reproducibility or further refinement.
LDA_incels.R – This script performs text mining on the 'UCO_hash.csv' tweets dataset, including data cleaning, word grouping, DTM construction, LDA topic modeling, and visualization.
Files
deleted_words.csv
Files
(6.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:44386210349aa0979e31a68b38377479
|
7.3 kB | Preview Download |
|
md5:30a6e0319398943565c58b2a8c7e4571
|
450.3 kB | Download |
|
md5:4d0b3796ab6f1600aefe3ac067448056
|
29.6 kB | Download |
|
md5:3c09a5d47660fa8ea1549d612b05bd31
|
3.0 kB | Preview Download |
|
md5:a401af0c302779ff892c93a95637cbba
|
110.9 kB | Download |
|
md5:f008e2108a32fee1cfa1a7082b9f6eee
|
121.8 kB | Download |
|
md5:19f00aba454295e59ff4bf697f043b3d
|
188.0 kB | Download |
|
md5:3e4a54aab5b5a3c1dfb9a30e99d40d20
|
192.7 kB | Download |
|
md5:081b77d116dbf93355b4cb1abf646c22
|
5.5 MB | Preview Download |
|
md5:d2dd08f85b6140e32bcc34b763ac66fe
|
9.7 kB | Preview Download |
Additional details
Funding
- Ministerio de Ciencia, Innovación y Universidades
- Lobby y Comunicación en la Unión Europea PID2020-118584RB-100
- European Commission
- DigiPatch (CHANSE ERA-NET) 101004509
- Ministerio de Ciencia, Innovación y Universidades
- Las Noticias Falsas en las Redes Sociales. Tres Estudios de Caso: Populismo, Covid y Cambio Climático PID2021-125788OB-I00
Dates
- Submitted
-
2025