Published March 21, 2025
| Version v1
Dataset
Restricted
Messages from alternative Spanish Telegram channels, 2019-2024
Creators
Description
This dataset contains processed data extracted from Telegram channels using pytopicgram from
2019-12-01 to 2024-08-31. It includes anonymized channel information, sampled messages, and topics identified using BERTopic. The data has been anonymized and structured for ease of analysis. The dataset comprises two main CSV files:1. Topics (topics.csv)
This file contains topics extracted from the full dataset using BERTopic. Each topic is described by a concise text generated by OpenAI o1.
| Column Name | Description |
|---|---|
Topic |
Numeric identifier for each topic. -1 is the generic topic for non-assignable messages. |
Name |
Human-readable name summarizing the topic. |
Representation |
List of representative keywords for the topic. |
Description |
Concise description of the topic generated by OpenAI. |
2. Messages (messages.csv)
This file contains a 25% stratified sample of messages (on topic column) from Telegram channels.
| Column Name | Description |
|---|---|
channel_id |
Anonymized identifier for the Telegram channel. |
week_year |
Week and year when the message was posted (format: week_year). |
media_type |
Type of media included in the message (txt, img, video, audio, doc, web). |
reach |
Number of users reached by the message. |
virality |
Virality score of the message. |
is_viral |
Boolean indicating whether the message is considered viral. |
topics |
Topic identifier associated with the message. |
probs |
Probability scores for topic assignment. |
Files
Additional details
Related works
- Is supplement to
- 10.5281/zenodo.14889387 (DOI)
Funding
- CaixaBank
- U-MIND SR21-00684