Dataset for: "Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale"
Authors/Creators
Description
This repository contains the dataset, along with the source code used to produce the main findings of the paper, "Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale."
Dataset Overview
This dataset consists of Reddit submissions and comments collected from various far-left subreddits, spanning from July 2019 to March 2022. To preserve anonymity, we anonymized all post identifiers and author names (except AutoModerator). In addition, any words beginning with u/ have been replaced with u/anonymized_author_name, and any words beginning with @ have been replaced with @anonymized_at_word.
The dataset includes:
File Structure
├── analysis.ipynb # Jupyter notebook with analysis
└── data/
├── far-left_dataset.ndjson # Main dataset
├── topic_keywords.json # Topic keywords dictionary
├── topics.jsonl # Topic assignments
└── ideology_user_base_similarity_matrix.json # User overlap similarity matrix
Dataset Files
1. Main Dataset (`far-left_dataset.ndjson`)
Fields:
- `id`: Unique post identifier
- `author`: Author username
- `subreddit`: Subreddit name
- `created_utc`: Post creation timestamp
- `post`: Post content
- `title`: Post title
- `subreddit_type`: Category used for analyzing related communities
2. Topic Keywords (`topic_keywords.json`)
Content: Dictionary mapping topic IDs to lists of representative keywords
3. Topic Assignments (`topics.jsonl`)
Fields:
- `subreddit`: Subreddit name
- `topic`: Topic ID
4. User Overlap Similarity Matrix (`ideology_user_base_similarity_matrix.json`)
Fields:
- Subreddit pairs as keys
- Similarity scores ranging from 0 (no overlap) to 1 (complete overlap)
- Similarity matrix showing user base similarities between subreddits
If you use this dataset in any publication, of any form and kind, please cite using this data:@misc{balcı2025rolltanksmeasuringleftwing, title={Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale}, author={Utkucan Balcı and Michael Sirivianos and Jeremy Blackburn}, year={2025}, eprint={2307.06981}, archivePrefix={arXiv}, primaryClass={cs.SI}, url={https://arxiv.org/abs/2307.06981}, }
Files
Additional details
Funding
- European Commission
- MedDMO 101083756