Dataset for: "Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale"
Description
This repository contains the dataset, along with the source code used to produce the main findings of the paper, "Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale."
Dataset Overview
This dataset consists of Reddit submissions and comments collected from various far-left subreddits, spanning from July 2019 to March 2022. To preserve anonymity, we anonymized all post identifiers and author names (except AutoModerator). In addition, any words beginning with u/ have been replaced with u/anonymized_author_name, and any words beginning with @ have been replaced with @anonymized_at_word.
The dataset includes:
File Structure
├── analysis.ipynb # Jupyter notebook with analysis
└── data/
├── far-left_dataset.ndjson # Main dataset
├── topic_keywords.json # Topic keywords dictionary
├── topics.jsonl # Topic assignments
└── ideology_user_base_similarity_matrix.json # User overlap similarity matrix
Dataset Files
1. Main Dataset (`far-left_dataset.ndjson`)
Fields:
- `id`: Unique post identifier
- `author`: Author username
- `subreddit`: Subreddit name
- `created_utc`: Post creation timestamp
- `post`: Post content
- `title`: Post title
- `subreddit_type`: Category used for analyzing related communities
2. Topic Keywords (`topic_keywords.json`)
Content: Dictionary mapping topic IDs to lists of representative keywords
3. Topic Assignments (`topics.jsonl`)
Fields:
- `subreddit`: Subreddit name
- `topic`: Topic ID
4. User Overlap Similarity Matrix (`ideology_user_base_similarity_matrix.json`)
Fields:
- Subreddit pairs as keys
- Similarity scores ranging from 0 (no overlap) to 1 (complete overlap)
- Similarity matrix showing user base similarities between subreddits
If you use this dataset in any publication, of any form and kind, please cite using this data:@misc{balcı2025rolltanksmeasuringleftwing,
title={Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale},
author={Utkucan Balcı and Michael Sirivianos and Jeremy Blackburn},
year={2025},
eprint={2307.06981},
archivePrefix={arXiv},
primaryClass={cs.SI},
url={https://arxiv.org/abs/2307.06981},
}
Files
Additional details
Funding
- European Commission
- MedDMO 101083756