Published July 28, 2025 | Version v1
Dataset Restricted

Dataset for: "Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale"

  • 1. EDMO icon Binghamton University
  • 2. Cyprus University of Technology

Description

This repository contains the dataset, along with the source code used to produce the main findings of the paper, "Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale."

Dataset Overview

This dataset consists of Reddit submissions and comments collected from various far-left subreddits, spanning from July 2019 to March 2022. To preserve anonymity, we anonymized all post identifiers and author names (except AutoModerator). In addition, any words beginning with u/ have been replaced with u/anonymized_author_name, and any words beginning with @ have been replaced with @anonymized_at_word.


The dataset includes:

File Structure


├── analysis.ipynb                      # Jupyter notebook with analysis
└── data/
    ├── far-left_dataset.ndjson         # Main dataset 
    ├── topic_keywords.json        # Topic keywords dictionary
    ├── topics.jsonl                    # Topic assignments
    └── ideology_user_base_similarity_matrix.json  # User overlap similarity matrix

Dataset Files

1. Main Dataset (`far-left_dataset.ndjson`)
Fields:
- `id`: Unique post identifier
- `author`: Author username
- `subreddit`: Subreddit name
- `created_utc`: Post creation timestamp
- `post`: Post content
- `title`: Post title 
- `subreddit_type`: Category used for analyzing related communities

2. Topic Keywords (`topic_keywords.json`)
Content: Dictionary mapping topic IDs to lists of representative keywords


3. Topic Assignments (`topics.jsonl`)
Fields:
- `subreddit`: Subreddit name
- `topic`: Topic ID


4. User Overlap Similarity Matrix (`ideology_user_base_similarity_matrix.json`)
Fields:
- Subreddit pairs as keys
- Similarity scores ranging from 0 (no overlap) to 1 (complete overlap)
- Similarity matrix showing user base similarities between subreddits

If you use this dataset in any publication, of any form and kind, please cite using this data:

@misc{balcı2025rolltanksmeasuringleftwing,
      title={Roll in the Tanks! Measuring Left-wing Extremism on Reddit at Scale}, 
      author={Utkucan Balcı and Michael Sirivianos and Jeremy Blackburn},
      year={2025},
      eprint={2307.06981},
      archivePrefix={arXiv},
      primaryClass={cs.SI},
      url={https://arxiv.org/abs/2307.06981}, 
}

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Funding

European Commission
MedDMO 101083756