iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023
Creators
Description
ABSTRACT
---------------
Online web communities often face bans for violating platform policies, encouraging their migration to alternative platforms. This migration, however, can result in increased toxicity and unforeseen consequences on the new platform. In recent years, researchers have collected data from many alternative platforms, indicating coordinated efforts leading to offline events, conspiracy movements, hate speech propagation, and harassment. Thus, it becomes crucial to characterize and understand these alternative platforms. To advance research in this direction, we collect and release a large-scale dataset from Scored -- an alternative Reddit platform that sheltered banned fringe communities, for example, c/TheDonald (a prominent right-wing community) and c/GreatAwakening (a conspiratorial community). Over four years, we collected approximately 57M posts from Scored, with at least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. Furthermore, we provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model, to further advance the field in characterizing the discussions within these communities. We aim to provide these resources to facilitate their investigations without the need for extensive data collection and processing efforts.
- Scored platform: https://scored.co
 - Link to paper: https://arxiv.org/abs/2405.10233
 - License: CC BY-NC-SA 4.0
 
Repository links
- Zenodo: From Zenodo, researchers can download `lite` version of this dataset, which includes only 57M posts from Scored (not the sentence embeddings).
 - Github: The main repository of this dataset, where we provide code-snippets to get started with this dataset.
 - Huggingface: On Huggingface, we provide complete dataset with senetence embeddings.
 
Dataset Info
| File-name | Data-points | 
| comments-2020 | 12,774,203 | 
| comments-2021 | 16,097,941 | 
| comments-2022 | 12,730,301 | 
| comments-2023 | 8,919,159 | 
| submissions-2020-to-2023 | 6,293,980 | 
Authorship
This dataset is published at "AAAI ICWSM 2024 (INTERNATIONAL AAAI CONFERENCE ON WEB AND SOCIAL MEDIA)" hosted at Buffalo, NY, USA.
- Academic Organization: iDRAMA Lab
 - Affiliation: Binghamton University, Boston University, University of California Riverside
 
Licensing
This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0.
Citation
@inproceedings{patel2024idrama,
title={iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023},
author={Patel, Jay and Paudel, Pujan and De Cristofaro, Emiliano and Stringhini, Gianluca and Blackburn, Jeremy},
booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
volume={18},
pages={2014--2024},
year={2024},
issn = {2334-0770},
doi = {10.1609/icwsm.v18i1.31444},
}
Files
      
        Files
         (22.0 GB)
        
      
    
    
  Additional details
Funding
- U.S. National Science Foundation
 - Collaborative Research: SaTC: TTP: Medium: iDRAMA.cloud: A Platform for Measuring and Understanding Information Manipulation 2247868
 - U.S. National Science Foundation
 - Collaborative Research: SaTC: TTP: Medium: iDRAMA.cloud: A Platform for Measuring and Understanding Information Manipulation 2247867
 
              
                Software
              
            
          - Repository URL
 - https://github.com/idramalab/iDRAMA-scored-2024
 - Development Status
 - Active