Published January 13, 2025 | Version 1.0.0
Dataset Open

MADOC: Multi-Platform Aggregated Dataset of Online Communities

Description

The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from four distinct platforms: Bluesky, Koo, Reddit, and Voat, spanning from 2012 to 2024. The dataset includes 18.9 million posts, 236 million comments, and data from 23.1 million unique users across all platforms, with a particular focus on understanding community dynamics, user migration patterns, and the evolution of toxic behavior across platforms. By providing standardized data structures and FAIR-compliant access through Zenodo, MADOC enables researchers to conduct comparative analyses of user behavior, interaction networks, and content sentiment across diverse social media environments. The dataset's unique value lies in its cross-platform scope, standardized structure, and rich metadata, making it particularly suitable for studying societal phenomena such as community formation, toxic behavior propagation, and user migration patterns in response to platform moderation policies.

Technical info

MADOC is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. The dataset aggregates and standardizes data from four distinct platforms:

  • Reddit (2014-2020)
  • Voat (2013-2020)
  • Bluesky (2023-2024)
  • Koo (2020-2023)

Dataset Scale

  • 18.9 million posts
  • 236 million comments
  • 23.1 million unique users across all platforms

Dataset Structure

The dataset is stored in Apache Parquet format, with separate files for each community. Each row represents a single interaction (post, comment, or repost) with the following standardized fields:
 
Field Description
post_id  Unique identifier for the interaction, anonymized
publish_date
UNIX timestamp of the interaction (seconds since epoch)
user_id
Anonymized identifier of the content creator
parent_id
Identifier of the parent post for comments/replies, NA for posts
parent_user_id
Identifier of the parent post's creator, NA for posts
content
Textual content of the interaction, with URLs extracted
url
External URLs referenced in the post/comment content
language
Detected language code (currently 'English' for all entries)
interaction_type
Type of interaction: 'POST', 'COMMENT', or 'REPOST'
platform
Source platform: 'reddit', 'voat', 'bluesky', or 'koo'
community
Community identifier (subreddit/subverse name), NA for Bluesky/Koo
sentiment_vader
VADER sentiment score ranging from -1 (negative) to 1 (positive)
strict_filter
Whether content matches strict keyword filtering criteria (1/0)

Platform Statistics

Reddit
 
Total interactions: 247.6M
Posts: 14.5M
Comments: 233.1M
Unique users: 22.6M
Average posts per user: 2.7
Average comments per user: 11.6
Mean sentiment: 0.063
URLs percentage: 10.3%

Voat
 
Total interactions: 1.2M
Posts: 0.4M
Comments: 0.8M
Unique users: 0.1M
Average posts per user: 8.5
Average comments per user: 7.8
Mean sentiment: 0.011
URLs percentage: 31.8%

Bluesky
 
Total interactions: 2.8M
Posts: 0.9M
Comments: 0.9M
Unique users: 0.2M
Average posts per user: 14.4
Average comments per user: 6.2
Mean sentiment: 0.088
URLs percentage: 4.5%

Koo
 
Total interactions: 4.3M
Posts: 3.1M
Comments: 1.2M
Unique users: 0.2M
Average posts per user: 23.0
Average comments per user: 10.6
Mean sentiment: 0.054
URLs percentage: 11.6%

Communities Covered

General Interest Communities
 
1. funny
2. gaming
3. pics
4. videos
5. gifs
6. technology

Controversial Communities
 
1. fatpeoplehate
2. GreatAwakening
3. MillionDollarExtreme
4. CringeAnarchy
5. KotakuInAction
6. MensRights

Responsible Usage Policy

By using this dataset, researchers agree to:

1. Research Purpose: Use the dataset solely for legitimate research purposes that aim to understand and improve online social dynamics.

2. Privacy Protection:
- Make no attempts to de-anonymize or re-identify users
- Not link or combine this dataset with other data sources to reveal user identities
- Not contact or attempt to identify individuals mentioned in post content

3. Ethical Research:
- Obtain appropriate institutional ethics approval when required
- Follow established guidelines for ethical social media research
- Consider potential impacts on marginalized or vulnerable communities

4. Prohibited Uses:
- Training harmful AI models or content generation systems
- Analyzing patterns to identify or target vulnerable communities

5. Publication Guidelines:
- Properly cite the dataset in any resulting publications
- Clearly document any filtering or processing applied to the data
- Share research findings responsibly, considering potential misuse

License
 
This dataset is released under Creative Commons Attribution International 4.0 License.
 
 

Right to Erasure (Right to be forgotten)

All users included in the MADOC dataset have the right to opt out and request the removal of their data, in accordance with GDPR provisions. However, the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides derogations. The data has been thoroughly pseudonymized..

If you wish to have your interactions excluded from this dataset, please submit your request to madoc@ctrust.ac.rs. 

We will process your request as soon as possible.

 

 

 
Contact

You can reach us at: madoc@ctrust.ac.rs

Files

README.md

Files (41.9 GB)

Name Size Download all
md5:8692f5f7d5cc335a6b4780e839421172
449.3 MB Download
md5:201715cf3c5c8a1e220247b3b45d2e58
774.3 MB Download
md5:56ebadfb619963f98da2d734430111db
5.4 kB Preview Download
md5:21b456602826f28232db8ac3b79085fa
951.7 MB Download
md5:ce77de69fbe8a4ee4d153d87462b022c
214.5 MB Download
md5:04fce7de8c4dbe8f8d5a77d06c17b3fa
9.1 GB Download
md5:cc8e73d9710c7e257eb9801f737e6e68
7.2 GB Download
md5:1d40d06ad275d2b23ac119a3f5817f6f
3.1 GB Download
md5:e0cfd48a0a2965e84affd3bdcd8fec07
179.3 MB Download
md5:9e87f8edf1360a7356a66151854e9419
1.5 GB Download
md5:2c9da3a651a82ab81da614ca5ce22ab0
797.8 MB Download
md5:c8f6f131785d98dfecb112d156cfe30f
170.2 MB Download
md5:0869d52aa49e00d1efb8ad6ba7fa64a8
8.3 GB Download
md5:5e3de767cd95238ee0358c1203d82bc3
2.5 GB Download
md5:19eb499eced66221be0ec97a95f17d98
6.5 GB Download
md5:707751d3dfc72b8a083d1288528c9786
464.9 kB Download
md5:b5fe7f8b3648fb654d24f66795a0d1e2
61.9 MB Download
md5:86e31c39c69400a80b67b41111ad268c
18.8 MB Download
md5:9445ede64b75ed8c10ca9a042446f14e
12.7 MB Download
md5:61eeda5a72ccce80661580ebbb0c5268
2.8 MB Download
md5:9e46908e983f8bbb098a2e512a986a6d
76.1 MB Download
md5:63d388fb42eb0499d72c31f5f0b57a86
1.8 MB Download
md5:61c1c46d53b9334dfe285a1fa4eedb37
773.7 kB Download
md5:1a4416f4a1d5f56f5e53fca666c36ad6
3.5 MB Download
md5:8acf3e79f11b8ca10c6a2614e0ee2c9c
5.3 MB Download
md5:e15002c55915aa0b3d33feedac5a9588
15.2 MB Download
md5:a7195e14ce7a0abe8ed3eff39a13a984
11.8 MB Download

Additional details

Related works

Is derived from
Dataset: 10.5281/zenodo.11082879 (DOI)
Dataset: 10.5281/zenodo.10476212 (DOI)
Dataset: 10.5281/zenodo.5841668 (DOI)

Funding

Science Fund of the Republic of Serbia
Topology-derived methods for the analysis of collective trust dynamics 7416

Software

Development Status
Active