Published June 18, 2025 | Version v3
Dataset Open

MADOC: Multi-Platform Aggregated Dataset of Online Communities

Description

The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from four distinct platforms: Bluesky, Koo, Reddit, and Voat, spanning from 2012 to 2024. The dataset includes 18.9 million posts, 236 million comments, and data from 23.1 million unique users across all platforms, with a particular focus on understanding community dynamics, user migration patterns, and the evolution of toxic behavior across platforms. By providing standardized data structures and FAIR-compliant access through Zenodo, MADOC enables researchers to conduct comparative analyses of user behavior, interaction networks, and content sentiment across diverse social media environments. The dataset's unique value lies in its cross-platform scope, standardized structure, and rich metadata, making it particularly suitable for studying societal phenomena such as community formation, toxic behavior propagation, and user migration patterns in response to platform moderation policies.

Technical info

MADOC is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. The dataset aggregates and standardizes data from four distinct platforms:

  • Reddit (2014-2020)
  • Voat (2013-2020)
  • Bluesky (2023-2024)
  • Koo (2020-2023)

Dataset Scale

  • 18.9 million posts
  • 236 million comments
  • 23.1 million unique users across all platforms

Dataset Structure

The dataset is stored in Apache Parquet format, with separate files for each community. Each row represents a single interaction (post, comment, or repost) with the following standardized fields:
 
Field Description
post_id  Unique identifier for the interaction, anonymized
publish_date
UNIX timestamp of the interaction (seconds since epoch)
user_id
Anonymized identifier of the content creator
parent_id
Identifier of the parent post for comments/replies, NA for posts
parent_user_id
Identifier of the parent post's creator, NA for posts
content
Textual content of the interaction, with URLs extracted
url
External URLs referenced in the post/comment content
language
Detected language code (currently 'English' for all entries)
interaction_type
Type of interaction: 'POST', 'COMMENT', or 'REPOST'
platform
Source platform: 'reddit', 'voat', 'bluesky', or 'koo'
community
Community identifier (subreddit/subverse name), NA for Bluesky/Koo
sentiment_vader
VADER sentiment score ranging from -1 (negative) to 1 (positive)
sentiment_textblob
 TextBlob sentiment polarity score ranging from -1 (negative) to 1 (positive)
subjectivity_textblob
TextBlob subjectivity score ranging from 0.0 (objective) to 1.0 (subjective)
toxicity_togixen
 ToxiGen toxicity score ranging from 0.00 (non-toxic) to 1.00 (toxic)
strict_filter
Whether content matches strict keyword filtering criteria (1/0)

Platform Statistics

Reddit
 
Total interactions: 247.6M
Posts: 14.5M
Comments: 233.1M
Unique users: 22.6M
Average posts per user: 2.7
Average comments per user: 11.6
Mean sentiment: 0.063
URLs percentage: 10.3%

Voat
 
Total interactions: 1.2M
Posts: 0.4M
Comments: 0.8M
Unique users: 0.1M
Average posts per user: 8.5
Average comments per user: 7.8
Mean sentiment: 0.011
URLs percentage: 31.8%

Bluesky
 
Total interactions: 2.8M
Posts: 0.9M
Comments: 0.9M
Unique users: 0.2M
Average posts per user: 14.4
Average comments per user: 6.2
Mean sentiment: 0.088
URLs percentage: 4.5%

Koo
 
Total interactions: 4.3M
Posts: 3.1M
Comments: 1.2M
Unique users: 0.2M
Average posts per user: 23.0
Average comments per user: 10.6
Mean sentiment: 0.054
URLs percentage: 11.6%

Communities Covered

General Interest Communities
 
1. funny
2. gaming
3. pics
4. videos
5. gifs
6. technology

Controversial Communities
 
1. fatpeoplehate
2. GreatAwakening
3. MillionDollarExtreme
4. CringeAnarchy
5. KotakuInAction
6. MensRights

Responsible Usage Policy

By using this dataset, researchers agree to:

1. Research Purpose: Use the dataset solely for legitimate research purposes that aim to understand and improve online social dynamics.

2. Privacy Protection:
- Make no attempts to de-anonymize or re-identify users
- Not link or combine this dataset with other data sources to reveal user identities
- Not contact or attempt to identify individuals mentioned in post content

3. Ethical Research:
- Obtain appropriate institutional ethics approval when required
- Follow established guidelines for ethical social media research
- Consider potential impacts on marginalized or vulnerable communities

4. Prohibited Uses:
- Training harmful AI models or content generation systems
- Analyzing patterns to identify or target vulnerable communities

5. Publication Guidelines:
- Properly cite the dataset in any resulting publications
- Clearly document any filtering or processing applied to the data
- Share research findings responsibly, considering potential misuse

License
 
This dataset is released under Creative Commons Attribution International 4.0 License.
 
 

Right to Erasure (Right to be forgotten)

All users included in the MADOC dataset have the right to opt out and request the removal of their data, in accordance with GDPR provisions. However, the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides derogations. The data has been thoroughly pseudonymized..

If you wish to have your interactions excluded from this dataset, please submit your request to madoc@ctrust.ac.rs. 

We will process your request as soon as possible.

 

 

 
Contact

You can reach us at: madoc@ctrust.ac.rs

Files

Files (44.6 GB)

Name Size Download all
md5:f43f42bc57923e22c3f4929e743cc60b
477.4 MB Download
md5:f017f38b81d9944e94a34ab5f5a709d7
811.8 MB Download
md5:68a3fc135ba06b79d163e49da81ae1db
1.0 GB Download
md5:406ac82ad76c2965427ef472d95fc030
229.6 MB Download
md5:fc1847d79be6066cbe8a5272e83933ce
9.7 GB Download
md5:e39193f8cff33bc5f12315c882d7e834
7.6 GB Download
md5:f062b7261d2f5db77313463eb5c3342a
3.3 GB Download
md5:22ede9020d26e22cb97facf73e2c4c71
189.3 MB Download
md5:90cb290f0383c2a82ae22c98ba3ccdf2
1.6 GB Download
md5:a1cb13dec5f57664c0be19a91093c9e0
835.4 MB Download
md5:7be15783688d1373ba4b5acc8f4ea676
183.3 MB Download
md5:6de693845d9da284a0687c0f4fe5ac96
8.8 GB Download
md5:36b2f7e2a65ac08602bdc90e10a97054
2.7 GB Download
md5:264065c5db5f1614a52dae2f70fd9055
6.9 GB Download
md5:f03595493866442f66e08158dad0cbbd
535.6 kB Download
md5:4d6d845ce05bfef07653a4d65a32d217
68.6 MB Download
md5:e4e5cb4124f55002630509b6fba43ab1
22.0 MB Download
md5:8d0e1617fd8e41f98840a676db68842a
14.3 MB Download
md5:b9d71fe50b3e17d1e84c73f9c06e2821
3.2 MB Download
md5:b6ffd00c0f9efac5e4f62aa51db676d2
81.9 MB Download
md5:9a978ad18eb8db315af7e947fc8d02c5
2.0 MB Download
md5:2aec5858f6643b04c182309ff5af44b6
833.3 kB Download
md5:d33bb5317d2090343730df0505279344
4.0 MB Download
md5:87c8657e1c83be5065f2146bc08e84c7
6.1 MB Download
md5:112d3bf8e5261eb694e8d7de12616574
17.1 MB Download
md5:89d337d134d06dfed7c58e0526d202c8
13.6 MB Download

Additional details

Related works

Is derived from
Dataset: 10.5281/zenodo.11082879 (DOI)
Dataset: 10.5281/zenodo.10476212 (DOI)
Dataset: 10.5281/zenodo.5841668 (DOI)

Funding

Science Fund of the Republic of Serbia
Topology-derived methods for the analysis of collective trust dynamics 7416

Software

Development Status
Active