Published January 13, 2025
| Version 1.0.0
Dataset
Open
MADOC: Multi-Platform Aggregated Dataset of Online Communities
Creators
-
Mitrovic Dankulov, Marija
(Researcher)1
-
Tomašević, Aleksandar
(Researcher)2
-
Maletic, Slobodan
(Researcher)3
-
Andjelkovic, Miroslav
(Researcher)3
-
Vranic, Ana
(Researcher)1, 4
-
Cvetkovic, Darja
(Researcher)1
-
Stupovski, Boris
(Researcher)1
-
Vudragovic, Dusan
(Researcher)1
-
Major, Sara
(Researcher)2
-
Bogojević, Aleksandar
(Researcher)1
Description
The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from four distinct platforms: Bluesky, Koo, Reddit, and Voat, spanning from 2012 to 2024. The dataset includes 18.9 million posts, 236 million comments, and data from 23.1 million unique users across all platforms, with a particular focus on understanding community dynamics, user migration patterns, and the evolution of toxic behavior across platforms. By providing standardized data structures and FAIR-compliant access through Zenodo, MADOC enables researchers to conduct comparative analyses of user behavior, interaction networks, and content sentiment across diverse social media environments. The dataset's unique value lies in its cross-platform scope, standardized structure, and rich metadata, making it particularly suitable for studying societal phenomena such as community formation, toxic behavior propagation, and user migration patterns in response to platform moderation policies.
Technical info
MADOC is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. The dataset aggregates and standardizes data from four distinct platforms:
- Reddit (2014-2020)
- Voat (2013-2020)
- Bluesky (2023-2024)
- Koo (2020-2023)
Dataset Scale
- 18.9 million posts
- 236 million comments
- 23.1 million unique users across all platforms
Dataset Structure
The dataset is stored in Apache Parquet format, with separate files for each community. Each row represents a single interaction (post, comment, or repost) with the following standardized fields:
Field | Description |
post_id | Unique identifier for the interaction, anonymized |
publish_date
|
UNIX timestamp of the interaction (seconds since epoch)
|
user_id
|
Anonymized identifier of the content creator
|
parent_id
|
Identifier of the parent post for comments/replies, NA for posts
|
parent_user_id
|
Identifier of the parent post's creator, NA for posts
|
content
|
Textual content of the interaction, with URLs extracted
|
url |
External URLs referenced in the post/comment content
|
language |
Detected language code (currently 'English' for all entries)
|
interaction_type |
Type of interaction: 'POST', 'COMMENT', or 'REPOST'
|
platform |
Source platform: 'reddit', 'voat', 'bluesky', or 'koo'
|
community |
Community identifier (subreddit/subverse name), NA for Bluesky/Koo
|
sentiment_vader |
VADER sentiment score ranging from -1 (negative) to 1 (positive)
|
strict_filter |
Whether content matches strict keyword filtering criteria (1/0)
|
Platform Statistics
Reddit
Total interactions: 247.6M
Posts: 14.5M
Comments: 233.1M
Unique users: 22.6M
Average posts per user: 2.7
Average comments per user: 11.6
Mean sentiment: 0.063
URLs percentage: 10.3%
Voat
Total interactions: 1.2M
Posts: 0.4M
Comments: 0.8M
Unique users: 0.1M
Average posts per user: 8.5
Average comments per user: 7.8
Mean sentiment: 0.011
URLs percentage: 31.8%
Bluesky
Total interactions: 2.8M
Posts: 0.9M
Comments: 0.9M
Unique users: 0.2M
Average posts per user: 14.4
Average comments per user: 6.2
Mean sentiment: 0.088
URLs percentage: 4.5%
Koo
Total interactions: 4.3M
Posts: 3.1M
Comments: 1.2M
Unique users: 0.2M
Average posts per user: 23.0
Average comments per user: 10.6
Mean sentiment: 0.054
URLs percentage: 11.6%
Communities Covered
General Interest Communities
1. funny
2. gaming
3. pics
4. videos
5. gifs
6. technology
Controversial Communities
1. fatpeoplehate
2. GreatAwakening
3. MillionDollarExtreme
4. CringeAnarchy
5. KotakuInAction
6. MensRights
Responsible Usage Policy
By using this dataset, researchers agree to:
1. Research Purpose: Use the dataset solely for legitimate research purposes that aim to understand and improve online social dynamics.
2. Privacy Protection:
- Make no attempts to de-anonymize or re-identify users
- Not link or combine this dataset with other data sources to reveal user identities
- Not contact or attempt to identify individuals mentioned in post content
3. Ethical Research:
- Obtain appropriate institutional ethics approval when required
- Follow established guidelines for ethical social media research
- Consider potential impacts on marginalized or vulnerable communities
4. Prohibited Uses:
- Training harmful AI models or content generation systems
- Analyzing patterns to identify or target vulnerable communities
5. Publication Guidelines:
- Properly cite the dataset in any resulting publications
- Clearly document any filtering or processing applied to the data
- Share research findings responsibly, considering potential misuse
License
This dataset is released under Creative Commons Attribution International 4.0 License.
Right to Erasure (Right to be forgotten)
All users included in the MADOC dataset have the right to opt out and request the removal of their data, in accordance with GDPR provisions. However, the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides derogations. The data has been thoroughly pseudonymized..
If you wish to have your interactions excluded from this dataset, please submit your request to madoc@ctrust.ac.rs.
We will process your request as soon as possible.
Files
README.md
Files
(41.9 GB)
Name | Size | Download all |
---|---|---|
md5:8692f5f7d5cc335a6b4780e839421172
|
449.3 MB | Download |
md5:201715cf3c5c8a1e220247b3b45d2e58
|
774.3 MB | Download |
md5:56ebadfb619963f98da2d734430111db
|
5.4 kB | Preview Download |
md5:21b456602826f28232db8ac3b79085fa
|
951.7 MB | Download |
md5:ce77de69fbe8a4ee4d153d87462b022c
|
214.5 MB | Download |
md5:04fce7de8c4dbe8f8d5a77d06c17b3fa
|
9.1 GB | Download |
md5:cc8e73d9710c7e257eb9801f737e6e68
|
7.2 GB | Download |
md5:1d40d06ad275d2b23ac119a3f5817f6f
|
3.1 GB | Download |
md5:e0cfd48a0a2965e84affd3bdcd8fec07
|
179.3 MB | Download |
md5:9e87f8edf1360a7356a66151854e9419
|
1.5 GB | Download |
md5:2c9da3a651a82ab81da614ca5ce22ab0
|
797.8 MB | Download |
md5:c8f6f131785d98dfecb112d156cfe30f
|
170.2 MB | Download |
md5:0869d52aa49e00d1efb8ad6ba7fa64a8
|
8.3 GB | Download |
md5:5e3de767cd95238ee0358c1203d82bc3
|
2.5 GB | Download |
md5:19eb499eced66221be0ec97a95f17d98
|
6.5 GB | Download |
md5:707751d3dfc72b8a083d1288528c9786
|
464.9 kB | Download |
md5:b5fe7f8b3648fb654d24f66795a0d1e2
|
61.9 MB | Download |
md5:86e31c39c69400a80b67b41111ad268c
|
18.8 MB | Download |
md5:9445ede64b75ed8c10ca9a042446f14e
|
12.7 MB | Download |
md5:61eeda5a72ccce80661580ebbb0c5268
|
2.8 MB | Download |
md5:9e46908e983f8bbb098a2e512a986a6d
|
76.1 MB | Download |
md5:63d388fb42eb0499d72c31f5f0b57a86
|
1.8 MB | Download |
md5:61c1c46d53b9334dfe285a1fa4eedb37
|
773.7 kB | Download |
md5:1a4416f4a1d5f56f5e53fca666c36ad6
|
3.5 MB | Download |
md5:8acf3e79f11b8ca10c6a2614e0ee2c9c
|
5.3 MB | Download |
md5:e15002c55915aa0b3d33feedac5a9588
|
15.2 MB | Download |
md5:a7195e14ce7a0abe8ed3eff39a13a984
|
11.8 MB | Download |
Additional details
Related works
- Is derived from
- Dataset: 10.5281/zenodo.11082879 (DOI)
- Dataset: 10.5281/zenodo.10476212 (DOI)
- Dataset: 10.5281/zenodo.5841668 (DOI)
Funding
Software
- Development Status
- Active