Published June 15, 2025 | Version 1.0.0
Dataset Open

FediData: A Comprehensive Multi-Modal Fediverse Dataset from Mastodon

Creators

Description

FediData

Recently, decentralized online social networks (DOSNs) have developed and gained public attention. Unlike traditional social networks, DOSNs provide users with more freedom to save and access their data and communications in a distributed mechanism. Among them,Mastodon is one of the most representative DOSNs, consisting of multiple servers operated by individuals independently.

Mastodon’s distributed architecture—featuring disparate instances, varied data formats, and strict rate limits—combined with highly heterogeneous, unstandardized user‐generated content (images, text, metadata, social links) lacking consistent annotations, makes unified data collection and effective analysis extremely challenging.

We propose FediData, the first open multi-modal dataset collected from Mastodon, which is dedicated to providing realistic and reliable data support for social behavior modeling, multi-modal learning, and research on user interaction mechanisms.

Please find the corresponding codes for data colleciton and data analysis via
https://github.com/FDUDataNET/FediData

Dataset

The dataset consists mainly of several parts. anonymous_60k_user_info.tar compressed file contains five files:

  • accounts_info_60k_anonymized. This CSV file contains all 64,345 users in our dataset and their account metadata. The anonymized users in other files are consistent with this file for ease of reference. (e.g., User A is anonymized as 1a2b3c@instance in all files).
  • 12k_labeled_anonymized.This CSV file contains 12,548 users and the labels we manually assigned to them.0 represents a real user, 1 represents a bot, 2 represents a user we find difficult to classify, and 3 represents a user who cannot be found via mastodon.social.
  • user_posts_60k_anonymized. This JSON file contains statistics on up to 20 tweets from 64,345 users (some users had fewer than 20 tweets available for crawling), sorted by posting time from newest to oldest (identified by index), along with the image URLs attached to the tweets. Ultimately, we obtained 725,282 user tweets and 872,231 image URLs. This content was filtered from the mastodon_ugc data we crawled, which will be described in detail below. To facilitate the use of image URLs, we did not anonymize the URLs.
  • download_status_60k_anonymized. This JSON file records the download status of each URL when downloading tweet images. Some images could not be downloaded successfully due to access restrictions or network issues. A total of 765,019 images were successfully downloaded, with a success rate of 87.7%.
  • edge_60k_anonymized. This CSV file records the interactions between 64,345 users, with a total of 3,044,116 edges, divided into two types of relationships: following (i.e., user A follows user B) and follower (i.e., user A is a fan of user B).

To facilitate uploading and downloading for users, we have packaged the 765,019 images we crawled into several 5 GB compressed files, resulting in 36 compressed files. The names of the compressed files are 60k_jpg_part.7z.001 - 60k_jpg_part.7z.036. The images are named a_b_c.jpg/png, where a represents the user's index in accounts_info_60k_anonymized, b represents the post index for that user, and c represents the image corresponding to the nth URL in that post. (Indexes are recorded starting from 0.)

 

Notes (English)

We adhere to explicit protocols to ensure ethical data handling throughout data collection and pre-processing. We comply with the rate limits, the Robots Exclusion Protocol (robots.txt), and DOSNs' content moderation policies to avoid unauthorized access to private data. FediData maintains user privacy and aligns with DOSNs' content moderation policies by collecting only publicly available information. We applied salted anonymization hashing to the crawled user data, generating non-reversible 8-character anonymized IDs while preserving their servers (also called instances) to facilitate analysis by researchers. For example, the user A@instance would be anonymized as 1a2b3c4d@instance.

Notes

Citation

If you find this dataset useful for your research, please cite our paper:

Min Gao, Haoran Du, Wen Wen, Qiang Duan, Xin Wang, and Yang Chen.
2025. FediData: A Comprehensive Multi-Modal Fediverse Dataset from
Mastodon. In Proceedings of the 34th ACM International Conference on In-
formation and Knowledge Management (CIKM ’25), November 10–14, 2025,
Seoul, Republic of Korea. https://doi.org/10.1145/3746252.3761634.

Files

Files (193.0 GB)

Name Size Download all
md5:565fa80af069acdd21498da079becc05
5.4 GB Download
md5:0a30c44944fb8186070ccda0f9c26498
5.4 GB Download
md5:c42214ec3d29e226e6a0ef1bece1c589
5.4 GB Download
md5:90f5cc0ac6cdc0572bc86b19b4b6a8e5
5.4 GB Download
md5:70a93c7b5ab1494554c72b50aef0a2d3
5.4 GB Download
md5:91463dcab40983d4c313563c9d707646
5.4 GB Download
md5:c88a1aea5b32eacd7f730dd840fb59b3
5.4 GB Download
md5:70b70da57ad5550c1e581d00ab41ddec
5.4 GB Download
md5:4adc45f9af370b03abd6939075066086
5.4 GB Download
md5:e1abbf6de47c8aac9f32e4306987d4d1
5.4 GB Download
md5:81e520a21fc1f6733882490ebcb4400b
5.4 GB Download
md5:47435913febcc75389c139d4ad5f429b
5.4 GB Download
md5:57d5e9a24030b0cb57b7bf93be6be938
5.4 GB Download
md5:db781b3c5b8ea0418ce4ceed1be93c93
5.4 GB Download
md5:f0f2efca0c765d3eec4c26ce2e91bc29
5.4 GB Download
md5:0d9513613451edc714371d7f6f4d483a
5.4 GB Download
md5:33602f280e3658042d2c81f7e2084a3f
5.4 GB Download
md5:f5111a6b95604fd61cf4b07b0f53cc61
5.4 GB Download
md5:27fef53f44de6ba82ca65a999a8aa54d
5.4 GB Download
md5:06445db286ba89f234bedb8cf115b7ef
5.4 GB Download
md5:a636bb7f4032f3b470f92d0621db8a0e
5.4 GB Download
md5:e08c9e75190e0a5f621a18e27af4051a
5.4 GB Download
md5:c18f19d8ca20682357660ec722c10240
5.4 GB Download
md5:37322d6b22ec2f05386f4d539db84af6
5.4 GB Download
md5:787d78b094c2f5fa6c8173ab88843a6a
5.4 GB Download
md5:6e48506bb1c95ec01af048695dfceb34
5.4 GB Download
md5:a735ce0e63f81bf40180f68ef5bf5337
5.4 GB Download
md5:fca6d851da422e6bba671560b885c644
5.4 GB Download
md5:eb435359d48dae7155b7a9f1fc758ece
5.4 GB Download
md5:885e7a8432b64f687fed834bf6c962e2
5.4 GB Download
md5:6e62320c906b1a73cdf72da2054a517b
5.4 GB Download
md5:ecbbb018220b978737582582927bbfde
5.4 GB Download
md5:6f239dd983d06d85d81b8e5e27c77188
5.4 GB Download
md5:50ad372ead596a999b528e506e471614
5.4 GB Download
md5:aa71f5e808f41d80b25f43442afed8fa
5.4 GB Download
md5:77adfe38348e431abe2873c98cfc6d33
4.9 GB Download
md5:0d26f0862da062cc36ecef5400593ee3
157.9 MB Download

Additional details

Related works

Is metadata for
Conference paper: 10.1145/3746252.3761634 (DOI)

Software

Repository URL
https://github.com/FDUDataNET/FediData
Programming language
Python