FediData: A Comprehensive Multi-Modal Fediverse Dataset from Mastodon
Creators
Description
FediData
Recently, decentralized online social networks (DOSNs) have developed and gained public attention. Unlike traditional social networks, DOSNs provide users with more freedom to save and access their data and communications in a distributed mechanism. Among them,Mastodon is one of the most representative DOSNs, consisting of multiple servers operated by individuals independently.
Mastodon’s distributed architecture—featuring disparate instances, varied data formats, and strict rate limits—combined with highly heterogeneous, unstandardized user‐generated content (images, text, metadata, social links) lacking consistent annotations, makes unified data collection and effective analysis extremely challenging.
We propose FediData, the first open multi-modal dataset collected from Mastodon, which is dedicated to providing realistic and reliable data support for social behavior modeling, multi-modal learning, and research on user interaction mechanisms.
Please find the corresponding codes for data colleciton and data analysis via
https://github.com/FDUDataNET/FediData
Dataset
The dataset consists mainly of several parts. anonymous_60k_user_info.tar
compressed file contains five files:
- accounts_info_60k_anonymized. This CSV file contains all 64,345 users in our dataset and their account metadata. The anonymized users in other files are consistent with this file for ease of reference. (e.g., User A is anonymized as 1a2b3c@instance in all files).
- 12k_labeled_anonymized.This CSV file contains 12,548 users and the labels we manually assigned to them.0 represents a real user, 1 represents a bot, 2 represents a user we find difficult to classify, and 3 represents a user who cannot be found via mastodon.social.
- user_posts_60k_anonymized. This JSON file contains statistics on up to 20 tweets from 64,345 users (some users had fewer than 20 tweets available for crawling), sorted by posting time from newest to oldest (identified by index), along with the image URLs attached to the tweets. Ultimately, we obtained 725,282 user tweets and 872,231 image URLs. This content was filtered from the mastodon_ugc data we crawled, which will be described in detail below. To facilitate the use of image URLs, we did not anonymize the URLs.
- download_status_60k_anonymized. This JSON file records the download status of each URL when downloading tweet images. Some images could not be downloaded successfully due to access restrictions or network issues. A total of 765,019 images were successfully downloaded, with a success rate of 87.7%.
- edge_60k_anonymized. This CSV file records the interactions between 64,345 users, with a total of 3,044,116 edges, divided into two types of relationships: following (i.e., user A follows user B) and follower (i.e., user A is a fan of user B).
To facilitate uploading and downloading for users, we have packaged the 765,019 images we crawled into several 5 GB compressed files, resulting in 36 compressed files. The names of the compressed files are 60k_jpg_part.7z.001
- 60k_jpg_part.7z.036
. The images are named a_b_c.jpg/png, where a represents the user's index in accounts_info_60k_anonymized, b represents the post index for that user, and c represents the image corresponding to the nth URL in that post. (Indexes are recorded starting from 0.)
Notes (English)
Notes
Files
Files
(193.0 GB)
Name | Size | Download all |
---|---|---|
md5:565fa80af069acdd21498da079becc05
|
5.4 GB | Download |
md5:0a30c44944fb8186070ccda0f9c26498
|
5.4 GB | Download |
md5:c42214ec3d29e226e6a0ef1bece1c589
|
5.4 GB | Download |
md5:90f5cc0ac6cdc0572bc86b19b4b6a8e5
|
5.4 GB | Download |
md5:70a93c7b5ab1494554c72b50aef0a2d3
|
5.4 GB | Download |
md5:91463dcab40983d4c313563c9d707646
|
5.4 GB | Download |
md5:c88a1aea5b32eacd7f730dd840fb59b3
|
5.4 GB | Download |
md5:70b70da57ad5550c1e581d00ab41ddec
|
5.4 GB | Download |
md5:4adc45f9af370b03abd6939075066086
|
5.4 GB | Download |
md5:e1abbf6de47c8aac9f32e4306987d4d1
|
5.4 GB | Download |
md5:81e520a21fc1f6733882490ebcb4400b
|
5.4 GB | Download |
md5:47435913febcc75389c139d4ad5f429b
|
5.4 GB | Download |
md5:57d5e9a24030b0cb57b7bf93be6be938
|
5.4 GB | Download |
md5:db781b3c5b8ea0418ce4ceed1be93c93
|
5.4 GB | Download |
md5:f0f2efca0c765d3eec4c26ce2e91bc29
|
5.4 GB | Download |
md5:0d9513613451edc714371d7f6f4d483a
|
5.4 GB | Download |
md5:33602f280e3658042d2c81f7e2084a3f
|
5.4 GB | Download |
md5:f5111a6b95604fd61cf4b07b0f53cc61
|
5.4 GB | Download |
md5:27fef53f44de6ba82ca65a999a8aa54d
|
5.4 GB | Download |
md5:06445db286ba89f234bedb8cf115b7ef
|
5.4 GB | Download |
md5:a636bb7f4032f3b470f92d0621db8a0e
|
5.4 GB | Download |
md5:e08c9e75190e0a5f621a18e27af4051a
|
5.4 GB | Download |
md5:c18f19d8ca20682357660ec722c10240
|
5.4 GB | Download |
md5:37322d6b22ec2f05386f4d539db84af6
|
5.4 GB | Download |
md5:787d78b094c2f5fa6c8173ab88843a6a
|
5.4 GB | Download |
md5:6e48506bb1c95ec01af048695dfceb34
|
5.4 GB | Download |
md5:a735ce0e63f81bf40180f68ef5bf5337
|
5.4 GB | Download |
md5:fca6d851da422e6bba671560b885c644
|
5.4 GB | Download |
md5:eb435359d48dae7155b7a9f1fc758ece
|
5.4 GB | Download |
md5:885e7a8432b64f687fed834bf6c962e2
|
5.4 GB | Download |
md5:6e62320c906b1a73cdf72da2054a517b
|
5.4 GB | Download |
md5:ecbbb018220b978737582582927bbfde
|
5.4 GB | Download |
md5:6f239dd983d06d85d81b8e5e27c77188
|
5.4 GB | Download |
md5:50ad372ead596a999b528e506e471614
|
5.4 GB | Download |
md5:aa71f5e808f41d80b25f43442afed8fa
|
5.4 GB | Download |
md5:77adfe38348e431abe2873c98cfc6d33
|
4.9 GB | Download |
md5:0d26f0862da062cc36ecef5400593ee3
|
157.9 MB | Download |
Additional details
Related works
- Is metadata for
- Conference paper: 10.1145/3746252.3761634 (DOI)
Software
- Repository URL
- https://github.com/FDUDataNET/FediData
- Programming language
- Python