Published February 14, 2025 | Version v1

Dataset of Mastodon Toots (collected by FediLive)

Authors/Creators

Description

This snapshot of Mastodon is captured using FediLive, covering publicly visible activities across the entire platform over a 13-day period, from Nov. 22 to Dec. 4, 2024 (UTC+0). During this period, FediLive collected 1,361,708 original posts and 2,628,018 interactions, consisting of 65.6% favourites, 31.9% boosts, and 2.5% replies.

The livefeeds.json dataset contains Mastodon toots from all instances over approximately two weeks. The boostersfavourites.json and replies.json files provide data on boosts, favorites, and replies for these toots.

In Mastodon, a "toot" is a post made by a user. A "boost" is similar to a retweet on Twitter, allowing users to share a toot with their followers. A "favorite" is akin to a like, indicating appreciation for a toot. A "reply" is a response to another user's toot, facilitating conversations.

These interactions are essential for analyzing user engagement and content reach on Mastodon. By examining the boostersfavourites.json and replies.json files, researchers can gain insights into how content spreads and how users engage with each other on the platform.

"Since each instance in the Fediverse maintains its own independent ID namespace, toots (posts) and user accounts from different instances may share identical local IDs. To ensure global uniqueness:

  • Toots are uniquely identified using their sid (server-generated ID) combined with their instance-specific url
  • User Accounts are uniquely identified through their canonical profile url, which inherently contains instance information."

Notes

The dataset employs the following anonymization measures to ensure user privacy and data security:

1. User Information Anonymization

  • User IDs/Accounts: Irreversible SHA256 hashing (18-character truncation) using instance domain + original ID, formatted as anon_id_xxxx/anon_name_xxxx.
  • Profile URLs: Automatic replacement of usernames in URLs (e.g., /@username or /users/username) with hashed values.
  • Cross-platform Accountsacct fields replaced with anon_acct_xxxx hashed identifiers.

2. Tweet/Post Anonymization

  • Post IDs: Hashed using instance domain + original ID, formatted as anon_id_xxxx.
  • Post URLs: Original post IDs in URL endpoints replaced with hashed values.
  • Social Interactionsin_reply_to_id and in_reply_to_account_id fields replaced with hashed identifiers.

3. Data Correlation Protection

  • Independent Hashing: Separate hashing algorithms for users and posts to prevent identity inference via correlations.
  • Salted Hashing: All hashing includes cryptographically secure salts to prevent cross-dataset matching.
  • Recursive Processing: Full traversal of nested JSON structures to anonymize all URL/URI fields.

4. Metadata Safeguards

  • Social Engagement: User identities in likes/reblogs replaced with hashed values.
  • Cross-File Consistency: Same original IDs generate identical hashes across all files.
  • Instance Privacy: Subdomains removed from instance names (e.g., mastodon.social → social).

This scheme ensures irreversible transformation of sensitive data while preserving internal relationships for research validity. All fields are processed to prevent reconstruction of original identities, social connections, or platform-specific metadata.

Notes

If you use FediLive or this dataset in your research, please cite our paper:

@inproceedings{Min2025FediLive,
  author    = {Min, Shaojie and Wang, Shaobin and Luo, Yaxiao and Gao, Min and Gong, Qingyuan and Xiao, Yu and Chen, Yang},
  title     = {{FediLive: A Framework for Collecting and Preprocessing Snapshots of Decentralized Online Social Networks}},
  year      = {2025},
  booktitle = {Companion Proceedings of the ACM on Web Conference 2025},
  series    = {WWW '25},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  pages     = {765–768},
  doi       = {10.1145/3701716.3715298},
  url       = {https://doi.org/10.1145/3701716.3715298}
}

Files

241212livefeeds.zip

Files (3.9 GB)

Name Size
md5:5f90e94bf1145494616b27f188bfac67
2.0 GB Preview Download
md5:51c001cd420a6e6a10d811b526ac21ad
2.0 GB Preview Download

Additional details

Software

Repository URL
https://github.com/FDUDataNET/FediLive/
Programming language
Python