Bot Into The Fediverse Dataset
Description
This dataset contains anonymized features for bot detection on Mastodon (Fediverse). It was created for the accompanying paper and consists of accounts labeled as bot or non-bot, collected from publicly accessible content via the Mastodon Application Programming Interface (API) during January–February 2025.
To reduce privacy risks and facilitate reuse, the dataset does not include raw usernames, user IDs, or raw text. Instead, we provide (i) engineered account/profile and activity features (e.g., follower/following counts and posting statistics), and (ii) text representations derived from public content. Specifically, the account profile description (“note”) was converted into fixed-length embeddings using bert-base-multilingual-cased. In addition, post-level textual information was converted into embeddings (see twets_emb), enabling downstream modeling without access to the original text.
The dataset is intended for research on bot detection, feature engineering, and multilingual representation learning on decentralized social networks, and supports reproducibility of experiments reported in the paper.
Data collection and processing
-
Source platform: Mastodon (public content only).
-
Collection period: January–February 2025.
-
Access method: Platform API.
-
Anonymization: Removal of direct identifiers (e.g., usernames and raw profile text). Only derived numeric features and embeddings are shared.
-
Text embeddings:
bert-base-multilingual-casedapplied to the profile description (“note”); post embeddings provided astwets_emb.
Intended use
-
Supervised bot detection and benchmarking on Mastodon-derived features.
-
Feature importance/ablation studies on profile and behavioral signals.
-
Experiments using multilingual text embeddings without releasing raw text.
Limitations and notes
-
Labels reflect the definition and labeling procedure described in the accompanying paper and may contain noise or bias.
-
The dataset contains derived representations, so it may not support tasks that require raw text (e.g., linguistic audits, toxicity annotation, qualitative analyses).
-
Some features (e.g., averages over interactions) may depend on the observation window and API availability at collection time.
Column dictionary
Below are the dataset columns included in each row (one row per account):
Username-based (derived, no raw username shared)
-
username_length: Length of the (anonymized) username string. -
username_num_digits: Count of numeric characters in username. -
username_num_letters: Count of alphabetic characters in username. -
username_num_special: Count of non-alphanumeric characters in username. -
username_starts_with_digit: Binary indicator (1 if username starts with a digit). -
username_ends_with_digit: Binary indicator (1 if username ends with a digit). -
fuzzy_score: Fuzzy string similarity score between username and screen name computed during preprocessing (as defined in the paper/processing scripts).
Network / account metadata
-
followers_count: Number of followers at collection time. -
following_count: Number of accounts followed at collection time. -
statuses_count: Total number of statuses/posts at collection time. -
days: Account age or days since creation/first observed.
Activity and interaction aggregates (computed over last 40 collected posts in the observation window)
-
avg_reply_count: Average replies per post. -
avg_retweet_count: Average boosts/reblogs per post. -
avg_favorite_count: Average favorites/likes per post. -
avg_num_tags: Average number of hashtags per post. -
avg_num_urls: Average number of URLs per post. -
avg_num_mentions: Average number of mentions per post. -
avg_possibly_sensitive: Average fraction/indicator of sensitive content (if available/derived).
Language and text embeddings
-
language: Language code associated with the account/posts (when available). -
note_emb: Embedding vector of the profile description (“note”) computed withbert-base-multilingual-cased. -
twets_emb: Embedding vector(s) derived from the account’s posts (average embedding over recent posts).
Label
-
bot: Binary label (1 = bot, 0 = non-bot).
Citation
If you use this dataset, please cite:
-
The dataset DOI (10.5281/zenodo.17987595)
-
The accompanying paper (DOI 10.1007/s13278-025-01567-z)
Files
botsintothefediverse.csv
Files
(127.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:de2faa691177bb1f45ad0c76f428a166
|
127.1 MB | Preview Download |