Bot Into The Fediverse Dataset

MORENO GARCIA, FRANCISCO

doi:10.5281/zenodo.17987595

Published December 19, 2025 | Version v1

Dataset Open

Bot Into The Fediverse Dataset

MORENO GARCIA, FRANCISCO (Research group)¹

1. Universidad Politécnica de Madrid

This dataset contains anonymized features for bot detection on Mastodon (Fediverse). It was created for the accompanying paper and consists of accounts labeled as bot or non-bot, collected from publicly accessible content via the Mastodon Application Programming Interface (API) during January–February 2025.

To reduce privacy risks and facilitate reuse, the dataset does not include raw usernames, user IDs, or raw text. Instead, we provide (i) engineered account/profile and activity features (e.g., follower/following counts and posting statistics), and (ii) text representations derived from public content. Specifically, the account profile description (“note”) was converted into fixed-length embeddings using bert-base-multilingual-cased. In addition, post-level textual information was converted into embeddings (see twets_emb), enabling downstream modeling without access to the original text.

The dataset is intended for research on bot detection, feature engineering, and multilingual representation learning on decentralized social networks, and supports reproducibility of experiments reported in the paper.

Data collection and processing

Source platform: Mastodon (public content only).
Collection period: January–February 2025.
Access method: Platform API.
Anonymization: Removal of direct identifiers (e.g., usernames and raw profile text). Only derived numeric features and embeddings are shared.
Text embeddings: bert-base-multilingual-cased applied to the profile description (“note”); post embeddings provided as twets_emb.

Intended use

Supervised bot detection and benchmarking on Mastodon-derived features.
Feature importance/ablation studies on profile and behavioral signals.
Experiments using multilingual text embeddings without releasing raw text.

Limitations and notes

Labels reflect the definition and labeling procedure described in the accompanying paper and may contain noise or bias.
The dataset contains derived representations, so it may not support tasks that require raw text (e.g., linguistic audits, toxicity annotation, qualitative analyses).
Some features (e.g., averages over interactions) may depend on the observation window and API availability at collection time.

Column dictionary

Below are the dataset columns included in each row (one row per account):

Username-based (derived, no raw username shared)

username_length: Length of the (anonymized) username string.
username_num_digits: Count of numeric characters in username.
username_num_letters: Count of alphabetic characters in username.
username_num_special: Count of non-alphanumeric characters in username.
username_starts_with_digit: Binary indicator (1 if username starts with a digit).
username_ends_with_digit: Binary indicator (1 if username ends with a digit).
fuzzy_score: Fuzzy string similarity score between username and screen name computed during preprocessing (as defined in the paper/processing scripts).

Network / account metadata

followers_count: Number of followers at collection time.
following_count: Number of accounts followed at collection time.
statuses_count: Total number of statuses/posts at collection time.
days: Account age or days since creation/first observed.

Activity and interaction aggregates (computed over last 40 collected posts in the observation window)

avg_reply_count: Average replies per post.
avg_retweet_count: Average boosts/reblogs per post.
avg_favorite_count: Average favorites/likes per post.
avg_num_tags: Average number of hashtags per post.
avg_num_urls: Average number of URLs per post.
avg_num_mentions: Average number of mentions per post.
avg_possibly_sensitive: Average fraction/indicator of sensitive content (if available/derived).

Language and text embeddings

language: Language code associated with the account/posts (when available).
note_emb: Embedding vector of the profile description (“note”) computed with bert-base-multilingual-cased.
twets_emb: Embedding vector(s) derived from the account’s posts (average embedding over recent posts).

Label

bot: Binary label (1 = bot, 0 = non-bot).

Citation

If you use this dataset, please cite:

The dataset DOI (10.5281/zenodo.17987595)
The accompanying paper (DOI 10.1007/s13278-025-01567-z)

Files

botsintothefediverse.csv

Files (127.1 MB)

Name	Size	Download all
botsintothefediverse.csv md5:de2faa691177bb1f45ad0c76f428a166	127.1 MB	Preview Download

Additional details

European Commission
AI-CODE - AI-CODE - AI services for COntinuous trust in emerging Digital Environments 101135437

	All versions	This version
Views	88	88
Downloads	4	4
Data volume	508.4 MB	508.4 MB

Bot Into The Fediverse Dataset

Authors/Creators

Description

Data collection and processing

Intended use

Limitations and notes

Column dictionary

Username-based (derived, no raw username shared)

Network / account metadata

Activity and interaction aggregates (computed over last 40 collected posts in the observation window)

Language and text embeddings

Label

Citation

Files

botsintothefediverse.csv

Files (127.1 MB)

Additional details

Funding