Published April 29, 2024 | Version v1
Dataset Open

Bluesky Social Dataset

  • 1. ROR icon University of Pisa
  • 2. ROR icon Institute of Information Science and Technologies

Description

Bluesky Social Dataset

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

 Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.

This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.

Dataset

Here is a description of the dataset files.

  • followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v).
  • posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line.
  • interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date.
  • graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.
  • feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each.  Posts are stored as a JSON-formatted line.  Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);
  • feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds.  Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp.
  • feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post,  and the like timestamp;
  • scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

 

Citation

If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984

 

Acknowledgments:

This work is supported by :

  • the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
    Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); 
  • SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;
  • EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research). 

Files

feed_bookmarks.csv

Files (21.3 GB)

Name Size Download all
md5:f679203886cca70d4a0395e8d3840070
552.8 kB Preview Download
md5:31889c5935d5fc43a8e33f9f1dca2014
16.4 MB Download
md5:499b99ae97e284240285f0d26e30d8cf
35.4 MB Download
md5:33eb37998142681ab0a993765c18b5cb
491.3 MB Download
md5:6dee46e97768bb37f347f072845e8b0f
891.2 MB Download
md5:81ad8051a45c001445ac796bf57d4dbb
1.0 GB Download
md5:9583b04f29207dcb8ba51cd47eadb743
18.8 GB Download
md5:83401fe2329cd80ca0b20ec0dcfd10dd
17.2 kB Download

Additional details

Related works

Is described by
Preprint: arXiv:2404.18984 (arXiv)