Published August 7, 2025 | Version v1
Dataset Open

From Birdwatch to Community Notes, from Twitter to X: four years of community-based content moderation

  • 1. ROR icon University College Dublin
  • 2. EDMO icon Trinity College Dublin
  • 3. EDMO icon Technical University of Berlin
  • 4. SPICED Academy
  • 5. ROR icon Technological University Dublin

Description

Dataset and Code Description

This repository contains the data and code used to analyse interactions within the Community Notes platform from January 23, 2021, to January 23, 2025. The files are organised as follows:

🧪 Code Notebooks

  • Create_graphs.ipynb: Constructs full interaction networks and separate sub-networks (helpful, somewhat helpful, unhelpful) from the monthly raw rating files.

  • Url_analysis.ipynb: Detects the language of each note and extracts any URLs or domain names mentioned.

  • BERTopic_English_hard_PCA100_UMAP10_MinCluster500.ipynb: Applies BERTopic to English-language notes to extract latent topics. Dimensionality is reduced using PCA (100 components) and UMAP (10 dimensions). Only clusters with at least 500 notes are retained to ensure robustness.

📄 Data Files

Notes Data
  • notes_with_lang.csv: All Community Notes written between January 23, 2021, and January 23, 2025, with detected language, extracted URLs, and domain names.

  • english_notes_with_nlp.csv: Subset of English notes with BERTopic topics, topic numbers, and keyword representations.

Each note file contains the following variables:

  • noteId: Unique ID of the note.

  • noteAuthorParticipantId: Unique ID of the note's author.

  • tweetId: ID of the tweet the note addresses.

  • date: Date the note was written (YYYY-MM-DD).

  • Timestamp: Time the note was written (HH:MM:SS).

  • language: Detected language of the note.

  • extracted_urls: List of URLs mentioned in the note.

  • news_source: List of extracted domain names.

  • BERTopic_word (only in English notes file): Main topic name.

  • BERTopic_number (only in English notes file): Numeric topic identifier.

  • BERTopic_representation (only in English notes file): List of keywords representing the topic.

Rating Data
  • Monthly rating files are stored in the rating monthly files/ directory with the naming format ratings_m_yyyy.csv.

  • Each file includes:

    • noteId: ID of the rated note.

    • raterParticipantId: ID of the participant giving the rating.

    • helpfulnessLevel: Rating category (HELPFUL, SOMEWHAT_HELPFUL, NOT_HELPFUL).

    • helpful, notHelpful: Deprecated binary flags (use helpfulnessLevel instead).

🌐 Network Files

Each month’s ratings are used to construct interaction graphs with user-to-user edges based on rating behaviours.

  • Whole Networks (whole_network_<month>_<year>.graphml): Full user interaction networks, with edges annotated by the number of helpful, unhelpful, and somewhat helpful ratings.

    • Each edge contains:

      • source: Rater’s participant ID.

      • target: Note author’s participant ID.

      • helpful, unhelpful, somewhathelpful: Count of ratings by type from rater to author.

  • Helpful Networks (network_<month>_<year>_helpful.graphml): Subnetworks based on helpful ratings only.

  • Somewhat Helpful Networks (network_<month>_<year>_somewhat.graphml): Subnetworks based on somewhat helpful ratings.

  • Unhelpful Networks (network_<month>_<year>_unhelpful.graphml): Subnetworks based on unhelpful ratings.

 

Files

BERTopic_English_hard_PCA100_UMAP10_MinCluster500.ipynb

Files (10.1 GB)

Name Size Download all
md5:d7afe96a6f8c53272b35590919a22cc1
127.3 kB Preview Download
md5:3f2b428543e75cfb1c2cda902debcd56
44.2 kB Preview Download
md5:78aeff114e0806ff8529ca22437a06ed
3.7 GB Preview Download
md5:2c74df2140e5742bff992df845d365a5
347.7 MB Preview Download
md5:7ea83da15ed30d679a6c5f3f767a2438
443.0 MB Preview Download
md5:17af2074f9831188de7dbc39ad03f955
2.1 GB Preview Download
md5:388f5193eac220a6284a1a484dc2f47d
12.2 kB Download
md5:3dfa5f1a119e51fc26c1d0cfaf61e3d1
832.5 kB Preview Download
md5:7d8bcc1f7f1566a588a683792b6cbb88
3.5 GB Preview Download

Additional details

Funding

Taighde Éireann - Research Ireland
18/CRT/6049
Taighde Éireann - Research Ireland
IRCLA/2022/3217