From Birdwatch to Community Notes, from Twitter to X: four years of community-based content moderation
Authors/Creators
Description
Dataset and Code Description
This repository contains the data and code used to analyse interactions within the Community Notes platform from January 23, 2021, to January 23, 2025. The files are organised as follows:
🧪 Code Notebooks
-
Create_graphs.ipynb: Constructs full interaction networks and separate sub-networks (helpful, somewhat helpful, unhelpful) from the monthly raw rating files.
-
Url_analysis.ipynb: Detects the language of each note and extracts any URLs or domain names mentioned.
-
BERTopic_English_hard_PCA100_UMAP10_MinCluster500.ipynb: Applies BERTopic to English-language notes to extract latent topics. Dimensionality is reduced using PCA (100 components) and UMAP (10 dimensions). Only clusters with at least 500 notes are retained to ensure robustness.
📄 Data Files
Notes Data
-
notes_with_lang.csv: All Community Notes written between January 23, 2021, and January 23, 2025, with detected language, extracted URLs, and domain names.
-
english_notes_with_nlp.csv: Subset of English notes with BERTopic topics, topic numbers, and keyword representations.
Each note file contains the following variables:
-
noteId: Unique ID of the note. -
noteAuthorParticipantId: Unique ID of the note's author. -
tweetId: ID of the tweet the note addresses. -
date: Date the note was written (YYYY-MM-DD). -
Timestamp: Time the note was written (HH:MM:SS). -
language: Detected language of the note. -
extracted_urls: List of URLs mentioned in the note. -
news_source: List of extracted domain names. -
BERTopic_word(only in English notes file): Main topic name. -
BERTopic_number(only in English notes file): Numeric topic identifier. -
BERTopic_representation(only in English notes file): List of keywords representing the topic.
Rating Data
-
Monthly rating files are stored in the
rating monthly files/directory with the naming formatratings_m_yyyy.csv. -
Each file includes:
-
noteId: ID of the rated note. -
raterParticipantId: ID of the participant giving the rating. -
helpfulnessLevel: Rating category (HELPFUL, SOMEWHAT_HELPFUL, NOT_HELPFUL). -
helpful,notHelpful: Deprecated binary flags (usehelpfulnessLevelinstead).
-
🌐 Network Files
Each month’s ratings are used to construct interaction graphs with user-to-user edges based on rating behaviours.
-
Whole Networks (
whole_network_<month>_<year>.graphml): Full user interaction networks, with edges annotated by the number of helpful, unhelpful, and somewhat helpful ratings.-
Each edge contains:
-
source: Rater’s participant ID. -
target: Note author’s participant ID. -
helpful,unhelpful,somewhathelpful: Count of ratings by type from rater to author.
-
-
-
Helpful Networks (
network_<month>_<year>_helpful.graphml): Subnetworks based on helpful ratings only. -
Somewhat Helpful Networks (
network_<month>_<year>_somewhat.graphml): Subnetworks based on somewhat helpful ratings. -
Unhelpful Networks (
network_<month>_<year>_unhelpful.graphml): Subnetworks based on unhelpful ratings.
Files
BERTopic_English_hard_PCA100_UMAP10_MinCluster500.ipynb
Files
(10.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d7afe96a6f8c53272b35590919a22cc1
|
127.3 kB | Preview Download |
|
md5:3f2b428543e75cfb1c2cda902debcd56
|
44.2 kB | Preview Download |
|
md5:78aeff114e0806ff8529ca22437a06ed
|
3.7 GB | Preview Download |
|
md5:2c74df2140e5742bff992df845d365a5
|
347.7 MB | Preview Download |
|
md5:7ea83da15ed30d679a6c5f3f767a2438
|
443.0 MB | Preview Download |
|
md5:17af2074f9831188de7dbc39ad03f955
|
2.1 GB | Preview Download |
|
md5:388f5193eac220a6284a1a484dc2f47d
|
12.2 kB | Download |
|
md5:3dfa5f1a119e51fc26c1d0cfaf61e3d1
|
832.5 kB | Preview Download |
|
md5:7d8bcc1f7f1566a588a683792b6cbb88
|
3.5 GB | Preview Download |
Additional details
Funding
- Taighde Éireann - Research Ireland
- 18/CRT/6049
- Taighde Éireann - Research Ireland
- IRCLA/2022/3217