Preprint
Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms.
arXiv [Cs.CY].
https://doi.org/10.48550/arXiv.2502.04942
Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025
Abstract
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
Datasheet
Motivation
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
Composition
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Collection Process
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Preprocessing/cleaning/labeling
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
Uses
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
Distribution
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Maintenance
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
SQL Database Schema
Table: posts
Column Name |
Type |
Description |
subreddit_id |
TEXT |
The unique identifier for the subreddit. |
crosspost_parent_id |
TEXT |
The ID of the original Reddit post if this post is a crosspost. |
post_id |
TEXT |
Unique identifier for the Reddit post. |
created_at |
TIMESTAMP |
The timestamp when the post was created. |
updated_at |
TIMESTAMP |
The timestamp when the post was last updated. |
language_code |
TEXT |
The language code of the post. |
score |
INTEGER |
The score (upvotes minus downvotes) of the post. |
upvote_ratio |
REAL |
The ratio of upvotes to total votes. |
gildings |
INTEGER |
Number of awards (gildings) received by the post. |
num_comments |
INTEGER |
Number of comments on the post. |
Table: comments
Column Name |
Type |
Description |
subreddit_id |
TEXT |
The unique identifier for the subreddit. |
post_id |
TEXT |
The ID of the Reddit post the comment belongs to. |
parent_id |
TEXT |
The ID of the parent comment (if a reply). |
comment_id |
TEXT |
Unique identifier for the comment. |
created_at |
TIMESTAMP |
The timestamp when the comment was created. |
last_modified_at |
TIMESTAMP |
The timestamp when the comment was last modified. |
score |
INTEGER |
The score (upvotes minus downvotes) of the comment. |
upvote_ratio |
REAL |
The ratio of upvotes to total votes for the comment. |
gilded |
INTEGER |
Number of awards (gildings) received by the comment. |
Table: postlinks
Column Name |
Type |
Description |
post_id |
TEXT |
Unique identifier for the Reddit post. |
end_processed_valid |
INTEGER |
Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url |
TEXT |
The extracted URL from the Reddit post. |
final_valid |
INTEGER |
Whether the final URL from the post resolves to a valid URL after redirections. |
final_status |
INTEGER |
HTTP status code of the final URL. |
final_url |
TEXT |
The final URL after redirections. |
redirected |
INTEGER |
Indicator of whether the posted URL was redirected (1) or not (0). |
in_title |
INTEGER |
Indicator of whether the link appears in the post title (1) or post body (0). |
Table: commentlinks
Column Name |
Type |
Description |
comment_id |
TEXT |
Unique identifier for the Reddit comment. |
end_processed_valid |
INTEGER |
Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url |
TEXT |
The extracted URL from the comment. |
final_valid |
INTEGER |
Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status |
INTEGER |
HTTP status code of the final URL. |
final_url |
TEXT |
The final URL after redirections. |
redirected |
INTEGER |
Indicator of whether the URL was redirected (1) or not (0). |
Table: linkarticles
Column Name |
Type |
Description |
final_url |
TEXT |
The final URL after redirections. |
lang |
TEXT |
The language code of the page. |
mobile |
INTEGER |
Indicator of whether the link was mobile-specific (1) or not (0). |
raw_title |
TEXT |
The raw, unprocessed title text extracted from the link. |
Table: resolved_redirects
Column Name |
Type |
Description |
lang |
TEXT |
The language code of the Wikipedia page. |
raw_title |
TEXT |
The raw title of the Wikipedia link before redirection. |
norm_title |
TEXT |
The normalized raw title of the page. |
canonical_title |
TEXT |
The canonical title after resolving the redirect. |
Table: collected_redirects
Column Name |
Type |
Description |
lang |
TEXT |
The language code of the Wikipedia page. |
canonical_title |
TEXT |
The canonical title of the page. |
other_title |
TEXT |
Other titles associated with the page that redirect to the canonical title. |
Table: wiki_ids
Column Name |
Type |
Description |
lang |
TEXT |
The language code of the Wikipedia page. |
title |
TEXT |
The title of the Wikipedia page. |
pageid |
INTEGER |
Unique identifier for the page in Wikipedia. |
wikidata_id |
TEXT |
The Wikidata identifier for the page. |
Table: pageviews
Column Name |
Type |
Description |
lang |
TEXT |
The language code of the Wikipedia page. |
title |
TEXT |
The title of the Wikipedia page (not strictly the canonical title). |
date |
TIMESTAMP |
The date of the page view count. |
pageviews |
INTEGER |
The number of page views on the given date. |
Table: revisions
Column Name |
Type |
Description |
lang |
TEXT |
The language code of the Wikipedia page. |
canonical_title |
TEXT |
The canonical title of the Wikipedia page. |
revid |
INTEGER |
The unique revision identifier. |
parentid |
INTEGER |
The ID of the parent revision. |
timestamp |
TEXT |
The timestamp of the revision. |