Dataset Open Access
Zagovora, Olga;
Ulloa, Roberto;
Weller, Katrin;
Flöck, Fabian
This dataset includes the historical versions of all individual references per article in the English Wikipedia. Each reference object also contains information about its original creating editor, editors implementing changes to it, and timestamps of all actions (creations, modifications, deletions, and reinsertions) that were applied to the reference. Each historical version of a reference is represented as a list of tokens (≈ words), where each token has an individual creator and change history.
The extraction process was meticulously vetted through crowdsourcing evaluations, assuring very high accuracy in contrast to standard textual difference algorithms. The dataset includes references that were created with the "<ref>" tag until June 2019. It contains 55,503,998 references with 164,530,374 actions. These references were found in 4,690,046 Wikipedia articles.
The dataset consists of JSON files where each article's page ID (here: article_id) is used as a file name. Each file is represented as a list of “References”. Each reference is a dictionary with the following keys:
1 WikiWho is a text mining algorithm to extract changes to tokens from Wikipedia revisions. Each token is assigned a unique ID. More information: https://www.wikiwho.net/#technical_details
GitHub Repository with Python example code on how to process data and extract document identifiers: https://github.com/gesiscss/wikipedia_references
To run the code at GESIS Notebook follow the link: https://notebooks.gesis.org/binder/v2/gh/gesiscss/wikipedia_references/master
Name | Size | |
---|---|---|
EN_References_part_1_articleids_10000001-1994503.zip
md5:4eae2c54ece7a8347b6938a044e5417d |
4.6 GB | Download |
EN_References_part_2_articleids_19945040-30669454.zip
md5:987083042990a60e4f2c14a1b74df486 |
4.2 GB | Download |
EN_References_part_3_articleids_30669478-41846817.zip
md5:3d6b99caa987f0b76fcf9fd09d9ea1a3 |
4.0 GB | Download |
EN_References_part_4_articleids_41846818-54127354.zip
md5:64cba1093121ec25cdcdeaeb0db8096b |
3.6 GB | Download |
EN_References_part_5_articleids_54127356-99999.zip
md5:8ccbbd6080fcdf652a1ed5304dc39631 |
4.0 GB | Download |
All versions | This version | |
---|---|---|
Views | 505 | 505 |
Downloads | 484 | 484 |
Data volume | 2.0 TB | 2.0 TB |
Unique views | 451 | 451 |
Unique downloads | 166 | 166 |