Dataset Open Access

Individual Edit Histories of All References in the English Wikipedia

Zagovora, Olga; Ulloa, Roberto; Weller, Katrin; Flöck, Fabian

This dataset includes the historical versions of all individual references per article in the English Wikipedia. Each reference object also contains information about its original creating editor, editors implementing changes to it, and timestamps of all actions (creations, modifications, deletions, and reinsertions) that were applied to the reference. Each historical version of a reference is represented as a list of tokens (≈ words), where each token has an individual creator and change history.

The extraction process was meticulously vetted through crowdsourcing evaluations, assuring very high accuracy in contrast to standard textual difference algorithms. The dataset includes references that were created with the "<ref>" tag until June 2019. It contains 55,503,998 references with 164,530,374 actions. These references were found in 4,690,046 Wikipedia articles.

The dataset consists of JSON files where each article's page ID (here: article_id) is used as a file name. Each file is represented as a list of “References”. Each reference is a dictionary with the following keys:

  • "first_rev_id" type: Integer, first revision where the reference was inserted (the same value is represented in “ins” as the first element of the list and in "rev_id" of the first element in the "change_sequence"),
  • "first_hash_id" type: String, the hash value of the first version of token_id (from WikiWho1, see below) list of the reference (the same value is represented as "hash_id" of the first element in the "change_sequence"),
  • "first_editor_id"  type: String, user_id or IP address of the first revision where the reference was inserted (the same value is represented as "editor_id" of the first element in the "change_sequence",
  • "deleted" type: Boolean, an indicator if the reference exists in the last available revision,
  • "ins" type: List of Integers, list of revisions where the reference was inserted (includes the first revision mentioned as "first_rev_id"),
  • "ins_editor" type: List of Strings, list of user_id or IP addresses of editors where the reference was inserted (includes the first user mentioned as "first_editor_id"),
  • "del" type: List of Integers, list of revisions where the reference was deleted from the article or reference was modified in a way that less than 25% of tokens remained,
  • "del_editor“ type: List of Strings, list of user_id or IP addresses of editors where the reference was deleted or reference was modified in a way that less than 25% of tokens remained,
  • "modif" type: List of Integers, list of revisions where the reference was modified, or reinserted with modification,
  • "hashes": type: List of Strings, list of hash values of all versions of references,
  • "first_rev_time": type: DateTime, the timestamp when the reference was created (the same value is represented in "ins_time” as the first element of the list and in "time" of the first element in the "change_sequence"),
  • "ins_time" type: List of DateTime, list of timestamps when the reference was inserted or reinserted,
  • "del_time" type: List of DateTime, list of timestamps when the reference was deleted,
  • "change_sequence" type: List of dictionaries, with information about tokens, editors and revisions where the reference was modified (the first element representing the first revision where the reference was inserted), where:
    • "hash_id" type: String, the hash value of the token_id (WikiWho1) list of the reference version,
    • "rev_id" type: Integer, the revision number of the particular version of the reference,
    • "editor_id" type: String, user_id or IP address of the revision editor,
    • "time" type: DateTime, the timestamp when of this particular version of the reference,
    • "tokens" type: List of Strings, ordered list of tokens (created by WikiWho1) that represents the particular version of the reference (the list has the same length as "token_editors"),
    • "token_editors" type: List of Strings, ordered list of user_ids or IP addresses of editors that were first who added the corresponding token (see "tokens") to Wikipedia article.

1 WikiWho is a text mining algorithm to extract changes to tokens from Wikipedia revisions. Each token is assigned a unique ID. More information: https://www.wikiwho.net/#technical_details

GitHub Repository with Python example code on how to process data and extract document identifiers: https://github.com/gesiscss/wikipedia_references

To run the code at GESIS Notebook follow the link: https://notebooks.gesis.org/binder/v2/gh/gesiscss/wikipedia_references/master

This research was supported by the Deutsche Forschungsgemeinschaft, DFG, project number 314727790.
Files (20.4 GB)
Name Size
EN_References_part_1_articleids_10000001-1994503.zip
md5:4eae2c54ece7a8347b6938a044e5417d
4.6 GB Download
EN_References_part_2_articleids_19945040-30669454.zip
md5:987083042990a60e4f2c14a1b74df486
4.2 GB Download
EN_References_part_3_articleids_30669478-41846817.zip
md5:3d6b99caa987f0b76fcf9fd09d9ea1a3
4.0 GB Download
EN_References_part_4_articleids_41846818-54127354.zip
md5:64cba1093121ec25cdcdeaeb0db8096b
3.6 GB Download
EN_References_part_5_articleids_54127356-99999.zip
md5:8ccbbd6080fcdf652a1ed5304dc39631
4.0 GB Download
505
484
views
downloads
All versions This version
Views 505505
Downloads 484484
Data volume 2.0 TB2.0 TB
Unique views 451451
Unique downloads 166166

Share

Cite as