Wikipedia Edit Event Data 2021 (WikiEvent.2021)
Description
The "Wikipedia Edit Event Data 2021 (WikiEvent.2021)" gives the time, user name, and article title of every edit that any registered and logged-in Wikipedia user performed on any article in the English-language edition of Wikipedia from January 15th, 2001 (the launch of Wikipedia) to January 2021. This dataset extends the older version WikiEven.2018 (https://zenodo.org/record/1626323).
The edit event data has been extracted from the file 'enwiki-20210101-stub-meta-history.xml.gz'; which was at that time linked from 'https://dumps.wikimedia.org/enwiki/20210101/'. These files get deleted some months after data collection - however the information is still available in any file 'enwiki-<date>-stub-meta-history.xml.gz' where <date> is 20210101 or later. These data are provided by the Wikimedia Foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
The Wikipedia Edit Event Data 2021 comprises the file ('WikiEvent.2021.csv') giving a table with 3 columns and more than 450 million rows in CSV format. Cell delimiter is semicolon (';') and strings are quoted by double-quotes ('"'). The table has a header given in the first row and the three columns are labeled 'time', 'user', and 'article' respectively. The uncompressed size of the file is about 23 GB. The small CSV file 'WikiEvent.2021_excerpt.csv' contains the first few lines of the larger file and serves to illustrate its structure.
Time is given by integers representing milliseconds from January 1st, 1970 at 0:00 to the time of the edit. Precision of edit times, however, is given by the second so that the last three digits of every time value are equal to '000'. The edit events are given in ascending order with respect to time (older edits before younger edits). An article is a page in Namespace 0 (the 'main namespace' or 'article namespace') that is not a redirect. An edit (that is, a new revision of an article) is considered to be performed by a registered and logged-in user if the 'contributor' is given in an XML element <username>...</username> in the referenced stub-file. (It is considered to be an anonymous edit if the 'contributor' is given in an XML element <ip>...</ip>. Anonymous edits are not recorded in the table.) Triples (time,user,article) are unique by construction: if the same user edits the same article in the same second more than once, only one of these edits appears in the table.
The related (older) WikiEvent.2018 data was originally used in: Lerner and Lomi (2020). Reliability of relational event model estimates under sampling: how to fit a relational event model to 360 million dyadic events. Network Science, 8(1):97-135. (DOI: https://doi.org/10.1017/nws.2019.57)
How to analyze the WikiEvent Data with relational event models is explained in the eventnet tutorial at: https://github.com/juergenlerner/eventnet/wiki/Large-event-networks-(tutorial).
Files
WikiEvent.2021.csv.zip
Files
(6.6 GB)
Name | Size | Download all |
---|---|---|
md5:f8909b0eadac3824a1e62e14a0cb31f9
|
6.6 GB | Preview Download |
md5:3f8a30a3aba5447a9086637c9dd30215
|
2.2 kB | Preview Download |