The Collaborative Organization of Knowledge: Data Set
Description
Wikipedia is an ongoing endeavor to create a free encyclopedia through an open computer-mediated collaborative effort. How does Wikipedia grow and maintain its coverage? This page contains supporing material relevant to a publication that examines this question.
- Diomidis Spinellis and Panagiotis Louridas. The collaborative organization of knowledge. Communications of the ACM, 51(8):68–73, August 2008. (doi:10.1145/1378704.1378720)
In the above paper, a longitudinal study of Wikipedia's evolution shows that although Wikipedia's scope is increasing, its coverage is not deteriorating. This can be explained by the fact that referring to an non-existing entry typically leads to the establishment of an article for it. Wikipedia's evolution also demonstrates the creation of a large real world scale-free graph through a combination of incremental growth and preferential attachment.
Though this data set you can download the processed results. The file starts with a header giving various attributes of the processed data set.
% Number of bins: 72 % Total revisions: 28247658 % Maximum revisions: 28273 (George W. Bush) % Maximum reverts: 9218 (George W. Bush) % Number of moves: 81380 % Total pages: 1898139 % Revisions from IP addresses: 8518913 % Total contributors: 230130 % Maximum different contributors: 2539 (George W. Bush) % Redirected pages: 631567 % Restricted pages: 2441 % Maximum number of contained references: 17577 (List of all three letter acrony ms) % Pages with at least one revert: 211704 % Total number of reverts across all pages: 1147151 % Total time between reverts: 54524346346 % Moved pages: 80332
Next comes one line of data for each one of Wikipedia's entries. Here is an example.
A (musical note):1128386876:Mailer diablo:1130566991:MrD9:10:7:18:0:0:0:0:0:0:0: 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0: 0:0:0:0:0:0:0:0:0:0:0:1:1:1:2:2:2:2:2:2:2:2:2:2:2:2:E
Each line contains the following fields.
- Entry name
- Time of first definition (in seconds since Unix epoch)
- Name of the contributor who first defined the entry
- Time of first reference (in seconds since Unix epoch)
- Name of the contributor who first referenced the entry
- Number of references
- Number of contributors
- Number of revisions
- Number of reverts
- For each one of the time period bins (72 in this file) the number of references to the entry
- The letter "E"
The fields are colon-separated. Colons in the input data are converted to an underscore.
Finally, come lines summarizing the data set's characteristics for each time period. Here is an example.
2001-07-01 4851 0 27106 15129 13458 531
Each line contains the following fields.
- Start date of this period
- Number of entries
- Number of entries that are stubs
- Number of references
- Number of referenced articles
- Number of undefined entries
- Number of active contributors in this period
Files
Files
(105.6 MB)
Name | Size | Download all |
---|---|---|
md5:d7c533b075894084627895aaecc80c37
|
105.6 MB | Download |
Additional details
Related works
- Is supplement to
- 10.1145/1378704.1378720 (DOI)
- 10.5281/zenodo.2526733 (DOI)