# The Collaborative Organization of Knowledge: Data Set

Spinellis, Diomidis; Louridas, Panos

<dc:date>2008-06-03</dc:date>
<dc:description>Wikipedia is an ongoing endeavor to create a free encyclopedia through an open computer-mediated collaborative effort. How does Wikipedia grow and maintain its coverage? This page contains supporing material relevant to a publication that examines this question.

Diomidis Spinellis and Panagiotis Louridas. The collaborative organization of knowledge. Communications of the ACM, 51(8):68–73, August 2008. (doi:10.1145/1378704.1378720)

In the above paper, a longitudinal study of Wikipedia's evolution shows that although Wikipedia's scope is increasing, its coverage is not deteriorating. This can be explained by the fact that referring to an non-existing entry typically leads to the establishment of an article for it. Wikipedia's evolution also demonstrates the creation of a large real world scale-free graph through a combination of incremental growth and preferential attachment.

Though this data set you can download the processed results. The file starts with a header giving various attributes of the processed data set.

% Number of bins: 72
% Total revisions: 28247658
% Maximum revisions: 28273 (George W. Bush)
% Maximum reverts: 9218 (George W. Bush)
% Number of moves: 81380
% Total pages: 1898139
% Revisions from IP addresses: 8518913
% Total contributors: 230130
% Maximum different contributors: 2539 (George W. Bush)
% Redirected pages: 631567
% Restricted pages: 2441
% Maximum number of contained references: 17577 (List of all three letter acrony
ms)
% Pages with at least one revert: 211704
% Total number of reverts across all pages: 1147151
% Total time between reverts: 54524346346
% Moved pages: 80332

Next comes one line of data for each one of Wikipedia's entries. Here is an example.

A (musical note):1128386876:Mailer diablo:1130566991:MrD9:10:7:18:0:0:0:0:0:0:0:
0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:
0:0:0:0:0:0:0:0:0:0:0:1:1:1:2:2:2:2:2:2:2:2:2:2:2:2:E

Each line contains the following fields.

Entry name
Time of first definition (in seconds since Unix epoch)
Name of the contributor who first defined the entry
Time of first reference (in seconds since Unix epoch)
Name of the contributor who first referenced the entry
Number of references
Number of contributors
Number of revisions
Number of reverts
For each one of the time period bins (72 in this file) the number of references to the entry
The letter "E"

The fields are colon-separated. Colons in the input data are converted to an underscore.

Finally, come lines summarizing the data set's characteristics for each time period. Here is an example.

2001-07-01 4851 0  27106   15129        13458   531

Each line contains the following fields.

Start date of this period
Number of entries
Number of entries that are stubs
Number of references
Number of referenced articles
Number of undefined entries
Number of active contributors in this period
