Dataset Open Access

The Collaborative Organization of Knowledge: Data Set

Spinellis, Diomidis; Louridas, Panos

Wikipedia is an ongoing endeavor to create a free encyclopedia through an open computer-mediated collaborative effort. How does Wikipedia grow and maintain its coverage? This page contains supporing material relevant to a publication that examines this question.

  • Diomidis Spinellis and Panagiotis Louridas. The collaborative organization of knowledge. Communications of the ACM, 51(8):68–73, August 2008. (doi:10.1145/1378704.1378720)

In the above paper, a longitudinal study of Wikipedia's evolution shows that although Wikipedia's scope is increasing, its coverage is not deteriorating. This can be explained by the fact that referring to an non-existing entry typically leads to the establishment of an article for it. Wikipedia's evolution also demonstrates the creation of a large real world scale-free graph through a combination of incremental growth and preferential attachment.

Though this data set you can download the processed results. The file starts with a header giving various attributes of the processed data set.

% Number of bins: 72
% Total revisions: 28247658
% Maximum revisions: 28273 (George W. Bush)
% Maximum reverts: 9218 (George W. Bush)
% Number of moves: 81380
% Total pages: 1898139
% Revisions from IP addresses: 8518913
% Total contributors: 230130
% Maximum different contributors: 2539 (George W. Bush)
% Redirected pages: 631567
% Restricted pages: 2441
% Maximum number of contained references: 17577 (List of all three letter acrony
% Pages with at least one revert: 211704
% Total number of reverts across all pages: 1147151
% Total time between reverts: 54524346346
% Moved pages: 80332

Next comes one line of data for each one of Wikipedia's entries. Here is an example.

A (musical note):1128386876:Mailer diablo:1130566991:MrD9:10:7:18:0:0:0:0:0:0:0:

Each line contains the following fields.

  • Entry name
  • Time of first definition (in seconds since Unix epoch)
  • Name of the contributor who first defined the entry
  • Time of first reference (in seconds since Unix epoch)
  • Name of the contributor who first referenced the entry
  • Number of references
  • Number of contributors
  • Number of revisions
  • Number of reverts
  • For each one of the time period bins (72 in this file) the number of references to the entry
  • The letter "E"

The fields are colon-separated. Colons in the input data are converted to an underscore.

Finally, come lines summarizing the data set's characteristics for each time period. Here is an example.

2001-07-01 4851 0  27106   15129        13458   531

Each line contains the following fields.

  • Start date of this period
  • Number of entries
  • Number of entries that are stubs
  • Number of references
  • Number of referenced articles
  • Number of undefined entries
  • Number of active contributors in this period
Files (105.6 MB)
Name Size
105.6 MB Download
All versions This version
Views 289289
Downloads 1313
Data volume 1.4 GB1.4 GB
Unique views 243243
Unique downloads 1313


Cite as