Dataset Open Access

The Collaborative Organization of Knowledge: Data Set

Spinellis, Diomidis; Louridas, Panos

Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="" xmlns:oai_dc="" xmlns:xsi="" xsi:schemaLocation="">
  <dc:creator>Spinellis, Diomidis</dc:creator>
  <dc:creator>Louridas, Panos</dc:creator>
  <dc:description>Wikipedia is an ongoing endeavor to create a free encyclopedia through an open computer-mediated collaborative effort. How does Wikipedia grow and maintain its coverage? This page contains supporing material relevant to a publication that examines this question.

	Diomidis Spinellis and Panagiotis Louridas. The collaborative organization of knowledge. Communications of the ACM, 51(8):68–73, August 2008. (doi:10.1145/1378704.1378720)

In the above paper, a longitudinal study of Wikipedia's evolution shows that although Wikipedia's scope is increasing, its coverage is not deteriorating. This can be explained by the fact that referring to an non-existing entry typically leads to the establishment of an article for it. Wikipedia's evolution also demonstrates the creation of a large real world scale-free graph through a combination of incremental growth and preferential attachment.

Though this data set you can download the processed results. The file starts with a header giving various attributes of the processed data set.

% Number of bins: 72
% Total revisions: 28247658
% Maximum revisions: 28273 (George W. Bush)
% Maximum reverts: 9218 (George W. Bush)
% Number of moves: 81380
% Total pages: 1898139
% Revisions from IP addresses: 8518913
% Total contributors: 230130
% Maximum different contributors: 2539 (George W. Bush)
% Redirected pages: 631567
% Restricted pages: 2441
% Maximum number of contained references: 17577 (List of all three letter acrony
% Pages with at least one revert: 211704
% Total number of reverts across all pages: 1147151
% Total time between reverts: 54524346346
% Moved pages: 80332

Next comes one line of data for each one of Wikipedia's entries. Here is an example.

A (musical note):1128386876:Mailer diablo:1130566991:MrD9:10:7:18:0:0:0:0:0:0:0:

Each line contains the following fields.

	Entry name
	Time of first definition (in seconds since Unix epoch)
	Name of the contributor who first defined the entry
	Time of first reference (in seconds since Unix epoch)
	Name of the contributor who first referenced the entry
	Number of references
	Number of contributors
	Number of revisions
	Number of reverts
	For each one of the time period bins (72 in this file) the number of references to the entry
	The letter "E"

The fields are colon-separated. Colons in the input data are converted to an underscore.

Finally, come lines summarizing the data set's characteristics for each time period. Here is an example.

2001-07-01 4851 0  27106   15129        13458   531

Each line contains the following fields.

	Start date of this period
	Number of entries
	Number of entries that are stubs
	Number of references
	Number of referenced articles
	Number of undefined entries
	Number of active contributors in this period
  <dc:subject>Replication package</dc:subject>
  <dc:title>The Collaborative Organization of Knowledge: Data Set</dc:title>
All versions This version
Views 289289
Downloads 1313
Data volume 1.4 GB1.4 GB
Unique views 243243
Unique downloads 1313


Cite as