Dataset Open Access

Wikipedia time-series graph

Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre

Thesis supervisor(s)

Vandergheynst Pierre

Wikipedia temporal graph.

The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.

Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using file that can be opened with open source Python library Graph-Tool.

Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is  5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.

Files (6.9 GB)
Name Size
5.1 GB Download
1.4 GB Download
109.1 MB Download
132.8 MB Download
147.2 MB Download
  • Benzi, Kirell Maël. "From recommender systems to spatio-temporal dynamics with network science." (2017).


All versions This version
Views 401402
Downloads 2,0962,097
Data volume 1.8 TB1.8 TB
Unique views 340341
Unique downloads 1,7931,794


Cite as