Published September 6, 2017 | Version v1
Dataset Open

Wikipedia time-series graph

Contributors

  • 1. EPFL

Description

Wikipedia temporal graph.

The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.

Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using enwiki-20150403-graph.gt file that can be opened with open source Python library Graph-Tool.

Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is  5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.

Files

edges.csv

Files (6.9 GB)

Name Size Download all
md5:6793008c445f956258a0c134e6cdf70d
5.1 GB Preview Download
md5:2b8da9c5818967128e594b9c35544892
1.4 GB Download
md5:d82a001e4534dbc2b0223117ea38223f
109.1 MB Preview Download
md5:b39aae812549771360f9908eec7fa829
132.8 MB Download
md5:83dcc9763d72bf5178fa6b9ab0550bc4
147.2 MB Preview Download

Additional details

Funding

MacSeNet – Machine Sensing Training Network 642685
European Commission

References