Wikipedia time-series graph
Description
Wikipedia temporal graph.
The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.
Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using enwiki-20150403-graph.gt file that can be opened with open source Python library Graph-Tool.
Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is 5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.
Files
edges.csv
Files
(6.9 GB)
Name | Size | Download all |
---|---|---|
md5:6793008c445f956258a0c134e6cdf70d
|
5.1 GB | Preview Download |
md5:2b8da9c5818967128e594b9c35544892
|
1.4 GB | Download |
md5:d82a001e4534dbc2b0223117ea38223f
|
109.1 MB | Preview Download |
md5:b39aae812549771360f9908eec7fa829
|
132.8 MB | Download |
md5:83dcc9763d72bf5178fa6b9ab0550bc4
|
147.2 MB | Preview Download |
Additional details
References
- Benzi, Kirell Maël. "From recommender systems to spatio-temporal dynamics with network science." (2017).
- https://infoscience.epfl.ch/record/225308?ln=fr