Published March 11, 2022 | Version v1
Dataset Open

Wikipedia Knowledge Graph dataset

  • 1. University of Granada
  • 2. Centre for Science and Technology Studies (CWTS)

Description

Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them,  as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

Files

Dataset_description.pdf

Files (26.9 GB)

Name Size Download all
md5:af80f50e94b155ba6db0b0a3428fffff
104.9 MB Download
md5:bb30a6be2696623abd5f0e52a9dfd325
244.0 kB Preview Download
md5:b2e1bf14defc504c7a2c49e8ef1d4bae
5.9 GB Download
md5:bee34d51ef697451dd58835cb719fb03
3.8 GB Download
md5:0358a4ae4feb1de9c5ca7c7c3019d15b
10.1 GB Download
md5:c099925f71dc8bc8f28c0cf177c2bbf4
1.2 GB Download
md5:3468f593a0d49cfd47431fc2eea44a6e
61.9 MB Download
md5:537714c8aad4f286e2cb5880d5512ae8
1.3 GB Download
md5:75e0bc89d256b040732d25161c442bff
161.0 MB Download
md5:5cf3858ab1b215b37810db1b5b61ce9a
4.3 GB Download

Additional details

Related works

Is compiled by
Software: 10.5281/zenodo.6959429 (DOI)
Is described by
Journal article: 10.1162/qss_a_00226 (DOI)
Preprint: 10.48550/arXiv.2210.13830 (DOI)
Is reviewed by
Software: 10.5281/zenodo.6958973 (DOI)