There is a newer version of the record available.

Published November 13, 2022 | Version 2022-11-10
Dataset Open

Cross-language Wikipedia link graph

Description

Wikipedia articles use Wikidata to list the links to the same article in other language versions. Therefore, each Wikipedia language edition stores the Wikidata Q-id for each article.

This dataset constitutes a Wikipedia link graph where all the article identifiers are normalized to Wikidata Q-ids. It contains the normalized links from all Wikipedia language versions. Detailed link count statistics are attached. Note that articles that have no incoming nor outgoing links are not part of this graph.

The format is as follows:

Q-id of linking page (outgoing) <tab> Q-id of linked page (incoming) <tab> language version - dump date (20221101)

This dataset was used to compute Wikidata PageRank. More information can be found on the danker repository, where the source code of the link extraction as well as the PageRank computation is hosted.

Example entries:

bzcat 2022-11-10.allwiki.links.bz2 | head
1	1001051	zhwiki-20221101
1	1001	azbwiki-20221101
1	10022	nds_nlwiki-20221101
1	1005917	ptwiki-20221101
1	10090	guwiki-20221101
1	10090	tawiki-20221101
1	101038	glwiki-20221101
1	101072	idwiki-20221101
1	101072	lvwiki-20221101
1	101072	ndswiki-20221101

 

Notes

This dataset is a remix of https://dumps.wikimedia.org and therefore published under the Creative Commons Attribution-Share-Alike 3.0 License.

Files

2022-11-10.allwiki.links.stats.txt

Files (11.3 GB)

Name Size Download all
md5:bba0a7f9ab4ed172c8eb89a9632d1f6f
11.3 GB Download
md5:27d073ee6cc994c0ceed35ccabae380c
9.7 kB Preview Download

Additional details

Related works

Is compiled by
Software: 10.5281/zenodo.7163272 (DOI)
Is supplemented by
Conference paper: 10.1007/978-3-319-47602-5_41 (DOI)