Published November 6, 2024 | Version 2024-11-06
Dataset Open

Cross-language Wikipedia link graph

Description

Wikipedia articles use Wikidata to list the links to the same article in other language versions. Therefore, each Wikipedia language edition stores the Wikidata Q-id for each article.

This dataset constitutes a Wikipedia link graph where all the article identifiers are normalized to Wikidata Q-ids. It contains the normalized links from all Wikipedia language versions. Detailed link count statistics are attached. Note that articles that have no incoming nor outgoing links are not part of this graph.

The format is as follows:

Q-id of linking page (outgoing) <tab> Q-id of linked page (incoming) <tab> language version - dump date (20241101)

This dataset was used to compute Wikidata PageRank. More information can be found on the danker repository, where the source code of the link extraction as well as the PageRank computation is hosted.

Example entries:

$ bzcat 2024-11-06.allwiki.links.bz2 | head

1    107    ckbwiki-20241101
1    107    lawiki-20241101
1    107    ltwiki-20241101
1    107    tewiki-20241101
1    107    wuuwiki-20241101
1    111    hywwiki-20241101
1    11379    bat_smgwiki-20241101
1    11471    cdowiki-20241101
1    150    ckbwiki-20241101
1    150    lowiki-20241101

 

 

Notes

This dataset is a remix of https://dumps.wikimedia.org and therefore published under the Creative Commons Attribution-Share-Alike 3.0 License.

Files

2024-11-06.allwiki.links.stats.txt

Files (12.6 GB)

Name Size Download all
md5:933531f38a62297d38660a66166cf37c
12.6 GB Download
md5:837a53dc23c17b0587c4d60822bb39a6
10.3 kB Preview Download

Additional details

Related works

Is compiled by
Software: 10.5281/zenodo.7163272 (DOI)
Is supplemented by
Conference paper: 10.1007/978-3-319-47602-5_41 (DOI)

Software

Repository URL
https://github.com/athalhammer/danker
Programming language
Python, Shell
Development Status
Active