Dataset Open Access

Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

Harshdeep Singh; Robert West; Giovanni Colavizza

The dataset is composed of 3 parts:

1.  The dataset of 29.276 million citations from  35 different citation templates,  out of which 3.92 million citations already contained identifiers, and approximately 260,752 citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip

2.  A minimal dataset containing a few of the columns from the citations from Wikipedia dataset. These columns are as follows: 'type_of_citation', 'page_title', 'Title', 'ID_list', metadata_file', 'updated_identifier'. This is under the filename: minimal_dataset.zip. The 'metadata_file' column can be used to refer to the metadata collected from CrossRef and page title, the title of the citation can be used to refer to the 'citations_from_wikipedia.zip' dataset and get more information for a particular citation (such as author, periodical, chapter).

3.  Citations classified as a journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)


The data was parsed from the Wikipedia XML content dumps published in May 2020.

The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki

The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset

Files (9.5 GB)
Name Size
citations_from_wikipedia.zip
md5:30c58ba51f4f6aeba39acfa6d6f954f0
7.1 GB Download
lookup_data.zip
md5:dd4891237183a311f76cb8f2ec9026e5
858.5 MB Download
minimal_dataset.zip
md5:33ba744ce3a35eb31843b7090a3e5df1
1.5 GB Download
957
184
views
downloads
All versions This version
Views 957633
Downloads 184143
Data volume 474.0 GB351.5 GB
Unique views 815560
Unique downloads 11591

Share

Cite as