Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

Harshdeep Singh; Robert West; Giovanni Colavizza

doi:10.5281/zenodo.3940692

Published July 14, 2020 | Version 0.2

Dataset Open

Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

1. EPFL
2. University of Amsterdam

The dataset is composed of 3 parts:

1. The dataset of 29.276 million citations from 35 different citation templates, out of which 3.92 million citations already contained identifiers, and approximately 260,752 citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip

2. A minimal dataset containing a few of the columns from the citations from Wikipedia dataset. These columns are as follows: 'type_of_citation', 'page_title', 'Title', 'ID_list', metadata_file', 'updated_identifier'. This is under the filename: minimal_dataset.zip. The 'metadata_file' column can be used to refer to the metadata collected from CrossRef and page title, the title of the citation can be used to refer to the 'citations_from_wikipedia.zip' dataset and get more information for a particular citation (such as author, periodical, chapter).

3. Citations classified as a journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)

The data was parsed from the Wikipedia XML content dumps published in May 2020.

The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki

The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset

Files

citations_from_wikipedia.zip

Files (9.5 GB)

Name	Size	Download all
citations_from_wikipedia.zip md5:30c58ba51f4f6aeba39acfa6d6f954f0	7.1 GB	Preview Download
lookup_data.zip md5:dd4891237183a311f76cb8f2ec9026e5	858.5 MB	Preview Download
minimal_dataset.zip md5:33ba744ce3a35eb31843b7090a3e5df1	1.5 GB	Preview Download

	All versions	This version
Views	3,859	3,045
Downloads	4,419	4,310
Data volume	20.7 TB	20.2 TB

Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

Creators

Description

Files

citations_from_wikipedia.zip

Files (9.5 GB)