A Comprehensive Dataset of Citations with Identifiers from English Wikipedia

Harshdeep Singh; Robert West; Giovanni Colavizza

doi:10.5281/zenodo.3606915

Published January 14, 2020 | Version 0.1

Dataset Open

A Comprehensive Dataset of Citations with Identifiers from English Wikipedia

1. EPFL
2. University of Amsterdam

The dataset is composed of 3 parts:

1. The dataset of 23.8 million citations from 35 different citation templates, out of which 3.14 million citations already contained identifiers, and approximately 2.15 million citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip

2. An example subset with the features for the classifier. This is under the filename: subset_of_citations_features.zip

3. Citations classified as a journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)

The data was parsed from the Wikipedia XML content dumps published in October 2018.

The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki

The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset

Files

citations_from_wikipedia.zip

Files (6.1 GB)

Name	Size	Download all
citations_from_wikipedia.zip md5:b6d4c1739f64a66f8c39ac24eba11234	5.4 GB	Preview Download
subset_of_citations_features.zip md5:7d00aa380eb5439af04a629bc943e701	679.1 MB	Preview Download

	All versions	This version
Views	3,898	508
Downloads	4,768	47
Data volume	22.5 TB	162.7 GB

A Comprehensive Dataset of Citations with Identifiers from English Wikipedia

Creators

Description

Files

citations_from_wikipedia.zip

Files (6.1 GB)