There is a newer version of the record available.

Published January 14, 2020 | Version 0.1
Dataset Open

A Comprehensive Dataset of Citations with Identifiers from English Wikipedia

  • 1. EPFL
  • 2. University of Amsterdam

Description

The dataset is composed of 3 parts:

1.  The dataset of 23.8  million citations from  35  different citation templates,  out of which  3.14 million citations already contained identifiers, and approximately 2.15 million citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip

2.  An example subset with the features for the classifier. This is under the filename: subset_of_citations_features.zip

3.  Citations classified as a journal and their corresponding metadata/identifier extracted from  Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)


The data was parsed from the Wikipedia XML content dumps published in October 2018.

The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki

The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset

Files

citations_from_wikipedia.zip

Files (6.1 GB)

Name Size Download all
md5:b6d4c1739f64a66f8c39ac24eba11234
5.4 GB Preview Download
md5:7d00aa380eb5439af04a629bc943e701
679.1 MB Preview Download