Published July 14, 2020 | Version 0.2
Dataset Open

Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

  • 1. EPFL
  • 2. University of Amsterdam

Description

The dataset is composed of 3 parts:

1.  The dataset of 29.276 million citations from  35 different citation templates,  out of which 3.92 million citations already contained identifiers, and approximately 260,752 citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip

2.  A minimal dataset containing a few of the columns from the citations from Wikipedia dataset. These columns are as follows: 'type_of_citation', 'page_title', 'Title', 'ID_list', metadata_file', 'updated_identifier'. This is under the filename: minimal_dataset.zip. The 'metadata_file' column can be used to refer to the metadata collected from CrossRef and page title, the title of the citation can be used to refer to the 'citations_from_wikipedia.zip' dataset and get more information for a particular citation (such as author, periodical, chapter).

3.  Citations classified as a journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)


The data was parsed from the Wikipedia XML content dumps published in May 2020.

The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki

The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset

Files

citations_from_wikipedia.zip

Files (9.5 GB)

Name Size Download all
md5:30c58ba51f4f6aeba39acfa6d6f954f0
7.1 GB Preview Download
md5:dd4891237183a311f76cb8f2ec9026e5
858.5 MB Preview Download
md5:33ba744ce3a35eb31843b7090a3e5df1
1.5 GB Preview Download