Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia
Description
The dataset is composed of 3 parts:
1. The dataset of 29.276 million citations from 35 different citation templates, out of which 3.92 million citations already contained identifiers, and approximately 260,752 citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip
2. A minimal dataset containing a few of the columns from the citations from Wikipedia dataset. These columns are as follows: 'type_of_citation', 'page_title', 'Title', 'ID_list', metadata_file', 'updated_identifier'. This is under the filename: minimal_dataset.zip. The 'metadata_file' column can be used to refer to the metadata collected from CrossRef and page title, the title of the citation can be used to refer to the 'citations_from_wikipedia.zip' dataset and get more information for a particular citation (such as author, periodical, chapter).
3. Citations classified as a journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)
The data was parsed from the Wikipedia XML content dumps published in May 2020.
The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki
The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset