Published December 25, 2021 | Version v1
Dataset Open

Citation data of arXiv eprints and the associated quantitatively-and-temporally normalised impact metrics

  • 1. University of Tokyo
  • 2. National Institute of Science and Technology Policy

Description

Data collection

This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv. 

The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.

The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints. 

Some general statistics and visualisations per research discipline are provided in the original article (Okamura, 2022), where the validity and limitations associated with the dataset are also discussed.

 

Description of columns (variables)

  • arxiv_id : arXiv ID
  • category : Research discipline
  • pre_year : Year of posting v1 on arXiv
  • pub_year : Year of DOI acquisition
  • c_tot : No. of citations acquired during 1991–2019
  • c_pre : No. of citations acquired before and including the year of DOI acquisition
  • c_pub : No. of citations acquired after the year of DOI acquisition
  • c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)
  • gamma : The quantitatively-and-temporally normalised citation index
  • gamma_star : The quantitatively-and-temporally standardised citation index

Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, 2022). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times. 

 

Data files

A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.

 

Notes (English)

This dataset is released to accompany a paper by Keisuke Okamura, published in Quantitative Science Studies (2022) 3 (1): 122–146; DOI: 10.1162/qss_a_00174.

Files

arxiv_impact.csv

Files (422.7 MB)

Name Size Download all
md5:174f2aa59672b665667d656977bbe325
215.5 MB Preview Download
md5:da3ec2f4aee30e59d106b0c4d02cbd8b
207.2 MB Download

Additional details

Related works

Is supplement to
Journal article: 10.1162/qss_a_00174 (DOI)
Preprint: arXiv:2106.05027 (arXiv)
Journal article: 10.1103/l2xd-43n9 (DOI)
Preprint: arXiv:2503.03011 (arXiv)

Dates

Created
2021-12-25