SAGE Rejected article tracker dataset

The enclosed dataset shows metadata for ArXiv preprints uploaded to ArXiv in 2012.

For each preprint, there are 2 rows of search data:

  • ArXiv preprint metadata plus the CrossRef API data for the correct search result (which is the metadata for the published version of that preprint).
  • ArXiv preprint metadata and the metadata for the top incorrect CrossRef API search result for the title and author-names associated with the preprint.

ArXiv preprints are referred to as 'query' documents and CrossRef documents are referred to as 'match' documents.

This dataset is created using the SAGE Rejected Article Tracker and is supplementary to that project. Similar custom datasets can be created using the SAGE Rejected Article Tracker with different parameters (e.g. different timeframes).

Intended use-cases

The primary intended use case is to train machine-learning models to track rejected articles (specifically, by linking records of rejected articles with CrossRef records for published papers). The process requires being able to search the CrossRef API and then distinguish between an accurate search result and an inaccurate one. The dataset is intended to provide examples data that allow training of a model which makes that distinction.

There are other use-cases for which this dataset might be useful:

  • training a model to connect preprints with their published versions
  • training, or testing a model to recognise duplicate articles given the same metadata (title and author names)

How the data is structured

The dataset is one CSV file with the following columns of data:

column name data type description
query_id string ArXiv identifier pre-pended with 'id:' (the prefix is added to prevent the id being read as a float)
query_doi string DOI associated with ArXiv preprint. This DOI is manually entered by the author of the preprint.
authors_list JSON array List of author names for the query article. This format includes all of the information provided by ArXiv before preprocessing.
query_title string Title of query article
query_authors string Author names from query article normalised to a simple first_initial+last_name format
query_created string Date query article was uploaded to ArXiv
query_lang string Language of query article according to langdetect
DOI string DOI of the match article found in CrossRef
match_title string Title associated with match article
full_title string The concatenation of the title of match article and its subtitle
publisher string Name of the publisher of the match article found in CrossRef
author JSON array Raw array of author details for the match article
cr_author string Match article author list expressed as a string in the same format as the query article. Raw data in the form prior to preprocessing.
container-title JSON array Essentially a list containing the name of the journal that published the match article (usually with length ==1 unless the journal has multiple names)
normalised_container_title string The name of the match journal as a string
earliest_date string Earliest date associated with the match article. This is often the publication date, or close to it. Format is YYYY-MM-DD
score float CrossRef search score for the match article when we search for the query article
rank integer Rank of the match article in the search results
doi_sim integer Textual similarity of the query DOI and match DOI. This is used to clean the dataset, removing cases where both: the DOI for an incorrect match is highly similar to the DOI for the query and the title for a correct match is highly different to the query title. This situation is assumed to occur where there is an error in the query DOI
is-referenced-by-count integer Count of citations to the CrossRef article according to CrossRef
author_match_one integer 1 or 0, essentially a boolean value showing if one or more of the author names in the query article match the author names in the match article.
author_match_all integer 1 or 0, essentially a boolean value showing if all of the author names in the query article match the author names in the match article.
similarity integer Textual similarity of query_title and full_title according to fuzzyywuzzy fuzz.ratio function
n_auths_query integer number of authors of the query article
n_auths_match integer number of authors of the match article
correct_yn integer 1 or 0, essentially a boolean value showing if the query_doi matches the match_doi