Published March 5, 2021 | Version v1
Dataset Open

CORD-19 Software Mentions

  • 1. Chan Zuckerberg Initiative (United States)

Description

In an effort to automate the process of identifying and analyzing the use of software in biomedical research, we have developed a SciBERT-based machine learning model to extract mentions of software from scientific articles. The input to this model is the full text from a scientific article and the output is a list of mentioned software within it.  We applied this model to the CORD-19 full-text articles and stored the output in this dataset, which includes metadata of over 77,000 COVID-19 and coronavirus-related papers and a list of software tools mentioned in each.

Notes

Notes:

  1. Not all papers in the CORD-19 dataset mention software. We only include here the subset of articles for which there was full-text and which also had at least one detected software mention
  2. Software names have not been normalized, nor have they been resolved to any external dictionary.  e.g. the list of software mentions includes "excel", "microsoft excel", "ms excel", and "office excel".   
  3. Dataset contains DOIs for each mentioning paper where available (96% of papers). External identifiers (such as PubMed Central IDs, PubMed PMIDs, and arXiv IDs) for the remainder of papers can often be imputed from the paper URLs, e.g. the arXiv ID for the paper with the URL "https://arxiv.org/pdf/2011.09270v1.pdf" is "2011.09270"  
  4. In some cases, mentions of software are incorrectly separated into multiple tokens, e.g. ['scikit', 'learn']

Schema:

Column

Type

Description

paper_id
string
ID of paper from CORD-19 dataset. 40-character sha1 of the PDF
doi
string
Digital Object Identifier of the article, from CORD-19
title
string
Title of the article, from CORD-19
source_x
array
Provenance of the article from CORD-19 dataset, 
e.g. arXiv, bioRxiv, Elsevier, Medline, PMC, WHO, Wiley 
license
string
License of the article, from CORD-19
publish_time  
date (mm/dd/yyyy)  
Publication date of the article, from CORD-19
journal
string
Journal short name, from CORD-19 (e.g. PLoS Compu Biol)
url
array
URL(s) of article, from CORD-19
software
array
Software mentions extracted from article full-text

Funding provided by: Chan Zuckerberg Initiative
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100014989
Award Number:

Files

CORD19_software_mentions.csv

Files (31.9 MB)

Name Size Download all
md5:41f7c5dca5abc6fc97e4c54f116c227b
31.9 MB Preview Download