Dataset Open Access


Jake Lever

This describes the output file for the CoronaCentral data. The scripts used to create it are hosted in the corona-ml Github repo. The sources for the documents before processing for CoronaCentral are PubMed and CORD-19.

The file is a gzipped JSON document containing one record per document. Each document has at least one of: a PubMed ID, a CORD-19 ID (cord_uid), a DOI or a URL.

The fields that documents should have are:

  • pubmed_id: PubMed identifier (optional)
  • pmcid: PubMed Central identifier (optional)
  • doi: Digital object identifier (optional)
  • cord_uid: CORD-19 identifier (optional)
  • url: URL
  • journal: Journal/preprint server
  • publish_year: Year of publication (optional)
  • publish_month: Month of publication (optional)
  • publish_day: Day of publication (optional)
  • title: Title of article
  • abstract: Abstract of article (optional)
  • is_preprint: Whether the article is a preprint
  • topics: Predicted topics for article
  • articletypes: Predicted article types for article
  • entities: Extracted entities (e.g. drugs) with identifiers and locations within text

Please report issues to the corona-ml Github issues page.

Files (232.4 MB)
Name Size
232.4 MB Download
All versions This version
Views 1,1312
Downloads 5421
Data volume 66.4 GB232.4 MB
Unique views 9552
Unique downloads 2771


Cite as