There is a newer version of the record available.

Published March 3, 2023 | Version v2
Dataset Open

Data from the paper "The landscape of biomedical research"

  • 1. Hertie Institute for Artificial Intelligence in Brain Health, University of Tübingen
  • 2. Quantitative Data Science Methods, University of Tübingen
  • 3. Nomic AI, New York

Description

Data from the paper "The landscape of biomedical research" (https://www.biorxiv.org/content/10.1101/2023.04.10.536208v1).

The paper used the PubMed 2020 baseline (download date: 26.01.2021, not available anymore) supplemented with additional files from the 2021 baseline (download date: 27.04.2022, not available anymore), both originally obtained from https://www.nlm.nih.gov/databases/download/pubmed_medline.html, courtesy of the U.S. National Library of Medicine.

The data provided here includes the following files:

pubmed_landscape_data.zip, which includes:

- from the PubMed database: article title, journal, PMID, and publication year.

- produced by us: t-SNE embedding X and Y coordinates, label, and color.

 

pubmed_landscape_abstracts.zip, which includes:

- from the PubMed database: PMID, and paper abstracts.

 

PubMedBERT_embeddings_float16.npy, which includes:

- produced by us: PubMedBERT embeddings of the paper abstracts (numpy.ndarray of shape 20,687,150x768).

Files

pubmed_landscape_abstracts.zip

Files (42.6 GB)

Name Size Download all
md5:34d91d57f74903655a6da55d73c1bdb2
9.5 GB Preview Download
md5:8762d53d1b77ca0477f917d3329da584
1.3 GB Preview Download
md5:54c321e9bb258932fbd0650ede6b4daa
31.8 GB Download