Published April 18, 2024 | Version v3
Dataset Open

Data from the paper "The landscape of biomedical research"

  • 1. Hertie Institute for Artificial Intelligence in Brain Health, University of Tübingen
  • 2. Quantitative Data Science Methods, University of Tübingen
  • 3. Nomic AI, New York


Data from the paper "The landscape of biomedical research".

The paper used the PubMed 2020 baseline (download date: 26.01.2021, not available anymore) supplemented with additional files from the 2021 baseline (download date: 27.04.2022, not available anymore), both originally obtained from, courtesy of the U.S. National Library of Medicine. This data can be found in v2 of this repository (

In the latest version of this repository we provide the PubMed 2024 baseline (download date: 06.02.2024) including all papers until the end of 2023, which is not the main data we analyzed in the paper but an updated version including newer articles. The paper contains two supplementary figures (S9 and S10) with the updated embedding.

The latest version provided here includes the following files:, which includes:

- from the PubMed database: article title, journal, PMID, and publication year.

- produced by us: t-SNE embedding X and Y coordinates, label, color, whether the paper is retracted or not (combining PubMed and Retraction Watch information), and affiliation country (from the first affiliation of the first author)., which includes:

- from the PubMed database: PMID, and paper abstracts.


PubMedBERT_embeddings_float16_2024.npy, which includes:

- produced by us: PubMedBERT embeddings of the paper abstracts (numpy.ndarray of shape 23,389,083x768).


Files (48.8 GB)

Name Size Download all
11.0 GB Preview Download
1.9 GB Preview Download
35.9 GB Download

Additional details