Published April 18, 2026 | Version v1
Dataset Open

The CPgenes Corpus: Trimodal Datasets for Generative Virtual Cell Modeling

Authors/Creators

Description

This repository contains the CPgenes dataset, a comprehensive, large-scale trimodal corpus curated for generative virtual cell modeling. The dataset systematically bridges two powerful high-throughput screening technologies: Cell Painting (morphology) and L1000 (transcriptomics) across four distinct biological cohorts: BBBC021, CDRP, JUMP, and LINCS.

Key contents of this upload:

  • Paired Trimodal Samples: Matched sets of chemical/genetic perturbation embeddings, high-resolution (512*512) cellular morphology images, and corresponding gene expression profiles.
  • Diverse Contexts: Data encompasses MCF7, U2OS, and A549 cell lines. 

For researchers requiring the complete raw datasets, you can download them directly from their respective official repositories. For more details, please refer to our paper: "MultiVCDiff: Building Generative Virtual Cell by Multimodally Predicting Morphological and Transcriptomic Perturbation Responses".

Files

merged_rgb_images_train_512_5k_images_only.zip

Files (2.2 GB)

Name Size Download all
md5:1332a8f45f6cc840ad21a58573fe407b
181.5 MB Download
md5:595f9cebd57cc742f8e0a40011f88d97
2.0 GB Preview Download

Additional details

Dates

Created
2026-04-18