Published April 18, 2026
| Version v1
Dataset
Open
The CPgenes Corpus: Trimodal Datasets for Generative Virtual Cell Modeling
Description
This repository contains the CPgenes dataset, a comprehensive, large-scale trimodal corpus curated for generative virtual cell modeling. The dataset systematically bridges two powerful high-throughput screening technologies: Cell Painting (morphology) and L1000 (transcriptomics) across four distinct biological cohorts: BBBC021, CDRP, JUMP, and LINCS.
Key contents of this upload:
- Paired Trimodal Samples: Matched sets of chemical/genetic perturbation embeddings, high-resolution (512*512) cellular morphology images, and corresponding gene expression profiles.
- Diverse Contexts: Data encompasses MCF7, U2OS, and A549 cell lines.
For researchers requiring the complete raw datasets, you can download them directly from their respective official repositories. For more details, please refer to our paper: "MultiVCDiff: Building Generative Virtual Cell by Multimodally Predicting Morphological and Transcriptomic Perturbation Responses".
Files
merged_rgb_images_train_512_5k_images_only.zip
Files
(2.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:1332a8f45f6cc840ad21a58573fe407b
|
181.5 MB | Download |
|
md5:595f9cebd57cc742f8e0a40011f88d97
|
2.0 GB | Preview Download |
Additional details
Dates
- Created
-
2026-04-18
Software
- Repository URL
- https://github.com/prsigma/MultiVCDiff