Published July 21, 2025 | Version 2
Dataset Open

MGMG: Cell Morphology-Guided Molecule Generation for Drug Discovery

  • 1. Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, University of Florida, Gainesville, FL 32610, USA
  • 2. School of Computer, Data & Information Sciences, University of Wisconsin-Madison, Madison, WI 53706, USA
  • 3. Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, USA
  • 4. Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
  • 5. Program in Cancer and Stem Cell Biology, Duke-NUS Medical School, Singapore 169857, Singapore

Description

This dataset supports MGMG (Morphology-Guided Molecule Generation), a phenotypic drug discovery-oriented framework that integrates morphological information from compound treatment with molecular textual descriptions to enable de novo molecule design without the need for target information. Please cite this publication when using the dataset:

MGMG: Cell Morphology-Guided Molecule Generation for Drug Discovery
Qiaosi Tang, Daoyun Ding, Xiaoyong Yuan, Gustavo Seabra, Peter A Ramdhan, Chi-Yuan Liu, My T. Thai, Chenglong Li, Hendrik Luesch, Yanjun Li
bioRxiv 2025.07.11.664424; doi: https://doi.org/10.1101/2025.07.11.664424

Overview

The dataset contains model checkpoint, input test data, and results required to reproduce MGMG’s evaluations and case studies, including the activator design task and docking task. It is structured to support reproducibility, downstream applications, and benchmarking by the research community. All data is organized under the archive file MGMG_dataset.zip.

Contents

checkpoint/

  • pytorch_model.bin: Saved checkpoint of the trained MGMG model.

testset_Morphology-BBBC036v1/

  • unique_morph_info.json: Morphological profiles for the test compounds, derived from Cell Painting images in BBBC036v1 (Bray et al. 2017) and processed using PyCytominer.

testset_Mol-Instructions-BBBC036v1/

  • task3_MolIns_text2mol_test_tk.json: Compound textual descriptions sourced from PubChem, aligned with the same compounds in the morphology set.

testset_MolCaptioned-BBBC036v1/

  • task3_MolIns_text2mol_test_tkgendesc_revised.json: Synthetic compound textual descriptions generated by BioT5, aligned with the same compounds in the morphology set.

reference_metrics/

  • test_gth.csv: Test set reference compound chemical properties.

  • train_gth.csv: Train set reference compound chemical properties.

activator_design/

  • task3_GeneOE_text2mol_test_[GENE].json: Textual descriptions for gene-specific molecule generation inputs, where each file corresponds to a gene overexpression case (e.g., TP53, BRCA1, NFKB1), with text profiles generated by GPT-4.0. Used for the first step of the activator design task (See the preprint Methods section for more details).

  • task3_GeneOEbiot5gen_test.json: BioT5-generated molecular descriptions for all gene perturbation cases. Used for the third step of the activator design task (See the preprint Methods section for more details).

  • unique_gene_morph_info.json: Morphological profiles of cells with gene overexpression perturbations, used for the activator design case study in the manuscript.

docking_files/

  • [PDB_ID].maegz: Prepared docking files (protein target, reference and generated sample compounds) for select demo cases (3CMF, 4E2J, 7DFP, 7QZ7), compatible with Maestro/PyMOL.

Files

MGMG_dataset.zip

Files (1.2 GB)

Name Size Download all
md5:12c3e9f5b3c787c86311d1261e5f9977
1.2 GB Preview Download

Additional details

Related works

Is supplement to
Preprint: 10.1101/2025.07.11.664424 (DOI)