Published March 16, 2023 | Version 1.0
Dataset Open

Chemical structures, Cell Painting and transcriptional profiles for compound bioactivity prediction.

  • 1. Broad Institute of MIT and Harvard; Biological Research Centre, Szeged, Hungary
  • 2. Broad Institute of MIT and Harvard
  • 3. University of California, Berkeley
  • 4. Biological Research Centre, Szeged, Hungary

Description

This is the related data, both input and produced for the paper "Predicting compound activity from phenotypic profiles and chemical structures".

This data can be merged with paper's GitHub repository for reproduction.

Folders and files and are described below:

├── assay_data
   ├── assay_matrix_discrete_270_assays.csv Assay matrix with hits for assays (270) and compounds (16170). Note that this is the final file that we used to produce splits.
   ├── assay_metadata.csv Assay metadata
   ├── broad_ids.txt List of broad ids used in this study. That is an unfiltered list of compounds required by some analysis scripts. 
   ├── smiles.txt Same as broad_ids.txt, but SMILES strings.

├── feature_data  (for 16978 compounds, can be masked with ./misc/compounds16978to16170.npy)
   ├── cp.npz Classical chemical features
   ├── ge.npz Gene expression features
   ├── ge_scale.npz Gene expression scaled features
   ├── mo.npz Morphology features (not batch corrected)
   ├── mobc.npz Morphology features (batch corrected)

├── misc
   ├── compound_analysis.npz Compounds in the dataset identified as PAINS
   ├── compounds16978to16170.npy Used to filter features from the bigger set of compounds to the final one
   ├── fingerprints.npz Calculated fingerprints of compounds, those were then used to calculate similarity
   ├── similarity_fingerprints.npz Similarity matrix for compounds (16978)
   ├── population_normalized.csv.gz Well-level morphological profiles that were used for batch-correction 
   ├── Table for PUMA Excel file with additional data and plots

├── predictions
   ├── scaffold_median(mean)_AUC.csv Aggregated median(mean) AUC scores over scaffold-based cross-validation splits. In the paper, median results were reported. 
   ├── scaffold_median(mean)_EF.csv Aggregated median(mean) enrichment factor (EF) over scaffold-based cross-validation splits. In the paper, median results were reported. 
   ├── toprank_chemical_cv{}_hitsnorm.csv Those files are needed to create enrichment plots and contain hit rate and top rank hit rate.
   ├── Each folder here stands for an experiment type, the number in the folder name is a number of the split. Inside each folder there are the following elements:
      ├── predictions Folder with predictions for each assay-compound pair for each modality
      ├── 2022_01_evaluation_all_data.csv File with AUC scores for each assay for the test set in the split
      ├── 2022_01_evaluation_all_data_EF.csv File with enrichment factor (EF) values for each assay for the test set in the split. Those files exist only for *chemical* folders.
      ├── assay_matrix_discrete_train(test)_old_scaff.csv Training and test subsets of data for the split. The first column contains broad_id.
      ├── assay_matrix_discrete_train(test)_old_scaff.csv Same, but SMILES strings in the first column. Those files are used as input to ChemProp!

      Experiments in this folder are the following: 
      - chemical Scaffold-based 5-fold cross-validation splits, the main results in the paper are reported with this series of experiments.
      - chemical_bal Same splits as in chemical, but training were run with ChemProp built-in data balancing. 
      - chemical_st Same splits as in chemical, but separate models were trained for each assay.
      - CV Random 5-fold cross-validation splits.
      - GE 5-fold cross-validation splits based on same-size clustering of gene expression features.
      - MOBC 5-fold cross-validation splits based on same-size clustering of batch-corrected morphology features.
      - random 10 random splits, ~80% of compounds in the training set and the rest in the test set. 

├── splitting This folder contains numpy files which help to match compounds and features to create training and test sets for a split, which can be reused in the analysis notebook for data preparation. 
   ├── scaffold_based_split.npz Splitting for scaffold-based splits.
   ├── random_split_{}.npz Random split indices of test set compounds (10 files).
   ├── cross_validation_indicies.npz Indices for random cross-validation splits
   ├── GE_clusters_size_constrained.npz Indicies of clusters of same-size clustering for gene-expression features.
   ├── MOBC_clusters_size_constrained.npz Indices of clusters of same-size clustering for batch-corrected morphology features.

 

Notes

This study was supported by a grant from the National Institutes of Health (R35 GM122547 to AEC), by the Broad Institute Schmidt Fellowship program (JCC) and by National Science Foundation (NSF-DBI award 2134695 to JCC). NM and PH acknowledge support from the LENDULET BIOMAG Grant (2018–342), from TKP2021-EGA09, SYMMETRY-ERAPerMed, from CZI Deep Visual Proteomics, H2020-Fair-CHARM, from the ELKH-Excellence grant, from OTKA-SNN 139455/ARRS N2-0136.

Files

PUMA.zip

Files (6.2 GB)

Name Size Download all
md5:ea0bc22b7e9affba75dd9ef51cc627fa
6.2 GB Preview Download

Additional details

Related works

Cites
Journal article: 10.1073/pnas.1410933111 (DOI)
Dataset: 10.5524/100351 (DOI)
Is supplement to
Preprint: 10.1101/2020.12.15.422887 (DOI)

Funding

National Institutes of Health
Extracting rich information from biological images 1R35GM122547-01
U.S. National Science Foundation
Collaborative Research: Image-based Readouts of Cellular State using Universal Morphology Embeddings 2134695

References

  • Wawer, M. J. et al. Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling. Proceedings of the National Academy of Sciences 111, 10911–10916 (2014).
  • Bray, M.-A. et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. Gigascience 6, 1–5 (2017).