Data used in "comboFM: leveraging multi-way interactions for systematic prediction of drug combination effects"

Julkunen, Heli; Cichonska, Anna; Gautam, Prson; Szedmak, Sandor; Douat, Jane; Pahikkala, Tapio; Aittokallio, Tero

doi:10.5281/zenodo.4135059

Published May 2, 2020 | Version 0.0.1

Working paper Open

Data used in "comboFM: leveraging multi-way interactions for systematic prediction of drug combination effects"

1. Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland
2. Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland; Department of Future Technologies, University of Turku, Turku, Finland; Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland
3. Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland
4. Department of Future Technologies, University of Turku, Turku, Finland
5. Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland; Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland; Department of Mathematics and Statistics, University of Turku, Turku, Finland; Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway; Oslo Centre for Biostatistics and Epidemiology, University of Oslo, Oslo, Norway

This repository contains the data used in [1] for predicting the responses of drug combinations in cancer cell lines. The data comes from NCI-ALMANAC dataset generated by the US National Cancer Institute (https://dtp.cancer.gov/ncialmanac/). The subset of data used in the method development consists of 50 unique FDA-approved drugs, randomly sampled from the original set of drugs, tested in 617 combinations and in various concentration pairs against all the 60 cell lines from NCI-60 panel. In this data subset, a total of 333 180 drug combination response measurements and 222 120 monotherapy response measurements of single drugs are available in the form of percentage growth of the cell lines.

The implementation of the method proposed in [1] is available on: https://github.com/aalto-ics-kepaco/comboFM.

Files description:

NCI-ALMANAC_subset_555300.csv: Drug combination response dataset used in the method development (subset of NCI-ALMANAC_combinations_measured_across_all_cellines.csv). Each row represents a drug combination response measurement - there are 555 300 measurements in total, consisting of a total of 333 180 drug combination response measurements and 222 120 monotherapy response measurements of single drugs. The file has six columns: columns 'Conc1' and 'Conc2' represent the concentrations of the two drugs, columns 'Drug1' and 'Drug2' contain the drug names and 'CellLine' contains the ID of the cell line against which the drug combination was screened. The last column, 'PercentageGrowth', contains the measured responses in the form of percentage growth of the cell line.
NCI-ALMANAC_full_data.csv. Full processed NCI-ALMANAC set. As a preprocessing of the dataset available on the website, a median across studies (experiment IDs) was taken and combinations with measurements across all cell lines were selected.
NCI-ALMANAC_combinations_measured_across_all_cellines.csv. Subset of full dataset including combinations measured across all cell lines.

Directory data contains drug combination responses and feature matrices needed to run the experiments, each row corresponding to a row in CombALMANAC_555300.csv. It contains the following files:

cell_lines__one-hot_encoding.csv: One-hot encodings for the cell lines (555 300 rows, 60 columns).
drug1__one-hot_encoding.csv: One-hot encodings for the first set of drugs (555 300 rows, 50 columns).
drug2__one-hot_encoding.csv: One-hot encodings for the second set of drugs (555 300 rows, 50 columns).
drug1_concentration__one-hot_encoding.csv: One-hot encodings for the concentrations of the first set of drugs (555 300 rows, 46 columns).
drug2_concentration__one-hot_encoding.csv: One-hot encodings for the concentrations of the second set of drugs (555 300 rows, 46 columns).
cell_lines__gene_expression.csv: Gene expression data for the cell lines, 0.05% of genes with the highest variance selected (555 300 rows, 78 columns).
drug1__estate_fingerprints.csv: Estate fingerprints for the first set of drugs, bits with zero-variance removed (555 300 rows, 34 columns).
drug2__estate_fingerprints.csv: Estate fingerprints for the second set of drugs, bits with zero-variance removed (555 300 rows, 34 columns).
drug1_drug2_concentration__values.csv: Concentration values for the first and second set of drugs, both in the same file (555 300 rows, 2 columns).
drug2_drug1_concentration__values.csv: Similar as above, but for different order of the drugs (555 300 rows, 2 columns).
responses.csv: File that contains the drug combination responses (555 300 rows).

Subdirectory additional_data contains additional data files based on which the features were constructed:

NCI-60__gene_expression.txt: The full gene expression dataset obtained from cellmineR package.
drugs__SMILES.csv: SMILES for the drug compounds for computing the fingerprints.
drugs__estate_fingerprints.csv: Fingerprints (EState) for the drug compounds.

In addition, similar files are available also for a validation set not used in the method development, which follow the same structure as described abovet: NCI-ALMANAC_subset_2225137__validation_train.csv contains the training set used in the validation experiment, consisting of the full development dataset and the monotherapy responses in the validation set and NCI-ALMANAC_subset_2225137__validation.csv contains the validation dataset. Corresponding features are included in validation_data_train and validation_data, respectively.

Directory experimental_validation_data contains the results from the experimental validation, described in [1].

Directory source_data contains data underlying the figures and display items in [1], with a separate file for each prediction setting and compared method. These files contain information on each drug combination and cell line, along with the measured and predicted percentage growth responses and resulting NCI ComboScores.

References:

[1] Julkunen, H.; Cichonska, A.; Gautam, P.; Szedmak, S.; Douat, J.; Pahikkala, T.; Aittokallio, T. and Rousu, J. comboFM: leveraging multi-way interactions for systematic prediction of drug combination effects.

Files

comboFM_data.zip

Files (420.1 MB)

Name	Size	Download all
comboFM_data.zip md5:f59e6cd594bf952c62ca1ea83b7d3bb7	420.1 MB	Preview Download

	All versions	This version
Views	2,405	1,390
Downloads	688	397
Data volume	496.6 GB	402.0 GB

Data used in "comboFM: leveraging multi-way interactions for systematic prediction of drug combination effects"

Authors/Creators

Description

Files

comboFM_data.zip

Files (420.1 MB)