ChemBioSim: Biological Assay and in Vivo Toxicity Models
Creators
- 1. BASF SE; Department of Pharmaceutical Sciences, Faculty of Life Sciences, University of Vienna
- 2. In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité Universitätsmedizin Berlin
- 3. MTM Research Centre, School of Science and Technology, Örebro University
- 4. BASF SE
- 5. Department of Pharmaceutical Sciences, Faculty of Life Sciences, University of Vienna
Description
ChemBioSim: Enhancing Conformal Prediction of in vivo Toxicity by Use of Predicted Bioactivities
Project description
Computational methods such as machine learning approaches have a strong track record of success in predicting the outcomes of in vitro assays. In contrast, their ability to predict in vivo endpoints is more limited due to the high number of parameters and processes that may influence the outcome. Recent studies have shown that the combination of chemical and biological data can yield better models for in vivo endpoints.
The ChemBioSim approach presented in this work aims to enhance the performance of conformal prediction models for in vivo endpoints by combining chemical information with (predicted) bioactivity assay outcomes. Three in vivo toxicological endpoints, capturing genotoxic (MNT), hepatic (DILI) and cardiological (DICC) issues, were selected for this study due to their high relevance for the registration and authorization of new compounds. Since the sparsity of available biological assay data is challenging for predictive modelling, predicted bioactivity descriptors were introduced instead. Thus, a machine learning model for each of the 373 collected biological assays was trained and applied on the compounds of the in vivo toxicity data sets. Besides the chemical descriptors (molecular fingerprints and physicochemical properties), these predicted bioactivities served as descriptors for the models of the three in vivo endpoints. For this study, a workflow based on a conformal prediction framework (a method for confidence estimation) built on random forest models was developed. Furthermore, the most relevant chemical and bioactivity descriptors for each in vivo endpoint were preselected with lasso models.
The incorporation of bioactivity descriptors increased the mean F1 scores of the MNT model from 0.61 to 0.70 and for the DICC model from 0.72 to 0.82 while the mean efficiencies increased by roughly 0.10 for both endpoints. In contrast, for the DILI endpoint no significant improvement in model performance was observed. Besides pure performance improvements, an analysis of the most important bioactivity features allowed to detect novel and less intuitive relationships between the predicted biological assay outcomes used as descriptors and the in vivo endpoints.
This study presents how the prediction of in vivo toxicity endpoints can be improved by the incorporation of biological information -which is not necessarily captured by chemical descriptors- in an automated workflow without the need for adding experimental workload for the generation of bioactivity descriptors as predicted outcomes of bioactivity assays were utilized.
Models
The models provided here are (1) the 373 models used for deriving the bioactivity descriptors and (2) those including the bioactivity descriptors as input for predicting in vivo toxicity (for the MNT, DILI and DICC endpoints).
1. Biological assays
These files contain conformal prediction (CP) models trained on biological assays and used for deriving the p-values used as bioactivity descriptors. The input for these models are the chemical descriptors described in the ChemBioSim paper, which can be calculated with the KNIME workflow provided in the supplementary information.
These models can be loaded e.g. with cloudpickle and used to make new predictions. The output values are the (CP) p-values for the inactive and active classes.
The p-values used in the manuscript are calculated as the mean p-value obtained from the five cross-validation models.
2. In vivo endpoints
This file contains conformal prediction (CP) models trained on (a) CHEM models trained exclusively on chemical descriptors, (b) BIO models based exclusively on predicted bioactivity descriptors and (c) CHEMBIO models based on the combination of both types of descriptors. The models were trained on a subset of selected features (selected with a lasso model). The selected features for each of the cross-validation models are provided in the "*_columns.csv" files. The input for these models are the selection of CHEM/BIO/CHEMBIO descriptors indicated in the CSV file. The chemical descriptors used here are described in the ChemBioSim paper and can be calculated with the KNIME workflow in the supplementary information (same descriptors as for the biological assays). The biological descriptors are the p-values of the inactive (p0) and active (p1) classes predicted with the biological assay CP models.
The models can be loaded e.g. with cloudpickle and used to make new predictions. The output values are the (CP) p-values for the inactive and active classes.
Usage
For the biological assays:
- Download the models.
- Calculate the chemical descriptors for the input molecules with the KNIME workflow provided with the ChemBioSim publication.
- Use the respective data scaler model for each endpoint to scale the descriptors before applying the models (as it was done on the training data). These are StandardScaler models from RDKit and can be loaded e.g. with cloudpickle. The scalers for all biological assays can be found under "data_scaler_biological_assays".
- Load the models and use the calculated, scaled descriptors as input to obtain the predictions. Note that the python version 3.6 is necessary to load the CP models.
- Calculate the mean p-value for each class and endpoint. These values can be used for class labeling or as input for a further model (as bioactivity descriptor).
For the in vivo endpoints:
- Follow steps (1) and (2) as described for the biological assays.
- Select the input descriptors:
- For the CHEM descriptor set: filter out the descriptors not included in the "*_columns.csv" files of each cross-validation run.
- For the BIO and CHEMBIO descriptor sets: calculate the p-values of the biological assays and append these as descriptors for the input data. In the case of CHEMBIO, the CHEM descriptor set has to be included as well. The "*_columns.csv" files indicate which p-values (and CHEM descriptors) are needed in each cross-validation run, as well as the descriptor order to use as input. The remaining descriptors not included in the respective file should be filtered out.
- Follow steps (3), (4) and (5) as described for the biological assays. The scaler models for the in vivo endpoints are in the "in_vivo_endpoints" file. Also note that the python version 3.6 is necessary to load the CP models. The mean p-values from the cross-validation can be used for class labeling at the desired significance level.
Citation
These models are part of the ChemBioSim publication:
Garcia de Lomana, M.; Morger, A.; Norinder, U.; Buesen, R.; Landsiedel, R.; Volkamer, A.; Kirchmair, J.; Mathea, M., ChemBioSim: Enhancing Conformal Prediction of In Vivo Toxicity by Use of Predicted Bioactivities. J. Chem. Inf. Model. 2021, 61, 3255-3272. doi: 10.1021/acs.jcim.1c00451
Files
Files
(186.8 GB)
Name | Size | Download all |
---|---|---|
md5:239aa9269a07cf4fb930cdf5d492f2e1
|
48.8 GB | Download |
md5:37619af3d2910c759db74b2c1c89e595
|
43.6 GB | Download |
md5:6b18569a29f0bbf9507c139ea907caca
|
47.8 GB | Download |
md5:04e8702f6092700503227149c9451ff0
|
44.3 GB | Download |
md5:eca8b58cc945353e55f04b0c4c2ee357
|
9.7 MB | Download |
md5:de8b96cf8621f922d356975c8f27fc67
|
2.3 GB | Download |