==============================================================================================
-----------------------------------------------------------------------------------------------
#Supporting data and code for: Machine Learning analysis of protein-ligand interaction fingerprints (IF) extracted from tau-RAMD dissociation trajectories for inhibitors of HSP90
This is a set of Jupyter Notebook scripts for analysis of the protein-ligand binding contacts in
protein-ligand dissociation trajectories generated in RAMD (random acceleration molecular dynamcis) simulations.
The aim is to derive a regression model for estimating the relative residence time
and to decipher the molecular features leading to longer residence times.
The present script is written for HSP90 and requires some adaptation if applied to another system.
--------------------------------------------------------------------------------------------
This script has been used to generate the results described in:
Kokh DB, Kaufman T, Kister B, Wade RC.
Machine learning analysis of tauRAMD trajectories to decipher molecular determinants of drug-target residence times, (2019) submitted.
----------------------------------------------------------------------------------------------
####Authors:
Authors: Daria Kokh, Tom Kaufmann
####version:
v2.1 05.08.2019
####Project Manager:
Dr. Rebecca Wade
Heidelberg Institute for Theoretical Studies (HITS)
Schloss-Wolfsbrunnenweg 35
D-69118 Heidelberg, Germany
####Contact:
E-Mail: mcmsoft(at)h-its.org
-------------------------------------------------------------------------------------------
### 1.Prerequisite:
Python 3 (the script was tested with version 3.6.4)
Python libraries required:
numpy, pandas v0.24 , scikit-learn, seaborn, matplotlyb, xlrd
--------------------------------------------------------------------------------------------
### 2.Running scripts:
Script jupyter_notebooks/clustering.ipynb
---------------------------------------------------------------------------------------------
### 3. Method
Script construct:
- Regression models, RMs, for the prediction of the residence time:
Linear Regression with ridge regularization, LR,
and Support Vector Regression, SVR:
models are trained on the experimental unbinding rates, koff, on the logarithmic scale
- Clustering using Gaussian Mixture Models
--------------------------------------------------------------------------------------------
### 4. Input Data
####Metadata:
../data/metadata1.xlsx
This file contains a list of all compounds with their measured kinetic rate constants and computed molecular properties.
The experimental data are from:
Amaral, M., et al. (2017). Nat. Commun. 8, 2276
Kokh, D., et al. (2018). J. Chem. Theory Comput 14, 3859-3869
Schuetz, D. A., et al. (2018). J. Med. Chem. 90, 4397-4411
####There are three different data sets of interaction fingerprints (IF) for 94 inhibitors of HSP90:
(These differ by the length of the starting part of the trajectory that is discarded)
1. ../data/full_data_all_ligands_new-2-94 - Model A:
all snapshots are discarded in which fewer than 2 protein-ligand contacts in the bound state structure are lost
2. ../data/full_data_all_ligands_new-20-94 - Model B:
all snapshots are discarded in which less than 20% of the protein-ligand contacts in the bound state structure are lost
3. ../data/full_data_all_ligands_new-60-94 - Model C:
all snapshots are discarded in which less than 60% of the protein-ligand contacts in the bound state structure are lost
These IFs were obtained from analysis of the ligand dissociation trajectories generated using the following protocol:
- dissociation trajectories were generated using the tauRAMD approach (Kokh, D., et al. (2018). J. Chem. Theory Comput 14, 3859–3869);
software is available at: https://www.h-its.org/downloads/ramd/
- the coordinates extracted in (i) were used to generate interaction fingerprints for each frame using an OpenEye script (OpenEye, 2016)
- Then the interaction fingerprints were grouped into four categories of protein-ligand contacts:
hydrogen-bond, aromatic, ionic, and apolar interactions and each category was assigned a value of 1 or 0 according to whether the contact type was, respectively,
####Each directory contains:
- DataFrame.p - Database of IFs (binary file); this is the only file needed for building the ML models.
Each database can be generated from the raw data using the python script execute_full_data_loading.py (in the directory python_files)
The interaction fingerprints, IF, are provided in the input database file.
For each trajectory and snapshot, IF is defined by a vector containing in each element the protein residue number and contact type:
APO - apolar (or van der Waals),
ARO - in-plane or edge aromatic interaction,
HB - hydrogen donor or acceptor,
ION - ionic interaction
The IFs are averaged over each extracted dissociation trajectory to obtain their occurrence and then these values are averaged once again over all trajectories, so that at the end, for each compound, we have just one number for each IF feature.
-------------------------------------------------------------------------------------------------
###6. Output Data
- Cluster_traj.png - Clustering of the compounds from analysis of the IF occupancies in the trajectories
- Cluster_times.png - Mean residence times of the compounds in each cluster
- Cluster_IF.png - Weights of the IFs in the clusters
- Cluster-GMM_Akaike.png - Akaike information plot for the molecular features in the clusters
- Cluster_IF-MF.png - Weights for molecular features versus interaction fingerprints
- LR_coef.png - Coefficients of the IFs in the linear regression model (averaged over all training/test set splits)
- LR_main_terms.png - Occurrence of the IFs with the largest coefficients (top 40%) averaged over all training/test set splits
- RM_hyperparam.png - Distributions of the hyperparameters over all of the training/test set splits (normally 200 rounds)
- RM_represetative_plot.png - Representative plots for regression models (RM) of computed vs experimental residence time on a logarithmic scale
- RM_MEA_in_intervals.png - Mean average error (MAE) of the test data split into four intervals of log (residence time) and averaged in each interval
- RM_evaluation.png - Summary of the assessment (according to MAE and Q2) of the Regression Models (RM)
- RM_evaluation.txt - Mean and SD values of MAE and R2 and Q2 scores
###7. Changes
1. V2.1
- The only change in the version 2.1 relavite to 2.0 is the fixed bug in Q2 function:
v2.0: 1- mean_squared_error(exp_test,comp_test)/np.sum(pow(exp_train-np.mean(comp_train),2)/len(exp_train))
v2.1: 1- mean_squared_error(exp_test,comp_test)/np.sum(pow(exp_train-np.mean(exp_train),2)/len(exp_train))
Bug did not have any notable effect on the Q2 distribution for HSP90, see plot AUXI/ramd-ml-bug-fixed.jpg