SWAAT Bioinformatics Workflow for Protein Structure-Based Annotation of ADME Gene Variants


	2. Methods



	2.1. Obtaining 3D Structures of Proteins

The 32 core ADME genes according to PharmaADME [5,25] (Supplementary Materials: data file 1) and the 23 genes from PharmVar (www.pharmvar.org, accessed on the 23 January 2020) were screened to identify available 3D structures in the Protein Data Bank (PDB). PharmVar aims to centralize the information regarding actionable pharmacogenes by integrating data from the Pharmacogenomic KnowledgeBase database (PharmGKB) and the Clinical Pharmacogenetic Implementation Consortium (CPIC). The criteria of selection among different structures of one protein included resolution, coverage, and completeness. All the structures were stripped from any heteroatoms including ligands, cofactors, and ions. Then, we used MODELLER [26] homology modeling software to predict the 3D coordinates of missing segments and atoms. In addition, ADME core proteins with no experimental 3D structure were processed with homology modeling. Proteins with poor alignment quality between the template and the target are filtered. We generated 20 conformations starting from different random seeds of which we selected the structure with the best DOPE score [27]. Then, stereochemical quality was verified by establishing the Ramachandran plot [28].


	2.2. Constructing and Evaluating A Predictive Model for Variant Effect Prediction

We compiled a dataset of mutations with experimentally determined ΔΔG values (difference between the folding energy levels of the reference structure and the mutated structure). The dataset is a combination of the Capriotti [29], Khan [30], and Guerios [31] datasets. A total of 2614 data points were collected covering a range of ΔΔG values between −12 and 12.7 kcal/moL and consisting of mutations belonging to 96 unique protein structures. High-impact variants are defined as changes in the amino acid sequence that affect the variation of the Gibbs energy by increasing or decreasing it beyond the threshold value of ±0.5 kcal/moL. We conducted a filtering process to remove outliers based on the calculated ΔΔG and the vibrational entropy difference between the reference structure and the variant (ΔΔSvib), which leads to retaining 2015 data points from an initial number of 2614 to use for training and validating the model. For each mutation, we computed 12 features including ΔΔG, ΔΔSvib, the solvent-accessible surface area (SASA) ratio between the reference and the variant amino acids, the Position-Specific Scoring Matrix (PSSM) score for substituting the reference amino acid by the variant and by itself, respectively, Sneath’s index, Grantham’s index, and BLOSUM62 substitution score, amino acid volume descriptor (BIGC670101) from AAindex, hydrophobicity descriptor (JOND750101) from AAindex [32], and the difference of the total protein solvent-accessible surface area.
The dataset was split into training (75%) and test (25%) datasets. The training dataset was used to build the model and fit the parameters, while the test dataset was used to evaluate the performance of a trained model and detect overfitting that renders the classifier unusable for prediction purposes. Mutations showing values of −0.5<ΔΔG<0.5 kcal/moL were labeled as neutral, while mutations with ΔΔG outside that range are regarded to cause high-impact changes on the protein function. The predictive model was built using the Python library scikit-learn [33]. We tested several algorithms to classify the variants as neutral or high-impact variants. Metrics of performance were calculated over a 10-fold cross-validation process and were used for additional criteria of optimization to build the predictor. Cross-validation helps to determine the stability of the performance of a model by sampling different portions of the data iteratively.
The random forest algorithm performed better than any other tested approach. Hyperparameter optimization resulted in an adjustment of the number of trees to 1000, a minimum number of samples used to split the nodes to 2, and the minimum samples required to be at a leaf node of 42. The maximum number of features for the best tree split is defined by the square root of the total number of features. The maximum depth of the tree is set to 60, and bootstrap samples were used to build the model.


	2.3. Evaluation of the Classifier’s Performance Using a Benchmarking Dataset

A list of variants, whose functional effects on the protein are known, were collected as the benchmarking dataset. These include 64 variants of DPYD gene characterized by the study of Shrestha et al. [34]. ’High-impact’ and ’neutral’ variants were defined by a threshold activity change of 70% as suggested by the authors in their original paper. We also collected a set of 55 variants belonging to different ADME genes from the CYP P450 superfamily including CYP2D6, CYP2B6, CYP2A6, CYP2C9, CYP2C19, CYP2E1, and CYP3A4 from the PharmVar database. For these variants, we used the CPIC functional annotation to assign the classes. In addition, we included 39 more variants belonging to TP53 and collected from 5 different studies [35,36,37,38,39]. Neutral variants were the ones that present ΔΔG values between −0.5 and 0.5 kcal/mol. The other variants were assigned to the ‘high-impact’ class. Each of the benchmarking variants was submitted to the Variant Effect Predictor [40] webserver to predict the functional binary classification using SIFT, PolyPhen-2, PROVEAN, MetaSVM, CADD, and FATHMM. The prediction tools run their default settings within the web server, and their returned binary classification was assigned to the ‘high-impact’ and ‘neutral’ classes.
The performance of the classification was evaluated for each tool as well as the SWAAT classifier by calculating the accuracy, sensitivity (true positive rate or TPR), and specificity (true negative rate or TNR) according to the equations, accuracy=TP+TNTP+TN+FP+FNTPR=TPTP+FNTNR=TNTN+FP
with TN, TP, FN, and FP corresponding to the counts of true negatives, true positives, false negatives, and false positives, respectively.
In addition, we established the Receiver Operating Characteristic (ROC) plot using the scikit-lean [33] Python library to calculate the true positive rates, the false positive rates, and the Area Under the Curve (AUC) from the raw scores of the different variant predictors. For SWAAT, we used the class probability score to perform the same calculation.


	2.4. Dependencies for Working with SWAAT

SWAAT was built and tested on version 20.10.0 of Nextflow [41]. To run the main workflow, certain requirements should be satisfied. The user must have version 3 of Python with installed modules Pandas, Numpy Matplotlib, Bokeh, and biopython [42]. Several other software must be installed and added to the path, including freesasa [43], FoldX [44], EnCoM [45], and stride [46]. These dependencies are required to run the annotation workflow. PRODRES pipline (https://github.com/ElofssonLab/PRODRES, accessed on the 2 February 2022) is required to calculate the PSSMs of the protein sequences to annotate during the preparation process used by the auxiliary workflow. SWAAT uses pre-mapped coordinates data of the genetic positions with their corresponding amino acid positions. The mapping was performed using the Transvar tool [47] running within the auxiliary workflow. The latter screens the amino acids of the canonical protein reference sequence and extracts the genomic positions of their corresponding DNA codons. We have implemented a series of quality check routines in the Python code ( prot2genCoor.py) that runs this process to ensure reliability. SWAAT uses GRCh37 assembly to report on the genomic coordinates.


	2.5. Overall Description of SWAAT

SWAAT is a workflow tool that aims to annotate missense variants of ADME genes. It uses VCF files as input, extracts the information to map the genetic coordinates to protein coordinates, structurally models the missense protein variants, calculates their biophysical and structural properties, and finally reports all this information to assess the status and the pharmacological relevance of the variant. The information provided by SWAAT considers different scenarios by which the variant can lead to a significant impact on the protein function. First, we considered the effect of the variant on the protein stability by reporting the difference of the folding energies between the wild type and the variant ( ΔΔG). For such an end, the FoldX software package [44] is used for its calculation accuracy and efficiency. Moreover, SWAAT integrates many qualitative criteria to help to assess the effect of the variant on protein stability. These involve eleven molecular events that include disulfide breakage, introducing a buried proline, replacing a buried glycine, introducing a buried hydrophilic residue, introducing a buried charged residue, switching the formal charge of a buried residue, changing the secondary structure, replacing a buried charged amino acid, switching the exposure state, replacing an exposed hydrophilic residue with a hydrophobic residue, salt bridge breakage, and the induction of large helical penalty in an α-helix structure caused by a substitution to a glycine or a proline. These criteria are part of the assessment approach used by missense3D [48] that were found to be disease-associated features. SWAAT also identifies the residues that are part of a hotspot patch defined as a cluster of amino acids in the 3D space, distant to each other by at most 6 Å and showing a folding energy difference of at least 2 kcal/moL when substituted to alanine.
Second, a variant can induce a perturbation in the conformational space of a protein which can cause a population shift in the free energy landscape [18]. The perception of biomolecular systems exerting a biological function as rigid entities has been discarded since long ago [49]. Proteins can populate various states at different levels of the energy landscape. This confers the plasticity that is required to undergo complex functional mechanisms such as allostery, domain–domain movements, and structural arrangement to accommodate the ligand in the interaction site. The conformational landscape can be assessed using simulation techniques such as molecular dynamics [50] and Monte Carlo methods [51]. However, these methods are laborious, computationally intensive, and unsuitable for screening variants even for a limited number of genes. To account for the large-scale conformational movements, SWAAT integrates the calculation of the protein normal modes using ENCoM [45]. Normal modes approximate the conformational space to a quadratic potential where the protein oscillates around an equilibrium conformation at low frequency [52]. Unlike other methods, ENCoM accounts for the diversity of amino acids thanks to a specific set of coarse grain parameters, thus allowing to study the effect of mutations. The eigenvectors calculated by ENCoM are used to compute ΔΔSvib.
Finally, the user will be able to annotate the genetic variants of ADME genes based on their putative role in binding drugs. We integrated the FTMap [53] data that provide information about the drug interaction hotspots in each of the ADME proteins. These hotspots consist of a set of amino acids evaluated for their capacity to bind 16 probe molecules. SWAAT reports the Z-score and Percentile score statistics as well as the number of hits per hotspot to allow for the quantitative differentiation of residues that are potential drug-binders.


	2.6. Implementation

SWAAT consists of a set of Python scripts and Bash instructions whose series of execution is managed within Nextflow’s DSL [41]. An auxiliary workflow was developed to build a database that contains information retrieved during the annotation process (Figure 1A) (https://github.com/hothman/SWAAT/tree/master/auxiliary_wf, accessed on the 2 February 2022). A database for the 36 ADME genes was prepared and made available in the main repository of SWAAT (https://github.com/hothman/SWAAT/tree/master/database, accessed on the 2 February 2022). Therefore, users who need to annotate these ADME genes can use the built-in database without running the auxiliary workflow. The auxiliary workflow was made available for versatility purposes in the case where users desire to annotate another set of genes. For example, one of the potential applications is the annotation of a set of functionally related genes implicated in a signaling pathway or a generic biological function. In such a case, the user needs to provide a list of the Uniprot identifiers and the PDB structures corresponding to the genes to annotate. Then, the auxiliary workflow can be run to generate and aggregate the information required by the main workflow for the annotation. These include protein to gene mapping, protein to structure mapping, and other sources of data. One of the limitations of this process is the necessity of large database files required to run the PRODRES pipeline in order to generate the PSSMs. However, the user may skip this process and can provide these files manually by submitting the sequences of the proteins to the webserver version of PRODRES (https://prodres.bioinfo.se, accessed on the 2 February 2022). To get the information about the drug interaction hotspots, the user needs to submit the PDB files to the FTMap server before running the auxiliary workflow [53].
The core functionality of SWAAT includes the main workflow (Figure 1B) that can process a list of variants, annotate them according to sequence-based and biophysical-based properties, and return a detailed report in HTML and CSV format. The user needs only to provide the list of variants in separated VCF files relative to each gene to annotate. The user can restrict the analysis to limited genes by providing a list of their corresponding Uniprot identifiers. The most important parameter used by SWAAT is the path to the database files generated by the auxiliary workflow. These include data about the amino acid sequence, the protein to genome mapping, the PDB to protein sequence mapping, the hotspot patch members, the PSSM files, precalculated BLOSUM62, Grantham, and Sneath scores, and the normal modes of the reference structures.
