QSAR rationales for the dipeptidyl peptidase-4 (DPP-4) inhibitors: The imidazolopyrimidine amides

The DPP4 inhibition activity of imidazolopyrimidine amides has been quantitatively analyzed in terms of chemometric descriptors. The statistically validated QSAR models provided rationales to explain the inhibition activity of these congeners. The descriptors identified through CP-MLR analysis have highlighted the role of mean electrotopological state (Ms), number of double bonds in molecular structure (nDB), 2D Petitijean shape index (PJI2), Moran autocorrelation of lag-2/weighted by atomic polarizabilities (MATS2p), Moran autocorrelation of lag-6 and lowest eigenvalue n.5 of Burden matrix /weighted by atomic Sanderson electronegativities (MATS6e and BELe5), lowest eigenvalue n.3 and highest eigenvalue n.1 of Burden matrix/weighted by atomic van der Waals volumes (BELv3 and BEHv1). In addition to these 2nd order mean Galvez topological charge index (JGI2), number of ring tertiary C(sp3) (nCrHR) and R--CR--X type structural fragments (C-028) have also shown prevalence to model the inhibitory activity. From statistically validated models, positive contribution of descriptors Ms, PJI2, JGI2, MATS2p, BELe5, BELv3 and BEHv1 suggested that higher values of these are conducive in improving the DPP4 inhibition activity. On the other hand, negative contribution of descriptors nDB, C-028, nCrHR and MATS6e advocated that absence of number of double bonds (nDB), R--CR--X type structural fragment (C-028), number of ring tertiary C(sp3) (nCrHR) and lower value of descriptor MATS6e would be advantageous. PLS analysis has confirmed the dominance of the CP‐MLR identified descriptors and applicability domain analysis revealed the acceptable predictability of suggested models. All the compounds are within the applicability domain of the proposed models and were evaluated correctly.


Introduction
Therapeutics based on Glucogon-like peptide-1 (GLP-1 a ) is among the novel and promising targets to cure type 2 diabetes [1][2][3]. The active and natural form of GLP-1, the incretin hormone GLP-1 , is secreted from intestinal Lcells after the intake of meals. The stimulation of insulin secretion, inhibition of glucogon release, delay in gastric emptying and promotion of β-cell trophism in intestinal L-cells are advantageous to glucose homeostasis in both the animal models and human [4,5]. Studies revealed that GLP-1 levels are noticeably reduced in type 2 diabetics and exogenous infusion of it may lead to normal insulin response to glucose [6][7][8] and this fact is the basis for GLP-1 and its analogeus as novel treatments of type 2 diabetes. One such example of a GLP-1 analogue is exenatide [9,10]. Halfmaximal effective concentration of 10 pM of the most potent incretin hormone, GLP-1 , is required to show its effects on pancreatic β-cells [11]. The biological functions of GLP-1  are exerted through circulation and binding to the GLP-1 receptor that is highly expressed in pancreatic β-cells. After secretion GLP-1  is rapidly degraded by DPP4 (EC 3.4.14.5) to afford inactive GLP-1  under normal physiological conditions. The apparent half-life for GLP-1  in this quick inactivation process is 60-90s. It is evinced that due to this natural degradation mechanism less than 50% of released active GLP-1  can reach circulation [12]. Thus it is apparent that a DPP4 inhibitor can prevent degradation of and lead to potentiation of GLP-1 and further improve glucose and insulin homeostasis [13,14]. DPP4, ubiquitously expressed throughout the body, is a nonclassical and sequence-specific serine protease. Membranebound DPP4 is highly expressed in the endothelium of the capillary bed in close proximity to intestinal L-cells where secretion of GLP-1 takes place. The other form which circulates in plasma is soluble form of DPP4 plays a little role in the cleavage of GLP-1 [15,16]. Vildagliptin [17], sitagliptin [18], saxagliptin [19] and alogliptin [20] are examples of small molecule DPP4 inhibitors which have demonstrated ability to lower blood glucose and HbA1c levels and to improve glucose tolerance in type 2 diabetic patients [21]. Several novel series of azolopyrimidine amines, containing an aromatic or heteroaromatic group on the azolo ring, as potent and selective DPP4 inhibitors were reported in view of medicinal chemistry efforts to discover novel scaffolds [22]. The substitution of aromatic or heteroaromatic group on the azolo ring in these compounds showed enhancement in the binding affinity to DPP4 but displayed high levels of the human ether-à-go-go related gene (hERG) and sodium channel inhibition.
As an attempt to minimize undesired hERG and sodium channel activities a novel series of imidazolopyrimidine amides as a highly potent and selective class of DPP4 inhibitors has been reported by Meng et al. [23]. The general structure of these analogues is shown in Figure 1 and structural variations are given in Table 1. In the present communication a 2D-quantitative SAR (2D-QSAR) has been conducted to provide the rationale for drugdesign and to explore the possible mechanism of the action. In the congeneric series, where a relative study is being carried out, the 2D-descriptors may play important role in deriving the significant correlations with biological activities of the compounds. The novelty and importance of a 2D-QSAR study is due to its simplicity for the calculations of different descriptors and their interpretation (in physical sense) to explain the inhibition actions of compounds at molecular level.

Data-set
For present work the imidazolopyrimidine amides (Table 1), along with their in vitro human DPP4 inhibition activity, have been taken from the literature [23]. The inhibition activity reported in terms of Ki is expressed as pKi on a molar basis and considered as the dependent variable for the present quantitative analysis. For modeling purpose, the complete data-set was divided into training-and test-sets. The training-set was used to derive statistical significant models while the test-set, consisting nearly 20% of total compounds, was employed to validate such models. The selection of test-set compounds was made through SYSTAT [24] using the single linkage hierarchical cluster procedure involving the Euclidean distances of the binding activity, pKi values. The test-set compounds were selected from the generated cluster tree in such a way to keep them at a maximum possible distance from each other. In SYSTAT, by default, the normalized Euclidean distances are computed to join the objects of cluster. The normalized distances are root mean-squared distances. The single linkage uses distance between two closest members in clustering. It generates long clusters and provides scope to choose objects at intervals. Due to this reason, a single linkage clustering procedure was applied.

Molecular descriptors
The structures of the compounds (Table 1), under study, have been drawn in 2D ChemDraw [25] and were converted into 3D objects using the default conversion procedure implemented in the CS Chem3D Ultra. The generated 3Dstructures of the compounds were subjected to energy minimization in the MOPAC module, using the AM1 procedure for closed shell systems, implemented in the CS Chem3D Ultra. This will ensure a well-defined conformer relationship across the compounds of the study. All these energy minimized structures of respective compounds have been ported to DRAGON software [26] for computing the descriptors corresponding to 0D-, 1D-, and 2D-classes. The combinatorial protocol in multiple linear regression (CP-MLR) [27] analysis and partial least-squares (PLS) [28][29][30] procedures have been used in the present work for developing QSAR models. A brief description of the computational procedure is given below.

Model development
The CP-MLR is a 'filter'-based variable selection procedure for model development in QSAR studies. Its procedural aspects and implementation are discussed in some of our publications [31][32][33][34][35][36]. The thrust of this procedure is in its embedded four 'filters'. They are briefly as follows: filter-1 seeds the variables by way of limiting inter-parameter correlations to predefined level (upper limit ≤ 0.79); filter-2 controls the variables entry to a regression equation through t-values of coefficients (threshold value ≥ 2.0); filter-3 provides comparability of equations with different number of variables in terms of square root of adjusted multiple correlation coefficient of regression equation, r-bar; filter-4 estimates the consistency of the equation in terms of cross-validated r 2 or q 2 with leave-one-out (LOO) crossvalidation as default option (threshold value 0.3 ≤ q 2 ≤ 1.0). All these filters make the variable selection process efficient and lead to a unique solution. In order to collect the descriptors with higher information content and explanatory power, the threshold of filter-3 was successively incremented with increasing number of descriptors (per equation) by considering the r-bar value of the preceding optimum model as the new threshold for next generation. Furthermore, in order to discover any chance correlations associated with the models recognized in CP-MLR, each cross-validated model has been put to a randomization test [37,38] by repeated randomization of the activity to ascertain the chance correlations, if any, associated with them. For this, every model has been subjected to 100 simulation runs with scrambled activity. The scrambled activity models with regression statistics better than or equal to that of the original activity model have been counted, to express the percent chance correlation of the model under scrutiny.
To support the findings, a partial least squares (PLS) analysis has been carried out on descriptors identified through CP-MLR. The study facilitates the development of a 'single window' structure-activity model and help to categorize the potentiality of identified descriptors in explaining the DPP4 inhibition activity profiles of the compounds. It also gives an opportunity to make a comparison of the relative significance among the descriptors. The fraction contributions obtainable from the normalized regression coefficients of the descriptors allow this comparison within the modeled activity.

Applicability domain
The utility of a QSAR model is based on its accurate prediction ability for new compounds. A model is valid only within its training domain and new compounds must be assessed as belonging to the domain before the model is applied. The applicability domain is assessed by the leverage values for each compound [39,40]. The Williams plot (the plot of standardized residuals versus leverage values, h) can then be used for an immediate and simple graphical detection of both the response outliers (Y outliers) and structurally influential chemicals (X outliers) in the model. In this plot, the applicability domain is established inside a squared area within ±x(s.d.) and a leverage threshold h * . The threshold h * is generally fixed at 3(k + 1)/n (n is the number of training-set compounds and k is the number of model parameters) whereas x = 2 or 3. Prediction must be considered unreliable for compounds with a high leverage value (h > h * ). On the other hand, when the leverage value of a compound is lower than the threshold value, the probability of accordance between predicted and observed values is as high as that for the training-set compounds.

QSAR results
For the compounds in Table 1, a total number of 479 descriptors belonging to 0D-to 2D-classes of DRAGON have been computed and were subjected to CP-MLR analysis. All the 34 compounds of data set were further divided into trainingset and test-set. Seven compounds (nearly 20% of total population) have been selected for test-set through SYSTAT. The identified test-set was then used for external validation of models derived from remaining twenty seven compounds in the training-set. The squared correlation coefficient between the observed and predicted values of compounds from test-set, r 2 Test, was calculated to explain the fraction of explained variance in the test-set which is not part of regression/model derivation. It is a measure of goodness of the derived model equation. A high r 2 Test value is always good. But considering the stringency of test-set procedures, often r 2 Test values in the range of 0.5 to 0.6 are regarded as logical models. Following the strategy to explore only predictive models, CP-MLR resulted one model in three descriptors, five models in four descriptors and sixteen models in five descriptors at upper limit of filter-1. The highest significant of them, in statistical sense, are given through Equations (1) where n and F represent respectively the number of data points and the F-ratio between the variances of calculated and observed activities. The data within the parentheses are the standard errors associated with regression coefficients. In all above equations, the F-values remained significant at 99% level. The indices q 2 LOO and q 2 L5O (> 0.5) have accounted for their internal robustness. For all above models except equation (1) the r 2 Test values, obtained greater than 0.5, specified that the selected test-set is fully accountable for their external validation. The descriptors, in all above models, have been scaled between the intervals 0 to 1 [41] to ensure that a descriptor will not dominate simply because it has larger or smaller pre-scaled value compared to the other descriptors. In this way, the scaled descriptors would have equal potential to influence the QSAR models. The signs of the regression coefficients have indicated the direction of influence of explanatory variables in above models. The positive regression coefficient associated to a descriptor will augment the activity profile of a compound while the negative coefficient will cause detrimental effect to it.
Though Equations (1)-(10) emerged as significant predictive models but Equations (7)-(10) remained statistically more efficient. The later four models, involving five descriptors in each, could estimate up to 81.22 percent of variance in observed activity of the compounds. In fact, a total number of sixteen such models, sharing 19 descriptors among them, have been obtained through CP-MLR and the most significant four of them have been documented through Equation (7)-(10). The shared 19 descriptors along with their brief description, average regression coefficients and total incidences are given in Table 2. Table 2, the other identified descriptors Mv is from constitutional and MAXDN is from topological class. The Mv represents mean atomic Van der Waals volume (scaled on carbon atom) (Equation 3) and MAXDN is maximal electrotopological negative variation (Equation 6). The further discussion is, however, based on the highest significant Equations (7)-(10). The derived statistical parameters of these four models have shown that these models are significant. These models were, therefore, used to calculate the activity profiles of all the compounds and are included in Table 1 for the sake of comparison with observed ones. A close agreement between them has been observed. Additionally, the graphical display, showing the variation of observed versus calculated activities is given in Figure 2 to ensure the goodness of fit for each of these four models.

Besides listed descriptors in
Descriptors Ms (mean electrotopological state) and nDB (number of double bonds in molecular structure) belong to constitutional class. From the sign of regression coefficients it is evident that higher value of mean electrotopological state (descriptor Ms) and lower number of double bonds (descriptor nDB) are helpful to augment the activity. The descriptor PJI2 participated in these models is topological class descriptor and represents 2D Petitijean shape index. The positive sign of regression coefficient of descriptor PJI2 suggest that a higher value of this descriptor is beneficiary to the DPP4 inhibition activity. The descriptors MATS2p (Moran autocorrelation of lag-2/weighted by atomic polarizabilities) and MATS6e (Moran autocorrelation of lag-6/weighted by atomic Sanderson electronegativities) are 2D autocorrelation descriptors. It is evinced from the models mentioned above the descriptor MATS2p contributed positively and descriptor MATS6e negatively to the activity. Thus a higher value of descriptor MATS2p and a lower value of descriptor MATS6e will be supportive to enhance the inhibition activity.  The participated descriptors BELe5 (lowest eigenvalue n.5 of Burden matrix/weighted by atomic Sanderson electronegativities), BELv3 (lowest eigenvalue n.3 of Burden matrix/weighted by atomic van der Waals volumes) and BEHv1 (highest eigenvalue n.1 of Burden matrix/weighted by atomic van der Waals volumes) belong to BCUT class. All these descriptors contributed positively to the activity suggesting that higher value of these will augment the activity.
From Equations (7)-(10), it appeared that the descriptors nCrHR, a functional group accounting descriptor representing number of ring tertiary C(sp3) functionality in a structure and atom centered fragment accounting descriptor C-028 showing R-CR--X type fragment in a molecular structure make negative contribution to activity and JGI2, mean Galvez topological charge index of order 2 shown positive correlation to the activity. In this way absence of number of ring tertiary C(sp3) functionality along with R-CR--X type fragment in a molecular structure and higher value of mean Galvez topological charge index of order 2 would be advantageous in improving the DPP4 inhibition activity of a compound.
To corroborate the study further, a PLS analysis has also been carried out on 19descriptors identified through CP-MLR and results are given in Table 3. For this purpose, the descriptors have been autoscaled (zero mean and unit s.d.) to give each one of them equal weight in the analysis. In the PLS cross-validation, three components have been found to be the optimum for these 19 descriptors and they explained 89.7% variance in the activity (r 2 = 0.897).The MLR-like PLS coefficients of these 19 descriptors are given in Table 3. The calculated activity values of training-and test-set compounds are in close agreement to that of the observed ones and are listed in Table 1. For the sake of comparison, the plot between observed and calculated activities (through PLS analysis) for the training-and test-set compounds is given in Figure 2. Figure 3 shows a plot of the fraction contribution of normalized regression coefficients of these descriptors to the activity (Table 3).  (Table 3) associated with DPP-4 binding affinity of the compounds.
The PLS analysis in 19 identified descriptors revealed three components (Table 3) as optimum to explain the DPP4 inhibition activity. The top ten descriptors in decreasing order of significance are BEHv1, Ms, BELv3, BELe5, nCrHR, C-032, C-028, nNR2, PJI2 and HNar (Table 3, figure 3). Among these descriptors, BEHv1, Ms, BELv3, BELe5, nCrHR, C-028 and PJI2 are part of Equations discussed above and convey same inferences in PLS analysis. The negative contributions of atom centered fragment descriptor C-032 (X--CX--X type fragment), functional group count descriptor nNR2 (number of tertiary aliphatic amine functionality in a molecule) and topological descriptor HNar (Narumi harmonic topological index) advocated lower value of these are helpful in improving the activity profile. It is also observed that PLS model from the dataset devoid of 19 descriptors (Table 3) remained inferior in explaining the activity of the analogues.

Applicability domain
On analyzing the applicability domain (AD) in the Williams plot (Figure 3) of the model based on the whole dataset (Table 4), no any compound has been identified as an obvious 'outlier' for the DPP4 inhibitory activity if the limit of normal values for the Y outliers (response outliers) was set as 3×(standard deviation) units. One of the compound (2; Table 1) was found to have leverage (h) values greater than the threshold leverage (h*); suggesting it as chemically influential compound.  Figure 4 Williams plot for the training-set and test-set for inhibition activity of DPP4 for the compounds in Table 1.
The horizontal dotted line refers to the residual limit (±3×standard deviation) and the vertical dotted line represents threshold leverage h* (= 0.529).
For both the training-set and test-set, the suggested model matches the high quality parameters with good fitting power and the capability of assessing external data. Furthermore, all of the compounds were within the applicability domain of the proposed model and were evaluated correctly.

Conclusion
The DPP4 inhibition activity of imidazolopyrimidine amides has been quantitatively analyzed in terms of chemometric descriptors. The statistically validated quantitative structure-activity relationship (QSAR) models provided rationales to explain the inhibition activity of these congeners. The descriptors identified through combinatorial protocol in multiple linear regression (CP-MLR) analysis have highlighted the role of mean electrotopological state (Ms), number of double bonds in molecular structure (nDB), 2D Petitijean shape index (PJI2), Moran autocorrelation of lag-2/ weighted by atomic polarizabilities (MATS2p), Moran autocorrelation of lag-6/weighted by atomic Sanderson electronegativities (MATS6e), lowest eigenvalue n.5 of Burden matrix/ weighted by atomic Sanderson electronegativities (BELe5), lowest eigenvalue n.3 of Burden matrix/ weighted by atomic van der Waals volumes (BELv3), highest eigenvalue n.1 of Burden matrix/ weighted by atomic van der Waals volumes (BEHv1). In addition to these 2 nd order mean Galvez topological charge index (JGI2), number of ring tertiary C(sp3) (nCrHR) and R--CR--X type structural fragments (C-028) have also shown prevalence to model the inhibitory activity.
From statistically validated models, it appeared that the descriptors Ms, PJI2, JGI2, MATS2p, BELe5, BELv3 and BEHv1 make positive contribution to activity and their higher values are conducive in improving the DPP4 inhibition activity of a compound. On the other hand, the descriptors nDB, C-028, nCrHR and MATS6e render detrimental effect to activity. Therefore, absence or lower number of double bonds (nDB), R--CR--X type structural fragment (C-028), number of ring tertiary C(sp3) (nCrHR) and lower value of descriptor MATS6e would be advantageous. Such guidelines may be helpful in exploring more potential analogues of the series. The statistics emerged from the test sets have validated the identified significant models. PLS analysis has further confirmed the dominance of the CP-MLR identified descriptors. Applicability domain analysis revealed that the suggested models have acceptable predictability. All the compounds are within the applicability domain of the proposed models and were evaluated correctly.