COSMO sar3D : Molecular Field Analysis based on local COSMO  -profiles

The COSMO surface polarization charge density  resulting from quantum chemical calculations combined with a virtual conductor embedding has been widely proven to be a very suitable descriptor for the quantification of interactions of molecules in liquids. In a preceding paper, grid-based local histograms of  have been introduced in the COSMO sim3D method, resulting in a novel 3D-molecular similarity measure and going along with a novel property-based molecular alignment method. In this paper we introduce under the name COSMO sar3D the usage of the resulting array of local  -profiles as a novel set of molecular interaction fields for 3D-QSAR, containing all information required for quantifying the virtual ligand-receptor interactions, including desolvation. In contrast to currently used molecular interaction fields, we provide a theoretical rationale that the logarithmic binding constants of ligands should be a linear function of the array of local  -profiles. This makes them especially suitable for linear regression analysis methods such as PLS. We demonstrate that the usage of local  -profiles in molecular field analysis inverts the role of ligands and receptor: while conventional 3D-QSAR considers the virtual receptor in potential energy fields provided by the ligands, our COSMO sar3D approach corresponds to the calculation of the free energy of the ligands in a virtual free energy field provided by the receptor. First applications of the COSMO sar3D method are presented which demonstrate its ability to yield robust and predictive models, which seem to be superior to the models generated based on conventionally used molecular fields.


Introduction
Since the development and introduction of the comparative molecular field analysis technique (CoMFA) by Cramer et al. in the years 1979-1988, the molecular interaction field (MIF)-based statistical analysis approaches (MFA) have become an important branch of computational drug design. [1][2][3] Even though due to the increasing availability of good quality crystallographic structures of protein targets in the past decade structure-based methods gained more attention, ligand-based statistical approaches are still of considerable importance in lead optimization, taking advantage of binding affinities determined by in vitro assays. The chances of success of a MFA project strongly depend on various factors, among them the amount and quality of the binding information used as input, the statistical analysis method employed, and the quality, completeness and balance of the MIFs used for the quantification of the interactions of the ligands with the virtual receptor. Conventional MIFs are based on forcefields and take into account steric and electrostatic interactions, usually described by the Lennard-Jones potential and the molecular electrostatic potential (MEP), as resulting from the Coulombic field generated by atom-centered point charges. Besides, MIFs have also been derived from more heuristic descriptions of hydrophobic interactions and hydrogen bonding, often based on probe atom or probe group potentials. Good reviews of the CoMFA technique and its variations have been given by Kubinyi 3 and more recently by Verma et al. 4 In this paper we suggest a novel set of MIFs, the local gridbased COSMO -profiles (LSPs), as a promising alternative to force-field based MIFs. Our starting point is the quantum chemistry-based COSMO-RS method, 5,6 which during the past decade has become an important method for the prediction of fluid phase equilibrium constants such as partition coefficients, solubilities, and vapor pressures in many areas of chemistry, biochemistry, and chemical engineering. 7 In COSMO-RS all molecular interactions are quantified as local contact interactions of the surface polarization charge densities  which arise on a molecular surface, if it is virtually embedded in a conducting medium. These polarization charge densities can nowadays be calculated at moderate costs by a combination of quantum chemical methods with the continuum solvation model COSMO. 8 The COSMO-RS method initially and mostly has been applied to the prediction of free energies of molecules in homogeneous bulk liquids, which are calculated as surface integrals of solvent specific -potentials over the -surfaces of the solute molecules. The -potentials express the affinity of a solvent for molecular surfaces of certain polarity , which originates from electrostatic interactions, hydrogen bonding and hydrophobic effects, if one likes to consider the latter as a separate class of interactions, as often done in medicinal chemistry. More recently, straightforward extensions of COSMO-RS to inhomogeneous situations such as interfaces, micelles, and bio-membranes have been reported, 9 which are based on inhomogeneous -potentials.
In a preceding paper 10 we have introduced local, grid-based -profiles (LSPs), which are four-dimensional histograms describing the amount of ligand surface area within a certain -interval and space interval. These LSPs have been shown to be valuable descriptors for the assessment of molecular similarity and alignment. If we now consider a protein receptor together with its aqueous embedding as a locally slightly flexible, and thus locally pseudo-liquid, matrix with locally varying preference for certain surface polarity, i.e. with locally varying -potential, then the free energy of a ligand in this receptor matrix should be a surface integral of the locally varying receptor -potential over the -surface of the ligand. Approximating the surface integration by a grid summation, the free energy of the ligand in the receptor matrix turns into a linear function of the LSPs. Since also the free energy of the ligand in water is of such form, the same must be true for the difference of the free energy of the ligand between the receptor-bound state and its state in This is a postprint of J. Chem. Inf. Model., 2012, 52 (8), pp 2157-2164. The original article can be found under http://pubs.acs.org/doi/abs/10.1021/ci300231t aqueous solution. Hence, by these minimal assumptions we have come to the result that the free energies of binding, and thus the pKi values of ligands to a receptor, should be describable as a linear model with respect to the LSPs of the ligands. Therefore, the LSPs should provide an optimally suited set of descriptors for a linear regression analysis, e.g. with partial least squares (PLS), 11 of pKi values, according to the 3D-QSAR paradigm. To the best of our knowledge, no other set of molecular fields used so far in MFA can claim such a sound theoretical justification for the expectation of a linear pKi model.
Our approach has additional attractive features: -As shown in a recent paper, 12 the polarization charge densities  used in our approach are better suited for the description of hydrogen bonding than the electrostatic potential, which is usually employed in MFA.
-The polarization charge densities  of neutral and charged species are in the same range, which enables the inclusion of compounds of varying charge states in the same model.
-For the first time a histogram of a property is used as input for MFA, while in all previous approaches the properties are used directly as fields. The usage of smooth spatial histograms introduces an increased robustness of the models against small geometrical shifts of the ligands relative to the grid even at larger grid spacing. If a property has a local hotspot, the latter may be initially located close to a grid point and thus be well represented, but after a small shift it may fall in between the grid points and loose importance. In our histograms, such hotspot would be well described in any case, and its representation would just be smoothly partitioned over the neighboring grid points.
-The recently published COSMOsim3D method 10 enables ligand alignment based on the same fields as used for the MFA, i.e. the LSPs.
-These theoretical considerations gave us sufficient confidence to expect that LSP-based MFA might be superior to currently employed MFA approaches. 13 This confidence meanwhile has been supported by the assessment of COSMOsar3D performance on a number of publicly available test cases 14 for 3D-QSAR.
In the next paragraphs we provide a formal derivation of the underlying theory, followed by a methods section describing the datasets and the statistical tools employed. Next, we report the performance of different variants of the COSMOsar3D method compared to publicly available results of other MFA approaches for the same datasets. Finally, we report the results of several robustness tests.

Theory
Within the COSMO-RS theory 5-7 the free energy of a solute X in a homogeneous liquid solvent S is given as a surface integral of a homogeneous solvent-specific -potential  () Using such weights in all four dimensions we get for the Inserting this into eq. 2 we find where LSP L (ix,iy,iz,i) is the local -profile of ligand L, i.e. the local histogram of the COSMO surface area of ligand L with respect to polarity  in the vicinity of grid point (ix,iy,iz), as it was recently introduced within the COSMOsim3D method. 10 Since the same derivation holds for the free energy of ligand L in water (W), even with a position-independent -potential, we find for the logarithmic binding constant pKi of the ligand L from water to receptor R i.e., the pKi is a linear functional of the LSPs. This means that we have exactly the kind of relation between the pKi values and the descriptors as it is assumed by most statistical methods applied in MFA, especially by the most widely employed PLS technique. This is an absolutely unique result for MFA, since for no other set of descriptors such a proof for the existence of a linear relationship between the set of descriptors and the pKi values has ever been derived. We even have a reasonable guess of the highest achievable accuracy: a model based on the last part of eq. 6 should be ideally able to describe logarithmic binding constants with about the same accuracy as standard COSMO-RS does for logarithmic partition coefficients, i.e. with an accuracy of about 0.35 log-units. Obviously, such high precision will hardly be achievable within MFA.
In conventional MFA a number of potential energy fields are evaluated on a regular grid, large enough to embed an aligned set of ligands. For each ligand, each of the fields describes the potential energy of a probe particle of certain specification, e.g. a positive probe atom, a hydrogen bond donor/acceptor, a hydrophobic group, etc., in the field provided by the ligand. This energy is commonly evaluated by force field methods. Then, a relationship (usually linear) between the logarithmic binding constant pKi(L,R) of the ligand L in the virtual receptor R is assumed, as where the index ip shall denote the various potential energy fields (PEFs). By that, the coefficients ˆ( , , , ) R c ix iy iz if , apart from a factor -RTln10, represent a population probability of the grid points with respect to receptor atoms of the various probe types; alternatively, taking into account the desolvation from the bulk water, they need to describe a differential population probability between receptor and water embedding of the ligands. In other words, the coefficients, which finally have to be regressed by a statistical method such as PLS, represent a differential 3D composition histogram. The existence and the right choices of proper potential energy fields allowing for a linear description of the pKi values are much less obvious than the existence of an effective position-dependent -potential as it is assumed in the COSMOsar3D methodology. It is worthwhile to note that there is an inversion of the roles of ligands and receptor as well as of molecular fields and composition variables between conventional 3D-QSAR (eq. 7) and the novel COSMOsar3D approach (eq. 6). In fact, as illustrated in Figure 1, in conventional MFA the ligands provide potential energy fields, and the differential receptor/water composition coefficients are regressed as coefficients, i.e. the virtual receptor is considered in the field of the ligands. Conversely, in COSMOsar3D composition histograms of the ligands are given as descriptors, and the coefficients of a differential receptor/water free energy functional are regressed, which means that the ligands are considered in the field of the receptor. Therefore, COSMOsar3D may literally be considered as a "re-volution" of 3D-QSAR, which may shed new light on the old problem from a very different perspective.

Methods
Coordinates for the datasets used in this work were taken from literature; 14 the same partitioning between training and test set as chosen by the original authors was maintained. PLS and statistical analysis were performed with the Open3DQSAR software, 15 which was adapted in order to read the LSP file format. O3QMFA descriptors (see below) were computed with Open3DQSAR, while LSP-based descriptors where computed with the COSMOsim3D program. 10 DFT-level COSMO -surfaces were obtained by single point calculations with the quantum chemical software suite TURBOMOLE. 16,17 The BP-SVP-COSMO 18-22 level of theory was used, applying infinity for the value of the dielectric constant. Approximate CF-COSMO -surfaces were generated with the COSMOfrag program. 23,24

Results and discussion
In order to demonstrate the performance of our new approach we took the eight datasets collected by Sutherland and coworkers, 14 and we applied the above described procedure using the same conformations and alignments as in the original literature. As a starting point we calculated the LSP descriptor matrix based on -surfaces resulting from BP-SVP-COSMO calculations, using the same grid spacing of 2.0 Å as used by Sutherland to generate 3D-QSAR models, and our default -interval (DELSIG) of 0.1 e/nm 2 . The latter leads to partioning of the  range in 61 bins, and therefore results in 61 MIFs per grid node. We soon realized that such a high resolution is not required for 3D-QSAR purposes and that DELSIG may be safely increased up to 0.6 e/nm 2 without impacting on model performance, while reducing the size of the descriptor arrays by roughly a factor 6; therefore, a -grid spacing of 0.6 e/nm 2 has been set as the default value and used for all further analyses. As usual, the optimal number of PLS components was determined as the one yielding the minimum standard deviation of the error of prediction during leave-one-out (LOO) crossvalidation. 2 For the assessment of the performance of the COSMOsar3D method relative to the seven conventional 2D/3D-QSAR methods considered by Sutherland, we will consider as indicators the average standard deviations of the error of internal ( cv,LOO s ) and external ( test s ) predictions over the eight datasets, and the respective average correlation coefficients ( 2 test r and 2 LOO q ). Since external prediction is the ultimate purpose of QSAR models, we consider test s as the most important among the four indicators.  For comparison purposes, we also report the results obtained from a reimplementation of CoMFA within the Open3DQSAR software (O3QMFA). Some minor differences with respect to original CoMFA occur due to the usage of the Merck force-field parameters and charges, 25 while the Tripos force field parameters 26 coupled with scaled MNDO ESP-fit (Gasteiger-Marsili for the THER dataset) charges were used by Sutherland; nevertheless, it overall nicely reflects the same behavior as CoMFA on the eight datasets. All individual and average statistics for the models are collected in Table 1, together with the corresponding data for O3QMFA and the seven QSAR methods considered by Sutherland. Based on test s the COSMOsar3D models are better than all non-LSP models. The average advantage over the best two models obtained by Sutherland, i.e. over CoMSIA-extra and standard CoMFA, is 0.10 and 0.12 log-units, respectively, and larger than the advantage of the latter over the other QSAR models. Hence, we may conclude that the LSP-based models instanter have a significantly higher predictivity than standard 3D-QSAR models. Although we consider test s as the most relevant indicator for the quality of the models, it is worth noting that the LSP based models also are best with respect to the other three overall quality measures, i.e. with respect to the external 2 test r and the internal cv,LOO s and 2 LOO q . Furthermore, for six of the eight datasets the standard deviation of test set residuals achieved by the LSP-based models is smaller than those of all QSAR models considered by Sutherland. Only in two cases, i.e. for the BZR and DHFR datasets, HQSAR has the smallest standard deviation on the test set residuals.
After having confirmed the predictive ability of LSP-based COSMOsar3D, we have investigated its robustness with respect to systematic variation of the grid resolution from 1.0 to 5.0 Å in 0.5 Å increments. For comparison, all experiments have been carried out with O3QMFA as well; results are reported in Figure  2. It can clearly be seen that O3QMFA models, in addition to being less predictive according to both internal and external indicators, show much larger fluctuations in test s ; i.e., the COSMOsar3D models are much more robust with respect to the variation of the grid step size.  14 Then, we analyzed the sensitivity to grid positioning, while making sure that the grid consistently exceeded the largest compound by at least 5.0 Å in all directions; although this is not required for COSMOsar3D models (see below), it is necessary for a fair comparison of O3QMFA results. In Figure 3 we see the dependence of test s on the position of the grid center; while the CoMFA-type method O3QMFA displays the well-known strong fluctuations depending on the grid position, 27,28 LSP-based COSMOsar3D appears to be essentially invariant.
As the next experiment, we applied random translations of increasing amplitude to each molecule, thus assessing the robustness of the models with respect to misalignment. The results are shown in Figure 4. As expected, the standard deviations of the errors of prediction increase with increasing noise amplitude for both COSMOsar3D and O3QMFA, but the latter appears to be much more sensitive to misalignment than the LSP-based COSMOsar3D.
Subsequently, we have tested the sensitivity of LSP-based COSMOsar3D models upon the quality of the underlying COSMO -sufaces. Firstly, in the DFT calculations we replaced the standard BP-SVP level of theory with the higher level BP-TZVP method, which generally yields more accurate results than BP-SVP in COSMO-RS applications. We found a negligible performance increase of 0.003 for test s , which clearly does not justify the larger computational demands compared to the BP-SVP level. Next, we explored the replacement of COSMO -surfaces explicitly computed by DFT/COSMO calculations for each of the considered molecules by approximate CF-COSMO -surfaces, which are generated within ~ 0.5 s per compound with the COSMOfrag program. 23 This replacement of the explicit quantum chemical calculations by COSMOfrag has been shown to be very efficient and almost equivalently accurate within COSMOsim3D similarity and alignment applications. 10 Since unfortunately this approach currently is not applicable to compounds with nonzero net charge, we restricted this test to the four datasets BZR, COX2, DHFR and GPB, which involve mostly neutral compounds; the few ionic compounds were represented by the BP-SVP-COSMO files. The performance of the BP-SVP LSP models on these four datasets was test s = 0.91, while the models based on CF-COSMO files yield test s = 0.94. Hence we may conclude that, at least for screening purposes, the costly DFT/COSMO calculations can be safely replaced by quickly generated CF-COSMO files with negligible loss of accuracy. By that the computational demands of the descriptor generation for COSMOsar3D become on par with force-field based methods. It might be worth noting that also the sizes of the descriptor matrix and hence the PLS demands are not far from standard MFA approaches, because the larger number of ~ 11  bins compared to the two CoMFA fields (electrostatic and steric) is partially compensated by the fact that the spatial grid of CoMFA type approaches must extend ~ 5 Å outside the molecules, while the LSP grid just needs to include all COSMO cavities, i.e. extends 1.9 Å at the most for organic molecules.
As a final experiment, we tested COSMOsar3D with a consistent field-based alignment generated with COSMOsim3D. For that purpose, we applied random translations and rotations to the molecular geometries of all compounds of the eight datasets. Subsequently, we used COSMOsim3D to re-align each dataset, using the same superposition templates as used by Sutherland.
The alignment was done in a completely automated fashion, using the super-self-consistent alignment mode of COSMOsim3D. 10 The resulting alignments and detailed COSMOsar3D results are shown in the Supporting Information. The overall performance of these COSMOsim3D-aligned models is almost identical to the performance of the models based on the alignment given by Sutherland. The overall standard deviation of test set residuals only increases by 0.02, while the other 3 indicators differ by 0.01 from those achieved on the original alignment; hence, there is no significant loss of performance if a fully automated COSMOsim3D alignment is used instead of the supervised alignment reported by Sutherland. Nevertheless, it must be acknowledged that our alignment is based on the conformational selection made by Sutherland and hence is not completely independent of Sutherland's work.

Conclusions
Starting from the framework of COSMO-RS theory, the local -profiles have been introduced as a natural and essentially complete set of descriptors for 3D-QSAR and molecular field analysis. Application of this COSMOsar3D concept to the eight reference MFA datasets published by Sutherland et al. demonstrates a significant increase of the predictive accuracy of the resulting models compared to standard 3D-QSAR methods. Furthermore, the COSMOsar3D models turn out to be exceptionally robust with respect to grid step size, grid positioning and random misalignment. Furthermore, no cutoff parameters are required, neither with respect to the descriptors, nor with respect to the field extension outside the molecules. Although being originally based on quantum chemically calculated and hence computationally more expensive COSMO -surfaces, the COSMOsar3D methodology is shown to perform almost equally well on approximate -surfaces quickly generated from a database of precalculated COSMO files. Finally, it is shown that COSMOsar3D also performs well using a consistent -surface alignment with COSMOsim3D. Summarizing, the COSMOsar3D method introduced herein appears to be a promising novel tool for 3D-QSAR, overcoming many of the problems inherent in commonly used methods.
As a next step, we are planning to develop an iterative workflow for the selection of the relevant ligand conformations, making use of the ability of the COSMO-RS method to quantify the free energies in aqueous solution, and of the capability of COSMOsar3D to predict the free energy differences of ligands between the aqueous state and the receptor bound state.

Supporting Information
The following material is available as supporting information: Table S1: Statistics of the PLS models obtained from COSMOsim3D alignments; Figure S1: Original Sutherland alignment compared to the unsupervised COSMOsim3D alignment for ACE, AChE, BZR and COX2 datasets; and Figure  S2. Original Sutherland alignment compared to the unsupervised COSMOsim3D alignment for DHFR, GPB, THERM and THR datasets.