Solubility prediction, solvate and cocrystal screening as tools for rational crystal engineering

The fact that novel drug candidates are becoming increasingly insoluble is a major problem of current drug development. Computational tools may address this issue by screening for suitable solvents or by identifying potential novel cocrystal formers that increase bioavailability. In contrast to other more specialized methods, the fluid phase thermodynamics approach COSMO‐RS (conductor‐like screening model for real solvents) allows for a comprehensive treatment of drug solubility, solvate and cocrystal formation and many other thermodynamics properties in liquids. This article gives an overview of recent COSMO‐RS developments that are of interest for drug development and contains several new application examples for solubility prediction and solvate/cocrystal screening.


Introduction
The fact that new drug candidates as generated by highthroughput screening methods are of increasingly poor aqueous solubility poses a major problem for modern drug research. [1] Thus, an important task during drug development is to find ways to solubilize such compounds. For computational methods to be supportive in this regard, they need to be of reliable predictivity and at the same time be efficient enough to save real lab work. Although many in-silico models and prediction methods have been developed targeting different individual aspects in drug development, only a few approaches allow for a comprehensive treatment of a broader range of thermodynamic properties. Since its development in the early 1990s, the liquid phase thermodynamics theory COSMO-RS (conductor-like screening model for real solvents) has been applied in variety of fields such as solubility prediction, [2] solvent screening, [3] excipient ranking, [4] computation of micellar systems [5] and ionic liquids, [6] prediction of pKa [7] and redox potentials, [8] and calculation of partitioning coefficients. [9] Recently, crystal engineers rediscovered the technique of cocrystallizing an excipient compound with an active pharmaceutical ingredient (API) as an alternative route to address the problem of poor drug solubility. [10] In this context, the COSMO-RS method has been demonstrated to be able to guide the choice of promising cocrystal formers (coformers) by the prediction of the relative probability that such a coformer forms a cocrystal with a given active. [11] The same approach could also be applied to estimate the tendency of solvents to form crystalline solvates.
In addition and also with direct relation to crystal engineering, the effect of milling on the surface energetics of molecular crystals was analysed with COSMO-RS. [12] A few other quite specialized methods have been developed to screen for potential coformers. [13,14] However, apart from the fact that the COSMO-RS approach performs at least comparably, [15] they have the disadvantage of not being embedded in a general theory and thus are not transferable to other thermodynamic properties.
Overall, it has been shown in many cases that COSMO-RS theory offers a rich choice of methods to aid drug developers finding a suitable solid form for a new drug. This article presents a general overview of recent COSMO-RS developments, which may be of particular interest for crystal engineering and solid form selection based on many novel application examples. If not indicated otherwise, all σ-profiles have been obtained with the COSMOquick approach, [16] that is, without the need for a costly quantum-chemical COSMO calculation. The article is sectioned as follows. First, a short overall introduction into COSMO-RS theory is given. Then, recent examples for solubility prediction of some typical drug and coformers in different solvents and solvent mixtures are evaluated. This is followed by a case study concerning solvate screening, and finally results of several cocrystal screenings are presented.

COSMO-RS Theory
COSMO-RS theory was developed by Klamt on top of the quantum chemical COSMO solvation model (conductorlike screening model) to overcome some of the limitations inherent to all dielectric continuum solvation models. One shortcoming of the continuum models is, for example, to account for differences between molecules with identical dielectric constants but quite different solvent properties like benzene and cyclohexane. [17] This finding led to the novel concept of COSMO-RS, a combination of COSMO with the statistical thermodynamics treatment of interacting surface segments. [18] COSMO-RS starts from the polarization charges on the COSMO surfaces, as obtained by individual COSMO calculations for solute and solvent molecules. Those surfaces are defined by the molecular cavity separating the solute volume from the embedding dielectric continuum. Inter-molecular interactions in COSMO-RS are calculated as local contacts of COSMO surfaces and quantified by the polarization charge densities σ and σ' of the two surface segments. For this purpose, histograms of polarization charge densities, the so-called σ-profile (Figure 1), are considered.
Instead of deriving macroscopic properties from an ensemble of interacting molecules like in a molecular dynamics simulation, for example, COSMO-RS uses an ensemble of interacting 'surface segments' . This approximation leads to a significant improvement of the overall efficiency: once the σ-profile is determined from a quantum-chemical calculation (usually density functional theory (DFT) is employed), the intermolecular thermodynamics is evaluated within fractions of a second. Nevertheless, because of the rigorous statistical thermodynamics applied, one obtains a good and thermodynamically consistent approximation for the free energy and the chemical potential of the compound of interest and hence has access to all related physico-chemical properties in the liquid phase.
COSMO-RS theory takes into account the most important modes of molecular interactions, electrostatics, hydrogen bonding and slightly more empirically also Van der Waals interactions. The so-called hydrophobic interactions are well represented as a result of electrostatics and hydrogen bonding. For more details of the approach, we refer to a recent review. [17] Although nowadays the necessary quantum-chemical results are readily available for small to medium sized molecules, the generation of the σ-profiles is still the  rate-determining step of the overall COSMO-RS workflow. However, this issue can be mitigated by deploying precomputed quantum-chemical COSMO calculation in a large database. Currently, a database with more than 100 000 molecules is available, which allows the instantaneous generation of σ-profiles via the composition out of molecular fragments (COSMOquick approach). [16] This comes at the cost of a somewhat reduced accuracy, but further improves the efficiency and the ease of use of the approach. It allows for thermodynamic property prediction using as input just the molecular topology as generated from a usual two-dimensional molecule sketching program.

Computational details
All COSMO-RS calculations presented in this article have been carried out with COSMOquick. COSMOquick is a JAVA-based software tool and graphical user interface, which internally calls FORTRAN routines for the generation of σ-profiles and the COSMOtherm code [19] for the subsequent thermodynamic computations. [20] Approximate σ-profiles are generated by accessing a database containing about 100 000 diverse chemical compounds, which avoids costly quantum-chemical calculations. Structures for those database constituents have been obtained by AM1/ COSMO geometry optimization followed by a single point DFT/COSMO calculation (BP-SVP level of theory) with TURBOMOLE [21] to obtain the σ-surface.
The final σ-profile is then generated by matching one or several database fragments to the topology of the molecule under scrutiny. The chemical potentials in the liquid phase are then calculated according to standard COSMO-RS theory.

Solubility prediction of drugs and coformers
The solubility of liquid compounds can be readily obtained from the chemical potential in the liquid phase. For solids, that is, for almost all drug compounds, additional information about the transfer from the liquid to the solid state is necessary. This information is contained in the free energy of fusion ΔGfus, which is related to the melting point Tm and the melting enthalpy (enthalpy of fusion) ΔHfus, under the assumption that the change in heat capacity may be neglected: The mole fraction solubility is then calculated from the free energy difference: Here, μ X and μS are the chemical potentials of the pure drug and the drug in solution, that is, in some solvent, respectively; R is the gas constant and T the temperature. The chemical potential of the drug in solution is of course concentration dependent, and thus the last equation has to be solved iteratively. However, working with a chemical potential μS at infinite dilution and solving for xs is a sensible approximation as long as the drug solubility is sufficiently low as it is the case for most current drugs anyway.
The liquid-phase chemical potentials μX and μS can be computed with a high accuracy by COSMO-RS theory. ΔGfus may be attained either via experimental melting point and fusion enthalpies, via the combination of a reference solubility measurement and a COSMO-RS computation (i.e. solving the last equation for ΔGfus), or via a quantitative structure property relationship (QSPR) model. The use of experimental data is certainly to be preferred and delivers more accurate results. Using a QSPR model for the fusion data should be applied only with care as it is accompanied usually by a large error bar.
Recently, a new solubility prediction algorithm based on COSMO-RS theory was introduced, using experimental reference solubilities in several different solvents to get the free energy of fusion and additionally a correction term for the chemical potential of the drug in solution for each specific solvent. [20] Using the information from three or four reference solubility measurements, this approach allows for a very accurate prediction of solubility in additional solvents or solvent mixtures.
There exists plenty of literature data on solubility measurements in different solvents. An excellent collection of solubility datasets has been compiled by the open notebook science project [22] and may be used to get access to the original literature sources. Table 1 shows some results of drug and coformer solubility predictions using three to four reference solvents as compared with experimental data. Reference solvent selection was guided by the objective to include a balanced set consisting of hydrogen bond acceptors, hydrogen bond donors and non-polar solvents. Apart from that, the selection was carried out arbitrarily. As an error measure, the root mean squared error (RMSE) in relation to the experiment is specified. The average RMSE is of similar magnitude for drugs and for coformers and amounts to about ∼0.5 log units. This is compares quite favourable to other state-of-the-art methods, which rarely exceed this threshold. [16] Details of the solubility predictions are listed in the supplement (Table S1).
The large RMSE of cinchonidine (0.87) is mostly due to its solubility in triethylamine, showing a strong deviation between experiment (log10(x) = −2.84) and prediction (log10(x) = 0.0, i.e. complete solubility). This strong deviation is in part due to shortcomings of the prediction method, which at the employed level of theory (BP-SVP) is known to be less accurate for tertiary amines. Increasing the theory level by using COSMOtherm at the TZVP-FINE level [19] reduce the deviation significantly for triethylamine (log10(x) = −3.0), but the overall deviation over all solvents still amounts to RMSE = 0.78. Finally, omitting this single value from the error analysis yields a significantly reduced RMSE for this dataset of 0.67.
An extension to solvent mixtures is straightforward, and in the following a few representative cases are presented. For solvent mixtures, the compositions with mole fraction x = 0.0 and x = 1.0 are fed as reference into the model to estimate the free energy of fusion and the corresponding correction to the chemical potential. Other more empirical methods like NRTL-SAC [23] need more than just two reference points for a reliable parameter fit, and thus additional measurements in other solvents are necessary before a specific solvent mixture can be addressed. Figures 2 and 3 present solubility data for sulfadiazine, salicylic acid, prednisolone and paracetamol in different binary solvent mixtures. As it is evident from Figures 2  and 3, the shape of the solubility curves in binary mixtures can be qualitatively very different. The system sulfadiazine in water/DMF [24] is most easily handled as its curve is monotonically decreasing (Figure 2a). The systems salicylic acid [25,26] and prednisolone [27] in water/ethanol show a shallow maximum in the experiments, which cannot be reproduced as such by the prediction method; however, the overall shape of the curve is met quite well (Figure 2c and 2d). The systems paracetamol in water/dioxane [28][29][30] and sulfadiazine in water/dioxane [31] exhibit pronounced solubility maxima, which are predicted qualitatively correct (Figures 2b and 3), although the location of the maxima is not met exactly. Water-rich phases are predicted not as well, which is mostly due to the intricacies of the strong hydrogen bonding network of water, which is not described perfectly at the employed level of theory. The case of paracetamol as shown in Figure 3 is in particular interesting, as three sets of experiments from different groups have been reported and the difference among those is already quite large. This shows that concerning the evaluation of predictions, one has to consider the fact that experiments may be subject to a some background noise, for example due to not well-specified experimental conditions.
An interesting application of this pure and mixed solvent prediction approach is the computational screening over a large set of solvents to identify the best possible solvent mixture for a given drug. This is combinatorial problem and even for binary solvents can become extremely costly. However, as solubility predictions can be done within fractions of a second, such vast screenings become feasible at a reasonable amount of time and resources.

Solvate and cocrystal screening
Solvate and cocrystal formation are closely related phenomena. In both cases, crystals are formed from two or more organic moieties with a defined stoichiometry that mostly have modified solubility and dissolution properties. For solvates one of the constituents is however liquid at room temperature (ΔGfus(298K) = 0), whereas for cocrystal both components are solid (ΔGfus(298K) > 0). The formation of solvates and cocrystals can be described via liquid phase thermodynamics according to the assumption that the interactions in the crystal are similar to a virtual supercooled liquid. Thus, the strength of the interactions in the cocrystal as compared with the pure reactants can be estimated via the mixing enthalpy (or equivalently the excess enthalpy) ΔHmix of the API-coformer or API-solvent mixture with given stoichiometry. In other words, the mixing enthalpy is a rough approximation to the free energy of cocrystal formation ΔGcc: Deviation from experiment is given as root mean squared error (RMSE). The RMSE is given relative to solubilities measured in logarithmic mole fractions (log10(x)). Number of reference solvents (#no references) and total solvents (#no solvents) are also specified.
ΔSmix is the entropy of mixing and ΔΔGfus is the difference between the free energy of fusion of the cocrystal and the reactants. The overall entropy change between reactant and cocrystal and also ΔΔGfus are assumed to be close to zero.
For the overall entropy change between related crystalline states (of reactants and cocrystal) this seems to be reasonable assumption at least concerning the accuracy of the approach. For the remaining fusion enthalpy part this is a far more severe approximation because this corresponds to neglecting any enthalpic change due to the solid state order. The mixing enthalpy (excess enthalpy) is given according to: With HAB being the enthalpy of the supercooled stoichiometric mixture of API A and coformer B, and HA (HB) the enthalpy of the pure (supercooled) liquid A (B), and x being the respective mole fractions. Although this approach neglects the solid state order, it performs surprisingly well and usually an excellent enrichment is obtained as compared with a random trial of manually selected coformers. [11] Figure 4 shows computed excess enthalpies of the experimentally identified solvates of the sulfonamide antibacterial sulfamerazine (marked red) together with FDA class 1, 2  Christoph Loschen and Andreas Klamt Solubility prediction, solvate and cocrystal screening and 3 solvents. [32] The reported solvates cyclopentanone, 1,4-dioxane, dimethylacetamide, dimethylformamide and 3-picoline all show a strongly negative Hmix because of their strong enthalpic interaction with sulfamerazine. Thus, in this hypothetical screening, a significant enrichment is obtained, and, for example, just by trying out the five top solvents having the most negative excess enthalpy one would have found already three solvates. The fact that dimethyl sulfoxide and N-methylpyrrolidone are not known to form solvates in spite of their large negative excess enthalpies is probably due to an unfavourable crystal packing that is not taken into account by the excess enthalpy screening. As a quantitative measure to assess the performance of a screening, a receiver operating characteristics (ROC) diagram is quite useful. An ROC curve plots the true posi-tive rate (number of true positive predictions/total number of positive observations) versus false-positive rate (number of false-positive predictions/total number of negative observations) for a binary classifier system (cocrystal screening results) as its discrimination threshold (ΔHmix cutoff) is varied from small to higher values. The area under the curve (AUC) measures the overall performance of the model. Predictions with higher AUCs are generally better with an AUC = 1.0 being a perfect prediction and should always be higher than 0.5, indicating that the model is better than random selection. An example on how to evaluate the statistics for the case of piracetam is given in the supplement (Table S2).
In addition to the surprisingly good results of just using the excess enthalpy/mixing enthalpy for screening, we empirically found out that the number of rotatable bonds  Christoph Loschen and Andreas Klamt Solubility prediction, solvate and cocrystal screening of API and coformers n is a further significant descriptor concerning the formation of cocrystals. We currently incorporate n by fitting a single parameter a to a set of experimental cocrystal screening results to obtain an effective screening function: m a x , For the sake of simplicity the parameter a also takes into account stoichiometry of the system, which for an unknown cocrystal is assumed to be always 1 : 1. The training set consists mostly out of experimental data we have used in one of our previous studies. [11] The coformer ranking according to ffit usually gives somewhat improved results in particular if the coformers vary with respect to the number of rotatable bonds. Additionally, the components of the screening function may be fit via a logistic regression to the experimental data, which will yield the estimated probability of an API-coformer pair to form a cocrystal. This will not change the ranking itself and thus give the same ROC statistics as the original screening function. Figure 5 shows some recent examples of COSMO-RSbased cocrystal screenings using data from the validation study of Musumeci and co-workers. [15] Table 2 shows some additional details about the calculations, in particular a comparison between COSMO-RS screenings using quantum-chemical derived σ-profiles (AUC, CRS) and instantaneously, i.e. database generated σ-profiles (COSMOquick approach, AUC, CQ).
Please note that the number of tested coformers for acetazolamide is only 34 in Table 2 because from the original set comprising 36 coformers as given in Grecu et al. [15] , two duplicates have been removed (nicotinic acid = 3-pyridinecarboxylic acid and isonicotinic acid = 4-pyridinecarboxylic acid).
A significant enrichment is obtained in all cases (the AUC value is larger than 0.5), except for acetazolamide using COSMOquick, where a virtual screening would not have been better than a random trial. This is partially due to the fact that the thiadiazole group of the acetazolamide is not properly represented in the COSMOquick database used to generate the σ-profile. In most cases, the software allows to identify such an underrepresented molecule/functional group by a score related to the quality of the as generated σ-profile before property calculation. The missing group/molecule can then eventually be added to the database. For the case of acetazolamide, this gives a slight improvement (AUC = 0.56). The remaining deviation from the experiment is due to the underlying approximations made, that is, neglect of the fusion enthalpy differences between drug and coformers.

Conclusion
COSMO-RS as a thermodynamic theory provides a number of useful tools to support solid form selection and crystal engineering. With the ability to compute the chemical potential of a drug in its pure liquid state, in a solvent or a solvent mixture, a broad variety of thermodynamic properties are accessible. The prediction of solubility needs additional information about the solid state of a drug. Usually Receiver operating characteristics plot of COSMO-RS (conductor-like screening model for real solvents) cocrystal screenings using the mixing enthalpy and the number of rotatable bonds for some drugs. Experimental data as used in Grecu et al. [15]  Experimental data as used in Grecu et al. [15] Screening scores have been measured using the area under the ROC curve (AUC) value. Results for the method as used in Grecu et al. [15] (SSIP), the full COSMO-RS level using quantum-chemical computed σ-profiles (CRS) and the COSMOquick level using database generated σ-profiles are given (CQ). CRS and CQ take into account the number of rotatable bonds of API and coformers. The final two columns give the number of coformers in the test set and the number of experimentally confirmed cocrystals, respectively some experimental reference data as solubilities in one or several solvents or as the fusion enthalpy/melting point of the drug is sufficient to obtain quite accurate quantitative predictions.
Applications to solvate and cocrystal screening using the excess/mixing enthalpy of a supercooled API-coformer mixture have been proven to be surprisingly predictive, yielding good enrichments in most cases as compared to manual coformer selection.
The capability to screen for suitable solvents, solvates or cocrystals, accurately and efficiently, within a single theoretical framework and software, is particularly well-suited to second rational drug development.

Supporting information
Additional Supporting Information may be found in the online version of this article at the publisher's web-site:

Table S1
Experimental (exp) and COSMOquick predicted (CQ) logarithmic solubilities (in mole fraction) for the potential cocrystal formers of Table 1. For results of the remaining compounds of Table 1, please refer to the supplement of Loschen and Klamt. [16] Table S2 Exemplary results of the computational cocrystal screening for piracetam. Experimental data as used in reference 15. The excess enthalpy Hex (in kcal/mol) is computed with COSMOquick. The screening function f_fit takes into account the number of rotatable bonds of API and coformers, drug-coformer pairs are ordered according to increasing f_fit. Systems reported to cocrystallize are labelled with '1' in the column cocrystal, '0' otherwise. The final two columns give the true positive rate (TPR) and the falsepositive rate (FPR), which are used to plot Figure 5 and to compute the area under the curve (AUC).