Comprehensive evaluation of end-point free energy techniques in carboxylated-pillar[6]arene host–guest binding: I. Standard procedure

Despite the massive application of end-point free energy methods in protein–ligand and protein–protein interactions, computational understandings about their performance in relatively simple and prototypical host–guest systems are limited. In this work, we present a comprehensive benchmark calculation with standard end-point free energy techniques in a recent host–guest dataset containing 13 host–guest pairs involving the carboxylated-pillar[6]arene host. We first assess the charge schemes for solutes by comparing the charge-produced electrostatics with many ab initio references, in order to obtain a preliminary albeit detailed view of the charge quality. Then, we focus on four modelling details of end-point free energy calculations, including the docking procedure for the generation of initial condition, the charge scheme for host and guest molecules, the water model used in explicit-solvent sampling, and the end-point methods for free energy estimation. The binding thermodynamics obtained with different modelling schemes are compared with experimental references, and some practical guidelines on maximizing the performance of end-point methods in practical host–guest systems are summarized. Further, we compare our simulation outcome with predictions in the grand challenge and discuss further developments to improve the prediction quality of end-point free energy methods. Overall, unlike the widely acknowledged applicability in protein–ligand binding, the standard end-point calculations cannot produce useful outcomes in host–guest binding and thus are not recommended unless alterations are performed.


Introduction
Pillar[n]arenes as a family of cylindrical macrocycles with symmetrical rims have exhibited great potentials in various laboratory and industrial applications, such as drug delivery, molecular machinery, catalyzed synthesis and fluorescence sensing [1][2][3][4][5][6][7][8]. Despite the guest-binding ability of pillar [n] arenes, the native form suffers from low solubility in aqueous environment and is unsuitable for biomedical applications. Chemical modifications are widely recognized an effective solution to this problem [9][10][11][12]. Among modified pillar[n]arenes derivatives, carboxylated-pillar[n]arenes with n = 6 or 7 are promising candidates in drug delivery due to their enhanced water-solubility, strong drug-binding ability, easy functionalization, and intermediate levels of cavity volume and entrance size [13][14][15][16]. An example of practical applications of functionalized pillar[n]arenes is the dramatic increase of the solubility and the significant pKa shift of 1 3 sanguinarine upon the carboxylated-pillar [6]arene (WP6) encapsulation [17].
Molecular simulations of protein-ligand and more generally host-guest systems accumulate time series data of atomic positions and energetics. These time-resolved information is then used to approximate ensemble averages under the ergodicity assumption [18][19][20][21][22]. However, due to the significant gap between the simulation-accessible time scale and that of the process of interest (e.g., host-guest binding/ unbinding), it is generally impossible to observe multiple transitions between states of interest during the course of brute-force simulations [23][24][25][26][27]. Therefore, in simulation investigations of the binding/unbinding processes, designed modelling schemes are required. If direct observations of binding/unbinding events are pursued, free energy simulations in the physical configurational space or along the artificial alchemical pathway could be performed [28][29][30][31]. Although these enhanced sampling techniques sample the gradual change of the whole system during the binding/ unbinding events and are considered rigorous free energy techniques, their high computational costs hinder large-scale applications in complex condensed matter systems [32,33]. The dramatic increase of computing power in recent years relieves the situation to some extent, but the size of dataset manageable for these rigorous free energy methods is still limited. Relatively cheap but less accurate alternatives are end-point free energy methods including MM/PBSA and MM/GBSA [34][35][36][37], the single-trajectory form of which only samples the well-defined bound state and thus is easier to converge. Currently, end-point free energy techniques are widely employed in drug discovery as a highly efficient tool for screening a large set of potential hits [38][39][40][41][42]. A computational technique cruder than end-point methods is molecular docking, which uses simpler energy functions (scoring function) and sampling techniques to ensure fast screening of an even larger set of molecules [43][44][45][46][47]. In the hierarchical screening with computational techniques, the docking procedure is often applied before end-point free energy calculation, and the latter is considered as a re-scoring procedure processing the docking outcome [48]. Aside from the sampling techniques, the model used in computation or Hamiltonian is also critical and determines the ultimate accuracy level that a comprehensive simulation can achieve [49][50][51][52]. In complex systems such as protein-ligand and also host-guest complexes, it is often desirable to employ fixedcharge force fields due to their balanced efficiency and accuracy. A huge number of benchmark free energy calculations have revealed that different force-field parameters could lead to different conformational preferences, binding thermodynamics and also interaction networks [53,54].
Despite the importance of macrocyclic hosts and their interactions with drug-like guest molecules, the number of scientific reports on practical experiences of end-point free energy techniques in novel host-guest complexes is rather limited, not to mention bias-free evaluations covering a variety of critical modelling procedures. In this work, using an interesting dataset containing 13 WP6 host-guest systems provided in the recent SAMPL9 challenge [55], we investigate four key modelling details of end-point free energy calculations extensively, aiming at providing a thorough evaluation of end-point methods in similar host-guest systems. The benchmarked four details include the method for the generation of initial condition, the charge scheme for host and guest molecules (i.e., solutes), the model for water molecules (solvents), and the parameters used in end-point free energy analysis. The accumulated comprehensive numerical experiences are summarized and interpreted to provide some general guidelines of maximizing the performance of standard end-point free energy calculations in host-guest systems similar to the current WP6 cases. Finally, we discuss about further directions (altered end-point schemes) to pursue improved performance.

Computational setup
The 3D chemical structures of all molecules under investigation are obtained from the GitHub site of the SAMPL9 challenge [55]. The determination of the protonation states of host and guest molecules follows a recent computational investigation [56]. The resulting net charges of all molecules under consideration can be interpreted from the chemical structures shown in Fig. 1. The aim of the current work is to provide a detailed and extensive assessment of end-point free energy techniques for WP6 host-guest binding. Thus, a set of parameter/force-field combinations are modelled in our simulations.

Charge schemes
First, for atomic charges, we consider two popular fixedcharge models. The first one is AM1-BCC [57], which optimizes the molecule at the AM1 level and combines the Mulliken charges with a pre-fitted correction term named bond-charge correction. The aim of this charge model is to substitute computationally demanding ab initio calculations with computationally cheaper corrected semiempirical charges, which is achieved mainly by training the bond-charge correction with the HF [58-60]/6-31G* electrostatic potential (ESP) for a set of small molecules. The second charge model is the restrained electrostatic potential (RESP) [61] scheme. The structure of each molecule is first optimized at B3LYP [62-64]/6-31G*, which produces a high-quality low-energy configuration. This representative geometry is then used to generate ESP data with the Merz-Kollmann scheme. We consider two ab initio levels including the gas-phase HF/6-31G* and the implicit-solvent (IEFPCM for water) B3LYP/def2-TZVPP to scan the molecular ESP. The former ab initio level, the gas-phase HF/6-31G*, follows the tradition of AMBERlike force fields, while the latter B3LYP/def2-TZVPP with IEFPCM solvation serves as a representative and popular method employing a higher-level electronic structure treatment and taking solvent effects into consideration. The obtained ESP data are then used to fit atom-centered charges with a two-step regularization procedure, producing two sets of RESP charges named RESP-1 (HF/6-31G*) and RESP-2 (B3LYP/def2-TZVPP IEFPCM), respectively. All the other missing parameters are obtained from the transferable GAFF2 parameter set [65]. An exception for this GAFF2 extraction is the G4 guest due to the absence of Si parameters in the pre-fitted GAFF2. For this molecule, we use bonded and vdW parameters reported in a recent publication, where the parametrization of Siinvolved species is the similar to the GAFF2 set [66]. It is worth noting that the Si parametrization in the reference is using AM1-BCC charges instead of the more rigorous HF/6-31G* RESP charges, but according to our detailed analysis of electrostatic data shown later, this corrected semi-empirical charge scheme works reasonably well for this molecule. Thus, the parametrization reported in the reference is generally usable and compatible with the charge schemes used in our modelling. For readers interested in reproducing our simulations, the atomic-charge files (.mol2) of all molecules for the three charge sets (AM1-BCC, RESP-1 and RESP-2) are shared online at https:// github. com/ prosz xppp/ WP6-host-guest-bindi ng.

Water models
Second, although the force-field parameters for host and guest molecules would have a significant impact on the simulation outcome, the parameters of the other components involved in the simulation box, e.g., water molecules, would also influence the modelling results [67][68][69]. Therefore, in our extensive end-point free energy calculations, we consider two water models including the widely applied TIP3P [70,71] and SPC/E models [72]. Although both water models have three interaction sites, all parameters (e.g., force constants and HOH angle) of them are different. As a result, various bulk properties of the pure solvent (water) obtained with the two water models differ, triggering differences in the solvation environments surrounding the host-guest solutes. This, ultimately, would alter the conformational/energetic preference of the complex structure and the resulting binding thermodynamics, leading to the solvent-modeldependent simulation outcome.

Initial bound conformation from molecular docking
As only the structures of host and guest are given in the online server, a question that needs to be solved in model construction is securing the initial bound conformation. In our modelling, the initial guess of the bound structure is obtained from docking calculations with the Autodock Vina program and the Autodock4 (AD4) scoring function [73,74]. The WP6 host with a D6 symmetry from the online server is used as the docking target, on which each guest is docked. The top-1 (i.e., most stable) structure is extracted as the starting configuration for each host-guest complex. The simulation box is then constructed by parametrizing and solvating the host-guest complex with different parameter sets (e.g., solute charges and water) and adding nonpolarizable monovalent spherical counter ions [75,76] of Na + or Cl − for neutralization. Periodic boundary conditions are always employed in our simulations. As the simulation is performed in an unbiased way in end-point free energy calculation, the whole sampling trajectory is often fluctuating in the neighborhood of the initial configuration. Thus, it is widely believed that the initial condition could have a significant impact on the simulation outcome. To investigate the impact of docking details on the simulation results, various parameters can be changed. Among these docking details, the scoring function itself is often considered the core of the problem. Therefore, this factor is investigated in our benchmark test of the docking details. Specifically, we repeat the docking calculation with another scoring function named Vina [73], which differs from the AD4 scoring function in both the definition of scoring terms and computational complexity. These two scoring functions show different behaviors in many benchmark calculations [73,77,78], which makes the current scoring-function variation indeed a meaningful test for the bound-conformation generation. Similar to the AD4 case, the box construction is repeated with the Vina-given top-1 bound structure.

End-point free energy calculation
With the constructed simulation box, we then turn to the molecular dynamics section. The SHAKE constraints on bonds involving hydrogen atoms are turned on [79,80]. Langevin dynamics [81] with the collision frequency of 2 ps −1 are implemented for temperature regulation and the time step for integrating the equations of motion is set to 2 fs. The real-space cutoff for non-bonded interactions is set to 10 Å, and the PME method is used to treat long-range electrostatics [82]. Unbiased sampling is performed with the GPU version of the pmemd engine with the hybrid precision (SPFP) in the AMBER [83] suite.
To avoid perturbation of the docking-produced binding pose, for each host-guest complex, we first energy-minimize the system and perform 300 ps constant-volume heating with weak harmonic restraints on solutes. After this initial equilibration step, we turn to NPT equilibration for 1 ns to reach the 1 atm pressure. Then, we perform a long production run lasting 100 ns for data accumulation. The sampling interval of host-guest configurations is set to 10 ps, which is typical in end-point free energy analysis. The aggregated sampling time for 12 parameter combinations (3 charge sets, 2 water models and 2 initial conditions for each of the 13 host-guest pairs) reaches ~ 16 μs. After the sampling part, we process the trajectories with different end-point free energy methods to estimate binding affinities. The free energy of binding in end-point free energy techniques can be generally expressed as the following equation, Here, the first two components are gas-phase electrostatic and vdW contributions to the binding enthalpy, the third term is the free energy contribution from solvation effects, and the last term is the gas-phase entropic contribution. The first two terms can be calculated directly with all-atom force fields, the third solvation term can be further divided into the polar and non-polar solvation parts that can be estimated with PBSA or GBSA implicit solvent models [84][85][86][87], and the last term can be calculated with various methods such as normal mode analysis (NMA) [88], quasi-harmonic approximation [89] and so on [90]. In our calculations, the GBSA solvation is performed with the GB OBC model [91,92], which is quite popular in modern end-point free energy calculations. The normal mode approximation has been widely used in various cases such as static quantum mechanics calculations and protein-protein binding [93][94][95][96][97][98][99][100][101]. Due to the popularity of NMA in end-point free energy analysis of protein-ligand and host-guest systems, in the current benchmark calculation, we select this 'standard' method to estimate the entropic contribution to the binding free energy. As the computational cost of NMA calculations (specifically the calculation of the Hessian matrix) is quite high in complex systems and the entropic changes from NMA are similar for different configurations, often only several or tens of snapshots are used to compute this term [102][103][104]. Thus, in the current WP6 host-guest calculations, 50 frames equally spaced in each simulation trajectory (i.e., 2 ns per sample) are used for NMA calculations.

Results and discussions
In the above paragraphs of the previous section, four key influencing factors to be benchmarked in the current work have been discussed. The initial condition can be obtained with the AD4 or Vina scoring function; the charge scheme (1) ΔG binding = ΔE elec + ΔE vdW + ΔG solv − TΔS gas 1 3 for solute molecules (i.e., host and guest) could be AM1-BCC or RESP (RESP-1 or RESP-2); the water model used for solvation could be TIP3P and SPC/E; and many details of the end-point free energy method could be varied. In the following discussion of the simulation outcome, we use the details of these four factors to indicate the modelling setup, e.g., Vina + AM1-BCC + TIP3P + MM/PBSA for the combination of modelling scheme using Vina-docked pose, AM1-BCC charges for solutes, TIP3P water, and MM/PBSA for free energy estimation. Note that the experimental binding affinities of the considered WP6 host-guest complexes are obtained from reference [105].

Charge quality
Direct comparison of atomic charges produced by different charge schemes provides a pre-simulation evaluation of the charge quality, providing hints on possible behaviors of the binding thermodynamics produced by these charge schemes. In RESP fitting, ESP deviations and hyperbolic regularizations are included in the loss function, and the quality of the regularized least-square fitting is often assessed by ESP relative root-mean-squared error (RRMSE). Although we have two RESP charge sets (RESP-1 and RESP-2), here we only consider the first one RESP-1 obtained from traditional AMBER-like RESP fitting, as it shares a goal similar to AM1-BCC (i.e., HF/6-31G* ESP) and a numerical procedure similar to RESP-2. From the percentage errors in Fig. 2a, we observe that the RESP charge scheme provides a much better fitting of the Coulombic ESP around each molecule than AM1-BCC, and the ESP RRMSE of AM1-BCC is still within an acceptable range. Thus, both charge schemes do not have significant problems in reproducing the high-level electrostatics for WP6 host-guest systems. An observable excluded in the loss function of RESP fitting, the dipole moment, is also checked in this charge quality evaluation in Fig. 2b. The absolute deviation of the dipole moment is negligible for RESP-1 charges and is a bit larger for AM1-BCC, which suggests the superiority of the RESP charge scheme. A worth noting observation is the wide range of dipole moments of the guest molecules, suggesting the diversity of the investigated dataset. Thus, the computational insights accumulated in this work could be generalizable to host-guest binding involving pillar[n]arene derivatives.
The fitting target, the ESP around each molecule, is influenced by the level of theory [106]. Thus, another interesting and valuable test performed in charge assessment is on the ab initio level. Although it is straightforward to change the level of theory used in RESP fitting, as AM1-BCC includes a pre-fitted term trained with HF/6-31G* data, its target level cannot be easily varied unless refitting the correction term. Thus, instead of directly varying the fitting target, we simply change the level of theory to generate the ESP data for evaluation. The deviations of the charge-produced ESP from different ab initio results are computed to assess the quality of atomic charges. In this numerical test, aside from Preliminary of charge quality for the two charge schemes (AM1-BCC and RESP-1 charge sets). a The ESP RRMSE used to assess the percentage deviation of the charge-produced ESP from the ab initio reference. b The dipole moments obtained from the two charge schemes and the electronic structure calculations. For both the Coulombic ESP and the dipole moments, RESP-1 charges reproduce the HF/6-31G* results obviously better than the AM1-BCC charge scheme. However, it is fair to note that the difference between the AM1-BCC results and the ab initio reference is also insignifi-cant. This phenomenon is in stark contrast to the Cucurbit [8]uril and β-cyclodextrin host-guest systems investigated in our previous works, where the AM1-BCC charge scheme triggers significant problems and RESP charges are required to obtain accurate ESP and binding information. According to the satisfactory behavior of AM1-BCC, the current WP6 host-guest systems can be considered similar to the training set of the AM1-BCC charge scheme and thus the electrostatics behavior would be relatively easy to model compared with other host-guest systems the AM1-BCC and RESP-1 target HF/6-31G*, we select four higher-level methods including B3LYP/def2-TZVPP, M06-2X [107]/def2-TZVPP, MN15 [108]/def2-TZVPP and wB97X-D [109]/def2-TZVPP. These density functionals are more accurate than HF, and the basis sets are upgraded from the crude Pople series 6-31G* to the polarized triplezeta def2-TZVPP. Aside from gas-phase calculations, we also include solvent effects with the IEFPCM implicit solvent model. The percentage errors of molecular ESP are shown in Fig. 3. The AM1-BCC ESP RRMSE using the gas-phase ab initio references is presented in Fig. 3a, where the error shows obvious system-dependence. In most cases, the deviation from HF/6-31G* is marginally smaller than those from the other higher-level results, which possibly arises from the fact that this level of theory (HF/6-31G*) is employed in the training of this semi-empirical charge model. The ESP percentage errors of AM1-BCC from the other ab initio levels are similar, which possibly suggests the similarity of ESP data generated with these levels of theory. The RESP-1 deviations from the gas-phase ab initio results are shown in Fig. 3b, where a similar behavior is observed. Namely, the magnitude of the deviation of charge-produced ESP from ab initio results shows system-dependence, the deviation from the HF/6-31G* reference is slightly smaller than the other, and the ESP RRMSEs computed with higherlevel methods are similar. By comparing the AM1-BCC and RESP results in Fig. 3a and b, interestingly, the ESP deviations for RESP-1 charges are obviously smaller than AM1-BCC, regardless of the level of theory used to generate the reference ESP data. This phenomenon, along with the similar ESP deviations from different ab initio references, suggests the similarity of ESP data generated at different ab initio levels. As a result, the lowest level tested here and also the widely employed reference in AMBER-like force fields, Fig. 3 The percentage deviation of the charge-produced ESP from different ab initio references. The ESP RRMSE of a AM1-BCC and b RESP-1 charges fitted targeting the HF/6-31G* ESP when the gasphase data are used as the reference. The ESP RRMSE of c AM1-BCC and d RESP-1 when using implicit-solvent data as reference. The implicit solvent model of IEFPCM is used to mimic the solvent environment HF/6-31G*, serves as a good yet efficient option to generate the reference ESP data. Consideration of higher-level techniques could be helpful, but the merits are of limited magnitudes and exhibit system-dependence. When incorporating solvent effects into the calculation, we reach the results presented in Fig. 3c, d. By comparing the ESP RRMSEs under the AM1-BCC charge scheme computed with gas-phase and IEFPCM ab initio references (i.e., Fig. 3a, c), interestingly, the ESP RRMSEs for many molecules (e.g., WP6, G1 and G8) decrease upon the inclusion of implicit solvent. However, the ESP RRMSE for many other molecules such as G3 increases. Thus, it is difficult to say whether the AM1-BCC charge scheme reproduces the solvated situation or the traditional gas-phase scan more accurately. The comparison between RESP-1-produced errors with gas-phase and implicit-solvent references gives a somehow similar behavior (c.f., Fig. 3b, d), but this time the ESP RRMSE increases for most molecules such as WP6 and G1 and only a small portion of molecules (e.g., G12) exhibit a decreasing behavior. This phenomenon is rather not unexpected, as the gas-phase ESP data are actually the fitting target of RESP-1 charges. When we compare the ESP RRMSEs under the two charge schemes computed with IEFPCM references (i.e., Fig. 3c, d), interestingly, the RESP-1 errors are larger than AM1-BCC in many cases, which is also caused by the overfitting of the gas-phase ESP in the RESP-1 charge generation. Overall, with all levels of theory to generate the evaluation data, the AM1-BCC and RESP-1 ESP RRMSEs are not unreasonably huge for all molecules. When the gas-phase ESP data are targeted, RESP-1 outperforms AM1-BCC. When the aim is to reproduce the implicit-solvent ESP, RESP-1 is slightly worse than AM1-BCC. Although the ESP RRMSE of the RESP-2 charge set is not computed, according to the above observations about the RESP-1 charge set, we expect the RESP-2 charge set to reproduce the implicit-solvent ESP in a better way but exhibit slightly larger deviations from the gas-phase data. Due to the difference between fitting targets of the RESP-1 and RESP-2 charge sets, we consider both of them as realizations of the RESP charge scheme in molecular simulations of end-point free energy calculations. As the ESP errors of the two charge schemes (i.e., AM1-BCC and RESP) are within the reasonable range and are similar in magnitude, both of them ensure accurate calculation of electrostatic interactions with the Coulomb equation. Thus, their performances are similar.
The similar ESP reproductions of the two charge schemes are quite surprising in host-guest systems. In our previous investigation of other challenging host-guest coordinations involving the Cucurbit [8]uril (CB8) host, the ESP reproductions of these two charge schemes differ significantly [110]. Specifically, for the CB8 host, the RESP charge scheme produces an ESP RRMSE ~ 5%, which is similar to the current WP6 situation. However, the AM1-BCC charge scheme fails to reproduce the ab initio ESP with a very huge ESP RRMSE ~ 33%. The major difference in ESP arises from the dissimilarity of the training set of AM1-BCC and the heterocyclic CB8 host [110]. The application of AM1-BCC to CB8 host-guest binding thus seems unsuitable due to its failure to reproduce the ab initio electrostatics. A similar observation is also obtained for β-cyclodextrin derivatives [111]. The binding affinities obtained through extensive sampling of binding/unbinding events under AM1-BCC deviate significantly from experimental values, which is in agreement with the detailed evaluation of electrostatic properties (e.g., ESP) [110]. However, in the current WP6 situation, the AM1-BCC charge scheme performs quite well in reproducing the ab initio ESP, which suggests that the AM1-BCC training set already covers the features of WP6. Namely, the macrocyclic host WP6 shares similar structural and chemical features with many drug-like molecules in the AM1-BCC training set. Another possible consequence of the similarity of WP6 and the AM1-BCC training set composed of drug-like molecules is the perfect suitability of GAFF derivatives, as these general-purpose force fields are also fitted with a large set of drug-like molecules. Although we do not perform calculations to evaluate the suitability of GAFF derivatives in the current WP6 host-guest systems like our previous works [111][112][113][114], this speculation about the GAFF(2) suitability is highly probable. All these phenomena possibly suggest that the current SAMPL9 WP6 host-guest systems are 'easier' to model compared with the previous CB8 cases, as the broad range of highly efficient and accurate computational tools developed for drug-like molecules could be employed. However, it should be noted that these force-field similarities do not guarantee the satisfactory performance of end-point free energy methods, as various other factors including the intrinsic limitations of end-point methods (e.g., the suitability of normal mode approximation, the neglect of conformational change upon host-guest binding, and the accuracy of the implicit solvent model) and sampling problems still play a role.

Sampling length
We then turn to the simulation part and check the free energy estimates from end-point techniques. Before checking the consistency between computational results and experimental references, we first validate the convergence of the free energy calculation by monitoring the time-dependence of the free energy estimates in Supplementary Figs. S1 and S2. The whole 100 ns sampling length in each simulation is divided into 10 groups, each of which lasts 10 ns. Starting from the first 0-10 ns point, we gradually add new trajectories to the existing one, reaching the 0-20 ns, 0-30 ns, and finally 0-100 ns results. When convergence is reached, the endpoint estimates should not change with further sampling.
Although the convergence can be reached within tens of ns (e.g., 40 ns) for many host-guest pairs under many simulation protocols, there are still cases where convergence cannot be reached even at the end of the long 100 ns unbiased sampling. This non-satisfactory convergence behavior is rather unexpected, as the sampling length of 100 ns employed in our simulation is already much longer than the commonly used ones such as 10 ns. To investigate the reason causing this non-satisfactory convergence behavior, we compute the structural root-mean-squared deviation (RMSD) of heavy atoms for the host and the guest. The first frame of the 100 ns production run is selected as reference in alignment and computation. The numerical results of the guests G1, G3, G8, G9 and G5 under some parameter sets are presented in Supplementary Fig. S3 as illustrative examples. The first four guests are examples with convergence problems, while for the last guest G5 the convergence behavior is quite satisfactory and thus this guest is used for comparison. For the guest G1 under the Vina-RESP-2-SPC/E parameter set, a conformational change happens for the guest molecule at ~ 10 ns (a dramatic increase of the guest RMSD), after which the guest RMSD exhibits an equilibrium fluctuating behavior around 2.4 Å. When looking at the time series of the free energy estimate for this system (i.e., Vina-PBSA-RESP-2-SPC/E and Vina-GBSA-RESP-2-SPC/E in Supplementary Fig. S2), some similar time-dependent behavior could be observed. Specifically, starting from the 10 ns estimate (about − 10 kcal/mol) corresponding to the free energy of binding for the first guest conformation with ~ 1.7 Å RMSD, the free energy of binding keeps decreasing monotonically due to the inclusion of data from the conformational state with ~ 2.4 Å RMSD. If we assume the binding free energy for the second state to be − 22 kcal/mol and plot ΔG = −10−22(n−1) n with n = 1,…,10 to approximate the time-evolution of free energy difference, this analytical curve agrees well with the numerical result in Supplementary Fig. S2. Thus, the reason for the nonsatisfactory convergence behavior of G1 is highly related to the conformational change of the guest. It should be noted that the significant difference between the binding free energies in different conformational states is also an influencing factor triggering this convergence problem. The 10% data from the first conformational state (− 10 kcal/mol) could contaminate the whole dataset, and the inclusion of the other 90% − 22 kcal/mol results, although almost eliminating the influence of the − 10 kcal/mol points, still have trouble making the convergence plot seemingly well. If the binding free energies for the two conformational states are close in value, the convergence behavior would look much better. The reasons for the convergence problem for the guests G3, G8 and G9 shown in Supplementary Fig. S3 are somehow similar and thus would not be discussed further. For the guest G5, the convergence behavior in Supplementary Figs. S1 and S2 is quite satisfactory, and the structural RMSD in Supplementary Fig. S3 is also very stable, which is provided as the example of systems without convergence problem.
The reasons causing this convergence problem could include the quality of the initial condition, the accuracy of Hamiltonian used in configurational sampling, and the compatibility of these two components (Hamiltonian and end-point free energy method). Specifically, if the docking procedure produces a good initial guess and the fixed-charge force field is good enough to stabilize this bound structure, the simulation employing the fixed-charge force field would achieve a good level of convergence. If the initial guess produced by docking is non-satisfactory, then the simulation would have trouble stabilizing the bound structure and the system keeps changing during the course of the simulation, leading to a not-so-good convergence behavior of the endpoint calculation. In the case that the force field is not good enough but the initial guess is satisfactorily good, the simulation still could have trouble stabilizing the initial guess and thus lead to non-convergence of the end-point calculation.
The above sampling-length comparison suggests that converged end-point free energy estimates generally require tens of ns or longer sampling time, and the 100 ns sampling is already very long in modern end-point calculations. Thus, in the following parts of the manuscript, we use the free energy results obtained with this sampling length in further investigation of other influencing factors. The 100 ns endpoint free energy estimates obtained with all combinations of modelling parameters (i.e., charge schemes, water models, scoring functions, and end-point calculation schemes) are summarized in Supplementary Tables S1-S4. Aside from RMSE mentioned previously, we also use the mean signed error (MSE), Kendall τ rank coefficient [115] and the Pearlman's predictive index (PI) [116] to evaluate the quality of computation for this WP6 host-guest dataset.

Changing the parameter set
The parameter set or Hamiltonian used in explicit-water sampling is critical to the simulation outcome. As mentioned in the computational setup section, two charge schemes are employed for host and guest molecules (i.e., solutes), while for water we also test two models. In the following paragraphs, we would investigate the impact of the parameter set on the end-point free energy estimates.
The force-field/parameter set involves two components, the charge scheme for solutes and the water models. In Fig. 4, the quality metrics are grouped according to the water model and charge scheme. When comparing the error metrics of the three charge sets including AM1-BCC, RESP-1 and RESP-2, in most cases the AM1-BCC performs worst, the RESP-2 charge set exhibits an intermediate accuracy, and the RESP-1 set performs best (smallest RMSE), as shown in Fig. 4a. When checking the ranking coefficient in Fig. 4b, similar observations about charge sets could be obtained. Namely, the AM1-BCC charge set still performs worst, the RESP-2 set achieves an intermediate accuracy, and RESP-1 performs best in most cases. To illustrate the relative performance of different charge sets in a clearer way, we present the number of times that a given charge set yields the best/worst prediction in Fig. 4c. The values of the error and ranking metrics are stacked together as both of them are prediction targets. It is clearly shown that the RESP-1 charge set always provides the best prediction quality among the three charge sets, confirming its superiority. By contrast, the AM1-BCC charge set sometimes achieves the best prediction quality, but also provides the worst results in most cases. The resulting wide range of its number of times achieving the best/worst prediction quality suggests that the performance of this charge set is highly related to the other parameters and thus is not very robust in end-point calculation. As for the RESP-2 charge set, it never achieves the best prediction quality but sometimes produces the worst results, making it a selection of intermediate accuracy and robustness among the three charge sets. The consistent good performance of RESP-1 (traditional HF/6-31G* RESP fitting) in the reproduction of absolute binding affinities (RMSE) and the experimental rank (τ) suggests the superiority of this charge scheme in end-point free energy calculations. The AM1-BCC and RESP-2 charge sets are relatively less accurate. However, it should be fair to note that the apparent accuracies of these charge schemes rely heavily on error cancellation. Namely, other factors such as implicit solvent and the approximated calculation procedure of end-point methods also play a role here. Then we check the solvent-model dependence of endpoint results. When using the TIP3P water model, the RMSEs on the left part of Fig. 4a are somehow larger than the SPC/E ones on the right, regardless of the charge scheme and the docking procedure (scoring function). This observation suggests the superiority of the SPC/E model in reproducing the absolute values of binding affinities in end-point free energy calculations. When comparing the ranking coefficients in Fig. 4b, we observe somehow different behaviors. When the solutes are described with the AM1-BCC charge model, TIP3P outperforms SPC/E. By contrast, when using either of the two RESP charge sets, SPC/E prevails. As the RESP charge scheme already shows higher accuracy in reproducing the absolute binding affinities (Fig. 4a) and achieves better ESP reproduction (c.f., Figs. 2, 3), this charge scheme is considered more accurate and is recommended in the charge-scheme comparison presented in the previous paragraph. Thus, using this recommended charge scheme, SPC/E achieves better performance in end-point free energy calculations. The superiority of the SPC/E water model can also be probed by the analysis of top-performing times in Fig. 4c, where the SPC/E model provides the best predictions for both RMSE and τ in most situations.
Overall, the above force-field comparison suggests that in end-point free energy calculations, the RESP-1 charge scheme performs best among charge schemes for solutes and the SPC/E water model achieves a higher accuracy than the widely applied TIP3P.

Initial condition and end-point methods
The initial condition can be pivotal in free energy calculations with the unbiased sampling technique, especially for systems with elevated (free-)energy barriers between relevant basins. If the initial bound structure is wrongly assigned, ns-length end-point free energy calculation could never really explore the true or optimal binding pose, introducing systematic bias of unknown magnitude. Thus, a detailed evaluation of this docking procedure for the generation of the initial condition seems indispensable in a thorough benchmark in the current WP6 host-guest binding.
As discussed in the computational setup section, the most stable (top-1) binding pose obtained with either of the two scoring functions including AD4 and Vina is selected as the starting bound configuration. A direct comparison between binding poses produced by the two scoring functions is thus on the structural RMSD of non-hydrogen atoms. The guest conformation is obviously of interest, but it gives little information of other aspects of the inter-molecular packing, e.g., the degree of insertion into the host cavity, which is also the property of interest. Thus, aside from the backbone RMSD of the guest itself, we also compute the RMSD of the whole complex. Although the host is fixed in docking calculations, this overall RMSD provides the structural drift of the whole complex and thus reflects the overall deviation of inter-molecular packing. The results of the AD4-Vina RMSD are depicted in Supplementary Fig. S4. For some guests such as G1 and G7, the close-to-zero guest RMSD indicates that the guest conformations obtained with the two scoring functions are the same, but the large RMSD of the whole complex suggests the difference in inter-molecular packing. For the other systems, the remarkable larger-thanzero behavior of the guest RMSD suggests the difference in the guest conformation, which along with the large RMSD of the host-guest complex indicates the difference in intermolecular coordination. Thus, the initial bound configurations produced by the two scoring functions differ. We then turn to the free energy estimates extracted with the two endpoint methods MM/PBSA and MM/GBSA to investigate the impact of initial conditions on the free energy of binding. In Fig. 5, the quality metrics are grouped according to these two criteria. We first compare the generation procedure of the initial condition. As shown in Fig. 5a, the RMSEs of AD4-and Vina-produced initial conditions are somehow similar, and we cannot conclude which scoring function would lead to better reproduction of absolute binding affinities. When checking the ranking coefficient in Fig. 5b, still the relative performance of the two scoring functions shows combination-dependence, i.e., depending on the other modelling details such as the charge scheme. The number of times that each scoring function produces the best results is depicted in Fig. 5c, where AD4 achieves better statistics than Vina and the contributions come both from RMSE and τ. Thus, statistically, the AD4-better-than-Vina phenomenon is more probable to be observed for both error and ranking metrics, but it should be fair to note that the RMSE values with AD4 and Vina are quite similar.
We then turn to the comparison of the two end-point methods. For RMSE in Fig. 5a, the MM/PBSA results are larger than MM/GBSA in most cases, which suggests that the MM/GBSA method achieves a better performance than MM/PBSA in reproducing the absolute binding affinities. When checking the ranking coefficient in Fig. 5b, the qualities of predictions of the two end-point methods show parameter-dependence, i.e., depending on other factors such as charge scheme and water model, which makes it difficult to conclude which method prevails. Similar observations could be obtained from the number of times that a method achieves the best performance depicted in Fig. 5c. For RMSE, MM/GBSA obviously outperforms MM/PBSA in most cases, but the top-performing times of them for the ranking coefficient are similar.
As the combination of the RESP-1 charge set and the SPC/E water model achieves the best performance in the previous force-field comparison, we fix the force-field parameters to this recommended parameter set and re-visit the data in Fig. 5. Interestingly, when using this RESP-1 + SPC/E parameter set, the free energy calculation achieves the highest performance among different force-field combinations for both the error and ranking metrics, regardless of the initial condition (scoring function) and end-point methods. Under this RESP-1 + SPC/E parameter set, the AD4 scoring function achieves a better performance than Vina for both error and ranking metrics, which suggests the superiority of this scoring function. As for the end-point methods, the MM/GBSA method outperforms MM/PBSA for both error and ranking metrics. Therefore, when using the recommended RESP-1 + SPC/E force-field combination, the AD4 scoring function and the MM/GBSA end-point method perform best. Overall, the above comparison of scoring functions used in molecular docking, charge schemes for solutes, water models, and end-point methods leads to the top-performing AD4 + RESP-1 + SPC/E + MM/GBSA combination in end-point free energy calculations in the current WP6 host-guest binding. D e s p i t e t h e t o p p e r f o r m a n c e o f t h e AD4 + RESP-1 + SPC/E + MM/GBSA end-point free energy calculation scheme, the deviations of the computed values from the experimental reference are still very huge. The RMSE is as large as 12.2 kcal/mol, which is far from acceptable in free energy calculation. The MSE of this set is 9.4 kcal/mol, which suggests a significant level of systematic overestimation (more negative) of binding affinities. The Kendall τ and PI are only marginally larger than zero, which suggests the failure of the reproduction of the experimental rank. This is already the top-level accuracy of standard endpoint free energy calculations, but the prediction quality is still non-satisfactory. To make the end-point free energy calculation really usable in host-guest binding, it is necessary to alter various details and pursue accuracy improvements.

Comparison with other methods
Five ranked submissions exist for the current WP6 host-guest dataset. We access the results from the SAMPL9 GitHub website [55] and compute the quality metrics (specifically RMSE and τ) in Fig. 6. When using the RMSE error metrics as the ranking criterion, the machine learning (ML) technique performs best (the smallest RMSE ~ 2 kcal/mol). The second top-performing method is the extended linear interaction energy (ELIE) with ~ 2.5 kcal/mol RMSE. This ELIE technique is a trained end-point method, which requires system-specific parametrization of the weights of different energy terms in free energy estimation [117]. The other three computational results are obtained with the alchemical method, although their detailed modelling protocols differ [56]. The average RMSE of these rigorous free energy techniques is approximately 3 kcal/mol, which is a bit larger than the end-point ELIE result. When considering the ranking coefficient τ as the criterion, interestingly, the performance of alchemical free energy calculations improves significantly, and the average τ of the three alchemical calculations is similar to the end-point one. By contrast, the ML technique performs poorly in reproducing the experimental rank of binding affinities. Overall, the ML method seems powerful in predicting the absolute values similar to the experimental reference without many outliers, but cannot properly assign the relative binding free energies of different systems. Free energy calculations based on atomistic simulation (ELIE and alchemical calculation) can properly differentiate the relative binding affinities of different host-guest complexes, but the absolute values cannot be accurately computed.
The trained MM/PBSA method, ELIE [117], is very similar to the end-point calculations performed in the current work. The difference lies in the system-specific training step that gives the relative weights of different energy terms defined in Eq. (1). Considering this parameterization step, this ELIE method also shares some similarities with ML and thus is not fully physics-based and is not directly transferable to other systems, especially for targets without known binding affinities. It should be noted that there are some other end-point methods sharing a similar parametrization procedure, e.g., the PBSA_E method [118]. Compared with the results reported in the current work that end-point free energy methods are straightforwardly applied without any system-specific parameter tuning, the trained MM/PBSA method exhibits a much better performance in both the computation of absolute values (RMSE) and the ranking of binding stabilities (τ). This improvement can be solely attributed to the system-specific parameterization step, and indeed can be considered a way (aside from the four key details benchmarked in the current work) to improve the end-point free energy results. However, it is ELIE technique involves an end-point free energy calculation procedure similar to the current work, and the difference is that the ELIE method uses pre-trained weighting factors for different components in the end-point free energy estimation. Interestingly, the dockingproduced affinity estimates (scores) with both AD4 and Vina scoring functions perform quite well in reproducing the experimental values, despite the low computational costs of the docking procedures fair to note that the involvement of the training step could hinder its application to newly encountered systems.
Inspired by a recent work reporting a surprisingly good estimation of binding affinities for host-guest systems with docking scores [119], we also extract the docking results under the two scoring functions and compare them with the experimental affinities. The binding affinity of each host-guest pair is estimated with the exponential sum of docking scores for different binding poses ΔG i , namely . The experimentcomputation correlogram is shown in Fig. 7, where no systematic deviation is observed. The RMSE values of the two scoring functions are 1.9 kcal/mol and 1.6 kcal/mol, which are plotted along with the quality metrics of SAMPL9 submissions in Fig. 6a. Interestingly, the docking-produced free energy estimates agree well with experimental values and perform better than computationally intense alchemical simulations and are comparable with ML-based techniques, which agrees with previous observations in other host-guest systems [119]. As for the ranking coefficient τ presented in Fig. 6b, the docking results are significantly better than ML but a bit worse than other simulation-based methods. Notably, for both error and ranking metrics, molecular docking outperforms the end-point results reported in the current work, regardless of the scoring function. Considering the widely acknowledged fact that post-processing docking results with atomistic simulations (normally with end-point free energy calculations) could lead to improved affinity ranks of compounds in protein-ligand binding [120][121][122][123][124][125][126][127], this underperformance of the current end-point calculations in WP6 host-guest binding is somehow counterintuitive. However, this is actually not unexpected, as this hierarchical workflow of computational screening has not been wellvalidated in host-guest modelling. In these prototypical host-guest binding cases, error cancellations become less successful and the problems of various modelling details become exposed. Our calculations reveal these problematic issues, degrade the popularity of the end-point re-scoring procedure, and deepen the understanding of the hierarchical screening in host-guest binding.

Directions to improve the performance of end-point methods
The comprehensive benchmark presented above follows the widely applied standard workflow of end-point free energy calculations. We first obtain the initial condition with molecular docking and then perform molecular simulations to relax the structure and grab an ensemble of configurations in the bound state.  [118], ELIE [117] and so on. Aside from numerical fitting, jumping out of the fixedcharge model could also be attempted. Another direction is altering the dielectric constant, which has been proven useful in many protein-ligand and protein-protein complexes [103,128]. Alteration of the implicit solvent model and the estimation of the entropic contribution could also be considered [129][130][131]. A further critical factor influencing the outcome is the exhaustiveness of the conformational sampling. As the end-point calculation is using configurations sampled with the unbiased technique, there could be several host-guest coordination patterns unexplored and thus not included in the ensemble average in Eq. (1), which could introduce systematic bias of unknown magnitude. This factor could Fig. 7 Correlation between the docking scores obtained with the AD4 and Vina scoring functions and the experimental reference be of great importance especially for host-guest binding, as such multi-modal binding behavior has been observed in several important host-guest coordinations such as CB8 and β-cyclodextrin [110,111,113,[132][133][134]. On this aspect, enhanced sampling techniques could be employed to remove the sampling bias.

Concluding remarks
Molecular modelling of complex donor-acceptor systems such as protein-ligand complexes is of great importance in modern computational biophysics. Due to the large size and the conformational flexibility of macromolecular systems, fixed-charge force fields are often incorporated with molecular simulation to probe and scope the spatial and temporal motions of different parts of the systems. Host-guest complexes are considered prototypical models for biomacromolecular systems due to the small size of the host and the relatively simple interaction network. However, despite the assumed simplicity of host-guest coordination, recent computational studies have revealed significant problems in both the force-field accuracy and the coverage of conformational space through configurational sampling. Endpoint free energy techniques as a choice with balanced accuracy and efficiency are widely used in drug discovery. A huge number of scientific reports on applications of endpoint methods in protein-ligand and protein-protein systems could be accessed easily, but numerical experiences in host-guest coordination are relatively limited. As these prototypical host-guest systems are simple in structure and small in size, fortuitous error cancellations can be avoided and potential issues in force-field parameters and sampling techniques can be relatively easy to be identified. Thus, in the current work, using a dataset containing pillar[n]arenebased host-guest systems, we benchmark end-point free energy calculations extensively with a special focus on four modelling details, including the docking procedure to generate the initial condition, the charge scheme for solutes, the water model for solvation, and the end-point method for free energy estimation. The preliminary analysis of charge quality suggests the suitability of the AM1-BCC charge scheme for WP6 host-guest systems. The AM1-BCC and two RESP charge sets (RESP-1 targeting HF/6-31G* and RESP-2 targeting B3LYP/def2-TZVPP IEFPCM) produce molecular ESP similar to ab initio references. However, this does not guarantee the satisfactory performance of end-point free energy calculations employing these charge schemes, as other factors (e.g., the intrinsic limitation of end-point free energy techniques) also influence the simulation outcome.
In free energy calculation, we first investigate the impact of sampling length on free energy outcomes. Tens of ns are generally sufficient, although there are cases where convergence cannot be reached within the length of 100 ns. The reasons could include the quality of the initial condition produced by docking, the accuracy of the fixed-charge force field, and the compatibility of these two components. For instance, if the docking-produced bound structure is also a stable pose under the force field, the simulation would maintain this conformation and the convergence behavior would be satisfactory.
Then, we benchmark two charge schemes (specifically, three charge sets including AM1-BCC, RESP-1 and RESP-2) and two water models (TIP3P and SPC/E) for binding thermodynamics. The RESP-1 charge set performs best among charge models, and the SPC/E water model achieves the highest level of accuracy. Thus, in the endpoint free energy calculations of WP6 host-guest binding, the RESP-1 charge set and the SPC/E water model are recommended. As for the investigation of the docking procedure to generate the initial condition/configuration, we perform docking calculations with two scoring functions (AD4 and Vina) and select the top-1 (most stable) binding mode for later simulation. AD4 is found to produce smaller errors compared with Vina, which suggests that the former scoring function could be more compatible with the end-point calculation procedure. As for the relative performance of the two end-point methods, we observe that MM/ GBSA outperforms MM/PBSA in the reproduction of the absolute binding affinity and the experimental rank. Overall, the extensive benchmark test of end-point free energy calculations in WP6 host-guest interactions recommends the AD4 + RESP-1 + SPC/E + MM/GBSA combination of modelling parameters. The docking scores are also used to estimate the binding affinities, and the docking-produced estimates are found to be more successful in reproducing both the absolute binding affinities and their rank.
Even with this best combination, the deviations from experiment are still very large (RMSE ~ 12 kcal/ mol and τ ~ 0.08), suggesting the unsuitability of standard end-point calculations in host-guest binding. The underperformance of standard end-point methods is not unexpected, as various issues could be identified in the calculation scheme. For instance, the suitability of normal mode approximation in the calculation of the entropic contribution, the accuracy of the implicit solvent model, the sampling convergence, the conformational change upon host-guest binding and so on can all influence the performance of end-point free energy estimates. To make endpoint calculations practically useful in host-guest modelling, it is necessary to consider further developments and modifications, which would be addressed in a series of following works.

Supporting information
Time series of binding affinities estimated with all possible combinations of the four modelling details, the structural RMSD of the host and the guest from the first frame during the 100 ns production run, the RMSD between AD4-and Vina-produced binding modes, and the 100 ns MM/PBSA and MM/GBSA binding free energies obtained with three charge schemes, two water models, initial bound structures from AD4 and Vina docking are given in the supporting information.