Introduction

The ability to accurately predict intermolecular associations is important for the understanding of the thermodynamic and structural aspects governing molecular recognition in biological systems. It is also critical to the success of important practical applications like the structure-based drug design. Hence, in the past two decades, the development of theoretical methods for predicting binding affinities has been fuelled by a perceived benefit to drug discovery. Binding affinity prediction methods span several levels of theory, with a corresponding trade-off between prediction accuracy and computational demand. On the one hand are the relatively slow but thermodynamically rigorous pathway approaches such as free energy perturbation (FEP) and thermodynamic integration (TI) [1, 2]. On the other hand is a large and ever-increasing number of faster approaches relying on binding affinity scoring functions that can be classified into three main categories: force-field-based, knowledge-based, and empirical [39].

An emergent group of end-point force-field based scoring functions that represent a reasonable compromise between time, computational resources, and accuracy combine molecular mechanics (MM) force-fields with a continuum treatment of solvation. A representative method in this group is MM-PB(GB)/SA [1014], which combines MM-based terms with electrostatic solvation terms from generalized Born (GB) or Poisson–Boltzmann (PB) continuum models, and surface area (SA)-based nonpolar solvation contribution. Solvated interaction energy (SIE) [1517] is a similar end-point force-field-based scoring function that approximates the protein–ligand binding affinity by an interaction energy contribution and a desolvation free energy contribution, each of them further made up of electrostatic and nonpolar components. Electrostatic solvation effects are calculated with the boundary element solution to the Poisson equation, while non-polar solvation is based on molecular SA. Calibration of several physical parameters, including the dielectric constant, Born radii, surface tension coefficient, and enthalpy-entropy compensation scaling factor, was based on a diverse dataset of 99 protein–ligand complexes [15]. The SIE scoring function parametrized in this manner achieves a reasonable transferability across a wide variety of protein–ligand systems, consistently returning absolute binding affinities within the experimental range, as demonstrated by test cases published in the literature [1831]. External testing of the standard SIE parametrization in the CSAR-2010 scoring challenge consisting in a curated dataset of 343 protein–ligand complexes diverse with respect to ligands and targets [32, 33], afforded binding affinity predictions with a mean-unsigned-error (MUE) of about 2 kcal/mol [34].

In this paper, we continue prospective testing of the SIE function. The first blind test was carried out in the first edition of SAMPL (Statistical Assessment of the Modeling of Proteins and Ligands) organized by OpenEye Scientific Software and showed a reasonable performance of SIE in binding affinity predictions for the SAMPL-1 set of kinase inhibitors for which available cognate crystal structures were provided [35]. However, the SAMPL-3 blind data sets propose significantly different challenges that test the limits of the applicability domain of the SIE function. First, the trypsin-binding fragments data set includes low-molecular-weight ligands that are typically of weak binding affinity (high-μM to mM) [36], a noisy range for most scoring functions. Secondly, the host–guest dataset challenges with systems in which the target is also of low-molecular-weight, but surprisingly, these systems have been considered notorious exceptions in the binding affinity landscape by having affinities unexpectedly high for their size [37, 38]. However, perhaps the most important test for the SIE function is the challenge in SAMPL-3 to work with non-experimental ligand poses for predicting binding affinities in both trypsin-fragment and host–guest systems. Clearly, the added challenge of scoring computationally derived binding modes is highly relevant for most of the real-life applications of SIE.

Therefore, a docking procedure was required in order to test SIE performance in SAMPL-3. We have recently developed Wilma (manuscript in preparation), an exhaustive docking program that has the required speed for large-scale in silico docking-scoring (aka virtual screening) [39] of small-molecule libraries. Owing to its exhaustive nature as well as to its fast empirical pose-ranking function calibrated on crystal structures of protein–ligand complexes, the top-ranked pose produced by Wilma has been proven to be consistently close to the experimental pose for drug-like ligands. In SAMPL-3, the top-ranked Wilma pose(s) was (were) selected for post-scoring with SIE. In effect, here we test the performance of the Wilma-SIE docking-scoring platform for both virtual screening and binding affinity predictions.

Unquestionably, the success (or failure) of virtual screening (VS) relies mostly on the quality of the underlying docking and scoring function(s). The challenge in virtual screening is exacerbated by the fact that in order to be relevant in a drug discovery pipeline accurate docking-scoring has to be achieved under the constraint of fast computing. Because intermolecular binding is typically accompanied by the dehydration of the interacting surfaces and reorganization of the solvent water around the ensuing complex, a fast yet accurate solvation model is of paramount importance. This is afforded by the next-generation of solvation models that will retain the efficiency of the current continuum approximation but will be able to capture aspects of the physics of hydration that are dependent on the discrete properties of water. Such continuum models are the semi-explicit assembly (SEA) [40, 41], and first shell of hydration (FiSH) [42, 43]. Hence, we also used the SAMPL-3 data sets to test both prospectively and retrospectively the FiSH model, which we incorporated into the SIE function.

Methods

Wilma docking

Docking was carried out using an exhaustive docking software called Wilma (manuscript in preparation). Wilma uses a brute-force searching approach where the interaction with the rigid protein of all the discrete rotational and translational states of every ligand conformation generated by OMEGA [44] (OpenEye Scientific Software, Santa Fe, NM) is examined. Using an efficient filtering method, the program exhaustively enumerates, scores and ranks all the ligand poses that do not overlap with the protein. The weighted 5-term scoring function used for docking was trained to recover the most native states using 343 protein–ligand complexes from the curated CSAR dataset [32]. The scoring function includes a van der Walls 6–12 Lennard–Jones potential, a Coulomb interaction term, an explicit H-bond term, which considers donor and acceptor orientations, and two surface and polar-surface complementarity terms. Docking is done within a predefined rectangular volume with a translation step size of 0.5 Å. The discrete rotation of the ligand is adjusted to insure that the maximum movement of any atom between adjacent orientations is less than 1 Å. The ligand conformations generated by OMEGA are controlled by an internal energy cutoff of 20 kcal/mol and a minimal RMSD value that keeps the total number of conformations below 3,000 for the trypsin compounds or 10,000 for the larger host–guest ligands.

Solvated interaction energy (SIE) calculations

Scoring of binding affinities was carried out using the solvated interaction energy (SIE) end-point force-field based method [1517, 34]. In SIE, the binding free energy in aqueous solution is approximated from the electrostatic and non-polar components of the interaction energy and the desolvation free energy (Eq. 1). The free state of the system is obtained from rigid separation of the interacting molecules from the bound state.

$$ {\text{SIE}}(\rho ,D_{\text{in}} ,\alpha ,\gamma ,C) = \alpha \left[ {E_{\text{inter}}^{\text{Coul}} (D_{\text{in}} ) + \Updelta G_{\text{desolv}}^{\text{R}} (\rho ,D_{\text{in}} ) + E_{\text{inter}}^{\text{vdW}} + \gamma (\rho ,D_{\text{in}} ) \cdot \Updelta {\text{MSA}}(\rho )} \right] + C $$
(1)

Intermolecular Coulomb and van der Waals interaction energies in the bound state, \( E_{\text{inter}}^{\text{Coul}} \) and \( E_{\text{inter}}^{\text{vdW}} \), were calculated with the biomacromolecular force field AMBER [45, 46], and its extension to small molecules, GAFF [47]. Partial atomic charges for protein atoms were taken from the AMBER force field, which are calculated with the two-stage RESP fitting method to the electrostatic potential at ab initio level [48, 49], whereas ligands were assigned AM1-BCC partial charges [50, 51]. For electrostatic desolvation, the change in the reaction field energy between the bound and free states, \( \Updelta G_{\text{desolv}}^{\text{R}} \), was calculated with a continuum model based on a boundary element solution to the Poisson equation using the BRI BEM program [52, 53]. The molecular surface required for boundary element electrostatic calculations was generated with a marching tetrahedra tessellation algorithm [54, 55], and a variable-radius solvent probe that adjusts with respect to the polarity of each atom being surfaced [56]. The generated molecular surface is also used to calculate the change in molecular surface area upon binding, MSA, leading to a nonpolar desolvation contribution upon multiplication with a surface tension coefficient, γ, which is based on a linear relationship between experimental hydration free energies of alkanes and their MSAs. ρ is a factor applied to derive atomic Born radii by linear scaling of AMBER van der Waals radii (R*). D in is the solute interior dielectric constant. α is a global scaling factor of the total raw solvated interaction energy relating to the scaling of the binding free energy due to configurational entropy effects [57, 58].

Our main interest in SAMPL-3 was to test prospectively the default values of ρ = 1.1, D in = 2.25, γ = 12.894 cal/(mol Å), α = 0.104758, and C = −2.89 kcal/mol, which represent the standard SIE parameters originally obtained by calibration against a protein–ligand training dataset of 99 complexes refined by restrained energy minimization [15]. We also explored prospectively rescaled SIE functions where the α and C parameters were retrained on published data for SAMPL-3 systems. For trypsin affinity prediction, we rescaled the SIE function using a subset of 16 trypsin-ligand complexes from the original SIE training data set [15]. This resulted in values of α = 0.1609 and C = 2.16 kcal/mol, specifically tuned for trypsin. For host–guest affinity predictions, SIE rescaling was based on free energy data available for 26 guests binding to host 1 and 7 guests binding to host 2 [59]. We note that SIE rescaling affects absolute predictions (e.g., MUEs) but not the level of correlation between experimental and SIE-predicted binding affinities. We shall refer to the rescaled SIE function as rSIE.

Single-conformation-based SIE calculations were performed on complexes refined by constrained energy minimization [15]. In the case of trypsin-ligand complexes, we applied our current refinement protocol for protein–ligand complexes [34, 35], which includes energy minimization of the ligand and only the protein residues within 4 Å from the ligand, and applying harmonic restraints with force constants of 3 kcal/(mol Å2) and 20 kcal/(mol Å2) for the ligand and protein, respectively, heavy atoms in this region. For the host–guest systems, the harmonic restraints were applied on all heavy atoms in the system, 3 kcal/(mol Å2) for the guest and 20 kcal/(mol Å2) for the host. Energy minimization was carried out down to a gradient of 0.01 kcal/(mol Å), with AMBER/GAFF force-field parameters and two-stage-RESP/AM1-BCC partial charges (as employed in SIE calculations), and a distance-dependent dielectric constant (4r) to crudely mimic solvent screening.

FiSH solvation model

The FiSH solvation model was designed to capture some of the discrete nature of hydration within a completely continuum framework [42]. By using Born radii that depend on the induced surface charge density it reproduces the charge asymmetry of hydration observed in discrete water simulations [60, 61]. Unlike the default solvation model within SIE, which uses a solute dielectric of 2.25, the FiSH model uses a solute dielectric of 1.0. Furthermore, the non-electrostatic component of solvation is split into a cavity term and a solute–solvent van der Waals term. The non-bulk nature of the first hydration shell is represented by a two-region continuum van der Waals model. Water in the first hydration shell is modeled as a uniform distribution along the solvent-accessible surface (SAS) constructed using atom-dependent probe radii. The van der Waals interaction of the solute with the first hydration shell is calculated by integrating the Lennard–Jones potential along the SAS [42] using AMBER [45, 46] or GAFF [47] parameters for the solute and TIP3P [62] parameters for water. The van der Waals contribution of the second solvation shell outwards is obtained by integrating the contribution of a uniform bulk solvent from the SAS + 2.8 Å out to infinity using standard continuum van der Waals methods [42, 63].

Structural preparation

Trypsin data set

Three high-resolution crystal structures of bovine trypsin were prepared for virtual screening, PDB entries 1HJ9, 1S0R and 3MI4 refined at resolutions of 0.95, 1.02, and 0.8 Å, respectively. A superposition of these structures reveals only minor structural deviations around the active site; however, the Gln192 side chain located at the opening of the S1 pocket adopts different rotameric states. Structural preparation was done in SYBYL 8.1.1 (Tripos, Inc., St. Louis, MO). Bound ligands and buffer ions were removed. Hydrogen atoms were added, with the ionizable groups protonated at neutral pH. Tautomeric and protonation states of His residues were manually assigned after visual inspection in order to maximize the H-bonding network. A Ca2+ ion distant from the catalytic site was retained. With respect to the treatment of crystallographic water molecules, we prepared two versions for each structure. In one version, all explicit solvent molecules were removed. In another version we retained 22 water molecules conserved among the three crystal structures used, 14 of which are buried in the protein core, 3 are proximal to Ca2+, 4 are buried in the back of the Asp189 side chain at the bottom of the S1 pocket, and 1 bridges the main-chain atoms of residues Ser217, Gln221, and Lys224 in the wall of the S1 pocket. Polar hydrogen atoms were manually oriented to maximize H-bonding. All prepared trypsin structures were then subjected to energy minimization with the AMBER force-field, in which all hydrogen atoms were allowed to move with heavy atoms fixed at their crystallographic positions.

In order to prepare the fragments database for virtual screening, we first assigned the protonation states of the 544 ligands in the database at neutral pH using FILTER (OpenEye Inc., Santa Fe, NM). Manual changes were made in the protonation states produced by FILTER for 11 ligands. These included migration of the proton from the more buried amine to the more exposed amine in the piperazine moieties of ligands ID 113, 114, 215, 216, 217, 245, 304, 330, 356, as well as protonation at the exposed N atom of the hydrazine moiety in ligand ID 178, and protonation of the tertiary aliphatic amine in the ligand ID 488. Partial charges were calculated with the AM1-BCC method [50, 51], as implemented in MOLCHARGE (OpenEye, Inc.), using as input the lowest-energy conformation generated by OMEGA (OpenEye, Inc.). The same preparation procedure of the target and ligands was used for trypsin binding affinity prediction.

Host–guest data set

Host-1, an acyclic cucurbit[n]uril (CB[n]) congener containing 4 carboxylate side chains, was prepared in two conformations starting from the high and low occupancy states observed crystallographically in the bound state with a linear aliphatic tetramine guest [59]. For the prospective study, two protonation states were considered for host-1, with the carboxylate groups ionized and neutral. Each host 1 structural variant was energy minimized with the GAFF force field [47], AM1-BCC partial charges, a 4r distance-dependent dielectric and harmonic force restraints on the heavy atoms of 20 kcal/(mol Å). Provided structures for host-2 and host-3, the neutral cyclic CB (7) and CB (8) hosts respectively, were energy minimized with the same settings as host-1, except that no restraints were imposed. The structures of the 7 guests binding to host-1, and the 2 guests binding to each of host-2 and host-3, were protonated at neutral pH and partial charges calculated as described earlier for trypsin ligands. A training set of guests with measured binding affinities, comprising 26 guests binding to host-1 and 7 guests binding to host-2 [59], was also prepared in the same manner.

Results

Trypsin virtual screening

We submitted two prospective predictions for trypsin virtual screening. The two submissions differed in the way the predicted pose was selected for scoring. In one set, the pose with the best Wilma docking score for each ligand was used and subjected to restrained energy minimization (see “Methods” section) and rescored using the SIE energy function with default parameters. We will refer to that pose as the Top-Wilma pose. In the second approach, the poses with the top 100 Wilma docking scores for each ligand were clustered and representatives of each cluster were subjected to restrained energy minimization followed by SIE rescoring. The best SIE score among them was selected as the virtual screening score for the ligand. We will refer to the associated pose as the Top-SIE pose.

We used three crystal structures of trypsin (pdb codes 1HJ9, 1S0R and 3MI4) as targets for docking. The submitted predictions were based on tryspin structures with the crystallographic water molecules removed. We also carried out the calculations with selected conserved water molecules retained. However, the results were highly correlated with those for the dry trypsin structures and we opted to base all our submissions on the dry trypsin targets. For each of these targets, a rectangular box (23.7 Å × 18.0 Å × 29.0 Å) enclosing the substrate-binding groove of trypsin (Fig. 1) was defined as the relevant region for exhaustive virtual docking using Wilma. In general, the top-scoring poses were docked at the S1 specificity pocket of trypsin. Each ligand was assigned the best score obtained across the three trypsin structures.

Fig. 1
figure 1

Box defining the region used for virtual docking. Trypsin is represented as a molecular surface. The box includes the S1 pocket as well as a large part of the binding groove around it

The SAMPL-3 virtual screening set consisted of 544 compounds of which 20 were true binders. Figure 2 shows the receiver operating characteristic (ROC) curve for these two sets of predictions. The performance of the two methods is very similar with the Top-SIE poses giving a somewhat better area under the curve (AUC). The bootstrapped AUCs are 0.70 and 0.68 for the Top-SIE and Top-Wilma poses, respectively. (The perfect AUC would have a value of 1, indicating all true binders are ranked at the top of the list; a random ranking would give an AUC of 0.5.) AUCs are sensitive to false negatives that are detected only late in the screening process. We have three true binders that are ranked near the bottom of the list. These three false negatives alone result in about a 10% reduction in the AUC. It should be noted that the early enrichment performance is quite good, with 50% of the true binders obtained with a 15.6% false positive rate for the Top-SIE set.

Fig. 2
figure 2

ROC curves for virtual screening on trypsin. a Using Top-SIE pose. b Using Top-Wilma pose

Trypsin affinity prediction

We submitted several prospective models of trypsin affinity predictions. These are summarized in Table 1 along with the statistical measures of their performance. Aside from testing the effect of which docked pose (Top Wilma or Top SIE) to score for affinity we also tested three scoring functions. These were (a) the SIE function with default parameters, (b) the rSIE (rescaled SIE) function with parameters α = 0.1609 and C = 2.16 in Eq. 1 that were optimized for trypsin and (c) SIE + FiSH, an SIE function with the solvation model replaced by the FiSH solvation model. The calculations carried out for affinity prediction were exactly the same as those used for the virtual screening exercise except for the additional scoring functions tested. As in the virtual screening case, there was not much difference between using Top-Wilma poses versus Top-SIE poses, although the latter performed slightly better. For the discussion that follows, we will focus on the results using the Top-SIE poses. Figure 3 shows scatter plots comparing the predicted and experimental binding affinities for each of the three scoring functions using the Top-SIE poses. The 34-compound set was composed of 17 binders and 17 non-binders. For the purpose of analyzing the results, the non-binders have been arbitrarily given an “experimental” value of −4.09 kcal/mol. Compared to the default SIE the use of rSIE improved the agreement of the predicted and experimental affinities but does not in any way alter the relative ranking of affinities. With rSIE, most of the predicted affinities for true binders are within 2 kcal/mol of the experimental values. The mean unsigned error (MUE) and median unsigned error (MdUE) are 0.64 and 0.30 kcal/mol, respectively. However, the correlation coefficients are rather poor, r 2 = 0.00 and Kendall τ = 0.12. For the purpose of the statistical analysis, binding affinities of non-binders that are predicted to be more positive than −4.09 kcal/mol have been capped at that value to equal the “experimental” value assigned to non-binders. Restricted to the true binders, the MUE and MdUE are 1.10 and 0.76 kcal/mol, respectively.

Table 1 Statistical performance of trypsin fragment affinity prediction models
Fig. 3
figure 3

Scatter plot of predicted versus experimental binding affinities of trypsin ligands. a Rescaled SIE parameters. b Default SIE parameters. c SIE + FiSH. The “experimental” value for non-binders (red diamonds) has been arbitrarily set to −4.09 kcal/mol. The dashed line represents perfect correlation. Points between the dotted lines have predicted affinities within ±2 kcal/mol from the experimental value

The FiSH solvation model [42, 43] is more sophisticated than the default solvation model of SIE. Instead of a single surface-area-based term for the non-electrostatic component of solvation, it includes additional terms for a continuum van der Waals representation of solute–solvent interactions. The modified SIE + FiSH scoring function then has the form

$$ {\text{SIE + FiSH}} = \alpha \cdot \left( {E_{\text{coul}} + E_{\text{RF}} + E_{\text{vdW}} + E_{\text{cvdW}} + E_{\text{cav}} } \right) + C $$
(2)

where α = 0.1232 and C = 1.46. The α and C parameters were obtained by training against the same 99 protein–ligand data set used for the original SIE function [15]. As with the rSIE case, most of the predicted affinities for true binders are within 2 kcal/mol of the experimental values. The MUE and MdUE are 0.81 and 0.25 kcal/mol, respectively. However, the correlation coefficients are rather poor, r 2 = 0.00 and Kendall τ = 0.14. Restricted to the true binders, the MUE and MdUE are 1.57 and 0.77 kcal/mol, respectively.

The overall performance of SIE + FiSH seems to be similar to that of re-scaled SIE. However, compared to rescaled SIE, SIE + FiSH appears to discriminate true binders from non-binders (Fig. 3a, c) better. We see that for rescaled SIE, the range and spread of values for the non-binders is similar to that of the true binders. With SIE + FiSH, the true binders tend to be more negative than the non-binders. Given this observation, we applied the SIE + FiSH scoring function retrospectively to the VS data set. The result is a dramatic improvement in the early enrichment performance (Fig. 4). For SIE, 50% of the true positives were obtained with a 15% false positive rate (Fig. 2). With SIE + FiSH, 50% of true positives were obtained with a 3% false positive rate. However, the AUC is only slightly increased due to the large penalty for the three false negatives that are ranked close to the bottom.

Fig. 4
figure 4

ROC curve for virtual screening on trypsin using SIE + FiSH

Host–guest affinity prediction

We submitted several prospective models of host–guest affinity predictions. These are listed in Table 2 along with their statistical performance on the combined data set of 11 host–guest complexes (7 guests for host-1, and 2 guests for each of host-2 and host-3). The affinities predicted for these complexes with the models listed in Table 2 are provided in Table S1. Host-1 is an acyclic cucurbituril (CB) analog that is ionizable due to its four carboxylate side chains, whereas host-2 and host-3 are cyclic CB (7) and CB (8) analogs which are neutral. All these hosts have a circular geometry with a central hole where certain guests are recognized with surprisingly high affinity given the relatively small size of these systems [59]. We used our exhaustive docking program Wilma to arrive at bound conformations for host–guest complexes. The search space was defined large enough to allow docking of the guest at any contact position around the host. In general, the top-scored pose for all guests was found to bind fully or partially through the central hole region of the hosts (Fig. 5), irrespective of the structural setup (neutral/ionized, high/low occupancy conformation) or pose-scoring function (Wilma, SIE, SIE + FiSH). We also docked a set of 26 guests to host-1 and 7 guests to host-2 with published binding affinities [59], with the intention of rescaling the SIE function specifically for host–guest systems. The training set of guests also docked in the central region of the hosts, and the top-scored pose was found to be similar to the binding modes previously determined experimentally for two of these guests (Figure S1) [59].

Table 2 Statistical performance of host–guest binding affinity prediction models
Fig. 5
figure 5

Global views of Wilma docking poses for the host–guest systems. The examples shown are from the SIE-based selection of the top-ranked pose and the ionized host-1. All guests are shown in a given host with different colors of the C atoms. Two orthogonal views are provided

All models submitted to SAMPL-3 host–guest affinity prediction challenge are based on the high-occupancy conformation of host-1, because prospectively very similar results where obtained when the low-occupancy conformation was used (r 2 > 0.98, mean-unsigned-deviation < 0.2 kcal/mol). An exception is guest-3, which in some models was predicted to bind more weakly to the low-occupancy conformation than to the high-occupancy one. Guest-3 is a branched and relatively larger guest in this series and Wilma mainly employs a rigid docking algorithm. Also, on a previously published data of 26 guests binding to host-1 [59], we obtained better correlations with experiment using the high-occupancy conformation (data not shown). All these prompted us to discard data generated on the low-occupancy conformation of host-1.

The prediction model #94 is based on the Top-SIE poses. Predictions for host-1 were generated with this host in the fully ionized form (net charge of −4e). This model returned a reasonable correlation with experiment (Kτ of 0.49; r 2 of 0.51) and also a good prediction of absolute binding affinities as shown by the MUE of 1.21 kcal/mol and RMSE of 1.54 kcal/mol, which are better than the null model for this data set (MUE = 1.44 kcal/mol and RMSE = 1.79 kcal/mol). We note, however, the relatively small correlation slope (0.441) and significant correlation intercept (−3.22), which underscore the narrower range of predicted absolute binding affinities than the experimental range, also apparent in the scatter plot in Fig. 6. For most complexes, the standard SIE function slightly underestimated absolute binding affinities, leading to a positive MSE value of 0.88 kcal/mol.

Fig. 6
figure 6

Scatter plots of calculated versus actual binding affinities for the host–guest systems and various prediction models. Please refer to Table 2 for a description of the models. Host-1 data are shown with blue circles, host-2 data with red triangles and host-3 data with yellow squares. The diagonal line indicates a perfect correlation. Select outliers discussed in the text are labeled

The prediction model #96 is similar to model #94, with the only difference being that it uses the Top-Wilma poses. Note that final scoring in both models is based on the standard SIE function. Significant prediction differences were observed for only 2 complexes of host-1, a 1.5 kcal/mol more positive SIE value (weakened predicted binding) for guest-3, and a 0.5 kcal/mol weakening in the case of guest-7. The difference in the selected poses for these two complexes by the two methods is shown in Fig. 7. In both cases, the SIE values based on Wilma pose selection (model #96) are farther from experimental values than predictions based on SIE pose selection (model #94), which is reflected in the slightly larger MUE and RMSE values. However, correlation parameters (Kτ, r 2, slope, intercept) for model #96 improve marginally relative to model #94 (Table 2).

Fig. 7
figure 7

Differences in the Top-Wilma and Top-SIE pose selection. The shown complexes are the only ones that have significantly different predicted binding affinities and different selected poses by the two scoring functions; see Table S1 and the text for the difference in predicted binding affinities. The C atoms of the guests (ball-and-stick models) are colored in green and cyan for the SIE-selected and Wilma-selected poses, respectively. H-bonds are indicated as black dashed line. Only polar H atoms are shown for clarity

We also tested prospectively the protocol from model #94 against a protonated (neutral) version of host-1 (prospective prediction #101, Table 2). Somewhat to our surprise, this model was our best submission in terms of absolute predictions, with the MUE as low as 1.16 kcal/mol and RMSE of 1.49 kcal/mol. Correlation indices deteriorate slightly relative to model #94, but still provide a similar r 2 of 0.50. For most guests, SIE values are slightly more negative (stronger predicted binding) for the neutral host-1 (model #101) than for the ionized host-1 (model #94), by as much as 1.25 kcal/mol in the case of guest-7, with the only exception being guest-6 having weakened predicted binding by 0.45 kcal/mol. To provide a qualitative view of differences in the interactions, in Fig. 8 we display the top-ranked poses of several guests with host-1 in the ionized and neutral states. We note that the correlation slope for this model has decreased to a low value of 0.290. As seen in the scatter plot in Fig. 6, the predicted binding affinities span 2.5 kcal/mol whereas the experimental values range over 6.5 kcal/mol.

Fig. 8
figure 8

Differences in the top-pose interactions with the ionized vs neutral states of host-1. See Table S1 and the text for the difference in predicted binding affinities for each shown example. Top-pose selection is based on the SIE function on both cases. The C atoms of the guests (ball-and-stick models) are colored in grey and cyan for the poses bound to the ionized and neutral host-1 states, respectively. H-bonds are indicated as black dashed line. Only polar H atoms are shown for clarity. Note that the ionized and neutral host structures overlay almost perfectly

One way to modulate the correlation slope is to rescale the SIE function in terms of the enthalpy-entropy compensation factor α in Eq. 1 specifically for the system being investigated. This is justified since is has been previously shown that the CB (7) host, for example, requires a higher energy efficiency factor, that is, the degree to which attractive forces are effective in generating binding free energy, rather than being cancelled by entropy losses, than the β-cyclodextrin (βCD) host [37, 38, 58]. This points towards a larger value for the α scaling factor in the SIE formulation. Hence, we explored this possibility prospectively by deriving a rescaled SIE function based on a previously published data for guests binding to host-1 (26 complexes) and host-2 (7 complexes) [59]. Amongst the many system and method variants tested in the prospective analysis (neutral/ionized host, Wilma/SIE pose selection, high/low occupancy conformations) the best fit was obtained for the Wilma-based selection of the pose and the neutral host-1. This training model achieved an MUE of 1.56 kcal/mol over all 26 guests for host-1 and 7 guests of host-2, and led to an α scaling factor of 0.2568 (the constant C was forced to zero), hence larger scaling than that for the standard SIE function (0.1048), in agreement with previous observations [37, 38]. Application of the rescaled SIE function to the SAMPL-3 host–guest data set led to the submitted prediction model #99, with an increased correlation slope (0.705) and similar correlation with experiment relative to the other prospective models based on standard SIE function. Two aspects in terms of absolute affinity prediction are noteworthy: rescaling led to overshooting of predictions from underestimating to overestimating actual affinities (negative MSE, see also Fig. 6), and increase of MUE to above 2 kcal/mol (Table 2).

Therefore, we retrospectively reanalyzed our un-submitted prospective models of rescaled SIE function on the training set of 33 complexes of host-1 and host-2 [59]. A particularly interesting model turns out to be the one employing Top-SIE pose selection and ionized host-1. This model was not submitted prospectively because it performed poorer than model #99 in the training stage, with a training MUE of 2.13 kcal/mol (versus 1.56 kcal/mol for model #99). Its α scaling factor of 0.2097 is larger than that in the standard SIE function used in model #94 that underestimated the actual data, but smaller than in the SIE function rescaled on the training dataset with neutral host-1 used in model #99 that overestimated actual affinities. As seen in Table 2 and Fig. 6, this retrospective model (rescaled SIE, ionized host-1) has a much improved correlation slope (0.880) relative to model #94 (standard SIE, ionized host-1) and an improved MUE (1.27 kcal/mol) relative to model #99 (rescaled SIE, neutral host-1). The correlation coefficient is similar to the other models, but the correlation intercept and the MSE are much improved being close to zero.

We also tested SIE + FiSH. The SAMPL-3 prospective prediction model #98, which is based on SIE + FiSH final scoring and ionized host-1, did not perform well, showing much deteriorated correlation and absolute affinity prediction relative to the standard SIE function (Table 2; Fig. 6). The severe underestimation of binding affinities by this model (MSE = MUE of 5.59 kcal/mol) is corrected in the prospective model #100, which uses a rescaled SIE/FiSH model derived on the training set of 33 complexes of host-1 and host-2 [59], which cannot however improve the correlation with experiment. Obviously, more work is needed for a consistent incorporation the FiSH solvation model into the SIE function, at least for the host–guest systems examined here.

Discussion

Outliers of trypsin virtual screening and affinity prediction

Figure 9 shows chemical structures of three serious outliers in our affinity prediction results. These same outliers also adversely affected the AUC in the virtual screening results. Figure 10 also shows the predicted binding mode of one of the outliers, frag.aff.15. In this pose, the imidazo nitrogen on the ligand points away from Asp189 and towards the backbone NH of Ser214. However, at a distance of 3.4 Å from the amide hydrogen, it is too far to form a good hydrogen bond. This pose suggests that if the imidazo nitrogen were protonated the ligand could flip and form a stabilizing ion pair with Asp189 of trypsin. By analogy with other imidazo compounds, it is plausible that the imidazo nitrogen in frag.aff.15 is protonated near neutral pH. Figure 10 (right panel) shows the predicted pose for the protonated version of frag.aff.15. After protonation, the predicted binding affinity using the rSIE scoring function goes from −4.45 to −6.44 kcal/mol with VS ranking going from 522 to 178. For the SIE + FiSH scoring function, the predicted binding affinity goes from 2.63 to −2.78 kcal/mol with VS ranking going from 534 to 124. The second outlier, frag.aff.16, also has an imidazole nitrogen that was not protonated. For rSIE, protonation changes the predicted binding affinity from −2.64 to −3.77 kcal/mol and raises the VS rank from 262 to 147. For SIE + FiSH, protonation changes the predicted binding affinity from −1.44 to −3.37 kcal/mol and raises the VS rank from 203 to 90. The docked conformation of the third outlier, frag.aff.27 (Fig. 9), had the aniline and pyrrole rings nearly perpendicular to each other. By docking a conformation in which the two rings are nearly planar, the predicted binding affinity and rank improved only marginally. The rSIE affinity went from −5.02 to –5.96 kcal/mol with rank rising from 302 to 285. For SIE + FiSH, the predicted affinity went from −0.69 to −0.79 kcal/mol with rank rising from 343 to 330. The corrected outliers improved the AUC for SIE + FiSH from 0.73 to 0.78 (Fig. 11). However, the correlation coefficients for affinity prediction are not much improved. For rSIE, r 2 goes from 0.00 to 0.04 after correcting for the outliers. For SIE + FiSH, the r 2 go from 0.00 to 0.02. The MUE is 0.60 and 0.51 kcal/mol for rSIE and SIE + FiSH, respectively, after correcting for outliers.

Fig. 9
figure 9

Outliers in fragment affinity prediction. In clockwise direction from the top left, the fragments are frag.aff.15, frag.aff.16 and frag.aff.27 of the SAMPL-3 set

Fig. 10
figure 10

Predicted binding mode of an outlier, frag.aff.15. Left panel Unprotonated imidazo nitrogen. Right panel Protonated imidazo nitrogen. Trypsin is depicted as thin sticks with the Ser 214 NH and Asp 189 side chain as ball and sticks. Protonation of the imidazo group allows an ion–pair interaction to be formed with the Asp 189 side chain

Fig. 11
figure 11

ROC for trypsin virtual screening after correcting for outliers. The SIE + FiSH scoring function was used

Outliers of host–guest affinity prediction

Given the good predictions obtained, there are no major outliers for most of the host–guest affinity models. Some of the outliers seen in the scatter plots in Fig. 6 depend on prediction model as well as on whether the outlier analysis refers to absolute or only correlating binding affinities. For example, in the case of host-1, the standard SIE model #94 indicates two outliers, guest-6 and guest-7, with absolute binding affinities underestimated by more than 2 kcal/mol. However, the rescaled SIE model (retrospective), which changes the correlation slope and the spread of prediction data but not the degree of correlation, shows that guest-6 is well-predicted. Although guest-7 pose docked into the ionized host-1 interacts well by traversing the entire central hole and engages both the amine and amide protons in H-bonds with the host (Fig. 8), its binding affinity is still underestimated. Interestingly, guest-7 was one of the few ligands affected by the protonation state of the host-1, the case of neutral host-1 leading to an improved prediction (model #101, Fig. 6; Table S1) despite the fact that the docked pose does not form any direct H-bonds nor does it fully cross from one face to the other of the neutral host-1 (Fig. 8). This serves to remind us about the delicate balance between interaction and desolvation, and the extent to which scoring functions can accurately account for that balance.

The outlier analysis also points out that system-specific rescaling of scoring functions for predicting absolute binding affinities has to be done carefully and not fully rely on the mathematical global optimum set of parameters but also consider the physical relevance of the system. In this study, the global optimum during training was found to correspond to the neutral host-1, with the resulting rescaled scoring function overshooting the predicted absolute binding affinities of the blind set towards overestimation, with guest-1 and guest-3 as significant outliers (model #99, Fig. 6). The more physically sound ionized state of the host-1 (for the experimental pH of 7.4) produced a rescaled SIE function slightly suboptimal in the training phase, but better performing in the test set (retrospective model in Fig. 6).

Prediction for host-2 and host-3 were reasonable (see the red triangles and yellow squares, respectively, in the scatter plots in Fig. 6). Specifically, the more branched h23-guest-1 due to the n-propyl substituent at the N atom of the imidazolyl ring was predicted to fit only partially into the smaller host-2 and hence it has a weaker binding affinity to host-2 (both predicted and experimental) than the more linear h23-guest-2 (the n-propyl substituent at the C atom of the imidazolyl ring) which traverses the host-2 central hole (middle panels in Fig. 5). However, same h23-guest-1 and h23-guest-2 were fitted well in the larger host-3 (lower panels in Fig. 5) to which they bind with similar affinities. Although absolute binding affinities to host-3 are slightly underestimated by the standard SIE function (e.g., model #94) by 1–2 kcal/mol, they are well predicted by the rescaled SIE function (retrospective model in Fig. 6).

As mentioned earlier, the use of the SIE + FiSH scoring function provided underestimated absolute binding affinities for all host–guest complexes, with guest-7 of host-1 as major outlier underestimated by 10 kcal/mol (model #98). Rescaling of the SIE + FiSH function did not change the modest low correlation with experimental data, but reduced significantly absolute errors in most outliers and provided more balanced absolute predictions (model #100, Fig. 6). The largest outlier with the rescaled SIE function is guest-1 for host-1, which was predicted as a racemic mixture from the binding poses of the enantiomers (Fig. 12).

Fig. 12
figure 12

Binding of the two enantiomers of guest-1 to host-1. The example shown is based on the SIE-selected top-ranked pose and an ionized state of host-1. The C atoms of the guest (ball-and-stick models) are colored in green and cyan for the R and S enantiomers, respectively. H-bonds are indicated as black dashed line. Only polar H atoms are shown for clarity

Assessment of general performance of SIE

It is informative to position the current prospective SIE predictions of binding affinity at SAMPL-3 in the context of the general performance of the SIE function during training and various tests available thus far.

Training

Standard SIE parametrization (see “Methods” section) was previously derived on a protein–ligand dataset consisting of 99 complexes from 11 diverse protein targets, each comprising a short congeneric series of ligands with known binding affinities curated from the literature and co-crystal structures solved at high-resolution [15]. A training performance characterized by an MUE of 1.34 kcal/mol and an r 2 of 0.65 (Fig. 13a) was obtained while maintaining the physical meaning and interpretability of the optimal parameters. In particular, the fitted optimal solute dielectric falls within the range of 2–4 in agreement with refractive index measurements of protein powders, and there is a scaling down of the potential energy plus solvation by about 90% likely reflecting the compensation exerted by the configurational entropy loss arising form narrowing of the energy wells in the complex versus the free state [57, 58].

Fig. 13
figure 13

Past performance of the SIE function on various datasets. a Calibration dataset consisting of 99 protein–ligand complexes [15]. HIV protease (filled circle); trypsin (open circle); lysozyme (filled square); thrombin (open square); neuraminidase (filled diamond); elastase (open diamond); triosephosphate isomerase (filled triangle); l-arabinose binding protein (open triangle); protein tyrosine phosphatase 1B (grey-shaded circle); glutathione transferase (grey-shaded square); streptavidin (grey-shaded diamond). b Testing on the CSAR-NRC-HiQ data set (N = 343) of the CSAR-2010 scoring exercise [34]. c Published applications to date including both SIE predictions and actual binding affinities. References: [18] N = 7 (filled circle); [23] N = 12 with 5 limiting values (open circle); [20] N = 2 (filled square); [24] N = 8 with 4 limiting values and IC50 data used (open square); [21] N = 2 (filled diamond); [26] N = 4 (open diamond); [22] N = 4 (filled triangle); [19] N = 7 (open triangle); [25] N = 11 (plus sign); [27] N = 3 (grey-shaded circle); [28] N = 6 (grey-shaded square) with IC50 data used; [29] N = 5 (multiplication sign); [30] N = 1 (grey-shaded triangle); [31] N = 1 (grey-shaded diamond) with IC50 data used (multiplication sign). d Blind testing on the SAMPL-1 affinity prediction JNK3 dataset of 59 complexes [35]. Closed symbols (filled circle) correspond to active inhibitors with co-crystal structures, and open symbols (open circle) correspond to weak binders computationally docked into the kinase active site

CSAR

The most extensive testing of the SIE function was done recently in the Community Structure-Activity Resource (CSAR) scoring challenge consisting of high-resolution co-crystal structures for 343 protein–ligand complexes with high-quality binding affinity data and high diversity with respect to protein targets [3234]. While the dataset resembles the SIE calibration dataset of 99 protein–ligand complexes in terms of target diversity and curation quality, there is no single entry in the CSAR-2010 dataset that was present in the SIE calibration dataset, albeit some protein targets were represented in both data sets. The previously calibrated standard SIE parametrization predicted absolute binding affinities for the highly curated CSAR-NRC-HiQ data set version well in the range of the experimental values, with an MUE of 1.98 kcal/mol and an r 2 of 0.38 (Fig. 13b). SIE predictions were found to be sensitive to the assignment of protonation and tautomeric states in the complex, and the treatment of metal ions near the protein–ligand interface. These structural preparation steps were critical for accurate testing of the SIE performance. Retraining and testing of SIE parameters on two predefined halves of CSAR-NRC-HiQ led to only marginal further improvements to an MUE of 1.83 kcal/mol and an r 2 of 0.43, with modest change in the optimal values of SIE parameters.

Published studies

The SIE function has also been applied retrospectively as well as prospectively in several other independent laboratories that have reported SIE predictions versus actual binding affinities [1831]. Collectively, these data indicate an MUE of 1.30 kcal/mol and an r 2 of 0.47 between the predicted and actual absolute binding affinities (Fig. 13c). As in the case of the CSAR-NRC-HiQ data set, these applications reiterate that the SIE approach returns predicted protein–ligand binding affinities well within the range of experimental measurements. The degree of scatter is comparable to that observed in the original calibration and in the CSAR testing, suggesting that the SIE parameters were not over-fitted to the training set.

SAMPL-1

Since the most objective way to evaluate computational methods is via blind tests, SIE was a participating method in the SAMPL-1 experiment organized by OpenEye, Inc. in early 2008 [35]. In SAMPL-1, we tested prospectively the standard SIE parametrization for protein–ligand binding affinity prediction on the Jun kinase 3 (JNK3) data set, a target class not used for SIE calibration. This data set consisted of 49 diverse JNK3 inhibitors from 12 classes, each with its own co-crystal structure with the kinase, plus 10 models of known “inactive” ligands (in fact weakly active ligands) docked in duplicated enzyme structures. The SIE function achieved reasonable prospective predictions for the JNK3 dataset of 49 actives, with an MUE of 0.92 kcal/mol and an r 2 of 0.36. Again, it became apparent that SIE can estimate absolute binding affinities, with predicted values spanning the same range as the actual ones (Fig. 13d). The 10 measured inactives were separated reasonably well from the actives, leading to an increase in r 2 to 0.54 over all 59 ligands.

SAMPL-3

As described in this paper, the SIE function returned encouraging prospective predictions when tested on the trypsin-fragment and host–guest blind data sets from the SAMPL-3 experiment. (Figure 6 on host–guest and Fig. 3 on trypsin-fragment data sets.) SAMPL-3 was a useful experiment because it tested the applicability domain of the method with challenging systems like fragment-sized weak-affinity ligands binding to an enzyme, and small host–guest systems exhibiting appreciable binding affinities. Additionally, these predictions had to be made not on solid, experimentally determined, binding modes as in CSAR and SAMPL-1, but on computationally docked binding modes. We also tested more extensively SIE rescaled specifically for the system under investigation, as well as we tested for the first time SIE + FiSH, a scoring function that incorporates our latest solvation model, FiSH [42], into SIE. We found that for the trypsin-fragment data set, rescaling of the SIE parameters was necessary to improve prediction of absolute binding affinities (MUE of 0.81 kcal/mol), which were systematically overestimated by the standard SIE parametrization (MUE of 2.24 kcal/mol). We note that even in the original SIE training set, the trypsin-ligand subset was also overestimated (Fig. 13). For the host–guest system, the MUE of about 1.16 kcal/mol and r 2 of 0.52 kcal/mol were hardly affected by SIE rescaling, but the correlation slope became closer to 1 after rescaling due to a larger entropy-related factor, in agreement with other studies [37, 38]. This suggests that in certain cases, possibly for fragment-sized ligands and other small molecular systems, the SIE function may need to be retrained for the system under investigation if data are available. A recent study also reports improved predictions with rescaled SIE parameters for protein–ligand systems, although the system-specific sets of SIE parameters were not validated on external sets [64]. Encouragingly in the case the trypsin-fragment data set, the SIE + FiSH scoring function outperformed the standard SIE scoring function in terms of absolute predictions (MUE of 2.24 kcal/mol for standard SIE versus 0.98 kcal/mol for standard SIE + FiSH). However, testing of the SIE + FiSH scoring function on more systems is required in order to confirm its general advantage.

Virtual screening

The compromise between speed and accuracy makes SIE a suitable scoring function for ranking compound libraries in virtual screening (VS) applications. Previously, SIE was tested for VS enrichment against estrogen receptor (ER) and thymidine kinase (TK) showing the ability of SIE to recover true hits in a collection of decoys [15]. While the ER set is considered an easier test, the TK set is more challenging partly due to weaker binding affinities for the true binders. The SIE function was able to recover all true positives within the top 10% of the ranked dataset, and half of them within top 1%. The SIE was clearly superior to simpler functions, e.g., buried surface area that describes only non-polar effects and ranked all TK true binders near the bottom of the list. In the blinded VS experiment of SAMPL-3, SIE showed a strong performance on the trypsin-fragment dataset of over 500 ligands, significantly enriching in the 20 true-active fragments with an AUC value of 0.70. A promising retrospective result is that the SIE + FiSH function improves the enrichment (AUC of 0.73) in this VS data set.

Docking

Although SAMPL-3 did not explicitly test docking methods it is clear that success or failure in virtual screening or affinity prediction is highly dependent on the quality of the predicted poses that are scored. For this purpose we opted to go with an exhaustive docker, Wilma, that thoroughly samples bound conformations rather than a stochastic docker with uncertain convergence properties. The rather fine search grid used (0.5 Å) combined with thorough sampling of ligand rotamers using OMEGA gives us some confidence that the native pose was visited during the search procedure. The scoring function used for docking is Physics-based and mimics the major components of a typical force-field calculation, albeit with empirically modified weights for the various terms. The net effect is that the top poses selected by Wilma will most likely be low-energy poses as well when rescored with our SIE function. In fact, this is what we observed by noting that the affinity predictions using the Top-Wilma pose were comparable to those using the Top-SIE pose. This is no mean feat given that the Wilma docking function is several orders of magnitude faster to compute than the SIE function.

Given the speed of Wilma scoring, it is tempting to use Wilma scoring alone for virtual screening applications. However, in our experience virtual screening ranking based on Wilma scores alone are not as reliable as that obtained after rescoring with the SIE function (data not shown). This is probably because in docking a given molecule many terms cancel out when comparing the energetics of one pose versus another. There is much less cancellation when comparing the affinity of one molecule versus another with a different molecular structure. Hence, a function optimized for docking may not properly capture key components necessary for accurate affinity prediction across different molecules. We find that the use of Wilma for docking and SIE for scoring achieves a cost-effective balance between speed and accuracy for both virtual screening and affinity prediction.

Conclusions

The performance of the SIE scoring function in this blind test is consistent with past experience in its application to a number of targets. The combination of SIE with an exhaustive docker such as Wilma affords a rapid cost-effective virtual screening platform that can provide not just a ranking of compounds but estimates of binding affinity as well. The sampling thoroughness afforded by Wilma may in fact be instrumental for the relatively good results obtained in virtual screening and affinity prediction. Nevertheless, the goal of consistently achieving sub-2 kcal/mol accuracy in relative binding free energies remains a challenge, as seen in poor correlations when the dynamic ranges of the actual binding affinities are narrow. However, it is encouraging that the inclusion of more Physics in the model, e.g., SIE + FiSH, can improve the quality of the predictions.