Introduction

Accurate prediction of structural and energetic aspects of binding in aqueous solution is critical for successful structure-based drug design and the understanding of molecular recognition in biological systems. Binding affinity prediction methods range from the relatively slow but thermodynamically rigorous pathway approaches such as free energy perturbation (FEP) and thermodynamic integration (TI) [1, 2], to the faster end-point approaches relying on binding affinity scoring functions that can be classified into three main categories: force-field-based, knowledge-based, and empirical [39]. Although many end-point methods are based on implicit solvent descriptions, the solvent potential of the mean force theory ensures that, given adequate configurational sampling, these methods can be as rigorous as alchemical pathway methods based on explicit solvent description [10, 11]. A popular method in the force-field-based group is MM-PB(GB)/SA [1214], which combines molecular mechanics-based terms with continuum solvation terms.

Solvated interaction energy (SIE) [1517] is another end-point force-field-based scoring function that approximates binding affinity by an interaction energy contribution and a desolvation free energy contribution, each of them further made up of electrostatic and nonpolar components. Calibrated on a diverse dataset of 99 protein–ligand complexes [15], SIE achieves a reasonable transferability across a wide variety of protein–ligand systems for which it predicts absolute binding affinities within the experimental range as shown by various test cases reported in the literature [17, 18]. External testing of the standard SIE parametrization in the CSAR-2010 dataset of 343 protein–ligand complexes diverse with respect to ligands and targets predicted absolute binding affinities with a mean-unsigned-error (MUE) of about 2 kcal/mol [18].

A docking procedure is required in order to apply SIE in absence of crystallographic ligand poses. To this end we developed Wilma, an exhaustive docking program that has the required efficiency for large-scale virtual screening of small-molecule libraries. Owing to its exhaustive nature as well as to its fast empirical pose-ranking function calibrated on crystal structures of protein–ligand complexes, the top-ranked pose produced by Wilma has been shown to be consistently close to the experimental pose for drug-like ligands.

Both SIE and Wilma have been employed for blind testing in previous editions of Statistical Assessment of the Modeling of Proteins and Ligands (SAMPL) organized by OpenEye, Inc. A reasonable performance of SIE in binding affinity prediction for the SAMPL1 set of kinase inhibitors with available cognate crystal structures had been noted [19]. In SAMPL3, the Wilma–SIE virtual screening platform achieved good enrichment of true positives from a dataset of fragment-size ligands against trypsin, with an AUC of about 0.7 for a receiver-operating-characteristic (ROC) curve characterized by an excellent early enrichment performance [20]. Binding affinity predictions for trypsin–ligand and host–guest complexes in SAMPL3 were generally within 2 kcal/mol of the experimental values but rank ordering of affinities within 2 kcal/mol was not well predicted.

In this paper, we continue prospective testing of the Wilma–SIE docking–scoring platform for both virtual screening and binding affinity predictions. We tested our methods on both molecular systems proposed in SAMPL4. One the one hand were the two relatively small hosts with their surprisingly high-affinity guest ligands. On the other was the much larger homodimeric HIV-1 integrase that can bind various inhibitors at several sites with much weaker affinities than one would expect based on the shape of the enzyme pockets and the size of the ligands. Several methodological and system-dependent properties are explored in this study. These include: (1) a new virtual screening scoring function replacing the surface-based terms with a penalty term for non-complementary polar and non-polar interactions, (2) the role of vibrational entropy change and symmetry corrections to the absolute magnitude and correlation of binding affinity predictions, respectively, studied on the host–guest systems, (3) the effect of the size of the sampled docking space for docking and enrichment of actives, studied on HIV-integrase pose prediction and virtual screening datasets, (4) a more advanced continuum solvation model and the use of a common structure of the target for binding free energy predictions, studied on the HIV-integrase affinity dataset.

Methods

Wilma docking

The docking software Wilma uses a brute-force searching approach where the interaction with the rigid protein of all the discrete rotational and translational states of every ligand conformation generated by Omega (OpenEye Scientific Software, New Mexico) is examined. Using an efficient filtering method, the program exhaustively enumerates, scores and ranks all the ligand poses that do not overlap with the protein. Docking is done within one or several predefined rectangular volumes with a translation step size of 0.5 Å. The discrete rotation of the ligand is adjusted to insure that the maximum movement of any atom between adjacent orientations is less than 1 Å. The ligand conformations generated by Omega are controlled by setting the internal energy cutoff to 20 kcal/mol and adjusting the RMSD clustering parameter to produce at most 5,000 conformations.

The original weighted 5-term scoring function used for Wilma docking was trained to recover the most native states using 320 protein–ligand complexes from the curated CSAR dataset [21]

$${\text{WilmaScore}}1 \, = w_{1} E_{\text{coul}} + w_{2} E_{\text{vdw}} + w_{3} E_{\text{HB}} + w_{4} E_{\text{psc}} + w_{5} E_{\text{npsc}}$$
(1)

This scoring function includes a coulombic interaction term, E coul, a van der Walls 6-12 Lennard-Jones potential, E vdw, an explicit H-bond term, E HB, which considers donor and acceptor orientations, and two surface (polar and non-polar) complementarity terms, E psc and E npsc. For this study we calibrated a different version of the scoring function for Wilma docking that replaces the two surface complementary terms by a term, E flaws, which introduces an energetic penalty for flaws present in the docked pose in terms of protein–ligand polar complementarity.

$${\text{WilmaScore}}2 = w_{1} E_{\text{coul}} + w_{2} E_{\text{vdw}} + w_{3} E_{\text{HB}} + w_{4} E_{\text{flaws}}$$
(2)

These flaws account for the obstruction of polar groups by non-polar or like-charged polar groups. Introduction of the E flaws model is an attempt to reduce occasional top-ranked poses and false-positive ligands that are “flawed” due to the presence of buried partially charged atoms without formation of electrostatically complementary interactions in the bound state, which were still observed when using the surface complementarity terms. This empirical geometrical model poses a more stringent electrostatic desolvation penalty on such unfavorable interactions (flaws) in addition to addressing the charge sign of polar interactions in comparison with the former surface-based model. Further description and implementation details of the E flaws term are provided in the Supplementary Material. The Wilma scoring function was used exclusively for structure prediction, i.e., to select the top-ranked docked pose.

Solvated interaction energy (SIE) calculations

Scoring of binding affinities was carried out using the SIE end-point force-field based method [1518], which approximates the binding free energy from the electrostatic and non-polar components of the interaction energy and the desolvation free energy

$${\text{SIE}} = \alpha (E_{\text{coul}} + E_{\text{vdw}} + E_{\text{RF(BEM)}} + E_{\text{npsolv}} ) \, + C$$
(3)

where E coul and E vdw describe solute–solute interactions by intermolecular coulombic and van der Waals interaction energies in the bound state calculated with AMBER and GAFF molecular mechanics force fields [2224]. Desolvation effects are described by E RF(BEM), the change in the reaction field energy between the bound and free states calculated with a continuum model based on a boundary element solution to the Poisson equation using the BRI BEM program [25, 26] and a solute dielectric constant D in = 2.25, and E npsolv, the non-polar desolvation approximated from a linear proportionality with the change in solute molecular surface area [2729]. The free state of the system is obtained by rigid separation of the interacting molecules from the bound state. Partial atomic charges for protein atoms are taken from the AMBER force field, whereas organic solutes are assigned AM1-BCC partial charges [30, 31]. α is a global scaling factor of the total raw solvated interaction energy relating to the scaling of the binding free energy due to configurational entropy effects [32, 33]. The standard parameters of the SIE function in Eq. (3) are α = 0.1048 and C = –2.89 kcal/mol calibrated against a protein–ligand training dataset of 99 complexes refined by restrained energy minimization [15].

We also explored prospectively a different SIE function in which the solvation terms are replaced by our latest continuum solvation model FiSH that captures some of the properties of the first shell of hydration [34, 35]. For example, the electrostatic desolvation in the FiSH model, E RF(FISH), can account for charge asymmetry effects. Also, instead of a single surface-area-based term for all non-electrostatic component of solvation, FiSH includes an additional continuum van der Waals term, E cvdw, to more accurately describe the solute–solvent non-polar interactions, and a separate surface-area based cavity term, E cav. Unlike the default solvation model within SIE, which uses a solute dielectric of 2.25, the FiSH model uses a solute dielectric of 1.0. The modified SIE + FiSH scoring function then has the form

$${\text{SIE}} + {\text{FiSH}} = \alpha (E_{\text{coul}} + E_{\text{vdw}} + E_{\text{RF(FISH)}} + E_{\text{cvdw}} + E_{\text{cav}} ) \, + C$$
(4)

where the parameters α = 0.1232 and C = 1.46 were obtained by training against the same 99 protein–ligand data set used for the original SIE function [15].

Finally, another SIE variant that implements two terms from the Wilma docking program, the explicit hydrogen bonding term, E HB, and the energetic penalty term for flaws of protein–ligand complementarity, E flaws,

$${\text{SIE}} + {\text{HB}} + {\text{FLAW}} = \alpha (E_{\text{coul}} + E_{\text{vdw}} + E_{\text{RF(BEM)}} + E_{\text{npsolv}} ) + \beta E_{\text{HB}} + \delta E_{\text{flaws}} + C$$
(5)

were calibrated against the 320 protein–ligand complexes from the curated CSAR dataset [21], leading to weighting factors β = −0.4 and δ = 1.2, while keeping α and C at the default values in Eq. (3).

Prior to SIE, SIE + FiSH and SIE + HB + FLAW calculations, all complexes were refined by constrained energy minimization as described previously [18, 20].

The average CPU time for a Wilma–SIE calculation was of the order of 10 min for a typical protein–ligand complex in the HIV integrase virtual screening exercise. It generally took Wilma about 0.1 s to exhaustively dock one conformation of a ligand. For each ligand up to 5,000 Omega-generated conformations were docked. The docked poses were then clustered and representatives from each cluster were rescored with SIE. Each protein–ligand representative took about 20 s to rescore.

Structural preparation

HIV-integrase data set for virtual screening

The 1.9-Å-resolution crystal structure of the homodimeric HIV-1 integrase catalytic core domain prepared for virtual screening was taken from the PDB entry 3NF8 as suggested by the SAMPL4 organizers. Structural preparation of the dimeric structure was done in SYBYL 8.1.1 (Tripos, Inc., St. Louis, MO). The crystallographic water molecules, acetate and sulfate ions, and co-solvent and ligand molecules were removed. Chain termini of the dimeric structure, including those arising from the disordered loop between residues Lys188 and Gly193 were capped with acetyl and methylaminyl groups. Hydrogen atoms were added, with the protonation states of most ionizable side-chains assigned for neutral pH. Exceptions include the side chains of Asp64 and Glu92 in both monomers, which were protonated. Tautomeric and protonation states of His residues were manually assigned after visual inspection in order to maximize the H-bonding network, noting the protonated forms assigned to His72 and His183 in both monomers. Polar hydrogen atoms were oriented to maximize H-bonding and then the structure was refined by energy minimization with the AMBER force field using harmonic constraints of 3 and 20 kcal/(molÅ2) for the non-hydrogen side-chain and backbone atoms, respectively.

In order to prepare the database of ligands for virtual screening, we first verified the provided protonation states at neutral pH and introduced alternate states for 13 ligands. These include deprotonation of pyridine N atoms in ligands AVX101125_0, AVX17228_0 and AVX17231_0, deprotonation of aromatic amine N atoms in AVX17264_0, AVX17264_1, AVX38752_0, AVX38752_1, AVX40869_0, AVX40872_0 and AVX62526_0, deprotonation of the imidazole ring in ligand AVX62778_0, and tautomerism between piperidine N atoms in ligands AVX40989_0 and AVX40989_1. Partial charges were calculated with the AM1-BCC method [30, 31], as implemented in Molcharge (OpenEye, Inc.), using as input the lowest-energy conformation generated by Omega (OpenEye, Inc.).

HIV-integrase data set for affinity prediction

Two sets of structures were prepared. In one set cognate protein structures for each ligand were used as inferred from the corresponding crystal structures. In the other set a common protein structure was used for all ligands. In the cognate set, the eight crystal structures of complexes provided for prospective predictions as well as the suggested control structure 3ZSQ of a complex with previously measured binding affinity were prepared in a similar way as described earlier for the virtual screening data set, followed by constrained energy-minimization of the complexes around the ligands as required for SIE calculations [18, 20]. The number of protein atoms was kept the same in all these complexes, which required deletion of C-terminal Ala residue in one of the structures (the AVX17557 complex). In the common set, the cognate protein structures in all these complexes were replaced by the 3NF8 structure prepared for virtual screening, after root-mean-square fitting of backbone atoms, and then refined by the same energy minimization protocol.

Host–guest data sets

The provided structure of the cyclic cucurbit[7]uril (CB7) and OctaAcid hosts used for Wilma docking and SIE affinity scoring were first energy-minimized with the GAFF force-field, AM1-BCC partial charges and a distance-dependent dielectric constant. The cyclic OctaAcid host, containing eight carboxylate side chains, was used in the state corresponding to the formal net charge of −8e. The rotameric states of its four aliphatic carboxylate side-chains were manually changed into a symmetrical geometry prior to energy minimization. The structures of the all guests (14 amines as CB7 guests and 9 carboxylic acids as OctaAcid guests) were docked in the most probable protonation states at the corresponding experimental pHs (as provided, with the exception of the CB7 guest #10 for which an alternate state corresponding to a mono-protonated piperidine ring was also docked). A training set of seven guests with measured binding affinities for the CB7 host [36] was prepared as described previously [20].

Vibrational entropy calculations

The relatively small size of the host–guest systems allows the direct application of a normal mode analysis (NMA) to compute the vibrational entropy change upon binding [37]. Here, the AMBER force field with a distance-dependent dielectric was used for the minimization and construction of the mass-weighted Hessian matrix.

Results and discussion

Host–guest affinity prediction

We submitted two prospective models for each of the CB7 and OctaAcid host–guest affinity prediction data sets, one based on the standard SIE scoring function in Eq. (3) and the other one on the SIE + HB + FLAW function in Eq. (5). We used our exhaustive docking program Wilma to arrive at bound conformations for host–guest complexes. The search space was defined large enough to allow docking of the guest at any contact position around the host. In general, the top-scored pose for all guests was found to bind through the central hole-region of the hosts. Both hosts are macrocycles having a circular geometry with a central hole where certain guests are recognized with surprisingly high affinity given the relatively small size of these systems [36]. Whereas CB7 is a neutral host, OctaAcid is negatively charged due to eight carboxylate side chains disposed peripherally and away from the macrocycle [38, 39]. Their guests are depicted in Figures S1 and S2 from the Supplementary Material.

The statistical performances of the models are listed in Table 1 (see Table S1 for the values of the predicted binding affinities). Since the results with the two scoring functions were similar we will discuss only those based on the SIE function. We see that there is good correlation with SIE for both hosts but the slopes are small, that is, predicted range much smaller than the experimental one. One way to modulate the correlation slope is to rescale the SIE function in terms of the enthalpy–entropy compensation factor α in Eq. (3) specifically for the system being investigated. This is justified since is has been previously shown that the CB7 host, for example, requires a higher energy efficiency factor, that is, the degree to which attractive forces are effective in generating binding free energy, rather than being cancelled by entropy losses, than the β-cyclodextrin host [33, 40, 41]. This points towards a larger value for the α scaling factor in the SIE formulation. Hence, we explored this possibility retrospectively by deriving a rescaled SIE function based on a previously published data for guests binding to CB7 (7 complexes) [36]. This training model leads to an α scaling factor of 0.3572 (with a positive constant C = 2.24), hence a significantly larger scaling than that for the standard SIE function (0.1048), in agreement with previous observations [40, 41]. Application of the rescaled SIE function to the SAMPL4 CB7 data set led to a retrospective model with a much-improved correlation slope (Table 1; Fig. 1a, b).

Table 1 Performance of host–guest and HIV-integrase binding affinity predictions
Fig. 1
figure 1

Scatter plots of calculated versus actual binding affinities for the host–guest systems and various prediction models. a CB7 host, standard SIE, submission #187; b CB7 host, refitted SIE (based on external published data for 7 complexes of CB7), retrospective; c CB7 host, refitted SIE, symmetry correction, retrospective; d OctaAcid host, standard SIE, submission #185; e OctaAcid host, refitted SIE (based on external published data for 7 complexes of CB7), retrospective; f OctaAcid host, refitted SIE, symmetry correction, retrospective. The diagonal dashed line indicates a perfect correlation

The weaker entropy compensation in CB7 binding as compared to proteins is likely due to the rigidity of the CB7 host to begin with, resulting in a reduced loss of entropy upon complex formation. As a way to support this explanation, we noted that the slope for the cyclic CB7 host is not only larger than that for proteins but also larger than that obtained for the acyclic host analog to CB7 we examined in SAMPL3 [20]. In order to assess qualitatively the reduced entropic costs of binding to the cyclic CB7 host, we have designed a computational experiment comparing the cyclic CB7 and its acyclic cucurbituril analog studied (as host-1) in SAMPL3 [20] (Fig. 2). The loss of vibrational entropy of the target host upon guest binding (a cyclohexyl diamine) was calculated by normal mode analysis in each system. After adding the loss of rotational and translational entropy of the ligand upon binding we calculated entropic losses –TΔS binding of only +9.7 kcal/mol in the case of the cyclic CB7 host compared with the larger loss of +13.3 kcal/mol for the acyclic host. This corroborates nicely the smaller enthalpy–entropy compensation and hence a larger scaling factor of binding free energy for the cyclic analog relative to the acyclic one. Furthermore, since OctaAcid is also a cyclic host, we applied the scaling factor derived for CB7 and significantly improved the correlation slope for this system as well (Fig. 1d, e). All these data suggest that entropic scaling of free energy is system-dependent and can be calibrated if data is available for the system under investigation. If not enough data with a good dynamics range is available, vibrational entropy calculations by normal mode analysis may provide an alternative for comparison between various systems and appropriate adjustment of the scaling coefficient.

Fig. 2
figure 2

Role of vibrational entropy of the target to free energy scaling studied by comparing the cyclic host CB7 with an acyclic host analog. The ligand bound to each host is shown in ball-and-stick representation

The symmetry of the hosts and some of the guests can have consequences on binding free energies [42, 43]. Since the host symmetry affects all ligands equally, only the guest (ligand) symmetry corrections need to be considered for relative binding free energy calculations. These corrections (~0.4 kcal/mol for a twofold symmetry) applied to the retrained SIE scoring function for both CB7 and OctaAcid systems have a marginal effect (Fig. 1c, f).

HIV integrase affinity prediction

We submitted SIE and SIE + FiSH prospective predictions for the HIV-integrase binding affinity data set (Table 1, Table S1), which consists of eight inhibitors (depicted in Figure S3) against the binding site for the LEDGF/p75 cellular cofactor of HIV-1 integrase (termed the LEDGF site throughout the rest of the paper). Previously unreleased crystal structures of these enzyme-inhibitor complexes were made available for this blind challenge. We first used these cognate protein structures for generating SIE predictions of binding free energies (submission #182). The correlation between these predictions and the actual values is quite poor (Table 1). It is worth noting that the dynamic range of binding affinities in this data set is extremely narrow (1.2 kcal/mol), so from this viewpoint the SIE blind prediction was successful in that the dynamic range of predicted binding affinities was similarly narrow (1.4 kcal/mol). However, SIE was trained and externally tested to achieve a performance of about 2 kcal/mol mean-unsigned error [15, 18, 20] and hence it is not capable of reliably ranking binding affinities within smaller dynamic ranges. The absolute magnitude of binding affinities was also overestimated by SIE in this data set (Fig. 3a). This may relate to the fact that SIE suffers from a certain mass bias and the ligands in this data set are relatively large (with molecular weights between 364 and 574 Da) for their measured binding affinities (0.2–1.5 mM dissociation constants).

Fig. 3
figure 3

Scatter plot of predicted versus experimental binding affinities of HIV-integrase ligands. a SIE function on cognate structures (submission #182); b SIE function on single structure (submission #183); c SIE function on cognate structures (submission #184); d SIE + FiSH function on single structure (retrospective). The diagonal dashed line indicates a perfect correlation

We then wanted to test whether small structural changes afforded in the protein target by various cognate crystal structures are contributing favorably or not to SIE predictions. We refer here to changes that are distributed all around the protein molecule and involve main-chain and side-chain fluctuations that are not necessarily limited to transitions over large torsional barriers. We also noted that several exposed side chains close to moieties that are common for all these ligands, for example Gln95 and His171, experience significantly different rotameric states in different crystal structures. Therefore, we replaced the cognate protein structures with a single external structure (taken from the available PDB structure 3NF8). The SIE prospective predictions from this common target structure experiment (submission #183) did not worsen the predictions, which are still within a narrow (1.1 kcal/mol) dynamic range, and actually we noticed a slight improvement in the magnitude of absolute predictions (Fig. 3b). This indicates that the common protein structure is a good strategy for noise reduction in predictions with the SIE and related methods, which are based on a single conformation of the complex. It also more faithfully represents the routine application of SIE in most virtual screening campaigns. If small conformational movements are needed close to the ligand, then those can be introduced on the common scaffold thus eliminating the noise introduced by distant movements.

Application of the SIE + FiSH variant of the scoring function improved the absolute magnitudes of predictions for half of the complexes with the cognate (multiple) target structure approach (submission #184, Fig. 3c) and for all but one complex in the common (single) target structure approach (retrospective prediction, Fig. 3d). This indicates that SIE + FiSH may be less sensitive to size bias than SIE, reinforcing some of the earlier findings based on our experience in SAMPL3 [20]. It is also apparent from the current results that the spread of the predictions with SIE + FiSH is larger than the SIE-based spread and the experiment, which indicates that this model is more sensitive to protein structural changes. This further reinforces the value of using the common structure approach.

HIV-integrase pose prediction and virtual screening

The HIV-integrase virtual screening challenge consisted of identifying a set of binders from a final full set of 305 molecules, some of which are stereoisomers of the same compounds. Retrospectively, there were 56 distinct binders in this data set, the rest consisting of proven non-binding decoy molecules that are structurally similar analogs of the binders. One peculiarity of this virtual screening challenge was that the target, HIV-integrase, can bind ligands at three distinct sites (actually six considering the dimer): the LEDGF site, the Y3 site, and the fragment site, although most binders included in this set bind to the LEDGF site [44, 45]. We directed virtual screening of the full set to all three sites and selected the best scoring pose overall using the standard SIE scoring function in Eq. (3) (submission #146) and the SIE variant that includes hydrogen bonding and flaws terms as in Eq. (5) (submission #147). In a third submission (#148) we also used the newer SIE + HB + FLAW function but ranked compounds based on the scores calculated at the LEDGF site only.

Obviously, the success or failure of the virtual screening experiment hinges greatly on the docking step. Hence, before discussing the virtual screening results, we wanted to get a feel for the docking accuracy based on the SAMPL4 pose prediction challenge consisting in the HIV-integrase binders together with their known actual poses. Hence, our pose prediction submissions (#154, #155 and #156) were essentially those from our corresponding virtual screening runs mentioned above (submission #146, #147 and #148, respectively). An overview of our pose prediction results over all binders is shown in Table 2. About a third of ligands were docked well (up to 2 Å RMSD from the actual pose) by Wilma when pose selection was done with the standard SIE function over all three sites. Slightly less ligands were docked well when scored by SIE + HB + FLAW overall all three sites and also when docking was directed only around the LEDGF site. Half of the ligands were docked closer than 4.52 Å RMSD from the actual pose with standard SIE scoring over all three sites, which is a reasonable performance.

Table 2 Performance of pose predictions for binders of HIV-integrase

The interpretation of the pose prediction challenge in SAMPL4 is complicated by the existence of several binding sites for various ligands as well as multiple binding of some of the compounds at more than one site. In the same time, this also represents a very stringent test of the docking method. Half of the ligands are docked at the correct site using the standard SIE scoring, with an average RMSD over this fraction of ligands of only 2.36 Å and with half of this fraction of ligands docked closer than 1.60 Å to the actual pose (Table 2). When docking was constrained only around the LEDGF site, 85 % of the ligands were docked in the correct site, because the rest of the compounds actually bind to other sites. The important metrics to remember in this case are the average and median RMSD values of 6.96 and 5.20 Å calculated for this fraction of compounds. These relatively large RMSD values indicate a certain level of misdocking within the relatively wide docking region that we set up around the LEDGF site.

To gain more insight into the performance of our docking program in this system, we focused on a subset of eight compounds from HIV-integrase affinity prediction challenge presented earlier for which we had access to the actual poses. These compounds bind to the same pocket in the LEDGF site. However, although these compounds were docked in the box around the LEDGF site, Wilma docking positioned correctly (RMSD < 2 Å) only one of the eight compounds. There are several compelling reasons for this poor docking result.

First, the box defining the docking space around the LEDGF site was much larger than the actual binding site of the ligands. We purposely defined a larger region because of the presence of a deep pocket adjacent to the actual binding site of these ligands (Fig. 4a). We found that most compounds docked into the deeper pocket rather than in the much shallower actual pocket. This was irrespective of whether SIE or SIE + HB + FLAW scoring functions were used to rank the poses generated by the Wilma docking program. Retrospectively repeating the docking on a smaller box focusing strictly around the actual binding location resulted in correct docking of five out of the eight ligands. Overall, these results may indicate that favorable van der Waals interactions in the deep pocket are overwhelming the cost of displacing water and ions from this location, leading to incorrect pose ranking. Hence, some further improvement of our scoring functions is warranted.

Fig. 4
figure 4

Box defining the region used for virtual docking. HIV-integrase is represented as a molecular surface. a The box extends beyond the actual binding location of the ligand and include an adjacent deep pocket filled with structured water molecules (red spheres) and an acetate ion. b Low-energy docked poses are found mostly in the adjacent deep pocket and less in the shallower actual binding site

Secondly, as already mentioned, these HIV-integrase inhibitors are weak binders (>100 μM dissociation constant). They are also highly flexible, having more than eight rotatable bonds. Weak and flexible ligands are promiscuous to bind at several locations at the protein surface. This is reflected in our docking results, where we find that poses in the wrong pocket outscored the correct poses by just 0.2 kcal/mol (Fig. 4b). Docking of flexible weak-binding ligands is highly prone to generate false positives, i.e., good-scoring incorrect pose. Handling highly flexible ligands is also difficult because the bioactive conformation may not be readily generated before docking (although Omega failed to generate the bioactive conformation for only one of the eight ligands). Hence, this challenge seems to fall outside of the applicability domain of our Wilma–SIE docking–scoring virtual screening platform, which is designed to reliably differentiate not-too-flexible (less than eight rotatable bonds) strong binders (at least sub-μM) from non-binders.

Despite the difficulties, our virtual screening results were better than random, as shown by the enrichment factors and ROC curves and their area-under-curve (AUC) values (Table 3). It is interesting that the early enrichment obtained with the standard SIE function (EF of 1.25 at 10 % of ranked library, submission #146) was slightly improved with the application of the SIE + HB + FLAW scoring function (1.79 at 10 % of ranked library, submission #147, see Fig. 5) overall the three sites. This contrasts with the slightly weaker performance of SIE + HB + FLAW versus SIE for pose prediction in the case of binders (Table 2), and hence indicates a role of the SIE + HB + FLAW function in filtering-out (via the E flaws penalty) some of the false-positive non-binders. Encouraged by the docking results with the LEDGF box confined around the actual binding site, we retrospectively repeated virtual screening on this smaller box. While the performance was improved (Table 3; Fig. 5), this required prior knowledge of the system, which we purposely excluded from the blind evaluation of our methods, as it will not always be available in real-life applications of our tools.

Table 3 Performance of virtual screening against HIV-integrase
Fig. 5
figure 5

ROC curves for virtual screening on HIV-integrase. The curves correspond to the models listed in Table 3

In the case of alternate protonation and tautomeric sates of a ligand, Wilma–SIE selects the state with the lowest SIE value. Feasible alternate states were included prospectively for 13 compounds in the HIV-integrase virtual screening. Retrospectively, it turns out that all these compounds are non-binders, and they were correctly ranked low by Wilma–SIE. However, the RMS variation in SIE score between alternate forms is 0.96 kcal/mol, which is not negligible and underscores earlier reports of SIE sensitivity to protonation states [18]. There is a larger impact in the fragment site (1.28 kcal/mol) than in the LEDGF site (0.93 kcal/mol) or the Y3 site (0.51 kcal/mol). On the same subject, the selected alternate protonation of guest #10 of CB7 gave an improved SIE of −1.56 kcal/mol over the other protonation state. This translates into improving the correlation with experimental data, e.g., submission # 187 would have a decreased correlation coefficient of 0.71 from 0.74 (Table 1).

Conclusions

The SAMPL4 blind challenge provided a stringent test for the performance of the Wilma–SIE docking–scoring platform, which remains consistent with past experience on various systems. The strength of Wilma–SIE is in providing good correlations with binding affinities over dynamic ranges of 3 kcal/mol or wider. Using a common protein structure for all ligands can reduce the noise, while incorporating the more sophisticated solvation treatment of the FiSH model improves absolute predictions. Although the goal of consistently achieving sub-2 kcal/mol accuracy in relative binding free energies remains a challenge even when using the actual binding modes, the predictions correctly detect such narrow dynamic ranges. Estimation of the change in target’s vibrational entropy may represent a way to improve absolute predictions. The present study further delineates the applicability domain of the Wilma–SIE platform for virtual screening. The formidable task of filtering out false positives may be improved by strengthening the penalty on non-complementary polar contacts. Wilma–SIE is not intended for detection of promiscuous weak binders with relatively high flexibility, although even in such difficult cases it can lead to better-than-random virtual screening results.