Introduction

Computer-Aided Drug Discovery (CADD) methods are constantly under development anda wide spectrum of options is available to the scientific community to address each specific situation at every stage of the drug discovery process [13].

Independent validation experiments are extremely useful to test the different methods, try them out under different circumstances and validate them for a specific goal. For instance, there have been experiments to help the development of protein structure modeling software [4],the prediction of protein–protein interactions [5] or certain physico-chemical properties of small molecules [6].

In this direction, the D3R Grand Challenge 2015 provides an independent exercise to assess and validate CADD tools related with protein–ligand interactions. Two proteins (Hsp90 and MAP4K4) with datasets comprising different ligands with measured affinities and crystal structures are provided as blind sets. Different measures for each of the datasets were used to evaluate the performance of different methods in two situations that are common in drug discovery projects: ligand ranking and binding mode prediction.

Docking, scoring and free energy methods have been widely applied in structure-based drug discovery [712] as they provide an excellent assistance particularly in early stages of the development of new drugs. Docking is a very common method that can be used both for predicting the binding mode of a protein–ligand complex and for virtually assaying thousands to millions of drug-like molecules in a relatively short amount of time, speeding up the finding of promising candidates and dramatically decreasing the cost in comparison with the experimental alternative [13]. However, the scoring functions employed in docking have been trained to reproduce specific data sets and are qualitative in nature. As such they are not expected to correlate with binding free energies [14]. Further limitations include receptor flexibility or the presence of water molecules that can be wither trapped or displaced by the ligand.

In our particular approach, docking is a central component tackling the D3R Grand Challenge 2015, but we aim to overcome some of its limitations with complementary tools and, whenever possible, guiding the calculations with previous knowledge about the systems. Specifically, we have used rDock software [15] as the docking engine, using pharmacophoric restraints to ensure that the predicted ligand poses fulfil certain key interaction points [1618]. In the case of Hsp90, they correspond to a hydrogen bond with the carboxylate of Asp93 and, in case of MAP4K4, a hydrogen bond with the nitrogen atom of Cys108 in the hinge region (a short linear sequence that acts as a hinge between the N-terminal and C-terminal domains in kinases). These interaction points can be identified merely by superimposing all the available crystal structures of protein–ligand complexes for each system in the PDB and obtaining a pharmacophore definition as detailed in the Methods, which can be supplemented to rDock in order to increase its efficiency as shown in previous studies [15]. Hsp90 presents at least one water molecule that can be displaced by certain ligand classes. By excluding this water molecule, we make the receptor definition valid for all chemotypes [19]. Then, to address the protein flexibility, we took a knowledge-based approach. We investigated the effect of protein flexibility on docking performance using Hsp90 as a test set, so we are familiar with the different conformations the protein can adopt upon ligand binding. We selected the We selected the most common conformation amongst all known Hsp90 protein–ligand complexes (namely, closed lid) for running docking and revised the quality of the predictions knowing that certain chemotypes can induce a conformational change of the lid to the open or helical states [20]. In contrast, MAP4K4 is a much less well characterized protein and we took a best guess based on our previous knowledge about other kinases. As we will discuss below, the different degree of previous knowledge for each system has had a major effect on the outcome and highlights the importance of the human factor, which remains essential even as the computational tools improve.

Finally, we have introduced the use of Dynamic Undocking (DUck), a new tool used to assess the structural stability of protein–ligand complexes [21]. Here we have experimentally adopted a consensus approach, where the docking poses are re-evaluated and re-ranked based on their resistance to break the key hydrogen bonding interaction. This approach allowed us to detect not only false positives but also false negatives from docking results. DUck has been shown to be orthogonal to docking, as it evaluates structural stability as opposed to binding affinity [21]. For some ligands, re-scoring by DUck has allowed us to identify good binding poses which are apriori discarded due to bad docking scores. In other cases, docking and DUck selected the same pose, increasing our confidence on predicted binding modes that would be deemed doubtful if they had been backed up only by docking.

In the next sections, we will discuss in detail the methodology and the results obtained in the D3R Grand Challenge, drawing some conclusions to explain the failures and successes, as well as some recommendations for future editions of this challenge.

Methods

Selection of cavity

The D3R Grand Challenge 2015 has two differentiated objectives: predict the crystallographic poses and the affinities or rankings for a series of ligands. Both of these objectives rely on a good definition of the system and a reliable characterization of the ligand-receptor interaction is crucial. For Hsp90, 4 receptor structures from the PDB were proposed by the organizers (2JJC, 2XDX, 4YKR and 4YKY). All of them were in the so-called closed conformation of the lid with the exception of 2XDX, which had the lid in open conformation.

As most of the known ligand-Hsp90 complexes have the lid in closed conformation, 2XDX was discarded. 2JJC was also discarded because, unlike the ligands in the test set, it is a very small and may be unable to modulate the cavity for better docking performance [22, 23]. Structures 4YKY and 4YKR are very similar in all respects (both bind a ligand of the resorcinol family) and were considered equivalent. The former was selected as reference structure. In a previous study [15] we demonstrated the improvement in virtual screening applications when guiding the docking process by adding previous knowledge, with a specific example for Hsp90. Additionally, it is known that three interfacial water molecules have an important role mediating the protein–ligand contacts. For this reason, they have been included in all docking runs as structural waters in the binding site. Some ligand types (e.g. adenine) interact with a fourth interfacial water molecule, but it is displaced by others ligands (e.g. resorcinol) and cannot be kept as part of the receptor [24]. Hence, the protocol used in all the docking calculations for Hsp90 includes a pharmacophore definition of two hydrogen bonds with Asp93:OD2 and one of the water molecules (included in all the runs as non displaceable), as previously defined in [15]. For undocking, the water molecule is added explicitly to the initial structure. In case of MAP4K4, 2 receptor structures from the PDB were supplied by the organizers (4OBO and 4U44). The main difference between the conformations of the two crystals is a loop folding towards the hinge region in 4OBO, thus decreasing the size and the solvent exposure of the binding site. Due to those restrictions we decided to use 4U44 as reference for all MAP4K4 applications, which had a bigger and more accessible binding site. In order to guide docking, we performed a pharmacophore search (more details in the next section) using all crystal structures of MAP4K4-ligand complexes in the PDB. We then supplied all docking calculations with a pharmacophore defined by a hydrogen bond with Cys108:N, located in the hinge region.

Pharmacophore search

To get a reliable pharmacophore definition for the MAP4K4 system, a set of known protein–ligand 3D structures was necessary. We selected all MAP4K4 protein–ligand complexes from the PDB (4OBO, 4OBP, 4OBQ, 4RVT, 4U40, 4U41, 4U42, 4U43, 4U44, 4U45, 4ZK5 and 5DI1) and aligned them to the reference 4U44. The “Pharmacophore Search” tool of MOE was run and a hydrogen bond with Cys108:N in the hinge region was selected as pharmacophore. It was fulfilled by all 12 ligands in the PDB subset. Moreover, it was consistent with other protein–ligand interactions in the kinases family [25, 26].

Molecular docking

For all molecular docking simulations we used rDock [15, 27], a fast and reliable docking program that we released as open source several years ago. To run rDock, only a correctly prepared 3D structure of the receptor and a definition of the binding site are needed. In this work, we defined the cavity using the crystallized ligand found in both PDB structures for Hsp90 and MAP4K4, 4YKY and 4U44 respectively. Some rDock rbcavity parameters were decreased with respect to the default values in order to optimize the binding site definition: radius (changed from 10.0 to 6.0), which defines the region around the reference ligand that will be used to define the docking binding site and max_cavities (from 99 to 1), as we only want to run docking in one cavity. The pharmacophoric restraints were defined as mandatory and all the ligands unable to fulfill the definition were discarded. For the docking protocol, no modifications were made to the standard as previously published [15]: 50 individual docking runs per ligand, which is considered exhaustive sampling, in order to ensure that the lowest-energy binding mode is found.

Receptor preparation

The 3D structure of the receptor has to be provided to rDock with standard Tripos MOL2 format and atom types [28]. However, as rDock relies on the user-supplied structure, we need to provide it with correct protonation states and charges, as well as correct orientations of flexible side chains (rDock only considers as flexible atoms of the receptor the hydrogen atoms of terminal OH and NH3 + groups within 3 Å of the binding site cavity). The “Structure Preparation” tool from MOE [29] was used to protonate at pH 7.0 and correct all the issues found for Hsp90 and MAP4K4 receptors, such as chain breaks, missing loops or disulfide bonds, incorrect residue labeling or alternate conformations. The prepared structures were then saved in mol2 format and used as input for rDock.

Ligand structure

As all ligands provided by the organization were in 2D format, Ligprep from Schrödinger [30] was used to calculate the 3D structure with correct topology, bond orders and geometry of bonds, angles, dihedrals and rings. The ionisable groups were protonated at pH = 7 with a tolerance of ±1. All ligands were saved in MOL SDF format and used as input for docking.

Dynamic undocking

We used Dynamic Undocking, or DUck, as a complementary tool to molecular docking in order to improve the overall performance of docking-based virtual screening [21]. DUck is a methodology developed in our group based on Steered Molecular Dynamics (SMD). The interaction of the ligand and the receptor with the key interaction point (specified when defining the cavity and protocols for docking) is monitored with SMD. In particular, DUck simulations consist on unbiased molecular dynamics (MD) simulations of the complex and repeated SMD simulations launched at 1 ns intervals of the MD to simulate the rupture of the ligand-receptor interaction and measure the force needed to achieve a state where the interaction has just been broken or, as we named it, a Quasi-Bound state. The work profiles obtained from the SMD simulations are processed to obtain the work to achieve the Quasi-Bound state (WQB), which will be used to score and rank the ligands. Moreover, in order to increase throughput and reduce the influence of peripheral interactions and focus on the desired interaction, we use a model receptor that includes only a small part of the protein of interest. This portion is created around the defined key interaction point and preserves its local environment, simplifying also the dissociation pathway and avoiding artifactual results (more details about DUck can be found in Ref. [21] and http://www.ub.edu/bl/undocking/). For Hsp90 and MAP4K4, the following protocol was set: protein models were created containing the residues with any atom within 6 Å around the key interaction points (as detailed in Selection of the Cavity section) and manually refined to include other important residues for the binding site environment (Figure S1; Table S2). The best-scored docking poses for each ligand were subjected to an in-house script that automatically parameterized each ligand and prepared the necessary files for running the MD and SMD simulations of DUck. Each protein–ligand complex system was placed in a cuboid box with a minimum distance between each atom and the edge of the box of 12 Å in every dimension and solvated with TIP3P water molecules and Na+ or Cl− ions were added to the solvation box depending on the charge of each of the protein–ligand complexes in order to ensure the electroneutrality of the simulated systems. Due to the artificiality of the protein models, MD simulations were run with harmonic restraints (1 kcal/mol Å2) in all heavy atoms of the receptor to prevent big conformational changes. In order to preserve key hydrogen bond interaction during the equilibration part of the simulations, distances beyond 3 Å are penalized (parabolic restraint with k = 1 kcal/mol Å2 between 3 and 4 Å; linear restraint with k = 10 kcal/mol Å beyond 4 Å). All unbiased MD steps were run using a Langevin thermostat with the cutoff for non-bonded interactions set to 9 Å and the collision frequency to 4 ps−1. The equilibration consisted in 1000 cycles of minimization, gradual warming from 100 K to 300 K for 400 ps in the NVT ensemble and equilibration of the system for 1 ns in the NPT ensemble. At intervals of 1 ns (starting right after the equilibration), two SMD runs are executed from the same restart file (at 300 and 325 K, as described in Ref. [21]) for 500 ps. During this time, the distance of the key hydrogen bond is steered from 2.5 to 5.0 Å with a spring constant of 50 kcal/mol Å2. More unbiased MD steps (1 ns each) were run to create more starting points for SMD runs to repeat the process as much as desired. All simulations were run with AMBER 14 [31] using in-house NVIDIA GeForce TITAN X GPUs or at the Barcelona Supercomputing Center using NVIDIA Tesla M2090 GPUs. AMBER forcefield 99SB was used for the protein and parm@Frosst [32] for the ligands.

Binding mode prediction

For all of the ligands where a binding mode was to be predicted, the protocol was the following: 1- Run docking as described in the “Molecular Docking” section above. 2- From the docking results, select a set of poses with a RMSD between them higher than 1 Å using the sdrmsd script from rDock package. 3- Run DUck to calculate the WQB for all the sets of selected poses per ligand. 4- Select the pose with the highest WQB as the correct binding mode and 5- visually inspect the results to check the selected poses fulfilled the defined interaction and the receptor conformation (more details in the following sections).

Ligand ranking

A few differences from the protocol for binding mode prediction were introduced in case of ligand ranking: 1- Run docking as described in the “Molecular Docking” section above. 2- From the docking results, select the top scored pose for each ligand. 3- Run DUck to calculate the WQB for the selected poses. 4- For each of the ligands in the sets, the similarity to all known PDB ligands with measured affinity for the corresponding receptor (Hsp90 or MAP4K4) was calculated and taken into account to check the rankings and possible docking errors. 5- Docking score and WQB from DUck were normalized for each of the sets. All ligands were ranked according to the sum of the two corresponding normalized scores. In the cases where docking was not able to find a good binding mode (i.e. the key interaction was not fulfilled), the similarity of each ligand with respect to other ligands in the challenge set and other ligands in PDB was used to assign a corrected ranking. Finally, a final step of visual analysis was carried on to check all ligands and re-rank some of them taking into account our previous experience.

Results and discussion

Following our primary hypothesis, we designed a docking protocol that would reinforce the importance of the most important binding hot spot. This was done through the introduction of pharmacophoric restraints that forced the presence of hydrogen bonding groups at specific locations (Fig. 1). The protein conformation was chosen to be as general as possible, thus for MAP4K4 we selected 4U44 as it has a bigger cavity than other structures available. For Hsp90, the biggest cavities present a ligand-induced hydrophobic sub-pocket (the PU3 cavity), but the associated protein conformation (helical) is energetically penalized and tends to downgrade the docking results [19]. For this reason, we chose a non-helical conformation (4YKY) taking care that the binding site was not blocked by any side-chain.

Fig. 1
figure 1

a Hsp90 receptor definition. Asp93 and two surrounding water molecules (shown in sticks) define the key interaction element. The pharmacophoric points (transparent blue spheres) force the presence of a H-bond donor next to Asp93:OD2 and a H-bond acceptor next to the interstitial water molecule. b MAP4K4 receptor definition. The hinge region is a characteristic binding hot spot of protein kinases. A pharmacophoric restraint forced the presence of a hydrogen bond acceptor next to Cys108:N (transparent blue sphere)

Binding mode prediction

We ran rDock to generate 50 poses per ligand. Poses with restraint penalties higher than 1 kJ/mol (indicating that the pharmacophore is not fulfilled) were discarded. After that, we selected a diverse set of the remaining poses, sorted by docking score to be re-evaluated by Dynamic Undocking (DUck). On average, 10 poses per ligand were selected for next step. DUck measures the work needed to break a given hydrogen bond (WQB). We have found that true ligands in their correct binding mode, form hydrogen bonds that are much harder to break than decoys [21]. Here we employ this method to compare various binding modes of the same ligand. In the majority of cases, the binding pose with the best docking score also presented the highest WQB value and was proposed as the correct solution. But often DUck provides a much more clear distinction between poses, removing uncertainty from the decision. This is illustrated with the Hsp90 ligand 40, which presented two alternative binding modes (Fig. 2). In the first binding mode, the ligand interacts with Asp93 through the resorcinol, whereas the cyclic urea plays this role in the second binding mode. Though their docking scores are relatively similar (−23.4 and −18.9 kJ/mol, respectively), the hydrogen bond formed by the second binding mode is extremely labile (WQB = 0.5 kcal/mol), which makes this binding mode very unlikely. By contrast, the first binding mode presented a very strong hydrogen bond (WQB = 17.7 kcal/mol) and was selected with full confidence. For Hsp90, in several cases a lower ranking pose was selected based on the DUck calculation (Table S1). This is shown in Fig. 3, where the Hsp90 ligand 73 presents a relatively similar binding mode with two different orientations. The first one (green) is the preferred one by docking (score = −20.3 kJ/mol), whereas the second one (pink) is heavily penalized due to a steric clash of the 1-chloro-3-nitrobenzene moiety (score = 1.3 kJ/mol). Dynamic undocking indicated that the latter binding mode was actually preferred (WQB = 11.6 kcal/mol vs. 10.9 kcal/mol), which prompted us to seek a protein conformation where the second binding mode would fit without clashing. In this particular case, the ligand binds to helical conformation (e.g. 2WI6) where a hydrophobic pocket (the PU3 pocket) emerges [20].

Fig. 2
figure 2

Two binding modes proposed by docking for ligand 40 in the Hsp90 set (green and pink sticks; RMSD = 5.5 Å). Dashed lines represent the hydrogen bond between each ligand and Asp93. The crystal structure of ligand 40 is represented in white sticks for comparison

Fig. 3
figure 3

Two different binding modes (RMSD: 2.5A) for ligand 73 in the Hsp90 set proposed by docking represented in green and pink sticks. The green one is the preferred conformation according to docking, whereas the pink one has a really bad score due to a clash penalization. With DUck, we could detect that the correct binding mode was the pink one. The crystal structure of ligand 73 is represented in white sticks for comparison, the RMSD with respect to the pink binding mode is 0.61

The results submitted to the stage 1 of the D3R Grand Challenge are summarized in Tables 1 and 2 for Hsp90 and MAP4K4, respectively. The accuracy of binding mode prediction is generally measured in terms of root mean squared deviation (RMSD) from the crystallographic pose. It is also common to convert this value to a binary decision (correct/incorrect) based on a fixed threshold (usu. 2.0 Å). This is a debated topic, and several alternative solutions have been put forward [33, 34].

Table 1 Summary of the results for the 6 ligands in the Hsp90 system Stage 1
Table 2 Summary of the results for the 30 ligands in the MAP4K4 system Stage 1, and simulation of Stage 1 results for MAP4K4 taking into account additional conformations of Tyr36 (ligands with bad prediction only)

In practice, the best measure may depend on the particular problem that one is facing. For instance, a prediction that captures the main interactions is valid when dealing with a new chemotype, but inadequate at the lead optimization stage. Since our lab focuses on the hit identification stages of drug discovery, we are particularly interested in predicting the position of the central scaffold, i.e. the part of the ligand that forms the main interactions and defines the vectors of growth in the hit to lead stage. Thus, we have complemented the objective RMSD measure with a subjective binary classification telling if the prediction is sufficiently accurate to be used in the hit progression. In terms of RMSD, our average results were 1.6 ± 0.9 Å for Hsp90 (8th position among the participants of the D3R Grand Challenge 2015) and 3.7 ± 2.8 Å for MAP4K4 (3rd position). On the former set, we predict all but one ligand within 2.0 Å of RMSD. The only exception is ligand 44 (RMSD = 3.0 Å), but even then the position of the scaffold is correct and the deviation is due to the different orientation of a part of the ligand that does not engage in interactions with the protein (Figure S2).

The MAP4K4 results are much worse, but we still fared better than most participants, which highlights the difficulty of this set. Using the 2.0 Å RMSD cutoff, we only predicted 11 ligands correctly (37 %). In our subjective assessment, we predicted the position of the scaffold correctly for 18 ligands (60 %). The reason behind the poor performance is almost exclusively due to the flexibility of the protein. As this is a key issue in molecular docking, it will be discussed in detail. On the positive side, our protocol was still capable of predicting the main interaction correctly for a majority of ligands. Worthy of note, the structure of ligand 32 was originally inverted (Fig. 4). Docking, but particularly Dynamic Undocking, argued strongly against this binding mode. After consultation with the crystallographers our predicted binding mode was accepted as the proper binding mode. This is a reminder of the necessary dialogue between crystallographers and modelers, particularly where various binding modes are consistent with the observed electron density (e.g. due to tautomerism) [35].

Fig. 4
figure 4

Structure of MAP4K4 ligand MAP32 in the disclosed crystallographic structure (a) and the alternative mode we proposed (b). Note the different tautomers with inverted methyl and hydroxyl groups, where in the crystallographic pose there is a clash between the methyl group and Glu106, we found a well structured hydrogen bond between the hydroxyl group and Glu106

Protein flexibility: the greatest docking challenge?

Reviewing the cause of the cases in MAP4K4 where we failed in making a good prediction, we found that using a single receptor conformation was by far the most important factor. There is a large body of literature indicating the importance of protein flexibility [3638] but back in 2005 we demonstrated that using multiple protein conformations could actually downgrade the results, particularly in virtual screening applications [19]. Since then, other authors have suggested that judicious selection of two or three structures can produce a small but systematic improvement over the best single structure [3941]. However, as we did not have any previous knowledge on this system, we adopted the simple approach of using the biggest cavity (4U44), hoping that it would be valid for a larger proportion of ligands [42].

Once the experimental structures were disclosed, we observed that a large proportion of the ligands actually bind to a conformation where the cavity is partly occluded by the side-chain of Tyr36 in the P-loop (Fig. 5). In order to measure the impact of these effects, we ran the exact same experiments using as receptor structure 4OBO (Tyr36-IN), which has this alternative conformation. As shown in Table 2, most of the recalculated poses have an RMSD lower than the one we submitted to the D3R Grand Challenge 2015. Taking the best RMSD of the different binding modes, we obtain an average RMSD improvement of 1.1 Å (2.6 vs. 3.7 Å) with 18 ligands (60 %) below the 2.0 Å threshold and 23 ligands (77 %) with a correctly placed scaffold. While the results are still imperfect, one must consider that three structures are still insufficient to represent the whole array of conformational possibilities. In fact, we deem that there are only 2 ligands (7 %) for which the failure cannot be attributed to the conformation of the protein: Ligands 4 and 17 do not form a hydrogen bond with the backbone of Cys108 (the hinge region) and are thus incompatible with our docking and dynamic undocking protocol. On the other hand, if the relative energies of the conformational states are not properly considered, using multiple structures may cause more problems than it solves [43]. In our opinion, except for direct experimental observation of the conformational states [44], empirical knowledge gained from detailed analysis of multiple crystallographic structures is—at present—the only practical solution to this problem.

Fig. 5
figure 5

Comparison between the two MAP4K4 supplied starting structures 4OBO (blue) and 4U44 (orange). In the former structure the side-chain of Tyr36 in the P-Loop is facing inwards, reducing the cavity space available for ligand binding

This is indeed the case for Hsp90, a system that we have studied thoroughly. Here, we were able to predict not only the structure of the ligands, but also which conformation would the protein adopt upon ligand binding. This aspect was not evaluated in the D3R Grand Challenge 2015. Considering the importance of this issue, we suggest that it should be included as a measurement of success in future editions. As shown in Table 3, the RMSD of the residues lining the binding site was below 0.4 Å in all cases, and the change in backbone conformation induced by ligand 73 could be predicted based on the DUck calculations (vide supra).

Table 3 RMSD (Å) between binding site residues of submitted Hsp90 receptor structures and crystal structure

Virtual screening

For Stage 2 of the D3R Grand Challenge 2015, we were asked to predict the affinities or affinity rankings for 180 ligands in Hsp90 and 18 ligands in MAP4K4 systems. The tools developed and used in our group are geared towards virtual screening, where we aim to identify true ligands from huge libraries of chemical compounds. As such, our predictions are fast and qualitative and not well suited to predict binding affinities, instead our goal was to produce a ranked list enriched with potent ligands in the top positions. For this reason, we only discuss the results in terms of virtual screening performance: area under the curve (AUC) of the Receiver Operating Characteristic (ROC) curve and Enrichment Factors (EF). This type of analysis could not be performed on the MAP4K4 set because 15 out of the 18 ligands were considered as active (IC50 < 1 µM) and the other three were in a close range (1.74, 2.25 and 10 µM). The Hsp90 set presented more dispersion: 40.6 % of ligands (73 out of 180) had an IC50 lower than 1 µM and are considered active, the remaining are considered inactive even though 21.7 % (39 out of 180) have an IC50 between 1 and 10 µM. The fact that the inactive set contains molecules that are, a) true binders and, b) structurally very similar to the active ones makes this a very unusual and challenging test set. We encourage the organizers to include more standard virtual screening test sets in future editions of the challenge.

Our ranking protocol was based on an initial docking stage followed by DUck simulations of the top scoring pose. We combined the scores obtained from docking and DUck and, following visual inspection to check all the ligands and the corresponding rankings, the final position of 49 ligands (27 %) in the ranked list was manually modified. Visual inspection introduces a subjective step that is difficult to control, but is essential in real applications to correct some of the limitations of docking. In our case, we used it mostly to rescue compounds that were predicted as inactive because they had an incorrect binding mode (e.g. ligands binding to the helical conformation that could not fit in the docking cavity). Considering the qualitative nature of our approach, the ROC curve (Fig. 6) demonstrates a very good performance, as do the corresponding enrichment factors (Table 4). To assess the effect of consensus ranking and visual inspection, we also plotted the ROC curve that would be obtained after the first stage (docking) and without the visual inspection (Figure S3). The AUC was much better for the combined ranking (0.71 vs. 0.55) and the enrichments were also higher for the combined ranking. This was the best performance across participants in this metric. Unexpectedly, we also ranked well in terms of Spearman correlation (0.39). This was surprising because both Docking and Dynamic Undocking are designed to discriminate between active and inactive compounds, rather than to obtain a quantitative assessment of their (relative) binding free energies. In part, this reflects our knowledge about this particular system, where we can anticipate from previous experience the conformational changes that take place in the protein and the ligand features that contribute to binding affinity. However, this correlation should not be considered a success, as it is likely insufficient to drive drug design. Instead it indicates that ranking ligands using structure-based methods is particularly challenging. In fact, many ligands in the test set have analogues with published binding affinity and we anticipate that a purely ligand-based strategy might have provided very good results. We suggest that the performance of one such knowledge-based approach would be useful as a benchmark of the performance of all participants in the contest.

Fig. 6
figure 6

ROC Curves of the 180 ligands in the Hsp90 Stage 2 Set. Ligands with an IC50 higher than 1 µM were considered as active. a Ranking according to rDock docking scores (AUC = 0.55). b Consensus ranking as submitted to the challenge (AUC = 0.71)

Table 4 Summary of statistics for Hsp90 system stage 2 results

Conclusions

Through the participation in the D3R Grand Challenge 2015, we have been able to validate the methods developed and used in our lab. We must emphasize that our main focus is virtual screening, an application that has not been considered explicitly in the challenge. Binding mode prediction is a first essential step for any subsequent prediction, so we had a particular interest on this part of the challenge. Binding affinity prediction (or ligand ranking) is much more demanding than virtual screening, and we participated in this part of the challenge somewhat reluctantly, expecting a clear underperformance compared to free energy methods.

We used a combination of qualitative techniques that, together, have worked much better than any of them separately. Namely, we used rDock for molecular docking with pharmacophoric restraints and DUck, a new technique based on molecular dynamics. For Stage 1, we were able to correctly predict how the ligands bind, particularly the position of the central scaffold forming the main interactions with the protein: for Hsp90 5 out of 6 ligands had an RMSD lower than 2 Å and 100 % of the scaffolds were correctly predicted; for MAP4K4 11 out of 30 ligands had an RMSD lower than 2 Å and 60 % of the scaffolds were correctly predicted. This figures would have increased to 18 out of 30 ligands and 77 % of the scaffolds if one single additional conformation (Tyr46-IN) would have taken into account. Retrospectively, we performed additional experiments to understand the failures, finding that protein flexibility was the major factor limiting the quality of the results. Predicting protein conformations is feasible, but increasing the number of conformation generally leads to decreased docking performance [19] and even when few conformations are considered, their relative energies must be considered to avoid artifacts [44]. This is a tall order that we have by-passed by employing previous knowledge about the system, which enabled us to predict the most likely receptor conformation for each Hsp90 ligand purely based on chemical structure. The fact that we did not have this information for MAP4K4 explains the difference in performance between both systems. It should be possible to extract this type of knowledge automatically from existing crystal structures deposited in the PDB, but we are not aware of any tool capable of doing this task. Forcing certain interactions during the docking process is equally important because it corrects some of the limitations of the scoring functions. Fortunately, in this case, the main pharmacophoric points can be extracted easily and automatically with existing tools. In the absence of known ligands, binding hot spots can be identified from molecular simulations [45].

In Stage 2, for Hsp90 we performed much better than expected considering the qualitative nature of our methods. The results were biased by our previous knowledge on this system, which had an important effect on the final performance, but this reflects the typical situation in drug discovery, where expert users combine tools and previous knowledge whenever possible. Our relative success highlights the challenges that free energy methods are still facing, but also indicates that there is a lot of potential in combining relatively simple structure-based tools with knowledge-based approaches. No doubt, machine learning will play an increasingly important role in the future, driven both by the growing body of public data [46, 47] and major advances in the field [48, 49].

Finally, we have several suggestions to improve future editions of the challenge. Namely, the prediction of protein conformation as a measure of success in binding mode prediction, the inclusion of a virtual screening prediction set and the introduction of an automated ligand-based approach as a baseline for measuring success of ligand ranking applications. We consider that all these aspects may improve what is already an extremely useful and necessary exercise.