Introduction

Major challenges in virtual screening are the inadequate scoring functions to evaluate the affinity of docked poses, and the difficulty to predict ligand induced flexibility observed in many important therapeutic targets [15]. To evaluate improvements in this area, the Drug Design Data Resource (D3R) developed the 2015 Grand Challenge, a community-wide experiment for researchers around the world to prospectively test docking and scoring methodologies against blinded data from two targets: heat shock protein 90 (HSP90), a protein that binds following an induced fit mechanism [6], i.e., the unbound or apo structure undergoes significant structural rearrangements upon ligand binding; and mitogen-activated protein kinase kinase kinase kinase 4 (MAP4K4), a kinase with a large pocket that includes sizable flexible loops [7].

The most commonly used scoring functions can basically be classified into three types as Kitchen et al. [1] summarized: force-field-based scoring (e.g., D-Score [8], G-Score [8], GOLD [9], AutoDock [10], DOCK [11], Glide [12], SIE [13]), empirical scoring (e.g., LUDI [14, 15], F-Score [16], ChemScore [17], SCORE [18], Fresno [19], X-SCORE [20], AutoDock Vina [21]), and knowledge-based scoring (e.g., DrugScore [22], SMoG [23]). In the 2010 Community Structure–Activity Resource (CSAR) Exercise, Carlson and collaborators analyzed the performance of different scoring functions on the CSAR-NRC data set [5, 24]. The results indicated that most of the scoring functions had comparable performance (R2 = 0.3–0.4) and the best R2 were achieved by AutoDock and AutoDock Vina (R2 = 0.55) [5]. Despite the poor performance of scoring, many docking methods did well in predicting poses within 2.0 Å of the crystal conformation [3, 5].

Over the last few years, the Camacho lab has steadily built novel platforms for drug discovery, from predictions of druggable sites [25], to pharmacophore-based interactive virtual screening technologies that search billion size libraries in seconds [26]. We also developed Smina [27], a version of AutoDock Vina specially optimized to support high-throughput minimization and scoring. Based on our current implementation in AnchorQuery [28], Smina can minimize 10,000 compounds into a fixed receptor in about 10 s (details will be published elsewhere), the same time scale required for docking a single compound to a flexible receptor [1]. More recently, we have shifted our attention to improving our virtual screening pipeline [2628]. We participated in the 2013/14 CSAR challenge that involved rank-ordering compounds to homology models of the receptors with a given protein primary sequence, identifying close-to-native bound conformations out of a set of decoy poses, and rank-ordering the affinity of sets of congeneric compounds to a given protein. Our predictions were among the best in the field [29, 30]. We showed that the most significant contribution to a meaningful enrichment of native-like models was the identification of the best receptor structure for docking and scoring. In particular, we showed that ranking a set of 31 congeneric compounds cross-docked to the tRNA (m1G37) methyltransferase (TRMD) structure with the largest pocket resulted in an impressive R2 = 0.67, whereas other receptor structures yielded R2~0.

Here, we report our participation in the 2015 D3R Grand Challenge, where we performed a comprehensive analysis of different strategies for predicting docking poses and ranking affinities for two highly flexible targets: HSP90 and MAP4K4. These strategies included methods that utilize all available receptor/ligand co-crystals (“close”), all available ligands and a single holo-receptor structure (“min-cross”) and only a single receptor/ligand co-crystal (“cross”). As in the 2013/14 CSAR competition [29], we found that the method that predicted the best docking poses was not the same as the ones that predicted the best ranking of active compounds. Similarly, different methods were shown to predict the optimal ranking of active compounds for HSP90 and MAP4K4, i.e., “close” and “cross”, respectively. Inspection of the type of flexibility exhibited by each target, i.e., induced fit versus large flexible pocket, suggests guiding principles for selecting the optimal virtual screening for flexible targets. We note that these findings are strongly supported by the fact that our prospective pose predictions and affinity rankings for HSP90 to the 2015 D3R Grand Challenge were the best in the community-wide experiment.

Methods

We tested the performance of five major methods (Fig. 1) on both pose and affinity predictions. Several variants of the methods were also applied to special cases, which will be discussed later in the specific challenges.

Fig. 1
figure 1

“Close” and “cross” methods for affinity ranking and pose prediction. Align-close and dock-close methods minimize and dock to the “closest” receptor for each compound. Min-cross, align-cross and dock-cross methods minimize and dock to all available receptors and select “optimal” receptor based on available experimental data (see “Methods” section). This is shown in the figure by the greyed-out shapes in the “cross” methods that ultimately select one optimal receptor. Red blocks and arrows correspond to alignment, blue blocks and arrows correspond to minimization, green blocks and arrows correspond to docking

The methods used the following applications that are freely available for academic research. Structure preparation: all receptor structures were superimposed using the “align” command in PyMOL 1.7 [31]. Conformer generation: For structural alignment, 20 conformers were generated using Omega2 [32] with default settings. Chemical similarity: Babel 2.3.2 [33] was used with fingerprint 3 (FP3) to identify the most similar or “closest” compound among known ligands. The co-crystal receptor corresponding to the “closest” compound is referred to as “closest” receptor. Conformer alignment: Structural alignments were performed using Open3DALIGN 2.282 [34]. Minimization: Aligned conformers are minimized to a given receptor using Smina [27] with default settings. Docking: Compounds were docked with Smina with default parameters and AutoDock Vina [21] scoring function. A reference compound was used to define the docking box. The Vina-predicted energy was used to select the best ranked docked pose.

Align-close method

(a) Conformers were generated for each compound in the test set. (b) The “closest” compound among known bound ligands was identified. (c) Conformers were aligned to the “closest” compound. (d) Aligned conformers were minimized to the “closest” receptor. (e) The best Vina score was used to predict affinity for the compound.

Dock-close method

(a) The “closest” compound among known bound ligands was identified. (b) Compounds were docked to the “closest” receptor using “closest” ligand as reference to define docking box. (c) The best Vina score was used to predict affinity for the compound.

Min-cross method

(a) Conformers were generated for each compound. (b) The “closest” compound was identified. (c) Conformers were aligned to the “closest” compound. (d) The aligned conformers were minimized to all known bound receptors. (e) The best Vina score to each receptor was used to predict affinity. (f) Optimal receptor for virtual screening is selected (see below).

Align-cross method

(a) Conformers were generated for each compound. (b) Conformers were aligned to every known bound ligand. (d) Aligned conformers were minimized to the corresponding bound receptor. (e) The best Vina score among conformers was used to predict affinity to each receptor structure. (f) Optimal receptor for virtual screening is selected (see below).

Dock-cross method

(a) Compounds were docked to every known bound receptor using its bound ligand as reference. (b) The best Vina score to each receptor was used to predict affinity. (c) Optimal receptor for virtual screening is selected (see below).

These five methods can be grouped by receptor selection. The optimal receptor for “cross” methods (min-cross, align-cross and dock-cross) was chosen by comparing the Vina scores for each receptor with experimental data (IC50, see Supplementary Tables 1 and 2). We calculated Spearman’s rank correlation coefficient (Spearman ρ) and coefficient of determination (R2) to select the optimal receptor that performs the best for affinity ranking in our training set. Similarly, we compared the best-scored poses -for each receptor with the crystal poses to generate the ligand root-mean-square deviation (RMSD), and computed the percentage of poses that have a RMSD less than 2 Å to select the optimal receptor for pose prediction. For testing data, we use the best-performing receptor in the training data set to rank affinity and predict poses. For “close” methods (align-close and dock-close), there is no optimal receptor, but multiple receptor/ligand co-crystals are used for predictions.

Results

HSP90 challenge

Challenge

(1) Predict binding modes of six HSP90 compounds. (2) Predict affinity ranking of P = 180 HSP90 compounds, among this set, 33 unidentified compounds were said to have no inhibition. (3) Predict relative/absolute free energy of three small sets of compounds. Analyses of the 180 compounds show that they all fall into three chemical scaffolds (aminopyrimidines, benzimidazolones and benzophenone-like, Fig. 2c–e. Upper panels show scaffolds, and lower panels show examples). Two unpublished structures, 4YKR and 4YKY, were provided as examples of benzimidazolones and benzophenone-like compound binding.

Fig. 2
figure 2

HSP90 adopts at least four ligand-induced binding modes. a Four conformations of HSP90 ligand-induced binding pocket based on the nearby adaptive loop (L2, between H4 and H5 [43]): close (2WI5), helix (4EFU), open (3RLR), half-close (3B28) (white cartoon: HSP90, red cartoon: flexible loop, orange sticks: small molecules). b Four waters in the binding pockets labeled from 1 to 4 (white cartoon: HSP90, red sphere: water molecules). c Aminopyrimidine scaffold and compound (2XDX). d Benzimidazolone scaffold and compound (4YKR). e Benzophenone-like scaffold and compound (4YKR). f Histogram of binding modes among the N = 181 known co-crystal structures and I = 69 structures with IC50 data. (N number of co-crystals, I number of co-crystal with IC50 data). g Histogram of conservation frequency of water molecule in Fig. 2b shows that three crystal waters are 100 % conserved

Binding pocket analysis

There are N = 179 PDB plus two unpublished HSP90 structures bound to small molecules, with I = 69 of them having known IC50 (from BindingDB [35], BindingMOAD [36] and PDBBind [37], Supplemental Table 1). We superimposed all the known receptors to the receptor structure in 4YKR. Interestingly, a distal loop (L2 between H4 and H5, Fig. 2a) is very adaptive upon different ligand binding. Basically, all co-crystal structures can be grouped into four distinct conformations based on the adaptive loop (red cartoon in Fig. 2a): close, helix, open and half-close (a conformation between open and close). The histograms of these binding modes in the whole dataset and sub-dataset with IC50 are shown in Fig. 2f. The core binding pocket is quite rigid and stable, and four crystal water molecules are observed to participate in ligand binding (Fig. 2b). Three waters are highly conserved despite the different adaptive loop conformations (Fig. 2g). These analyses suggest that the ligand-binding pocket of HSP90 consists of a rigid core part with a conserved water-mediated interacting network and a ligand-dependent adaptive loop. Therefore, when preparing models for docking and alignment/minimization, we kept conserved water molecules as part of the receptors.

Methods

We applied the five methods listed in Fig. 1 (i.e., align-close, dock-close, min-cross, align-cross, and dock-cross) for both pose prediction and affinity ranking. For affinity ranking, we also devised several variations of the previous methods as potential improvements for ligand alignment, and others. (a) min-cross-scaffold and align-close-scaffold methods: Given the limited set of scaffolds that presumably capture the core ligand interactions, for min-cross and align-close methods we aligned the test compounds to the three scaffolds shown in Fig. 2c–e (see, e.g., Fig. 3b) instead of the chemically “closest” compounds as in Fig. 3a. (b) min-cross-pose and align-close-pose methods: Instead of using as templates ligand structures from co-crystals, we use the actual predicted poses by “close” methods as templates for alignment in min-cross and align-close methods (see, e.g., Fig. 3c). (c) dock-close-filter and align-close-filter: We also used the aforementioned predicted poses for manually selecting inactive compounds in testing set. We then overruled the Vina score and moved this set of compounds to the bottom of the affinity ranking for the two methods that had best performance in training set. (d) HSP90 score 1–4: We used machine learning and forward selection methodologies to develop four HSP90-specific scoring functions from the set of energy terms available in Smina [27] (see Supplemental Table 2 for the selected parameters and weights). A training dataset was constructed by cross-docking the I = 69 compounds with published IC50 data to crystal structure 4EFU (optimal receptor for dock-cross method) with the default Smina settings. HSP90 score 1 and 2 functions were trained on active compounds (measured by Spearman ρ), while HSP90 score 3 and 4 were trained to maximize the discrimination of active versus decoy compounds which were obtained from the HSP90 dataset in the DUD-E database [38] (measured by AUC). (e) 3DQSAR-align-pose and 3DQSAR-dock-pose: The relatively large amount of binding data made quantitative structure–activity relationship (QSAR) possible. Using Open3DQSAR 2.3 [39], we trained 3DQSAR models with the 69 HSP90 structures with IC50 data. We applied the trained models to the predicted poses in the testing set from “close” methods.

Fig. 3
figure 3

“Close” methods predicted high-accuracy poses for six HSP90 compounds. ac Examples of different alignment methods in HSP90 challenge. a A compound is aligned to the “closest” compound. b A compound is aligned to one of the three scaffolds. c A compound is aligned to the predicted pose. d Performance of pose prediction using different methods in the training set. (N number of co-crystals) e Lysine 58 from another HSP90 stabilizes the conformation of the extending functional group of HSP90-44. (white and red stand for two different HSP90 monomers. sticks: HSP90-44; meshes: HSP90; lines: lysine 58 of HSP90; black dash: hydrogen bonds). f–h Best predicted poses are aligned to the co-crystal structures. f HSP90-40 is an example of aminopyrimidines (RMSD 0.14 Å). g HSP90-73 is an example of benzimidazolones (RMSD 0.28 Å). h HSP90-175 is an example of benzophenone-like compounds (RMSD 0.27 Å). (white sticks: crystal pose of the compounds; magenta sticks: best pose prediction of the compounds; white meshes: HSP90 surface; grey lines: HSP90 residues close to the binding groove)

Phase 1: Pose prediction results

Retrospective study of known ligands demonstrated that dock-close and align-close methods predicted the most accurate poses. For the analysis shown in Fig. 3d the co-crystal of each ligand was first removed from the dataset, and poses were then predicted based on the remaining co-crystal structures in the training set. Given the large dataset of available co-crystal structures, our results reflect the empirical observation that crystallographic information is superior to any computational model. Hence, we were able to predict high-accuracy poses for all six testing compounds. We took the top five poses predicted by “close” methods (sorted by Vina score), and submitted the best models. The mean RMSD for the first ranked and best pose were 0.46 and 0.32 Å, respectively. Figure 3f–h show an example of the best-predicted poses of each scaffold. The predicted pose for HSP90-44 had a flexible group sticking out of the binding site. We used molecular dynamics to predict the most likely conformation, yet the co-crystal shows that this group is stabilized by Lys58 from the second HSP90 monomer in the dimer structure (Fig. 3e). When structural data is available, our results demonstrate that “close” methods are significantly better in pose prediction than “cross” methods, while dock-cross have an upper limit of about 50 % success rate using a single receptor structure.

Phase 2: Affinity prediction results

The results of our predictions are summarized in Table 1. Dock-close (Spearman ρ = 0.42, R2 = 0.26) and align-close (Spearman ρ = 0.45, R2 = 0.24) methods have the best performance. The relative performance of the five methods is consistent between the training set and our submitted predictions (Fig. 4a). An interesting question to ask is whether for the “cross” methods, were we able to predict the optimal receptors? The answer is that our R2 analysis correctly predicted an open structure (Fig. 2a) as optimal receptor. However, in retrospect, we found that other open structures were marginally better, see Testing (best) in Fig. 4a. Thus, a receptor is only assumed to be “optimal” based on the data available. Overall, the relatively similar outcomes of “close” and “cross” methods suggest that our scoring function cannot account for the change in free energy associated with different receptor structures, and therefore ranking ligands to induced-fit targets is still limited.

Table 1 Affinity ranking prediction results of HSP90 challenge
Fig. 4
figure 4

“Close” methods have better performance than “cross” methods for affinity ranking in HSP90 challenge. a Prediction rates on: training set, testing set submitted prospectively, and testing (best) set reassessed retrospectively. Optimal receptors for align-cross, min-cross and dock-cross were (prospectively) 3OWD, 4BQJ, 3K98 and (retrospectively) 3T10, 3RLP, 3OWD, respectively. N number of co-crystals, I number of co-crystal with IC50 data, P number of compounds for prediction. b Results of variant methods: aligning to scaffold, to predicted pose, and using human expertise to eliminate non-binders. c Distinguishing active from 33 inactive compounds using general methods, human discrimination, 3DQSAR, and special purpose scoring functions to discriminate HSP90 ligands. The lower panel shows binding/non-binding AUC performances, and upper panel shows the corresponding affinity ranking. df Examples of binding poses of inactive compounds. d Co-crystal of inactive compound 176 (4YKY). e Co-crystal from PDB 3B26 (unknown IC50). f prediction for compound 110 (inactive)

Alignment is an area that can be improved particularly for large and/or low similarity compounds. Thus, we developed two variants of the above methods to test different structural alignments. First, we surmised that aligning to the scaffold would lead to better core interactions (Fig. 2c–e). Although this was the case in our training data set, the opposite was observed for “-scaffold” methods in the testing set (Fig. 4b). In retrospect, we found that our method was able to improve some bad alignments, but it also eliminated some good ones. The latter was particularly true for benzophenone-like compounds whose structures are quite diverse. Second, using our models for the testing set as “predicted closest” templates also failed to improve the affinity ranking, observing only a minor “-pose” improvement for min-cross in the training set (Fig. 4b). The failure may come from the inadequacy of the force field to smoothly remove clashes upon minimization. In summary, neither aligning to scaffolds nor to predicted poses improved affinity ranking relative to aligning to the “closest” compound.

As a control of blind versus human predictions, we visually inspected all dock-close and align-close poses and predicted whether they were binders/active or non-binders/inactive (“-filter” methods in Table 1). Humbly, the blind methods performed better than the subjective human filtered scores. In hindsight, one problem is that a compound may bind but it might also be deemed inactive. For instance, HSP90-176 and HSP90-110 are both inactive compounds (Fig. 4d–f), yet, compound 176 binds HSP90 (4YKY), same thing happened for our binding model for HSP90-110 (Fig. 4f) that is based on a highly similar co-crystal (3B26 in Fig. 4e).

The 3DQSAR models perform poorly in affinity ranking (Table 1; Fig. 4c). The major reason seems to be that there were no cases for aminopyrimidine scaffold among the 69 compounds with IC50 data. Therefore, when applying the models and functions to testing set, they did poorly at scoring aminopyrimidine compounds.

The HSP90 score 1–2 were trained to better rank active compounds, and their predicted ranking was similar to other “cross” methods. However, these scoring functions had a meaningful improvement in the discrimination between actives and inactives (Fig. 4c). On the other hand, the HSP90 score 3 and 4 that were specially designed solely to distinguish actives from inactives. As expected, these methods performed poorly in affinity ranking. However, training on inactive compounds from the DUD-E database did not improve the discrimination of active compounds. In hindsight, we realized that the inactive compounds in the testing set had different scaffolds than the DUD-E decoy compounds. Thus, in all likelihood the observed discrimination might be close to random. These results show how dangerous is to evaluate machine learning scoring functions without a rigorous benchmarking. Overall, these results indicate that target specific scoring functions and 3DQSAR models can do better at distinguishing active from inactive compounds than the default Vina scoring function used in the methods in Fig. 1.

MAP4K4 challenge

Challenge

(1) Predict the binding modes of P = 30 MAP4K4 compounds. (2) Predict affinity ranking of P = 18 MAP4K4 compounds. The structural data available for MAP4K4 were fairly limited. Only N = 8 published co-crystal structures of MAP4K4 bound to small molecules were found in the PDB, and all of them (I = 8) had IC50 data (from BindingDB [35], and literature [40, 41]). MAP4K4 is a kinase with a large ATP binding pocket surrounded by several flexible regions (Fig. 5a): residues 171-to-190, residue 30-to-39 (P loop) and residue 60-to-75 (αC). Residue 171–190 is a disordered region that is not resolved in most co-crystals [42]. The P loop links two ß-sheets, ß1 and ß2, in the binding groove, and acts as a lid to adaptively adjust conformation upon ligand binding. Helix αC is also flexible and the loop between αC and ß3 is sometimes not resolved.

Fig. 5
figure 5

Align-close predicted the best models for 30 MAP4K4 compounds with a mean RMSD of 1.6 Å. a Flexible regions around the MAP4K4 binding pocket adapt to different conformations upon ligand binding. (left panel: 4OBO, right panel: 4U44, white cartoon: MAP4K4, red cartoon: flexible loop/helix, orange sticks: small molecules). b Pose prediction performance across different methods in training set. c, d Alignment of our best-predicted pose with the co-crystal structure. MAP-14 is an example of aligning to compound from MAP4K4 (RMSD 0.67 Å). MAP-02 is an example of aligning to the compound from other kinase (CHK1, 4QYH) (RMSD 0.79 Å). e, f Two cases we did poor in pose prediction: MAP-17 and MAP-20. (white sticks: crystal; magenta sticks: predicted; white meshes: MAP4K4 surface; grey lines: MAP4K4 residues close to the binding groove)

Methods

We first applied the five methods described in Fig. 1 for pose prediction and affinity ranking. Scaffolds of testing compounds were quite different from the eight published ligands. Thus, we enriched our structural sampling by collecting an additional eight co-crystal structures of related kinases (Supplemental Table 1), which we only used to improve the alignment of compounds in the MAP4K4 binding pocket. Otherwise, the methods are the same as for HSP90. For affinity ranking, we also tested: (1) Min-cross-pose and align-close-pose (see description in HSP90 challenge methods); (2) Co-crystal-min: Since for Phase 2 we learned the co-crystal structures of all the testing compounds, we ranked these ligands using Vina scores after minimization; (3) Interaction-based ranking: Since the Vina scoring function does not have a directional Hydrogen bond term, we generated an alternative ranking by counting the number of intermolecular hydrogen bonds between the compound and the receptor.

Phase 1: Pose prediction results

Analysis of the training set showed that only the align-close method predicted top quality models (Fig. 5b). Contrary to HSP90 where dock-close performed very well, docking on the large kinase pocket performed very poorly. However, expanding the set of ligands to those bound to homolog kinases allowed us to find related compounds for 27 out of the 30 testing compounds, using align-close for pose prediction resulted in a mean RMSD for the first predicted poses and best poses of 2.6 Å and 1.6 Å, respectively. Figure 5c, d show successful examples of pose predictions with reference in MAP4K4 training set or other kinases (CHK1). Two particularly bad predictions were MAP-17 and MAP-20. For MAP-17, we predicted a binding mode similar to MAP-12 and MAP-13, however, the co-crystal showed that the compound is not as deeply buried (Fig. 5e). For MAP-20, the reference structure we used was 3FV8 from JNK3. Here, the co-crystal revealed a pose that is 180° rotated from our predicted pose (Fig. 5f). Without these two incorrect predictions, our mean RMSD was 1.1 Å. The above notwithstanding, our results produced the best overall models of D3R, and our method proved to be a robust approach for pose prediction of kinases even when there is limited available data.

Phase 2: Affinity prediction results

The ranking predictions of the 18 MAP4K4 compounds are summarized in Table 2. Our best submitted predictions were obtained using the min-cross and min-cross-pose methods (Spearman ρ = 0.41, R2 = 0.28). Based on the IC50 data, we predicted 4OBP as the optimal receptor for all “cross” methods, achieving a remarkable Spearman ρ = 0.8 in the training set. Of course, this prediction lacked statistical significance due to the small number of IC50′s. For Phase 2, we were given the co-crystal structures of all the compounds in the testing set but we still selected the same optimal receptor as in Phase 1. Hence, as shown in Table 2, results from Phase 1 and Phase 2 did not show significant differences for most of the methods. The two methods that improved were min-cross-pose and interaction-based methods, in which predicted poses were replaced by co-crystal poses. However, it is important to note that our retrospective analysis shows that with more affinity data our approach could have improved the affinity ranking (see Testing best Fig. 6a). In fact, just choosing MAP23 as our optimal receptor would have resulted on a Spearman ρ = 0.57, comparable to the best prediction for this target in the D3R challenge.

Table 2 Affinity ranking prediction results of MAP4K4 challenge
Fig. 6
figure 6

“Cross” methods perform better than “close” methods for affinity ranking of MAP4K4 ligands. a Five methods in training set, submitted testing set, and retrospective best predictions for testing set. Optimal receptors for align-cross, min-cross and dock-cross were (prospectively) 4OBP, 4OBP, MAP03 and (retrospectively) MAP29, MAP16, 4U45, respectively. Overall, min-cross and align-cross performed better in our submitted predictions. N number of co-crystals, I number of co-crystal with IC50 data, P number of compounds for prediction. b Comparison of several pose related methods. (Spearman ρ and R2 are generated by comparing the Vina scores from different methods and experimental IC50 data)

Several variations of the main methods resemble “close” methods that overall did not perform as well as some “cross” methods. Figure 6b showed that min-cross-pose and align-close-pose, which in Phase 2 aligned compounds to the crystal poses as opposed to the “closest” poses, had almost the same Spearman outcomes. Consistent with our training data predicting “cross” methods superior to “close” methods, the co-crystal-min method that combined scores from different receptors performed poorly. The interaction-based method (Table 2; Fig. 6b) showed that hydrogen bonding is important in determining the affinity of the compounds, highlighting a limitation of the Vina scoring function.

Discussion

The Drug Design Data Resource (D3R) 2015 Grand Challenge provided a unique platform for the prospective validation of new methods addressing two of the fundamental challenges in computer-aided drug discovery: predicting the pose of compounds bound to a non-cognate receptor (the cross-docking problem) and ranking series of congeneric compounds based on their affinity (the scoring problem). For the last several decades, the field has worked to develop solutions to these problems and many advances have been made. However, most evaluations of docking and scoring are retrospective wherein the correct answers (poses or affinities) are already known [3]. This retrospective analysis can lead to an unintentional bias to overestimate the performance of the method as the persons developing the methods may notice problems that can lead them to the circumstantial corrections of flaws in the workflow. Unfortunately, this does not reflect the real world use-case of docking and scoring methods where the correct answers are unknown until tested experimentally. This highlights the importance of prospective validation of methods as the true test of their performance since after the submission of the predictions, no further tweaking of parameters or workflows can take place.

The 2015 Grand Challenge featured two challenging and therapeutically relevant targets: HSP90 and MAP4K4. Here we presented five methods for pose prediction and affinity ranking that we employed in the challenge: align-close, dock-close, min-cross, align-cross and dock-cross. As reported, our approach of using “close” methods for pose predictions, where all available crystallographic information (mostly co-crystals) is used, yielded the most accurate poses in the community-wide experiment for both targets. Consistent with our earlier work [29], align-close was particularly robust for both flexible targets. Although dock-close predicted better-docked poses to the rather tight pocket of HSP90, the same method failed predicting good poses in the much larger binding site of MAP4K4. The reason is that docking to a large pocket naturally relies more on scoring to select among several “reasonable” poses.

Affinity ranking is still a major challenge in drug discovery, where sometimes not even the co-crystal structure helps much. Part of the problem is that often scoring functions are optimized for virtual screening, whereas calculating the true binding free energy is a multidimensional process that involve different interaction energies, water molecules, polarization effects, conformational changes and dynamics in the two components. The methods that we present here are aimed to optimize virtual screening technologies. As such we keep the receptor structure fixed for docking and/or minimization. With this constraint, we find that “cross” methods had more robust predictions than “close” methods, stressing the limitations scoring different receptor structures. Interestingly, for HSP90, dock-close had slightly better predictions than dock-cross. Analyses of the HSP90 bound structures suggest that dock-close performed better because the induced-fit binding process modifies the distal helix of the binding pocket but keeps the core motif almost intact (Fig. 2a), providing a better normalization between receptors. The latter is very different from MAP4K4 where most of the binding pocket is malleable (Fig. 5a), and the binding energy associated with those differences varies greatly.

There is still significant room for improvements. For instance, our methods use Vina scoring function, which represents the state of the art in open source scoring functions but it is still far from perfect [5]. Additionally, it is well known that experimental data coming from different assays is not well normalized. The above notwithstanding, compared with 3D-QSAR and machine learning algorithms, the structure-based methods sketched in Fig. 1 were significantly more reliable with limited data (Table 1). From a methodology point of view, we have shown that “close” methods show great power to predict poses. For affinity ranking, the choice of methods seems to be more dependent on the binding pocket features. Consistent with MAP4K4, another kinase from CSAR 2013/14 competition, SYK, had “cross” methods as the best ranking method [29], whereas for TRMD, a target with a rigid and buried pocket perhaps resembling HSP90, dock-cross performed the best [29]. These results suggest a preference of method selection in different pocket types, but the specific features that dominate the selection of the optimal method remains to be determined. In the 2013/14 CSAR competitions, we suggested that the structure with the largest binding pocket should have the best performance [29]. Similar analyses have been applied to HSP90 and MAP4K4, but no strong correlations were observed. Further research is needed to understand what are the conditions to select the optimal receptor or set of receptors that would predict the best ranking of compounds.

In presenting the results of our participation in the D3R 2015 Grand Challenge, we have validated five major methods for pose prediction, docking and scoring. These methods prospectively predicted the overall best poses for both flexible targets and best affinity ranking (Spearman ρ) for HSP90. Furthermore, these general methods can be adapted and applied in a number of different situations by modifying the scoring functions or docking protocols, improving the outcome of virtual screening experiments.