Introduction

Molecular docking problem is generally cast as a problem of finding the low-energy binding modes of a ligand based on the “lock and key mechanism” [1], within the active site of a receptor, whose structure is known [2]. It plays an important role in drug design, which is demonstrated by the vast amount of literature [310] devoted to the optimization methods for molecular docking design since the pioneering work of Kuntz et al. [11]. Protein–ligand docking (PLD) for drug molecular optimization design is an ideal approach to virtual screening, i.e., to search large sets of compounds for finding new lead structure. A fundamental problem for molecular docking is that the design space is very large and grows combinatorially with the number of degrees of freedom of the interacting molecules. The computation of the ligand–receptor interaction energy at all possible docking configurations cannot be completed in a reasonable amount of computing time, for example, a typical protein receptor might occupy a volume of some 60 Å−3, even with a moderate translational resolution of 1 Å, this leaves 216,000 translations to search. For a rotational resolution of 20° in each axis and potential ligand with 35 atoms and protein receptor with 3,500, 1.5 × 1014 pairwise nonbonding evaluations will be needed to scan the complete range of possible docking configurations. Even if one can evict 99% of the points by employing various assumptions, we will still require 1.5 × 1012 evaluations. If billions of compounds are to be screened in this way, the required computational power becomes a limiting feature. Therefore, simpler and more efficient methods are continuously being researched into.

In this paper, an entropy-based optimization model is constructed to obtain the narrowing coefficients of the searched space for multi-population evolution very easily. Then a new iteration scheme in conjunction with multi-population genetic strategy and an entropy-based searching technique is developed to search optimal molecular orientation and conformation. The elitist maintaining strategy and efficient convergent rule are used to close the global solution, and the contracted space is employed as convergence criterion instead of the genetic generations used in the most of the genetic algorithms, so that docking time is dramatically decreased. Furthermore, a novel adaptive strategy is employed; the probabilities of the crossover and mutation operators are optimized as the added design variables in the evolution process. These strategies can speed up the optimizing process and ensure very rapid and steady convergence.

In order to evaluate the new docking method, we have conducted a numerical experiment with 134 protein–ligand complexes from the publicly available GOLD test set [12]. Comparisons with Glide [13], GOLD [12], FlexX [14], and Surflex [15] indicate that docking accuracy of our method is comparable to these methods. The computational efficiency of the method has significantly improvement. Docking time is approximately in proportion to the number of rotatable bonds of ligands.

Methodology

Optimization design model of molecular docking

The optimization problem for molecular docking can be written as follows

$$ \begin{array}{ll} \min & f({\mathbf{d}}) \\ {\text{s.t.}} &\quad g_{j} ({\mathbf{d}}) \le 0, \quad j = 1,2, \ldots, q \end{array} $$
(1)

where \( {\mathbf{d}} = \left\{ {T_{x} ,T_{y} ,T_{z} ,R_{x} ,R_{y} ,R_{z} ,T_{b1} , \ldots ,T_{bn} } \right\}^{T} \) is a vector of q(q = n + 6) design variables, T b1,…,T bn are the torsion angles of the rotatable bonds for flexible ligand docking, T x , T y , T z , R x , R y , R z are the position coordinates and rotational angles of the anchor for the matching-based orientation search. The objective function \( f({\mathbf{x}}) \) is intermolecular interaction energy

$$ f\left( {\mathbf{d}} \right) = \sum\limits_{i = 1}^{{n_{lig} }} {\sum\limits_{j = 1}^{{n_{rec} }} {\left( {\frac{{A_{ij} }}{{r_{ij}^{a} }} - \frac{{B_{ij} }}{{r_{ij}^{b} }} + 332.0\tfrac{{q_{i} q_{j} }}{{Dr_{ij} }}} \right)} } $$
(2)

where each term is a double sum over the ligand atom i and the receptor atom j, r ij is the distance between atom i in ligand and atom j in receptor, A ij, B ij are van der Waals repulsion and attraction parameters, a, b are van der Waals repulsion and attraction exponents, q i, q j are point charges on atoms i and j, D is dielectric function, and 332.0 is a factor for conversion of electrostatic energy into kilocalories per mole. The constraints \( g({\mathbf{x}}) \) may be represented as the size limits of the design variables, and certain behavior constraints of the molecule exist, as are shown below:

$$ \left\{ {\begin{array}{ll} {\underline{{T_{x} }} \le T_{x} \le \overline{{T_{x} }} } \\ {\underline{{ T_{y} }} \le T_{y} \le \overline{{T_{y} }} } \\ {\underline{{T_{z} }} \le T_{z} \le \overline{{T_{z} }} } \\ { - \pi \le angle \le \pi ,\begin{array}{ll} {} & {angle = R_{x} ,R_{y} ,R_{z} ,T_{b1} , \cdots ,T_{bn} } \\ \end{array} } \\ \end{array} } \right. $$
(3)

In the protein–ligand docking process, the binding free energy is a function of the Cartesian coordinates of the ligand atoms only. The Cartesian coordinates of all ligand atoms can be determined by solving the optimization problem (1). This indicates that the optimal conformation of a flexible ligand is determined by translational (T x , T y , T z ), rotational (R x , R y , R z ) and torsional motions T b1, T b2,…,T bn (n is the number of torsion bonds) (T bi , i = 1, 2,…, n, n is the number of torsion bonds). The former variables, which account for the six degrees of freedom for a rigid body, can also be interpreted as the orientation of the ligand; T bi is the angle of the ith flexible bond. Since the movement of the ligand should be limited in a pocket confined to the active site of the receptor the design subspace of (T x , T y , T z ) is defined as a cuboid circumscribed in the pocket. (T x , T y , T z ) and \( (\overline{{T_{x} }} ,\overline{{T_{y} }} ,\overline{{T_{z} }} ) \) are the minimum and maximum Cartesian coordinates of the circumscribed cuboid. The defined design subspace is larger than the pocket, it not only ensures that the ligand can move freely within the binding pocket, but also cuts down on computational costs by avoiding the complexity of resolving the actual boundary. The remaining variables are allowed to vary between −π and π rad.

Transformation of optimization model

As mentioned above, the problem (1) involves lots of constraints, so it is difficult to solve it. In order to solve problem (1) efficiently, first, we introduce some definitions and theorems as follows.

Definition 1

If ψ is a positive real variable, and \( G = \left\{ {g_{j} ({\mathbf{d}})} \right\},\,j = 1, \ldots ,q, \) is a set of constraint functions, then

$$ E(G) = (1/\psi )\ln \sum\limits_{j = 1}^{q} {\exp (\psi g_{j} ({\mathbf{d}}))} $$
(4)

is a parametric constraint evaluation (PCE) function. The optimization problem (1) is transformed into the following model by means of PCE function:

$$ \begin{array}{*{20}c} {\min } \hfill & {f({\mathbf{d}})} \hfill \\ {{\text{s}} . {\text{t}} .} \hfill & {g_{\psi } ({\mathbf{d}}) = (1/\psi )\ln \sum\limits_{i = 1}^{q} {\exp (\psi g_{i} ({\mathbf{d}}))} \le 0} \hfill \\ \end{array} $$
(5)

Definition 2

If, for any \( F({\mathbf{d}}) = \left\{ {f_{1} ({\mathbf{d}}),f_{2} ({\mathbf{d}}), \ldots ,f_{q} ({\mathbf{d}})} \right\}, \) and \( \overline{F} ({\mathbf{d}}) = \left\{ {\overline{f}_{1} ({\mathbf{d}}),\overline{f}_{2} ({\mathbf{d}}), \ldots ,\overline{f}_{q} ({\mathbf{d}})} \right\}, \) \( F({\mathbf{d}}),\overline{F} ({\mathbf{d}}) \in E^{q} \) with \( f_{j} ({\mathbf{d}}) \le \overline{f}_{j} ({\mathbf{d}}), j = 1,2, \ldots ,q, \) and there exists at least one \( j_{0} , (1 \le j_{0} \le q), \) such that \( f_{{j_{0} }} ({\mathbf{d}}) < \overline{f}_{{j_{0} }} ({\mathbf{d}}), \) then \( F({\mathbf{d}}) \le \overline{F} ({\mathbf{d}}) \) or, simply \( F \le \overline{F} . \)

Definition 3

If, for any, \( F,\overline{F} \in E^{q} , \) with \( F \le \overline{F} , \) \( E(F) < E(\overline{F} ), \) then E(F) is a strictly monotone increasing function of F.

Lemma

The PCE function E(G) is a strictly monotone increasing function of G, and if ψ→∞ then

$$ (1/\psi )\ln \sum\limits_{j = 1}^{q} {\exp (\psi g_{j} ({\mathbf{d}}))} = \max g_{j} ({\mathbf{d}}) \quad j = 1,2, \ldots ,q $$
(6)

Proof

Let

$$ G = \left\{ {g_{j} ({\mathbf{d}})} \right\} \le \overline{G} = \left\{ {\overline{g}_{j} ({\mathbf{d}})} \right\}, \quad j = 1,2, \ldots ,q $$
(7)

By Definition 2

$$ g_{j} ({\mathbf{d}}) \le \overline{g}_{j} ({\mathbf{d}}), \quad j = 1,2, \ldots ,q $$
(8)

and there exists at least one \( j_{0} (1 \le j_{0} \le q) \) such that

$$ g_{{j_{0} }} ({\mathbf{d}}) < \overline{g}_{{j_{0} }} ({\mathbf{d}}) $$
(9)

Then for ψ > 0,

$$ \psi g_{{j_{0} }} ({\mathbf{d}}) < \psi \overline{{\mathbf{g}}}_{{j_{0} }} ({\mathbf{d}}) $$
(10)
$$ \exp (\psi g_{{j_{0} }} ({\mathbf{d}})) < \exp (\psi \overline{g}_{{j_{0} }} ({\mathbf{d}})) $$
(11)

Hence

$$ \sum\limits_{j = 1}^{q} {\exp (\psi g_{j} ({\mathbf{d}})) < \sum\limits_{j = 1}^{q} {\exp (\psi \overline{g}_{j} ({\mathbf{d}}))} } $$
(12)

Taking logarithms on both sides and dividing by\( \psi \)

$$ E(G) = (1/\psi )\ln \sum\limits_{j = 1}^{q} {\exp (\psi g_{j} ({\mathbf{d}}))} < (1/\psi )\ln \sum\limits_{j = 1}^{q} {\exp (\psi \overline{g}_{j} ({\mathbf{d}}))} $$
(13)

i.e. E(F) is a strictly monotone increasing function of increasing function of F. The \( \psi \) norm of the q-dimensional vector

$$ E_{G} = \left\{ {e^{{g_{1} ({\mathbf{d}})}} ,e^{{g_{2} ({\mathbf{d}})}} , \ldots ,e^{{g_{q} ({\mathbf{d}})}} } \right\}^{T} $$
(14)

is given by

$$ N_{\psi } (E_{G} ) = \left( {\sum\limits_{j = 1}^{q} {e^{{\psi g_{j} ({\mathbf{d}})}} } } \right)^{(1/\psi )} $$
(15)

The uniform norm, also called the maximum norm, is defined by

$$ N_{\infty } (E_{G} ) = \mathop {\lim }\limits_{\psi \to \infty } N_{\psi } \left( {E_{G} } \right) $$
(16)

Since \( e^{{g_{j} \left( {\mathbf{d}} \right)}} > 0 \) by Jensen’s inequality, the norm is a strictly monotone decreasing function of its order, i.e.

$$ N_{s} < N_{r} \quad {\text{ for }}r < s $$
(17)

The importance of this inequality is that it holds also in the limit as s→∞. Thus, Eq. 16 may be written as

$$ N_{\infty } \left( {E_{G} } \right) = \max \left( {{\text{e}}^{{g_{j} \left( {\mathbf{d}} \right)}} } \right) < N_{r} \left( {E_{G} } \right) $$
(18)

Taking logarithms on both side of Eq. 18 and substituting from Eqs. 15 and 16 gives

$$ \mathop {\lim }\limits_{\psi \to \infty } (1/\psi )\ln \sum\limits_{j = 1}^{q} {\exp (\psi g_{j} ({\mathbf{d}}))} = \max (g_{j} ({\mathbf{d}})) $$
(19)

and the proof is completed.

The PCE function plays an important role in the proposed method. By means of the PCE function, we can give the following theorem and simplify the problem (1) with multiconstraints as an optimization problem with a single constraint only.

Theorem 1

If ψ→∞, then the optimization problem (1) and

$$ \left\{ \begin{array}{ll} \min &f({\mathbf{d}}) \\ {\text{s.t.}} &g_{\psi } ({\mathbf{d}}) = (1/\psi )\ln \left\{ {\sum\limits_{k = 1}^{q} {\exp [\psi g_{k} ({\mathbf{d}})]} } \right\} \le 0 \end{array} \right. $$
(20)

have the same Kuhn–Tucker points.

Proof

The Lagrange augmented function problem (20) is

$$ L({\mathbf{d}},\gamma ) = f({\mathbf{d}}) + (\gamma /\psi )\ln \sum\limits_{j = 1}^{q} {\exp (\psi g_{j} ({\mathbf{d}}))} $$
(21)

where γ > 0 is the Lagrange multiplier of corresponding constraint. The Kuhn–Tucker condition for problem (20) is given as

$$ \partial f({\mathbf{d}})/\partial d_{i} + (\gamma /\psi )\left\{ {\sum\limits_{j = 1}^{q} {\exp (\psi g_{j} ({\mathbf{d}})) \cdot \partial g_{j} ({\mathbf{d}})/\partial d_{i} } } \right\}/\sum\limits_{j = 1}^{q} {\exp \left[ {\psi g_{j} \left( {\mathbf{d}} \right)} \right]} = 0 $$
(22)
$$ (1/\psi )\ln \sum\limits_{j = 1}^{q} {\exp \left( {\psi g_{j} \left( {\mathbf{d}} \right)} \right)} \le 0 $$
(23)
$$ (\gamma /\psi )\ln \sum\limits_{j = 1}^{q} {\exp \left( {\psi g_{j} \left( {\mathbf{d}} \right)} \right)} = 0,\quad \gamma \ge 0 $$
(24)

By means of Lemma 1 and Eq. 23, if ψ → ∞, then

$$ (1/\psi )\ln \sum\limits_{j = 1}^{q} {\exp (\psi g_{j} ({\mathbf{d}}))} = \max g_{j} ({\mathbf{d}}) \le 0 , \quad j = 1,2, \ldots ,q $$
(25)

i.e.

$$ g_{j} ({\mathbf{d}}) \le 0 $$
(26)

Substituting

$$ \gamma = \psi , \mu_{j} = \frac{{\exp \left[ {\psi g_{j} ({\mathbf{d}})} \right]}}{{\sum\limits_{j = 1}^{q} {\exp \left[ {\psi g_{j} ({\mathbf{d}})} \right]} }} $$
(27)

into Eq. 22 gives

$$ \partial f({\mathbf{d}})/\partial d_{i} + \sum\limits_{j = 1}^{q} {\mu_{j} \frac{{\partial g_{j} \left( {\mathbf{d}} \right)}}{{\partial d_{j} }}} = 0 $$
(28)

Combining Eqs. 24 and 27, if ψ→∞, then

$$ \left\{ \begin{array}{ll} g_{j} ({\mathbf{d}}) = 0&\quad {\text{if }}\mu_{j} > 0\\ g_{j} ({\mathbf{d}}) < 0&\quad {\text{if }}\mu_{j} = 0 \end{array} \right. $$
(29)

Equations 26, 28 and 29 are identical the Kuhn–Tucker condition of the problem (1). Hence the problems (20) and (1) have the same Kuhn–Tucker points and vice versa. The Theorem 1 is proved.

Kuhn–Tucker points are obtained by solving the Kuhn–Tucker conditions, which are necessary condition for the optimum solution of non-linear programming with equality and inequality constraints [16]. Theorem 1 shows that to solve problem (1) with multi constraints can be substituted by solving a simple problem (20) with a single constraint only.

In order to solve the problem (20) by using genetic algorithm, we transfer it into the following unconstraint optimization problem by using quasi-exact penalty function:

$$ \min \varphi_{\psi } ({\mathbf{d}}) = f({\mathbf{d}}) + \left( {\alpha /\psi } \right)\ln \left\{ {1 + \sum\limits_{i = 1}^{q} {\exp (\psi g_{i} ({\mathbf{d}}))} } \right\} $$
(30)

The parameter ψ can be chosen in the range 103–105 and α is penalty factor, α > 0.

Adaptive entropy-based genetic algorithm

The objective function of problem (30) is nonlinear and the design space is non-convex, the sensitivity analysis is very difficult. There is a critical need to study alternate strategies for optimal design that are not susceptible to the pitfalls of methods of nonlinear programming. Genetic algorithms provide such a capability of their successful adaptation and implementation in a series of optimal design problems. But genetic search process is a time-consuming work, so that hindered them from applied to molecular docking optimization problem, especially to massively among a virtual library of billions of small molecules for compounds that can bind to known protein binding sites. In such circumstances, a novel adaptive genetic algorithm (GA) is here proposed, in which an entropy-based searching technique with multi-population and the quasi-exactness penalty function are developed to ensure rapid and steady convergence.

By means of Eq. 30, the fitness function of genetic algorithm may be written as:

$$ \max F({\mathbf{d}}) = C - \varphi_{\psi } ({\mathbf{d}}) $$
(31)

Problem (31) can be solved as an evolutionary design model, in which F(d) is the fitness function, C is a large positive number to ensure > 0. The quasi-exact penalty function \( \varphi_{\psi } ({\mathbf{d}}) \) is developed to solve nonlinear programming (NLP) problems with equality and inequality constraints.

The traditional genetic algorithm involves five basic operators. These include the coding of string, the fitness function, reproduction, crossover and mutation. The probabilities of the crossover and mutation operators p c and p m must be provided in GA, and are generally provided as initial data. However, these genetic parameters can make the convergence of the algorithm slow and unsteady if they are not appropriately defined. Here the probabilities p c and p m are assigned to be the added design variables to overcome the difficulty in confirming the genetic parameters. The lower and upper limits of p c and p m can be defined in a reasonable region (here \( 0.6 \le p_{c} \le 1.0, 0.0 \le p_{m} \le 0.1 \)).

For multi-population genetic strategy, the genetic algorithm begins from generating arbitrarily m populations with all the same searching space, i.e. design space. If \( F_{j} ({\mathbf{d}}) \, (j = 1, \ldots ,m) \) represent that the best value of the fitness function occurs in the jth population, then we need to maximum \( F_{j} ({\mathbf{d}}) \, (j = 1, \ldots ,m) \) by means of a genetic operations, i.e. to solve the following optimization problem:

$$ \min - F_{j} ({\mathbf{d}}), \quad j = 1,2, \ldots ,m $$
(32)

Problem (32) is a multi-objective optimization, which is very difficult to solve completely. For the improved genetic algorithm with narrowing of the search space, we need only to know efficient narrowing coefficients for the searched space.

Shannon’s theorem [17] has wide-ranging applications in both communications and data storage applications. This theorem is of foundational importance to the modern field of information theory [18]. There are similarities between the process of optimization and communication of information theory. Information entropy or Shannon entropy H of a discrete set of probabilities p 1, …, p n is defined by

$$ \begin{array}{ll} H = &-\sum {p_{i} \ln p_{i} } \\{\text{s.t.}}&\sum {p_{i} = 1,} \quad p_{i} \in [0,1] \end{array} $$
(33)

Shannon entropy can be used to measure the uncertainty about the realization of a random variable. If p j is here defined as a probability that the optimal solution of the problem (32) occurs in the population j, then Shannon entropy will be decreased during optimization process of problem (32) and an entropy-based optimization model can be constructed as follows:

$$ \left\{ \begin{array}{ll} \min &- \sum\limits_{j = 1}^{m} {p_{j} F_{j} ({\mathbf{d}})}\\ \min &H = - \sum\limits_{j = 1}^{m} {p_{j} \ln (p_{j} )} \\ {\text{s.t.}} &\sum\limits_{j = 1}^{m} {p_{j} } = 1, \quad p_{j} \in [0,1] \end{array} \right. $$
(34)

where H is the information entropy.

Theorem 2

The optimization problem (34) and (32) both have the same optimal solution.

Proof

Suppose that \( {\mathbf{d}}^{ * } \) and \( {\mathbf{p}}^{ * } = \left\{ {p_{1}^{*} ,p_{2}^{*} , \cdot \cdot \cdot ,p_{m}^{*} } \right\} \) are the optimal solution of problem (34), so that

$$ \min H^{ * } = - \sum\limits_{j = 1}^{m} {p_{j}^{ * } \ln (p_{j}^{ * } )} = p_{l}^{ * } \ln (1) = 0 $$
(35)

where \( p_{l}^{ * } = 1, p_{i}^{ * } = 0{\text{ for }}i \ne l, \) i.e. the optimal solution of the problem (34) occurs in the population l. Hence

$$ \min\,\, - \sum\limits_{j = 1}^{m} {p_{j} F_{j} ({\mathbf{d}}^{ * } )} = \min - F_{l} ({\mathbf{d}}^{ * } ) $$
(36)

Obviously, \( {\mathbf{d}}^{ * } \) are also the optimal solution of problem (32). It can be similarly proved that the optimal solution of problem (32) is also the optimal solution of problem (34), and the proof is completed.

The solution p j of Eq. 34 can be obtained easily and explicitly.

$$ p_{j}^{*} = \exp (\nu F_{j} ({\mathbf{d}}))/\sum\limits_{j = 1}^{m} {\exp (\nu F_{j} ({\mathbf{d}}))} $$
(37)

in which

$$ \nu = (\beta - 1)/\beta $$
(38)

ν is called as the quasi-weight coefficient (here β = 0.5).

The (1 − p j ) can be used as the coefficients of narrowing searching space in the modified genetic algorithm. When the optimal solution occurs in the lth population, then (1 − p * l ) = 0, and its searching space is not narrowing. Using multi-population genetic strategy with narrowing down searching space, the M populations with N members are generated in the given space.

Design space is defined as initial searching space D(0). M populations with N members are generated in the given space. After a new generation is independently evolved in each population, the searching space of each population is narrowed according to the following equation:

$$ \begin{array}{ll} D_{j} (K) = (1 - p_{j} )D_{j} (K - 1) \\ \underline{{d_{ji} }} (K) = \max \left\{ {\left[ {d_{ji}^{*} (K) - 0.5(1 - p_{j} )D_{j} (K)} \right],\underline{{d_{ji} }} (0)} \right\} \\ \overline{{d_{ji} }} (K) = \max \left\{ {\left[ {d_{ji}^{*} (K) + 0.5(1 - p_{j} )D_{j} (K)} \right],\overline{{d_{ji} }} (0)} \right\} \end{array} $$
(39)

where D j (K) is the searching space of the population j at Kth iteration. d ji (K) and \( \overline{{d_{ji} }} (K) \) are the modified lower and upper limits of ith design variable in the population j at Kth iteration, respectively. d ji *(K)is the value of design variable i of the best member in the population j.

Equation 39 is employed to control the narrowing of searching space for each population. If (1 − p * l ) = 0, the optimal solution occurs in the lth population, and its searching space is not narrowing. Then the convergence criterion of the proposed method can be defined as: when the searching space in the best population has been reduced to a very small area (a given tolerance), the global optimal solution can be obtained approximately. Using narrowed space as the convergence criterion could controls the convergence of the algorithm effectively.

The algorithm consists of the following steps:

  • Step 1. Generate an initial population and implement the duplicate operator.

  • Step 2. Perform crossover and mutation operators among populations.

  • Step 3. Narrow down the design spaces of each population and find the best individual; reserve according to the elitist strategy, next, check the convergence to ensure that the searching space in the best population has been reduced to the given tolerance satisfied. If it has, go to step 4; otherwise, return to step 2.

  • Step 4. Output the optimization results and stop the process.

Results and discussion

Test data set

The GOLD test data set, originally proposed by Jones et al. [12], was chosen for our studies. Each complex was separated into a probe molecule and a docking ligand according to the biological interacting pairs. Each protein molecule was obtained by excluding ligands, all structural water molecules, cofactors, and metal ions from the receptor pdb file. Next, a mol2 file was generated by adding hydrogen atoms and Kallman charge using Sybyl6.8. Residues around the bound ligand within a radius of 6.5 Å were isolated from the protein to define as the active site. The ligands were then prepared by adding hydrogen atoms and Gasteiger-Marsili atomic charges adopted in Sybyl6.8. The heavy atoms number of the ligands ranged from 6 to 55, with 83.6% of the ligands possessing fewer than 30 such atoms. Besides, The rotatable bonds of the ligands ranged widely from 0 to 22, with greater than 88.8% of the ligands possessing fewer than 15 such bonds.

Docking accuracy

Generally, the primary criteria for evaluating docking methods are docking accuracy, scoring veracity, screening efficiency, and computational speed [15]. Docking accuracy and speed are more important indexes than others. Docking accuracy is based on the root-mean-square deviation (RMSD) value of the locations of all heavy atoms in the model from those of the crystal structure. In general, the docking accuracy is acceptable if the RMSD value between the docked pose and X-ray crystal structure is less than 2.0 Å. Depending on the RMSD values, the results was assigned to four categories. The first, excellent, was for those predictions in which the top scoring pose was within 0.5 Å RMSD from experimental results. If the RMSD values were between 0.5 and 2.0 Å, the results would be assigned to the good category. A third category, close, was used for those predictions that the RMSD values were between 2.0 and 2.5 Å. And a fourth category, errors, was used for those predictions that the RMSD values were between 2.5 and 3.0 Å. Finally, the fifth category, wrong, was used for completely incorrect predictions with RMSD values larger than 3.0 Å. The energy of a conformer was computed by Eq. 2 with the nonbonding 12-6 Lennard-Jones and electrostatic energy terms. The docking accuracy and speed of our program was evaluated by the docking results on an SGI Fuel Workstation.

Table 1 summarizes the performance of our docking method. As shown, the docking program yielded 30 docking solutions with a RMSD values below 0.5 Å. Nearly 46% of the results were excellent results. If we consider acceptable results to be contained in the excellent and good categories, then our program achieved a 70% prediction rate.

Table 1 Docking accuracy of the improved method

Figure 1 shows a representative example for excellent docking; The three-dimensional structure of the complex of 1FKG with FKBP12 has been determined by X-ray crystallography to a resolution of 2.0 Å [19]. Flexibility of FKBP12 was described using 11 rotatable bonds. The active site was determined by flooding-filling to a radius of 6.5 Å (roughly corresponding to all solvent-accessible cavity atoms). And the movement of the ligand was limited in a cuboid circumscribed in the active site. The defined cuboid is around the bound ligand within a radius of 6.5 Å shown in Fig. 1a. Only 61 genetic iterations were needed to obtain the highest fitness score. All the atoms of the ligand of 1FKG were correctly placed with RMSD of 0.27 Å (see Fig. 1b). The algorithm run took 14.32 s.

Fig. 1
figure 1

Example of docking. 1FKG (0.27 Å RMSD, 11 rotatable bonds). (a) The defined cuboid around the bound ligand within a radius of 6.5 Å, (b) Docked structure was fitted to the protein-bound X-ray structure merged into the reference protein coordinates

Table 2 gives the relationship among RMSD values, docking time and the flexibility of the ligands. RMSD values and docking time increase with ligand flexibility, and the docking time is approximately in proportion to the number of rotatable bonds of ligands, i.e. the docking time increases approximately 1 s per adding a rotatable bond (see Fig. 2). Furthermore, with the increasing of ligand size, docking accuracy and efficiency will also decrease (see Table 3).

Table 2 RMSD values and computational time for the ligands with different number of rotatable bonds
Fig. 2
figure 2

The relation between computational time and the number of rotatable bonds of ligands

Table 3 RMSD values and docking time of the ligands with the different number of the heavy atoms

Table 4 provides comparisons with other programs [1215]. Their scoring results are insignificant because these programs adopt the different score function, so Table 4 contains only the RMSD values of the poses corresponding to the ligands optimized by using the docking score. To avoid biases from different data sets, we only compare the results within the subsets of the GOLD data set, which contain 99 complexes, 120 complexes, 81 complexes, 122 complexes and 96 complexes for GOLD, FlexX, Surflex, Glide and DOCK6, respectively. The results show that our program has good docking accuracy. The further comparisons of docking accuracy with above programs are given in Tables 57.

Table 4 Comparisons with Glide, GOLD, Surflex, FlexX and DOCK6 for Docking accuracy
Table 5 Comparisons with GOLD for RMSD values of the ligands with the different number of rotatable bonds

Table 5 gives the comparison with GOLD; this paper gives an average RMSD value of 2.12 Å, whereas it is 3.00 Å for GOLD. However, the successful results of GOLD were slightly better than our program, the RMSD values of 66 solutions in Gold results are less than 2.0 Å, while it is 64 in our results. Tables 6 and 7 give comparison with FlexX and Surflex, the average RMSD value of our program is quite better than them. Table 8 gives comparisons with Glide for RMSD values of the ligands with the different number of rotatable bonds. The average RMSD value of our program is 1.99 Å, whereas Glide is 1.91 Å. However, this comparison may not be completely fair to our program, because Glide gives the RMSD values of the predicted poses to energy-minimized ligand not to the native ligand [13].We also give comparison with DOCK6 (the new version of DOCK). As shown in Table 9, our program yielded the 66.7% success rate with RMSD values less than 2 Å. In contrast, DOCK6 got a 60.4% successful rate. For these 96 complexes, most of them have 0–9 rotatable bonds. The results show that our program is superior for molecular docking at this level of rotatable bonds.

Table 6 Comparisons with FlexX for RMSD values of the ligands with the different number of rotatable bonds
Table 7 Comparisons with Surflex for RMSD values and computational time of the ligands with the different number of rotatable bonds
Table 8 Comparison with Glide for RMSD values of the ligands with the different number of rotatable bonds
Table 9 Comparison of with DOCK6 for RMSD values of the ligands with the different number of rotatable bonds

Docking speed

An advantage of the method is its significant docking speed. Current docking programs present a decreasing performance with the increasing number of conformational degrees of freedom considered [20]. Direct comparison of docking speed is somewhat problematic because of differences in hardware. However, we can still offer a comparison from several recent works [12, 15, 21]. According to the reports of these works, docking per molecule needs 50–100 s for FlexX, DOCK and GOLD on an SGI Indigo2 R10 K processor [15]. In this paper, the average time of docking a ligand for above dataset is about 5.36 s, and the maximum time is 42 s on a SGI Fuel Workstation.

Docking speed is a critical issue in the application of a docking method, especially in virtual screening [15]. Using the same data set, Fig. 2 gives a plot of mean docking time of the 134 ligands from the data set versus number of rotatable bonds. It shows that the docking time is approximately in proportion to the number of rotatable bonds of ligands. This indicates that our program is fast enough for virtual screening on large-scale chemical databases.

Conclusions

This paper presents a rapid adaptive genetic algorithm for flexible molecular docking. The testing results for the test data set show that the docking time is approximately in proportion to the number of rotatable bonds of ligands, and can dock a ligand to protein target in a few seconds on SGI Fuel Workstation. The proposed method is suitable to virtual screening. However, there are still several aspects we should improve, such as developing better docking strategies, improving score functions and considering the flexibility of protein target, the further work is under doing.