Keywords

1 Introduction

The molecular distance geometry problem (MDGP) basically consists of obtaining all feasible three-dimensional structures for a molecule when some of its interatomic distances are given [2, 4, 6, 8]. For the case when all interatomic distances are known, the problem can be solved in linear time [3]. Otherwise, the problem is classified as NP-hard [11].

Formally, the MDGP can be described as follows. Given an atomic sequence \(1,2,\ldots ,n\), and a set Sof all pairs of atoms (i, j) such that the distance d ij is known, find the feasible Cartesian coordinates \({x}_{1},\ldots ,{x}_{n} \in {\mathbb{R}}^{3}\)of the atomic sequence (which can be seen as a linear sequence of bonded atoms) so that

$$\Vert {x}_{i} - {x}_{j}\Vert = {d}_{ij},\quad \forall (i,j) \in S.$$
(9.1)

By supposing the validity of some properties for the known interatomic distance set (usually compatible with proteins, a very important class of macromolecules), the problem has a discrete search space and is called discretizable molecular distance geometry problem (DMDGP) [5].

The following assumptions turn the MDGP into a combinatorial problem (DMDGP), for a given atomic ordering:

  1. 1.

    d ij is known for all pairs of atoms (i, j), with \(1 \leq j - i \leq 3\)

  2. 2.

    Angles between vectors \(\left ({x}_{i+2} - {x}_{i+1}\right )\)and \(({x}_{i+1} - {x}_{i})\), where \(1 \leq i \leq n - 2\), are never a multiple of \(\pi \).

The set Sof pair of atoms with known distances may be partitioned in two subsets: the set E, corresponding to all pairs of atoms (i, j), where \(1 \leq j - i \leq 3\), and the set Fof all pairs of atoms (i, j), where j − i > 3 (see Fig. 9.1).

In [7], a branch-and-prune (BP) algorithm has been proposed for solving the DMDGP. In this chapter, we provide an alternative way for solving this problem, making use of the BP algorithm.

Fig. 9.1
figure 1

F-type distances

The main idea in the BP algorithm is to explore the search space by using torsion matrices of each atom and to eliminate the infeasible positions as soon as possible. The torsion matrix B i related to the atom ican be calculated as follows:

$${B}_{i} = \left [\begin{array}{rrrr} -\cos {\theta }_{i-2,i}& -\sin {\theta }_{i-2,i}& 0& - {d}_{i-1,i}\cos {\theta }_{i-2,i} \\ \sin {\theta }_{i-2,i}\cos {\omega }_{i-3,i}& -\cos {\theta }_{i-2,i}\cos {\omega }_{i-3,i}& -\sin {\omega }_{i-3,i}&{d}_{i-1,i}\sin {\theta }_{i-2,i}\cos {\omega }_{i-3,i} \\ \sin {\theta }_{i-2,i}\sin {\omega }_{i-3,i}& -\cos {\theta }_{i-2,i}\sin {\omega }_{i-3,i}& \cos {\omega }_{i-3,i}&{d}_{i-1,i}\sin {\theta }_{i-2,i}\sin {\omega }_{i-3,i} \\ 0& 0& 0& 1 \end{array} \right ],$$

for \(i \geq 4\), where \({\omega }_{i-3,i}\)is a torsion angle, \({\theta }_{i-2,i}\)is a bond angle, and d i − 1, i is a bond length. Matrices B 1, B 2, and B 3are given by

$$\begin{array}{rcl}{ B}_{1}& =& \left [\begin{array}{cccc} 1&0&0&0\\ 0 &1 &0 &0 \\ 0&0&1&0\\ 0 &0 &0 &1 \end{array} \right ],\quad {B}_{2} = \left [\begin{array}{cccc} - 1&0& 0 & - {d}_{1,2} \\ 0 &1& 0 & 0\\ 0 &0 & - 1 & 0 \\ 0 &0& 0 & 1 \end{array} \right ], \\ {B}_{3}& =& \left [\begin{array}{cccc} -\cos {\theta }_{3} & -\sin {\theta }_{3} & 0& - {d}_{2,3}\cos {\theta }_{3} \\ \sin {\theta }_{3} & -\cos {\theta }_{3} & 0& {d}_{2,3}\cos {\theta }_{3} \\ 0 & 0 &1& 0\\ 0 & 0 &0 & 1 \end{array} \right ]. \end{array}$$
(9.2)

With the product \({B}_{1}{B}_{2}\ldots {B}_{i}\), one can easily obtain the positions for the atom iwhich is consistent with all E-type distances. Each atom \(i \geq 4\)has two possible torsion matrices B i (calculated by using \(\sin {\omega }_{i-3,i} = +\sqrt{(1 {-\cos }^{2 } {\omega }_{i-3,i } )}\)) or \({B}_{i}^{{\prime}}\)(calculated by using \(\sin {\omega }_{i-3,i} = -\sqrt{(1 {-\cos }^{2 } {\omega }_{i-3,i } )}\)). The first three atoms have only one torsion matrix, which implies that there are 2i − 3feasible coordinates, a priori, for each atom i > 3. The BP algorithm behaves like a tree search algorithm (such as depth or breadth-first search): at each level i, each torsion matrix B i and \({B}_{i}^{{\prime}}\)is multiplied by the previous matrix product \({B}_{1}\ldots {B}_{i-1}\), thus providing us with two positions x i and \({x}_{i}^{{\prime}}\)(branching), and positions that are not consistent (according to a constant error tolerance \(\epsilon \)) with the related F-type distances are discarded (pruning).

We introduce now some definitions. Let Mbe a molecule defined by a sequence of natoms 1, 2, , n. An interval[a, b] of Mis any subsequence \(\{a,\ldots ,b\}\)of atoms of M, with \(1 \leq a \leq b \leq n\). The size of [a, b] is defined by b − a. A realizationR a, b is a function \({R}_{a,b} : [a,b]\mapsto {\mathbb{R}}^{3}\)that associates each atom of an interval to a point in \({\mathbb{R}}^{3}\). We say that R a, b is infeasiblefor a given instance of the DMDGP when \(\exists (i,j) \in S\)such that \(i,j \in [a,b]\)and \({d}_{ij}\neq \vert \vert {R}_{a,b}(j) - {R}_{a,b}(i)\vert \vert \); otherwise, it is feasible. R a, b is completeif [a, b] = [1, n] otherwise, it is partial. The idea of working with partial realizations was previously exploited in [10], for investigating a conjecture on the DMDGP, and in [9], for the development of a parallel version of the BP algorithm, even though no formal definitions were given in the latter.

A realization treeT a, b is a rooted tree with two properties:

  1. 1.

    Each level kof T a, b corresponds to one atom of the interval [a, b], given by atom(k).

  2. 2.

    Each node at level kcontains a coordinate vector for atom(k) corresponding to the atom of the root node.

We use \(\left \Vert {T}_{a,b}\right \Vert\)to denote the number of nodes in T a, b . We also use \({T}_{a,b}^{+}\)and \({T}_{a,b}^{-}\)to denote, respectively, the tree growth direction from left to right and from right to left. As it can be seen, the BP algorithm (given in Algorithm 2) yields a realization tree containing all nodes visited by the algorithm.

In the next sections, we will use more than one realization tree to solve the DMDGP, showing that this strategy can improve the BP algorithm performance. The rest of this work is structured as follows. In Sect. 9.2, we motivate alternative uses of the BP algorithm through one simple theoretical example. In sects. 9.3and 9.4, we present a technique for merging realization trees, which is the main contribution of this work. Section 9.5provides a heuristic method that controls the growth of the realization trees. In Sect. 9.6, we describe a methodology and show some computational experiments where the presented techniques are considered. Section 9.7provides our conclusions.

2 BP May Be Used in Different Ways

One must notice that the BP algorithm may be used to explore the search space by other means than the original procedure where the initial atoms of the sequence are fixed and branching on the tree is performed until the last atom is positioned. We are going to show that the BP algorithm presented in [7] actually provides a framework for solving DMDGP instances in various ways. Despite DMDGP is already proven to be NP-hard [5], it is still important to know how to solve its instances as fast as possible, even if its asymptotic behavior does not change.

Let us first use a simple example as motivation, showing that we can use BP in two ways with different performances for the same DMDGP instance. Let us consider a DMDGP instance where n = 6 and F = { (2, 6)}. When we execute the BP as described in [7], the algorithm starts placing atoms 1, 2, and 3. Then, it branches two possible positions for atom 4, and so on, until it reaches the last atom. At this level, it must branch all eight possibilities in order to check the feasibility of each one through the information provided by d 2, 6. Only at the last level the algorithm is able to reject some atomic positions.

However, we could solve the considered instance in the opposite direction along the sequence of atoms (for implementation purposes, without loss of generality, this could also be seen as executing the same BP algorithm for an “inverted” instance, where each atom label iis swapped by 7 − i). In this alternative approach, BP starts fixing atoms 6, 5, and 4, then branching two positions for atom 3. When atom 2 is reached, four positions are computed, and the known distance d 2, 6can be tested for each of them. In case a node is pruned, this node will not have any child nodes on the subsequent level. In this way, BP explores fewer nodes if compared to the classical approach. In other words, the knowledge about d 2, 6allows the second approach to restrict its search space one level before the first approach, thus making the search faster.

The presented example makes it clear that solving an instance with BP by the usual way is not always the best approach—it mainly depends on the Fdistance set. The same analysis may be applied to the concept of interval, introduced earlier, since this can be seen as DMDGP instance as well. Each interval may be solved separately by the BP algorithm, yielding multiple partial realization trees to be combined later, forming complete realizations. Therefore, it is important to study the different ways of solving the DMDGP instances with BP by dividing instances in intervals and, then, by solving intervals in different directions.

3 Merging Two Partial Realizations

In order to solve DMDGP instances by using many intervals, we need to be able to combine solutions associated to each interval. If R a, x and R b, y (a < b < x < y) are two feasible realizations sharing three non-colinear atoms (the existence of these atoms implies that \(x - b \geq 3\)), then we can combine them in order to obtain a single realization R a, y .

Arbitrarily we choose R a, x as a basis for constructing R a, y . Thus, both realizations will have the same reference system, and R a, y will inherit all coordinates of R a, x , that is, \(\forall k \in [a,x],\,\,{R}_{a,y}(k) = {R}_{a,x}(k)\). In order to complete the coordinate sequence of R a, y , we still need to fill the remaining interval [x + 1, y], which will be done by applying Euclidean transformations over the coordinates of R b, y .

Let i, j, and kbe three atoms that belong to both realizations R a, x and R b, y . In order to make the coordinates of interval [x + 1, y] satisfy all E-type distances, we must align R b, y to R a, x , that is, to find \({R}_{b,y}^{{\prime}}\)such that

$$\left \{\begin{array}{@{}l@{\quad }l@{}} {R}_{b,y}(i) = {R}_{a,x}(i), \quad \\ {R}_{b,y}(j) = {R}_{a,x}(j),\quad \\ {R}_{b,y}(k) = {R}_{a,x}(k).\quad \end{array} \right .$$
(9.3)

Initially, we consider \({R}_{b,y}^{{\prime}}\)as a copy of R b, y . Then, the first equation in Eq. (9.3) is achieved by applying a simple translation over \({R}_{b,y}^{{\prime}}\)(Fig. 9.2), whose translation vector vis given by

$$v = {R}_{a,x}(i) - {R}_{b,y}(i).$$
Fig. 9.2
figure 2

Translation of R b, y for aligning atom i

In order to satisfy the second equation in Eq. (9.3), after the translation, we need to apply a rotation around the axis perpendicular to the two vectors connecting the atoms iand jin each realization (R a, x and \({R}_{b,y}^{{\prime}}\)) (Fig. 9.3). These vectors are:

$${L}_{j} = {R}_{a,x}(j) - {R}_{a,x}(i)$$

and

$${L}_{j}^{{\prime}} = {R}_{ b,y}^{{\prime}}(j) - {R}_{ b,y}^{{\prime}}(i).$$

The rotation axis can be obtained through the cross product \({L}_{j} \times {L}_{j}^{{\prime}}\). The rotation angle is the one between the two vectors, and can be obtained by using the cosine law:

$${\phi }_{1} {=\cos }^{-1}\left (\frac{{L}_{j}^{2} + {L}_{ j}^{{\prime}2} -{\left \vert {L}_{ j} - {L}_{j}^{{\prime}}\right \vert }^{2}} {2{L}_{j}{L}_{j}^{{\prime}}} \right ).$$
Fig. 9.3
figure 3

Rotation of R by for aligning atom j

For aligning the atom k(satisfying the last equation in Eq. (9.3)) we need another rotation. Atoms iand j—already aligned—determine the only possible rotation axis for \({R}_{b,y}^{{\prime}}\)in order to continue satisfying the first two equations. The rotation angle around this axis is calculated by using the two vectors connecting the atoms jand kin each realization, as follows:

$${L}_{k} = {R}_{a,x}(k) - {R}_{a,x}(j)$$

and

$${L}_{k}^{{\prime}} = {R}_{ b,y}^{{\prime}}(k) - {R}_{ b,y}^{{\prime}}(j).$$

However, we are not interested anymore in the angle formed by these two vectors, as in the previous case. Now, what matters is the angle between their projections over the perpendicular plane to the rotation axis (Fig. 9.4). For calculating these projections, we use the projection matrix M, oriented by vector L j :

$$M = {L}_{j}{L}_{j}^{T}.$$
Fig. 9.4
figure 4

Rotation of R by for aligning atom k

The projected vectors are given by:

$${P}_{k} = M{L}_{k}$$

and

$${P}_{k}^{{\prime}} = M{L}_{ k}^{{\prime}},$$

and the angle between them may also be calculated by the cosine law:

$${\phi }_{2} {=\cos }^{-1}\left (\frac{{P}_{k}^{2} + {P}_{ k}^{{\prime}2} -{\left \vert {P}_{ k} - {P}_{k}^{{\prime}}\right \vert }^{2}} {2{P}_{k}{P}_{k}^{{\prime}}} \right ).$$

4 Merging Two Realization Trees

Once we know how to combine two feasible realizations sharing three non-colinear atoms, we can combine two realization trees sharing three atoms, which cannot be colinear by definition of DMDGP. If T a, x and T b, y are two realization trees sharing at least three atoms, the realizations generated by their combination will fill the interval [a, y].

According to the growth direction of the trees, different kinds of merging may occur, described below. Algorithm 3, presented next, provides a way for merging trees T a, x and T b, y , so yielding realizations for the interval [a, y], and is applicable to all three kinds of merging. When two trees \({T}_{a,x}^{+}\)and \({T}_{b,y}^{-}\)grow in opposite direction and overlay their roots, satisfying \(x - b \geq 3\), they can be combined from their initialization (Fig. 9.5c).

Fig. 9.5
figure 5

Realization trees merging

This kind of merging occurs between trees T 1and T 2growing in the same direction, when T 1reaches the atom related to the root of T 2(Fig. 9.5a). In order to be able to merge them, it is necessary to expand T 1into two levels, so that we have \({T}_{1} = {T}_{a,x}\)and \({T}_{2} = {T}_{b,y}\), \(x - b \geq 3\).

This case happens when two trees grow one in direction of the other (Fig. 9.5b). Let us suppose that at some point, two trees (one negative, the other positive) reach the same atom i. In order to share three atoms, they need to grow at most two levels more. This can be done in three ways: the positive tree growing two levels, the negative tree growing two levels, or both growing one level. Thus, we will have \({T}_{a,x}^{-}\)and \({T}_{b,y}^{+}\), such that \(x - b \geq 3\). From the performance point of view, the tree that is about to undergo more prunings should have higher growth priority. Even though two levels may seem to not be significant, it is worth to emphasize that, for big amounts of leafs, growing one level may be very expensive (in the order of the total amount of nodes in the tree until the previous level).

In Algorithm 3, we initially combine each realization in T a, x with each realization in T b, y . Clearly, the total amount of combinations is the product of the amount of leaves in each tree. Considering that both trees share exactly three atoms (this condition is enough for merging them), by naming nas y − a, the amount of leaves for T a, x and T b, y is O(2x − a) and O(2n − (x − a) + 3), respectively. Thus, the total amount of combined realizations is O(2n). Then, each combined realization is verified according to the Fdistance set, whose size is O(n 2). Finally, the complexity of the algorithm for combining T a, x and T b, y is O(n 22n), which is greater than the exponential complexity of the original BP algorithm. In Sect. 9.6, we will see that, according to our computational results, this worst-case analysis does not seem to entail practical significance.

5 Growth Control

When we solve one instance by using multiple realization trees \({T}_{1},{T}_{2},\ldots ,{T}_{x}\), each pair of subsequent trees will undergo one of the three kinds of merging described in Sect. 9.4. In case of pairs \(({T}_{i},{T}_{i+1})\)that undergo root–root merging, there is only one way for performing the merging. The same happens in the case of pairs undergoing root–leaf merging, since only the tree that undergoes the merging on its leaves can grow inside the interval delimited by the roots (the tree which undergoes merging on its leaves has to grow until it reaches the root of the other tree). However, in leaf–leaf merging cases, the atom in which the merging will occur is not previously known, and it depends on the growth of both trees.

Aiming at minimizing the algorithm’s execution time and the total amount of nodes (of both trees), we may consider the following heuristic: we give growth priority to the tree with fewer leaves. In other words, at each step, we verify which tree has fewer leaves, and we let it grow by one level (changing its amount of leaves for the next step). This procedure is repeated until their leaves reach the same atom. However, this is a greedy method, which does not consider the possibility of allowing the other tree to grow first. For example, it might be more convenient to let a tree grow if it is about to apply a large pruning in a few steps. Algorithm 4 summarizes this approach.

6 Computational Experiments

In this section, we will consider some artificial instances, automatically generated by computer programs, and real instances, produced from protein structures obtained from the protein data bank (PDB) [1]. PDB is a public database where three-dimensional conformations of proteins and nucleic acids are stored. We have selected only structures generated by NMR. Our goal is to study in which cases the use of multiple realization trees is more efficient than the original BP algorithm. For accomplishing this, we have implemented two methods which use two realization trees in a primitive way (without any previous analysis of the F set for determining in which atom the trees start and in which directions they grow). Then, we have compared both methods to the original BP algorithm, in positive and negative directions. Our analysis did not consider the quality of the solutions which are found, since all methods fully explore the search space of instances, thus reaching the same set of solutions.

All algorithms were implemented in C++, using the standard template library (STL). Experiments were executed on an Intel Core2Duo 2.2GHz, with 2GB RAM.

We introduce a graphical representation which allows us to view how the Fset pairs are distributed along the molecule. We do this through the plot of \(x \times P(x)\), where xis an atom of the molecule and P(x) is the function expressing the sum of the interval lengths related to F-set pairs whose last atom to be reached by BP is x. Considering \({F}_{x}^{+} =\{ (i,x) \in F\}\)and \({F}_{x}^{-} =\{ (x,j) \in F\}\), P(x) is defined, for each direction, by the following formulae:

$${P}^{+}(x) ={ \sum \nolimits }_{(i,x)\in {F}_{x}^{+}}(x - i)$$

and

$${P}^{-}(x) ={ \sum \nolimits }_{(x,j)\in {F}_{x}^{-}}(j - x).$$

Figure 9.6shows the F-set of tested artificial instances, described through an arc representation (each pair (i, j) ∈ Fis represented by an arc that connects atoms iand j), their respective plots of P  + (x) and P  − (x), and the execution time for the following methods (also listed in Table 9.1):

  1. Method 1

    One positive tree T 1, n  + (original BP), implemented with breadth-first tree search;

  2. Method 2

    One positive tree T 1, n  + (original BP), implemented with depth-first tree search;

  3. Method 3

    One negative tree T 1, n  − (original BP), implemented with breadth-first tree search;

  4. Method 4

    One negative tree T 1, n  − (original BP), implemented with depth-first tree search;

  5. Method 5

    Two trees in opposite directions, growing from their extremities towards the center, with growth control (leaf–leaf merging);

  6. Method 6

    Two trees in opposite directions, growing from the center towards their extremities (root–root merging).

Fig. 9.6
figure 6

Tests with artificial instances

Table 9.1 Tests with artificial instances

From the test results with artificial instances, we can observe some facts. The variability of instances has showed that different cases require different approaches, where the direction and the amount of trees play an important role (see Fig. 9.6). The bad performance of methods with two trees for instances (b) and (c) is not due to the growth of trees, but to the merging process, since both trees have many leaves in their merging point. Method 5, which uses the growth control heuristic for two trees, behaves in a versatile manner, allowing the tree with fewer leaves to traverse a greater part of the molecule, and is not so sensitive to less uniform distributions of F-type distances as the BP algorithm is, thus obtaining good performances for instances such as (f) and (a).

The same methods have been tested in instances produced from real protein data. For this, we created a DMDGP instance from a PDB file that contains a known protein structure, by taking all atomic coordinates from the main chain (protein backbone) for determining those inter-atomic distances that are inside a cut-off radius of 5 Å. Figure 9.7and Table 9.2provide further details about the used instances and the execution time of our methods (for instances generated from real data, the arc representation is not clear, due to the big amount of atoms and F-type distances, so it was not used).

Fig. 9.7
figure 7

Tests with instances generated from real protein data

Table 9.2 Tests with instances generated from real protein data

In these tests, as in tests with artificial instances, the use of two trees has been efficient in certain cases and has showed some advantages over the original BP. However, in order to justify the use of more than one tree, due to its computational cost, it is necessary that the trees are placed in strategic positions along the molecule (as it happened for method 5 for instance 1SFV), so that F-type distances can be used as soon as possible, causing prunings and having yielding few leaves in the moment of merging. As it has been showed for artificial instances, the cases where molecules have some interval with low amount of F-type distances to be traversed (as instances (1DFN-a) and (1DFN-b)) give greater growth priority for one of the trees, what cannot be foreseen by the original BP. Method 5, whose trees have controlled growth, has showed that it can deal better with this kind of F-set topology.

7 Conclusions

We have presented some initial studies that enable solving the DMDGP with multiple realization trees. Through Euclidean transformations as translations and rotations, we discussed about possible ways for combining realizations of distinct intervals that share at least three atoms. We studied the three possible cases for merging realization trees, and, by using this technique, we presented an algorithm that deals with these three cases. This algorithm is what actually allows the use of multiple trees for solving the DMDGP. Moreover, we presented a heuristic for the multiple tree strategy, consisting of regulating the growth of trees that will undergo leaf–leaf merging.

We have made tests with artificial instances and instances generated from real protein data, by comparing the BP algorithm, which produces one realization tree, with two primitive methods using two realizations trees. For both artificial instances and instances generated from proteins, the use of two trees (despite the simplicity of the implemented methods) has showed good performance in comparison to the original BP algorithm. For each instance, depending on the topology of the Fset, the methods that use only one tree had a high performance variability according to the direction of growth along the molecule. However, the heuristic method that consider two trees was not so sensitive to the F-set topology, having, most of the time, a performance which is similar to the method of one tree in its most efficient direction (with no need to detect which is the direction).

The results presented here reinforce the interest about studying alternative solving approaches for the DMDGP. Our intention here was twofold: (1) to show that the BP algorithm itself does not assure the best performance, depending on the direction of its growth along the molecule and (2) it may be used by more complex strategies, as solving the molecule with multiple realization trees, which provide advantages over the traditional approach. The multiple trees strategy lets us think about performance (not only completeness and correctness) when solving the DMDGP, stimulating investigation of heuristics for the DMDGP.

AcknowledgmentsThe authors would like to thank the Brazilian research agencies FAPESP, FAPERJ, and CNPq, for the financial support.