Keywords

1 Introduction

In this work we propose and solve some computational experiments involving instances of the molecular distance geometry problem [11]. We employ a continuous optimization software to find numerical solutions to the problem. Our objective is to reconstruct three-dimensional structures of proteins using only the distances between their atoms. To attain this goal, we need to determine a set of n points \(\{{x}^{1},{x}^{2},\ldots ,{x}^{n}\} \subset{\mathbb{R}}^{3}\) such that \(\|{x}^{i} - {x}^{j}\| =\hat{ {d}}_{ij}\), where \(\hat{{d}}_{ij}\) is the Euclidean distance between the atoms i and j. We can formulate this task as a continuous optimization problem as follows:

$$\begin{array}{rcl} \mbox{ minimize}\ \,& & \sum\limits_{i,j}\ {(\|{x}^{i} - {x}^{j}\| -\hat{ {d}}_{ ij})}^{2}, \\ \mbox{ subject to}\ \,& & {x}^{i} \in{\mathbb{R}}^{3},i = 1,2,\ldots ,n.\end{array}$$
(12.1)

The variables in Eq. (12.1) are the coordinates of points \({x}^{i} \in{\mathbb{R}}^{3}\), and the objective function is not differentiable when \({x}^{i} = {x}^{j}\), for some i, j. However, as the distances between atoms are always positive real numbers, we can apply a minimization algorithm that uses first derivatives to solve the problem (12.1). Then, if \(\hat{{d}}_{ij} > 0\) for all i, j, the local minimizers of Eq. (12.1) are configurations that do not contain coincident points. This result was proved by Jan de Leeuw in [4]. In the computational experiments, we solve some instances of the molecular distance geometry problem using GENCAN [2]. This routine, available atwww.ime.usp.br/~egbirgin/tangois able to find approximate solutions to minimization problems with box constraints. For each considered instance, the numerical solutions obtained by GENCAN were compared to the true structure of the analyzed protein. The comparison was carried out as follows: given the true configuration of the protein and a numerical solution, we determine a transformation that superimposes both structures in some optimal manner. This problem is known as the Procrustes problem [6, 8].

The Procrustes problem consists in finding an orthogonal matrix \(Q \in{\mathbb{R}}^{3\times 3}\) that minimizes the function

$$g(Q) =\| {M}_{0} - {M}_{1}{Q\|}_{F},$$
(12.2)

where M 0 and M 1 are matrices in \({\mathbb{R}}^{n\times 3}\) and \(\|{.\|}_{F}\) is the Frobenius norm. The orthogonal matrix Q that minimizes Eq. (12.2) has a closed-form expression. In the book of Golub and Van Loan [5], a singular value decomposition is employed to determine Q. This way of solving Eq. (12.2) does not ensure that the orthogonal matrix Q is a rotation matrix. There are cases where Q is the composition of a rotation and of a reflection. Kearsley in [9] uses unitary quaternions to find Q, and, as a result, he always obtains a rotation matrix. More references about quaternions and rotations can be found in [3, 7, 10, 12]. We desire to investigate if the numerical solutions obtained by GENCAN to the problem (12.1) differ from the original configurations by transformations involving a pure rotation or a rotation followed by a reflection. For this, we solve the Procrustes problem applying both proposed techniques: singular value decomposition and unitary quaternions.

2 Numerical Experiments

We selected some proteins from protein data bank [1] and we considered only the alpha carbon coordinates (C α) of each structure. The selected proteins and the number of atoms (\({n}_{{C}_{\alpha }}\)) are indicated in Table 12.1. We initially propose three sets of tests with these proteins; the details are discussed below. All experiments in this work have been carried out on a single core of an Intel Core 2 CPU 2.4GHz with 2GB RAM, running MAC OS X 10.5.

Table 12.1 Proteins used in the computational experiments

2.1 Solving Problems Using All Distances Between Atoms

In this set of experiments we suppose that all the distances \(\hat{{d}}_{ij}\) between the atoms are known. With the first five proteins of Table 12.1, we solve the problem (12.1) using GENCAN twenty times whereas the problem with 1lmO protein was solved two hundred times, where each run corresponds to a different starting point. Table 12.2 shows the results of runs for which GENCAN reached the lowest objective function value. The columns in this table have the following meaning: ndist is the total number of distances between pairs of atoms, nvar is the number of variables, iter and evalf are, respectively, the total number of iterations and evaluations of the objective function, f(x  ∗ ) is the final objective function value, and t(s) is the CPU time in seconds. The column pure rot. shows the quantity of runs in which the optimization routine obtained a solution that differs from the true structure by a transformation involving a pure rotation. The column rot. + ref. indicates the total of rounds in which the numerical solution differs from the true structure by a transformation involving a rotation followed by a reflection. The stopping criterion of GENCAN in all tests required that the gradient norm had to be smaller than 10 − 4.

Table 12.2 First set of experiments: all the distances are known

The final values of the objective function show that GENCAN finds configurations of points in \({\mathbb{R}}^{3}\) that fit all the distances. However, we noted in six tests related to the protein 6PAX that the routine obtains configurations with \(f({x}^{{_\ast}}) \approx1{0}^{3}\) and gradient norm smaller than 10 − 4. These configurations are certainly local minimizers.

We chose the protein 1OOH to illustrate a test where GENCAN obtains a solution that differs of the true configuration by a transformation involving reflection. Figure 12.1a shows the true structure of 1OOH with 126 alpha carbons. Each atom is represented by a point in \({\mathbb{R}}^{3}\) and consecutive points are joined by lines. Figure 12.1b compares the original distances between pairs of atoms (\(\hat{{d}}_{ij}\) axis) to the distances obtained numerically (axis d ij ).

Fig. 12.1
figure 1

Experiments with 1OOH protein

The analysis with the Procrustes problem is shown in Fig. 12.2. To construct this figure we use only ten first consecutive C α atoms of 1OOH. We solve the Procrustes problem (12.2) using the two formulations discussed above. Figure 12.2a shows the optimal superimposition of the true and numerical structures. The numerical solution (red points) does not appear in this image because it is superimposed by the true structure (green points). The transformation matrix obtained in this case involves a reflection and was obtained solving Eq. (12.2) with the strategy proposed by G. Golub and C. Van Loan. Figure 12.2b shows the superimposition obtained by applying the strategy proposed by Kearsley. In this case, the orthogonal matrix describes a pure rotation. We note in Fig. 12.2b that the numerical configuration (red points) is the reflected image of the original one (green points). In this case, it is not possible to determine a rotation that superimposes both structures. The results obtained for the Procrustes problem are reported in Table 12.3.

Fig. 12.2
figure 2

1OOH protein: solving procrustes problem to compare structures

Table 12.3 Results of procrustes problem with 1OOH protein

2.2 Simulating Errors

In this set of experiments, we consider the first five proteins of Table 12.1 and we suppose that all distances between pairs of atoms were obtained with errors. To simulate this situation, we add to each value \(\hat{{d}}_{ij}\) a random number created in the interval \([-\rho ,\rho ]\), with \(\vert \rho \vert \leq1\). Then, for each protein and each fixed value ρ, we solve Eq. (12.1) twenty times with a different starting point in each run. All tables below show only the results corresponding to the tests in which GENCAN reached the lowest final value of objective function. Table 12.4 indicates the final values of objective function attained by GENCAN and Table 12.5 shows the performance of the routine for solving each instance in terms of iterations, evaluations of the objective function, and CPU time. According to Table 12.4, when we increase the parameter ρ, the final values of the objective function also increase by a factor of 102. The performance of GENCAN in these problems was quite similar to the results obtained in correspondence with problems considered in the first set of experiments.

Table 12.4 Final values of objective function reached by GENCAN
Table 12.5 Performance of GENCAN in tests with errors

We built some figures for an experiment related to the protein 3RAT, where we fixed ρ = 1 (fifth line and last column of Table 12.4). Figure 12.3a shows the true structure of 3RAT with 124 alpha carbons, and Fig. 12.3b indicates the result of superimposing both structures (true and numerical) by a transformation involving a pure rotation. For building this figure, we use only the first ten atoms of structures: true (green points) and numerical (red points). In this case, we obtained the same orthogonal matrix by solving the Procrustes problem with the two approaches described above:

$$Q = \left (\begin{array}{rrr} - 0.728719& - 0.651336& 0.211497\\ 0.141985 & - 0.44583 & - 0.883785 \\ 0.669932& - 0.614001& 0.417364 \end{array} \right ),\quad g(Q) = 2.02778.$$
Fig. 12.3
figure 3

Test with 3RAT protein

Figure 12.4 shows a graph where the x-and y-axes represent, respectively, the perturbed distances and the distances obtained by solving Eq. (12.1) with GENCAN. Although the final value of objective function is not small, the points are close to the line y = x.

Fig. 12.4
figure 4

Simulating 7,626 distances with errors

2.3 Solving Problems Using a Subset of Interatomic Distances

In these experiments, we use the same proteins reported in Table 12.1. However, we try here to recover the true structure using only distances not greater than a fixed parameter d fix. Assuming that the distances are known exactly, we varied the parameter value d fix and we analyzed the obtained results using the Procrustes technique. To each protein and each value d fix, we solve the problem (12.1) fifty times with a multistart strategy: a different initial point was used in each run. In all the cases, GENCAN stopped when the gradient norm was lower than 10 − 4.

Tables 12.6 and 12.7 show only information corresponding to the tests with the lowest value of the objective function attained by GENCAN. The total number of distances between pairs of atoms (ndist), the number of distances used to solve the problem (12.1) (nd), and the final value of the objective function (f(x  ∗ )) are reported in Table 12.6. The performance of routine is indicated in Table 12.7. These results show that GENCAN can find configurations of points that fit all distances between atoms using less than 36 % of the known distances.

Table 12.6 Final values of objective function attained by GENCAN
Table 12.7 Performance of GENCAN in the resolution of problems

We illustrate a test with the protein 6PAX where we attempt to recover the true structure considering only distances not greater than 10 Å. Figure 12.5a shows the true structure with 133 alpha carbons and Fig. 12.5b compares the original distances between atoms (\(\hat{{d}}_{ij}\)) with the distances in the numerical solution (d ij ). We can see in the graph that the points are concentrated around the line y = x. To create Fig. 12.6a, we solved the Procrustes problem using a singular value decomposition and we obtained a transformation involving a reflection. In the case of Fig. 12.6b, we applied the quaternion approach and, as a result, we obtained a pure rotation matrix. These results are shown in Table 12.8.

Fig. 12.5
figure 5

Experiments with 6PAX protein

Fig. 12.6
figure 6

Comparison of structures via procrustes

Table 12.8 Results of procrustes problem with 6PAX protein

According to the experiments, we can conclude that it is possible to use a continuous optimization routine to recover a 3D structure of a protein using only a subset of known distances between pairs of atoms. In particular, if we provide to GENCAN a reasonable starting point, the routine solves the problem very quickly.

2.4 Comparing GENCAN and MD-jeep

To finish this work, we compare the performances of GENCAN to the ones of a software tool named MD-jeep. MD-jeep was developed specifically to solve molecular distance geometry problems using combinatorial optimization techniques [13]. This software was written in C by Mucherino et al., and it is freely distributed atwww.antoniomucherino.it/en/mdjeep.php

In order to run the experiments, we used eight instances obtained from protein conformations downloaded from Protein Data Bank. We extracted the coordinates of atoms N, C α, and C from each structure. For each protein, only distances not greater than d fix = 6 Åwere considered as input for the routines. To solve the problems with GENCAN, we employ a multistart strategy: we perform runs until the routine provides a solution that differs from the original structure by a transformation involving a pure rotation. GENCAN stopped in all tests with the gradient norm smaller than 10 − 4. The solutions of MD-jeep listed in Table 12.9 differ from original structure by a linear transformation involving a rotation matrix. The columns of Table 12.9 have the following meaning: nat is the number of atoms N, C α, and C present in each protein, ndist is the total number of distances between pairs of atoms, nd is the number of distances not greater than d fix = 6 Å, and t(s) is the CPU time, in seconds.

Table 12.9 Comparing GENCAN and MDJEEP

After solving the problems with both packages, we analyze the quality of the solutions obtained using the error formula

$${E}_{\mbox{ sol}} = \frac{1} {{n}_{d}}\sum\limits_{i,j}\ \frac{\vert \hat{{d}}_{ij} - {d}_{ij}\vert } {{d}_{ij}} ,$$
(12.3)

where \(\hat{{d}}_{ij}\) is the original distance between the atoms i, j, d ij is the final distance between the points \({x}^{i},{x}^{j} \in{\mathbb{R}}^{3}\), and n d is the number of distances used in each test. We employed a Fortran procedure to evaluate the numerical solutions using the formula (12.3). The results are shown in the columns E sol of Table 12.9. According to Table 12.9, we can see that both routines attain good solutions to the problems. GENCAN obtains smaller values to the error (12.3), but MD-jeep is much faster.