1 Introduction

Distance geometry (DG) started when Menger characterized geometric concepts using the idea of distance [35]. In the majority of applications of DG, the input data consists of an incomplete list of distance values and the output is a set of positions in some Euclidean space realizing those given distances [6]. Other applications are given in [36, 39].

When distances are pre-assigned to pairs of objects, we have the assigned Distance Geometry Problem (aDGP), also called just DGP [28, 37], defined as follows:

Definition 1

Given a simple undirected graph \(G=(V,E,\delta )\), whose edges are weighted by \( \delta :E\rightarrow (0,\infty )\), and an integer \(K>0\), find a function \(x:V\rightarrow {\mathbb {R}}^{K}\) such that

$$\begin{aligned} \forall \{u,v\}\in E,\text { }||x_{i}-x_{j}||_{2}=\delta _{i,j}, \end{aligned}$$
(1)

where \(x_{i}=x(i)\), \(x_{j}=x(j)\), and \(\delta _{i,j}=\delta (\{i,j\})\).

Depending on the application, the embedding space can be very general, but for problems related to molecular geometry, we will fix it to \({\mathbb {R}}^{3} \). For example, 3D protein structures and nanostructures can be obtained by exploiting distance information between atom pairs provided by experimental techniques, such as nuclear magnetic resonance (NMR) [28] and the pair distribution function (PDF) method [15], respectively.

In general, in the context of molecular conformations, it is considered that the graph G is known a priori, but the information that is actually given by NMR experiments and PDF methods consists of a list of distance values that are only subsequently assigned to atom pairs [7].

In other words, while the distance is given, we do not know the two vertices having such a distance. That is, the associated graph is actually unknown and the only input is a vertex set and a list of distance values. This is the unassigned Distance Geometry Problem (uDGP), which has received much less attention than the aDGP [4, 6].

The formal definition of the uDGP is the following (since \(\delta _{i,j}=\delta _{j,i}\), we will write \(\{i,j\}\) instead of (ij)):

Definition 2

Given a set of vertices V and a list of associated distance values \( d_{1},\ldots ,d_{m}\), find an injective function \(g:\{1,\ldots ,m\}\rightarrow V\times V\) and a function \(x:V\rightarrow {\mathbb {R}}^{3}\) such that, \( \forall \{i,j\}\in g(\{1,\ldots ,m\})\),

$$\begin{aligned} \delta _{i,j}=d_{g^{-1}(\{i,j\})}\text { } \end{aligned}$$
(2)

and

$$\begin{aligned} ||x_{i}-x_{j}||_{2}=\delta _{i,j}. \end{aligned}$$

Note that g is an assignment function that defines a set \(E\subset \) \( V\times V\), the edges of the graph associated to the uDGP.

For historical notes and surveys on methods to solve DGPs, see [6, 7, 28, 29], respectively. Also see the recent books [21, 22, 30].

In 1979, Saxe proved that the aDGP is NP-hard [40]. The uDGP is even more challenging in practice, because the graph itself and the graph realization must both be determined at the same time.

Although the uDGP is not new to mathematics [41], the literature focuses predominantly on one-dimensional problems motivated by DNA sequencing, often called partial digest problems [12].

For nanostructure calculations, there are two heuristics that have been proposed: TRIBOND [11, 14] and LIGA [15]. Both methods are based on build-up approaches and suppose that sufficient distance constraints are available to ensure a unique solution at each step of the procedure.

In the context of molecular geometry, we propose mathematical programming formulations for the uDGP, one of the open problems in DG mentioned in [31]. In addition to theoretical results related to these formulations, we also propose a new heuristic for the problem (Sect. 2). Section 3 presents computational experiments and Sect. 4 concludes the paper with new research directions.

2 New formulations for the uDGP

In this section, we present mathematical programming models for the uDGP, with associated theoretical results, and a heuristic to solve it.

2.1 Mathematical programming models

To take into account the assignment function g (Definition 2), we introduce binary variables \(a_{i,j}^{k}\) such that

$$\begin{aligned} a_{i,j}^{k}=1\Leftrightarrow \text {distance }d_{k}\text { is assigned to the pair }(i,j)\in V\times V\text {.} \end{aligned}$$

Considering vertices \(v_{1},\ldots ,v_{n}\in V\) and distance values \( d_{1},\ldots ,d_{m}\) related to a uDGP instance, we propose the following model to the uDGP:

$$\begin{aligned} \begin{array}{c} \begin{array}{c} \begin{array}{rc} \underset{x_{1},\ldots ,x_{n},a_{i,j}^{k}}{\min } &{} \overset{n-1}{\underset{i=1}{ \sum }}\overset{n}{\underset{j=i+1}{\sum }}\left( \overset{m}{\underset{k=1}{ \sum }}\left( a_{i,j}^{k}\left( ||x_{i}-x_{j}||_{2}^{2}-d_{k}^{2}\right) ^{2}\right) \right) \\ \text {s.t.} &{} \begin{array}{c} \overset{n-1}{\underset{i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }} a_{i,j}^{k}=1,\text { }k=1,\ldots ,m, \\ \overset{m}{\underset{k=1}{\sum }}a_{i,j}^{k}\le 1,\text { }i=1,\ldots ,n-1, \text { }j=i+1,\ldots ,n,\ \\ x_{i}\in {\mathbb {R}}^{3},\text { }a_{i,j}^{k}\in \{0,1\},\text { } \\ k=1,\ldots ,m,\text { }i=1,\ldots ,n-1,\text { }j=i+1,\ldots ,n. \end{array} \end{array} \end{array} \end{array} \end{aligned}$$
(3)

Our first result states a relationship between a uDGP solution and a solution to model (3).

Theorem 1

A pair (gx) is a solution for a uDGP instance associated to a graph \( G=(V,E)\), with \(|V|=n\), \(|E|=m\), \(g:\{1,\ldots ,m\}\rightarrow V\times V\), and \( x:V\rightarrow {\mathbb {R}}^{3}\), if and only if (xa) is a global optimal solution to (3).

Proof

If (gx) is a solution for a given uDGP associated to a graph \(G=(V,E)\), with \(|V|=n\), \(|E|=m\), \(g:\{1,\ldots ,m\}\rightarrow V\times V\), \(x:V\rightarrow {\mathbb {R}}^{3}\), and distance values \(d_{1},\ldots ,d_{m}\) related to binary variables \(a_{i,j}^{k}\), such that

$$\begin{aligned} ||x_{i}-x_{j}||_2=\delta _{i,j}, \end{aligned}$$

where \(\delta _{i,j}=d_{g^{-1}(\{i,j\})}\), then it is easy to see that (xa) is a global optimal solution to (3).

Considering now that \((x_{1},\ldots ,x_{n},a_{i,j}^{k})\) is a global optimal solution to (3), for \(i=1,\ldots ,n-1\), \(j=i+1,\ldots ,n\), and \(k=1,\ldots ,m \), we have that

$$\begin{aligned} \overset{m}{\underset{k=1}{\sum }}\left( a_{i,j}^{k}\left( ||x_{i}-x_{j}||_{2}^{2}-d_{k}^{2}\right) ^{2}\right) =0 \end{aligned}$$

and, by constraints of problem (3), the values \(a_{i,j}^{k}\) assign pairs \((i,j)\in V\times V\) such that

$$\begin{aligned} ||x_{i}-x_{j}||_2=d_{k}\text {, }k=1,\ldots ,m\text {.} \end{aligned}$$
(4)

This implicitly defines a weighted graph \(G=(V,E,d)\), \(d:E\rightarrow {\mathbb {R}}\), with vertices and edges related to \(x_{1},\ldots ,x_{n}\) and pairs (ij), respectively, an injective function \(g:\{1,\ldots ,m\}\rightarrow V\times V\), such that

$$\begin{aligned} \delta _{i,j}=d_{g^{-1}(\{i,j\})}, \end{aligned}$$

and a realization of G, \(x:V\rightarrow {\mathbb {R}}^{3}\), that satisfies ( 4). Thus, (gx) is a solution of the uDGP associated to the distance values \(d_{1},\ldots ,d_{m}\) and the graph \(G=(V,E,d)\). \(\square \)

In order to avoid the huge number of binary variables of the model (3) and inspired by the Solid Isotropic Material with Penalization (SIMP) method [5] and the ideas proposed in [34], we introduce a new formulation with only continuous variables:

$$\begin{aligned} \begin{array}{c} \begin{array}{c} \begin{array}{rc} \underset{t,x_{1},\ldots ,x_{n},a_{i,j}^{k}}{\min } &{} t-\overset{n-1}{\underset{ i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }}\left( \overset{m}{\underset{ k=1}{\sum }}\left( a_{i,j}^{k}\right) ^{2}\right) \\ \text {s.t.} &{} \begin{array}{c} \overset{n-1}{\underset{i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }} \left( \overset{m}{\underset{k=1}{\sum }}\left( a_{i,j}^{k}\left( ||x_{i}-x_{j}||_2^{2}-d_{k}^{2}\right) ^{2}\right) \right) =t, \\ \overset{n-1}{\underset{i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }} a_{i,j}^{k}=1,\text { }k=1,\ldots ,m, \\ \overset{m}{\underset{k=1}{\sum }}a_{i,j}^{k}\le 1,\text { }i=1,\ldots ,n-1, \text { }j=i+1,\ldots ,n,\text { } \\ t\ge 0,\text { }x_{i}\in {\mathbb {R}}^{3},\text { }0\le a_{i,j}^{k}\le 1, \text { } \\ k=1,\ldots ,m,\text { }i=1,\ldots ,n-1,\text { }j=i+1,\ldots ,n. \end{array} \end{array} \end{array} \end{array} \end{aligned}$$
(5)

The next result gives a relationship between a uDGP solution and a solution to model (5).

Theorem 2

A pair (gx) is a solution for a feasible uDGP instance associated to a graph \(G=(V,E)\), with \(|V|=n\), \(|E|=m\), \(g:\{1,\ldots ,m\}\rightarrow V\times V\) , and \(x:V\rightarrow {\mathbb {R}}^{3}\), if and only if (txa) is a global optimal solution to (5) with globally optimal objective function value equal to \(-m\).

Proof

If (gx) is a solution for a given uDGP associated to a graph \(G=(V,E)\), with \(|V|=n\), \(|E|=m\), \(g:\{1,\ldots ,m\}\rightarrow V\times V\), \(x:V\rightarrow {\mathbb {R}}^{3}\), and distance values \(d_{1},\ldots ,d_{m}\) related to binary variables \(a_{i,j}^{k}\), such that

$$\begin{aligned} ||x_{i}-x_{j}||=\delta _{i,j}, \end{aligned}$$

where \(\delta _{i,j}=d_{g^{-1}(\{i,j\})}\), we obtain, from Theorem 3, that (xa) is a global optimal solution to (3). Considering this solution in model (5), we have \(a_{i,j}^{k}\in \{0,1\}\) and

$$\begin{aligned} \overset{n-1}{\underset{i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }} \left( \overset{m}{\underset{k=1}{\sum }}\left( a_{i,j}^{k}\left( ||x_{i}-x_{j}||^{2}-d_{k}^{2}\right) ^{2}\right) \right) =0, \end{aligned}$$

which implies that \(t=0\) and that (0, xa) is also a global optimum solution to model (5), with globally optimal objective function value equal to \(-m\).

Let us consider the other direction of the theorem. For a global optimal solution (txa) of the model (5), if there exist positive integers \(l_{1},l_{2},l_{3}\) with \(l_{1}\le n-1\), \(l_{2}\le n\), \(l_{3}\le m\), such that

$$\begin{aligned} 0<a_{l_{1},l_{2}}^{l_{3}}<1, \end{aligned}$$

then

$$\begin{aligned} \overset{n-1}{\underset{i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }} \left( \overset{m}{\underset{k=1}{\sum }}\left( a_{i,j}^{k}\right) ^{2}\right) <\overset{n-1}{\underset{i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }}\left( \overset{m}{\underset{k=1}{\sum }}a_{i,j}^{k}\right) . \end{aligned}$$
(6)

Since we are considering a feasible uDGP, for \(k=1,\ldots ,m\),

$$\begin{aligned} \overset{n-1}{\underset{i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }} a_{i,j}^{k}=1\Rightarrow \overset{n-1}{\underset{i=1}{\sum }}\overset{n}{ \underset{j=i+1}{\sum }}\left( \overset{m}{\underset{k=1}{\sum }} a_{i,j}^{k}\right) =m, \end{aligned}$$

implying that, from (6) and \(t\ge 0\),

$$\begin{aligned} \overset{n-1}{\underset{i=1}{\sum }}\overset{n}{\underset{j=i+1}{\sum }} \left( \overset{m}{\underset{k=1}{\sum }}\left( a_{i,j}^{k}\right) ^{2}\right) <m\Rightarrow t-\overset{n-1}{\underset{i=1}{\sum }}\overset{n}{ \underset{j=i+1}{\sum }}\left( \overset{m}{\underset{k=1}{\sum }}\left( a_{i,j}^{k}\right) ^{2}\right) >t-m, \end{aligned}$$

which is a contradiction, because we already know that \(-m\) is the optimal value for model (5). Thus,

$$\begin{aligned} a_{i,j}^{k}\in \{0,1\}, \end{aligned}$$

for \(i=1,\ldots ,n-1\), \(j=i+1,\ldots ,n\), and \(k=1,\ldots ,m\). From the constraints of problem (5), the values \(a_{i,j}^{k}\) assign pairs \((i,j)\in V\times V\) such that

$$\begin{aligned} ||x_{i}-x_{j}||=d_{k}\text {, }k=1,\ldots ,m\text {.} \end{aligned}$$
(7)

This implicitly defines a weighted graph \(G=(V,E,\delta )\), \(\delta :E\rightarrow {\mathbb {R}}\), with vertices and edges related to \( x_{1},\ldots ,x_{n}\) and pairs (ij), respectively, an injective function \( g:\{1,\ldots ,m\}\rightarrow V\times V\), such that

$$\begin{aligned} \delta _{i,j}=d_{g^{-1}(\{i,j\})}, \end{aligned}$$

and a realization of G, \(x:V\rightarrow {\mathbb {R}}^{3}\), that satisfies ( 7). Thus, (gx) is a solution of the uDGP associated to the distance values \(d_{1},\ldots ,d_{m}\) and the graph \(G=(V,E,d)\). \(\square \)

From the proof above, note that model (5) also provides a “certificate” of infeasibility of the uDGP instance if the globally optimal objective function value is strictly greater than \(-m\).

2.2 A heuristic approach

Model (5) can solve larger instances, compared to model (3), but to solve instances with hundreds of atoms, we propose a new heuristic inspired by the TRIBOND method [14] and model (5).

First, we need to find a “core” (positions in \({\mathbb {R}}^{3}\) for five vertices with ten associated distances provided from the list of distance values), solving model (5) considering just five points, and then increase its size by adding one vertex position at a time solving a modification of model (5), where four random points (already fixed) are used in order to find the next position:

  1. 1.

    Find a core \(x_{1},\ldots ,x_{5}\in {\mathbb {R}}^{3}\) solving the problem

    $$\begin{aligned} \begin{array}{c} \begin{array}{c} \begin{array}{rc} \underset{t,x_{1},\ldots ,x_{5},a_{i,j}^{k}}{\min } &{} t-\overset{4}{\underset{i=1 }{\sum }}\overset{5}{\underset{j=i+1}{\sum }}\left( \overset{m}{\underset{k=1 }{\sum }}\left( a_{i,j}^{k}\right) ^{2}\right) \\ \text {s.t.} &{} \begin{array}{c} \overset{4}{\underset{i=1}{\sum }}\overset{5}{\underset{j=i+1}{\sum }}\left( \overset{m}{\underset{k=1}{\sum }}\left( a_{i,j}^{k}\left( ||x_{i}-x_{j}||_2^{2}-d_{k}^{2}\right) ^{2}\right) \right) =t, \\ \overset{4}{\underset{i=1}{\sum }}\overset{5}{\underset{j=i+1}{\sum }} a_{i,j}^{k}=1,\text { }k=1,\ldots ,m, \\ \overset{m}{\underset{k=1}{\sum }}a_{i,j}^{k}\le 1,\text { }i=1,\ldots ,4,\text { }j=i+1,\ldots ,5,\ \ \\ t\ge 0,\text { }x_{1},\ldots ,x_{5}\in {\mathbb {R}}^{3},\text { }0\le a_{i,j}^{k}\le 1,\text { }k=1,\ldots ,m. \end{array} \end{array} \end{array} \end{array} \end{aligned}$$
    (8)
  2. 2.

    For \(i=6,\ldots ,n\), solve the problem

    $$\begin{aligned} \begin{array}{c} \begin{array}{c} \begin{array}{rc} \underset{t,x_{i},a_{i,j}^{k}}{\min } &{} t-\overset{m_{i}}{\underset{k=1}{ \sum }}\left( \underset{j\in J}{\sum }\left( a_{i,j}^{k}\right) ^{2}\right) \\ \text {s.t.} &{} \begin{array}{c} \underset{j\in J}{\sum }\overset{m_{i}}{\underset{k=1}{\sum }}\left( a_{i,j}^{k}\left( ||x_{i}-y_{j}||_2^{2}-d_{k}^{2}\right) ^{2}\right) =t \\ \underset{j\in J}{\sum }a_{i,j}^{k}=1, \ k=1,\ldots ,m_{i}, \\ \overset{m_{i}}{\underset{k=1}{\sum }}a_{i,j}^{k}\le 1,\text { }j\in J, \\ t\ge 0,\text { }x_{i}\in {\mathbb {R}}^{3},\text { }0\le a_{i,j}^{k}\le 1, \text { }k=1,\ldots ,m_{i},\text { }j\in J, \end{array} \end{array} \end{array} \end{array} \end{aligned}$$
    (9)

    where \(x_{i}\in {\mathbb {R}}^{3}\) is the position to be determined, J is a random set with four indices related to already fixed points \(y_{j}\in {\mathbb {R}}^{3},\) \(j\in J\subset \{1,\ldots ,i-1\}\), and \(m_{i}\) is the number of available distances.

  3. 3.

    If a set of compatible distances cannot be found for some \(i=6,\ldots ,n\), find a new core (go to Step 1) and restart.

The importance of a core in Step 1 is to allow, with high probability [14], to start correctly the reconstruction of the molecular structure. After finding a core, the geometric idea of Step 2 is to intersect fours spheres [32] (centered at points \(y_{j}\)), which gives one point if there are consistent distance values (radii of the spheres) from the list of distances.

3 Computational results

We generate uDGP instances in the following way. We consider a sequence of covalently connected atoms indexed by \(1,\ldots ,n.\) The 3D structure of the instance can be defined in terms of the lengths of the covalent bonds \(d_{1,2},\ldots ,d_{n-1,n}\), covalent angles \(\theta _{1,3},\ldots ,\theta _{n-2,n}\) (formed by three consecutive atoms), and torsion angles \( \omega _{1,4},\ldots ,\omega _{n-3,n}\) (formed by four consecutive atoms).

By fixing the lengths of the covalent bonds (\(d_{i-1,i}=1.0\)) and the values of the covalent angles (\(\theta _{i-2,i}=2.0\) radians), a 3D molecular structure is determined by the torsion angles \(\omega _{1,4},\ldots ,\omega _{n-3,n}\in [0,2\pi ]\), randomly chosen from the set \(\{\frac{\pi }{3},\pi ,\frac{5\pi }{3}\}\). More details about instance generation are given in [16, 20].

Differently from TRIBOND and LIGA methods, instances with incomplete list of distances, i.e. \(m<\frac{n(n-1)}{2}\), can be easier to solve by the proposed approach, since there will be many global optimum solutions for the model (5). Thus, in order to guarantee a unique solution, we consider instances with all the distances \(d_{1},\ldots ,d_{m}\), where \(m=\frac{ n(n-1)}{2}\).

We used the software AMPL with the solver Baron 17.4.1 on a Lenovo notebook, with 6 MB RAM and intel celeron 1.6 GHz, to solve the three models proposed: M1 (3), M2 (5), and M3 (9).

For all values of n, we generated 5 random instances according to the procedure described above.

For models M1 and M2, we stopped with \(n=5\) and \(n=10\), respectively, because no solution was found considering 3000 minutes as the limit time (see Table 1).

Table 1 Number of solutions found with time limit = 3,000 min
Table 2 Computational time in minutes

For each n, Table 2 shows the average of the computational time, in minutes, necessary to solve all the 5 instances randomly generated.

From Tables 1 and 2, we notice that the proposed heuristic for solving problem (5) finds the global optimum solutions, in all the cases, in a reasonable time. This means that, during the execution of the heuristic, for a core found in Step 1, problem (9) was solved in Step 2, for all \(i=6,\ldots ,n\).

4 Conclusions

TRIBOND and LIGA are the first generation methods to solve the uDGP applied to molecular conformation problems and computational results presented in this paper point to different approaches that could be starting points for new research directions.

We are particularly interested in such kind of problems related to protein structure calculations using distance information given by NMR experiments, called the Molecular DGP (MDGP) [28]. Many methods applied to the MDGP suppose that the available distances are pre-assigned to the pairs of atoms. However, as we mentioned in the Introduction, the data that is actually provided by NMR consists of just a list of distances.

The geometry of proteins allows us to define vertex orders \(v_{1},\ldots ,v_{n}\) on the associated graph \(G=(V,E)\) [8, 25] such that

  1. 1.

    The first four vertices can be fixed in \({\mathbb {R}}^{3}\), since they define a clique;

  2. 2.

    Each vertex with rank greater than four is adjacent to at least two contiguous predecessors, i.e.

    $$\begin{aligned} \forall i>4,\{v_{i-2},v_{i}\},\{v_{i-1},v_{i}\}\in E. \end{aligned}$$

Property 1 can help to define the core (instead of solving problem (8)) and property 2 can be useful in the definition of the set J in the model (9).

In [23], the authors propose a new vertex order for protein molecules where pairs \(\{v_{i-3},v_{i}\}\) can also be included in the set of edges of the associated graph. When distances \(d_{i-3,i}\) are also known, the search space of the problem can be discretized and if the assignment function g is defined in advance, we have the so-called Discretizable MDGP (DMDGP) [17, 18], allowing the application of a combinatorial method, called Branch-and-Prune (BP) [27].

As mentioned in the recent survey on DGP’s [7], experiences with TRIBOND and LIGA, together with recent results on BP methods for DGP’s [13, 26, 33], emphasize the importance of vertex orders in molecular reconstruction from distance information.

Our main research direction now is to consider protein vertex orders in the models proposed in this work to deal with uncertainties in the NMR distance information, already discussed in many approaches to the aDGP [1,2,3, 9, 10, 19, 24, 38, 42].

The proposed models can also incorporate uncertainties representing distance values as interval distances \([{\underline{d}}_{k},{\overline{d}}_{k}]\), \(0< {\underline{d}}_{k}\le {\overline{d}}_{k}\), where precise distance values \(d_{k} \) are replaced by \({\underline{d}}_{k}+\lambda _{k}w_{k}\), with \(0\le \lambda _{k}\le 1\) and \(w_{k}={\overline{d}}_{k}-{\underline{d}}_{k}\) [38], implying that \(\lambda _{k}\) would be new variables and \(w_{k}\) new input data.