1 Introduction

We discuss an interesting hybrid of two problems: the Maximum Feasible Subsystem (MaxFS) [17] and the Distance Geometry Problem (DGP) [25], and its application to the problem of determining the spatial conformation of proteins from distance data derived from Nuclear Magnetic Resonance (NMR) experiments.

The MaxFS is as follows: given a set of constraints, generally of the form

$$\begin{aligned} \forall i \in I \qquad g^L_i \le g_i(x) \le g^U_i, \end{aligned}$$
(1)

determine a subset \( \subseteq I\) of maximum cardinality such that the set of constraints of Eq. (1) indexed by S is feasible. This problem is \(\mathbf {NP}\)-hard to solve, and does not admit a polynomial-time approximation scheme unless \({\mathbf {P}}=\mathbf {NP}\) [5].

The (Euclidean) DGP is as follows: given an integer \(K>0\) and a simple connected edge-weighted graph \(G=(V,E,d)\), where \(d:E\rightarrow {\mathbb {R}}_+\), determine whether there exists a realization \(x:V\rightarrow {\mathbb {R}}^K\) such that:

$$\begin{aligned} \forall \{i,j\}\in E\qquad \Vert x_i-x_j\Vert _2=d_{ij}. \end{aligned}$$
(2)

There are many applications of the DGP [25] and even more variants. The one we are specially interested in is the interval DGP (iDGP), which replaces \(d:E\rightarrow {\mathbb {R}}_+\) with the interval weight function \(d:E\rightarrow \mathbb {IR}_+\) such that \(d(\{i,j\})=[L_{ij},U_{ij}]\) [15, 21]. Specifically, Eq. (2) becomes

$$\begin{aligned} \forall \{i,j\}\in E\qquad L_{ij}\le \Vert x_i-x_j\Vert _2\le U_{ij}. \end{aligned}$$
(3)

The iDGP is \(\mathbf {NP}\)-hard by inclusion of the DGP: the inclusion follows from the case \(L_{ij}=U_{ij}\) for all \(\{i,j\}\in E\). We note that if \(L_{ij}=0\) and \(U_{ij}\) is infinite (for all \(\{i,j\}\in E\)) the problem is clearly tractable since any realization is feasible. The existence of a “phase transition” threshold between tractability and hardness of this problem is an open question [1]. From now on, we shall assume all norms are Euclidean (a.k.a. \(\ell _2\)) unless stated otherwise. We also denote \(n=|V|\) and \(m=|E|\).

We are now in the position of stating the main problem discussed in this paper.

Max Feasible Subsystem of Distance Geometry constraints (MaxFS\({}_{\text{ DGP }}\)). Given an integer \(K>0\) and a simple connected edge-weighted graph \(G=(V,E,d)\) with \(d:E\rightarrow \mathbb {IR}_+\), determine a maximum cardinality subset \(S\subseteq E\) inducing a subgraph of G, such that there exists a realization \(x:V[E]\rightarrow {\mathbb {R}}^K\) satisfying

$$\begin{aligned} \forall \{i,j\}\in S\qquad L_{ij}\le \Vert x_i-x_j\Vert \le U_{ij}. \end{aligned}$$
(4)

We recall that \(d(\{i,j\})=[L_{ij},U_{ij}]\) for each \(\{i,j\}\in E\), as explained above. The connectedness assumption, as in many problems on graphs, is without loss of generality: for a disconnected graph it suffices to find the connected components in polynomial time, and solve the problem on each connected component independently.

The MaxFS\({}_{\text{ DGP }}\) is motivated by a specific application of the DGP, namely the determination of the shape of proteins given some of their inter-atomic distances. In principle, NMR can determine all inter-atomic distances in a given protein up to a certain length threshold (somewhere between 5Å and 6Å). In practice, reality is fuzzier than this. First, we note that proteins rarely crystallize (so X-ray crystallography does not help), but usually live in a solution. Secondly, proteins vibrate, but we assume that they do not (this is called the “molecular rigidity assumption”) [28].

NMR experiments yield a probability distribution over triplets (atom label, atom label, distance value); this distribution is used to imperfectly reconstruct the weighted graph G that is the actual input to the DGP. It is not easy to understand the methods used by NMR machinery to perform this reconstruction, other than they are mostly based on simulated annealing [31]; attempts to construct the conformations starting directly from NMR output are under way [29]. According to [8], this process induces two types of errors: experimental errors (due to the rigidity assumption), and systematic errors (due to the imperfect reconstruction). Specifically, the experimental errors are accommodated by the interval bounds on the iDGP. The systematic errors are described in [8] as consisting of a certain proportion of completely wrong distances. This induces sets of constraints in (3) that are likely to be infeasible. We propose the MaxFS\({}_{\text{ DGP }}\) in order to address the issue and find solutions subject to such limitations.

The focus of this paper is to solve MaxFS\({}_{\text{ DGP }}\) instances using formulations and solution methods from Mathematical Programming (MP). In particular, we present exact formulations and several reformulations thereof. We construct a practically viable solution method based on solving an approximate matrix formulation of MaxFS\({}_{\text{ DGP }}\) followed by rank reduction and a refinement phase. Since Polynomial Programs (PP) offer more reformulation opportunities than with general Nonlinear Programs (NLP), \(\ell _2\) norm terms are always squared.

The rest of this paper is organized as follows. In Sect. 2 we introduce an exact formulation of the MaxFS\({}_{\text{ DGP }}\) problem. In Sect. 3 we construct a relaxation in the same primal variables as the exact formulation. In Sect. 4 we construct some matrix relaxations, and propose methodologies for reducing solution rank and improving the quality of the low-rank solutions. In Sect. 5 we describe two computational approaches to solve MaxFS\({}_{\text{ DGP }}\) instances, based on the formulations of Sects. 3 and 4, and discuss computational results obtained from protein instances of small and medium sizes.

2 Exact formulation

In this section we present a Mathematical Programming (MP) formulation of the MaxFS\({}_{\text{ DGP }}\) problem.

2.1 Experimental errors

Experimental errors are addressed by minimizing the infeasibilities w.r.t. Eq. (2) or Eq. (3). A well-known box-constrained formulation targeting the DGP is:

$$\begin{aligned} \min \limits _{x\in [x^L,x^U]} \sum \limits _{\{i,j\}\in E} (\Vert x_i-x_j\Vert ^2 - d_{ij}^2)^2, \end{aligned}$$
(5)

where \(x^L,x^U\) are given lower and upper bounds for the decision variable \(n\times K\) matrix \(x=(x_1,\ldots ,x_n)^T\). Equation (5) was tested computationally in e.g. [20]. A corresponding reformulation for the iDGP can be obtained replacing each term \(\Vert x_i-x_j\Vert ^2 - d_{ij}^2\) of Eq. (5) with

$$\begin{aligned} \max (0,L_{ij}^2-\Vert x_i-x_j\Vert ^2) + \max (0,\Vert x_i-x_j\Vert ^2-U_{ij}^2), \end{aligned}$$

so that setting the objective to zero corresponds to the (nonconvex) pure feasibility problem:

$$\begin{aligned} \left. \begin{array}{rrcl} \forall \{i,j\}\in E &{} \Vert x_i-x_j\Vert ^2 &{}\ge &{} L_{ij}^2 \\ \forall \{i,j\}\in E &{} \Vert x_i-x_j\Vert ^2 &{}\le &{} U_{ij}^2 \\ &{} x^L \le x &{}\le &{} x^U. \end{array}\right\} \end{aligned}$$
(6)

We remark that \(x^L,x^U\) may or may not be provided by the specific DGP application being tackled; but, even when they are not explicitly provided, Proposition 2.2 below can be used to derive them. We also note that Eq. (5) and Eq. (6) can be solved without bound constraints. In the latter case, it may be advisable to impose a zero centroid constraint instead, which removes translation invariance:

$$\begin{aligned} \frac{1}{n}\sum \limits _{i\in V} x_i = 0. \end{aligned}$$
(7)

Henceforth, we shall refrain (for brevity) from mentioning bound or zero centroid constraints on realization variables in the MP formulations below, unless the context requires it.

2.2 Systematic errors

The MaxFS\({}_{\text{ DGP }}\) can be formulated in a natural way, using so-called “big-M” techniques and binary decision variables to linearly represent disjunctions as follows [4]:

$$\begin{aligned} \left. \begin{array}{rrcl} \max &{} \sum \limits _{\{i,j\}\in E} y_{ij} &{} &{} \\ \forall \{i,j\}\in E &{} d_{ij}^2 - M(1-y_{ij}) \le &{}\Vert x_i-x_j\Vert ^2 &{}\le d_{ij}^2 + M(1-y_{ij})\\ &{} {y} &{}{\in \{0,1\}^{m}.}&{} \end{array}\right\} \end{aligned}$$
(8)

In the above, it is clear that \(y_{ij}=1\) enforces the constraint on edge \(\{ i,j \}\) in the DGP, i.e. Eq. (2), whereas \(y_{ij}=0\) relaxes it.

Note that, since distances are always non-negative, the LHS of the distance constraints in Eq. (8) can be tightened to \(d_{ij}^2y_{ij}\). This yields:

$$\begin{aligned} \left. \begin{array}{rrcl} \max &{} \sum \limits _{\{i,j\}\in E} y_{ij} &{} &{} \\ \forall \{i,j\}\in E &{} d_{ij}^2y_{ij}\le &{}\Vert x_i-x_j\Vert ^2 &{}\le d_{ij}^2 + M(1-y_{ij}) \\ &{} y &{}\in &{} \{0,1\}^{m}. \end{array}\right\} \end{aligned}$$
(9)

Lemma 2.1

The y component of every optimal solution for the MaxFS\({}_{\text{ DGP }}\)  defines a connected graph.

Proof

Suppose, by contradiction, that an optimal solution is made up by at least two connected components \(C_1\) and \(C_2\). Since we consider the MaxFS\({}_{\text{ DGP }}\) defined on a connected graph, there exists a path given by the sequence of vertices \(v_1\), \(v_2\),...,\(v_k\) (with \(k\ge 2)\) connecting \(C_1\) with \(C_2\) where \(v_1\) belongs to \(C_1\), \(v_k\) belongs to \(C_2\) and \(v_2\),...,\(v_{k-1}\) belong neither to \(C_1\) nor to \(C_2\). Finding a realization for the vertices of the path is trivial and it is always possible to move as a whole both \(C_1\) and \(C_2\) in such a way to continue to respect their internal distance constraints and the realization of \(v_1\) and \(v_k\). Therefore, adding the path to the starting solution we obtain a feasible solution for the MaxFS\({}_{\text{ DGP }}\) satisfying the distances of \(k-1\) (\(\ge 1\)) additional edges. This contradicts the optimality of the starting solution. \(\square \)

Proposition 2.2

If \(M=(\sum _{\{ i,j \} \in E} d_{ij})^2\), then the optimal solution of Eq. (8) solves the MaxFS\({}_{\text{ DGP }}\).

Proof

First, we claim that any feasible DGP instance can be realized in a sphere of radius \(R=\frac{1}{2}\sum \limits _{\{i,j\}\in E} d_{ij}\). A cycle graph C on \(V=\{1,2,\ldots ,n\}\) with \(E=\{\{1,2\},\{2,3\},\ldots ,\{n-1,n\},\{1,n\}\}\) with \(d_{1n}=\sum _{\{i,j\}\in E\setminus \{1,n\}} d_{ij}\) can be realized on a straight segment of length \(r=d_{1n}\) embedded in any Euclidean space [33]; if this segment is centered about the origin it belongs by construction to the sphere \(R{\mathbb {S}}^{K-1}\). Any other biconnected graph on n vertices will have more cycles than C, and hence will induce realizations in \({\mathbb {R}}^K\) having segments shorter than 2R when projected on any coordinate axis. Connected but non-biconnected graphs are the same as trees: the tree yielding a realization with longest segment projection on any coordinate axis is the path on n vertices realized as a segment of length 2R; again, by centering the segment it is easy to see that the path can be realized in a sphere of radius R. Lastly, we note that the above claim also shows that the maximum possible distance between two vertices ij in a realization of an optimal solution for the MaxFS\({}_{\text{ DGP }}\) is 2R, since it must induce a connected subgraph according to Lemma 2.1. This shows that if a MaxFS\({}_{\text{ DGP }}\) instance has a solution with a certain support vector \(y^*\) for the maximum cardinality set of feasible constraints, then setting \(y=y^*\) in Eq. (8) will induce a valid realization \(x^*\) of the subgraph consisting of the edges \(\{i,j\}\) for which \(y^*_{ij}=1\), and vice versa. \(\square \)

In practice, segment realizations are extremely rare, and therefore M can be tightened w.r.t. Proposition 2.2. We remark that bounds on M can also be inferred from \(x^L,x^U\), if they are given; and, conversely, that \([x^L,x^U]\) can be set to \([-\sqrt{M},\sqrt{M}]\) if the application field does not explicitly provide them.

2.3 Systematic and experimental errors together

We consider Eq. (6) and employ the y binary variables as in Eq. (9) to activate/deactivate the distance constraints. This yields a valid MP formulation for the MaxFS\({}_{\text{ DGP }}\), as follows:

$$\begin{aligned} \left. \begin{array}{rrcll} \max &{} \sum \limits _{\{i,j\}\in E} y_{ij} &{}&{} &{} \\ \forall \{i,j\}\in E &{} \Vert x_i-x_j\Vert ^2 &{}\ge &{} L_{ij}^2y_{ij} &{} (\text{ eL})\\ \forall \{i,j\}\in E &{} \Vert x_i-x_j\Vert ^2 &{}\le &{} U_{ij}^2 + M(1-y_{ij}) &{} (\text{ eU}) \\ &{} y &{}\in &{} \{0,1\}^m &{} \\ &{} x^L \le &{}x &{}\le x^U. &{} \end{array}\right\} \end{aligned}$$
(10)

The correctness of Eq. (10) is easy to establish: constraints (eL) and (eU) ensure that \(\Vert x_i-x_j\Vert \) is in the desired interval \([L_{ij},U_{ij}]\) as long as \(y_{ij}=1\), i.e. the \(\{i,j\}\)-th constraint is imposed. Otherwise \(\Vert x_i-x_j\Vert \in [0,\sqrt{U_{ij}^2+M}]\), meaning that the constraint is relaxed. The objective function ensures that as many constraints as possible are imposed.

Moreover, according to [8], the fraction p of wrong distances can be estimated statistically a priori. We can encode this knowledge by means of the additional cardinality constraint:

$$\begin{aligned} \sum \limits _{\{i,j\}\in E} y_{ij} \ge (1-p)\,m, \end{aligned}$$
(11)

which makes Eq. (10) infeasible whenever Eq. (11) is not satisfied.

Equation (10) is a nonconvex Mixed-Integer Nonlinear Program (MINLP), which represents one of the hardest classes in MP. MINLP is an uncomputable class, in general [26]. For bounded decision variables (such as in Eq. (10)) it is computable, but NP-hard [22].

The state of the art in MINLP solution is not as advanced as for Mixed-Integer Linear Programming (MILP), for which, despite the hardness, relatively large scale instances can be solved either to optimality or at least to feasibility. Preliminary tests on Eqs. (10)–(11) showed that no feasible solutions can be found in a given maximum CPU time limit. Removing Eq. (11) simply yielded the trivial solution with \(y_{ij}=0\) for all \(\{i,j\}\in E\), again, with a given maximum CPU time limit.

In the rest of the paper we shall investigate reformulations of many types in order to obtain “solver-friendlier” MP formulations for the MaxFS\({}_{\text{ DGP }}\). This investigation involves MINLP relaxations in the original nK-dimensional space of realizations (Sect. 3), as well as MILP approximations in a larger \(\frac{n(n+1)}{2}\)-dimensional space of symmetric matrices (Sect. 4).

The order of presentation of these formulations is dictated by a “mathematical programmer’s common sense” regulated by computational experience: we believe that a certain reformulation might solve the issues of the preceding formulation; then, during preliminary testing, we discover new issues. Thus, while the whole reformulation sequence explains how we came to the one that actually works in practice, we do not systematically test every formulation we present. Specifically, in the computational results Sect. 5 we only solve Eq. (16) and test a solution method based on the pair (Eqs. 23, 26).

3 Primal relaxations and approximations

The fact that solving Eq. (10) with a CPU time limit yields a solution where all y variables are set to zero is a witness to the empirical observation that feasibility is harder to achieve than optimality. Setting \(y_{ij}=0\) and \(x_i=x_j\) for all \(\{i,j\}\in E\) yields a feasible solution with the worst possible objective function value. Obviously, the major contributors to the feasibility issue are the constraints (eL) and (eU) in Eq. (10). We note that, when the integrality of the y variables is relaxed, (eL) is nonconvex and (eU) is convex. We shall discuss relaxations and approximations of (eL) in Sect. 4.1. In this section, we introduce a method for approximating (eU).

First, we replace \(U_{ij}^2\) by \(U_{ij}^2y_{ij}\), and \(M(1-y_{ij})\) by an additional variable \(r_{ij}\ge 0\), for \(\{i,j\}\in E\). We then make sure that \(r_{ij}=0\) whenever \(y_{ij}=1\). This yields:

$$\begin{aligned} \forall \{i,j\}\in E \quad \Vert x_i-x_j\Vert ^2\le & {} U_{ij}^2y_{ij} + r_{ij} \end{aligned}$$
(12)
$$\begin{aligned} \forall \{i,j\}\in E \quad {0 \le } r_{ij}\le & {} M(1-y_{ij}). \end{aligned}$$
(13)

This is an exact reformulation whose only difference with Eq. (10) is that the systematic error w.r.t. \(U_{ij}\) is represented by \(r_{ij}\) for each edge \(\{i,j\}\in E\).

3.1 Bi-objective relaxation

Some preliminary experiments on the exact formulation based on Eqs. (12)–(13) above presented us with the following unusual issue: our Branch-and-Bound (BB) solver of choice failed to find any feasible solution using the value of M given by Proposition 2.2, which meant that the BB algorithm could never prune by bound, which then resulted in a rapid exponential growth of the BB tree. Setting M to unreasonably large values allowed the solver to find a feasible solution rapidly. While, as is well known, a large M yields a slacker bound, it is obviously better to prune by bound ineffectively than not at all. The issue, however, is now that of finding good values of M, which — in our preliminary experiments at least — appeared to vary considerably by instance.

This motivated us to construct a formulation where Eq. (13) is turned from a hard constraint into a soft one, making sure that \(r_{ij}\) remains small. We achieve this by introducing a second objective function \(\min \sum _{ij} r_{ij}\). This yields a bi-objective MINLP relaxation

$$\begin{aligned} \max \sum \limits _{\{i,j\}\in E} y_{ij} \end{aligned}$$
(14)
$$\begin{aligned} \min \sum \limits _{\{i,j\}\in E} r_{ij} \end{aligned}$$
(15)

subject to Eq. (12), (eL), and the other constraints in Eq. (10) aside from (eU). Note that Eq. (13) is relaxed. Moreover, Eq. (14) addresses the systematic error and Eq. (15) addresses the experimental error. We denote this formulation by \(({\mathsf {B}})\).

While bi-objective programs can hardly be considered exact reformulations of single-objective ones, in the following we give some limited sufficient and necessary conditions linking optimality in Eq. (10) and Pareto optimality in \(({\mathsf {B}})\).

Note that any solution \((x^*,y^*)\) feasible in Eq. (10) can be extended to a solution of \(({\mathsf {B}})\) by setting \(r^*_{ij}=0\) iff \(y^*_{ij}=1\) and \(r^*_{ij}=\Vert x^*_i-x^*_j\Vert ^2\) otherwise. Whenever \((x^*,y^*)\) is feasible in Eq. (10) and the existence of \(r^*\) is implicitly referred to in the text, we assume it was computed from \(x^*,y^*\) as above. For a binary vector \(y\in \{0,1\}^m\) we denote by \(\mathsf {supp}(y)\) the set of indices \(e = (i,j)\) \(\in E\) for which \(y_{ij} = 1\).

3.1.1 Sufficient optimality conditions

We first prove that Pareto optimal values of \(({\mathsf {B}})\) are good candidate optima for Eq. (10) as long as one knows the optimal value.

Lemma 3.1

Suppose \((x',y',r')\) is such that: (i) it is feasible in \(({\mathsf {B}})\); (ii) it has \(r'_{ij}=0\) for all \(\{i,j\}\in E\) with \(y'_{ij}=1\). Then \((x',y')\) is feasible in Eq. (10).

Proof

This follows by inspection of Eq. (13), which is the only constraint equivalent to (eU) in Eq. (10) which is relaxed in \(({\mathsf {B}})\). \(\square \)

Proposition 3.2

Suppose \((x',y',r')\) is such that: (i) it is Pareto optimal in \(({\mathsf {B}})\); (ii) the value of Eq. (14) is optimal for Eq. (10). Then \((x',y')\) is optimal in Eq. (10).

Proof

Consider any edge \(\{h,\ell \}\in E\) with \(y'_{h\ell }=1\), and suppose \(r'_{h\ell }>0\). Define \({\hat{r}}\) s.t. \({\hat{r}}_{ij}=r'_{ij}\) for all \(\{i,j\}\not =\{h,\ell \}\), and \({\hat{r}}_{h\ell }=0\). By Eq. (12) the solution \((x',y',{\hat{r}})\) is feasible in \(({\mathsf {B}})\) and dominates \((x',y',r')\) since

$$\begin{aligned} \sum _{\{i,j\}\in E}{\hat{r}}_{ij}=\sum _{\begin{array}{c} \{i,j\}\in E \\ \{i,j\}\not =\{h,\ell \} \end{array}} r'_{ij}<\sum _{\{i,j\}\in E} r'_{ij}, \end{aligned}$$

against the assumption. Therefore \(r'_{h\ell }=0\), which, by Lemma 3.1, implies that \((x',y')\) is feasible in Eq. (10). Optimality follows by (ii). \(\square \)

3.1.2 Necessary optimality conditions

Next, we give a support-dependent characterization of Pareto dominance in \(({\mathsf {B}})\) w.r.t. optimality in Eq. (10).

Theorem 3.3

No optimal solution \( (x^*,y^*)\) of Eq. (10) can be dominated in \(({\mathsf {B}})\) by a Pareto solution \((x',y',r')\) of \(({\mathsf {B}})\) where \(\mathsf {supp}(y^*)\subseteq \mathsf {supp}(y')\).

Proof

Suppose first that \((x^*,y^*)\) is dominated by \((x',y',r')\) w.r.t. the objective function Eq. (15), i.e. \(\sum _{ij}r'_{ij}<\sum _{ij}r^*_{ij}\). For all \(\{i,j\}\) with \(y^*_{ij}=1\) we have \(r^*_{ij}=0\) by Eq. (13) (and also \(y'_{ij}=1\) since \(\mathsf {supp}(y^*)\subseteq \mathsf {supp}(y')\)). So, no decrease in Eq. (15) can be achieved over the edges \(\{i,j\}\) for which \(y^*_{ij}=y'_{ij}=1\). For all \(\{i,j\}\) with \(y^*_{ij}=y'_{ij}=0\) we have \(r^*_{ij}=\Vert x^*_i-x^*_j\Vert ^2\) by definition. Note that the relaxation of Eq. (13) does not change the feasible region restricted to \(y_{ij}=0\), so we can assume without loss of generality that \(x^*=x'\) over edges \(\{i,j\}\in E\) for which \(y^*_{ij}=y'_{ij}=0\), whence \(r'_{ij}\ge \Vert x'_i-x'_j\Vert ^2=\Vert x^*_i-x^*_j\Vert ^2=r^*_{ij}\) by Eq. (12). Because of the optimization direction of Eq. (15), we have \(r'_{ij}=r^*_{ij}\) for all \(\{i,j\}\) s.t. \(y^*_{ij}=y'_{ij}=0\). So, again, no decrease in Eq. (15) can be achieved over these edges. The only possible decrease in the value of Eq. (15) must therefore occur over edges \(\{i,j\}\in E\) where \(0=y^*_{ij}<y'_{ij}=1\). But then this means that \((x^*,y^*)\) is dominated w.r.t. the objective Eq. (14).

We therefore assume that \((x^*,y^*)\) is dominated by \((x',y',r')\) w.r.t. Eq. (14). Then \(\sum _{ij}y'_{ij}>\sum _{ij}y^*_{ij}\). By optimality of \((x^*,y^*)\) in Eq. (10), \((x',y')\) cannot be feasible in Eq. (10), which means that \((x',y',r')\) does not satisfy Eq. (13), i.e. \(r'_{h\ell }>0\) for some edge \(\{h,\ell \}\in E\) where \(y'_{h\ell }=1\). Since \(\mathsf {supp}(y^*)\subseteq \mathsf {supp}(y')\), and \(r^*_{ij}=0\) whenever \(y^*_{ij}=1\), we have

$$\begin{aligned} \sum _{\begin{array}{c} \{i,j\}\in E \\ \{i,j\}\not =\{h,\ell \} \end{array}}r^*_{ij}\le & {} \sum _{\begin{array}{c} \{i,j\}\in E \\ \{i,j\}\not =\{h,\ell \} \end{array}} r'_{ij} \\ \Rightarrow \quad \sum _{\{i,j\}\in E}r^*_{ij}< & {} \sum _{\{i,j\}\in E}r'_{ij}. \end{aligned}$$

Thus, \((x',y',r')\) does not dominate \((x^*,y^*)\), as claimed. We therefore must have \(\sum _{ij}y^*_{ij}=\sum _{ij}y'_{ij}\). Moreover, since \(\mathsf {supp}(y^*)\subseteq \mathsf {supp}(y')\), we also have \(y'=y^*\), yielding \(\sum _{ij}r^*_{ij}=\sum _{ij}r'_{ij}\). \(\square \)

3.2 Scalarized approximation

Finally, we derive a weighted scalarization of Eqs. (14)–(15), with the aim of weighing Eq. (14) more than Eq. (15). In summary, we obtain the formulation:

$$\begin{aligned} \left. \begin{array}{rrcl} \max &{} \sum \limits _{\{i,j\}\in E} y_{ij} &{}-&{} \alpha \sum \limits _{\{i,j\}\in E}r_{ij}\\ \forall \{i,j\}\in E &{} \Vert x_i-x_j\Vert ^2 &{}\ge &{} L_{ij}^2y_{ij} \\ \forall \{i,j\}\in E &{} \Vert x_i-x_j\Vert ^2 &{}\le &{} U_{ij}^2y_{ij} + r_{ij} \\ &{} y &{}\in &{} \{0,1\}^m \\ &{} r &{}\in &{} {\mathbb {R}}_{\ge 0}^m \\ [0.2em] &{} x^L \le &{}x &{}\le x^U, \end{array}\right\} \end{aligned}$$
(16)

where \(\alpha \ge 0\). Equation (16) is still a nonconvex MINLP, and thus remains very challenging to solve. However, it has no “big-M” constraint.

4 Matrix relaxations and rank reduction

In this section we look at a Mixed-Integer Semidefinite Programming (MISDP) relaxation, followed by a Mixed-Integer Diagonally Dominant Programming (MIDDP) restriction thereof, which provides a Mixed-Integer Linear Programming (MILP) approximation of Eq. (16). We then show how to reduce the rank of the \(n\times n\) matrix solution of these formulations to an \(n\times K\) matrix representation of the realization we look for.

4.1 MISDP relaxation

The standard derivation of a Semidefinite Programming (SDP) relaxation from MPs involving Euclidean distance terms consists in writing them as

$$\begin{aligned} \Vert x_i-x_j\Vert ^2 = \Vert x_i\Vert ^2 + \Vert x_j\Vert ^2 - 2 x_i\cdot x_j, \end{aligned}$$

and then linearizing the nonlinear terms \(\Vert x_i\Vert ^2\) and \(x_i\cdot x_j\) by added variables \(X_{ii}\), \(X_{ij}\) respectively, resulting in \(\Vert x_i-x_j\Vert ^2\) being replaced by

$$\begin{aligned} X_{ii}+X_{jj}-2X_{ij}, \end{aligned}$$

which is linear in X. The effect of this replacement yields an exact reformulation as long as the constraints

$$\begin{aligned} X=x{x}^{\top } \end{aligned}$$
(17)

are satisfied. Note that Eq. (17) are nonconvex constraints. We relax them by

$$\begin{aligned} X\succeq x{x}^{\top }, \end{aligned}$$

which define a convex set. We remark that, if the x variables appear nowhere else in the formulation, it suffices to enforce \(X\succeq 0\) (see e.g. [13]). In order to eliminate x from the formulation, we can relax the bounds \(x^L,x^U\) in Eq. (16), which are inessential by translation invariance.

The application of the above reformulation to Eq. (16) yields

$$\begin{aligned} \left. \begin{array}{rrcl} \max &{} \sum \limits _{\{i,j\}\in E} y_{ij} &{}- &{}\alpha \sum \limits _{\{i,j\}\in E} r_{ij} \\ \forall \{i,j\}\in E &{} X_{ii}+X_{jj}-2X_{ij} &{}\ge &{} L_{ij}^2 y_{ij} \\ \forall \{i,j\}\in E &{} X_{ii}+X_{jj}-2X_{ij} &{}\le &{} U_{ij}^2 y_{ij} + r_{ij} \\ &{} y &{}\in &{} \{0,1\}^m \\ &{} r &{}\in &{} {\mathbb {R}}_{\ge 0}^m \\ &{} X &{}\succeq &{} 0, \end{array}\right\} \end{aligned}$$
(18)

which is a MISDP formulation.

4.2 MIDDP approximations

Diagonally Dominant Programming (DDP) was proposed in [2, 3] as an MP-based approximation technique for the positive semidefinite (PSD) cone, yielding both inner and outer approximating formulations in the Linear Programming (LP) or Second-Order Cone Programming (SOCP) classes. Since MILP solvers are more technologically advanced than their conic counterparts, in this section we only discuss the LP approximation.

A matrix \(A=(A_{ij})\) is diagonally dominant (DD) whenever

$$\begin{aligned} \forall i \quad A_{ii}\ge \sum _{j\not =i} |A_{ij}|. \end{aligned}$$
(19)

DDP rests on the observation that all DD matrices are PSD, a fact which follows easily by Gershgorin’s circle theorem [14]. Since Eq. (19) can be represented by a set of linear inequalities, solving LPs over the DD cone \({\mathcal {D}}\) instead of the PSD cone \({\mathcal {S}}_+\) yields a PSD matrix solution at the cost of solving a LP problem.

On the other hand, not all PSD matrices are DD, which implies that \({\mathcal {D}}\subsetneq {\mathcal {S}}_+\): the feasible region of an SDP problem might be non-empty while the corresponding (inner) DDP problem is infeasible. In such cases, one may resort to the outer DDP problem, which is derived using the dual DD cone \({\mathcal {D}}^*\). On the other hand, \(S_+\subsetneq {\mathcal {D}}^*\), so the matrix solution for the outer DDP problem may not be PSD. More details about applying DDP to the DGP are given in [13, 23].

4.2.1 Inner restriction

We focus on the inner approximation first, since, if it is feasible, it provides a PSD matrix as a solution, which is crucial to further processing. For the MISDP problem in Eq. (18), we simply replace \(X\succeq 0\) with “X is DD”, i.e.

$$\begin{aligned} \forall i\in V \quad X_{ii} \ge \sum _{j\not =i} |X_{ij}|. \end{aligned}$$
(20)

This is easily reformulated to a linear form by introducing a linearizing \(n\times n\) symmetric matrix T for the nonlinear term |X|. We then obtain:

$$\begin{aligned} \forall i\in V \quad X_{ii}\ge & {} \sum _{j\not =i} T_{ij} \end{aligned}$$
(21)
$$\begin{aligned} -T \le X\le & {} T. \end{aligned}$$
(22)

The exact reformulation proof between Eq. (20) and (21)–(22) is hinted at in [2] to proceed by projection of the T variables. A more direct argument (by contradiction of optimality) can be obtained by considering the objective function \(\min \sum _{ij} T_{ij}\) (a corresponding term \(-\sum _{ij} T_{ij}\) may optionally be added to the objective of Eq. (23)). In summary, we have the following MILP:

$$\begin{aligned} \left. \begin{array}{rrcl} \max &{} \sum \limits _{\{i,j\}\in E} y_{ij} &{}-&{} \alpha \sum \limits _{\{i,j\}\in E} r_{ij} \\ \forall \{i,j\}\in E &{} X_{ii}+X_{jj}-2X_{ij} &{}\ge &{} L_{ij}^2 y_{ij} \\ \forall \{i,j\}\in E &{} X_{ii}+X_{jj}-2X_{ij} &{}\le &{} U_{ij}^2 y_{ij} + r_{ij} \\ \forall i<j\le n &{} X_{ij} &{}=&{} X_{ji} \\ \forall i \in V&{} \sum \limits _{j\not =i} T_{ij} &{}\le &{} X_{ii} \\ &{} -T \ \le \ X &{}\le &{} T \\ &{} T,X &{}\in &{} {\mathbb {R}}^{n\times n}\\ &{} y &{}\in &{} \{0,1\}^m \\ &{} r &{}\in &{} {\mathbb {R}}_{\ge 0}^m. \end{array}\right\} \end{aligned}$$
(23)

4.2.2 Outer relaxation

An outer relaxation of the MISDP formulation Eq. (18) can be obtained using DDP techniques and the dual DD cone \({\mathcal {D}}^*\).

The formulation replaces the implicit constraint \(X\in {\mathcal {D}}\) by \(X\in {\mathcal {D}}^*\). We remark that the PSD cone \({\mathcal {S}}_+\) is contained in \({\mathcal {D}}^*\), and that the latter is a polyhedral relaxation of the former. We also recall that \({\mathcal {S}}_+\) can be characterized by means of the uncountably infinite set of constraints \({v}^{\top }Xv\ge 0\) for all \(v\in {\mathbb {R}}^n\). Moreover, by [6], \({\mathcal {D}}\) is finitely generated by the set \({\mathcal {V}}\) of its extreme rays, which consists of the matrices \(e_i{e}^{\top }_i\), \((e_i+e_j){(e_i+e_j)}^{\top }\), \((e_i-e_j){(e_i-e_j)}^{\top }\) for all \(i<j\le n\).

It is well known that if a cone is finitely generated, its dual cone is also finitely generated. By [23, Thm. 3], the polyhedral representation of \({\mathcal {D}}^*\) is

$$\begin{aligned} \forall v\in {\mathcal {V}}\quad \mathsf {trace}({v}^{\top }Xv) \ge 0, \end{aligned}$$
(24)

which also shows that \({\mathcal {S}}_+\subseteq {\mathcal {D}}^*\). Thus, an explicit formulation of \(X\in {\mathcal {D}}^*\) is as follows:

$$\begin{aligned} \begin{array}{lrl} \forall i\le n \quad &{} X_{ii} &{}\ge 0 \\ \forall i<j\le n \quad &{} X_{ii} + X_{jj} - 2X_{ij} &{}\ge 0 \\ \forall i<j\le n \quad &{} X_{ii} + X_{jj} + 2X_{ij} &{}\ge 0. \end{array} \end{aligned}$$

We can therefore derive a dual DDP formulation for the outer relaxation of Eq. (18):

$$\begin{aligned} \left. \begin{array}{rrcl} \max &{} \sum \limits _{\{i,j\}\in E} y_{ij} - \alpha \sum \limits _{\{i,j\}\in E} r_{ij} &{}&{} \\ \forall \{i,j\}\in E &{} X_{ii}+X_{jj}-2X_{ij} &{}\ge &{} L_{ij}^2 y_{ij} \\ \forall \{i,j\}\in E &{} X_{ii}+X_{jj}-2X_{ij} &{}\le &{} U_{ij}^2 y_{ij} + r_{ij} \\ \forall i<j\le n &{} X_{ij} &{}=&{} X_{ji} \\ \forall \{i,j\}\not \in E &{} X_{ii} + X_{jj} - 2X_{ij} &{}\ge &{} 0 \\ \forall i<j\le n &{} X_{ii} + X_{jj} + 2X_{ij} &{}\ge &{} 0 \\ \forall i\le n &{} X_{ii} &{}\ge &{} 0 \\ &{} X &{}\in &{} {\mathbb {R}}^{n\times n}\\ &{} y &{}\in &{} \{0,1\}^m \\ &{} r &{}\in &{} {\mathbb {R}}_{\ge 0}^m. \end{array}\right\} \end{aligned}$$
(25)

Since \({\mathcal {S}}_+\subsetneq {\mathcal {D}}^*\), the optimal matrix solution \(X^*\) of Eq. (25) may (and in practice usually does) have negative eigenvalues, which makes it a poor candidate for the further rank reduction processing discussed below. On the other hand, it provides a guaranteed bound to the optimal objective function value of Eq. (18) in the optimization direction.

4.3 Rank reduction

If Eq. (23) is feasible, solving it yields a symmetric \(n\times n\) PSD matrix X which has the property that \(X_{ii}+X_{jj}-2X_{ij}\) is the Euclidean distance between two points \({\bar{x}}_i,{\bar{x}}_j\), in some Euclidean space, such that \({\bar{x}}_i\cdot {\bar{x}}_i=X_{ii}\), \({\bar{x}}_j\cdot {\bar{x}}_j=X_{jj}\) and \({\bar{x}}_i\cdot {\bar{x}}_j=X_{ij}\). In other words, X is the Gram matrix of a realization \({\bar{x}}\) of the given graph G.

Since X is square symmetric, we can use spectral decomposition to write X as \(X = P\Lambda {P}^{\top }\), where P is a matrix of eigenvectors, and \(\Lambda \) a diagonal matrix of eigenvalues \(\lambda _1,\ldots ,\lambda _n\) which we shall assume ordered largest to smallest. Since X is PSD, we have \(\lambda _n\ge 0\) which implies that \(\sqrt{\Lambda }\) is a real matrix. Hence we can decompose X as \((P\sqrt{\Lambda }){(P\sqrt{\Lambda })}^{\top }\), which means that we can take \({\bar{x}}=P\sqrt{\Lambda }\) as the realization of G.

We now recall that the DGP asks for a realization in \({\mathbb {R}}^K\) for a given integer K. The realization x is an \(n\times n\) matrix, so it can be taken as a realization in \({\mathbb {R}}^n\). The intrinsic dimension of x is actually given by \(\mathsf {rank}(x)\). In practice, \(\mathsf {rank}(x)\) is usually n or very close to n, whereas K is usually much smaller than n. Thus, given \({\bar{x}}\in {\mathbb {R}}^{n\times n}\), we would like to find a reduced rank realization \(x'\in {\mathbb {R}}^{n\times K}\).

We consider two rank reduction methods: the first is Principal Component Analysis (PCA) [18]. The second is Barvinok’s naive algorithm [7], extended to consider arbitrary ranks [27].

4.3.1 Principal Component Analysis

With the notation of the previous section, we define

$$\begin{aligned} \Lambda ^{(K)} = \mathsf {diag}(\lambda _1,\ldots ,\lambda _K,0,\ldots ,0), \end{aligned}$$

and then we let \(x'=P\sqrt{\Lambda ^{(K)}}\). Although \(x'\) is still technically speaking an \(n\times n\) matrix, all of the columns from the \(K+1\)-st to the n-th are zero vectors. This means that they can be ignored, and \(x'\) can be considered an \(n\times K\) realization matrix, such that its i-th row \(x_i\) is the position of vertex i.

Since \(\lambda _1\ge \cdots \ge \lambda _K\) are the K largest eigenvalues of X, the approximate realization \(x'\) is the “closest” to \({\bar{x}}\) (with respect to the Schatten norm [10] considering a subset of K eigenvalues in \(\Lambda \)) which minimizes

$$\begin{aligned} \sum \limits _{i\not =j}|\,\Vert {\bar{x}}_i-{\bar{x}}_j\Vert ^2-\Vert x'_i-x'_j\Vert ^2\,|, \end{aligned}$$

for otherwise a contradiction would ensue with \(\lambda _1,\ldots ,\lambda _K\) being the K largest eigenvalues.

4.3.2 Barvinok’s naive algorithm

The “naive algorithm” published by Barvinok in [7] is a lesser known, but effective, dimensionality reduction technique applicable to solutions of SDPs. Consider the quadratic feasibility problem of determining whether the set \(\{x\in {\mathbb {R}}^n \;|\;\forall i\le m\ {x}^{\top }Q^ix=a_i\}\) is non-empty. Let \({\bar{X}}\) be a solution of the corresponding SDP set \(\{X\in {\mathbb {R}}^{n\times n} \;|\;\forall i\le m\ \mathsf {trace}(Q^iX)=a_i\}\). Barvinok’s naive algorithm performs the following steps:

  1. 1.

    let T be a factor of \({\bar{X}}\), so \({\bar{X}}=T{T}^{\top }\)

  2. 2.

    sample each component of a vector \(w\in {\mathbb {R}}^n\) from the normal distribution \({\mathcal {N}}(0,1)\)

  3. 3.

    let \(x'=Tw\).

A concentration of measure argument shows that

$$\begin{aligned} \mathsf {Prob}\bigg (\forall i\le m\;\mathsf {dist}(x',\{x\;|\;{x}^{\top }Q^ix=a_i\})\le C\sqrt{\Vert {\bar{X}}\Vert \ln (n)} \bigg ) \ge 0.9, \end{aligned}$$

where C is a positive universal constant, and \(\mathsf {dist}(p,S)\) is the Euclidean distance between a point p and a set S. In other words, Barvinok’s naive algorithm ensures that \(x'\) is “not too far” from the feasible set of the quadratic equations.

An extension of Barvinok’s naive algorithm to the iDGP was proposed in [27]. Starting from a solution \({\bar{X}}\) of the SDP relaxation of the iDGP, it is as follows:

  1. 1.

    let T be a factor of \({\bar{X}}\), so \({\bar{X}}=T{T}^{\top }\)

  2. 2.

    sample each component of an \(n\times K\) matrix w from the normal distribution \({\mathcal {N}}(0,1/\sqrt{K})\)

  3. 3.

    let \(x'=Tw\).

Then

$$\begin{aligned} \mathsf {Prob}\bigg (\forall \{i,j\}\in E\;\mathsf {dist}(x',\{x\;|\;L_{ij}\le \Vert x_i-x_j\Vert \le U_{ij}\})\le C\sqrt{\Vert {\bar{X}}\Vert \ln (|E|)} \bigg ) \ge 0.9. \end{aligned}$$

Again, \(x'\) is close to being an iDGP realization of the graph G with high probability.

4.4 Refinement

Both of the rank reduction methods sketched above produce an approximate realization \(x'\) which is close to being feasible for the given iDGP instance. We therefore propose a “refinement step” where \(x'\) is given as a starting point for a local descent consisting in locally solving the following variant of Eq. (6):

$$\begin{aligned} \left. \begin{array}{rrcl} \min &{} \sum \limits _{\{i,j\}\in E} s_{ij} &{} &{} \\ \forall \{i,j\}\in E' &{} \Vert x_i-x_j\Vert ^2 &{}\ge &{} L_{ij}^2 - s_{ij} \\ \forall \{i,j\}\in E' &{} \Vert x_i-x_j\Vert ^2 &{}\le &{} U_{ij}^2 \\ \forall \{i,j\}\in E' &{} s_{ij} &{}\ge &{} 0 \\ &{} x^L \le x &{}\le &{} x^U, \end{array}\right\} \end{aligned}$$
(26)

where \(E'=\{\{i,j\}\in E\;|\;{\bar{y}}_{ij}=1\}\), with \({\bar{y}}\in \{0,1\}^m\) given by any of the mixed-integer formulations in this section.

We remark that the problematic reverse-convex constraint was relaxed in Eq. (26) by means of a slack variable to be minimized. Solving Eq. (26) yields a solution \(x^*\in {\mathbb {R}}^{n\times K}\) improved w.r.t. \(x'\).

5 Computational results

In this section we present some computational results concerning the formulations presented in the previous sections. More precisely, we test the following solution methods:

  1. 1.

    Algorithm sBB is a global spatial Branch-and-Bound (sBB) based solver deployed on the formulation in Eq. (16) for a given CPU time;

  2. 2.

    Algorithm DDP consists in: (i) solving an inner MIDDP restriction Eq. (23) (Sect. 4.2.1) to yield a solution \(({\bar{X}},{\bar{y}})\), (ii) reduce the rank of \({\bar{X}}\) to K (Sect. 4.3) to yield a realization \({\bar{x}}\in {\mathbb {R}}^{n\times K}\), (iii) use \({\bar{x}}\) as a starting point for a local NLP solver deployed on Eq. (26), to obtain a realization \(x^*\in {\mathbb {R}}^{n\times K}\).

The reason why we do not consider solving the exact MINLP formulation in Eq. (10) is that preliminary tests showed that sBB solvers cannot even find a locally optimal solution of our smallest instance. We do not consider the MISDP relaxation in Eq. (18) due to the fact that we could not find an off-the-shelf MISDP solver we could deploy on our computational platform.

5.1 Solution quality measures

Let \(x^*\) be a realization of G, and \(y^*\in \{0,1\}^m\) describe the activation of the m constraints of the iDGP. We consider the following quality measures:

  • the support cardinality \(|\mathsf {supp}(y)|=\sum _{\{i,j\}\in E} y^*_{ij}\) of y, which is equal to the number of satisfied distance constraints, and evaluates the systematic errors;

  • the mean and largest distance error (MDE, LDE) measures computed on the support (i.e., for \(\{i,j\}\in E\) with \(y_{ij} = 1\)):

    $$\begin{aligned} \mathsf {MDE}(x,y,G)= & {} \frac{1}{m} \sum \limits _{\{i,j\}\in E} \max (L_{ij}-\Vert x_i-x_j\Vert _2,\Vert x_i-x_j\Vert _2-U_{ij})y_{ij} \end{aligned}$$
    (27)
    $$\begin{aligned} \mathsf {LDE}(x,y,G)= & {} \max \limits _{\{i,j\}\in E} \max (L_{ij}-\Vert x_i-x_j\Vert _2,\Vert x_i-x_j\Vert _2-U_{ij})y_{ij}, \end{aligned}$$
    (28)

    which evaluate the realization error w.r.t. the given (interval) distances;

  • the CPU time taken to solve the instance.

We also employ a comparative measure between two realizations \(x,y\in {\mathbb {R}}^{n\times K}\) called Root Mean Square Deviation (RMSD), defined as \(\sqrt{\sum _{i\le n}\Vert x_i-y_i\Vert ^2/n}\). According to [12], the RMSD is not overly meaningful on protein realizations unless one also normalizes w.r.t. partial reflections, which, however, appears hard. Along with RMSD scores between the realizations found by our method and reference realizations with zero systematic and experimental errors, we also show the plots with realization aligned according to Procrustes analysis [16] (but not w.r.t. partial reflections, see [12]).

5.2 Test set

We perform our tests on a set of small to medium-sized protein instances. We note that K is fixed to the constant 3. For a given protein x with known realization \({\hat{x}}\in {\mathbb {R}}^{n\times 3}\), the instance corresponding to x was generated as follows:

  1. 1.

    the \(n\times n\) Euclidean distance matrix \(D=(d_{ij})\) of \({\hat{x}}\) was computed;

  2. 2.

    all distances between atoms i and \(j\in \{i+1,i+2\}\) on the backbone were kept as exact distances, namely \([L_{ij},U_{ij}]=[d_{ij},d_{ij}]\), for each \(i\le n-2\) (these distances are known as discretization distances);

  3. 3.

    all distances between atoms i and \(i+3\) on the backbone were kept as interval distances, namely \([L_{ij},U_{ij}]=[d_{ij}-\eta d_{ij},d_{ij}+\eta d_{ij}]\), for each \(i\le n-3\) (these distances are also known as discretization distances [30]);

  4. 4.

    all other distances shorter than 5Å in D were kept as interval distances, namely \([L_{ij},U_{ij}]=[d_{ij}-\eta d_{ij},d_{ij}+\eta d_{ij}]\) (these distances are known as pruning distances [30]);

  5. 5.

    a given fraction \(\sigma \) of the pruning distances, chosen randomly, were reassigned randomly to a different pair of atoms;

  6. 6.

    any other distance was discarded from D.

In our experiments, we set \(\eta =\sigma =0.1\).

Most of the instances in Table 1 were extracted in the Protein Data Bank (PDB) [9]; a few are modifications of instances from the PDB.

Table 1 The protein instances in the test set, with the corresponding number of edges and vertices in their graphs

5.3 Computational set-up

The sBB algorithm was implemented by the Baron solver [32, 34]. As for the DDP algorithm, solutions of Eq. (23) were obtained using CPLEX 12.10 [19]. Solutions of the refinement step subproblems (Sect. 4.4) were obtained using IPOPT 3.11 [11].

In practice, the \(\alpha \) coefficient, appearing in the scalarized formulation Eq. (16) and in the derived MIDDP formulation Eq. (23), depends on the instance. In general, we found that values of \(\alpha \) over 0.2 often yielded trivial solutions where \(y_{ij}=0\) for every \(\{i,j\}\in E\). After some preliminary experimentation, we set \(\alpha =0.15\) for both sBB and DDP.

Our implementation uses a mixture of Python3 and AMPL (using the AmplPy Python interface). The experiments were carried out on a server with four Intel Xeon E5-2620 v4 CPUs with eight cores per CPU at 2.1GHz with 64GB RAM running CentOS Linux.

5.4 Experiments

In this section we discuss the results obtained from the experiments.

5.4.1 The sBB algorithm

The sBB algorithm was able to solve none of the instances in Table 1 to guaranteed global optimality within one hour of CPU time. Moreover, it failed to find feasible solutions for any instance other than tiny.

On the tiny instance, sBB found, within one hour of CPU time, a realization \(x^*\) with \(|\mathsf {supp}(y)|=335\), \(\mathsf {MDE}=0.056\), \(\mathsf {LDE}=3.352\). Since \(m=335=|\mathsf {supp}(y)|\), this solution neglects all systematic errors, considering them as experimental instead. This is noticeable in Fig. 1, which shows \(x^*\) optimally aligned to a reference realization without any error at all. The RMSD value for this pair is 0.084.

Fig. 1
figure 1

Realization of tiny obtained using sBB compared with a correct reference realization (left). The systematic error (center). Error derived from a wrong partial reflection (right). The scaling on the axes is arbitrary

Each picture in Fig. 1 shows two realizations of tiny aligned optimally using Procrustes analysis. One of them, labelled \(x^*\) in the left picture, was found by sBB; the other, labelled “Reference”, is a realization of tiny without errors (experimental or systematic). We remark that the two clusters on the left appear to be well aligned. In the center picture we emphasize an edge \(\{i,j\}\) in \(x^*\) which should clearly add to the systematic error rather than to the experimental one: it shows that \(y_{ij}\) should have been zero rather than one (we recall that, instead, sBB found a solution with \(y_{ij}=1\)). On the right picture, we emphasize a flipped partial reflection, which contributes to the overall RMSD error, but is not an actual error, as explained in [12].

5.4.2 The DDP algorithm

The DDP algorithm was configured with a maximum CPU time of 3600s for the MILP solver deployed on the inner MIDDP formulation of Eq. (23), while the local NLP solver in the refinement phase was allowed to terminate naturally. The DDP algorithm was able to find reasonable realizations for all of the instances. The results are reported in Table 2.

Table 2 Results from the DDP algorithm

The columns report: the instance name, the rank reduction algorithm (PCA, denoted “pca”, or Barvinok’s naive algorithm, denoted “bvk”), the number m of edges in the instance, the number \(|\mathsf {supp}(y)|\) of edges with experimental error only, the MDE and LDE measures, the RMSD value between the realization found and a reference realization without any error, the upper bound UB found by the outer MIDDP relaxation in Eq. (25), the ratio NegEv of negative to total eigenvalue weight of the corresponding outer MIDDP solution, and the CPU time.

Table 2 allows us to make a few observations.

  1. 1.

    Overall, the methodology we propose is able to derive approximate solutions to small and medium-scaled instances of the MaxFS\({}_{\text{ DGP }}\) problem in acceptable times.

  2. 2.

    The MDE quality measures are acceptable with respect to results obtained in the DGP literature on proteins with solution methods taken from MP (see e.g. [12, 13, 20, 24, 27]). The LDE measures, on the other hand, appear excessive, and may be a sign that the balance \(\alpha \) between systematic and experimental error needs further tuning.

  3. 3.

    The time spent on the refinement phase can vary greatly. While this is not necessarily troublesome when finding the shape of proteins (which is rarely a real-time affair), it may help to set a time limit on the refinement phase too.

  4. 4.

    It is unclear whether PCA or Barvinok’s naive algorithm is the best rank reduction method. They clearly yield different approximate realizations, which is important in view of the choice of starting point for the refinement phase. A possible idea would be to deploy them in parallel until termination of the fastest refinement phase.

  5. 5.

    The upper bound UB obtained by the outer MIDDP relaxation is always equal to the number of edges of the instance. Further analysis shows that every outer MIDDP was always solved by the presolver, which further strengthens the possibility that this upper bound may be always trivial. We do not know whether setting \(y_{ij}=1\) for each \(\{i,j\}\in E\) always yields a feasible solution in Eq. (25) with \(\sum _{ij} r_{ij}=0\); neither do we know whether such a property might be established for specific values of \(\alpha \).

A further methodological inquiry can be made by computing the RMSD between solutions of the MIDDP formulation Eq. (23) after projection (by PCA or by Barvinok’s algorithm), and the solution after the refinement phase using the local NLP solver. This statistic gives us an idea of whether the refinement phase is effective. As can be seen from Table 3, this is indeed the case.

Table 3 RMSD values between partial solutions obtained by MIDDP after dimensionality reduction, and final solutions after the refinement phase

6 Conclusion

This paper is about the problem of finding a maximum feasible subsystem of distance geometry constraints, which models well the dual nature of the errors arising from finding protein conformations from NMR distance data. We discussed mathematical programming formulations of many different types for this problem: exact, approximate, relaxations, restrictions. We proposed a solution methodology based on a mixed integer diagonally dominant programming restriction of the original mixed-integer nonlinear programming problem. We tested our methodology on a set of protein instances of small and medium sizes obtaining reasonable solutions in acceptable CPU times.