1 Introduction

Within biology, it is well-established that the biochemical function of a molecule is strongly related to its three-dimensional structure. As a direct consequence, substantial effort is devoted within the field of structural biology to the problem of molecular structure determination: given the chemical composition and topology of a molecule, we seek its conformation in \(\mathbb {R}^3\). For the specific problem of protein structure determination, the composition and topology of a protein molecule are completely specified by its amino acid sequence,Footnote 1 and we again seek its conformation(s) in three dimensions.

Proteins are highly flexible polymer chains of amino acids, and consequently may have multiple conformations for a given amino acid sequence. Therefore, additional geometric measurements are required in order to obtain a reasonable number of solutions to the protein structure determination problem. Nuclear magnetic resonance (NMR) experiments are frequently employed to obtain these measurements. In an NMR experiment, the interaction between the nucleus of each atom in a molecule and a strong magnetic field is measured using precisely timed pulses of radio-frequency radiation [15]. The interaction energy of a nucleus as a fraction of the total magnetic field energy is known as its chemical shift. Chemical shifts are highly sensitive reporters of the local electronic environments of their nuclei, making them excellent tools for structure determination. For proteins, whose polymer backbones are formed by repeated N–C\(_\alpha \)–C\('\) units, chemical shifts are dominated by local backbone geometry (Fig. 1), in particular the backbone dihedral angles \(\phi \), \(\psi \) and \(\omega \). The \(\phi \) and \(\psi \) angles are historically referred to as the Ramachandran angles [19]. For amino acid i of a protein, the \(\phi _i\) angle is defined by the atoms C\('^{(i-1)}\)–N\(^{(i)}\)–C\(_\alpha ^{(i)}\)–C\('^{(i)}\), the \(\psi _i\) angle is defined by N\(^{(i)}\)–C\(_\alpha ^{(i)}\)–C\('^{(i)}\)–N\(^{(i+1)}\), and the \(\omega _i\) angle is defined by C\(_\alpha ^{(i)}\)–C\('^{(i)}\)–N\(^{(i+1)}\)–C\(_\alpha ^{(i+1)}\). The \(\omega _i\) dihedral angle is usually fixed to 180\(^\circ \) due to the known near-planarity of the peptide bond [1]. When all \(\omega \) dihedrals are fixed to 180\(^\circ \), and the local geometry is assumed to be fixed [10], the conformation of a protein’s backbone may be completely specified by supplying the Ramachandran angles \((\phi ,\psi )\) for all of its amino acid units. These angles may be predicted from the chemical shifts of the N, C\(_\alpha \), C\(_\beta \), C\('\), H\(_1\) and H\(_\alpha \) atoms using TALOS-N, an artificial neural network trained on a large database of previously determined NMR protein structures [23]. Multiple predictions from the neural network are used to estimate a mean and standard deviation for each backbone dihedral angle, resulting in fairly reliable Ramachandran angle intervals. However, when some chemical shift measurements are missing, the accuracy of the predicted dihedral angles can suffer. A set of multidimensional NMR experiments involving different nuclei [11] are required to determine the chemical shift of each atom, using a process known as sequential assignment (cf. §4). Figure 2 illustrates one such experimental result in protein NMR, a two-dimensional spectrum that is used to correlate the chemical shifts of backbone H\(_1^{(i)}\) and N\(^{(i)}\) atoms.

Fig. 1
figure 1

Depiction of a single amino acid unit i of a protein backbone, including any atoms of flanking units \(i-1\) and \(i+1\) necessary for defining the backbone dihedral angles \(\phi _i\), \(\psi _i\) and \(\omega _i\). While the repetition vertex orders introduced in Sect. 2.3 do not include C\(_\beta \) atoms, the NMR chemical shifts of these atoms are important indicators of the local dihedral angles (cf. 1)

Fig. 2
figure 2

NMR spectrum obtained from the \(^1\)H–\(^{15}\)N heteronuclear single quantum coherence (HSQC) experiment for the HHD2 protein sample described in Sect. 4. Peaks in the spectrum labeled with a number correspond to pairs of bonded H\(_1^{(i)}\) and N\(^{(i)}\) atoms in the protein, where the number corresponds to the index of the amino acid unit (i). The protein studied here is a short sub-sequence of a much longer amino acid sequence, so the values of i indicate the position of the synthesized amino acids within that longer sequence

In addition to the local geometic information provided by chemical shifts, NMR may also be used to obtain global geometric information in the form of distances between pairs of atoms. In such experiments, the intensity of the NMR signal at the chemical shifts of two atoms, e.g. H\(^{(i)}\) and H\(^{(j)}\), is approximately proportional to the inverse sixth power of their distance, \(d(\mathrm{H}^{(i)}, \mathrm{H}^{(j)})^{-6}\). Because the signal intensity decays rapidly with increasing distance, this information is usually only available for atoms that are within 5–6 Å in the structure. Additional effects, including inevitable molecular motion during the measurement, can also result in a change of signal between nearby atoms. Finally, these signal intensity measurements are often perturbed by strong systematic measurement errors, so they are generally converted into interval distances using “rules of thumb” developed by NMR spectroscopists [27].

Conventional NMR protein structure determination protocols combine known distances and angles between bonded atoms with predicted Ramachandran angles and interval distance measurements into a non-convex objective function that smoothly penalizes any deviations of the structure from the target geometry [4, 18]. As this objective contains many local optima, metaheuristics such as simulated annealing with multiple random initializations are employed. Nevertheless, there is no guarantee that the resulting structures are within the feasible set specified by the distance, angle and dihedral constraints.

The protein structure determination problem may be recast into a distance geometry problem (DGP) by converting all aforementioned geometric constraints into distance constraints between pairs of atoms. Formally, the protein is represented as a graph \(G = (V, E, d)\), where V represents the atoms and E holds the atom pairs which have a known distance. The distances are either exact values or intervals, so there exists a partitioning of the edge set \(E = E_D \cup E_I\), where \(E_D\) and \(E_I\) hold exact and interval distances, respectively. We denote the set of positive real valued intervals as \(\mathbb {IR}_+\). Exact distances in \(E_D\) are given by known bond lengths, angles, and dihedrals in protein structures. Three-atom angles and four-atom dihedral angles that are known from protein chemistry shall be collected into the sets \({\varTheta }\) and \({\varOmega }\), respectively. Given this information, we may formally re-introduce the interval Discretizable Molecular Distance Geometry Problem [16] as the following,

interval DMDGP: given a simple weighted undirected graph \(G = (V, E, d)\) where \(E = E_D \cup E_I\) and,

$$\begin{aligned} d : {\left\{ \begin{array}{ll} E_D \rightarrow \mathbb {R}_+ \\ E_I \rightarrow \mathbb {IR}_+ \end{array}\right. } \end{aligned}$$

and a vertex ordering \(R = (v_1, \dots , v_n)\) on V satisfying the following requirements:

  • The subgraph of G induced by \(V_0 = \{v_1, v_2, v_3\}\) is a clique with all edges in \(E_D\)

  • For all \(j \in R {\setminus } V_0\) we have

    1. 1.

      \(d_{j-1,j}, \; d_{j-2,j} \in E_D\)

    2. 2.

      \(d_{j-3,j} \in E\)

  • For all \(j \in R {\setminus } \{v_1,v_2\}\) the strict triangular inequality is satisfied by \(d_{j-1,j}\), \(d_{j-2,j}\) and \(d_{j-2,j-1}\)

is there an embedding \(\mathbf {x} : V \rightarrow \mathbb {R}^3\) such that \(\Vert \mathbf {x}_u-\mathbf {x}_v\Vert = d_{uv} \;\;\forall \{u,v\} \in E_D\) and \(\Vert \mathbf {x}_u-\mathbf {x}_v\Vert \in [\underline{d}_{uv},\overline{d}_{uv}] \;\;\forall \{u,v\} \in E_I\)?

In general, the vertex ordering R may visit each vertex in the graph one or more times. We define such a repetition order [6] as a sequence \(R : \mathbb {N} \rightarrow V \cup 0\) with length \(|R| \in \mathbb {N}\) (such that \(R_i = 0\) for all \(i > |R|\)) when it satisfies the requirements of the iDMDGP. A repetition order entry \(r_i\) is referred to as repeated when there exist one or more elements \(v_j\) with \(j < i-2\), such that \(v_j = v_i\). As shown in [6, §4], there exist iDMDGP instances whose vertex orders must contain repetitions.

The interval Branch-and-Prune (iBP) algorithm is a combinatorial method for finding solutions to the iDMDGP. The iDMDGP solution space is a search tree, within which each path from the root node to a leaf node represents a distinct protein conformation. The iBP algorithm, explained in detail previously [5, 12, 13], recursively traverses the search tree in order to enumerate solutions to a given problem instance. If at any point in the search, the partial solution becomes infeasible with respect to the constraints, then the leaves of the current search tree node are “pruned” and iBP backtracks until a solution is identified. By searching the entire tree, iBP is capable of systematically exploring the feasible set of any given problem instance.

This work introduces a variant of iBP that uses an iterative tree traversal algorithm based on embedding equations derived from Clifford algebra [12]. This iterative variant is ideal for iDMDGP instances that have highly repetitive vertex orders [6], as it requires no extra matrix computations for repeated atoms in the order. We then pair our iterative iBP algorithm with such a highly repetitive discretization vertex order that directly uses information on dihedral angles during tree traversal. When many backbone dihedral angles are known to high precision in a given problem instance, this method and vertex order produce significantly smaller search trees than prior methods. For the purposes of this work, the proposed algorithm and vertex order enable iBP to obtain solutions to problem instances containing strong backbone dihedral information. For such problem instances, it is challenging for iBP to obtain any solutions using previously introduced vertex orders that do not branch on backbone dihedrals. We show that our new implementation solves this family of instances efficiently, even when previous iBP implementations fail to obtain any solutions.

The remainder of this paper is organized as follows. In Sect. 2, we introduce the iterative iBP algorithm with its embedding equations, along with new repetition vertex orders for protein structures. In Sect. 3, we describe features of the new iterative iBP implementation that simplify its use in routine applications while retaining the remarkable flexibility and generality of iBP. In Sect. 4 we present some computational results, specifically focusing on benchmarking the ability of iBP to find a single solution. Finally, we conclude the paper in Sect. 5.

Fig. 3
figure 3

Embedding relations for computing the position of a vertex \(\mathbf {x}_4\) from its predecessors \((\mathbf {x}_1, \mathbf {x}_2, \mathbf {x}_3)\), via the parameters \((\delta _4, \theta _4, \tau _4, \sigma _4)\)

2 Algorithmic considerations

2.1 Iterative embedding relations

iBP is most naturally defined as a recursive tree traversal algorithm [7, 13] that uses recursively defined affine transformation matrices [24] for computing embedded coordinates. However, recent use of Clifford algebra has yielded embedding equations that are non-recursive [12], the key results of which are recalled here.

For any vertex j in a repetition order R, we seek the embedded coordinate \(\mathbf {x}_j \in \mathbb {R}^3\), given distances to the three preceding vertices \(j-1\), \(j-2\) and \(j-3\). From the properties of the iDMDGP, the distances \(d_{j-1,j}\), \(d_{j-2,j}\) and \(d_{j-3,j}\) are known, where \(d_{j,j-3}\) is potentially an interval. The remaining distances in the clique formed by the four vertices \((j-3,j-2,j-1,j)\) are calculable from the coordinates \(\mathbf {x}_{j-1}\), \(\mathbf {x}_{j-2}\) and \(\mathbf {x}_{j-3}\). We shall introduce the quantities \(\delta _j\), \(\theta _j\), \(\tau _j\) and \(\sigma _j\) for embedding vertex j, where \(\delta _j \triangleq d_{j-1,j}\) (cf. Fig. 3). The angle \(\theta _j\) is uniquely defined, and is obtained from the cosine law using the relevant distances,

$$\begin{aligned} \theta _j \triangleq \cos ^{-1} \left( \frac{ d_{j-1,j}^2 + d_{j-2,j-1}^2 - d_{j-2,j}^2 }{ 2 d_{j-1,j} \, d_{j-2,j-1} } \right) \end{aligned}$$
(1)

From the requirement that the vertex order satisfies the strict triangular inequality, \(\theta _j\) is strictly within \((0,\pi )\), ensuring the vertices \((j-3,j-2,j-1)\) are non-collinear. The variable \(\tau _j \triangleq \cos \omega _j\) is related to the dihedral angle \(\omega _j\) formed by the vertices \((j-3,j-2,j-1,j)\). When \(\omega _j\) is known from either protein chemistry or measurement, such that \(\omega _j \in {\varOmega }\), the set of experimentally known dihedral angles, we may directly compute \(\tau _j\), as well as the sign variable \(\sigma _j \in \{-1, +1\}\):

$$\begin{aligned} \sigma _j \triangleq {\left\{ \begin{array}{ll} \frac{\sin \omega _j}{|\sin \omega _j|} &{} \text {if } \sin \omega _j \ne 0 \\ 1 &{} \text {if } \sin \omega _j = 0 \end{array}\right. } \end{aligned}$$

In cases where \(\omega _j \notin {\varOmega }\), the iDMDGP nevertheless guarantees that the distance \(d_{j-3,j}\) is available, and we may compute \(\tau _j\) from the cosine law for a trihedron [12]:

$$\begin{aligned} \tau _j = \frac{ 2 d_{j-2,j-1}^2 \left( d_{j-3,j-2}^2 + d_{j-2,j}^2 - d_{j-3,j}^2 \right) - d_{j-3,j-2,j-1} d_{j-2,j-1,j} }{ \sqrt{ 4 d_{j-3,j-2}^2 d_{j-2,j-1}^2 - d_{j-3,j-2,j-1}^2 } \sqrt{ 4 d_{j-2,j-1}^2 d_{j-2,j}^2 - d_{j-2,j-1,j}^2 }} \end{aligned}$$
(2)

where

$$\begin{aligned} d_{j-3,j-2,j-1}&\triangleq d_{j-3,j-2}^2 + d_{j-2,j-1}^2 - d_{j-3,j-1}^2\\ d_{j-2,j-1,j}&\triangleq d_{j-2,j-1}^2 + d_{j-2,j}^2 - d_{j,j-1}^2 \end{aligned}$$

Given \(\delta _j\), \(\theta _j\), \(\tau _j\), and \(\sigma _j\), the embedded coordinates of vertex j are given by the following equation:

$$\begin{aligned} \mathbf {x}_j = \mathbf {p}_1 + \tau _j \,\mathbf {p}_2 + \sigma _j \sqrt{1 - \tau _j^2} \,\mathbf {p}_3 \end{aligned}$$

where \(\mathbf {p}_1,\mathbf {p}_2,\mathbf {p}_3 \in \mathbb {R}^3\) depend only on \(\mathbf {x}_{j-1}\), \(\mathbf {x}_{j-2}\), \(\mathbf {x}_{j-3}\), \(\delta _j\) and \(\theta _j\),

$$\begin{aligned} \mathbf {p}_1&= -\left( \frac{\delta _j}{\Vert \mathbf {r}_{12}\Vert } \right) \left( \left( \cos (\theta _j) - \frac{\Vert \mathbf {r}_{12}\Vert }{\delta _j} \right) \mathbf {x}_{j-1} - \cos (\theta _j) \,\mathbf {x}_{j-2} \right) \\ \mathbf {p}_2&= -\left( \frac{\delta _j}{\Vert \mathbf {r}_{12}\Vert } \right) \left( \frac{\sin (\theta _j)}{\Vert \mathbf {r}_{12} \times \mathbf {r}_{23}\Vert } \right) \left( \Vert \mathbf {r}_{12}\Vert ^2 \,\mathbf {r}_{23} - (\mathbf {r}_{12} \cdot \mathbf {r}_{23}) \,\mathbf {r}_{12} \right) \\ \mathbf {p}_3&= -\left( \frac{\delta _j}{\Vert \mathbf {r}_{12}\Vert } \right) \left( \frac{\sin (\theta _j)}{\Vert \mathbf {r}_{12} \times \mathbf {r}_{23}\Vert } \right) \Vert \mathbf {r}_{12}\Vert \left( \mathbf {r}_{12} \times \mathbf {r}_{23} \right) \end{aligned}$$

and we have introduced \(\mathbf {r}_{12},\mathbf {r}_{23} \in \mathbb {R}^3\) for notational simplicity,

$$\begin{aligned} \mathbf {r}_{12}&= \mathbf {x}_{j-1} - \mathbf {x}_{j-2} \\ \mathbf {r}_{23}&= \mathbf {x}_{j-2} - \mathbf {x}_{j-3} \end{aligned}$$

By explicitly parameterizing each embedding with the sign \(\sigma _j\), we obtain an unambiguous solution for the coordinate \(\mathbf {x}_j\). An iBP search tree that uses these equations therefore stores a set of values \((\delta ,\theta ,\tau ,\sigma ,\mathbf {x})\) at each of its nodes. Thus, whenever \(\omega _j\)—and consequently \(\sigma _j\) and \(\tau _j\)—is known for vertex j, this yields a single branch at that level in the tree.

In the present iBP implementation [5] that uses recursively defined affine transformation matrices, iBP must compute and store a matrix for each repeated vertex in a repetition order. To compute the matrix of a repeated vertex, the sign parameter \(\sigma _j\) must be determined by systematic search during branching: the value of \(\sigma _j\) that yields an embedded coordinate nearest to the originally determined coordinate is used to build the transformation matrix. For vertex orders containing a large degree of repetition, this introduces unnecessary computations. Using the above equations, embedding a new vertex \(\mathbf {x}_j\) requires only \((\delta _j,\theta _j,\omega _j,\sigma _j)\) and the embedded coordinates of the last three vertices in the order, so repeating a vertex in a discretization order introduces a copy of the original vertex coordinate, but requires no additional computations.

Like the iBP implementation using affine transformation matrices, our implementation requires the calculation of \(\tau _j\) from distances by Eq. (2) when \(\omega _j \notin {\varOmega }\). While this can lead to numerical instability in large instances with imperfect distance data, the fact that our implementation directly computes \(\tau _j\) when \(\omega _j \in {\varOmega }\) effectively mitigates this instability when dihedral constraints are available.

Thus far, the discussion of the Clifford algebraic embedding equations has focused on situations when either \(\omega _j\) or \(d_{j,j-3}\) are known exactly, resulting in a single value of \(\tau _j\) and either one or two values of \(\sigma _j\), respectively. When \(d_{j-3,j}\) is an interval distance, such that \(\{j-3,j\} \in E_I\), we shall denote its lower and upper bounds as \(\underline{d}_{j-3,j}\) and \(\overline{d}_{j-3,j}\), respectively. Similarly, when \(\omega _j\) is an interval dihedral, i.e. \(\omega _j \in {\varOmega }_I\), we denote its bounds as \(\underline{\omega }_j\) and \(\overline{\omega }_j\). In a process that is described in more detail below, these intervals are then discretized through linear interpolation in order to obtain values that may be used in the embedding equations.

figure a

2.2 Iterative tree traversal

In order to obtain solutions to the iDMDGP, iBP discretizes the search space by breaking each interval into a finite set of values, resulting in a search tree. We introduce the R-index \(\mathbf {N} \in \mathbb {N}^{|R|}\), referred to as the tree size, which holds the number of branches at each level of this tree. Herein R is the repetition order associated with the iDMDGP instance. The number of branches at level j of the tree is given by \(N_j\), and depends on the geometric information available for vertex j. When j is a repetition of a previously embedded vertex, we have \(N_j = 1\). When \(d_{j-3,j}\) is an exact distance, there are two possible embeddings of vertex j, so \(N_j = 2\). Finally, when \(d_{j-3,j}\) is an interval distance, \(N_j = 2 B\) where B is a user-specified discretization factor. In cases where \(\omega _j\) is known—as opposed to knowing only \(\cos \omega _j\) via \(d_{j-3,j}\)—the branch count \(N_j\) is reduced by a factor of two, as \(\sigma _j\) takes a single value.

By default, interval discretization in iBP is done uniformly, with each of the B discrete points equally spaced upon its arc formed by the three-sphere intersection sub-problem. Due to discretization, iBP is a heuristic method. For any iDMDGP instance having a non-empty feasible set, there is no guarantee that iBP will select discretization points which satisfy the original iDMDGP [9]. However, when sufficiently large values of B are used in concert with a small tolerance during distance feasibility pruning, the chances of obtaining solutions is increased. Thus, in the absence of alternative methods for discretizing the interval \(d_j\) and \(\omega _j\) values, we defer to uniform subdivision.

In the following, we shall employ multi-index notation to describe the iterative tree traversal routine, with multi-indices denoted by capitalized boldface letters. Given a tree size \(\mathbf {N} \in \mathbb {N}^{|R|}\), we introduce the partial order operator for all |R|-indices \(\mathbf {I} \in \mathbb {N}^{|R|}\) as follows:

$$\begin{aligned} \mathbf {I} \le \mathbf {N} \rightarrow {\left\{ \begin{array}{ll} { \textit{true}} &{} \text {if } I_j \le N_j \quad \forall j \in \{1,\dots ,|R|\} \\ {\textit{false}} &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

An index \(\mathbf {I}\) describes a valid—though potentially infeasible—path through the tree if it satisfies \(\mathbf {1} \le \mathbf {I} \le \mathbf {N}\), where \(\mathbf {1}\) is a vector of ones.

In the iterative formulation of iBP described in this paper, paths through the search tree are enumerated by advancing an index \(\mathbf {I}\) (Algorithm 1). Advancing an index \(\mathbf {I}\) at its last element \(I_{|R|}\) yields a sequence of indices—and thus paths through a tree—that implicitly effects a depth-first search in that tree. On the other hand, incrementing an index \(\mathbf {I}\) at some element \(I_j\), with \(j < |R|\), is equivalent to pruning the tree at that level, as it skips all indices that would have been produced by incrementing at |R|. The term index is used to describe \(\mathbf {I}\) due to the similarity it shares to the indices of multidimensional arrays of shape \(\mathbf {N}\). In this analogy, each leaf of the search tree is a given element in a multidimensional array, which has |R| dimensions and \(N_j\) elements along dimension j. Alternatively, the act of advancing an index may be considered equivalent to a set of combined operations on a stack S.

Proposition 1

(Stack equivalence) An index \(\mathbf {I} \in \mathbb {N}^{|R|}\), combined with a level \(j \le |R|\), represents a stack S, such that \(S = (I_1, I_2, I_3, \dots , I_j)\). By introducing the following equivalence between index operations and stack operations,

  1. i.

    \(I_k \leftarrow z, \; k > j\): no operation.

  2. ii.

    \(I_k \leftarrow z, \; k \le j\):

    1. a.

      pop S until \(|S| = k-1\),

    2. b.

      push z onto S.

  3. iii.

    \(j \leftarrow k, \; k > j\): push 1 onto S until \(|S| = k\).

we see that the advance method maintains a stack within the first j elements of its index \(\mathbf {I}\).

Proof

(Stack equivalence)

  1. i.

    As S only contains the first j elements of \(\mathbf {I}\), the modification of an index element \(I_k\) for \(k > j\) does not modify S.

  2. ii.

    This applies to lines 5 and 8 of Algorithm 1, and ensures that \(S_k = I_k\) for \(k \le j\). Note that this changes the size of the stack, which is reflected in the algorithm by the least modified index \(j^*\)

  3. iii.

    This ensures that the initialization (\(j \leftarrow 1\)) and the act of moving to the next tree level (\(j \leftarrow j + 1\)) are equivalent to pushing the next available tree node onto the stack S.

\(\square \)

Using these indices, we may construct an iterative variant of iBP that implicitly traverses and prunes a search tree by advancing an index (Algorithm 2). The calcAngle and calcTorsion methods in Algorithm 2 calculate the bond angle \(\theta _i\) and the cosine of the dihedral angle, \(\tau _i\), using Eqs. 1 and 2 above, using the provided vertex coordinates to compute the required distances.

figure b

2.3 Ramachandran-defined vertex orders

In order to leverage known backbone dihedral angles when enumerating solutions to the iDMDGP, a new set of vertex orders was introduced. We define an initial order \(R_1\), an inner order \(R_i\) and a final order \(R_n\) for the first, inner, and last amino acid units of protein graphs, respectively. The vertex orders are as follows:

$$\begin{aligned} R_1&= \left\{ \mathrm{N}^{(1)},\; \mathrm{H}_1^{(1)},\; \mathrm{H}_2^{(1)},\; \mathrm{C}_\alpha ^{(1)},\; \mathrm{N}^{(1)},\, \mathrm{H}_\alpha ^{(1)},\; \mathrm{C}_\alpha ^{(1)},\; \mathrm{C}'^{(1)} \right\} \\ R_i&= \left\{ \mathrm{N}^{(i)},\; \mathrm{O}^{(i-1)},\; \mathrm{C}_\alpha ^{(i-1)},\; \mathrm{C}'^{(i-1)},\; \mathrm{N}^{(i)},\; \mathrm{C}_\alpha ^{(i)},\; \mathrm{C}'^{(i)},\; \mathrm{N}^{(i+1)}, \right. \\&\left. \qquad \mathrm{C}^{(i-1)},\; \mathrm{N}^{(i)},\; \mathrm{C}_\alpha ^{(i)},\; \mathrm{H}_1^{(i)},\; \mathrm{N}^{(i)},\; \mathrm{C}_\alpha ^{(i)},\; \mathrm{C}'^{(i)},\; \mathrm{H}_\alpha ^{(i)},\; \mathrm{C}'^{(i)},\; \mathrm{C}_\alpha ^{(i)} \right\} \\ R_n&=\left\{ \mathrm{N}^{(n)},\; \mathrm{O}^{(n-1)},\; \mathrm{C}_\alpha ^{(n-1)},\; \mathrm{C}'^{(n-1)},\; \mathrm{N}^{(n)},\; \mathrm{C}_\alpha ^{(n)},\; \right. \\&\qquad \mathrm{C}'^{(n)},\; \mathrm{C}'^{(n-1)},\; \mathrm{N}^{(n)},\; \mathrm{C}_\alpha ^{(n)},\; \mathrm{H}_1^{(n)},\; \mathrm{N}^{(n)},\; \mathrm{C}_\alpha ^{(n)},\; \mathrm{C}'^{(n)},\; \\&\left. \qquad \mathrm{H}_\alpha ^{(n)},\; \mathrm{C}'^{(n)},\; \mathrm{C}_\alpha ^{(n)},\; \mathrm{O}_1^{(n)},\; \mathrm{C}'^{(n)},\; \mathrm{O}_2^{(n)} \right\} \end{aligned}$$

where i denotes the amino acid unit index and n denotes the total number of amino acid units within this context. These orders make extensive use of repeated vertices in order to achieve direct branching on the backbone dihedrals \(\phi \), \(\psi \) and \(\omega \).

Fig. 4
figure 4

Reorder-dependent differences in sampling of Ramachandran space, using identical branching factors (\(B = 64\)). Black crosses and red dots represent conformations sampled with and without application of dihedral restraints to \((\phi ,\psi )\), respectively. a Old repetition orders. b New Ramachandran-defined repetition orders. (Color figure online)

Figure 4 illustrates how the new orders more directly and uniformly sample Ramachandran space than previously published vertex orders [7]. An iBP calculation was performed, using both vertex orders, to find solutions to the iDMDGP instance of a backbone-only tetrapeptide, with the direct distance feasibility (DDF, [5]) and van der Waals (VDW) pruning devices enabled. Because the previous orders do not branch on the \(\phi \) or \(\omega \) backbone dihedral angles of proteins, they do not enable iBP to cover this space in a regular way, and often sample nearby points in \((\phi ,\psi )\)-space more than once during tree traversal (Fig. 4a). An even stronger difference in behaviour is observed when constraints are placed on an instance’s \(\phi \) and \(\psi \) angles (Fig. 4, red dots). Using previously published orders, iBP treats the dihedral constraints as extra pruning edges, thus maintaining the same irregular sampling pattern and pruning any partial conformations that fall outside the constraints. Using the new orders that directly branch on protein backbone dihedral angles, iBP adaptively changes its sampling pattern (Fig. 4b, red dots) to better cover the region specified by the constraints. The red rectangle in the second panel of Fig. 4 is indeed a regular grid of sampled conformations. The central regions of \((\phi ,\psi )\)-space were not sampled by either vertex order, as they were pruned by the VDW device.

3 Implementation-specific technical innovations

3.1 CHARMM-syntax force fields

The basic input to protein structure determination procedures is the amino acid sequence of the protein of interest. From the sequence, software packages draw information from “force field” libraries in order to construct the topology (graph structure) and parameters (graph edge weighting) of the target molecule. Routinely employed software packages [2, 3, 22, 26] employ a common force field syntax derived from the Chemistry at HARvard Molecular Mechanics (CHARMM) software package [2]. To ensure extensibility, flexibility and interoperability with these packages, our implementation of iBP also uses CHARMM-syntax force fields to construct iDMDGP instances. In addition, iBP accepts a superset of the CHARMM parameter file syntax that allows users to specify interval distances, angles and dihedrals within force fields. The force field used by iBP was derived from the protein based force field of the Crystallography and NMR System (CNS, [3]) with minor additions. For example, the topology information of a protein backbone is defined in iBP using the following notation:

figure c

Vertex orders in iBP are also specified using a custom syntax that retains the flexibility of CHARMM-style force fields. Each amino acid unit defined in the force field topology files is given a corresponding order. As an example, the Ramachandran-defined vertex order of a protein backbone corresponding to the above topology entry is specified as follows:

figure d

where -O denotes atom O of the previous amino acid unit, and +N denotes atom N of the next. At runtime, iBP constructs the iDMDGP instance of a protein from its sequence using this combined topology, parameter, and vertex order information. The validity of the resulting problem instance is then checked, and tree traversal is initiated via Algorithm 2. Therefore, it is now straightforward to use iBP within routine structure determination efforts. In addition, this flexibility enables iBP to handle structure determination involving post-translational modifications and non-natural amino acids, provided a repetition order can be crafted.

3.2 Pruning and timing statistics

Our implementation of iBP admits the inclusion of new pruning devices via a system of callback functions. On tree initialization, each pruning device registers its callbacks at each level of the tree. During tree traversal, iBP executes each registered callback function to determine the feasibility of newly embedded vertices. Finally, when traversal is completed or terminated prematurely by the user, each pruning device outputs vertex-specific pruning results. This is useful for identifying any distance, angle or torsion constraints that are geometrically inconsistent, for example.

Furthermore, the index-based traversal algorithm enables iBP to roughly estimate its runtime. Given an index \(\mathbf {I}\) in a tree of size \(\mathbf {N}\), we define the “width” of the sub-tree from \(\mathbf {1}\) to \(\mathbf {I}\) as the number of leaves in that sub-tree,

$$\begin{aligned} w(\mathbf {I}, \mathbf {N}) = \sum _{j=1}^{|R|} (I_j - 1) \prod _{k=j+1}^{|R|} N_k \end{aligned}$$

Therefore, the width of the entire tree is \(w(\mathbf {N}, \mathbf {N})\)+1, and the width of the yet-untraversed portion of the tree is given by \(w(\mathbf {N} - \mathbf {I} + \mathbf {1}, \mathbf {N})\). Assuming the rate of tree traversal—defined as the number of leaf nodes traversed per unit time—is relatively constant on average, we estimate the remaining runtime of an iBP tree traversal as follows:

$$\begin{aligned} t_r = \frac{ w(\mathbf {N} - \mathbf {I} + \mathbf {1}, \mathbf {N}) }{ w(\mathbf {I}, \mathbf {N}) } t_e \end{aligned}$$

where \(t_r\) and \(t_e\) are the remaining and elapsed runtimes, respectively. Using these equations, iBP may output tree size and periodic timing information to the user. This assumption of a constant traversal—and thus pruning—rate of iBP is ensured by measuring \(w(\mathbf {N} - \mathbf {I} + \mathbf {1}, \mathbf {N})\) at sufficiently spaced time intervals.

Fig. 5
figure 5

The first solution returned by iBP for the HHD2 iDMDGP instance, using only dihedral constraints predicted from NMR chemical shifts

4 Computational results

4.1 Experimental data

The vertex orders in this work are ideally suited to pairing with backbone dihedral angles, such as those predicted from measured NMR chemical shifts (e.g. by TALOS-N, [23]). To illustrate this, the structure of an HHD2 protein domain (residues G420-R499 of Whirlin isoform-4) was solved by constraining all backbone dihedral angles to the intervals predicted by TALOS-N (Fig. 5). NMR chemical shifts were measured from a uniformly \(^{15}\)N,\(^{13}\)C-labeled protein sample that was prepared at a concentration of 300 \(\upmu \)M in 250 \(\upmu \)L of a buffer solution (150 mM NaCl, 50 mM Tris-HCl, 5% D2O, pH 7.5). All data were collected at 25 \(^\circ \)C on a Bruker Avance III 900 MHz spectrometer with a three-channel cryogenically cooled probe. The following experiments were performed, yielding chemical shifts for the following types of atomic nuclei:

  • BT-HSQC [14]: H\(_1\), N. (cf. Fig. 2)

  • HNHA [25]: H\(_1\), N, H\(_\alpha \).

  • BT-HNCO [14], BT-HNCO+ [8]: H\(_1\), N, C\('\).

  • BT-HNCACB, BT-HNCOCACB [14]: H\(_1\), N, C\(_\alpha \), C\(_\beta \).

From these experiments, about 99% of the expected backbone chemical shifts were assigned (99% of H\(_1\), N, C\(_\alpha \), C\(_\beta \) and C\('\); 91% of H\(_\alpha \)). The assigned chemical shifts, along with the amino acid sequence of HHD2, were then used to obtain predicted intervals for \(\phi \) and \(\psi \) backbone dihedral angles using TALOS-N. The \(\omega \) backbone dihedral angles were fixed to 180\(^\circ \), following standard practice in the structural bioinformatics field. The resulting iDMDGP instance contained 464 vertices, and had a vertex order length \(|R| = 1378\).

Fig. 6
figure 6

Comparison of the first solution returned by iBP for the HHD2 iDMDGP instance (Fig. 5, shown here in dark grey) with the ten lowest-energy structures from ARIA/CNS (colored, thin traces), using the large initial structure set and slow annealing rate

4.2 Structure calculations

To obtain the structure illustrated in Fig. 5, the iterative iBP variant described in Algorithm 2 was run using the Ramachandran-defined vertex orders described in Sect. 2.3. A branching factor of \(B = 16\) and a branch \(\varepsilon \) of 0.01 Å were used, resulting in an effective branching factor \(B_{\mathrm {eff},j}\) that varies at each level of the tree according to the following equation:

$$\begin{aligned} B_{\mathrm {eff},j} \triangleq \min \left\{ B, \left\lfloor \frac{\overline{d}_{j,j-3} - \underline{d}_{j,j-3}}{\varepsilon } \right\rceil \right\} \end{aligned}$$

The direct distance feasibility (DDF, [5]) and van der Waals (VDW) pruning devices were enabled during iBP tree traversal, as well as a new pruning device that tested for direct dihedral angle feasibility. In short, dihedral feasibility pruning ensures that all quartets of atoms with corresponding dihedral angles in \({\varOmega }\) are consistent with their respective dihedral constraints. This dihedral feasibility pruning device effectively generalizes previously developed iBP pruning devices related to proteins, including the \(\alpha \)-helix and chirality pruning devices introduced in [5].

To provide a basis for comparison, conventional NMR structure calculations were performed by molecular dynamics simulated annealing (MDSA) using ARIA/CNS [3, 20], which was recently evaluated as one of the most effective software tools for NMR structure determination [21]. In MDSA, the motion of each protein structure is simulated by numerically solving Newton’s equations of motion from various initial velocities, during which the temperature of the system is reduced from \(\sim 10^4\) to \(\sim 50\) K. The structures from simulated annealing are then subjected to a round of local descent. For both the dynamics and descent stages, the objective function is an approximation of the molecule’s potential energy, which includes terms for the known local geometry and the NMR-derived geometric constraints [18].

Within ARIA/CNS, default parameters were used to randomly generate two sets of 20 and 100 initial structures, based solely on the predicted \((\phi ,\psi )\) angles from TALOS-N. The smaller set of structures was subjected to local descent using a short MDSA calculation, whereas the larger set was subjected to a longer MDSA calculation with a ten-fold slower annealing rate.

While complete traversal of the search tree, which contained \(10^{147}\) leaves, was estimated to require \(10^{140}\) min, the solution illustrated in Fig. 5 was obtained by iBP within a few seconds, thanks to the strong constraints supplied from TALOS-N. In contrast, average times of the short and long MDSA computations were 29 and 186 s, respectively. However, none of the structures produced by short MDSA were in the dihedral angle feasible set, and only 71% of structures from long MDSA were feasible. A comparison of the overall folds produced by iBP and the long MDSA run of ARIA/CNS is given in Fig. 6; while the secondary structures and general fold are the same among all structures, only the iBP structure is guaranteed to be feasible in the iDMDGP.

Fig. 7
figure 7

Summary of root mean square deviations (RMSD) between the first solution produced by iBP from the TALOS instance and the target structure, as a function of dihedral uncertainty (\({\varDelta }\)) and distance threshold \(d_{min}\). Full results are listed in Tables 1, 2, and 3

Table 1 Results of solving the TALOS instance under low dihedral angle uncertainties (\({\varDelta }= 0^\circ , 1^\circ , 2^\circ \))
Table 2 Results of solving the TALOS instance under moderate dihedral angle uncertainties (\({\varDelta }= 5^\circ , 10^\circ , 15^\circ \))
Table 3 Results of solving the TALOS instance under high dihedral angle uncertainties (\({\varDelta }= 20^\circ , 25^\circ , 30^\circ \))

4.3 Sensitivity to uncertainty

In general graphs, iBP exhibits exponential worst-case complexity [17]. Fortunately, the fact that pruning edges of similar length are distributed sufficiently uniformly over the vertex order ensures that branches do not occur frequently, resulting in tractable instances. However, when intervals are employed for branching in iBP, discretization can potentially spoil the favorable properties of the method [9]. Lower discretization factors B increase the probability that no path through the search tree remains feasible after applying the discretization. Conversely, increasing the discretization factor grows the search tree exponentially, resulting in dramatically longer runtimes.

To analyze the sensitivity of the proposed iBP algorithm and vertex order to interval uncertainty, a set of experiments was performed using the first solution from the TALOS instance (the target structure) as a basis. First, dihedral restraints were obtained by computing the backbone \(\phi \) and \(\psi \) dihedral angles from the target structure, and varying degrees of uncertainty (\({\varDelta }\)) were added to the dihedrals of residues 11–13, 30–34, 46–49, and 61–65. These residues were found in “loop” regions of the target structure, which are generally more flexible in protein structures. In addition, long-range distance restraints were obtained by collecting the following set of distances from the target structure:

$$\begin{aligned} \mathscr {D} = \{ \Vert \mathbf {x}_i - \mathbf {x}_j \Vert < d_{\textit{max}} ; \; |i - j| \ge 5 \} \end{aligned}$$

where \(d_{\textit{max}}\) specifies the maximum admissible distance in the set. The final distance restraints were obtained by forming intervals 0.5Å wide around each distance in \(\mathscr {D}\). We shall refer to the size of the set \(\mathscr {D}\) as \(m_{\textit{dist}}\) in all further discussions. For the problem instances resulting from each pair of values \(({\varDelta },d_{\textit{max}})\), iBP was run using identical parameters to those used to solve the original TALOS instance, with the exception that the DDF tolerance was expanded to 0.1Å. The time required for iBP to obtain a single solution, subject to a time limit of 5 h, was recorded as \(t_1\) for each instance, and the root mean square deviation of each obtained solution to the target structure was also recorded (Fig. 7). The complete set of results of this analysis is given in Tables 1, 2, and 3.

This analysis illustrates several general behaviors of iBP for solving protein instances. When dihedrals are specified with high precision (Table 1), the conformational space is small, and iBP rapidly obtains a solution. As \({\varDelta }\) is increased, iBP still rapidly obtains a solution when only a few long-range distances are supplied. However, their deviations from the target structure increase with \({\varDelta }\), and supplying more distances decreases the deviations at the expense of increased computation time. Finally, the inclusion of a large number of distance restraints tends to increase runtime past the 5 h limit, especially when the conformational space is large due to large \({\varDelta }\). While one may expect runtimes to decrease as more distances are added, due to the graph becoming more complete [17], the opposite effect is observed. This is a result of the negative interplay between interval discretization and distance pruning in iBP, and might be alleviated using error-tolerant pruning devices such as mean distance error (MDE) [9].

Finally, it is important to note that none of the artificial instances produced from the original TALOS structure generated solutions when running iBP with previously defined vertex orders [7], due to the irregular sampling patterns of Ramachandran space that those orders produce.

5 Conclusions

This paper introduces a new implementation of the interval Branch-and-Prune algorithm for molecular structure determination [13], and describes several modifications that provide enhanced performance for the specific task of protein structure elucidation. Comparison against an existing state of the art method in the field demonstrates the advantages of iBP in practical structure determination problems. Our new iBP implementation is completely open source, and is available under the MIT license at http://github.com/geekysuavo/ibp-ng.