Introduction

While genome sequencing technologies are constantly evolving, they are still unable to read at once complete genomic sequences from organisms of interest. Instead, they produce a large number of rather short genomic fragments, called reads, originating from unknown locations and strands of the genome. The problem then becomes to assemble the reads into the complete genome. Existing genome assemblers usually assemble reads based on their overlap patterns and produce longer genomic fragments, called contigs, which are typically interweaved with highly polymorphic and/or repetitive regions in the genome. Contigs are further assembled into scaffolds, i.e., sequences of contigs interspaced with gaps.Footnote 1 Assembling scaffolds into larger scaffolds (ideally representing complete chromosomes) is called the scaffold assembly problem.

The scaffold assembly problem is known to be \(\textsf {NP}\)-hard [14, 17, 23, 29, 35], but there still exists a number of methods that use heuristic and/or exact algorithmic approaches to address it. The scaffold assembly problem consists of two subproblems:

  1. 1.

    determine the order of scaffolds (scaffold order problem); and

  2. 2.

    determine the orientation (i.e., strand of origin) of scaffolds (scaffold orientation problem).

Some methods attempt to solve these subproblems jointly by using various types of additional data including jumping libraries [11, 15, 20, 21, 25, 27, 32], long error-prone reads [6, 7, 12, 26, 34], homology relationships between genomes [1, 3,4,5, 24], etc. Other methods (typically based on wet-lab experiments [13, 22, 28, 30, 31, 33]) can often reliably reconstruct the order of scaffolds, but may fail to impose their orientation.

The scaffold orientation problem is also known to be \(\textsf {NP}\)-hard [10, 23]. Since the scaffold order problem can often be reliably solved with wet-lab based methods, this inspires us to consider the special case of the scaffold orientation problem with the given order of scaffolds, which we refer to as the orientation of ordered scaffolds (OOS) problem. We formulate the OOS as an optimization problem based on given weighted orientations of scaffolds and their pairs (e.g., coming from pair-end sequencing reads, long reads, or homologous relations). We prove that the OOS is \(\textsf {NP}\)-hard both in the case of linear genomes and in the case of circular genomes. We present a polynomial-time algorithm for solving the special case of the OOS, where the orientation of each scaffold is imposed relatively to at most two other scaffolds, and further generalize it to an \(\textsf {FPT}\) algorithm for the general OOS problem. The proposed algorithms are implemented in the CAMSA [2] software that have been developed for comparative analysis and merging of scaffold assemblies.

Background

We start with a brief description of the notation which have been used in CAMSA framework. The notation provides a unifying way to represent scaffold assemblies obtained by different methods.

Let \({\mathbb {S}}= \{s_i\}_{i=1}^n\) be the set of scaffolds. We represent an assembly of scaffolds as a set of assembly points. Each assembly point is formed by an adjacency between two scaffolds. Namely, an assembly point \(p = (s_i, s_j)\) tells that the scaffolds \(s_i\) and \(s_j\) are adjacent in the assembly, where \(s_i, s_j \in {\mathbb {S}}\). Additionally, we may know the orientation of either or both of the scaffolds and thus distinguish between three types of assembly points:

  1. 1.

    p is oriented if the orientation of both scaffolds \(s_i\) and \(s_j\) is known;

  2. 2.

    p is semi-oriented if the orientation of only one scaffold among \(s_i\) and \(s_j\) is known;

  3. 3.

    p is unoriented if the orientation of neither of \(s_i\) and \(s_j\) is known.

We denote the known orientation of scaffolds in assembly points by overhead arrows. While the right arrow corresponds to the original genomic sequence, the left arrow corresponds to the reverse complement of this sequence. For example, \((\overrightarrow{s_i}, \overleftarrow{s_j})\), \((\overrightarrow{s_i}, s_j)\), and \((s_i, s_j)\) are oriented, semi-oriented, and unoriented assembly points, respectively. We remark that assembly points \((\overrightarrow{s_i}, \overrightarrow{s_j})\) and \((\overleftarrow{s_j}, \overleftarrow{s_i})\) represent the same adjacency between oriented scaffolds; to make this representation unique we will require that in all assembly points \((s_i, s_j)\) we have \(i<j\). Another way to represent the orientation of the scaffolds in an assembly point is by using superscripts h and t denoting the head and tail extremities of the scaffold’s genomic sequence, e.g., \((\overrightarrow{s_i}, \overrightarrow{s_j})\) can also be written as \((s_i^h, s_j^t)\).

We will need an auxiliary function \({{\,\mathrm{\text {sn}}\,}}(p,i)\) defined on an assembly point p and an index \(i\in \{1, 2\}\) that returns the scaffold corresponding to the component i of p (e.g., \({{\,\mathrm{\text {sn}}\,}}((\overrightarrow{s_i}, \overrightarrow{s_j}), 2) = s_j\)). We define a realization of an assembly point p as any oriented assembly point that can be obtained from p by orienting the unoriented scaffolds. We denote the set of realizations of p by \({{\,\mathrm{\text {R}}\,}}(p)\). When p is oriented, it has a single realization equal p itself (i.e., \({{\,\mathrm{\text {R}}\,}}(p)=\{p\}\)); when p is semi-oriented, it has two realizations (i.e., \(|{{\,\mathrm{\text {R}}\,}}(p)| = 2\)); and when p is unoriented, it has four realizations (i.e., \(|{{\,\mathrm{\text {R}}\,}}(p)| = 4\)). For example,

$$\begin{aligned} {{\,\mathrm{\text {R}}\,}}((s_i, s_j)) = \left\{ (\overrightarrow{s_i}, \overrightarrow{s_j}), (\overrightarrow{s_i}, \overleftarrow{s_j}), (\overleftarrow{s_i}, \overrightarrow{s_j}), (\overleftarrow{s_i}, \overleftarrow{s_j})\right\} . \end{aligned}$$
(1)

An assembly point p is called a refinement of an assembly point q if \({{\,\mathrm{\text {R}}\,}}(p)\subset {{\,\mathrm{\text {R}}\,}}(q)\). From now on, we assume that no assembly point in a given assembly is a refinement of another assembly point (otherwise we simply discard the latter assembly point as less informative). We further assume that in a given assembly there are no two assembly points \((\overrightarrow{s_i}, s_j)\) and \((s_i, \overrightarrow{s_j})\) such that \(s_i\) or \(s_j\) belongs to yet another assembly point (otherwiseFootnote 2 we simply replace \((\overrightarrow{s_i}, s_j)\) and \((s_i, \overrightarrow{s_j})\) with \((\overrightarrow{s_i}, \overrightarrow{s_j})\)). Similarly, we assume that no assembly points \((\overrightarrow{s_i}, \overleftarrow{s_j}), (\overleftarrow{s_i}, \overrightarrow{s_j}), (\overleftarrow{s_i}, \overleftarrow{s_j})\) can be present in a given assembly at the same time. We refer to an assembly satisfying these assumptions as a proper assembly.

For a given assembly \({\mathbb {A}}\) we will use subscripts u/s/o to denote the sets of unoriented/semi-oriented/oriented assembly points in \({\mathbb {A}}\) (e.g., \({\mathbb {A}}_u\subset {\mathbb {A}}\) is the set of all unoriented assembly points from \({\mathbb {A}}\)). We also denote by \({\mathbb {S}}({\mathbb {A}})\) the set of scaffolds appearing in the assembly points from \({\mathbb {A}}\).

We call two assembly points overlapping if they involve the same scaffold, and further call them conflicting if they involve the same extremity of this scaffold. We generalize this notion for semi-oriented and unoriented assembly points: two assembly points p and q are conflicting if all pairs of their realizations \((p', q')\in {{\,\mathrm{\text {R}}\,}}(p)\times {{\,\mathrm{\text {R}}\,}}(p)\) are conflicting. If some, but not all, pairs of the realizations are conflicting, p and q are called semi-conflicting. Otherwise, p and q are called non-conflicting.

We extend the notion of non-/semi- conflictedness to entire assemblies as follows. A scaffold assembly \({\mathbb {A}}\) is non-conflicting if all pairs of assembly points in it are non-conflicting, and \({\mathbb {A}}\) is semi-conflicting if all pairs of assembly points are non-conflicting or semi-conflicting with at least one pair being semi-conflicting.

Methods

Assembly Realizations

For an assembly \({\mathbb {A}}= \{p_i\}_{i=1}^k\), an assembly \({\mathbb {A}}' = \{q_i\}_{i=1}^k\) is called a realizationFootnote 3 of \({\mathbb {A}}\) if there exists a permutation \(\pi\) of order k such that \(q_{\pi _i}\in {{\,\mathrm{\text {R}}\,}}(p_i)\) for all \(i=1,2,\dots ,k\). We denote by \({{\,\mathrm{\text {R}}\,}}({\mathbb {A}})\) the set of realizations of assembly \({\mathbb {A}}\), and by \({{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\) the set of non-conflicting realizations among them.

We define the scaffold assembly graph \(\mathsf {SAG} ({\mathbb {A}})\) on the set of vertices \(\{s^h, s^t\ :\ s\in {\mathbb {S}}({\mathbb {A}})\}\) and edges of two types: directed edges \((s^t, s^h)\) that encode scaffolds from \({\mathbb {S}}({\mathbb {A}})\), and undirected edges that encode all possible realizations of all assembly points in \({\mathbb {A}}\) (Fig. 1a). We further define the order (multi)graph \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) formed by the set of vertices \({\mathbb {S}}({\mathbb {A}})\) and the set of undirected edges \(\{\{{{\,\mathrm{\text {sn}}\,}}(p,1), {{\,\mathrm{\text {sn}}\,}}(p,2)\}\ :\ p\in {\mathbb {A}}\}\) (Fig. 1b). The order graph can also be obtained from \(\mathsf {SAG} ({\mathbb {A}})\) by first contracting the directed edges, and then by substituting all edges that encode realizations of the same assembly point with a single edge (Fig. 1b). We define the contracted order graph \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {A}})\) obtained from \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) by replacing all multi-edges edges with single edges (Fig. 1c).

Let \(\deg (v)\) be the degree of a vertex v in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\), i.e., the number of edges (counted with multiplicity) incident to v. We call the order graph \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) non-branching if \(\deg (v)\le 2\) for all vertices v of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\).

Fig. 1
figure 1

For an assembly \(A = \{(s_1, \overrightarrow{s_2}), (\overrightarrow{s_1}, \overrightarrow{s_2}), (\overrightarrow{s_2}, \overrightarrow{s_3}), (\overrightarrow{s_3}, s_4), (\overleftarrow{s_1}, \overleftarrow{s_4}), (\overrightarrow{s_5}, s_6)\), \((\overleftarrow{s_6}, \overrightarrow{s_7}), (\overrightarrow{s_6}, s_7)\}\), (a) the scaffold assembly graph \(\mathsf {SAG} (A)\), where semi-oriented assembly points, oriented assembly points, and scaffolds are represented by dashed red edges, solid red edges, and directed black edges, respectively. (b) The order graph \({{\,\mathrm{\mathsf {OG}}\,}}(A)\). (c) The contracted order graph \({{\,\mathrm{\mathsf {COG}}\,}}(A)\)

Lemma 1

For a non-conflicting realization \({\mathbb {A}}'\) of an assembly \({\mathbb {A}}\), \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}')\) is non-branching.

Proof

Each vertex v in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}')\) represents a scaffold, which has two extremities and thus can participate in at most two non-conflicting assembly points in \({\mathbb {A}}'\). Hence, \(\deg (v)\le 2\). \(\square\)

We notice that any non-conflicting realization \({\mathbb {A}}'\) of an assembly \({\mathbb {A}}\) provides orientation for all scaffolds involved in each connected component of \(\mathsf {SAG} ({\mathbb {A}}')\) (as well as of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}')\) and \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {A}}')\)) relatively to each other.

Theorem 1

An assembly \({\mathbb {A}}\) has at least one non-conflicting realization (i.e., \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\)) if and only if \({\mathbb {A}}\) is non-conflicting or semi-conflicting and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is non-branching.

Proof

Suppose that \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\) and pick any \({\mathbb {A}}'\in {{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\). Then for every pair of assembly points \(p,q\in {\mathbb {A}}\), their realizations in \({\mathbb {A}}'\) are non-conflicting, implying that p and q are either non-conflicting or semi-conflicting. Hence, \({\mathbb {A}}\) is non-conflicting or semi-conflicting. Since \({\mathbb {A}}\) is a proper assembly, we have \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})={{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}')\). Taking into the account that \({\mathbb {A}}'\) is non-conflicting, Lemma 1 implies that \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is non-branching.

Vice versa, suppose that \({\mathbb {A}}\) is non-conflicting or semi-conflicting and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is non-branching. To prove that \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\), we will orient unoriented scaffolds in all assembly points in \({\mathbb {A}}\) without creating conflicts. Every scaffold s corresponds to a vertex v in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) of degree at most 2. If \(\deg (v)=1\), then s participates in one assembly point p, and s is either already oriented in p or we pick an arbitrary orientation for it. If \(\deg (v)=2\), then s participates in two overlapping assembly points p and q. If s is not oriented in either of p, q, we pick an arbitrary orientation for it consistently across p and q (i.e., keeping them non-conflicting). If s is oriented in exactly one assembly point, we orient the unoriented instance of s consistently with its orientation in the other assembly point. Since conflicts may appear only between assembly points that share a vertex in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\), the constructed orientations produce no new conflicts. On other hand, the scaffolds that are already oriented in \({\mathbb {A}}\) impose no conflicts since \({\mathbb {A}}\) is non-conflicting or semi-conflicting. Hence, the resulting oriented assembly points form a non-conflicting assembly from \({{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\), i.e., \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\). \(\square\)

We remark that if \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is branching, the assembly \({\mathbb {A}}\) may be semi-conflicting but have \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|=0\). An example is given by \({\mathbb {A}}=\{(s_1,s_{i+1})\}_{i=1}^k\) with \(k>2\), which contains no conflicting assembly points (in fact, all assembly points in \({\mathbb {A}}\) are semi-conflicting), but \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|=0\).

From now on, we will always assume that assembly \({\mathbb {A}}\) has at least one non-conflicting realization (i.e., \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\)). For an assembly \({\mathbb {A}}\), the orientation of some scaffolds from \({\mathbb {S}}({\mathbb {A}})\) does not depend on the choice of a realization from \({{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\) (we denote the set of such scaffolds by \({\mathbb {S}}_o({\mathbb {A}})\)), while the orientation of other scaffolds within some assembly points varies across realizations from \({{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\) (we denote the set of such scaffolds by \({\mathbb {S}}_u({\mathbb {A}})\)). Trivially, we have \({\mathbb {S}}_u({\mathbb {A}})\cup {\mathbb {S}}_o({\mathbb {A}})={\mathbb {S}}({\mathbb {A}})\). It can be easily seen that the set \({\mathbb {S}}_u({\mathbb {A}})\) is formed by the scaffolds for which the orientation in the proof of Theorem 1 was chosen arbitrarily, implying the following statement.

Corollary 1

For a given assembly \({\mathbb {A}}\) with \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\), we have \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|=2^{|{\mathbb {S}}_u({\mathbb {A}})|}\).

We label scaffolds from \({\mathbb {S}}({\mathbb {A}})\) with integers \(\{1,\dots ,|{\mathbb {S}}({\mathbb {A}})|\}\). From computational perspective, we assume that we can get a scaffold from its name and vice verca in \({\mathcal {O}}\left( 1\right)\) time.

Lemma 2

Testing whether a given assembly \({\mathbb {A}}\) has a non-conflicting realization can be done in \({\mathcal {O}}\left( k\right)\) time, where \(k=|{\mathbb {S}}({\mathbb {A}})|\).

Proof

To test whether \({\mathbb {A}}\) has a non-conflicting realization, we first create a hash table indexed by \({\mathbb {S}}({\mathbb {A}})\) that for every scaffold \(s\in {\mathbb {S}}({\mathbb {A}})\) will contain a list of assembly points that involve s. We iterate over all assembly points \(p\in {\mathbb {A}}\) and add p to two lists in the hash table indexed by the scaffolds participating in p. If the length of some list becomes greater than 2, then \({\mathbb {A}}\) is conflicting and we stop. If we successfully complete the iterations, then every scaffold from \({\mathbb {S}}({\mathbb {A}})\) participates in at most two assembly points in \({\mathbb {A}}\), and thus we made \({\mathcal {O}}\left( k\right)\) steps of \({\mathcal {O}}\left( 1\right)\) time each.

Next, for every scaffold whose list in the hash table has length 2, we check whether the corresponding assembly points are either non-conflicting or semi-conflicting. If not, then \({\mathbb {A}}\) is conflicting and we stop. If the check completes successfully, then \({\mathbb {A}}\) has a non-conflicting realization by Theorem 1. The check takes \({\mathcal {O}}\left( k\right)\) steps of \({\mathcal {O}}\left( 1\right)\) time each, and thus the total running time comes to \({\mathcal {O}}\left( k\right)\). \(\square\)

A pseudocode for the test described in the proof of Lemma 2 is given in Algorithm 3 in the Appendix.

Lemma 3

For a given assembly \({\mathbb {A}}\) with \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\), the set \({\mathbb {S}}_u({\mathbb {A}})\) can be computed in \({\mathcal {O}}\left( k\right)\) time, where \(k=|{\mathbb {S}}({\mathbb {A}})|\).

Proof

We will construct the set \(S = {\mathbb {S}}_u({\mathbb {A}})\) iteratively. Initially we let \(S=\emptyset\). Following the algorithm described in the proof for Lemma 2, we construct a hash table that for every scaffold \(i\in {\mathbb {S}}({\mathbb {A}})\) contains a list of assembly points that involve i (which takes \({\mathcal {O}}\left( k\right)\) time). Then for every \(i\in {\mathbb {S}}({\mathbb {A}})\), we check if either of the corresponding assembly points provides an orientation for i; if not, we add i to S. This check for each scaffolds takes \({\mathcal {O}}\left( 1\right)\) time, bringing the total running time to \({\mathcal {O}}\left( k\right) .\) \(\square\)

A pseudocode for the computation of \({\mathbb {S}}_u({\mathbb {A}})\) described in the proof of Lemma 3 is given in Algorithm 4 in the Appendix.

Problem Formulations

For a non-conflicting assembly \({\mathbb {A}}\) composed only of oriented assembly points, an assembly point p on scaffolds \(s_i, s_j\in {\mathbb {S}}({\mathbb {A}})\) has a consistent orientation with \({\mathbb {A}}\) if for some \(p'\in {{\,\mathrm{\text {R}}\,}}(p)\) there exists a path connecting edges \(s_i\) and \(s_j\) in \(\mathsf {SAG} ({\mathbb {A}})\) such that direction of edges \(s_i\) and \(s_j\) at the path ends is consistent with \(p'\) (e.g., in Fig. 1a, the assembly point \((\overrightarrow{s_1}, \overrightarrow{s_3})\) has a consistent orientation with the assembly \({\mathbb {A}}\)). Furthermore, for a non-conflicting assembly \({\mathbb {A}}\) that has at least one non-conflicting realization, an assembly point p has a consistent orientation with \({\mathbb {A}}\) if \(p'\) has a consistent orientation with \({\mathbb {A}}'\) for some \(p'\in {{\,\mathrm{\text {R}}\,}}(p)\) and \({\mathbb {A}}' \in {{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\).

We formulate the orientation of ordered scaffolds problem as follows.

Orientation of Ordered Scaffolds

(OOS) Let \({\mathbb {A}}\) be an assembly and \({\mathbb {O}}\) be a setFootnote 4 of assembly points such that \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\) and \({\mathbb {S}}({\mathbb {O}})\subset {\mathbb {S}}({\mathbb {A}})\). Find a non-conflicting realization \({\mathbb {A}}'\in {{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\) that maximizes the number (total weight) of assembly points from \({\mathbb {O}}\) having consistent orientations with \({\mathbb {A}}'\).

From the biological perspective, the OOS can be viewed as a formalization of the case where (sub)orders of scaffolds have been determined (which defines \({\mathbb {A}}\)), while there exists some information (possibly coming from different sources and conflicting) about their relative orientation (which defines \({\mathbb {O}}\)). The OOS asks to orient unoriented scaffolds in the given scaffold orders in a way that is most consistent with the given orientation information.

We also remark that the OOS can be viewed as a fine-grained variant of the scaffold orientation problem studied in [10]. In our terminology, the latter problem concerns an artificial circular genome \({\mathbb {A}}\) formed by the given scaffolds in an arbitrary order (so that there is a path connecting any scaffold or its reverse complement to any other scaffold in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\)), and \({\mathbb {O}}\) formed by unordered pairs of scaffolds supplemented with the binary information on whether each such pair come from the same or different strands of the genome. In contrast, in the OOS, the assembly \({\mathbb {A}}\) is given and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) does not have to be connected or non-branching, while \({\mathbb {O}}\) may provide a pair of scaffolds with up to four options (as in (1)) of their relative orientation.

At the latest stages of genome assembly, the constructed scaffolds are usually of significant length. If (sub)orders for these scaffolds are known, it is rather rare to have orientation-imposing information that would involve non-neighboring scaffolds. Or, more generally, it is rather rare to have orientation imposing information for one scaffold with respect to more than two other scaffolds. This inspires us to consider a special case of the OOS problem:

Non-branching Orientation of Ordered Scaffolds

(NOOS) Given an OOS instance \(({\mathbb {A}},{\mathbb {O}})\) such that the graph \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) is non-branching. Find \({\mathbb {A}}'\in {{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\) that maximizes the number of assembly points from \({\mathbb {O}}\) having consistent orientations with \({\mathbb {A}}'\).

\(\textsf {NP}\)-Hardness of the OOS

We consider two important partial cases of the OOS, where the assembly \({\mathbb {A}}\) represents a linear or circular genome up to unknown orientations of the scaffolds. In these cases, the graph \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) forms a collection of paths or cycles, respectively. Below we prove that the OOS in both these cases is \(\textsf {NP}\)-hard.

Lemma 4

The OOS for linear genomes is \(\textsf {NP}\)-hard.

Proof

We will construct a polynomial-time reduction from the \(\mathrm {MAX}\ 2\)-\(\mathrm {DNF}\) problem, which is known to be \(\textsf {NP}\)-hard [8, 16]. An instance of \(\mathrm {MAX}\ 2\)-\(\mathrm {DNF}\) consists of clauses \(C = \{c_i\}_{i=1}^k\) each formed by either a single variable or a conjunction of two variables from \(X = \{x_i\}_{i=1}^n\), each of which may or may not be negated. The goal is to determine the maximum number of clauses that can be simultaneously satisfied by a 0/1 assignment to the variables from X. For a given instance \(I=(C,X)\) of \(\mathrm {MAX}\ 2\)-\(\mathrm {DNF}\), we construct an assembly

$$\begin{aligned} {\mathbb {A}}= \{ (0,x_1) \}\ \cup \ \{ (x_i, x_{i+1})\ :\ i=1,2,\dots ,n-1\}. \end{aligned}$$

Then we construct a set of assembly points \({\mathbb {O}}\) from the clauses in C as follows. For each conjunction \(c\in C\) of variables \(x_i\) and \(x_j\) (\(i<j\)), we add an oriented assembly point on scaffolds \(x_i, x_j\) to \({\mathbb {O}}\) with the orientation depending on the presence of negation of these variables in c (e.g., a conjunction \(x_i\wedge \overline{x_j}\) is translated into an assembly point \((\overrightarrow{x_i}, \overleftarrow{x_j})\)). For each clause \(c\in C\) with a single variable x, we add an assembly point \((\overrightarrow{0}, \overrightarrow{x})\) or \((\overrightarrow{0}, \overleftarrow{x})\) depending whether x is negated in c.

It is easy to see that the constructed assembly \({\mathbb {A}}\) is semi-conflicting and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is a path, and thus by Theorem 1\({\mathbb {A}}\) has a non-conflicting realization. Hence, \({\mathbb {A}}\) and \({\mathbb {O}}\) form an instance of the OOS for linear genomes. A solution \({\mathbb {A}}'\) to this OOS provides an orientation for each \(x\in {\mathbb {S}}\) that maximizes the number of assembly points from \({\mathbb {O}}\) having consistent orientations with \({\mathbb {A}}'\). A solution to I is obtained from \({\mathbb {A}}'\) as the assignment of 0 or 1 to each variable x depending on whether the orientation of scaffold x in \({\mathbb {A}}'\) is forward or reverse. Indeed, since each assembly point in \({\mathbb {O}}\) having consistent orientation with \({\mathbb {A}}'\) corresponds to a truthful clause in I, the number of such clauses is maximized.

Since the OOS instance and the solution to I can be computed in polynomial time, the above construction represents a polynomial-time reduction from the \(\mathrm {MAX}\ 2\)-\(\mathrm {DNF}\) to the OOS for linear genomes. \(\square\)

Lemma 5

The OOS for circular genomes is \(\textsf {NP}\)-hard.

Proof

We construct a polynomial-time reduction from the \(\mathrm {MAX}\)-\(\mathrm {CUT}\) problem, which is known to be \(\textsf {NP}\)-hard [18, 19]. An instance I of \(\mathrm {MAX}\)-\(\mathrm {CUT}\) for a given a graph (VE) asks to partition the set of vertices \(V = \{v_i\}_{i=1}^n\) into two disjoint subsets \(V_1\) and \(V_2\) such that the number of edges \(\{u, v\}\in E\) with \(u\in V_1\) and \(v\in V_2\) is maximized. For a given instance I of \(\mathrm {MAX}\)-\(\mathrm {CUT}\) problem, we define the assembly

$$\begin{aligned} {\mathbb {A}}= \left\{ (v_i,v_{i+1})\ :\ i=1,2,\dots ,n-1\right\} \cup \left\{ (v_1,v_n) \right\} \end{aligned}$$

and the set of assembly points

$$\begin{aligned} {\mathbb {O}}= \left\{ (\overrightarrow{v_i}, \overleftarrow{v_j})\ :\ \{v_i, v_j\}\in E \right\} . \end{aligned}$$

It is easy to see that \({\mathbb {A}}\) has a non-conflicting realization and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is a cycle, i.e., \({\mathbb {A}}\) and \({\mathbb {O}}\) form an instance of the OOS for circular genomes. A solution \({\mathbb {A}}'\) to this OOS instance provides orientations for all elements \({\mathbb {S}}({\mathbb {A}})=V\) that maximizes the number of assembly points from \({\mathbb {O}}\) having consistent orientations with \({\mathbb {A}}'\). A solution to I is obtained as the partition of V into two disjoint subsets, depending on the orientation of scaffolds in \({\mathbb {A}}'\) (forward vs reverse). Indeed, since each assembly point in \({\mathbb {O}}\) having a consistent orientation with \({\mathbb {A}}'\) corresponds to an edge from E whose endpoints belong to distinct subsets in the partition, the number of such edges is maximized.

It is easy to see that the OOS instance and the solution to I can be computed in polynomial time, thus we constructed a polynomial-time reduction from the \(\mathrm {MAX}\)-\(\mathrm {CUT}\) to the OOS for circular genomes. \(\square\)

As a trivial consequence of Lemmas 4 and 5, we obtain that the general OOS problem is \(\textsf {NP}\)-hard.

Theorem 2

The OOS is \(\textsf {NP}\)-hard.

Properties of the OOS

In this subsection, we formulate and prove some important properties of the OOS.

Connected Components of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\)

Below we show that an OOS instance can also be solved independently for each connected component of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\). We start with the following lemma that trivially follows from the definition of consistent orientation.

Lemma 6

Let \({\mathbb {A}}\) be an assembly such that \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\). An assembly point on scaffolds \(s_i, s_j\in {\mathbb {S}}({\mathbb {A}})\) may have a consistent orientation with \({\mathbb {A}}\) only if both \(s_i\) and \(s_j\) belong to the same connected component in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\).

Theorem 3

Let \(({\mathbb {A}},{\mathbb {O}})\) be an OOS instance, and \({\mathbb {A}}= {\mathbb {A}}_1 \cup \dots \cup {\mathbb {A}}_k\) be the partition such that \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}_1),\dots ,{{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}_k)\) represent the connected components of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\). For each \(i=1,2,\dots ,k\), define \({\mathbb {O}}_i = \{ p\in {\mathbb {O}}\ :\ {{\,\mathrm{\text {sn}}\,}}(p,1),{{\,\mathrm{\text {sn}}\,}}(p,2)\in {\mathbb {S}}({\mathbb {A}}_i)\}\) and let \({\mathbb {A}}'_i\) be a solution to the OOS instance \(({\mathbb {A}}_i,{\mathbb {O}}_i)\). Then \({\mathbb {A}}'_1\cup \dots \cup {\mathbb {A}}'_k\) is a solution to the OOS instance \(({\mathbb {A}},{\mathbb {O}})\).

Proof

Lemma 6 implies that we can discard from \({\mathbb {O}}\) all assembly points that are formed by scaffolds from different connected components in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\). Hence, we may assume that \({\mathbb {O}}= {\mathbb {O}}_1\cup \dots \cup {\mathbb {O}}_k\}\).

Lemma 6 further implies that an assembly point from \({\mathbb {O}}_i\) may have a consistent orientation with \({\mathbb {A}}_j\) only if \(i=j\). Therefore, any solution to the OOS instance \(({\mathbb {A}},{\mathbb {O}})\) is formed by the union of solutions to the OOS instances \(({\mathbb {A}}_i,{\mathbb {O}}_i)\). \(\square\)

Theorem 3 allows us focus on instances of the OOS, where \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is connected and thus forms a path or a cycle (by Theorem 1).

Connected Components of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\)

Below we show that an OOS instance can also be solved independently for each connected component of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\). We need the following lemma that trivially holds.

Lemma 7

Let \({\mathbb {A}}\) be an assembly such that \(|{{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})|\ge 1\), and \(s_i,s_j\) be scaffolds from the same connected component C in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\). Then an unoriented assembly point \((s_i,s_j)\) has a consistent orientation with \({\mathbb {A}}\). Furthermore, if C is a cycle, then any semi-oriented assembly point on \(s_i,s_j\) has a consistent orientation with \({\mathbb {A}}\).

By Lemma 7, we can assume that \({\mathbb {O}}\) does not contain any unoriented assembly points (i.e., \({\mathbb {O}}= {\mathbb {O}}_o \cup {\mathbb {O}}_s\)). Furthermore, if \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is a cycle, we can assume that \({\mathbb {O}}={\mathbb {O}}_o\) (i.e., \({\mathbb {O}}\) consists of oriented assembly points only). We consider two cases depending on whether \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) forms a path or a cycle.

\({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is a path.  Suppose that \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}) = (s_1,s_2,\dots ,s_n)\) is a path and \({\mathbb {O}}= {\mathbb {O}}_o \cup {\mathbb {O}}_s\). Let \({{{\mathcal {C}}}}\) be the set of connected components of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\).

Consider any \(C \in {{{\mathcal {C}}}}\). Let \((s_{j_1},\dots ,s_{j_{m}})\) be a vertex sequence of C such that \(j_{1}<j_{2}<\dots <j_{m}\), where m is the number of vertices in C. We define an assembly \({\mathbb {A}}_{C}\) such that \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}_{C})\) is the path \((x,s_{j_1},\dots ,s_{j_{m}},y)\), where x and y are artificial vertices, and the assembly points in \({\mathbb {A}}_{C}\) (corresponding to the edges in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}_{C})\)) are oriented or semi-oriented as follows.

  • The edges \(\{x,s_{j_{1}}\}\) and \(\{s_{j_{m}},y\}\) correspond to semi-oriented assembly points \((\overrightarrow{x},s_{j_{1}})\) and \((s_{j_{m}},\overrightarrow{y})\), respectively;

  • For each \(l \in \{1,\dots ,m-1\}\), orientation of the assembly point corresponding to the edge \(\{s_{j_{l}},s_{j_{l+1}}\}\) is imposed from the orientations of \(s_{j_{l}}\) and \(s_{j_{l+1}}\) in the assembly points in \({\mathbb {A}}\) corresponding to the edges \(\{s_{j_{l}},s_{j_{l}+1}\}\) and \(\{s_{j_{l+1}-1},s_{j_{l+1}}\}\) at the ends of a path connecting \(s_{j_{l}}\) and \(s_{j_{l+1}}\) in \(\mathsf {SAG} ({\mathbb {A}})\). For example, assembly points \((\overrightarrow{s}_{j_{l}},\overrightarrow{s}_{j_{l}+1})\) and \((\overrightarrow{s}_{j_{l+1}-1},\overleftarrow{s}_{j_{l+1}})\) in \({\mathbb {A}}\) impose the assembly point \((\overrightarrow{s}_{j_{l}},\overleftarrow{s}_{j_{l+1}})\) in \({\mathbb {A}}_{C}\).

We further define \({\mathbb {O}}_C\) as a set formed by the assembly points from C and the following assembly points. For each semi-oriented assembly point \(p\in {\mathbb {O}}\) formed by scaffolds \(s_i\) and \(s_j\) (\(i<j\)), \({\mathbb {O}}_C\) contains:

  • an oriented point \(p'\) formed by \(s_i\) and \(\overrightarrow{y}\) whenever \(s_i\) is oriented in p and belongs to C (and its orientation in \(p'\) is inherited from p);

  • an oriented point \(p''\) formed by \(\overrightarrow{x}\) and \(s_j\) whenever \(s_j\) is oriented in p and belongs to C (and its orientation in \(p''\) is inherited from p) (Fig. 2).

Now, for each \(C \in {{{\mathcal {C}}}}\), we assume that \({\mathbb {A}}_C\) and \({\mathbb {O}}_C\) are defined as above and let \({\mathbb {A}}'_C\) be a solution to the OOS instance \(({\mathbb {A}}_C,{\mathbb {O}}_C)\). We construct a non-conflicting realization \({\mathbb {A}}'\in {{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\) as follows:

  • for a scaffold s present in some \({\mathbb {A}}'_C\), \({\mathbb {A}}'\) inherits the orientation of s from \({\mathbb {A}}'_C\);

  • for a scaffold s not present in any \({\mathbb {A}}'_C\), if s is oriented in any assembly point of \({\mathbb {A}}\), then \({\mathbb {A}}'\) inherits that orientation of s; otherwise s is arbitrarily oriented in \({\mathbb {A}}'\).

The following theorem shows that constructed \({\mathbb {A}}'\) is a solution to the OOS instance \(({\mathbb {A}},{\mathbb {O}})\).

Fig. 2
figure 2

Decomposition of an OOS problem instance \(({\mathbb {A}}, {\mathbb {O}})\) based on the connected components of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}}_o)\). (a) The superposition of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) (red edges) and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\) (green edges), where arrows (if present) at the ends of green edges encode the orientation of the scaffolds in the corresponding assembly points. (b) The superposition of five graphs \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}_i)\) (red edges) and three graphs \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}}_j)\) (green edges) constructed based on the connected components of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}}_o)\). Unless \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}_i)\) is formed by an isolated vertex, it contains artificial vertices \(x_i\) and \(y_i\), which coincide if \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}_i)\) is a cycle

Theorem 4

Let \(({\mathbb {A}},{\mathbb {O}})\) be an OOS instance, and \({\mathbb {A}}'\in {{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\) be defined as above. Then \({\mathbb {A}}'\) is a solution to the OOS instance \(({\mathbb {A}},{\mathbb {O}})\).

Proof

The graph \(\mathsf {SAG} ({\mathbb {A}}')\) can be viewed as an ordered sequence of directed scaffold edges (interweaved with undirected edges encoding assembly points). Then each \(\mathsf {SAG} ({\mathbb {A}}'_i)\), with the exception of scaffold edges \(x_i\) and \(y_i\), corresponds to a subsequence of this sequence.

Each oriented assembly point \(p\in {\mathbb {O}}\) is formed by scaffolds uv from \(C_i\) for some \(i\in \{1,\dots ,k\}\). Then \(p\in {\mathbb {O}}\cap {\mathbb {O}}_i\) and there exist a unique path in \(\mathsf {SAG} ({\mathbb {A}}'_i)\) and a unique path in \(\mathsf {SAG} ({\mathbb {A}}')\) having the same directed edges uv at the ends. Hence, if p has a consistent orientation with one of assemblies \({\mathbb {A}}'\) or \({\mathbb {A}}'_i\), then it has a consistent orientation with the other.

Each semi-oriented assembly point \(p\in {\mathbb {O}}\) formed by scaffold uv corresponds to an oriented assembly point \(q\in {\mathbb {O}}_i\) (for some i) formed by u and \(y_i\) (in which case \(u\in C_i\) and u is oriented in p), or by \(x_i\) and v (in which case \(v\in C_i\) and v is oriented in p). Without loss of generality, we assume the former case. Then there exists a unique path Q in \(\mathsf {SAG} ({\mathbb {A}}'_i)\) connecting directed edges u and \(y_i\), and there exists a unique path P in \(\mathsf {SAG} ({\mathbb {A}}')\) connecting directed edges u and v, where the orientation of u is the same in the two paths. By construction, the orientation of \(y_i\) in q matches that in Q. Hence, q has a consistent orientation with \({\mathbb {A}}'_i\) if and only if the orientation of u in q matches that in Q, which happens if and only if the orientation of u in p matches its orientation in P, i.e., p has a consistent orientation with \({\mathbb {A}}'\). We proved that the number of assembly points from \({\mathbb {O}}\) having consistent orientation with \({\mathbb {A}}'\) equals the total number of assembly points from \({\mathbb {O}}_i\) having consistent orientation with \({\mathbb {A}}'_i\) for all \(i=1,2,\dots ,k\). It remains to notice that this number is maximum possible, i.e., \({\mathbb {A}}'\) is indeed a solution to the OOS instance \(({\mathbb {A}},{\mathbb {O}})\) (if it is not, then the sets \({\mathbb {A}}_i\) constructed from \({\mathbb {A}}\) being an actual solution to the OOS will give a better solution to at least one of the subproblems). \(\square\)

\({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) is a cycle.  In this case, we can construct subproblems based on the connected components of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\) similarly to Case 1, with the following differences. First, by Lemma 7, we assume that \({\mathbb {O}}={\mathbb {O}}_o\) (discarding all unoriented and semi-oriented assembly points from \({\mathbb {O}}\)). Second, we assume that \(x_i=y_i\) and thus \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}}_i)\) forms a cycle. Theorem 4 still holds in this case.

Articulation Vertices in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\)

While Theorem 4 allows us to divide the OOS problems into subproblems based on the connected components of \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\), we show below that similar division is possible when \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\) is connected but contains an articulation vertex.Footnote 5

A vertex v in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\) (or in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\)) is called oriented if \(v\in {\mathbb {S}}_o({\mathbb {A}})\). Otherwise, v is called unoriented. Let \(({\mathbb {A}}, {\mathbb {O}})\) be an instance of the OOS problem such that both \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\) are connected. Let v be an oriented articulation vertex in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\), defining a partition of \({\mathbb {S}}({\mathbb {O}})\) into disjoint subsets:

$$\begin{aligned} {\mathbb {S}}({\mathbb {O}}) = \{v\} \cup V_1 \cup V_2 \cup \dots \cup V_k, \end{aligned}$$
(2)

where \(k>1\) and the \(V_i\) represent the vertex sets of the connected components resulted from removal of v from \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\). To divide the OOS instance \(({\mathbb {A}}, {\mathbb {O}})\) into subinstances, we construct a new OOS instance \(({\hat{{\mathbb {A}}}}, {\hat{{\mathbb {O}}}})\) as follows.

We introduce copies \(v_1, \dots , v_k\) of v, and construct \({\hat{{\mathbb {A}}}}\) from \({\mathbb {A}}\) by replacing a path (uvw) in \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) with a path \((u,v_1,v_2,\dots ,v_k,w)\) where all \(v_i\) inherit the orientation from v. Then we construct \({\hat{{\mathbb {O}}}}\) from \({\mathbb {O}}\) by replacing in each assembly point p formed by v and \(u\in V_i\) (for some \(i\in \{1,2,\dots ,k\})\) with an assembly point formed by \(v_i\) and u (keeping their orientations intact).

The OOS instance \(({\hat{{\mathbb {A}}}}, {\hat{{\mathbb {O}}}})\) enables application of Theorem 4. Indeed, by construction, the vertex sets of the connected components of \({{\,\mathrm{\mathsf {OG}}\,}}({\hat{{\mathbb {O}}}})\) are \(\{v_i\} \cup V_i\), where \(i \in \{1,2,\dots ,k\}\). Hence, by Theorem 4 the OOS instance \(({\hat{{\mathbb {A}}}}, {\hat{{\mathbb {O}}}})\) can solved by dividing into OOS subinstances corresponding to the connected components of \({{\,\mathrm{\mathsf {OG}}\,}}({\hat{{\mathbb {O}}}})\).

Now, we assume that we have a solution to the OOS instance \(({\hat{{\mathbb {A}}}}, {\hat{{\mathbb {O}}}})\). We construct a non-conflicting realization \({\mathbb {A}}'\in {{\,\mathrm{\text {NR}}\,}}({\mathbb {A}})\) from a solution to the OOS instance \(({\hat{{\mathbb {A}}}}, {\hat{{\mathbb {O}}}})\) by replacing every scaffold \(v_i\) with v.

The following theorem shows that the constructed \({\mathbb {A}}'\) is a solution to the OOS instance \(({\mathbb {A}},{\mathbb {O}})\).

Theorem 5

Let \(({\mathbb {A}}, {\mathbb {O}})\) be an OOS instance such that both \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\) are connected, and \({\mathbb {A}}'\) be defined as above. Then \({\mathbb {A}}'\) is a solution to the OOS instance \(({\mathbb {A}},{\mathbb {O}})\).

Proof

Let \({\hat{{\mathbb {A}}}}'\) be a solution to the OOS instance \(({\hat{{\mathbb {A}}}}, {\hat{{\mathbb {O}}}})\), and \({\mathbb {A}}'\) be obtained from \({\hat{{\mathbb {A}}}}'\) by replacing every \(v_i\) with v. We remark that \({\mathbb {O}}\) can be obtained from \({\hat{{\mathbb {O}}}}\) by similar replacement.

This establishes an one-to-one correspondence between the assembly points in \({\hat{{\mathbb {A}}}}'\) and \({\mathbb {A}}'\), as well as between the assembly points in \({\hat{{\mathbb {O}}}}'\) and \({\mathbb {O}}'\). It remains to show that consistent orientations are invariant under this correspondence.

We remark that \(\mathsf {SAG} ({\mathbb {A}}')\) can be obtained from \(\mathsf {SAG} ({\hat{{\mathbb {A}}}}')\) by replacing a sequence of edges \((r_1,v_1,r_2,v_2,\dots ,r_k,v_k,r_{k+1})\), where \(r_i\) are assembly edges, with a sequence of edges \((r_1,v,r_2)\). Therefore, if there exists a path in one graph proving existence of consistent orientation for some assembly point, then there exists a corresponding path in the other graph (having the same orientations of the end edges). \(\square\)

Algorithms for the NOOS and the OOS

In this section, by Theorems 3 and 4, we can assume that both \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {A}})\) and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\) are connected.

A Polynomial-Time Algorithm for NOOS

Theorem 6

The NOOS is in \(\textsf{P}\).

Proof

Since \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) is non-branching, we consider two cases depending on whether it is a path or a cycle.

If \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) is a path, then every vertex in it is an articulation vertex in both \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) and \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\). Our algorithm will process this path in a divide-and-conquer manner. Namely, for a path of length greater than 2, we pick a vertex v closest to the path middle. If v is oriented, we proceed as in Theorem 5. If v is unoriented, we fix each of the two possible orientations, proceed as in Theorem 5 to obtain two candidate solutions, from which we pick one with the larger number of assembly points with consistent orientations.

A path of length at most 2 can be solved in \({\mathcal {O}}\left( |{\mathbb {O}}|\right)\) time by brute-forcing all possible orientations of the scaffolds in the path and counting how many assembly points in \({\mathbb {O}}\) get consistent orientations.

The running time T(l) for recursive part of the algorithm satisfies the formula:

$$\begin{aligned} T(l) = {\left\{ \begin{array}{ll} 4\cdot T\left( \frac{l}{2}\right) + {\mathcal {O}}\left( 1\right) , &{} \text {if}\ |{\mathbb {O}}| > 2;\\ {\mathcal {O}}\left( |{\mathbb {O}}|\right) , &{} \text {if}\ |{\mathbb {O}}| \le 2. \end{array}\right. } \end{aligned}$$

From the Master theorem [9], we conclude that the total running time for the proposed recursive algorithm is \({\mathcal {O}}\left( |{\mathbb {O}}|^2\right)\) (or \({\mathcal {O}}\left( |{\mathbb {S}}({\mathbb {A}})|^2\right)\) since \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) is a path).

If \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) is a cycle, we can reduce the corresponding NOOS instance to the case of a path as follows. First, we pick a random vertex w in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) and replace it with new vertices \(w_1\) and \(w_2\) such that the edges \(\{u, w\}\), \(\{w, v\}\) in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) are replaced with \(\{u, w_1\}\), \(\{w_2, v\}\). Then we solve the NOOS for the resulting path one or two times (depending on whether \(w\in {\mathbb {S}}_o({\mathbb {A}})\)): once for each of possible orientations of scaffold w (inherited by \(w_1\) and \(w_2\)), and then select the orientation for w that produces the largest number of assembly points having consistent orientations with the input assembly. \(\square\)

A pseudocode for the algorithm described in the proof of Theorem 6 is given in Algorithm 1 in the Appendix.

An Exact Algorithm for the OOS

Below we show how to solve OOS instance \(({\mathbb {A}}, {\mathbb {O}})\) in general case, i.e., when \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) is neither a path or a cycle.

First we assume that there are no articulation vertices in the \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\), while the case when articulation vertices are present is addressed in the next section. Let \({{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\) be the set of unoriented branching vertices (i.e., unoriented vertices of degree greater than 2) in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\). We define a non-branching path as a path for which the endpoints are in \({{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\), and all internal vertices have degree 2 (e.g., \(\{s_{18}, s_{23}, s_{24}, s_{25}\}\) is a non-branching path in Fig. 3a). Similarly, we define a non-branching cycle as a cycle in which all vertices have degree 2, except for one vertex (called endpoint) that belongs to \({{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\) and thus has degree greater than 2 (e.g., \(\{s_7, s_4, s_3, s_1, s_2, s_6, s_5, s_7\}\) is a non-branching cycle in Fig. 3a).

Each OOS instance induced by a non-branching path and a non-branching cycle in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) represents an NOOS instance, and thus can be solved in polynomial time. We iterate over all possible orientations for the endpoints of the underlying paths/cycles in the corresponding NOOS instances and solve them. A solution to the OOS instance is obtained by iterating over all possible orientations of the scaffolds represented by branching vertices in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) (i.e., \({{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\)) and merging the solutions to the corresponding NOOS instances, and picking the best result. Then, the following lemma trivially holds:

Lemma 8

The running time for the proposed algorithm is bounded by \({\mathcal {O}}\left( 2^{|{{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})|}\cdot |{\mathbb {S}}({\mathbb {A}})|^2\right) .\)

An \(\textsf {FPT}\) Algorithm for the OOS

Thanks to Theorem 5, we can partition a given OOS instance \(({\mathbb {A}}, {\mathbb {O}})\) into subinstances using the oriented articulation vertices. By Theorem 6, we also know how to efficiently orient scaffolds that correspond to unoriented articulation vertices of degree 2. In this section, we address the remaining type of articulation vertices, namely unoriented articulation vertices of degree at least 3.

Let \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\subseteq {{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\) be the set of unoriented articulation vertices of degree at least 3. A straightforward solution to this problem is to iterate over all possible \(2^{|AV({\mathbb {O}})|}\) orientations of the scaffolds in \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\), and then use Theorem 5 to partition the OOS instance \(({\mathbb {A}}, {\mathbb {O}})\) into subinstances. Each such subinstance, in turn, can be solved using Theorem 6 or Lemma 8. Below we show how one can orient the scaffolds in \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\) more efficiently based on the dependencies between the connected subgraphs flanked by the corresponding vertices.

The set \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\) defines a set \({{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\) of connected subgraphs (components) of \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) by breaking it at the vertices from \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\), introducing copies of each articulation vertex in the resulting components (Fig. 3a). We distinguish between two types of components in \({{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\):

  • path bridges forming the set \({{\,\mathrm{\text {PB}}\,}}({\mathbb {O}})\subseteq {{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\), i.e., components that do not contain cycles (e.g., \(pb_1\) in Fig. 1a);

  • complex components forming the set \({{\,\mathrm{\text {CC}}\,}}({\mathbb {O}})\subseteq {{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\), i.e., components that contain at least one cycle (e.g., \(cc_2\) in Fig. 3a).

Trivially we have \({{\,\mathrm{\text {CC}}\,}}({\mathbb {O}})\cup {{\,\mathrm{\text {PB}}\,}}({\mathbb {O}})={{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\). We denote by V(c) the set of vertices in a component \(c \in {{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\). Now, we show how to solve the OOS instances induced by elements of \({{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\):

Case \(c\in {{\,\mathrm{\text {PB}}\,}}({\mathbb {O}})\) The OOS instance induced by c can be solved as follows. We iterate over all possible orientations of the unoriented articulation vertices in c (i.e., we need solve the OOS instance induced by c at most 4 times). For each fixed orientation, since c is non-branching, the OOS instance induced by c is an instance of NOOS and can be solved as in Theorem 6.

Case \(c\in {{\,\mathrm{\text {CC}}\,}}({\mathbb {O}})\) The OOS instance induced by c can be solved as follows. We iterate over all possible orientations of the vertices in \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\cap V(c)\). For each fixed orientation, a solution to the OOS instance induced by c can be obtained as in Theorem 8 by iterating over all possible orientations of the scaffolds represented by the unoriented branching vertices in c (i.e., \(({{\,\mathrm{\text {BV}}\,}}({\mathbb {O}}) \setminus {{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})) \cap V(c)\)).

Fig. 3
figure 3

(a) Contracted ordered graph \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) of a set of assembly points \({\mathbb {O}}\). Branching articulation vertices \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}}) = \{s_7, s_{10}, s_{12}, s_{14}, s_{21}, s_{40}\}\) are shown as filled with gray. Branching vertices that are not articulation vertices \({{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\setminus {{\,\mathrm{\text {AV}}\,}}({\mathbb {O}}) = \{s_{33}, s_{26}, s_{27}\}\) are shown as filled with line pattern. Yellow areas highlight elements of \({{\,\mathrm{\text {CC}}\,}}({\mathbb {O}}) = \{cc_1, cc_2, cc_3, cc_4, cc_5\}\). Blue areas highlights elements of \({{\,\mathrm{\text {PB}}\,}}({\mathbb {O}}) = \{pb_1, pb_2, pb_3, pb_4\}\). (b) The subproblem tree \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}})\)

Now, we outline how we iterate over the orientations of scaffolds in \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\). Our algorithm constructs a subproblem tree \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}}) = (V, E)\) (Fig. 3b), where \(V={{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\) is the set of vertices corresponding to the set of components induced by \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\), and E is the set of edges constructed iteratively. We start with \(E=\emptyset\) and populate E as follows: for each vertex \(v\in V\) and all vertices \(u\in V\), add an edge \(\{ v, u \}\) if the following two conditions hold:

  1. 1.

    v and u share an articulation vertex in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) (e.g., \(cc_2\) and \(pb_1\) in Fig. 1a); and

  2. 2.

    u is not an endpoint of any edge in E.

A subproblem tree \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}})\) allows us to solve the original OOS instance in the bottom-up fashion. Indeed, the OOS instance corresponding to any disjoint subtrees of \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}})\) can be solved independently. We start with solving OOS instances that correspond to the leaves, producing solutions corresponding to different orientations of the scaffolds corresponding to articulation vertices. When the OOS instances for all children of an internal vertex c in \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}})\) are solved, we iterate over the orientations for the scaffolds that correspond to articulation vertices in c (i.e., \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\cap V(c)\)) and merge the OOS solutions for c with the corresponding solutions for its children. Eventually, we obtain the OOS solution for the root of \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}})\) and thus for the original OOS problem.

The following theorem states the running time of the proposed algorithm.

Theorem 7

The running time for the proposed algorithm for solving OOS instance \(({\mathbb {A}},{\mathbb {O}})\) is bounded by

$$\begin{aligned} {\mathcal {O}}\left( 2^{\alpha } \cdot |{\mathbb {S}}({\mathbb {A}})|^2 \cdot |{{\,\mathrm{\text {CC}}\,}}({\mathbb {O}})|\right) , \end{aligned}$$
(3)

where \(\alpha =\max _{c\in {{\,\mathrm{\text {CC}}\,}}({\mathbb {O}})} |{{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\cap V(c)|\).

Proof

The construction time of \(\mathsf {SAG} ({\mathbb {A}})\), \({{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\), \({{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\), \({{\,\mathrm{\mathsf {OG}}\,}}({\mathbb {O}})\), \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\), \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}})\), and \({{\,\mathrm{\text {C}}\,}}({\mathbb {O}})\) is bounded by \({\mathcal {O}}\left( |{\mathbb {S}}({\mathbb {A}})|^2\right)\).

The OOS instances induced by each non-branching path or cycle in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) are solved at most 4 times for different orientations of the endpoints. By Theorem 6, the total running time for processing all non-branching paths/cycles in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) is bounded by \({\mathcal {O}}\left( |{\mathbb {S}}({\mathbb {A}})|^2\right)\).

By Lemma 8, each OOS instance induced by a complex component \(c\in {{\,\mathrm{\text {CC}}\,}}({\mathbb {O}})\) can be solved in \({\mathcal {O}}\left( 2^m\cdot |{\mathbb {S}}({\mathbb {A}})|^2\right)\) time, where \(m=|({{\,\mathrm{\text {BV}}\,}}({\mathbb {O}}) \setminus {{\,\mathrm{\text {AV}}\,}}({\mathbb {O}}))\cap V(c)|\). The running time of the bottom-up algorithm is bounded by \(|{{\,\mathrm{\text {C}}\,}}({\mathbb {O}})|\) (i.e., the number of vertices in \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}})\)) times the running time of the merging procedure bounded by \({\mathcal {O}}\left( 2^{|{{\,\mathrm{\text {AV}}\,}}({\mathbb {O}})\cap V(c)|}\cdot \deg (c)\right)\), where \(\deg (c)\) is the degree of c in \({{\,\mathrm{\text {ST}}\,}}({\mathbb {O}})\).

Thus, the proposed algorithm can be bounded by \({\mathcal {O}}\left( 2^{\alpha } \cdot |{\mathbb {S}}({\mathbb {A}})|^2 \cdot |{{\,\mathrm{\text {CC}}\,}}({\mathbb {O}})|\right)\), where \(\alpha =\max _{c\in {{\,\mathrm{\text {CC}}\,}}({\mathbb {O}})} |{{\,\mathrm{\text {BV}}\,}}({\mathbb {O}})\cap V(c)|\). \(\square\)

The proposed algorithm is an \(\textsf {FPT}\) algorithm. Indeed, instead of finding the best orientation by iterating over all possible orientations of the scaffolds in \({\mathbb {S}}_u({\mathbb {A}})\), we iterate over all possible orientations of the scaffolds that correspond to branching vertices in \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\). Furthermore, we reduced running time of an \(\textsf {FPT}\) algorithm by partitioning the problem into connected components and solving them independently.

The exponential term in (3) accounts for the number of articulation vertices in the complex components of \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\). For real data, the exponent can become large only if many scaffolds have relative orientation with respect to three or more other scaffolds, which we expect to be a rare situation, especially when the scaffolds are long (e.g., produced by scaffolders combining paired-end and long-read data, a popular approach for the genome assembly).

Conclusions

In the present study, we posed the orientation of ordered scaffolds (OOS) problem as an optimization problem based on given weighted orientations of scaffolds and their pairs. We further addressed it within the earlier introduced CAMSA framework [2], taking advantage of the simple yet powerful concept of assembly points describing (semi-/un-) oriented adjacencies between scaffolds. This approach allows one to uniformly represent both orders of oriented and/or unoriented scaffolds and orientation-imposing data.

We proved that the OOS problem is \(\textsf {NP}\)-hard when the given scaffold order represents a linear or circular genome. We also described a polynomial-time algorithm for the special case of non-branching OOS (NOOS), where the orientation of each scaffold is imposed relatively to at most two other scaffolds. Our algorithm for the NOOS problem and Theorems 3, 4, and 5 further enabled us to develop an \(\textsf {FPT}\) algorithm for the general OOS problem. The proposed algorithms are implemented in the CAMSA software version 2 (https://github.com/compbiol/CAMSA).