1 Introduction

A path cover of a DAG \(G = (V, E)\) is a set of paths such that every node of G belongs to some path. A minimum path cover (MPC) is one having the minimum number of paths. The size of a MPC is also called the width of G. Many DAGs commonly used in genome research, such as graphs encoding human mutations [8] and graphs modeling gene transcripts [15], can consist, in the former case, of millions of nodes and, in the latter case, of thousands of nodes. However, they generally have a small width on average; for example, splicing graphs for most genes in human chromosome 2 have width at most 10 [35, Fig. 7]. To the best of our knowledge, among the many MPC algorithms [6, 7, 12, 16, 27, 31], there are only three whose complexities depends on the width of the DAG. Say the width of G is k. The first algorithm runs in time \(O(|V||E| + k|V|^2)\) and can be obtained by slightly modifying an algorithm for finding a minimum chain cover in partial orders from [11]. The other two algorithms are due to Chen and Chen: the first one works in time \(O(|V|^2 + k\sqrt{k}|V|)\) [6], and the second one works in time \(O(\max (\sqrt{|V|}|E|, k\sqrt{k}|V|))\) [7].

In this paper we present an MPC algorithm running in time \(O(k|E|\log |V|)\). For example, for \(k = o(\sqrt{|V|} / \log |V|)\) and \(|E| = O(|V|^{3/2})\), this is better than all previous algorithms. Our algorithm is based on the following standard reduction of a minimum flow problem to a maximum flow problem (see e.g. [2]): (i) find a feasible flow/path cover satisfying all demands, and (ii) solve a maximum flow problem in a graph encoding how much flow can be removed from every edge. Our main insight is to solve step (i) by finding an approximate solution that is greater than the optimal one only by a \(O(\log |V|)\) factor. Then, if we solve step (ii) with the Ford-Fulkerson algorithm, the number of iterations can be bounded by \(O(k\log |V|)\).

We then proceed to show that some problems (like pattern matching) that admit efficient sparse dynamic programming solutions on sequences [10] can be extended to DAGs, so that their complexity increases only by the minimum path cover size k. Extending pattern matching to DAGs has been studied before [3, 24, 28]. For those edit distance -based formulations our approach does not yield an improvement, but on formulations involving a sparse set of matching anchors [10] we can boost the naive solutions of their DAG extensions by exploiting a path cover. Namely, our improvement applies to many cases where a data structure over previously computed solutions is maintained and queried for computing the next value. Our new MPC algorithm enables this, as its complexity is generally of the same form as that of solving the extended problems. Given a path cover, our technique then computes so-called forward propagation links indicating how the partial solutions in each path in the cover must be synchronized.

To best illustrate the versatility of the technique itself, in the full version of this paper [19] we show how to compute a longest increasing subsequence (LIS) in a labeled DAG, in time \(O(k |E| \log |V|)\). This matches the optimal solution to the classical problem on a single sequence when, e.g., this is modeled as a path (where \(k=1\)). In Sect. 4, We also illustrate our technique with the longest common subsequence (LCS) problem between a labeled DAG \(G = (V,E)\) and a sequence S.

Finally, we consider the main problem of this paper—co-linear chaining (CLC)—first introduced in [23]. It has been proposed as a model of the sequence alignment problem that scales to massive inputs, and has been a subject of recent interest (see e.g. [22, 29, 32, 36, 38,39,40]). In the CLC problem, the input is directly assumed to be a set of N pairs of intervals in the two sequences that match (either exactly or approximately). The CLC alignment solution asks for a subset of these plausible pairs that maximizes the coverage in one of the sequences, and whose elements appear in increasing order in both sequences. The fastest algorithm for this problem runs in the optimal \(O(N \log N)\) time [1].

We define a generalization of the CLC problem between a sequence and a labeled DAG. As motivation, we mention the problem of aligning a long sequence, or even an entire chromosome, inside a DAG storing all known mutations of a population with respect to a reference genome (such as the above-mentioned [8], or more specificly a linearized version of it [14]). Here, the N input pairs match intervals in the sequence with paths (also called anchors) in the DAG. This problem is not straightforward, as the topological order of the DAG might not follow the reachability order between the anchors. Existing tools for aligning DNA sequences to DAGs (BGREAT [20], vg [25]) rely on anchors but do not explicitly consider solving CLC optimally on the DAG.

The algorithm we propose uses the general framework mentioned above. Since it is more involved, we will develop it in stages. We first give a simple approach to solve a relaxed co-linear chaining problem using \(O((|V|+|E|) N)\) time. Then, we introduce the MPC approach that requires \(O(k|E| \log |V| + kN \log N)\) time. As above, if the DAG is a labeled path representing a sequence, the running time of our algorithm is reduced to the best current solution for the co-linear chaining problem on sequences, \(O(N \log N)\) [1]. In the full version of this paper [19], we use a Burrows-Wheeler technique to efficiently handle a special case that we omitted in this relaxed variant. We remark that one can reduce the LIS and LCS problems to the CLC problem to obtain the same running time bounds as mentioned earlier; these are given for the sake of comprehensiveness.

In the last section we discuss the anchor-finding preprocessing step. We implemented the new MPC-based co-linear chaining algorithm and conducted experiments on splicing graphs to show that the approach is practical, once anchors are given. Some future directions on how to incorporate practical anchors, and how to apply the techniques to transcript prediction, are discussed.

Notation. To simplify notation, for any DAG \(G = (V,E)\) we will assume that V is always \(\{1,\dots ,|V|\}\) and that \(1,\dots ,|V|\) is a topological order on V (so that for every edge (uv) we have \(u < v\)). We will also assume that \(|E| \ge |V| - 1\). A labeled DAG is a tuple \((V,E,\ell ,\varSigma )\) where (VE) is a DAG and \(\ell : V \mapsto \varSigma \) assign to the nodes labels from \(\varSigma \), \(\varSigma \) being an ordered alphabet.

For a node \(v \in V\), we denote by \(N^-(v)\) the set of in-neighbors of v and by \(N^+(v)\) the set of out-neighbors of v. If there is a (possibly empty) path from node u to node v we say that u reaches v. We denote by \(R^-(v)\) the set of nodes that reach v. We denote a set of consecutive integers with interval notation [i..j], meaning \(\{i,i+1,\ldots ,j\}\). For a pair of intervals \(m=([x..y],[c..d])\), we use m.x, m.y, m.c, and m.d to denote the four respective endpoints. We also consider pairs of the form \(m=(P,[c..d])\) where P is a path, and use m.P to access P. The first node of P will be called its startpoint, and its last node will be called its endpoint. For a set M we may fix an order, to access an element as M[i].

2 The MPC Algorithm

In this section we assume basic familiarity with network flow concepts; see [2] for further details. In the minimum flow problem, we are given a directed graph \(G = (V,E)\) with a single source and a single sink, with a demand \(d : E \rightarrow \mathbb {Z}\) for every edge. The task is to find a flow of minimum value (the value is the sum of the flow on the edges exiting the source) that satisfies all demands (to be called feasible). The standard reduction from the minimum path cover problem to a minimum flow one (see, e.g. [26]) creates a new DAG \(G^*\) by replacing each node v with two nodes \(v^-,v^+\), adds the edge \((v^-,v^+)\) and adds all in-neighbors of v as in-neighbors of \(v^-\), and all out-neighbors of v as out-neighbors of \(v^+\). Finally, the reduction adds a global source with an out-going edge to every node, and a global sink with an in-coming edge from every node. Edges of type \((v^-,v^+)\) get demand 1, and all other edges get demand 0. The value of the minimum flow equals k, the width of G, and any decomposition of it into source-to-sink paths induces a minimum path cover in G.

Our MPC algorithm is based on the following simple reduction of a minimum flow problem to a maximum flow one (see e.g. [2]): (i) find a feasible flow \(f : E \rightarrow \mathbb {Z}\); (ii) transform this into a minimum feasible flow, by finding a maximum flow \(f'\) in G in which every \(e \in E\) now has capacity \(f(e) - d(e)\). The final minimum flow solution is obtained as \(f(e) - f'(e)\), for every \(e \in E\). Observe that this path cover induces a flow of value \(O(k\log |V|)\). Thus, in step (ii) we need to shrink this flow into a flow of value k. If we run the Ford-Fulkerson algorithm, this means that there are \(O(k\log |V|)\) successive augmenting paths, each of which can be found in time O(E). This gives a time bound for step (ii) of \(O(k|E|\log |V|)\).

We solve step (i) in time \(O(k|E|\log |V|)\) by finding a path cover in \(G^*\) whose size is larger than k only by a multiplicative factor \(O(\log |V|)\). This is based on the classical greedy set cover algorithm, see e.g. [37, Chapter 2]: at each step, select a path covering most of the remaining uncovered nodes.

Such approximation approach has also been applied to other covering problems on graphs, such as a 2-hop cover [9]. More importantly, the approximation-and-refinement approach is similar to the one from [11] for finding the minimum number k of chains to cover a partial order of size n. A chain is a set of pairwise comparable elements. The algorithm from [11] runs in time \(O(kn^2)\), and it has the same feature as ours: it first finds a set of \(O(k\log n)\) chains in the same way as us (longest chains covering most uncovered elements), and then in a second step reduces these to k. However, if we were to apply this algorithm to DAGs, it would run in time \(O(|V||E| + k|V|^2)\), which is slower than our algorithm for small k. This is because it uses the classical reduction given by Fulkerson [12] to a bipartite graph, where each edge of the graph encodes a pair of elements in the relation. Since DAGs are not transitive in general, to use this reduction one needs first to compute the transitive closure of the DAG, in time O(|V||E|).

We now show how to solve step (i) within the claimed running time, by dynamic programming.

Lemma 1

Let \(G = (V,E)\) be a DAG, and let k be the width of G. In time \(O(k|E|\log |V|)\), we can compute a path cover \(P_1,\dots ,P_K\) of G, such that \(K = O(k\log |V|)\).

Proof

The algorithm works by choosing, at each step, a path that covers the most uncovered nodes. For every node \(v \in V\), we store \(\mathtt {m}[v] = 1\), if v is not covered by any path, and \(\mathtt {m}[v] = 0\) otherwise. We also store \(\mathtt {u}[v]\) as the largest number of uncovered nodes on a path starting at v. The values \(\mathtt {u}[\cdot ]\) are computed by dynamic programming, by traversing the nodes in inverse topological order and setting \(\mathtt {u}[v] = \mathtt {m}[v] + \max _{w \in N^+(v)} \mathtt {u}[v]\). Initially we have \(\mathtt {m}[v] = 1\) for all v. We then compute \(\mathtt {u}[v]\) for all v, in time O(|E|). By taking the node v with the maximum \(\mathtt {u}[v]\), and tracing back along the optimal path starting at v, we obtain our first path in time O(|E|). We then update \(\mathtt {m}[v] = 0\) for all nodes on this path, and iterate this process until all nodes are covered. This takes overall time O(K|E|), where K is the number of paths found.

This algorithm analysis is identical to the one of the classical greedy set cover algorithm [37, Chapter 2], because the universe to be covered is V and each possible path in G is a possible covering set, which implies that \(K = O(k\log |V|)\).

   \(\square \)

Combining Lemma 1 with the above-mentioned application of the Ford-Fulkerson algorithm, we obtain our first result:

Theorem 1

Given a DAG \(G = (V,E)\) of width k, the MPC problem on G can be solved in time \(O(k|E|\log |V|)\).

3 The Dynamic Programming Framework

In this section we give an overview of the main ideas of our approach.

Suppose we have a problem involving DAGs that is solvable, for example by dynamic programming, by traversing the nodes in topological order. Thus, assume also that a partial solution at each node v is obtainable from all (and only) nodes of the DAG that can reach v, plus some other independent objects, such as another sequence. Furthermore, suppose that at each node v we need to query (and maintain) a data structure \(\mathcal {T}\) that depends on \(R^-(v)\) and such that the answer \(\mathsf {Query}(R^-(v))\) at v is decomposable as:

$$\begin{aligned} \mathsf {Query}(R^-(v)) = \bigoplus _i \mathsf {Query}(R^-_i(v)). \end{aligned}$$
(1)

In the above, the sets \(R^-_i(v)\) are such that \(R^-(v) = \bigcup _i R^-_i(v)\), they are not necessarily disjoint, and \(\bigoplus \) is some operation on the queries, such as min or max, that does not assume disjointness. It is understood that after the computation at v, we need to update \(\mathcal {T}\). It is also understood that once we have updated \(\mathcal {T}\) at v, we cannot query \(\mathcal {T}\) for a node before v in topological order, because it would give an incorrect answer.

The first idea is to decompose the graph into a path cover \(P_1,\dots ,P_K\). As such, we decompose the computation only along these paths, in light of (1). We replace a single data structure \(\mathcal {T}\) with K data structures \(\mathcal {T}_1,\dots ,\mathcal {T}_K\), and perform the operation from (1) on the results of the queries to these K data structures.

Our second idea concerns the order in which the nodes on these K paths are processed. Because the answer at v depends on \(R^-(v)\), we cannot process the nodes on the K paths (and update the corresponding \(\mathcal {T}_i\)’s) in an arbitrary order. As such, for every path i and every node v, we distinguish the last node on path i that reaches v (if it exists). We will call this node \(\mathtt {last2reach}[v,i]\). See Fig. 1 for an example. We note that this insight is the same as in [17], which symmetrically identified the first node on a chain i that can be reached from v (a chain is a subsequence of a path). The following observation is the first ingredient for using the decomposition (1).

Fig. 1.
figure 1

A path cover \(P_1,P_2,P_3\) of a DAG. The forward links entering v from last2reach[vi] are shown with dotted black lines, for \(i\in \{1,2,3\}\). We mark in gray the set \(R^-(v)\) of nodes that reach v.

Observation 1

Let \(P_1,\dots ,P_K\) be a path cover of a DAG G, and let \(v \in V(G)\). Let \(R_i\) denote the set of nodes of \(P_i\) from its beginning until \(\mathtt {last2reach}[v,i]\) inclusively (or the empty set, if \(\mathtt {last2reach}[v,i]\) does not exist). Then \(R^-(v) = \bigcup _{i=1}^{K}R_i\).

Proof

It is clear that \(\bigcup _{i=1}^{K} R_i \subseteq R^-(v)\). To show the reverse inclusion, consider a node \(u \in R^-(v)\). Since \(P_1,\dots ,P_K\) is a path cover, then u appears on some \(P_i\). Since u reaches v, then u appears on \(P_i\) before \(\mathtt {last2reach}[v,i]\), or \(u = \mathtt {last2reach}[v,i]\). Therefore u appears on \(R_i\), as desired.    \(\square \)

This allows us to identify, for every node u, a set of forward propagation links \(\mathtt {forward}[u]\), where \((v,i) \in \mathtt {forward}[u]\) holds for any node v and index i with \(\mathtt {last2reach}[v,i] = u\). These propagation links are the second ingredient in the correctness of the decomposition. Once we have computed the correct value at u, we update the corresponding data structures \(\mathcal {T}_i\) for all paths i to which u belongs. We also propagate the query value of \(\mathcal {T}_i\) in the decomposition (1) for all nodes v with \((v,i) \in \mathtt {forward}[u]\). This means that when we come to process v, we have already correctly computed all terms in the decomposition (1) and it suffices to apply the operation \(\bigoplus \) to these terms.

The next lemma shows how to compute the values \(\mathtt {last2reach}\) (and, as a consequence, all forward propagation links), also by dynamic programming.

Lemma 2

Let \(G = (V,E)\) be a DAG, and let \(P_1,\dots ,P_K\) be a path cover of G. For every \(v \in V\) and every \(i \in [1..K]\), we can compute \(\mathtt {last2reach}[v,i]\) in overall time O(K|E|).

Proof

For each \(P_i\) and every node v on \(P_i\), let \(\mathtt {index}[v,i]\) denote the position of v in \(P_i\) (say, starting from 1). Our algorithm actually computes \(\mathtt {last2reach}[v,i]\) as the index of this node in \(P_i\). Initially, we set \(\mathtt {last2reach}[v,i] = -1\) for all v and i. At the end of the algorithm, \(\mathtt {last2reach}[v,i] = -1\) will hold precisely for those nodes v that cannot be reached by any node of \(P_i\). We traverse the nodes in topological order. For every \(i \in [1..K]\), we do as follows: if v is on \(P_i\), then we set \(\mathtt {last2reach}[v,i] = \mathtt {index}[v,i]\). Otherwise, we compute by dynamic programming \(\mathtt {last2reach}[v,i]\) as \(\max _{\textit{u} \in N^-(v)} \mathtt {last2reach}[u,i]\).    \(\square \)

An immediate application of Theorem 1 and of the values \(\mathtt {last2reach}[v,i]\) is for solving reachability queries. Another simple application is an extension of the longest increasing subsequence (LIS) problem to labeled DAGs. (Both are given in the full version of this paper [19]).

The LIS problem, the LCS problem of Sect. 4, as well as co-linear chaining (CLC) of Sect. 5 make use of the following standard data structure (see e.g. [21, p.20]).

Lemma 3

The following two operations can be supported with a balanced binary search tree \(\mathcal {T}\) in time \(O(\log n)\), where n is the number of leaves in the tree.

  • \(\mathsf {update}(k,\mathtt {val})\): For the leaf w with \(\mathtt {key}(w)=k\), update \(\mathtt {value}(w)=\mathtt {val}\).

  • \(\mathsf {RMaxQ}(l,r)\): Return \(\max _{w \,:\,l\le \mathtt {key}(w) \le r} \mathtt {value}(w)\) (Range Maximum Query).

Moreover, the balanced binary search tree can be built in O(n) time, given the n pairs \((\mathtt {key},\mathtt {value})\) sorted by component \(\mathtt {key}\).

4 The LCS Problem

Consider a labeled DAG \(G=(V,E,\ell ,\varSigma )\) and a sequence \(S \in \varSigma ^*\), where \(\varSigma \) is an ordered alphabet. We say that the longest common subsequence (LCS) between G and S is a longest subsequence C of any path label in G such that C is also a subsequence of S.

We will modify the LIS algorithm (see the full version of this paper [19]) minimally to find a LCS between a DAG G and a sequence S. The description is self-contained yet, for the interest of page limit, more dense than the LIS algorithm derivation. The purpose is to give an example of the general MPC-framework with fewer technical details than required in the main result of this paper concerning co-linear chaining.

For any \(c \in \varSigma \), let S(c) denote set \(\{j \mid S[j]=c\}\). For each node v and each \(j\in S(\ell (v))\), we aim to store in \(\mathtt {LLCS}[v,j]\) the length of the longest common subsequence between S[1..j] and any label of path ending at v, among all subsequences having \(\ell (v)=S[j]\) as the last symbol.

Assume we have a path cover of size K and \(\mathtt {forward}[u]\) computed for all \(u\in V\). Assume also we have mapped \(\varSigma \) to \(\{0,1,2,\ldots ,|S|+1\}\) in \(O((|V|+|S|) \log |S|)\) time (e.g. by sorting the symbols of S, binary searching labels of V, and then relabeling by ranks, with the exception that, if a node label does not appear in S, it is replaced by \(|S|+1\)).

Let \(\mathcal {T}_i\) be a search tree of Lemma 3 initialized with key-value pairs (0, 0), \((1,-\infty )\), \((2,-\infty )\), ..., \((|S|,-\infty )\), for each \(i \in [1..K]\). The algorithm proceeds in fixed topological ordering on G. At a node u, for every \((v,i) \in \mathtt {forward}[u]\) we now update an array \(\mathtt {LLCS}[v,j]\) for all \(j \in S(\ell (v))\) as follows: \(\mathtt {LLCS}[v,j]=\max (\mathtt {LLCS}[v,j],\mathcal {T}_i.\mathsf {RMaxQ}(0,j-1)+1)\). The update step of \(\mathcal {T}_i\) when the algorithm reaches a node v, for each covering path i containing v, is done as \(\mathcal {T}_{i}.\mathsf {update}(j',\mathtt {LLCS}[v,j'])\) for all \(j'\) with \(j'<j\) and \(j' \in S(\ell (v))\). Initialization is handled by the (0, 0) key-value pair so that any (vj) with \(\ell (v)=S[j]\) can start a new common subsequence.

The final answer to the problem is \(\max _{v\in V, j\in S(\ell (v))} \mathtt {LLCS}[v,j]\), with the actual LCS to be found with a standard traceback. The algorithm runs in \(O((|V|+|S|)\log |S|+K|M| \log |S|)\) time, where \(M=\{(v,j) \mid v \in V, j \in [1..|S|], \ell (v)=S[j]\}\), and assuming a cover of K paths is given. Notice that |M| can be \(\varOmega (|V||S|)\). With Theorem 1 plugged in, the total running time becomes \(O(k|E| \log |V| + (|V|+|S|)\log |S|+k|M| \log |S|)\). Since the queries on the data structures are semi-open, one can use the more efficient data structure from [13] to improve the bound to \(O(k|E| \log |V| + (|V|+|S|)\log |S|+k|M| \log \log |S|)\). The following theorem summarizes this result.

Theorem 2

Let \(G = (V,E,\ell ,\varSigma )\) be a labeled DAG of width k, and let \(S \in \varSigma ^*\), where \(\varSigma \) is an ordered alphabet. We can find a longest common subsequence between G and S in time \(O(k|E| \log |V| + (|V|+|S|)\log |S|+k|M| \log \log |S|)\).

When G is a path, the bound improves to \(O((|V|+|S|)\log |S|+|M|\log \log |S|)\), which nearly matches the fastest sparse dynamic programming algorithm for the LCS on two sequences [10] (with a difference in \(\log \log \)-factor due to a different data structure, which does not work for this order of computation).

5 Co-linear Chaining

We start with a formal definition of the co-linear chaining problem (see Fig. 2 for an illustration), following the notions introduced in [21, Sect. 15.4].

Fig. 2.
figure 2

In the co-linear chaining problem between two sequences T and R, we need to find a subset of pairs of intervals (i.e., anchors) so that (i) the selected intervals in each sequence appear in increasing order; and (ii) the selected intervals cover in R the maximum amount of positions. The figure shows an input for the problem, and highlights in gray an optimal subset of anchors. Figure taken from [21].

Problem 1

(Co-linear chaining (CLC)). Let T and R be two sequences over an alphabet \(\varSigma \), and let M be a set of N pairs \(([x ..y],[c ..d])\). Find an ordered subset \(S=s_1 s_2 \cdots s_p\) of pairs from M such that

  • \(s_{j-1}.y<s_{j}.y\) and \(s_{j-1}.d<s_{j}.d\), for all \(1\le j \le p\), and

  • S maximizes the ordered coverage of R, defined as

    $$\begin{aligned} \mathtt {coverage}(R,S)=|\{i \in [1..|R|] \,|\,i \in [s_{j}.c ..s_{j}.d]\,\text {for some}\,1\le j \le p\}|. \end{aligned}$$

The definition of ordered coverage between two sequences is symmetric, as we can simply exchange the roles of T and R. But when solving the CLC problem between a DAG and a sequence, we must choose whether we want to maximize the ordered coverage on the sequence R or on the DAG G. We will consider the former variant.

First, we define the following precedence relation:

Definition 1

Given two paths \(P_1\) and \(P_2\) in a DAG G, we say that \(P_1\) precedes \(P_2\), and write \(P_1 \prec P_2\), if one of the following conditions holds:

  • \(P_1\) and \(P_2\) do not share nodes and there is a path in G from the endpoint of \(P_1\) to the startpoint of \(P_2\), or

  • \(P_1\) and \(P_2\) have a suffix-prefix overlap and \(P_2\) is not fully contained in \(P_1\); that is, if \(P_1 = (a_1,\dots ,a_i)\) and \(P_2 = (b_1,\dots ,b_j)\) then there exists a \(k \in \{\max (1,2+i-j),\dots ,i\}\) such that \(a_k = b_1\), \(a_{k+1} = b_2\), ..., \(a_{i} = b_{1+i-k}\).

We then extend the formulation of Problem 1 to handle a sequence and a DAG.

Problem 2

(CLC between a sequence and a DAG). Let R be a sequence, let G be a labeled DAG, and let M be a set of N pairs \((P,[c ..d])\), where P is a path in G and \(c \le d\) are non-negative integers. Find an ordered subset \(S=s_1 s_2 \cdots s_p\) of pairs from M such that

  • for all \(2 \le j \le p\), it holds that \(s_{j-1}.P \prec s_{j}.P\) and \(s_{j-1}.d < s_{j}.d\), and

  • S maximizes the ordered coverage of R, analogously defined as \(\mathtt {coverage}(R,S)=|\{i \in [1..|R|] \,|\,i \in [s_{j}.c ..s_{j}.d]\,\text {for some}\,1\le j \le p\}|\).

To illustrate the main technique of this paper, let us for now only seek solutions where paths in consecutive pairs in a solution do not overlap in the DAG. Suffix-prefix overlaps between paths turn out to be challenging; we prove this case in the full version of this paper [19].

Problem 3

(Overlap-limited CLC between a sequence and a DAG). Let R be a sequence, let G be a labeled DAG, and let M be a set of N pairs \((P,[c ..d])\), where P is a path in G and \(c \le d\) are non-negative integers (with the interpretation that \(\ell (P)\) matches \(R[c ..d]\)). Find an ordered subset \(S=s_1 s_2 \cdots s_p\) of pairs from M such that

  • for all \(2 \le j \le p\), it holds that there is a non-empty path from the last node of \(s_{j-1}.P\) to the first node of \(s_{j}.P\) and \(s_{j-1}.d < s_{j}.d\), and

  • S maximizes \(\mathtt {coverage}(R,S)\).

First, let us consider a trivial approach to solve Problem 3. Assume we have ordered in \(O(|E| + N)\) time the N input pairs as \(M[1],M[2],\dots , M[N]\), so that the endpoints of \(M[1].P, M[2].P,\dots ,M[N].P\) are in topological order, breaking ties arbitrarily. We denote by C[j] the maximum ordered coverage of \(R[1 ..M[j].d]\) using the pair M[j] and any subset of pairs from \(\{M[1],M[2],\dots , M[j-1]\}\).

Theorem 3

Overlap-limited co-linear chaining between a sequence and a labeled DAG \(G=(V,E,\ell ,\varSigma )\) (Problem 3) on N input pairs can be solved in \(O((|V| + |E|) N)\) time.

Proof

First, we reverse the edges of G. Then we mark the nodes that correspond to the path endpoints for every pair. After this preprocessing we can start computing the maximum ordered coverage for the pairs as follows: for every pair M[j] in topological order of their path endpoints for \(j \in \{1,\dots ,N\}\) we do a depth-first traversal starting at the startpoint of path M[j].P. Note that since the edges are reversed, the depth-first traversal checks only pairs whose paths are predecessors of M[j].P.

Whenever we encounter a node that corresponds to the path endpoint of a pair \(M[j']\), we first examine whether it fulfills the criterion \(M[j'].d < M[j].c\) (call this case (a)). The best ordered coverage using pair M[j] after all such \(M[j']\) is then

$$\begin{aligned} C^\text {a}[j]=\max _{j' \,:\,M[j'].d<M[j].c} \{C[j']+(M[j].d-M[j].c+1) \}, \end{aligned}$$
(2)

where \(C[j]'\) is the best ordered coverage when using pairs \(M[j']\) last.

If pair \(M[j']\) does not fulfill the criterion for case (a), we then check whether \(M[j].c \le M[j'].d \le M[j].d\) (call this case (b)). The best ordered coverage using pair M[j] after all such \(M[j']\) with \(M[j'].c < M[j].c\) is then

$$\begin{aligned} C^\text {b}[j]=\max _{j' \,:\,M[j].c\le M[j'].d\le M[j].d} \{C[j']+(M[j].d-M[j'].d)\}. \end{aligned}$$
(3)

Inclusions, i.e. \(M[j].c \le M[j'].c\), can be left computed incorrectly in \(C^\text {b}[j]\), since there is a better or equally good solution computed in \(C^\text {a}[j]\) or \(C^\text {b}[j]\) that does not use them [1].

Finally, we take \(C[j]=\max (C^\text {a}[j],C^\text {b}[j])\). Depth-first traversal takes \(O(|V|+|E|)\) time and is executed N times, for \(O((|V| + |E|) N)\) total time.    \(\square \)

However, we can do significantly better than \(O((|V| + |E|) N)\) time. In the next sections we will describe how to apply the framework from Sect. 3 here.

5.1 Co-linear Chaining on Sequences Revisited

We now describe the dynamic programming algorithm from [1] for the case of two sequences, as we will then reuse this same algorithm in our MPC approach.

First, sort input pairs in M by the coordinate y into the sequence M[1], M[2], ..., M[N], so that \(M[i].y\le M[j].y\) holds for all \(i<j\). This will ensure that we consider the overlapping ranges in sequence T in the correct order. Then, we fill a table \(C[1..N]\) analogous to that of Theorem 3 so that C[j] gives the maximum ordered coverage of \(R[1 ..M[j].d]\) using the pair M[j] and any subset of pairs from \(\{M[1],M[2],\dots , M[j-1]\}\). Hence, \(\max _j C[j]\) gives the total maximum ordered coverage of R.

Consider Eq. (2) and (3). Now we can use an invariant technique to convert these recurrence relations so that we can exploit the range maximum queries of Lemma 3:

$$\begin{aligned} C^\mathtt {a}[j]= & {} (M[j].d-M[j].c+1) +\max _{j' \,:\,M[j'].d<M[j].c} C[j']\\= & {} (M[j].d-M[j].c+1)+\mathcal {T}.\mathsf {RMaxQ}(0,M[j].c-1),\\ C^\mathtt {b}[j]= & {} M[j].d +\max _{j' \,:\,M[j].c\le M[j'].d\le M[j].d} \{C[j']-M[j'].d\} \\= & {} M[j].d+\mathcal {I}.\mathsf {RMaxQ}(M[j].c,M[j].d), \\ C[j]= & {} \max (C^\mathtt {a}[j],C^\mathtt {b}[j]). \end{aligned}$$

For these to work correctly, we need to have properly updated the trees \(\mathcal {T}\) and \(\mathcal {I}\) for all \(j' \in [1..j-1]\). That is, we need to call \(\mathcal {T}. \mathsf {update}(M[j'].d,C[j'])\) and \(\mathcal {I}.\mathsf {update}(M[j'].d,C[j']-M[j'].d)\) after computing each \(C[j']\). The running time is \(O(N \log N)\).

Figure 2 illustrates the optimal chain on our schematic example. This chain can be extracted by modifying the algorithm to store traceback pointers.

Theorem 4

([1, 32]). Problem 1 on N input pairs can be solved in the optimal \(O(N \log N)\) time.

5.2 Co-linear Chaining on DAGs Using a Minimum Path Cover

Let us now modify the above algorithm to work with DAGs, using the main technique of this paper.

Theorem 5

Problem 3 on a labeled DAG \(G=(V,E,\ell ,\varSigma )\) of width k and a set of N input pairs can be solved in time \(O(k|E| \log |V|+ kN \log N)\) time.

Proof

Assume we have a path cover of size K and \(\mathtt {forward}[u]\) computed for all \(u\in V\). For each path \(i\in [1..K]\), we create two binary search trees \(\mathcal {T}_i\) and \(\mathcal {I}_i\). As a reminder, these trees correspond to coverages for pairs that do not, and do overlap, respectively, on the sequence. Moreover, recall that in Problem 3 we do not consider solutions where consecutive paths in the graph overlap.

As keys, we use M[j].d, for every pair M[j], and additionally the key 0. The value of every key is initialized to \(-\infty \).

After these preprocessing steps, we process the nodes in topological order, as detailed in Algorithm 1. If node v corresponds to the endpoint of some M[j].P, we update the trees \(\mathcal {T}_i\) and \(\mathcal {I}_i\) for all covering paths i containing node v. Then we follow all forward propagation links \((w,i) \in \mathtt {forward}[v]\) and update C[j] for each path M[j].P starting at w, taking into account all pairs whose path endpoints are in covering path i. Before the main loop visits w, we have processed all forward propagation links to w, and the computation of C[j] has taken all previous pairs into account, as in the naive algorithm, but now indirectly through the K search trees. Exceptions are the pairs overlapping in the graph, which we omit in this problem statement. The forward propagation ensures that the search tree query results are indeed taking only reachable pairs into account. While C[j] is already computed when visiting w, the startpoint of M[j].P, the added coverage with the pair is updated to the search trees only when visiting the endpoint.

There are NK forward propagation links, and both search trees are queried in \(O(\log N)\) time. All the search trees containing a path endpoint of a pair are updated. Each endpoint can be contained in at most K paths, so this also gives the same bound 2NK on the number of updates. With Theorem 1 plugged in, we have \(K = k\) and the total running time becomes \(O(k|E| \log |V|+k N \log N)\).    \(\square \)

figure a

6 Discussion and Experiments

For applying our solutions to Problem 2 in practice, one first needs to find the alignment anchors. As explained in the problem formulation, alignment anchors are such pairs \((P,[c ..d])\) where P is a path in G and \(\ell (P)\) matches \(R[c ..d]\). With sequence inputs, such pairs are usually taken to be maximal exact matches (MEMs) and can be retrieved in small space in linear time [4, 5]. It is largely an open problem how to retrieve MEMs between a sequence and a DAG efficiently: The case of length-limited MEMs is studied in [33], based on an extension of [34] with features such as suffix tree functionality. On the practical side, anchor finding has already been incorporated into tools for conducting alignment of a sequence to a DAG [20, 25].

For the purpose of demonstrating the efficiency of our MPC-approach applied to co-linear chaining, we implemented a MEM-finding routine based on simple dynamic programming. We leave it for future work to incorporate a practical procedure (e.g. like those in [20, 25]). We tested the time improvement of our MPC-approach (Theorem 5) over the trivial algorithm (Theorem 3) on the sequence graphs of annotated human genes. Out of all the 62219 genes in the HG38 annotation for all human chromosomes, we singled out 8628 genes such that their sequence graph had at least 5000 nodes. Out of these, we picked 500 genes at random.

The size of the graphs for these 500 genes varied between \(|V|=5023\) and \(|V|=30959\) vertices. Their width, i.e., the number of paths in the MPC, varied between \(k=1\) and \(k=15\). (The number of graphs for each value of k is listed in the column #graphs of the top table of Fig. 3.) The number of anchors, N, for patterns of length 1000 varied between \(10^1\) and \(10^5\). As shown in Fig. 3, with small values of N, our MPC-based co-linear chaining algorithm was twice as fast as the trivial algorithm. When values of N were increased from \(10^1\) to \(10^5\), the difference increased to two orders of magnitude.

Fig. 3.
figure 3

The average running times, and their standard deviation, (in milliseconds) of the two approaches for co-linear chaining between a sequence and a DAG (Problem 2), for all inputs of a certain width k (top), and with N belonging to a certain interval (below). Both approaches are given the same anchors; the time for finding them is not included.

The improved efficiency when compared to the naive approach gives reason to believe a practical sequence-to-DAG aligner can be engineered along the algorithmic foundations given here. Future work includes the incorporation of a practical anchor-finding method, and testing whether the complete scheme improves transcript prediction through improved finding of exon chains [18, 30].

On the theoretical side, it remains open whether the MPC algorithm could benefit from a better initial approximation and/or one that is faster to compute. More generally, it remains open whether the overall bound \(O(k|E|\log |V|)\) for the MPC problem can be improved.