Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-linear Chaining Extended

Kuosmanen, Anna; Paavilainen, Topi; Gagie, Travis; Chikhi, Rayan; Tomescu, Alexandru; Mäkinen, Veli

doi:10.1007/978-3-319-89929-9_7

Anna Kuosmanen¹⁴,
Topi Paavilainen¹⁴,
Travis Gagie¹⁵,
Rayan Chikhi¹⁶,
Alexandru Tomescu¹⁴ &
…
Veli Mäkinen¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10812))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2367 Accesses
9 Citations
8 Altmetric

Abstract

Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics. Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigorous approach to combine the anchors. It is a well-known approach for the case of two sequences as inputs. Here we extend the approach so that one of the inputs can be a directed acyclic graph (DAGs), e.g. a splicing graph in transcriptomics or a variant graph in pan-genomics.

This extension to DAGs turns out to have a tight connection to the minimum path cover problem, asking us to find a minimum-cardinality set of paths that cover all the nodes of a DAG. We study the case when the size k of a minimum path cover is small, which is often the case in practice. First, we propose an algorithm for finding a minimum path cover of a DAG (V, E) in $O(k|E|\log |V|)$ time, improving all known time-bounds when k is small and the DAG is not too dense. Second, we introduce a general technique for extending dynamic programming (DP) algorithms from sequences to DAGs. This is enabled by our minimum path cover algorithm, and works by mimicking the DP algorithm for sequences on each path of the minimum path cover. This technique generally produces algorithms that are slower than their counterparts on sequences only by a factor k. Our technique can be applied, for example, to the classical longest increasing subsequence and longest common subsequence problems, extended to labeled DAGs. Finally, we apply this technique to the co-linear chaining problem, which is a generalization of both of these two problems. We also implemented the new co-linear chaining approach. Experiments on splicing graphs show that the new method is efficient also in practice.

A. Tomescu and V. Mäkinen—Shared last author contribution.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Sequence to Graph Alignment Using Gap-Sensitive Co-linear Chaining

Superbubbles revisited

Article Open access 01 December 2018

AStarix: Fast and Optimal Sequence-to-Graph Alignment

1 Introduction

A path cover of a DAG $G = (V, E)$ is a set of paths such that every node of G belongs to some path. A minimum path cover (MPC) is one having the minimum number of paths. The size of a MPC is also called the width of G. Many DAGs commonly used in genome research, such as graphs encoding human mutations [8] and graphs modeling gene transcripts [15], can consist, in the former case, of millions of nodes and, in the latter case, of thousands of nodes. However, they generally have a small width on average; for example, splicing graphs for most genes in human chromosome 2 have width at most 10 [35, Fig. 7]. To the best of our knowledge, among the many MPC algorithms [6, 7, 12, 16, 27, 31], there are only three whose complexities depends on the width of the DAG. Say the width of G is k. The first algorithm runs in time $O(|V||E| + k|V|^2)$ and can be obtained by slightly modifying an algorithm for finding a minimum chain cover in partial orders from [11]. The other two algorithms are due to Chen and Chen: the first one works in time $O(|V|^2 + k\sqrt{k}|V|)$ [6], and the second one works in time $O(\max (\sqrt{|V|}|E|, k\sqrt{k}|V|))$ [7].

In this paper we present an MPC algorithm running in time $O(k|E|\log |V|)$. For example, for $k = o(\sqrt{|V|} / \log |V|)$ and $|E| = O(|V|^{3/2})$, this is better than all previous algorithms. Our algorithm is based on the following standard reduction of a minimum flow problem to a maximum flow problem (see e.g. [2]): (i) find a feasible flow/path cover satisfying all demands, and (ii) solve a maximum flow problem in a graph encoding how much flow can be removed from every edge. Our main insight is to solve step (i) by finding an approximate solution that is greater than the optimal one only by a $O(\log |V|)$ factor. Then, if we solve step (ii) with the Ford-Fulkerson algorithm, the number of iterations can be bounded by $O(k\log |V|)$.

We then proceed to show that some problems (like pattern matching) that admit efficient sparse dynamic programming solutions on sequences [10] can be extended to DAGs, so that their complexity increases only by the minimum path cover size k. Extending pattern matching to DAGs has been studied before [3, 24, 28]. For those edit distance -based formulations our approach does not yield an improvement, but on formulations involving a sparse set of matching anchors [10] we can boost the naive solutions of their DAG extensions by exploiting a path cover. Namely, our improvement applies to many cases where a data structure over previously computed solutions is maintained and queried for computing the next value. Our new MPC algorithm enables this, as its complexity is generally of the same form as that of solving the extended problems. Given a path cover, our technique then computes so-called forward propagation links indicating how the partial solutions in each path in the cover must be synchronized.

To best illustrate the versatility of the technique itself, in the full version of this paper [19] we show how to compute a longest increasing subsequence (LIS) in a labeled DAG, in time $O(k |E| \log |V|)$. This matches the optimal solution to the classical problem on a single sequence when, e.g., this is modeled as a path (where $k=1$). In Sect. 4, We also illustrate our technique with the longest common subsequence (LCS) problem between a labeled DAG $G = (V,E)$ and a sequence S.

Finally, we consider the main problem of this paper—co-linear chaining (CLC)—first introduced in [23]. It has been proposed as a model of the sequence alignment problem that scales to massive inputs, and has been a subject of recent interest (see e.g. [22, 29, 32, 36, 38,39,40]). In the CLC problem, the input is directly assumed to be a set of N pairs of intervals in the two sequences that match (either exactly or approximately). The CLC alignment solution asks for a subset of these plausible pairs that maximizes the coverage in one of the sequences, and whose elements appear in increasing order in both sequences. The fastest algorithm for this problem runs in the optimal $O(N \log N)$ time [1].

We define a generalization of the CLC problem between a sequence and a labeled DAG. As motivation, we mention the problem of aligning a long sequence, or even an entire chromosome, inside a DAG storing all known mutations of a population with respect to a reference genome (such as the above-mentioned [8], or more specificly a linearized version of it [14]). Here, the N input pairs match intervals in the sequence with paths (also called anchors) in the DAG. This problem is not straightforward, as the topological order of the DAG might not follow the reachability order between the anchors. Existing tools for aligning DNA sequences to DAGs (BGREAT [20], vg [25]) rely on anchors but do not explicitly consider solving CLC optimally on the DAG.

The algorithm we propose uses the general framework mentioned above. Since it is more involved, we will develop it in stages. We first give a simple approach to solve a relaxed co-linear chaining problem using $O((|V|+|E|) N)$ time. Then, we introduce the MPC approach that requires $O(k|E| \log |V| + kN \log N)$ time. As above, if the DAG is a labeled path representing a sequence, the running time of our algorithm is reduced to the best current solution for the co-linear chaining problem on sequences, $O(N \log N)$ [1]. In the full version of this paper [19], we use a Burrows-Wheeler technique to efficiently handle a special case that we omitted in this relaxed variant. We remark that one can reduce the LIS and LCS problems to the CLC problem to obtain the same running time bounds as mentioned earlier; these are given for the sake of comprehensiveness.

In the last section we discuss the anchor-finding preprocessing step. We implemented the new MPC-based co-linear chaining algorithm and conducted experiments on splicing graphs to show that the approach is practical, once anchors are given. Some future directions on how to incorporate practical anchors, and how to apply the techniques to transcript prediction, are discussed.

Notation. To simplify notation, for any DAG $G = (V,E)$ we will assume that V is always $\{1,\dots ,|V|\}$ and that $1,\dots ,|V|$ is a topological order on V (so that for every edge (u, v) we have $u < v$). We will also assume that $|E| \ge |V| - 1$. A labeled DAG is a tuple $(V,E,\ell ,\varSigma )$ where (V, E) is a DAG and $\ell : V \mapsto \varSigma $ assign to the nodes labels from $\varSigma $, $\varSigma $ being an ordered alphabet.

For a node $v \in V$, we denote by $N^-(v)$ the set of in-neighbors of v and by $N^+(v)$ the set of out-neighbors of v. If there is a (possibly empty) path from node u to node v we say that u reaches v. We denote by $R^-(v)$ the set of nodes that reach v. We denote a set of consecutive integers with interval notation [i..j], meaning $\{i,i+1,\ldots ,j\}$. For a pair of intervals $m=([x..y],[c..d])$, we use m.x, m.y, m.c, and m.d to denote the four respective endpoints. We also consider pairs of the form $m=(P,[c..d])$ where P is a path, and use m.P to access P. The first node of P will be called its startpoint, and its last node will be called its endpoint. For a set M we may fix an order, to access an element as M[i].

2 The MPC Algorithm

In this section we assume basic familiarity with network flow concepts; see [2] for further details. In the minimum flow problem, we are given a directed graph $G = (V,E)$ with a single source and a single sink, with a demand $d : E \rightarrow \mathbb {Z}$ for every edge. The task is to find a flow of minimum value (the value is the sum of the flow on the edges exiting the source) that satisfies all demands (to be called feasible). The standard reduction from the minimum path cover problem to a minimum flow one (see, e.g. [26]) creates a new DAG $G^*$ by replacing each node v with two nodes $v^-,v^+$, adds the edge $(v^-,v^+)$ and adds all in-neighbors of v as in-neighbors of $v^-$, and all out-neighbors of v as out-neighbors of $v^+$. Finally, the reduction adds a global source with an out-going edge to every node, and a global sink with an in-coming edge from every node. Edges of type $(v^-,v^+)$ get demand 1, and all other edges get demand 0. The value of the minimum flow equals k, the width of G, and any decomposition of it into source-to-sink paths induces a minimum path cover in G.

Our MPC algorithm is based on the following simple reduction of a minimum flow problem to a maximum flow one (see e.g. [2]): (i) find a feasible flow $f : E \rightarrow \mathbb {Z}$; (ii) transform this into a minimum feasible flow, by finding a maximum flow $f'$ in G in which every $e \in E$ now has capacity $f(e) - d(e)$. The final minimum flow solution is obtained as $f(e) - f'(e)$, for every $e \in E$. Observe that this path cover induces a flow of value $O(k\log |V|)$. Thus, in step (ii) we need to shrink this flow into a flow of value k. If we run the Ford-Fulkerson algorithm, this means that there are $O(k\log |V|)$ successive augmenting paths, each of which can be found in time O(E). This gives a time bound for step (ii) of $O(k|E|\log |V|)$.

We solve step (i) in time $O(k|E|\log |V|)$ by finding a path cover in $G^*$ whose size is larger than k only by a multiplicative factor $O(\log |V|)$. This is based on the classical greedy set cover algorithm, see e.g. [37, Chapter 2]: at each step, select a path covering most of the remaining uncovered nodes.

Such approximation approach has also been applied to other covering problems on graphs, such as a 2-hop cover [9]. More importantly, the approximation-and-refinement approach is similar to the one from [11] for finding the minimum number k of chains to cover a partial order of size n. A chain is a set of pairwise comparable elements. The algorithm from [11] runs in time $O(kn^2)$, and it has the same feature as ours: it first finds a set of $O(k\log n)$ chains in the same way as us (longest chains covering most uncovered elements), and then in a second step reduces these to k. However, if we were to apply this algorithm to DAGs, it would run in time $O(|V||E| + k|V|^2)$, which is slower than our algorithm for small k. This is because it uses the classical reduction given by Fulkerson [12] to a bipartite graph, where each edge of the graph encodes a pair of elements in the relation. Since DAGs are not transitive in general, to use this reduction one needs first to compute the transitive closure of the DAG, in time O(|V||E|).

We now show how to solve step (i) within the claimed running time, by dynamic programming.

Lemma 1

Let $G = (V,E)$ be a DAG, and let k be the width of G. In time $O(k|E|\log |V|)$, we can compute a path cover $P_1,\dots ,P_K$ of G, such that $K = O(k\log |V|)$.

Proof

The algorithm works by choosing, at each step, a path that covers the most uncovered nodes. For every node $v \in V$, we store $\mathtt {m}[v] = 1$, if v is not covered by any path, and $\mathtt {m}[v] = 0$ otherwise. We also store $\mathtt {u}[v]$ as the largest number of uncovered nodes on a path starting at v. The values $\mathtt {u}[\cdot ]$ are computed by dynamic programming, by traversing the nodes in inverse topological order and setting $\mathtt {u}[v] = \mathtt {m}[v] + \max _{w \in N^+(v)} \mathtt {u}[v]$. Initially we have $\mathtt {m}[v] = 1$ for all v. We then compute $\mathtt {u}[v]$ for all v, in time O(|E|). By taking the node v with the maximum $\mathtt {u}[v]$, and tracing back along the optimal path starting at v, we obtain our first path in time O(|E|). We then update $\mathtt {m}[v] = 0$ for all nodes on this path, and iterate this process until all nodes are covered. This takes overall time O(K|E|), where K is the number of paths found.

This algorithm analysis is identical to the one of the classical greedy set cover algorithm [37, Chapter 2], because the universe to be covered is V and each possible path in G is a possible covering set, which implies that $K = O(k\log |V|)$.

$\square $

Combining Lemma 1 with the above-mentioned application of the Ford-Fulkerson algorithm, we obtain our first result:

Theorem 1

Given a DAG $G = (V,E)$ of width k, the MPC problem on G can be solved in time $O(k|E|\log |V|)$.

3 The Dynamic Programming Framework

In this section we give an overview of the main ideas of our approach.

Suppose we have a problem involving DAGs that is solvable, for example by dynamic programming, by traversing the nodes in topological order. Thus, assume also that a partial solution at each node v is obtainable from all (and only) nodes of the DAG that can reach v, plus some other independent objects, such as another sequence. Furthermore, suppose that at each node v we need to query (and maintain) a data structure $\mathcal {T}$ that depends on $R^-(v)$ and such that the answer $\mathsf {Query}(R^-(v))$ at v is decomposable as:

$$\begin{aligned} \mathsf {Query}(R^-(v)) = \bigoplus _i \mathsf {Query}(R^-_i(v)). \end{aligned}$$

(1)

In the above, the sets $R^-_i(v)$ are such that $R^-(v) = \bigcup _i R^-_i(v)$, they are not necessarily disjoint, and $\bigoplus $ is some operation on the queries, such as min or max, that does not assume disjointness. It is understood that after the computation at v, we need to update $\mathcal {T}$. It is also understood that once we have updated $\mathcal {T}$ at v, we cannot query $\mathcal {T}$ for a node before v in topological order, because it would give an incorrect answer.

The first idea is to decompose the graph into a path cover $P_1,\dots ,P_K$. As such, we decompose the computation only along these paths, in light of (1). We replace a single data structure $\mathcal {T}$ with K data structures $\mathcal {T}_1,\dots ,\mathcal {T}_K$, and perform the operation from (1) on the results of the queries to these K data structures.

Our second idea concerns the order in which the nodes on these K paths are processed. Because the answer at v depends on $R^-(v)$, we cannot process the nodes on the K paths (and update the corresponding $\mathcal {T}_i$’s) in an arbitrary order. As such, for every path i and every node v, we distinguish the last node on path i that reaches v (if it exists). We will call this node $\mathtt {last2reach}[v,i]$. See Fig. 1 for an example. We note that this insight is the same as in [17], which symmetrically identified the first node on a chain i that can be reached from v (a chain is a subsequence of a path). The following observation is the first ingredient for using the decomposition (1).

Observation 1

Let $P_1,\dots ,P_K$ be a path cover of a DAG G, and let $v \in V(G)$. Let $R_i$ denote the set of nodes of $P_i$ from its beginning until $\mathtt {last2reach}[v,i]$ inclusively (or the empty set, if $\mathtt {last2reach}[v,i]$ does not exist). Then $R^-(v) = \bigcup _{i=1}^{K}R_i$.

Proof

It is clear that $\bigcup _{i=1}^{K} R_i \subseteq R^-(v)$. To show the reverse inclusion, consider a node $u \in R^-(v)$. Since $P_1,\dots ,P_K$ is a path cover, then u appears on some $P_i$. Since u reaches v, then u appears on $P_i$ before $\mathtt {last2reach}[v,i]$, or $u = \mathtt {last2reach}[v,i]$. Therefore u appears on $R_i$, as desired. $\square $

This allows us to identify, for every node u, a set of forward propagation links $\mathtt {forward}[u]$, where $(v,i) \in \mathtt {forward}[u]$ holds for any node v and index i with $\mathtt {last2reach}[v,i] = u$. These propagation links are the second ingredient in the correctness of the decomposition. Once we have computed the correct value at u, we update the corresponding data structures $\mathcal {T}_i$ for all paths i to which u belongs. We also propagate the query value of $\mathcal {T}_i$ in the decomposition (1) for all nodes v with $(v,i) \in \mathtt {forward}[u]$. This means that when we come to process v, we have already correctly computed all terms in the decomposition (1) and it suffices to apply the operation $\bigoplus $ to these terms.

The next lemma shows how to compute the values $\mathtt {last2reach}$ (and, as a consequence, all forward propagation links), also by dynamic programming.

Lemma 2

Let $G = (V,E)$ be a DAG, and let $P_1,\dots ,P_K$ be a path cover of G. For every $v \in V$ and every $i \in [1..K]$, we can compute $\mathtt {last2reach}[v,i]$ in overall time O(K|E|).

Proof

For each $P_i$ and every node v on $P_i$, let $\mathtt {index}[v,i]$ denote the position of v in $P_i$ (say, starting from 1). Our algorithm actually computes $\mathtt {last2reach}[v,i]$ as the index of this node in $P_i$. Initially, we set $\mathtt {last2reach}[v,i] = -1$ for all v and i. At the end of the algorithm, $\mathtt {last2reach}[v,i] = -1$ will hold precisely for those nodes v that cannot be reached by any node of $P_i$. We traverse the nodes in topological order. For every $i \in [1..K]$, we do as follows: if v is on $P_i$, then we set $\mathtt {last2reach}[v,i] = \mathtt {index}[v,i]$. Otherwise, we compute by dynamic programming $\mathtt {last2reach}[v,i]$ as $\max _{\textit{u} \in N^-(v)} \mathtt {last2reach}[u,i]$. $\square $

An immediate application of Theorem 1 and of the values $\mathtt {last2reach}[v,i]$ is for solving reachability queries. Another simple application is an extension of the longest increasing subsequence (LIS) problem to labeled DAGs. (Both are given in the full version of this paper [19]).

The LIS problem, the LCS problem of Sect. 4, as well as co-linear chaining (CLC) of Sect. 5 make use of the following standard data structure (see e.g. [21, p.20]).

Lemma 3

The following two operations can be supported with a balanced binary search tree $\mathcal {T}$ in time $O(\log n)$, where n is the number of leaves in the tree.

$\mathsf {update}(k,\mathtt {val})$: For the leaf w with $\mathtt {key}(w)=k$, update $\mathtt {value}(w)=\mathtt {val}$.
$\mathsf {RMaxQ}(l,r)$: Return $\max _{w \,:\,l\le \mathtt {key}(w) \le r} \mathtt {value}(w)$ (Range Maximum Query).

Moreover, the balanced binary search tree can be built in O(n) time, given the n pairs $(\mathtt {key},\mathtt {value})$ sorted by component $\mathtt {key}$.

4 The LCS Problem

Consider a labeled DAG $G=(V,E,\ell ,\varSigma )$ and a sequence $S \in \varSigma ^*$, where $\varSigma $ is an ordered alphabet. We say that the longest common subsequence (LCS) between G and S is a longest subsequence C of any path label in G such that C is also a subsequence of S.

We will modify the LIS algorithm (see the full version of this paper [19]) minimally to find a LCS between a DAG G and a sequence S. The description is self-contained yet, for the interest of page limit, more dense than the LIS algorithm derivation. The purpose is to give an example of the general MPC-framework with fewer technical details than required in the main result of this paper concerning co-linear chaining.

For any $c \in \varSigma $, let S(c) denote set $\{j \mid S[j]=c\}$. For each node v and each $j\in S(\ell (v))$, we aim to store in $\mathtt {LLCS}[v,j]$ the length of the longest common subsequence between S[1..j] and any label of path ending at v, among all subsequences having $\ell (v)=S[j]$ as the last symbol.

Assume we have a path cover of size K and $\mathtt {forward}[u]$ computed for all $u\in V$. Assume also we have mapped $\varSigma $ to $\{0,1,2,\ldots ,|S|+1\}$ in $O((|V|+|S|) \log |S|)$ time (e.g. by sorting the symbols of S, binary searching labels of V, and then relabeling by ranks, with the exception that, if a node label does not appear in S, it is replaced by $|S|+1$).

Let $\mathcal {T}_i$ be a search tree of Lemma 3 initialized with key-value pairs (0, 0), $(1,-\infty )$, $(2,-\infty )$, ..., $(|S|,-\infty )$, for each $i \in [1..K]$. The algorithm proceeds in fixed topological ordering on G. At a node u, for every $(v,i) \in \mathtt {forward}[u]$ we now update an array $\mathtt {LLCS}[v,j]$ for all $j \in S(\ell (v))$ as follows: $\mathtt {LLCS}[v,j]=\max (\mathtt {LLCS}[v,j],\mathcal {T}_i.\mathsf {RMaxQ}(0,j-1)+1)$. The update step of $\mathcal {T}_i$ when the algorithm reaches a node v, for each covering path i containing v, is done as $\mathcal {T}_{i}.\mathsf {update}(j',\mathtt {LLCS}[v,j'])$ for all $j'$ with $j'<j$ and $j' \in S(\ell (v))$. Initialization is handled by the (0, 0) key-value pair so that any (v, j) with $\ell (v)=S[j]$ can start a new common subsequence.

The final answer to the problem is $\max _{v\in V, j\in S(\ell (v))} \mathtt {LLCS}[v,j]$, with the actual LCS to be found with a standard traceback. The algorithm runs in $O((|V|+|S|)\log |S|+K|M| \log |S|)$ time, where $M=\{(v,j) \mid v \in V, j \in [1..|S|], \ell (v)=S[j]\}$, and assuming a cover of K paths is given. Notice that |M| can be $\varOmega (|V||S|)$. With Theorem 1 plugged in, the total running time becomes $O(k|E| \log |V| + (|V|+|S|)\log |S|+k|M| \log |S|)$. Since the queries on the data structures are semi-open, one can use the more efficient data structure from [13] to improve the bound to $O(k|E| \log |V| + (|V|+|S|)\log |S|+k|M| \log \log |S|)$. The following theorem summarizes this result.

Theorem 2

Let $G = (V,E,\ell ,\varSigma )$ be a labeled DAG of width k, and let $S \in \varSigma ^*$, where $\varSigma $ is an ordered alphabet. We can find a longest common subsequence between G and S in time $O(k|E| \log |V| + (|V|+|S|)\log |S|+k|M| \log \log |S|)$.

When G is a path, the bound improves to $O((|V|+|S|)\log |S|+|M|\log \log |S|)$, which nearly matches the fastest sparse dynamic programming algorithm for the LCS on two sequences [10] (with a difference in $\log \log $-factor due to a different data structure, which does not work for this order of computation).

5 Co-linear Chaining

We start with a formal definition of the co-linear chaining problem (see Fig. 2 for an illustration), following the notions introduced in [21, Sect. 15.4].

Problem 1

(Co-linear chaining (CLC)). Let T and R be two sequences over an alphabet $\varSigma $, and let M be a set of N pairs $([x ..y],[c ..d])$. Find an ordered subset $S=s_1 s_2 \cdots s_p$ of pairs from M such that

$s_{j-1}.y<s_{j}.y$ and $s_{j-1}.d<s_{j}.d$, for all $1\le j \le p$, and
S maximizes the ordered coverage of R, defined as
$$\begin{aligned} \mathtt {coverage}(R,S)=|\{i \in [1..|R|] \,|\,i \in [s_{j}.c ..s_{j}.d]\,\text {for some}\,1\le j \le p\}|. \end{aligned}$$

The definition of ordered coverage between two sequences is symmetric, as we can simply exchange the roles of T and R. But when solving the CLC problem between a DAG and a sequence, we must choose whether we want to maximize the ordered coverage on the sequence R or on the DAG G. We will consider the former variant.

First, we define the following precedence relation:

Definition 1

Given two paths $P_1$ and $P_2$ in a DAG G, we say that $P_1$ precedes $P_2$, and write $P_1 \prec P_2$, if one of the following conditions holds:

$P_1$ and $P_2$ do not share nodes and there is a path in G from the endpoint of $P_1$ to the startpoint of $P_2$, or
$P_1$ and $P_2$ have a suffix-prefix overlap and $P_2$ is not fully contained in $P_1$; that is, if $P_1 = (a_1,\dots ,a_i)$ and $P_2 = (b_1,\dots ,b_j)$ then there exists a $k \in \{\max (1,2+i-j),\dots ,i\}$ such that $a_k = b_1$, $a_{k+1} = b_2$, ..., $a_{i} = b_{1+i-k}$.

We then extend the formulation of Problem 1 to handle a sequence and a DAG.

Problem 2

(CLC between a sequence and a DAG). Let R be a sequence, let G be a labeled DAG, and let M be a set of N pairs $(P,[c ..d])$, where P is a path in G and $c \le d$ are non-negative integers. Find an ordered subset $S=s_1 s_2 \cdots s_p$ of pairs from M such that

for all $2 \le j \le p$, it holds that $s_{j-1}.P \prec s_{j}.P$ and $s_{j-1}.d < s_{j}.d$, and
S maximizes the ordered coverage of R, analogously defined as $\mathtt {coverage}(R,S)=|\{i \in [1..|R|] \,|\,i \in [s_{j}.c ..s_{j}.d]\,\text {for some}\,1\le j \le p\}|$.

To illustrate the main technique of this paper, let us for now only seek solutions where paths in consecutive pairs in a solution do not overlap in the DAG. Suffix-prefix overlaps between paths turn out to be challenging; we prove this case in the full version of this paper [19].

Problem 3

(Overlap-limited CLC between a sequence and a DAG). Let R be a sequence, let G be a labeled DAG, and let M be a set of N pairs $(P,[c ..d])$, where P is a path in G and $c \le d$ are non-negative integers (with the interpretation that $\ell (P)$ matches $R[c ..d]$). Find an ordered subset $S=s_1 s_2 \cdots s_p$ of pairs from M such that

for all $2 \le j \le p$, it holds that there is a non-empty path from the last node of $s_{j-1}.P$ to the first node of $s_{j}.P$ and $s_{j-1}.d < s_{j}.d$, and
S maximizes $\mathtt {coverage}(R,S)$.

First, let us consider a trivial approach to solve Problem 3. Assume we have ordered in $O(|E| + N)$ time the N input pairs as $M[1],M[2],\dots , M[N]$, so that the endpoints of $M[1].P, M[2].P,\dots ,M[N].P$ are in topological order, breaking ties arbitrarily. We denote by C[j] the maximum ordered coverage of $R[1 ..M[j].d]$ using the pair M[j] and any subset of pairs from $\{M[1],M[2],\dots , M[j-1]\}$.

Theorem 3

Overlap-limited co-linear chaining between a sequence and a labeled DAG $G=(V,E,\ell ,\varSigma )$ (Problem 3) on N input pairs can be solved in $O((|V| + |E|) N)$ time.

Proof

First, we reverse the edges of G. Then we mark the nodes that correspond to the path endpoints for every pair. After this preprocessing we can start computing the maximum ordered coverage for the pairs as follows: for every pair M[j] in topological order of their path endpoints for $j \in \{1,\dots ,N\}$ we do a depth-first traversal starting at the startpoint of path M[j].P. Note that since the edges are reversed, the depth-first traversal checks only pairs whose paths are predecessors of M[j].P.

Whenever we encounter a node that corresponds to the path endpoint of a pair $M[j']$, we first examine whether it fulfills the criterion $M[j'].d < M[j].c$ (call this case (a)). The best ordered coverage using pair M[j] after all such $M[j']$ is then

$$\begin{aligned} C^\text {a}[j]=\max _{j' \,:\,M[j'].d<M[j].c} \{C[j']+(M[j].d-M[j].c+1) \}, \end{aligned}$$

(2)

where $C[j]'$ is the best ordered coverage when using pairs $M[j']$ last.

If pair $M[j']$ does not fulfill the criterion for case (a), we then check whether $M[j].c \le M[j'].d \le M[j].d$ (call this case (b)). The best ordered coverage using pair M[j] after all such $M[j']$ with $M[j'].c < M[j].c$ is then

$$\begin{aligned} C^\text {b}[j]=\max _{j' \,:\,M[j].c\le M[j'].d\le M[j].d} \{C[j']+(M[j].d-M[j'].d)\}. \end{aligned}$$

(3)

Inclusions, i.e. $M[j].c \le M[j'].c$, can be left computed incorrectly in $C^\text {b}[j]$, since there is a better or equally good solution computed in $C^\text {a}[j]$ or $C^\text {b}[j]$ that does not use them [1].

Finally, we take $C[j]=\max (C^\text {a}[j],C^\text {b}[j])$. Depth-first traversal takes $O(|V|+|E|)$ time and is executed N times, for $O((|V| + |E|) N)$ total time. $\square $

However, we can do significantly better than $O((|V| + |E|) N)$ time. In the next sections we will describe how to apply the framework from Sect. 3 here.

5.1 Co-linear Chaining on Sequences Revisited

We now describe the dynamic programming algorithm from [1] for the case of two sequences, as we will then reuse this same algorithm in our MPC approach.

First, sort input pairs in M by the coordinate y into the sequence M[1], M[2], ..., M[N], so that $M[i].y\le M[j].y$ holds for all $i<j$. This will ensure that we consider the overlapping ranges in sequence T in the correct order. Then, we fill a table $C[1..N]$ analogous to that of Theorem 3 so that C[j] gives the maximum ordered coverage of $R[1 ..M[j].d]$ using the pair M[j] and any subset of pairs from $\{M[1],M[2],\dots , M[j-1]\}$. Hence, $\max _j C[j]$ gives the total maximum ordered coverage of R.

Consider Eq. (2) and (3). Now we can use an invariant technique to convert these recurrence relations so that we can exploit the range maximum queries of Lemma 3:

$$\begin{aligned} C^\mathtt {a}[j]= & {} (M[j].d-M[j].c+1) +\max _{j' \,:\,M[j'].d<M[j].c} C[j']\\= & {} (M[j].d-M[j].c+1)+\mathcal {T}.\mathsf {RMaxQ}(0,M[j].c-1),\\ C^\mathtt {b}[j]= & {} M[j].d +\max _{j' \,:\,M[j].c\le M[j'].d\le M[j].d} \{C[j']-M[j'].d\} \\= & {} M[j].d+\mathcal {I}.\mathsf {RMaxQ}(M[j].c,M[j].d), \\ C[j]= & {} \max (C^\mathtt {a}[j],C^\mathtt {b}[j]). \end{aligned}$$

For these to work correctly, we need to have properly updated the trees $\mathcal {T}$ and $\mathcal {I}$ for all $j' \in [1..j-1]$. That is, we need to call $\mathcal {T}. \mathsf {update}(M[j'].d,C[j'])$ and $\mathcal {I}.\mathsf {update}(M[j'].d,C[j']-M[j'].d)$ after computing each $C[j']$. The running time is $O(N \log N)$.

Figure 2 illustrates the optimal chain on our schematic example. This chain can be extracted by modifying the algorithm to store traceback pointers.

Theorem 4

([1, 32]). Problem 1 on N input pairs can be solved in the optimal $O(N \log N)$ time.

5.2 Co-linear Chaining on DAGs Using a Minimum Path Cover

Let us now modify the above algorithm to work with DAGs, using the main technique of this paper.

Theorem 5

Problem 3 on a labeled DAG $G=(V,E,\ell ,\varSigma )$ of width k and a set of N input pairs can be solved in time $O(k|E| \log |V|+ kN \log N)$ time.

Proof

Assume we have a path cover of size K and $\mathtt {forward}[u]$ computed for all $u\in V$. For each path $i\in [1..K]$, we create two binary search trees $\mathcal {T}_i$ and $\mathcal {I}_i$. As a reminder, these trees correspond to coverages for pairs that do not, and do overlap, respectively, on the sequence. Moreover, recall that in Problem 3 we do not consider solutions where consecutive paths in the graph overlap.

As keys, we use M[j].d, for every pair M[j], and additionally the key 0. The value of every key is initialized to $-\infty $.

After these preprocessing steps, we process the nodes in topological order, as detailed in Algorithm 1. If node v corresponds to the endpoint of some M[j].P, we update the trees $\mathcal {T}_i$ and $\mathcal {I}_i$ for all covering paths i containing node v. Then we follow all forward propagation links $(w,i) \in \mathtt {forward}[v]$ and update C[j] for each path M[j].P starting at w, taking into account all pairs whose path endpoints are in covering path i. Before the main loop visits w, we have processed all forward propagation links to w, and the computation of C[j] has taken all previous pairs into account, as in the naive algorithm, but now indirectly through the K search trees. Exceptions are the pairs overlapping in the graph, which we omit in this problem statement. The forward propagation ensures that the search tree query results are indeed taking only reachable pairs into account. While C[j] is already computed when visiting w, the startpoint of M[j].P, the added coverage with the pair is updated to the search trees only when visiting the endpoint.

There are NK forward propagation links, and both search trees are queried in $O(\log N)$ time. All the search trees containing a path endpoint of a pair are updated. Each endpoint can be contained in at most K paths, so this also gives the same bound 2NK on the number of updates. With Theorem 1 plugged in, we have $K = k$ and the total running time becomes $O(k|E| \log |V|+k N \log N)$. $\square $

6 Discussion and Experiments

For applying our solutions to Problem 2 in practice, one first needs to find the alignment anchors. As explained in the problem formulation, alignment anchors are such pairs $(P,[c ..d])$ where P is a path in G and $\ell (P)$ matches $R[c ..d]$. With sequence inputs, such pairs are usually taken to be maximal exact matches (MEMs) and can be retrieved in small space in linear time [4, 5]. It is largely an open problem how to retrieve MEMs between a sequence and a DAG efficiently: The case of length-limited MEMs is studied in [33], based on an extension of [34] with features such as suffix tree functionality. On the practical side, anchor finding has already been incorporated into tools for conducting alignment of a sequence to a DAG [20, 25].

For the purpose of demonstrating the efficiency of our MPC-approach applied to co-linear chaining, we implemented a MEM-finding routine based on simple dynamic programming. We leave it for future work to incorporate a practical procedure (e.g. like those in [20, 25]). We tested the time improvement of our MPC-approach (Theorem 5) over the trivial algorithm (Theorem 3) on the sequence graphs of annotated human genes. Out of all the 62219 genes in the HG38 annotation for all human chromosomes, we singled out 8628 genes such that their sequence graph had at least 5000 nodes. Out of these, we picked 500 genes at random.

The size of the graphs for these 500 genes varied between $|V|=5023$ and $|V|=30959$ vertices. Their width, i.e., the number of paths in the MPC, varied between $k=1$ and $k=15$. (The number of graphs for each value of k is listed in the column #graphs of the top table of Fig. 3.) The number of anchors, N, for patterns of length 1000 varied between $10^1$ and $10^5$. As shown in Fig. 3, with small values of N, our MPC-based co-linear chaining algorithm was twice as fast as the trivial algorithm. When values of N were increased from $10^1$ to $10^5$, the difference increased to two orders of magnitude.

The improved efficiency when compared to the naive approach gives reason to believe a practical sequence-to-DAG aligner can be engineered along the algorithmic foundations given here. Future work includes the incorporation of a practical anchor-finding method, and testing whether the complete scheme improves transcript prediction through improved finding of exon chains [18, 30].

On the theoretical side, it remains open whether the MPC algorithm could benefit from a better initial approximation and/or one that is faster to compute. More generally, it remains open whether the overall bound $O(k|E|\log |V|)$ for the MPC problem can be improved.

References

Abouelhoda, M.: A chaining algorithm for mapping cdna sequences to multiple genomic sequences. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 1–13. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75530-2_1
Chapter Google Scholar
Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice-Hall Inc, Upper Saddle River (1993)
MATH Google Scholar
Amir, A., Lewenstein, M., Lewenstein, N.: Pattern matching in hypertext. J. Algorithms 35(1), 82–99 (2000)
Article MathSciNet Google Scholar
Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the Symposium on Theory of Computing STOC 2014, pp. 148–193. ACM (2014)
Google Scholar
Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40450-4_12
Chapter Google Scholar
Chen, Y., Chen, Y.: An efficient algorithm for answering graph reachability queries. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 893–902, April 2008
Google Scholar
Chen, Y., Chen, Y.: On the graph decomposition. In: 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, pp. 777–784, Dec 2014
Google Scholar
Church, D.M., Schneider, V.A., Steinberg, K.M., Schatz, M.C., Quinlan, A.R., Chin, C.-S., Kitts, P.A., Aken, B., Marth, G.T., Hoffman, M.M., et al.: Extending reference assembly models. Genome Biol. 16(1), 13 (2015)
Article Google Scholar
Cohen, E., Halperin, E., Kaplan, H., Zwick, U.: Reachability and distance queries via 2-hop labels. SIAM J. Comput. 32(5), 1338–1355 (2003)
Article MathSciNet Google Scholar
Eppstein, D., Galil, Z., Giancarlo, R., Italiano, G.F.: Sparse dynamic programming I: linear cost functions. J. ACM 39(3), 519–545 (1992)
Article MathSciNet Google Scholar
Felsner, S., Raghavan, V., Spinrad, J.: Recognition algorithms for orders of small width and graphs of small Dilworth number. Order 20(4), 351–364 (2003)
Article MathSciNet Google Scholar
Fulkerson, D.R.: Note on Dilworth’s decomposition theorem for partially ordered sets. Proc. Am. Math. Soc. 7(4), 701–702 (1956)
MathSciNet MATH Google Scholar
Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, STOC 1984, pp. 135–143. ACM, New York (1984)
Google Scholar
Haussler, D., Smuga-Otto, M., Paten, B., Novak, A.M., Nikitin, S., Zueva, M., Miagkov, D.: A flow procedure for the linearization of genome sequence graphs. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 34–49. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_3
Chapter Google Scholar
Heber, S., Alekseyev, M., Sze, S.-H., Tang, H., Pevzner, P.A.: Splicing graphs and EST assembly problem. Bioinformatics 18(Suppl. 1), S181–S188 (2002)
Article Google Scholar
Hopcroft, J.E., Karp, R.M.: An $n^{5/2}$ algorithm for maximum matchings in Bipartite graphs. SIAM J. Comput. 2(4), 225–231 (1973)
Article MathSciNet Google Scholar
Jagadish, H.V.: A compression technique to materialize transitive closure. ACM Trans. Database Syst. 15(4), 558–598 (1990)
Article MathSciNet Google Scholar
Kuosmanen, A., Norri, T., Mäkinen, V.: Evaluating approaches to find exon chains based on long reads. Brief. Bioinform. bbw137 (2017)
Google Scholar
Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R., Tomescu, A.I., Mäkinen, V.: Using minimum path cover to boost dynamic programming on dags: co-linear chaining extended. CoRR, abs/1705.08754 (2018)
Google Scholar
Limasset, A., Cazaux, B., Rivals, E., Peterlongo, P.: Read mapping on de Bruijn graphs. BMC Bioinform. 17(1), 237 (2016)
Article Google Scholar
Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design. Cambridge University Press, Cambridge (2015)
Book Google Scholar
Mäkinen, V., Salmela, L., Ylinen, J.: Normalized N50 assembly metric using gap-restricted co-linear chaining. BMC Bioinform. 13, 255 (2012)
Article Google Scholar
Myers, G., Miller, W.: Chaining multiple-alignment fragments in sub-quadratic time. In: Clarkson, K.L. (ed.) Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 22–24 January 1995, pp. 38–47. ACM/SIAM, San Francisco (1995)
Google Scholar
Navarro, G.: Improved approximate pattern matching on hypertext. Theor. Comput. Sci. 237(1–2), 455–463 (2000)
Article MathSciNet Google Scholar
Novak, A.M., Garrison, E., Paten, B.: A graph extension of the positional Burrows-Wheeler transform and its applications. In: Frith, M., Storm Pedersen, C.N. (eds.) WABI 2016. LNCS, vol. 9838, pp. 246–256. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43681-4_20
Chapter Google Scholar
Ntafos, S.C., Hakimi, S.L.: On path cover problems in digraphs and applications to program testing. IEEE Trans. Softw. Eng. 5(5), 520–529 (1979)
Article MathSciNet Google Scholar
Orlin, J.B.: Max flows in $O(nm)$ time, or better. In: Proceedings of the 45th Annual ACM Symposium on the Theory of Computing, STOC 2013, pp. 765–774. ACM, New York (2013)
Google Scholar
Park, K., Kim, D.K.: String matching in hypertext. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 318–329. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60044-2_51
Chapter Google Scholar
Patro, R., Duggal, G., Love, M.I., Irizarry, R.A., Kingsford, C.: Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14(4), 417–419 (2017)
Article Google Scholar
Rizzi, R., Tomescu, A.I., Mäkinen, V.: On the complexity of minimum path cover with subpath constraints for multi-assembly. BMC Bioinform. 15(S–9), S5 (2014)
Article Google Scholar
Schnorr, C.-P.: An algorithm for transitive closure with linear expected time. SIAM J. Comput. 7(2), 127–133 (1978)
Article MathSciNet Google Scholar
Shibuya, T., Kurochkin, I.: Match chaining algorithms for cDNA mapping. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS, vol. 2812, pp. 462–475. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39763-2_33
Chapter Google Scholar
Sirén, J.: Indexing variation graphs. In: 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 13–27. SIAM (2017)
Google Scholar
Sirén, J., Välimäki, N., Mäkinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinf. 11(2), 375–388 (2014)
Article Google Scholar
Tomescu, A.I., Gagie, T., Popa, A., Rizzi, R., Kuosmanen, A., Mäkinen, V.: Explaining a weighted dag with few paths for solving genome-guided multi-assembly. IEEE/ACM Trans. Comput. Biol. Bioinf. 12(6), 1345–1354 (2015)
Article Google Scholar
Uricaru, R., Michotey, C., Chiapello, H., Rivals, E.: YOC, a new strategy for pairwise alignment of collinear genomes. BMC Bioinform. 16(1), 111 (2015)
Article Google Scholar
Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg (2001)
MATH Google Scholar
Vyverman, M., De Baets, B., Fack, V., Dawyndt, P.: A long fragment aligner called ALFALFA. BMC Bioinform. 16(1), 159 (2015)
Article Google Scholar
Vyverman, M., De Smedt, D., Lin, Y.-C., Sterck, L., De Baets, B., Fack, V., Dawyndt, P.: Fast and Accurate cDNA mapping and splice site identification. In: Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOSTEC 2014), pp. 233–238 (2014)
Google Scholar
Wandelt, S., Leser, U.: RRCA: ultra-fast multiple in-species genome alignments. In: Dediu, A.-H., Martín-Vide, C., Truthe, B. (eds.) AlCoB 2014. LNCS, vol. 8542, pp. 247–261. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07953-0_20
Chapter Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for comments that improved the presentation of this paper. We thank Gonzalo Navarro for pointing out the connection to pattern matching on hypertexts. This work was funded in part by the Academy of Finland (grant 274977 to AIT and grants 284598 and 309048 to AK and to VM), and by Futurice Oy (to TP).

Author information

Authors and Affiliations

Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
Anna Kuosmanen, Topi Paavilainen, Alexandru Tomescu & Veli Mäkinen
Diego Portales University, Santiago, Chile
Travis Gagie
CNRS, CRIStAL, University of Lille 1, Villeneuve-d’Ascq, France
Rayan Chikhi

Authors

Anna Kuosmanen
View author publications
You can also search for this author in PubMed Google Scholar
Topi Paavilainen
View author publications
You can also search for this author in PubMed Google Scholar
Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Rayan Chikhi
View author publications
You can also search for this author in PubMed Google Scholar
Alexandru Tomescu
View author publications
You can also search for this author in PubMed Google Scholar
Veli Mäkinen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Veli Mäkinen .

Editor information

Editors and Affiliations

Computer Science Department, Princeton University, Princeton, New Jersey, USA
Benjamin J. Raphael

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R., Tomescu, A., Mäkinen, V. (2018). Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-linear Chaining Extended. In: Raphael, B. (eds) Research in Computational Molecular Biology. RECOMB 2018. Lecture Notes in Computer Science(), vol 10812. Springer, Cham. https://doi.org/10.1007/978-3-319-89929-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-89929-9_7
Published: 18 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89928-2
Online ISBN: 978-3-319-89929-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-linear Chaining Extended

Abstract

Similar content being viewed by others

Sequence to Graph Alignment Using Gap-Sensitive Co-linear Chaining

Superbubbles revisited

AStarix: Fast and Optimal Sequence-to-Graph Alignment

1 Introduction

2 The MPC Algorithm

Lemma 1

Proof

Theorem 1

3 The Dynamic Programming Framework

Observation 1

Proof

Lemma 2

Proof

Lemma 3

4 The LCS Problem

Theorem 2

5 Co-linear Chaining

Problem 1

Definition 1

Problem 2

Problem 3

Theorem 3

Proof

5.1 Co-linear Chaining on Sequences Revisited

Theorem 4

5.2 Co-linear Chaining on DAGs Using a Minimum Path Cover

Theorem 5

Proof

6 Discussion and Experiments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation