On the Shortest Common Superstring of NGS Reads

Braquelaire, Tristan; Gasparoux, Marie; Raffinot, Mathieu; Uricaru, Raluca

doi:10.1007/978-3-319-55911-7_8

Tristan Braquelaire¹⁶,
Marie Gasparoux^16,18,
Mathieu Raffinot^16,17 &
…
Raluca Uricaru¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10185))

Included in the following conference series:

International Conference on Theory and Applications of Models of Computation

850 Accesses

Abstract

The Shortest Superstring Problem (SSP) consists, for a set of strings $S = \{s_1,\cdots ,s_n\}$ (with no $s_i$ substring of $s_j$), to find a minimum length string that contains all $s_i, 1\le i \le n$, as substrings.

This problem is proved to be NP-Complete and APX-hard. Guaranteed approximation algorithms have been proposed, the current best ratio being $2\frac{11}{30}$, which has been achieved through a long and difficult process. SSP is highly used in practice on Next Generation Sequencing (NGS) data, which plays an increasingly important role in modern biological and medical research. In this note, we show that on NGS data the SSP approximation ratio reached by the classical algorithm of Blum et al. [2], is usually below $2\frac{11}{30}$, while assuming specific characteristics of the data that are experimentally verified on a large sampling set. Moreover, we present an efficient linear time test for any input of strings of equal length, which allows to compute the approximation ratio that can be reached using the classical algorithm in [2].

This work was supported by the PEPS INS2I-CNRS project CompX and by a Genotype to Phenotype project of the Life Sciences Department of University of Bordeaux.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Approximating Shortest Superstring Problem Using de Bruijn Graphs

Parameterized Complexity of Superstring Problems

Greedy Shortest Common Superstring Approximation in Compact Space

Keywords

1 Introduction

The Shortest Superstring Problem (SSP) consists, for a set of strings $S = \{s_1,\cdots ,s_n\}$ (with no $s_i$ substring of $s_j$), in constructing a string s such that any element of S is a substring of s and s is of minimal length. For an arbitrary number of strings n, the problem is known to be NP-Complete [10, 11] and APX-hard [2]. Lower bounds for the achievable approximation ratios on a binary alphabet have been given [16, 20], and the best approximation ratio so far for the general case is $2 \frac{11}{30} \approx 2.3667$ [21], reached after a long series of improvements [1,2,3,4, 7, 15, 17,18,19, 22, 23, 25] leading to increasingly involved algorithms. An SSP greedy algorithm is known to reach good performances in practice but its guaranteed approximation ratio has only been proved to be 3.5 [15] and conjectured 2.

Also, the SSP problem was tackled from the perspective of compression and several compression SSP algorithms have been designed. The idea is to ensure a fixed compression ratio between the sum of the lengths of the strings in the set and the optimal superstring on this set. In this context, the above mentioned greedy algorithm is proved to achieve a compression ratio of at least $\frac{1}{2}$, while the best compression algorithm achieves a ratio of $\frac{38}{63}$ [17].

In this article, we focus on guaranteed algorithms for practical applications of SSP, like assembling reads produced by Next Generation sequencers. Over the past decade, the landscape of sequencing and assembly deeply changed, with the increasing development of Next Generation Sequencing (NGS) devices. These relatively cheap devices produce in a single run, millions of randomly read, relatively short, equal length DNA sequences. Such sequences are named reads and each read is typically 32 to 1000 bases long, with a low and still decreasing cost per base.

Considering the specificity of read sequences (i.e., equal length, specific period distribution), the question that arises is whether it is possible to propose better approximation algorithms for this type of data. Note that our results can easily be extended to the case where the lengths of the input strings are not necessarily equal but very close, by considering the maximum length. However, for simplicity, we prefer to restrain our presentation to equal length strings.

This research, similar to the one targeting better algorithms for small-world graphs in social networks, aims to better suit the actual data. The greedy SSP algorithm was already studied in the general case of DNA sequences (not necessarily of equal length) in [8] and proved to be efficient.

In this article, we refine the approximation bound of the classical (and one of the simplest) SSP algorithm in [2] taking into account (a) the strings being of equal length and (b) the period distribution of those strings. Then, given any set of equal length strings, we propose a linear-time algorithm to compute the approximation ratio the classical algorithm can reach. If the period distribution fits a specific model (see Sect. 4), this ratio can be better than the best actual ratio of $2\frac{11}{30}.$

Our experiments show that on real NGS datasets, the ratios we obtain are most of the time better than $2\frac{11}{30},$ allowing us to use a better guaranteed and simpler algorithm for its assembling. For instance, on the NGS dataset SRR069579, we reach a 2.0738 approximation ratio (see Table 1). In the appendix we present the result of the ratio computation on 100 sets of reads.

To our knowledge, the only related work where sequences have the same length is [12]. Up to 7 bases, they propose a better approximation ratio, with a De Bruijn graph approach. However, these sequences are much shorter than real-world reads.

Note that some theoretical variations of SSP have also been studied [5, 27]. Here we neither dwell on these studies since their focus is far from ours, nor detail the greedy algorithm approximation conjecture, which is a subject in itself [9, 15, 24].

2 The Classical 3-Approximation Algorithm for SSP

We consider the Shortest Superstring Problem (SSP) on n strings over a finite alphabet $\varSigma $, in a set $S=\{s_1,s_2,\ldots ,s_n\}$, with no $s_i$ substring of $s_j$.

For two strings u, v we define the maximum overlap of u and v, denoted , as the longest suffix of u that is also a prefix of v. Also, we define the prefix of u relatively to v, denoted $\text {pref}(u,v)$, as the string x such that $u=x\cdot \text {ov}(u,v)$, i.e., the maximum prefix of u that does not overlap v.

The prefix graph (also called the distance graph) built on S is a complete directed graph with the vertex set $V=S$ and the edges set $E=\{e_{i,j}=(s_i,s_j) | \forall s_i, s_j \in V\}$, with label $l(e_{i,j})=\text {pref}(s_i,s_j)$ and weight $w(e_{i,j})=|\text {pref}(s_i,s_j)|$. Figure 1 shows the prefix graph built on the set of strings ${S}=\{ \text {ACGACG},$ $\text {CGACGA},$ $\text {ATGTAG},\text {TAGATG},$ $\text {GACGAT},$ $\text {ATGATG}\}$. For the sake of simplicity, the prefix graph depicted in Fig. 1 is not complete; indeed, edges corresponding to pairs of non overlapping strings are not represented, as they do not impact our example.

Let c be a cycle in the prefix graph and let w(c) denote its weight, that is the sum of the weights of its edges. Now let r be a string corresponding to one of the vertices in this cycle. Then r can be expressed by turning around the cycle a certain number of times and concatenating the labels of the edges. For instance, on cycle (1) in Fig. 1, $\text {ACGACG}$ can be expressed as $\text {A}$ from node [0] to [1], followed by $\text {CG}$ from node [1] to [0], then again by $\text {A}$ from [0] to [1] and eventually by $\text {CG}$ from [1] to [0]. In this case, turning 2 times around the cycle allows expressing $\text {ACGACG}$. Note that the number of cycle tours is not necessarily an integer, as a string can be expressed as a ratio of the weight of the cycle.

A cycle cover of a graph is a collection of disjoint cycles that cover all the vertices. Let OPT denote the length of a solution for the SSP on the strings in the set S. It was shown that OPT can be lower-bounded by the minimum weight of a cycle cover of the prefix graph of S [26]. Based on these considerations, a classical algorithm described in [2, 26] was proposed, whose general framework is given in Algorithm 1 (for a complete description of the algorithm see Chapter 7 in [26]).

For the example in Fig. 1, if we consider vertex [0] ($\text {ACGACG}$) as the representative for cycle (1) then $\sigma _{1} = \text {A} \cdot \text {CG} \cdot \text {ACGACG} = \text {ACGACGACG}.$ For cycle (2), let us choose arbitrarily vertex [2] ($\text {ATGTAG}$) as the representative. Then $\sigma _2 = \text {ATG}\cdot \text {TAGAT} \cdot \text {GACG} \cdot \text {ATG} \cdot \text {ATGTAG}=\text {ATGTAGATGAC}\text {GATGATGTAG}$. Thus, in Algorithm 1 we obtain $S_{\sigma } = \{\text {ATGTAGATGAC}\text {GATGATGTAG}, \text {ACGACGACG}\}$.

Algorithm 1 is proved to be a 3-approximation when using the greedy compression algorithm of ratio $\frac{1}{2}$. Below, we show that when applied on data having specific characteristics, the approximation factor of Algorithm 1 is in fact better.

3 Algorithm 1 on a Particular Instance of the SSP

In this section we consider a particular instance of the Shortest Superstring Problem, namely when the n strings in S have the same length m with $0< m\ll n$. We begin this section with several notations, lemmas and corollaries, followed by an analysis of Algorithm 1 on this particular instance of the SSP.

For a string s of length m, an integer $1 \le p \le m$ is a period of s if $s[i] = s[i+p]$ for all $1 \le i \le m-p$. Note that s has at least one period that is its length. The smallest period of s is called period of s, and denoted . Let n(i) ($1 \le i \le m$) denote the number of strings of period i.

We denote the period of a cycle as being equal to its weight. Note that the period of a cycle is at least 2, as a cycle is composed of at least two vertices thus of at least two edges, and as the weight of an edge is at least 1 (otherwise a string would be a substring of another thus contradicting the definition of SSP).

Lemma 1

Let $c \in \mathcal {C}$ be a cycle and $s_1\ldots s_k$ the strings in c, then $\text {period}(c)\ge \max _{i=1}^{k}\{\text {period}(s_i)\}.$

Proof

Each string in the cycle can be expressed by turning around the cycle. If $\text {period}(c) < \text {period}(s)$ for a given $s \in \{s_1\ldots s_k\}$, then $\text {period}(c)$ is also a period of s, which is smaller than the smallest period of s. Contradiction. Thus $\text {period}(c)\ge \text {period}(s_i), 1 \le i \le k.$

Corollary 1

Let $1 \le i \le m$, the maximal number of cycles of period less or equal to i is bounded by $\frac{1}{2}\sum _{k=1}^i n(k).$

Proof

By Lemma 1, strings in a cycle c of period i must have a period less than or equal to i. There are $\sum _{k=1}^i n(k)$ strings that can compose these cycles. Given that a cycle contains at least two strings, there are at maximum $\frac{1}{2}\sum _{k=1}^i n(k)$ cycles of period $\le i$.

A cycle cover is composed of cycles of different periods. For example, in Fig. 1, the period of cycle (1) is 3, while that of cycle (2) is 16. The period of a cycle determines the number of turns needed in order to express a string on that cycle. Indeed, a string s that is part of a cycle c is expressed by turning around the cycle $\frac{m}{period(c)}$ times. Note that given that all strings have the same length m, the smaller the period of a cycle, the more we need to turn around the cycle in order to express a string (see Fig. 2).

We split the set of cycles in two subsets with respect to their periods: small cycles with periods less than or equal to $m\alpha $, and large cycles with periods greater than $m\alpha $, where $0 < \alpha \le 1$ is a parameter that will be discussed later in the paper. For instance for $\alpha =0.8$, it is straightforward to see that in Fig. 2, cycle (1) is small while cycle (2) is large.

Corollary 2

Let $c \in \mathcal {C}$ be a cycle and $s_1\ldots s_k$ the strings in $\mathcal {C}$. If $\text {period}(c)\le m \alpha $, $\text {period}(s_i) \le m \alpha $, where $1\le i \le k$.

Proof

By Lemma 1, the periods of the strings in a cycle are smaller than or equal to the period of the cycle.

Based on this corollary, the partition of the set of cycles with respect to $m\alpha $ generates a partition of the set of reads: reads with periods less than or equal to $m\alpha $ that can be part of the small cycles, and reads with periods comprised between $m\alpha $ and m that can only be part of the large cycles.

Below we will show that $||S_{\sigma }||$ (see Step 2, Algorithm 1) for equal length strings, approaches OPT when having a maximum number of cycles with large periods.

Lemma 2

$$||S_{\sigma }|| \le {\textit{wt}}(\mathcal {C}) + {\textit{wt}}(\mathcal {C}) \frac{1}{\alpha }+ \frac{1}{2} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha }) \le (1 + \frac{1}{\alpha })OPT + \frac{1}{2} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })$$

Proof

It is straightforward to see that $S_{\sigma }$ has the length equal to $\sum _{i=1}^{nc} \left| \sigma _i \right| $. A $\sigma _i$ on a cycle $c_i$ is given by $\text {pref}(r_i,s_{i_2}) \cdot ... \cdot \text {pref}(s_{i_{l-1}},s_{i_l}) \cdot \text {pref}(s_{i_l},r_{i}) \cdot r_i$. As $|\text {pref}(r_i,s_{i_2}) \cdot ... \cdot \text {pref}(s_{i_{l-1}},s_{i_l}) \cdot \text {pref}(s_{i_l},r_{i})| = \text {wt}(c_i)$, we first sum over these prefixes of each $\sigma _i$. This leads to the first global $\text {wt}(\mathcal {C})$ in Lemma 2.

We now have to consider $\sum _{i=1}^{nc} \left| r_i \right| $. As all $r_i$ have the same length m, this gives $\sum _{i=1}^{nc} \left| r_i \right| = m \cdot nc$. However, this is not entirely useful given that it is not possible to finely bound the number of cycles in the cycle cover, nc. Therefore we will have to pursue a different approach, that is further decompose this result on large and small period cycles and exploit the fact that expressing an $r_i$ on a cycle $c_i$ requires $\frac{m}{period(c_i)}$ cycle tours.

We consider the total length of the $r_i$ for the large cycles (with periods $>m\alpha $). From the observation above we get that the expression of such an $r_i$ requires at maximum $\frac{1}{\alpha }$ cycle tours and its length can be formulated as $\frac{1}{\alpha }\cdot \text {wt}(c_i)$. Thus their total length is bounded by $\frac{1}{\alpha }\cdot \text {wt}(\mathcal {C})$. Note that $\text {wt}(\mathcal {C})$ is the sum of the weights of all cycles and not only of the large ones.

We now have to address the total length of the $r_i$ for the small cycles (with periods $ \le m\alpha $). As, by Corollary 1, there are at most $\frac{1}{2}\sum _{k=1}^{m\alpha } n(k)$ such cycles, the sum of the lengths of the corresponding $r_i$ is upper bounded by $\frac{m}{2}\sum _{k=1}^{m\alpha } n(k)$. This can be further refined given that $\frac{1}{\alpha }$ cycle tours for the small cycles are already taken into account in the computation for the large cycles, i.e., $\frac{k}{\alpha }$ from this upper bound. Note that the worst case for counting the small cycles of period k from 2 to $m\alpha $ is when there is a maximum of small cycles at each step k. By Corollary 1, the number of cycles from step $k-1$ to step k can only be increased by $\frac{n(k)}{2}$. Expressing the representatives of these $\frac{n(k)}{2}$ additional cycles of period k requires $\frac{n(k)}{2} (\frac{m}{k} - \frac{1}{\alpha })$ tours. The number of tours has to be multiplied by k in order to obtain their total length. Finally, we get an upper bound for the sum of the lengths of the $r_i$ for the small cycles of $\frac{1}{2} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })$. Eventually, as $\text {wt}(\mathcal {C}) \le OPT$, the result follows.

Instead of using the greedy algorithm in the last step of Algorithm 1, we compress $S_{\sigma }$ using the guaranteed $\frac{38}{63}$ compression algorithm [17], similarly to the classical approaches related to the superstring approximation. We define $\text {OPT}_{\sigma }$ as the length of an optimal superstring on $S_{\sigma }$ and $\tau $ as the result of the $\frac{38}{63}$ compression algorithm on $S_{\sigma }.$ The next lemma (Lemma 7.7 in [26]) links $\textit{OPT}_{\sigma }$ and $\textit{OPT}$.

Lemma 3

$\textit{OPT}_{\sigma }\le \textit{OPT}+{\textit{wt}}(\mathcal {C})$

By applying the $\frac{38}{63}$ compression algorithm on $S_{\sigma }$, we thus derive the following result:

Lemma 4

$$|\tau | \le 2\textit{OPT}+ \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) \textit{OPT}+ \frac{25}{126} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha }) $$

Proof

Let us denote $\varDelta = \left( \frac{1-\alpha }{\alpha } \right) \textit{OPT}+ \frac{1}{2} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })$ (see Fig. 3). Lemma 2 states that $||S_{\sigma }|| \le 2{OPT} + \varDelta .$

Lemma 3 gives $\textit{OPT}_{\sigma }<\textit{OPT}+\text {wt}(\mathcal {C}) \le 2\textit{OPT}.$ As $||S_{\sigma }|| - |\tau | \ge \frac{38}{63} (||S_{\sigma }|| - \textit{OPT}_{\sigma })$, the following inequality also holds: $||S_{\sigma }|| - |\tau | \ge \frac{38}{63} (||S_{\sigma }|| - 2 {OPT}).$

Let us define $|\tau _0|$ as the value of $|\tau |$ obtained in the worst case, when $||S_{\sigma }|| - 2 {OPT} = \varDelta $. Then $|\tau _0| \le 2{OPT}+\frac{25}{63} \varDelta .$ Assume now that $||S_{\sigma }||$ is less than the worst case, that is $||S_{\sigma }|| - 2 {OPT} < \varDelta $. Then the $|\tau |$ obtained after compressing $||S_{\sigma }||$ is less than $|\tau _0|$ (Fig. 3). Thus $|\tau | \le |\tau _0| \le 2{OPT}+\frac{25}{63} \varDelta $ $=$ $2{OPT}+ \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) \textit{OPT}+ \frac{25}{126} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha }).$

An important point is that $OPT \ge n$, since any superstring contains at least one character from each string, as no string is a substring of another. Combined to Lemma 4, this leads to:

Theorem 1

$$\frac{|\tau |}{OPT} \le 2+ \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) + \frac{\frac{25}{126} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })}{OPT} \le 2+ \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) + \frac{25}{126 n} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })$$

Proof

The approximation ratio of $|\tau |$ with respect to OPT is computed by dividing $|\tau |$ by OPT. As $OPT \ge n$, $\frac{\frac{25}{126} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })}{OPT}$ $\le $ $\frac{25}{126 n} \sum _{k=1}^{m\alpha } n(k) {(m - \frac{k}{\alpha })}.$

4 The Shortest Superstring Problem on NGS Reads

In the previous section we have shown that the classical SSP algorithm (Algorithm 1), when applied on strings of equal length, might give a better approximation factor than in the general case, depending on the period distribution of the sequences.

This approximation factor is parametrized by a parameter $\alpha $, $0 < \alpha \le 1$, which is inferred from the set of strings given as input. This the novely of our approach. The approximation factor is ad-hoc, computed on the data trough a linear time procedure.

The parameter $\alpha $ allows to partition the set of strings in two subsets: strings with periods $\le m\alpha $ and the others with periods comprised between $m\alpha $ and m, and thanks to Corollary 2 it also roughly partitions the set of cycles (see Sect. 3). Thus the choice of $\alpha $ determines the number of small and large cycles and as we have seen above (see Lemma 2), the lower the number of small cycles with respect to the number of large cycles, the better the approximation factor. In this section we describe a procedure that allows to compute the optimal value for the parameter $\alpha $ on a given dataset, and apply this procedure on real NGS datasets composed of equal length reads. We observe that for a large panel of these NGS datasets the optimal value of $\alpha $ is relatively high (between 0.8 and 1), which means that the number of small cycles becomes “negligible” with respect with the number of reads when the period of the reads increases.

As our approach is data driven, we performed extensive tests (see Appendix, Table 5) on 100 sets of reads of differents species, and we took several parameters into account. For instance, in Fig. 4, we show such four sets of reads with lengths of 32, 36, 98 and 200. The x-axis represents the periods, and the y-axis the number of reads on a $\text {log10}$ scale. We plotted three types of curves:

n(x) (red circles) of the corresponding set of reads;
random values of n(x) (oblique dotted curve) experimentally generated for the total number of sequences of the set of reads;
parameters $m \alpha $ (vertical dotted line) computed on the set of reads.

Compared to random periods, this value of $m\alpha $ roughly corresponds to the area where the curves n(x) join the random values. It would be a natural approach to fit our experimental curves to the theoretical random distributions under some probabilistic source assumptions (Bernoulli or Markov for instance). However, obtaining these theoretical distributions is an open study [13, 14], which is out of the scope of this article.

For a set of equal length strings, the computation of the value of $\alpha $ consists in (a) computing all minimal periods of all input sequences and (b) computing n(i) for all $1 \le i \le m$ and (c) computing for each $1 \le i \le m$ the parameters $\alpha $, $\gamma $ $=$ $\frac{25}{126 n} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })$ and $\beta $ $=$ $2 + \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) + \gamma $. Eventually, we keep the lower $\beta $ as the approximation ratio if it is below $2 \frac{11}{30}\approx 2.3667$. Given a string of size m on a fixed alphabet $\varSigma $, the computation of its smallest period can be done easily in expected linear time O(m), in worst case O(m) or even in optimal sublinear expected time $O(\sqrt{m.\log _{|\varSigma |}m})$ using a more involved algorithm [6]. Performing n such computations and a m-buckets sorting allows to solve the steps (a) and (b) in $O(n\sqrt{m.\log _{|\varSigma |}m})$ time. Step (c) is then computed in O(n) time.

Table 1. SRR069579 read set, 3702308 reads of size 36, $\gamma $ $=$ $\frac{25}{126 n} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })$ and $\beta = 2 + \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) + \gamma .$

Full size table

Table 2. ERR000009 read set, 4049489 reads of size 32

Full size table

Table 3. SRR211279 read set, 25103766 reads of size 200

Full size table

Table 4. SRR959239 read set, 4143243 reads of size 98

Full size table

5 Experimental Results

We first present and explain experimental results for the sets of reads SRR069579 (Table 1), ERR000009 (Table 2), SRR211279 (Table 3), and SRR959239 (Table 4).

In each table, for each period i from 1 to m we show : (a) n(i), (b) the cumulative number of sequences, (c) the value of $\alpha $ corresponding to i / m, (d) the value of $1+\frac{1}{\alpha }$, (e) $2 + \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) $ which corresponds to the term of Theorem 1 due to the large cycles, (f) $\gamma $ which is the part of the final ratio brought by the small cycles, and eventually (g) $\beta $, the final ratio that can be reached by using the value of $\alpha $ from the previous line in the table.

The resulting approximation ratios on the read sets cited above are respectively 2.0475, 2.0618, 2.0109 and 2.0163, which are much lower than 2.3667 using a much simpler algorithm.

We then tested our approach on 100 sets of reads from different organisms (see Appendix, Table 5) in order to compute the ratios with the previous method. One set only (ERR1041316 Tupaia belangeri) is above the best actual bound of 2.3667, all the others being lower and many very close to 2.0 which is the lower bound of our approach.

References

Armen, C., Stein, C.: A ${2}{\frac{2}{3}}$ superstring approximation algorithm. Discrete Appl. Math. 88(1–3), 29–57 (1998). http://dx.doi.org/10.1016/S0166-218X(98)00065-1, http://www.sciencedirect.com/science/article/pii/S0166218X98000651, Computational Molecular Biology DAM - CMB Series
Blum, A., Jiang, T., Li, M., Tromp, J., Yannakakis, M.: Linear approximation of shortest superstrings. J. ACM 41(4), 630–647 (1994). doi:10.1145/179812.179818, http://doi.acm.org/10.1145/179812.179818
Breslauer, D., Jiang, T., Jiang, Z.: Rotations of periodic strings and short superstrings. J. Algorithms 24(2), 340–353 (1997). http://dx.doi.org/10.1006/jagm.1997.0861, http://www.sciencedirect.com/science/article/pii/S0196677497908610
Armen, C., Stein, C.: Improved length bounds for the shortest superstring problem. In: Akl, S.G., Dehne, F., Sack, J.-R., Santoro, N. (eds.) WADS 1995. LNCS, vol. 955, pp. 494–505. Springer, Heidelberg (1995). doi:10.1007/3-540-60220-8_88
Chapter Google Scholar
Crochemore, M., Cygan, M., Iliopoulos, C., Kubica, M., Radoszewski, J., Rytter, W., Waleń, T.: Algorithms for three versions of the shortest common superstring problem. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 299–309. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13509-5_27
Chapter Google Scholar
Czumaj, A., Gçasieniec, L.: On the complexity of determining the period of a string. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 412–422. Springer, Heidelberg (2000). doi:10.1007/3-540-45123-4_34
Chapter Google Scholar
Czumaj, A., Gasieniec, L., Piotrów, M., Rytter, W.: Sequential and parallel approximation of shortest superstrings. J. Algorithms 23(1), 74–100 (1997). http://dx.doi.org/10.1006/jagm.1996.0823, http://www.sciencedirect.com/science/article/pii/S0196677496908238
Ferragina, P., Landau, G., Ma, B.: Combinatorial pattern matching why greed works for shortest common superstring problem. Theor. Comput. Sci. 410(51), 5374–5381 (2009). http://dx.doi.org/10.1016/j.tcs.2009.09.014, http://www.sciencedirect.com/science/article/pii/S0304397509006410
Fici, G., Kociumaka, T., Radoszewski, J., Rytter, W., Walen, T.: On the greedy algorithm for the shortest common superstring problem with reversals. Inf. Process. Lett. 116(3), 245–251 (2016). doi:10.1016/j.ipl.2015.11.015
Gallant, J., Maier, D., Astorer, J.: On finding minimal length superstrings. J. Comput. Syst. Sci. 20(1), 50–58 (1980). http://dx.doi.org/10.1016/0022-0000(80)90004-5, http://www.sciencedirect.com/science/article/pii/0022000080900045
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co., New York (1990)
MATH Google Scholar
Golovnev, A., Kulikov, A.S., Mihajlin, I.: Approximating shortest superstring problem using de Bruijn graphs. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 120–129. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38905-4_13
Chapter Google Scholar
Guibas, L.J., Odlyzko, A.M.: Periods in strings. J. Comb. Theor. Ser. A 30(1), 19–42 (1981). http://dx.doi.org/10.1016/0097-3165(81)90038-8, http://www.sciencedirect.com/science/article/pii/0097316581900388
Holub, S., Shallit, J.: Periods and borders of random words. In: STACS, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, LIPIcs, vol. 47, pp. 44:1–44:10 (2016)
Google Scholar
Kaplan, H., Shafrir, N.: The greedy algorithm for shortest superstrings. Inf. Process. Lett. 93(1), 13–17 (2005). doi:10.1016/j.ipl.2004.09.012
Karpinski, M., Schmied, R.: Improved inapproximability results for the shortest superstring and related problems. In: CATS, Australian Computer Society, CRPIT 2013, vol. 141, pp. 27–36 (2013)
Google Scholar
Kosaraju, S.R., Park, J.K., Stein, C.: Long tours and short superstrings. In: 1994 Proceedings of 35th Annual Symposium on Foundations of Computer Science, pp 166–177 (1994). doi:10.1109/SFCS.1994.365696
Li, M.: Towards a DNA sequencing theory (learning a string). In: Proceedings of the 31st Symposium on the Foundations of Computer Science, pp. 125–134. IEEE Computer Society Press, Los Alamitos (1990)
Google Scholar
Mucha, M.: Lyndon words and short superstrings. In: SODA, pp. 958–972. SIAM (2013)
Google Scholar
Ott, S.: Lower bounds for approximating shortest superstrings over an alphabet of size 2. In: Widmayer, P., Neyer, G., Eidenbenz, S. (eds.) WG 1999. LNCS, vol. 1665, pp. 55–64. Springer, Heidelberg (1999). doi:10.1007/3-540-46784-X_7
Chapter Google Scholar
Paluch, K.E.: Better approximation algorithms for maximum asymmetric traveling salesman and shortest superstring. CoRR abs/1401.3670 (2014). http://arxiv.org/abs/1401.3670
Paluch, K.E., Elbassioni, K.M., van Zuylen, A.: Simpler approximation of the maximum asymmetric traveling salesman problem. In: STACS, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, LIPIcs, vol. 14, pp. 501–506 (2012)
Google Scholar
Sweedyk, Z.: A $\mathbf{{2}{\frac{1}{2}}}$-approximation algorithm for shortest superstring. SIAM J. Comput. 29(3), 954–986 (1999). doi:10.1137/S0097539796324661
Tarhio, J., Ukkonen, E.: A greedy approximation algorithm for constructing shortest common superstrings. Theor. Comput. Sci. 57(1), 131–145 (1988). doi:10.1016/0304-3975(88)90167-3, http://www.sciencedirect.com/science/article/pii/0304397588901673
Teng, S.H., Yao, F.F.: Approximating shortest superstrings. SIAM J. Comput. 26(2), 410–417 (1997). doi:10.1137/S0097539794286125
Vazirani, V.V.: Approximation Algorithms. Springer-Verlag New York Inc., New York (2001)
MATH Google Scholar
Yu, Y.W.: Approximation hardness of shortest common superstring variants. CoRR abs/1602.08648 (2016). http://arxiv.org/abs/1602.08648

Download references

Author information

Authors and Affiliations

LaBRI and CBiB, University of Bordeaux, Bordeaux, France
Tristan Braquelaire, Marie Gasparoux, Mathieu Raffinot & Raluca Uricaru
CNRS, LaBRI and CBiB, University of Bordeaux, Bordeaux, France
Mathieu Raffinot
DIRO, University of Montréal, Montréal, Québec, Canada
Marie Gasparoux

Authors

Tristan Braquelaire
View author publications
You can also search for this author in PubMed Google Scholar
Marie Gasparoux
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Raffinot
View author publications
You can also search for this author in PubMed Google Scholar
Raluca Uricaru
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tristan Braquelaire .

Editor information

Editors and Affiliations

Anna University, Chennai, India
T.V. Gopal
Universität Bern , Bern, Switzerland
Gerhard Jäger
Universität Bern , Bern, Switzerland
Silvia Steila

Appendix

Table 5. Results on 100 sets of reads

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Braquelaire, T., Gasparoux, M., Raffinot, M., Uricaru, R. (2017). On the Shortest Common Superstring of NGS Reads. In: Gopal, T., Jäger , G., Steila, S. (eds) Theory and Applications of Models of Computation. TAMC 2017. Lecture Notes in Computer Science(), vol 10185. Springer, Cham. https://doi.org/10.1007/978-3-319-55911-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-55911-7_8
Published: 21 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55910-0
Online ISBN: 978-3-319-55911-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Shortest Common Superstring of NGS Reads

Abstract

Similar content being viewed by others

Approximating Shortest Superstring Problem Using de Bruijn Graphs

Parameterized Complexity of Superstring Problems

Greedy Shortest Common Superstring Approximation in Compact Space

Keywords

1 Introduction

2 The Classical 3-Approximation Algorithm for SSP