Keywords

1 Introduction

The Shortest Superstring Problem (SSP) consists, for a set of strings \(S = \{s_1,\cdots ,s_n\}\) (with no \(s_i\) substring of \(s_j\)), in constructing a string s such that any element of S is a substring of s and s is of minimal length. For an arbitrary number of strings n, the problem is known to be NP-Complete [10, 11] and APX-hard [2]. Lower bounds for the achievable approximation ratios on a binary alphabet have been given [16, 20], and the best approximation ratio so far for the general case is \(2 \frac{11}{30} \approx 2.3667\) [21], reached after a long series of improvements [1,2,3,4, 7, 15, 17,18,19, 22, 23, 25] leading to increasingly involved algorithms. An SSP greedy algorithm is known to reach good performances in practice but its guaranteed approximation ratio has only been proved to be 3.5 [15] and conjectured 2.

Also, the SSP problem was tackled from the perspective of compression and several compression SSP algorithms have been designed. The idea is to ensure a fixed compression ratio between the sum of the lengths of the strings in the set and the optimal superstring on this set. In this context, the above mentioned greedy algorithm is proved to achieve a compression ratio of at least \(\frac{1}{2}\), while the best compression algorithm achieves a ratio of \(\frac{38}{63}\) [17].

In this article, we focus on guaranteed algorithms for practical applications of SSP, like assembling reads produced by Next Generation sequencers. Over the past decade, the landscape of sequencing and assembly deeply changed, with the increasing development of Next Generation Sequencing (NGS) devices. These relatively cheap devices produce in a single run, millions of randomly read, relatively short, equal length DNA sequences. Such sequences are named reads and each read is typically 32 to 1000 bases long, with a low and still decreasing cost per base.

Considering the specificity of read sequences (i.e., equal length, specific period distribution), the question that arises is whether it is possible to propose better approximation algorithms for this type of data. Note that our results can easily be extended to the case where the lengths of the input strings are not necessarily equal but very close, by considering the maximum length. However, for simplicity, we prefer to restrain our presentation to equal length strings.

This research, similar to the one targeting better algorithms for small-world graphs in social networks, aims to better suit the actual data. The greedy SSP algorithm was already studied in the general case of DNA sequences (not necessarily of equal length) in [8] and proved to be efficient.

In this article, we refine the approximation bound of the classical (and one of the simplest) SSP algorithm in [2] taking into account (a) the strings being of equal length and (b) the period distribution of those strings. Then, given any set of equal length strings, we propose a linear-time algorithm to compute the approximation ratio the classical algorithm can reach. If the period distribution fits a specific model (see Sect. 4), this ratio can be better than the best actual ratio of \(2\frac{11}{30}.\)

Our experiments show that on real NGS datasets, the ratios we obtain are most of the time better than \(2\frac{11}{30},\) allowing us to use a better guaranteed and simpler algorithm for its assembling. For instance, on the NGS dataset SRR069579, we reach a 2.0738 approximation ratio (see Table 1). In the appendix we present the result of the ratio computation on 100 sets of reads.

To our knowledge, the only related work where sequences have the same length is [12]. Up to 7 bases, they propose a better approximation ratio, with a De Bruijn graph approach. However, these sequences are much shorter than real-world reads.

Note that some theoretical variations of SSP have also been studied [5, 27]. Here we neither dwell on these studies since their focus is far from ours, nor detail the greedy algorithm approximation conjecture, which is a subject in itself [9, 15, 24].

2 The Classical 3-Approximation Algorithm for SSP

We consider the Shortest Superstring Problem (SSP) on n strings over a finite alphabet \(\varSigma \), in a set \(S=\{s_1,s_2,\ldots ,s_n\}\), with no \(s_i\) substring of \(s_j\).

For two strings uv we define the maximum overlap of u and v, denoted , as the longest suffix of u that is also a prefix of v. Also, we define the prefix of u relatively to v, denoted \(\text {pref}(u,v)\), as the string x such that \(u=x\cdot \text {ov}(u,v)\), i.e., the maximum prefix of u that does not overlap v.

The prefix graph (also called the distance graph) built on S is a complete directed graph with the vertex set \(V=S\) and the edges set \(E=\{e_{i,j}=(s_i,s_j) | \forall s_i, s_j \in V\}\), with label \(l(e_{i,j})=\text {pref}(s_i,s_j)\) and weight \(w(e_{i,j})=|\text {pref}(s_i,s_j)|\). Figure 1 shows the prefix graph built on the set of strings \({S}=\{ \text {ACGACG},\) \(\text {CGACGA},\) \(\text {ATGTAG},\text {TAGATG},\) \(\text {GACGAT},\) \(\text {ATGATG}\}\). For the sake of simplicity, the prefix graph depicted in Fig. 1 is not complete; indeed, edges corresponding to pairs of non overlapping strings are not represented, as they do not impact our example.

Fig. 1.
figure 1

The prefix graph of \( S =\{ \text {ACGACG},\) \(\text {CGACGA},\) \(\text {ATGTAG},\) \(\text {TAGATG},\) \(\text {GACGAT},\text {ATGATG} \}\). Vertices in the graph are numbered from \(\text {[0]}\) to \(\text {[5]}\). We highlight a cycle cover made of two cycles, whose edges are in bold: (1) \(\text {[0]}\) - \(\text {[1]}\) and (2) \(\text {[2]}\) - \(\text {[3]}\) - \(\text {[4]}\) - \(\text {[5]}\).

Let c be a cycle in the prefix graph and let w(c) denote its weight, that is the sum of the weights of its edges. Now let r be a string corresponding to one of the vertices in this cycle. Then r can be expressed by turning around the cycle a certain number of times and concatenating the labels of the edges. For instance, on cycle (1) in Fig. 1, \(\text {ACGACG}\) can be expressed as \(\text {A}\) from node [0] to [1], followed by \(\text {CG}\) from node [1] to [0], then again by \(\text {A}\) from [0] to [1] and eventually by \(\text {CG}\) from [1] to [0]. In this case, turning 2 times around the cycle allows expressing \(\text {ACGACG}\). Note that the number of cycle tours is not necessarily an integer, as a string can be expressed as a ratio of the weight of the cycle.

A cycle cover of a graph is a collection of disjoint cycles that cover all the vertices. Let OPT denote the length of a solution for the SSP on the strings in the set S. It was shown that OPT can be lower-bounded by the minimum weight of a cycle cover of the prefix graph of S [26]. Based on these considerations, a classical algorithm described in [2, 26] was proposed, whose general framework is given in Algorithm 1 (for a complete description of the algorithm see Chapter 7 in [26]).

For the example in Fig. 1, if we consider vertex [0] (\(\text {ACGACG}\)) as the representative for cycle (1) then \(\sigma _{1} = \text {A} \cdot \text {CG} \cdot \text {ACGACG} = \text {ACGACGACG}.\) For cycle (2), let us choose arbitrarily vertex [2] (\(\text {ATGTAG}\)) as the representative. Then \(\sigma _2 = \text {ATG}\cdot \text {TAGAT} \cdot \text {GACG} \cdot \text {ATG} \cdot \text {ATGTAG}=\text {ATGTAGATGAC}\text {GATGATGTAG}\). Thus, in Algorithm 1 we obtain \(S_{\sigma } = \{\text {ATGTAGATGAC}\text {GATGATGTAG}, \text {ACGACGACG}\}\).

figure a

Algorithm 1 is proved to be a 3-approximation when using the greedy compression algorithm of ratio \(\frac{1}{2}\). Below, we show that when applied on data having specific characteristics, the approximation factor of Algorithm 1 is in fact better.

3 Algorithm 1 on a Particular Instance of the SSP

In this section we consider a particular instance of the Shortest Superstring Problem, namely when the n strings in S have the same length m with \(0< m\ll n\). We begin this section with several notations, lemmas and corollaries, followed by an analysis of Algorithm 1 on this particular instance of the SSP.

For a string s of length m, an integer \(1 \le p \le m\) is a period of s if \(s[i] = s[i+p]\) for all \(1 \le i \le m-p\). Note that s has at least one period that is its length. The smallest period of s is called period of s, and denoted . Let n(i) (\(1 \le i \le m\)) denote the number of strings of period i.

We denote the period of a cycle as being equal to its weight. Note that the period of a cycle is at least 2, as a cycle is composed of at least two vertices thus of at least two edges, and as the weight of an edge is at least 1 (otherwise a string would be a substring of another thus contradicting the definition of SSP).

Lemma 1

Let \(c \in \mathcal {C}\) be a cycle and \(s_1\ldots s_k\) the strings in c, then \(\text {period}(c)\ge \max _{i=1}^{k}\{\text {period}(s_i)\}.\)

Proof

Each string in the cycle can be expressed by turning around the cycle. If \(\text {period}(c) < \text {period}(s)\) for a given \(s \in \{s_1\ldots s_k\}\), then \(\text {period}(c)\) is also a period of s, which is smaller than the smallest period of s. Contradiction. Thus \(\text {period}(c)\ge \text {period}(s_i), 1 \le i \le k.\)

Corollary 1

Let \(1 \le i \le m\), the maximal number of cycles of period less or equal to i is bounded by \(\frac{1}{2}\sum _{k=1}^i n(k).\)

Proof

By Lemma 1, strings in a cycle c of period i must have a period less than or equal to i. There are \(\sum _{k=1}^i n(k)\) strings that can compose these cycles. Given that a cycle contains at least two strings, there are at maximum \(\frac{1}{2}\sum _{k=1}^i n(k)\) cycles of period \(\le i\).

A cycle cover is composed of cycles of different periods. For example, in Fig. 1, the period of cycle (1) is 3, while that of cycle (2) is 16. The period of a cycle determines the number of turns needed in order to express a string on that cycle. Indeed, a string s that is part of a cycle c is expressed by turning around the cycle \(\frac{m}{period(c)}\) times. Note that given that all strings have the same length m, the smaller the period of a cycle, the more we need to turn around the cycle in order to express a string (see Fig. 2).

Fig. 2.
figure 2

Expressing a representative r on a small cycle (having a period inferior to \(m\alpha \)) in (1) versus on a large cycle (with a period superior to \(m\alpha \)) in (2), where \(\alpha = 0.8\).

We split the set of cycles in two subsets with respect to their periods: small cycles with periods less than or equal to \(m\alpha \), and large cycles with periods greater than \(m\alpha \), where \(0 < \alpha \le 1\) is a parameter that will be discussed later in the paper. For instance for \(\alpha =0.8\), it is straightforward to see that in Fig. 2, cycle (1) is small while cycle (2) is large.

Corollary 2

Let \(c \in \mathcal {C}\) be a cycle and \(s_1\ldots s_k\) the strings in \(\mathcal {C}\). If \(\text {period}(c)\le m \alpha \), \(\text {period}(s_i) \le m \alpha \), where \(1\le i \le k\).

Proof

By Lemma 1, the periods of the strings in a cycle are smaller than or equal to the period of the cycle.

Based on this corollary, the partition of the set of cycles with respect to \(m\alpha \) generates a partition of the set of reads: reads with periods less than or equal to \(m\alpha \) that can be part of the small cycles, and reads with periods comprised between \(m\alpha \) and m that can only be part of the large cycles.

Below we will show that \(||S_{\sigma }||\) (see Step 2, Algorithm 1) for equal length strings, approaches OPT when having a maximum number of cycles with large periods.

Lemma 2

$$||S_{\sigma }|| \le {\textit{wt}}(\mathcal {C}) + {\textit{wt}}(\mathcal {C}) \frac{1}{\alpha }+ \frac{1}{2} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha }) \le (1 + \frac{1}{\alpha })OPT + \frac{1}{2} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })$$

Proof

It is straightforward to see that \(S_{\sigma }\) has the length equal to \(\sum _{i=1}^{nc} \left| \sigma _i \right| \). A \(\sigma _i\) on a cycle \(c_i\) is given by \(\text {pref}(r_i,s_{i_2}) \cdot ... \cdot \text {pref}(s_{i_{l-1}},s_{i_l}) \cdot \text {pref}(s_{i_l},r_{i}) \cdot r_i\). As \(|\text {pref}(r_i,s_{i_2}) \cdot ... \cdot \text {pref}(s_{i_{l-1}},s_{i_l}) \cdot \text {pref}(s_{i_l},r_{i})| = \text {wt}(c_i)\), we first sum over these prefixes of each \(\sigma _i\). This leads to the first global \(\text {wt}(\mathcal {C})\) in Lemma 2.

We now have to consider \(\sum _{i=1}^{nc} \left| r_i \right| \). As all \(r_i\) have the same length m, this gives \(\sum _{i=1}^{nc} \left| r_i \right| = m \cdot nc\). However, this is not entirely useful given that it is not possible to finely bound the number of cycles in the cycle cover, nc. Therefore we will have to pursue a different approach, that is further decompose this result on large and small period cycles and exploit the fact that expressing an \(r_i\) on a cycle \(c_i\) requires \(\frac{m}{period(c_i)}\) cycle tours.

We consider the total length of the \(r_i\) for the large cycles (with periods \(>m\alpha \)). From the observation above we get that the expression of such an \(r_i\) requires at maximum \(\frac{1}{\alpha }\) cycle tours and its length can be formulated as \(\frac{1}{\alpha }\cdot \text {wt}(c_i)\). Thus their total length is bounded by \(\frac{1}{\alpha }\cdot \text {wt}(\mathcal {C})\). Note that \(\text {wt}(\mathcal {C})\) is the sum of the weights of all cycles and not only of the large ones.

We now have to address the total length of the \(r_i\) for the small cycles (with periods \( \le m\alpha \)). As, by Corollary 1, there are at most \(\frac{1}{2}\sum _{k=1}^{m\alpha } n(k)\) such cycles, the sum of the lengths of the corresponding \(r_i\) is upper bounded by \(\frac{m}{2}\sum _{k=1}^{m\alpha } n(k)\). This can be further refined given that \(\frac{1}{\alpha }\) cycle tours for the small cycles are already taken into account in the computation for the large cycles, i.e., \(\frac{k}{\alpha }\) from this upper bound. Note that the worst case for counting the small cycles of period k from 2 to \(m\alpha \) is when there is a maximum of small cycles at each step k. By Corollary 1, the number of cycles from step \(k-1\) to step k can only be increased by \(\frac{n(k)}{2}\). Expressing the representatives of these \(\frac{n(k)}{2}\) additional cycles of period k requires \(\frac{n(k)}{2} (\frac{m}{k} - \frac{1}{\alpha })\) tours. The number of tours has to be multiplied by k in order to obtain their total length. Finally, we get an upper bound for the sum of the lengths of the \(r_i\) for the small cycles of \(\frac{1}{2} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })\). Eventually, as \(\text {wt}(\mathcal {C}) \le OPT\), the result follows.

Instead of using the greedy algorithm in the last step of Algorithm 1, we compress \(S_{\sigma }\) using the guaranteed \(\frac{38}{63}\) compression algorithm [17], similarly to the classical approaches related to the superstring approximation. We define \(\text {OPT}_{\sigma }\) as the length of an optimal superstring on \(S_{\sigma }\) and \(\tau \) as the result of the \(\frac{38}{63}\) compression algorithm on \(S_{\sigma }.\) The next lemma (Lemma 7.7 in [26]) links \(\textit{OPT}_{\sigma }\) and \(\textit{OPT}\).

Lemma 3

\(\textit{OPT}_{\sigma }\le \textit{OPT}+{\textit{wt}}(\mathcal {C})\)

By applying the \(\frac{38}{63}\) compression algorithm on \(S_{\sigma }\), we thus derive the following result:

Lemma 4

$$|\tau | \le 2\textit{OPT}+ \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) \textit{OPT}+ \frac{25}{126} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha }) $$

Proof

Let us denote \(\varDelta = \left( \frac{1-\alpha }{\alpha } \right) \textit{OPT}+ \frac{1}{2} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })\) (see Fig. 3). Lemma 2 states that \(||S_{\sigma }|| \le 2{OPT} + \varDelta .\)

Lemma 3 gives \(\textit{OPT}_{\sigma }<\textit{OPT}+\text {wt}(\mathcal {C}) \le 2\textit{OPT}.\) As \(||S_{\sigma }|| - |\tau | \ge \frac{38}{63} (||S_{\sigma }|| - \textit{OPT}_{\sigma })\), the following inequality also holds: \(||S_{\sigma }|| - |\tau | \ge \frac{38}{63} (||S_{\sigma }|| - 2 {OPT}).\)

Let us define \(|\tau _0|\) as the value of \(|\tau |\) obtained in the worst case, when \(||S_{\sigma }|| - 2 {OPT} = \varDelta \). Then \(|\tau _0| \le 2{OPT}+\frac{25}{63} \varDelta .\) Assume now that \(||S_{\sigma }||\) is less than the worst case, that is \(||S_{\sigma }|| - 2 {OPT} < \varDelta \). Then the \(|\tau |\) obtained after compressing \(||S_{\sigma }||\) is less than \(|\tau _0|\) (Fig. 3). Thus \(|\tau | \le |\tau _0| \le 2{OPT}+\frac{25}{63} \varDelta \) \(=\) \(2{OPT}+ \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) \textit{OPT}+ \frac{25}{126} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha }).\)

Fig. 3.
figure 3

Compressing \(S_{\sigma }\) using the \(\frac{38}{63}\) algorithm [17].

An important point is that \(OPT \ge n\), since any superstring contains at least one character from each string, as no string is a substring of another. Combined to Lemma 4, this leads to:

Theorem 1

$$\frac{|\tau |}{OPT} \le 2+ \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) + \frac{\frac{25}{126} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })}{OPT} \le 2+ \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) + \frac{25}{126 n} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })$$

Proof

The approximation ratio of \(|\tau |\) with respect to OPT is computed by dividing \(|\tau |\) by OPT. As \(OPT \ge n\), \(\frac{\frac{25}{126} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })}{OPT}\) \(\le \) \(\frac{25}{126 n} \sum _{k=1}^{m\alpha } n(k) {(m - \frac{k}{\alpha })}.\)

4 The Shortest Superstring Problem on NGS Reads

In the previous section we have shown that the classical SSP algorithm (Algorithm 1), when applied on strings of equal length, might give a better approximation factor than in the general case, depending on the period distribution of the sequences.

This approximation factor is parametrized by a parameter \(\alpha \), \(0 < \alpha \le 1\), which is inferred from the set of strings given as input. This the novely of our approach. The approximation factor is ad-hoc, computed on the data trough a linear time procedure.

The parameter \(\alpha \) allows to partition the set of strings in two subsets: strings with periods \(\le m\alpha \) and the others with periods comprised between \(m\alpha \) and m, and thanks to Corollary 2 it also roughly partitions the set of cycles (see Sect. 3). Thus the choice of \(\alpha \) determines the number of small and large cycles and as we have seen above (see Lemma 2), the lower the number of small cycles with respect to the number of large cycles, the better the approximation factor. In this section we describe a procedure that allows to compute the optimal value for the parameter \(\alpha \) on a given dataset, and apply this procedure on real NGS datasets composed of equal length reads. We observe that for a large panel of these NGS datasets the optimal value of \(\alpha \) is relatively high (between 0.8 and 1), which means that the number of small cycles becomes “negligible” with respect with the number of reads when the period of the reads increases.

As our approach is data driven, we performed extensive tests (see Appendix, Table 5) on 100 sets of reads of differents species, and we took several parameters into account. For instance, in Fig. 4, we show such four sets of reads with lengths of 32, 36, 98 and 200. The x-axis represents the periods, and the y-axis the number of reads on a \(\text {log10}\) scale. We plotted three types of curves:

  • n(x) (red circles) of the corresponding set of reads;

  • random values of n(x) (oblique dotted curve) experimentally generated for the total number of sequences of the set of reads;

  • parameters \(m \alpha \) (vertical dotted line) computed on the set of reads.

Compared to random periods, this value of \(m\alpha \) roughly corresponds to the area where the curves n(x) join the random values. It would be a natural approach to fit our experimental curves to the theoretical random distributions under some probabilistic source assumptions (Bernoulli or Markov for instance). However, obtaining these theoretical distributions is an open study [13, 14], which is out of the scope of this article.

Fig. 4.
figure 4

Four sets of reads from left to right, top to bottom: SRR069579 (human), ERR000009 (yeast), SRR211279 (human), SRR959239 (human). The x-axis is the period and the y-axis is the number of reads on a \(\log _{10}\) scale. The circles represent n(x) and the curve represents a distribution of periods on a random dataset. The dash vertical line corresponds to the final \(m\alpha \) (computed in Sect. 5). (Color figure online)

For a set of equal length strings, the computation of the value of \(\alpha \) consists in (a) computing all minimal periods of all input sequences and (b) computing n(i) for all \(1 \le i \le m\) and (c) computing for each \(1 \le i \le m\) the parameters \(\alpha \), \(\gamma \) \(=\) \(\frac{25}{126 n} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })\) and \(\beta \) \(=\) \(2 + \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) + \gamma \). Eventually, we keep the lower \(\beta \) as the approximation ratio if it is below \(2 \frac{11}{30}\approx 2.3667\). Given a string of size m on a fixed alphabet \(\varSigma \), the computation of its smallest period can be done easily in expected linear time O(m), in worst case O(m) or even in optimal sublinear expected time \(O(\sqrt{m.\log _{|\varSigma |}m})\) using a more involved algorithm [6]. Performing n such computations and a m-buckets sorting allows to solve the steps (a) and (b) in \(O(n\sqrt{m.\log _{|\varSigma |}m})\) time. Step (c) is then computed in O(n) time.

Table 1. SRR069579 read set, 3702308 reads of size 36, \(\gamma \) \(=\) \(\frac{25}{126 n} \sum _{k=1}^{m\alpha } n(k) (m - \frac{k}{\alpha })\) and \(\beta = 2 + \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) + \gamma .\)
Table 2. ERR000009 read set, 4049489 reads of size 32
Table 3. SRR211279 read set, 25103766 reads of size 200
Table 4. SRR959239 read set, 4143243 reads of size 98

5 Experimental Results

We first present and explain experimental results for the sets of reads SRR069579 (Table 1), ERR000009 (Table 2), SRR211279 (Table 3), and SRR959239 (Table 4).

In each table, for each period i from 1 to m we show : (a) n(i), (b) the cumulative number of sequences, (c) the value of \(\alpha \) corresponding to i / m, (d) the value of \(1+\frac{1}{\alpha }\), (e) \(2 + \frac{25}{63} \left( \frac{1-\alpha }{\alpha } \right) \) which corresponds to the term of Theorem 1 due to the large cycles, (f) \(\gamma \) which is the part of the final ratio brought by the small cycles, and eventually (g) \(\beta \), the final ratio that can be reached by using the value of \(\alpha \) from the previous line in the table.

The resulting approximation ratios on the read sets cited above are respectively 2.0475, 2.0618, 2.0109 and 2.0163, which are much lower than 2.3667 using a much simpler algorithm.

We then tested our approach on 100 sets of reads from different organisms (see Appendix, Table 5) in order to compute the ratios with the previous method. One set only (ERR1041316 Tupaia belangeri) is above the best actual bound of 2.3667, all the others being lower and many very close to 2.0 which is the lower bound of our approach.