Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

An alphabet is a finite non-empty set whose elements are called letters . A string is a sequence of letters. Given two strings s i and s j , the second is a substring of the first if s i contains consecutive letters that match s j exactly. We say that s i is a superstring of s j . The Shortest (common) Superstring Problem (SSP) is a combinatorial optimization problem that consists in finding a shortest string which contains as substrings all of the strings in a given set. The strings of the set may be overlapping inside the superstring exploiting their common data.

1 Applications

The SSP has several important applications in various scientific domains and this is the reason why it has attracted the interest of many researchers. In computational molecular biology, the DNA sequencing procedure via fragment assembly can be formulated as SSP. In virology, the SSP models the compression of viral genome. In information technology, the SSP can be used to achieve data compression. In scheduling, SSP solutions can be used to schedule operations in machines with coordinated starting times. In the field of data structures, efficient storage can be achieved in specific cases using the solutions of the SSP.

1.1 DNA Sequencing

The molecule of the DNA encodes the genetic information used in the developing and functioning of living beings. DNA is a double-stranded sequence of four types of nucleotides: adenine, cytosine, guanine and thymine, and thereby it can be viewed as a string over the alphabet {a, c, g, t}. In the field of molecular biology, the DNA sequencing procedure determines the sequence of a DNA molecule, that is the precise order of the nucleotides within it. DNA sequencing highly accelerates biological and medical research.

Due to laboratory equipment constraints, only parts of DNA up to few hundred nucleotides can be read reliably, while the length of the DNA molecule in many species is quite longer. To recognize a long DNA sequence, many copies of the DNA molecule are made and cut into smaller overlapping pieces, named fragments , that can be read at once. Each fragment is chosen from an unknown location of the molecule. To reconstruct the initial DNA molecule, these fragments must be reassembled in their initial order, a procedure known as the DNA assembly problem . Due to the huge amount of data generated by the fragment sequencing methods, an automated procedure supported by a computer software is necessary for the assembly process. Intuitively, shortest superstrings of the sequenced fragments preserve important biological structures [33, 46, 53], and in practice they are proved to be good representations of the original DNA molecule [27, 34]. Therefore, the SSP can be considered as an abstraction of the assembly problem, and consequently many researchers developed assembly methods based on it [18, 45, 51]. The most widely used of them, the shotgun sequencing , is essentially the natural greedy algorithm for the SSP. Similar assembly problems arise during reconstruction of RNA molecules or proteins from sequenced fragments.

1.2 Data Compression

In the fields of computer science, information technology and data transmission, a crucial issue is the size of the stored or transferred data. Data compression is the process of encoding data using fewer bits than their original representation. According to whether the compressed data is exactly as the original data or not, we distinguish the lossless compression and the lossy compression , respectively (see [50]).

Considering data as text over an alphabet, an intuitive method of lossless compression is based on the idea of dividing the text into strings and representing it by a superstring of these strings with pointers to their original positions. Based on this principle, several macro schemes concerning the nature of the pointers are taken under consideration in [55, 56], leading in general applications of the SSP in the field of textual substitution. In programming languages, each alphanumerical string in the code may be represented as a pointer to a common string stored in the memory. Therefore, the target of the compiler is to arrange the alphanumerical strings in such a way that they overlap as much as possible [15, 37]. Other general applications of the SSP on data compression are discussed in [14, 54].

1.3 Modelling the Viral Genome Compression

Viruses are forced to reduce their genome size by environmental factors such as the need for quick replication and the small amount of nucleic acid that can be incorporated in them. One way to compress their genome is by overlapping their genes. Genes are the parts of the DNA that specify all proteins in living beings. Between genes there are generally long sequences of nucleotides that do not be coded into proteins. On the other hand, overlapping genes are common in viruses. Therefore, in most virus species, two or more proteins are coded by the same nucleotide sequence, allowing viruses to increase their repertoire of proteins without increasing their genome length, as indicated in [10].

In [24, 25], the SSP is used to model the viral genome compression. The genes are considered as strings and the purpose is to find a shortest superstring that contains them all. The computational results show that the amount of compression achieved by the viruses in the real world is the same or very close to the one obtained by the algorithms in all the examples considered in [24, 25]. Another conclusion from these computations is that the average compression ratio of viruses is remarkably high considering the fact that the DNA molecules are very difficult to compress in general. Finally, by modelling the viral genome compression as SSP, any exact solution or lower bound of the corresponding SSP instance provides a bound on the real size of a viral genome with a given set of genes.

1.4 Scheduling with Coordinated Starting Times

The Flow Shop Problem (FSP) and the Open Shop Problem (OSP) concern the scheduling of operations in machines and have particular applications in scheduling and planning of experiments. Given a set of k machines M 1, M 2, , M k the problem is to schedule a set of jobs on them, where each job consists of k operations and the i-th operation has to be assigned on the machine M i . A machine can process at most one operation at a time, and any two operations of a job cannot be processed simultaneously. In the FSP, the operation on M i has to be finished before the operation on M i+1 can start for each job, whereas in the OSP there is not such commitment. In the no-wait versions of these problems, it is required the operations of a job to be processed directly one after the other. The optional constraint of coordinated starting times necessitates an operation starting on one machine only when each of the other machines is either idle or also starts an operation. In all these cases, the task is to find a schedule such that the overall processing time is minimized.

The FSP and the OSP on two machines, and their corresponding no-wait versions are polynomially solvable in general, but this is not always true when the machines have to coordinate the starting times of operations. In [39], this additional constraint is considered. Each instance of the no-wait version of these problems under the additional constraint of the coordinated starting times can be transformed into an SSP instance, where all strings are of a special form. The NP-completeness of these problem versions is proved using this transformation. Apart from the computational complexity, this transformation can be applied for solving the constrained FSP and OSP. Each exact and heuristic algorithm for the SSP can be applied to these problems too. Also, the special case of the SSP can be used to derive approximation algorithms for the constrained shop problems.

1.5 Data Structure Storage

In [15], a special case of the SSP is considered, where all the strings are of length at most two. It is proved that this version of the SSP is solvable in polynomial time. This SSP case has applications to the storage of data structures, and specifically to the Huffman trees [23] that encode pairs of letters, which are used to an entropy encoding algorithm for lossless data compression, and for efficient representation of directed graphs in memory.

2 Definitions and Notations

Let \(\mathbb{N}\) be the set of natural numbers including 0. All the numbers in this chapter are natural, unless otherwise stated. For a real x, \(\lceil x\rceil \) denotes the smaller integer greater than or equal to x. For a letter l, the notation l ∈ Σ means that l belongs to alphabet Σ, while for a string s, if all letters of s belong to Σ, we say that s is over the alphabet Σ. If s is a string, then \(\vert s\vert\) denotes its length, that is the number of its letters, while if S is a set, then \(\vert S\vert\) denotes its cardinality. For a string s and \(i,j \in \mathbb{N}\) such that 1 ≤ i ≤ j ≤ | s | , the substring of s from i-th to j-th letter is denoted by s [i, j]. Any substring s [1, j] is a prefix of s, and if j <  | s | , then it is called a proper prefix . Similarly, any substring s [i, | s | ] is a suffix of s, and if i > 1, then it is called a proper suffix .

The placement of two or more strings one next to the other denotes their concatenation , e.g. s i s j is the concatenation of s i and s j . A coverage string between strings s i and s j , in this specific order, is a string v such that s i  = uv and s j  = vw, for some non-empty strings u, w. In other words, v is a string that is a proper suffix of s i and a proper prefix of s j . The length of the coverage string is called coverage between the corresponding strings and is a non-negative integer. A join string of s i and s j is the concatenation of these two strings with a coverage string appearing only once, that is uvw. We use \(J_{\{s_{i},s_{j}\}}\) to denote the set of all join strings of s i and s j regardless their order.

The overlap string between s i and s j is their longest coverage string, and is denoted by o(s i , s j ). Its length | o(s i , s j ) | is called overlap . The overlap of a string with itself is called self-overlap , and notice that it is not limited to half the total string length. The merge string of s i and s j is the concatenation of these two strings with the overlap string appearing only once, that is the shortest join string between them. It is denoted by m(s i , s j ). We have \(\vert m(s_{i},s_{j})\vert = \vert s_{i}\vert + \vert s_{j}\vert -\vert o(s_{i},s_{j})\vert \). The length of the prefix of s i before the overlap string with s j is called distance from s i to s j and is denoted by d(s i , s j ).

Example 1.

Suppose that we have the strings s 1 = bbacb and s 2 = bcabbcabb over the alphabet {a, b, c}, so | s 1 |  = 5 and | s 2 |  = 9. The one-letter string b is a proper suffix of s 1 and a proper prefix of s 2. Moreover, it is the longest such string, and thus the overlap string between them, o(s 1, s 2) = b, with overlap 1. The coverage strings between s 2 and s 1 are b and bb, and so o(s 2, s 1) = bb with overlap 2. The self-overlap of the first string is | o(s 1, s 1) |  = 1, while | o(s 2, s 2) |  = 5. The corresponding merge strings are m(s 1, s 2) = bbacbcabbcabb, m(s 2, s 1) = bcabbcabbacb, m(s 1, s 1) = bbacbbacb, and m(s 2, s 2) = bcabbcabbcabb. The distance from s 1 to s 2 is d(s 1, s 2) = 4, while d(s 2, s 1) = 7, d(s 1, s 1) = 4, and d(s 2, s 2) = 4. Finally, the set of all join strings is \(J_{\{s_{1},s_{2}\}} =\{ bbacbbcabbcabb,bcabbcabbbbacb,bbacbcabbcabb,bcabbcabbbacb,bcabbcabbacb\}\).

Given a finite set S of strings over an alphabet Σ, the sum of lengths of the strings in S is defined as \(\vert \vert S\vert \vert =\sum _{s\in S}\vert s\vert \). The orbit size of a letter l ∈ Σ is the number of its occurrences in the strings of S.

An instance of the SSP is specified by a finite set S = { s 1, s 2, , s n } of strings. A string s is a superstring of S, if it is a superstring of all s i  ∈ S. A multiset is a generalization of the notion of the set where elements are allowed to appear more than once. Without loss of generality, S is defined to be a set since if S is a multiset, then S has exactly the same superstrings as the set {s: s ∈ S}. Also, it is assumed that S is a substring-free set , i.e., no string s i  ∈ S is a substring of any other string s j  ∈ S. This assumption can be made without loss of generality, since for any set of strings there exists a unique substring-free set that has the same superstrings, obtained by removing any string is a substring of another.

Given a set \(S =\{ s_{1},s_{2},\ldots,s_{n}\}\) of strings over an alphabet Σ, the SSP is the problem of finding a minimum length superstring of S. Note that such a string may not be unique. The length of a shortest superstring of S is denoted by opt l (S), while the corresponding achieved compression is defined as \(\mathrm{opt}_{c}(S) = \vert \vert S\vert \vert -\mathrm{ opt}_{l}(S)\). The decision version of the SSP is described as follows. Given a set S of strings and a \(k \in \mathbb{N}\), is there a superstring s of S such that | s |  = k?

Example 2.

Suppose that we have the multiset S′ = { s 1, s 2, s 3, s 4, s 5, s 6} of strings over the alphabet {a, b, c}, where s 1 = bababbc, s 2 = bbccaac, s 3 = bbcaaabb, s 4 = acabb, s 5 = bcaaab, and s 6 = acabb. The corresponding substring-free set is S = { s 1, s 2, s 3, s 4} with | S |  = 4 and | | S | |  = 27. The orbit size of the letter a in S is 9, of the letter b is 12, and of the letter c is 6. These are the two shortest superstrings of S: s = bababbccaacabbcaaabb and s′ = bababbcaaabbccaacabb, with \(\mathrm{opt}_{l}(S) = \vert s\vert = \vert s^{\prime}\vert = 19\) and opt c (S) = 7.

Let I n be the finite set {1, 2, , n}, and Π n be the set of all permutations of the set I n . Any solution for the SSP of n strings can be represented as a permutation p ∈ Π n , indicating the order in which strings must be merged to get the superstring. It is implied that the shortest superstrings are derived only by string merges. If this is not the case, there would be parts of the superstring that do not correspond to any string, or some consecutive strings would not exploit their longest coverage string and could be joined by a larger coverage. In both cases there would be a shorter superstring. The elements of a permutation p ∈ Π n are denoted by p(i), i ∈ I n , where i indicates the order of each element in p such that p = (p(1), p(2), , p(n)).

Given an order of strings (s 1, s 2, , s n ) the superstring \(s =\langle s_{1},\ldots,s_{n}\rangle\) is defined to be the string m(s 1, m(s 2, … m(s n−1, s n ))). In such an order, the first string s 1 is denoted by first(s) and the last string s n is denoted by last(s). Notice that s is the shortest string such that s 1, s 2, , s n appear in this order as substrings.

For a set \(S =\{ s_{1},s_{2},\ldots,s_{n}\}\) of strings and a permutation p ∈ Π n , the corresponding superstring is defined as \(strSp(S,p) =\langle s_{p(1)},s_{p(2)},\ldots,s_{p(n)}\rangle\). For any SSP instance \(S =\{ s_{1},s_{2},\ldots,s_{n}\}\), there exists a permutation p ∈ Π n such that strSp(S, p) is an optimal solution. For any p ∈ Π n , the length of the superstring strSp(S, p) is given by \(\vert strSp(S,p)\vert =\sum \nolimits_{ i=1}^{n}\vert s_{i}\vert -\sum\nolimits_{i=1}^{n-1}\vert o(s_{p(i)},s_{p(i+1)})\vert \). Therefore, the SSP can be formulated as

$$\displaystyle{ \min _{p\in \varPi _{n}}\sum _{i=1}^{n}\vert s_{ i}\vert -\sum _{i=1}^{n-1}\vert o(s_{ p(i)},s_{p(i+1)})\vert. }$$
(1)

The shortest superstrings that correspond to the permutation p of the optimal solution have length equal to opt l (S). A superstring of the minimum length is achieved when the sum of the overlaps between consecutive strings, in the order defined by p, is maximized.

There are two ways to assess the solution quality of a non-exact algorithm for the SSP: the length measure and the overlap or compression measure . According to the first measure, a superstring is better when its length is shorter. In this case, the SSP is described as a minimization problem. According to the second measure, a superstring is better when the achieved compression is greater. In this case, the problem is described as a maximization problem. The two measures are equivalent when applied to exact solutions, but they give different results when they measure the relative preciseness of non-exact solutions obtained by approximation or heuristic algorithms. A good algorithm with respect to one of the above measures is not necessarily a good algorithm with respect to the other measure.

Example 3.

For the substring-free set S = { s 1, s 2, s 3, s 4} of Example 2 and the two shortest superstrings of it, \(s =\langle s_{1},s_{2},s_{4},s_{3}\rangle\) and \(s^{\prime} =\langle s_{1},s_{3},s_{2},s_{4}\rangle\), we have \(\mathrm{first}(s) =\mathrm{ first}(s^{\prime}) = s_{1}\), last(s) = s 3, and last(s′) = s 4.

Let s″ = strSp(S, p) for the permutation p = (3, 4, 2, 1), which is a superstring of length | s″ |  = 24. According to the length measure the solution s″ is \((\vert s^{\prime\prime}\vert -\mathrm{ opt}_{l}(S))/\mathrm{opt}_{l}(S) = 26.3\,\%\) far from the optimal length, while according to the compression measure is \((\mathrm{opt}_{c}(S) - (\vert \vert S\vert \vert -\vert s^{\prime\prime}\vert ))/\mathrm{opt}_{c}(S) = 71.4\,\%\) far from the optimal compression.

A directed graph G is defined by a vertex set V (G) and an arc set E(G) which contains ordered pairs of vertices and is denoted by G = (V, E). For an arc e = (u, v), u is called the tail of e, and v the head of e. We say that e is incident to both vertices, while for v the arc e is an incoming arc , and for u is an outgoing arc . An arc with the same tail and head is called a loop . For a vertex v ∈ V, the number of incoming arcs of v is denoted by \(\deg ^{-}(v)\), and the number of outgoing arcs of v is denoted by \(\deg ^{+}(v)\). The overall number of the incident arcs to a vertex v regardless of their direction is the degree of v. The degree of a graph is the maximum degree between its vertices. Graph G is complete if there is an arc (u, v) for any vertex pair u, v ∈ V, uv. For a weight function \(w: E \rightarrow \mathbb{N}\), we denote by G = (V, E, w) a weighted directed graph. When there is no confusion we denote by w ij the weight of arc e = (v i , v j ) ∈ E. If the elements of set E have no direction, then they are called edges and the corresponding graph is called undirected . An undirected graph is called bipartite if its vertex set can be partitioned into two subsets, V 1 and V 2, such that every edge is incident to a vertex of V 1 and to a vertex of V 2. If the arc set contains ordered tuples instead of pairs of vertices then we have a multigraph .

Given a set S = { s 1, s 2, , s n } of strings, the complete directed weighted graph G = (V, E, w) with

  • vertex set V = { s 1, s 2, , s n },

  • arc set E = { (s i , s j ): s i , s j  ∈ V, ij}, and

  • weight function \(w: E \rightarrow \mathbb{N}\), with w ij  =  | o(s i , s j ) | ,

is called the overlap graph of S and is denoted by G o (S). If the arc weight function depends on the distance instead of the overlap between the string pairs, that is w ij  = d(s i , s j ), then the corresponding graph is called the distance graph of S, and is denoted by G d (S). Notice that all weights on both graphs are non-negative integers. In the following, it is assumed that the overlap and distance graphs have no loops, unless otherwise stated. For any set A ⊆ E of arcs on both graph, we denote by o(A) the sum of weights of the arcs on G o (S), that is their total overlap, and by d(A) the sum of weights of the arcs on G d (S), that is their total distance. For each arc e = (s, s′) on both graphs, we have

$$\displaystyle{ \vert s\vert = o(\{e\}) + d(\{e\}). }$$
(2)

Example 4.

For the substring-free set S = { s 1, s 2, s 3, s 4} of Example 2 the associated overlap and distance graphs are depicted in the next figure.

For the arc set A = { (s 1, s 2), (s 3, s 4)}, we have o(A) = 3 and d(A) = 12.

A walk on a directed graph is a sequence of arcs where the head of each arc except the last one is the tail of the next arc. A walk can be specified either by its vertices or its arcs in the order of appearance in it. A path on a directed graph is a walk with no repeating vertices. On an undirected graph, a path is a sequence of consecutive edges that connect no repeating vertices. A walk is called Eulerian if it contains all the arcs of the graph, while a path is called Hamiltonian if it contains all the vertices of the graph. A cycle is a path where the first and the last vertices are the same. A cycle with k arcs is called a k-cycle. For a string set S and the associated overlap and distance graphs, consider a cycle c on them, a string s ∈ S corresponds to a vertex of c, and let s′ be the unique previous string of s in c. The superstring \(\langle s,\ldots,s^{\prime}\rangle\) where the strings are in the order around c is called the cycle superstring of c with respect to s and is denoted by strC(c, s). The superstring \(\langle s,\ldots,s^{\prime},s\rangle\) where the strings are in the order around c is called the extended cycle superstring of c with respect to s and is denoted by strC +(c, s). Notice that \(strC^{+}(c,s) = m(strC(c,s),s)\).

Example 5.

For the substring-free set S = { s 1, s 2, s 3, s 4} of Example 2 and the associated overlap and distance graphs presented in Example 4, consider the cycle c = (s 1, s 2, s 4, s 1). We have strC(c, s 1) = bababbccaacabb, and \(strC^{+}(c,s_{1}) = bababbccaacabbababbc\).

Some combinatorial optimization problems are closely related to the SSP due to their nature and are used in the establishment of many results of the SSP. A matching on a directed graph is a set of arcs, no two of which are incident to the same vertex. A maximum matching on a weighted graph is a matching with the largest total weight, while the Matching Problem (MP) looks for a maximum matching on a weighted directed graph. The MP is defined similarly on undirected weighted graphs. A directed matching is a set of arcs, no two of which have the same tail or the same head. In other words, it is a set of disjoint paths and cycles on a graph. The Directed Matching Problem (DMP) looks for a maximum directed matching. Both MP and DMP can be solved in polynomial time (see, e.g., [42, 59]). A cycle cover on a directed graph is a set of cycles such that each vertex of the graph is in exactly one cycle. The Cycle Cover Problem (CCP) on a weighted directed graph consists in finding a cycle cover with maximum total weight. The CCP is solvable in polynomial time by reduction to the MP on bipartite graphs (see, e.g., [42]).

The Hamiltonian Path Problem (HPP) on a weighted directed graph consists in finding an optimal Hamiltonian path according to its total weight. If the objective is to minimize the total weight, then the Min-HPP is considered, while if the objective is to maximize the total weight, then the Max-HPP is considered. The decision HPP on a directed graph G asks for the existence of a Hamiltonian path on G. Similarly, we have the maximization and the minimization Hamiltonian Cycle Problem , which are also known as Traveling Salesman Problems (Min-TSP and Max-TSP). Both HPP and TSP are NP-hard problems [30]. There is a simple relation between these problems. The HPP on a graph G can be transformed to the TSP on a graph G′ obtained from G by adding a new vertex u and zero-weighted arcs from u to each vertex of G and from each vertex of G to u.

3 Computational Complexity

The results described in this section concern the computational complexity of the SSP, and justify the fact that there are only few exact algorithms, and on the other hand so many approximation algorithms for it. The SSP cannot be solved efficiently to optimality in polynomial time. It can be approximated within a constant ratio, whereas this ratio has a bound.

3.1 Complexity of Exact Solution

Given a string set S and a string s, there is a polynomial time algorithm for checking if s is a superstring of S, and therefore the decision SSP belongs in class NP.

A string is primitive if no letter appears more than once into it. Next theorem establishes the NP-completeness of the decision SSP.

Theorem 1 ([15]).

The decision SSP is NP-complete. Furthermore, this problem is NP-complete even if for any integer m ≥ 3 the restriction is made that all strings in set S are primitive and of length m.

The proof is based on a polynomial time transformation from the decision HPP on directed graphs with the following additional restrictions:

  • there is a designated start vertex s with \(\deg ^{-}(s) = 0\) and a designated end vertex t with \(\deg ^{+}(t) = 0,\)

  • for each vt, we have deg+(v) > 1.

A set S of specific strings of length 3 is constructed, and each string is corresponding to a vertex of a directed graph G = (V, E) that satisfies the above restrictions. Graph G has a Hamiltonian path if and only if set S has a superstring of length 2 | E | + 3 | V | . Therefore, there is no efficient algorithm for solving the SSP, unless P = NP.

Due to the nature of the SSP, several parameters can be considered fixed in order to define restricted cases of the problem. Besides the length of the strings and the primitiveness that were mentioned previously, the cardinality of the alphabet, the orbit size of the letters, and the form of the strings were also examined for the conservation or not of the NP-completeness.

The decision SSP remains NP-complete when it is restricted to an alphabet of cardinality 2 as proved in [15]. A restricted version of the SSP concerning both the alphabet cardinality and the string length is also studied and the result is stated in the next theorem. Let bits(n) denote the number of bits that are necessary to represent n in binary, for any \(n \in \mathbb{N}\).

Theorem 2 ([15]).

The decision SSP is NP-complete even if for any real h > 1, the strings in set S are written over the alphabet {0,1} and have length \(\lceil h\;bits(\vert \vert S\vert \vert )\rceil \).

The proof is based on Theorem 1 and on the encoding of each letter of the initial alphabet with letters of the alphabet {0, 1} such that no relative changes yielded to the overlaps between the strings after the new encoding.

In [38, 39], NP-completeness results are proved for some special cases of the decision SSP. For a set S of strings over an alphabet Σ, these complexity results can be briefly presented as follows. The decision SSP is NP-complete even if

  • all strings in S are of length 3 and the maximum orbit size of each letter in Σ is 8.

  • all strings in S are of length 4 and the maximum orbit size of each letter in Σ is 6.

  • Σ = { 0, 1} and each string in S is of the form 0p10q10r1 or 10p10q10r, where \(p,q,r \in \mathbb{N}\).

  • Σ = { 0, 1} and all strings in S are of the form 10p10q, where \(p,q \in \mathbb{N}\).

  • Σ = { 0, 1, 2} and each string contains a fixed number of each letter.

3.2 Complexity of Approximation

Since the SSP is a hard problem to be solved to optimality, a huge amount of effort is made to develop approximation algorithms. The theoretical framework for the complexity of this aspect establishes that although the SSP is easy to be approximated within some constant ratio, it is hard to be approximated within any constant ratio. The linear reduction (L-reduction) is necessary for what follows.

Definition 1 ([43]).

Let A and B be two optimization problems. Problem A L-reduces to B if there are two polynomial time algorithms F and G and real constants α, β > 0 such that

  • given an instance a of A, algorithm F produces an instance b = F(a) of B such that opt(b) is at most α × opt(a), where opt(a) and opt(b) are the costs of the optimal solution of instances a and b respectively, and

  • given any solution of b with cost c′, algorithm G produces in polynomial time a solution of a with cost c such that \(\vert c -\mathrm{ opt}(\mathsf{a})\vert \leq \beta \vert c^{\prime} -\mathrm{ opt}(\mathsf{b})\vert \).

For two optimization problems A and B and the constants α and β of the Definition 1, the following theorem establishes the basic usage of L-reduction.

Theorem 3 ([43]).

If problem A L-reduces to problem B and there is a polynomial time approximation algorithm for B with worst-case error \(\varepsilon\), then there is a polynomial time approximation algorithm for A with worst-case error \(\alpha \beta \varepsilon\).

Therefore, if problem B has a polynomial time approximation scheme (PTAS), then so does problem A.

The class Max-SNP is a class of optimization problems defined syntactically in [43]. Every problem in Max-SNP can be approximated in polynomial time within some constant ratio. A problem is Max-SNP-hard if any other problem in Max-SNP L-reduces to it.

Theorem 4 ([8]).

The SSP is Max-SNP-hard.

The proof is based on an L-reduction from the Min-HPP, where the degree of the associated directed graph is bounded, and all the weights are either 1 or 2, which is Max-SNP-hard [44]. The reduction from this problem to the SSP is similar to the one used to show the NP-completeness of the decision SSP in Theorem 1, with the extra establishment that it is an L-reduction. The strings that are considered for the above L-reduction have bounded lengths, and so the same reduction can be applied to the maximization version of the superstring problem with respect to the compression measure, and concludes to the same hardness result.

Corollary 1 ([8]).

Maximizing the total compression of a string set is Max-SNP-hard.

In [5], it is proved that if a Max-SNP-hard problem has a PTAS, then P = NP. Therefore, there is no PTAS for the SSP, unless P = NP, which means that there exists an \(\varepsilon > 0\) such that it is NP-hard to approximate the SSP within a ratio of \(1+\varepsilon\).

The L-reduction described in [8] for the proof of the Max-SNP-hardness of the SSP produces instances with arbitrarily large alphabets. More precisely, each instance of the special Min-HPP with n vertices is transformed to an SSP instance over an alphabet with 2n + 1 letters. However, the SSP is APX-hard even if the alphabet contains just two letters as stated in the next theorem.

Theorem 5 ([41]).

The SSP is APX-hard both with respect to the length measure and the compression measure, even if the alphabet has cardinality 2 and every string is of the form 10m1n01m0n+410 or 01m0n10p1q01m0n10r1s01, where m,n,p,q,r,s ≥ 2.

4 Polynomially Solvable Cases

Since the SSP is NP-hard, special cases of the problem that can be solved in polynomial time constitute an interesting aspect. Various additional restrictions on the problem’s parameters, similar to these described in Sect. 3 lead to polynomial algorithms revealing the boundaries between hard and easily solvable cases of the problem.

Obviously, if the cardinality of the alphabet is equal to 1 or all the strings in the given set are of length 1, then the SSP is trivial. Also, if the number of the strings in the set is fixed, then the SSP is polynomially solvable by enumerating all the different string orders. However, there are more interesting and complicated polynomial cases of the SSP.

Since Theorem 1 establishes the NP-completeness of the SSP for string lengths greater than 2, the question is what happens in the remaining cases. The answer is given by the next theorem.

Theorem 6 ([37]).

For a string set S ={ s 1 ,s 2 ,…,s n } and an integer k, if |s i |≤ 2,i ∈ I n , then there is a linear time and space algorithm to decide if S has a superstring of length k.

A path decomposition of a directed graph G is a partition of E(G) into edge-disjoint paths. Such a decomposition is minimum if it contains the minimum number of paths. The linear algorithm in Theorem 6 is based on a minimum path decomposition of a graph associated with the string set S. Besides the algorithm for the decision problem mentioned in Theorem 6, there is also a linear algorithm that finds a shortest superstring for strings of length at most 2.

A fixed maximum orbit size for the letters in the alphabet leads to a special case of the SSP that is also solvable in polynomial time. Assume a set S of strings over alphabet Σ and let m = max{ | s | : s ∈ S}.

Theorem 7 ([61]).

If the orbit size of each letter in Σ is at most 2 in S, then a shortest superstring for S is found in polynomial time O(|Σ|2 m).

Another special case of the SSP concerns the fixed difference between the sum of string lengths and the cardinality of the alphabet as cited in [61]. Given a set S of strings over an alphabet Σ, for a fixed difference | | S | | − | Σ | , the SSP is solvable in polynomial time by a special exhaustive enumeration. The difference | | S | | − | Σ | is mentioned as a measure of dissimilarity of the strings in S.

In [38], restricted cases of the SSP are studied, and a string form that induces polynomial cases is found.

Theorem 8 ([38]).

The SSP over the alphabet {0,1} is polynomial time solvable if each given string contains at most one 1.

As cited in [61], a particular case of the SSP in which S is the set of all three-letter strings over an alphabet Σ is known as the Code Lock Problem . In this case, the possible overlaps between the strings are 1 and 2. This problem is reducible to the Eulerian Walk Problem , where the existence of a walk that contains all the arcs of a directed graph is sought, and hence, according to [16] it is solvable in polynomial time.

5 Exact Solutions

There are only few exact algorithms in the literature for the SSP. This is due to the computational complexity of the problem, and the lack of necessity for optimal solutions at its main applications in computational molecular biology. In the DNA sequencing practice, the biological properties of a genome molecule can be usually expressed also by a superstring of its fragments that is not the shortest one, but its length is close to the optimum.

5.1 Exhaustive Enumeration

The SSP can be trivially solved by exhaustive enumeration of all possible arrangements of the strings. The merge of the strings in some of these orders would correspond to a shortest superstring. Given a set S of n string, the examination of the superstrings of S that correspond to all permutations in Π n is enough to find a shortest one. The exhaustive examination of all permutations can be executed in time O(n! | | S | | ), or by a different implementation that also exhaustively enumerates the possible solutions, in time O(n | | S | | n+1) as mentioned in [61]. Optimal solutions for small SSP instances taken by the exhaustive algorithm are used in [24, 25] to compare the compression achieved by the viruses to their genome with the largest possible compression of their genes.

5.2 Integer Programming Formulation

Given an SSP instance specified by a set S of n strings, consider the associated overlap graph G o (S). An optimal solution to the SSP instance can be obtained by an optimal solution to the Max-HPP on G o (S), since a maximum Hamiltonian path contains all the vertices (strings) ordered in a single path such that it has the maximum total overlap. Due to the relation between the HPP and the TSP described in Sect. 2, these solutions can be obtained by an optimal solution to the Max-TSP. According to these transformations, optimal solutions for the SSP can be derived by any integer programming formulation for the Max-TSP using branch and bound or cutting plane algorithms. In [17], a benchmark set of instances with known optimal solutions was constructed using the integer program of [40] for the Max-TSP and used to compare the solutions of a heuristic for the SSP with the optimal ones.

6 Approximation Algorithms

The fact that the SSP is Max-SNP motivates many researches to develop approximation algorithms for it. As mentioned in Sect. 2, there are two ways to assess the solution of an approximation algorithm: the length measure considering the SSP as a minimization problem, and the compression measure considering the SSP as a maximization problem.

For a string set S, and any algorithm Alg for the SSP, we use the notation Alg l (S) to denote the length of the superstring of S obtained by Alg, and Alg c (S) to denote the corresponding achieved compression. An approximation ratio \(\varepsilon = \frac{ALG_{l}(S)} {\mathrm{opt}_{l}(S)} \geq 1\) with respect to the length measure means that Alg \(_{l}(S) \leq \varepsilon \times \mathrm{opt}_{l}(S)\) for all instances, while an approximation ratio \(\varepsilon = \frac{ALG_{c}(S)} {\mathrm{opt}_{c}(S)} \leq 1\) with respect to the compression measure means that Alg \(_{c}(S) \geq \varepsilon \times \mathrm{opt}_{c}(S)\) for all instances. Although the two measures are equivalent regarding the optimal solution, they differ regarding the approximate solutions of the problem. The existence of an algorithm with a constant approximation ratio for the one measure has in general no approximation performance guarantee for the other measure.

In this section, the approximation algorithms for the SSP both with respect to the length and the compression measure are presented, revealing the special features of the superstrings in each case.

6.1 Approximation of Compression

The compression measure counts the number of letters gained in comparison with the simply concatenation of all strings. Algorithms that approximate this gain are presented here.

6.1.1 The Natural Greedy Algorithm

A very well known, simply implemented, and widely used algorithm for the SSP is the natural greedy algorithm. It is routinely used in DNA sequencing practice. It starts with the string set S and repeatedly merges a pair of distinct strings with the maximum possible overlap until only one string remains in S. Next algorithm shows the pseudo-code of the natural greedy for the SSP.

The operation of the Greedy algorithm on the string set S is equivalent to the creation of a Hamiltonian path on the overlap graph G o (S). In general directed weighted graphs, the total weight of the Hamiltonian path obtained by the greedy approach is at least one third the weight of a maximum path [26]. In the case of the overlap graphs, a stronger result can be obtained by exploiting their properties. A basic lemma that concerns the form of these graphs is restated here in terms of strings.

Lemma 1 ([58]).

Let s 1 ,s 2 ,s 3, and s 4 be strings, not necessarily distinct, such that |o(s 3 ,s 2 )|≥|o(s 1 ,s 2 )| and |o(s 3 ,s 2 )|≥|o(s 3 ,s 4 )|. Then \(\vert o(s_{1},s_{4})\vert \geq \vert o(s_{1},s_{2})\vert + \vert o(s_{3},s_{4})\vert -\vert o(s_{3},s_{2})\vert \).

The proof can be derived directly from the next figure, where the alignment of the four strings according to their overlaps is presented.

Notice that, if s 1 and s 4 are not distinct, then the result of Lemma 1 concerns the self-overlap of the string s 1.

The following theorem establishes the approximation performance of the Greedy algorithm based on the corresponding analysis of the greedy approach for the Max-HPP and on Lemma 1.

Theorem 9 ([58]).

For a string set, the compression achieved by the Greedy algorithm is at least half the compression achieved by a shortest superstring.

Next example, presented in [58], shows that the result of the Theorem 9 is the best possible.

Example 6.

For the string set {ab k, b k+1, b k a}, k ≥ 1, over the alphabet {a, b}, Greedy may produce the superstring ab k ab k+1 or the superstring b k+1 ab k a that achieves compression k, whereas the shortest superstring is ab k+1 a and achieves compression 2k. Notice that Greedy can also give the shortest superstring depending on how it breaks ties.

6.1.2 Approximation Based on Matchings

Apart from Greedy, two other \(\frac{1} {2}\)-approximation algorithms for the compression of a superstring based on the MP and the DMP are presented in [62]. For an SSP instance S, consider the associated overlap graph G o (S). In both algorithms, a matching algorithm is repeatedly applied to G o (S), to produce a Hamiltonian path.

For the description of the first algorithm the notion of the arc contraction is necessary. Given a weighted directed graph G and an arc e = (u, v) ∈ E(G), the contraction of e is denoted by Ge and gives a new graph obtained from G where the vertices u, v and their incident arcs are replaced by a new vertex w which has as incoming arcs the incoming arcs of u and as outgoing arcs the outgoing arcs of v with the same weights as on G. The Match algorithm initially finds a maximum matching on G o (S), and then contracts the arcs of the matching. This process is repeated on the new graph until a graph with no arcs comes up. G o (S) = (V, E, w) is the initial overlap graph which remains unchanged, whereas G denotes the graph obtained in each iteration after the arc contractions. Initially, G = G o (S). Let maxm(G) be a maximum matching on graph G.

The approximation performance of the Match algorithm is based on the observation that any matching on an overlap graph can be extended to a Hamiltonian path on it, since overlap graphs are complete. Moreover, a maximum matching has total weight at least half the weight of a maximum Hamiltonian path. This can easily be shown by considering each Hamiltonian path as two matchings with distinct arcs, constructed by taking alternate arcs from the path. These results imply that the compression achieved by the Match algorithm is at least half the optimal compression.

The second algorithm with the same approximation ratio for the SSP is based on the slightly different DMP. Remember that a directed matching on a graph is a set of disjoint paths and cycles. For the description of this algorithm, the notion of the arc contraction is extended naturally to paths. Given a weighted directed graph G and a path p = (v 1, v 2, , v r ) on it, defined by its vertices, the contraction of p gives a new graph obtained from G where the vertices v 1, , v r and their incident arcs are replaced by a new vertex w which has as incoming arcs, the incoming arcs of v 1 and as outgoing arcs the outgoing arcs, of v r with the same weights as on G. The DiMatch algorithm described also in [62] operates exactly as Match except that it finds a directed matching of each step, opens each cycle of it by deleting an arc with the smallest weight, and finally contracts the paths into vertices. The compression achieved by the DiMatch algorithm is at least half the optimal compression.

6.1.3 Approximation Based on the TSP

Any approximation algorithm for the Max-TSP is also an approximation algorithm for the SSP with respect to the compression measure, or equivalently for the Max-HPP, with the same ratio due to the transformation from the TSP to the HPP. For both problems, it is implied that they are asymmetric , which means that they applied on directed graphs, and that the weight of an arc (u, v) is not necessarily equal to the weight of the arc (v, u).

In [7, 31], two approximation algorithms for the Max-TSP are presented. In both cases, procedures with complementary worst cases run on directed graphs with even number of vertices. The best result among them is a Hamiltonian cycle whose weight is at least \(\frac{38} {63}\) times the weight of a maximum weight Hamiltonian cycle for the first algorithm, and \(\frac{8} {13}\) for the second. Both algorithms achieve their approximation performance without utilizing any special structure of the strings. In both algorithms is required that the complete input graph G has an even number of vertices. In general, for an SSP instance of n strings, the above algorithms achieve approximation ratios \(\frac{38} {63}(1 - \frac{1} {n})\) and \(\frac{8} {13}(1 - \frac{1} {n})\), respectively. Finally, an approximation algorithm for the Max-TSP is also designed in [29], achieving the best ratio until now, namely \(\frac{2} {3}\). It operates by decomposing a special form of directed multigraphs, where the elements of the arc set are ordered triples of the vertex set.

6.2 Approximation of Length

A plethora of approximation algorithms with respect to the length measure have been developed for the SSP using different variations of the greedy strategy. The best one among them finds a string whose length is at most \(2\frac{1} {2}\) times the length of the optimal string.

6.2.1 Naive Approximation Algorithm

A naive algorithm for the SSP is used in [8] for comparison reasons in relative performance of other approximation algorithms. Its approximation performance is not remarkable but the idea is quite simple, showing that it is easy to develop an algorithm for the SSP, but it is not so easy to achieve a good approximation ratio. For a string set S, the algorithm arbitrarily chooses a string from S considering it as the initial current string, and then repeatedly updates the current string by merging it with a remaining string from S that yields the maximum overlap. The performance of this algorithm highly depends on the random choice of the initial point, and it is possible to produce superstrings whose length grows quadratically in the optimal length.

6.2.2 Approximation Algorithm Used in a Learning Process

The first attempt to approximate the shortest superstring of a set was made in [34], where the DNA sequencing procedure is modelled as a string learning process from randomly drawn substrings of it. Under certain restrictions, this may be viewed as a string learning process in Valiant’s distribution free learning model [63]. The efficiency of the learning method depends on the solution of an algorithm which approximates the length of a superstring, and seeks in each step for an appropriate join string among the candidate ones.

Given a string set S = { s 1, s 2, , s n } and a string s, we denote by subSs(S, s) the set of the strings in S that are substrings of s.

Example 7.

Suppose that we have the string set S = { s 1, s 2, s 3}, where s 1 = caabaa, s 2 = abaaca, and s 3 = baacaa are strings over the alphabet {a, b, c}. The set of all join strings of s 1 and s 2 regardless their order is \(J_{\{s_{1},s_{2}\}} =\{ caabaaabaaca,caabaabaaca,caabaaca,abaacacaabaa,abaacaabaa\}\), while the only join string \(s \in J_{\{s_{1},s_{2}\}}\) for which subSs(S, s) = S is the string abaacaabaa.

Given a string set S, the Group-Combine algorithm constructs a superstring of S by an iterative process. The algorithm begins with a string set and combines the strings in groups such that all strings in a group are substrings of a join string of two of them, trying to find as large groups as possible.

Next theorem establishes the approximation ratio of the algorithm.

Theorem 10 ([34]).

Given a string set, if the length of the optimal superstring is m, then Group-Combine produces a superstring of length O(m log m).

6.2.3 4-Approximation Algorithms

The first approximation algorithm with a constant ratio for the length of a superstring is described in [8], answering a notorious open problem for the existence of such an algorithm. The algorithm utilizes a minimum cycle cover on the distance graph of a string set to derive a superstring with preferable properties that bound its length. Given an SSP instance S, the Cycle-Concatenation algorithm finds a minimum cycle cover on the graph G d (S) with loops in polynomial time. Then it opens each cycle of the cover by removing an arc chosen randomly, constructs the superstring that corresponds to the obtained path, and concatenates these strings.

Next theorem demonstrates the approximation performance of the algorithm establishing the first constant approximation ratio for the SSP.

Theorem 11 ([8]).

For a string set S, Cycle-Concatenation produces a superstring of length at most 4 × opt l (S).

Another algorithm for the SSP with the same constant approximation ratio is MGreedy which is presented in [8].

Notice that at line 3, the two strings of each pair are not necessarily distinct allowing in this way the self-overlaps. Since the choices at line 4 are made according to the overlaps in S, MGreedy can be thought as choosing arcs from the graph G o (S) with loops. The choice of the pair (s i , s j ) corresponds to the choice of the arc (last(s i ), first(s j )) on G o (S) in each step. Therefore, the algorithm constructs paths, and closes them into cycles when distinctness is not satisfied at line 4. Thus, MGreedy ends up with a set of disjoint cycles that cover the vertices of G o (S), which is a cycle cover. The same cycle cover can be thought on graph G d (S) with loops. For a cycle cover C, by Eq. (2), we have \(o(C) + d(C) = \vert \vert S\vert \vert \), and so a cycle cover has minimum total weight on G d (S) if and only if it has maximum total weight on G o (S), and in both cases it is called optimal.

Theorem 12 ([8]).

The cover created by MGreedy is an optimal cycle cover.

Notice that the presence of the loops is a necessary assumption for this result. Since MGreedy finds an optimal cycle cover, the superstring that is produced by it is no longer than the string produced by algorithm Cycle-Concatenation. Therefore, the approximation ratio with respect to the length measure of MGreedy for the SSP is also equal to 4. Actually, the superstring of MGreedy could be shorter than the one obtained by Cycle-Concatenation since MGreedy simulates the breaking of each cycle in the optimal position, that is between the strings with the minimum overlap in the cycle.

6.2.4 Greedy Is a \(3\frac{1} {2}\)-Approximation Algorithm

The Greedy algorithm has already been presented as an approximation for the compression. A notorious open question is how well Greedy approximates the length of a shortest superstring, while a common conjecture states that Greedy produces a superstring of length at most two times the length of the optimum [54, 58, 62]. In fact, Greedy may give a superstring almost twice as long as the optimal one, as shown in the next example from [8].

Example 8.

For the string set {c(ab)k, (ba)k, (ab)k c}, k ≥ 1, over the alphabet {a, b, c}, Greedy may produce the superstring c(ab)k c(ba)k or the superstring (ba)k c(ab)k c of length 4k + 2, whereas the shortest superstring is c(ab)k+1 c of length 2k + 4.

In [8], it is proved that Greedy is a 4-approximation algorithm for the SSP. Next theorem improves this approximation ratio based on a more careful analysis on specially formed strings.

Theorem 13 ([28]).

The Greedy algorithm is a \(3\frac{1} {2}\) -approximation algorithm with respect to the length measure.

6.2.5 A 3-Approximation Algorithm

The algorithm TGreedy described in [8] operates in the same way as MGreedy except that in the last step it merges the strings in set T by running Greedy on them instead of simply concatenates them. Next theorem establishes its approximation performance.

Theorem 14 ([8]).

For a string set S, algorithm TGreedy produces a superstring of length at most 3 × opt l (S).

In [8], a relative performance comparison between Greedy, MGreedy, and TGreedy algorithms is presented. TGreedy always produces better solutions than MGreedy since in the last step it greedily merges the strings, whereas MGreedy just concatenates them. The approximation performance of TGreedy is better than this of Greedy, but the superiority of one of these algorithms over the other is not guaranteed as shown in the next example.

Example 9.

For the string set {c(ab)k, (ab)k+1 a, (ba)k c}, k ≥ 1, over the alphabet {a, b, c}, Greedy produces the shortest superstring c(ab)k+1 ac of length 2k + 5, whereas TGreedy produces the superstring c(ab)k ac(ab)k+1 a or the superstring (ab)k+1 ac(ab)k ac of length 4k + 6, since the initial maximum overlap is the self-overlap of the second string.

On the other hand, for the string set {cab k, ab k ab k a, b k dab k−1}, k ≥ 1, over the alphabet {a, b, c, d}, TGreedy produces the shortest superstring cab k dab k ab k a of length 3k + 6, since the initial maximum overlap is the self-overlap of the second string, whereas Greedy produces the superstring cab k ab k ab k dab k−1 or the superstring b k dab k−1 cab k ab k a of length 4k + 5.

6.2.6 Generic Approximation Based on Cycle Covers

Algorithms MGreedy and TGreedy implicitly construct optimal cycle covers on the associated overlap and distance graphs of a string set, while Cycle-Concatenation explicitly takes advantage of this construction. A generic algorithm that explains this basic idea is presented in [11].

For a string set S, let C = { c 1, c 2, , c p } be a cycle cover on the graph G d (S). Suppose that an arbitrary string r i is picked from each cycle c i  ∈ C, and these strings form the representative set R = { r 1, r 2, , r p }. Let \(r =\langle r_{1},r_{2},\ldots,r_{p}\rangle\) be a superstring of R. By replacing each r i , i ∈ I p , in r with the string strC +(c i , r i ), we get the string

$$\displaystyle{ \langle strC^{+}(c_{ 1},r_{1}),strC^{+}(c_{ 2},r_{2}),\ldots,strC^{+}(c_{ p},r_{p})\rangle, }$$

which is called the extension string of r with respect to C and is denoted by ext(r, C). Observe that ext(r, C) is a superstring of S.

For a string set S, the Generic-Cover algorithm constructs a minimum cycle cover C on the graph G d (S), and chooses a random string from each cycle of this cover to form a set R of representatives. Then, it finds a new minimum cycle cover on G d (R), opens each cycle of this cover in a random position, and concatenates the resulting cycle superstrings to create a superstring of R. Finally, it returns the extension string of this superstring with respect to the cycle cover C to take a superstring of S.

The Generic-Cover algorithm has approximation ratio equal to 3. This algorithm constitutes the base for the design of better approximation algorithms for the SSP as described below.

6.2.7 Handling 2-Cycles and 3-Cycles Separately

For an SSP instance specified by a string set S, opt c (S) may grow quadratically in opt l (S) in general. Thus, to take advantage of a compression approximation to design length approximation algorithms with constant ratio based on Generic-Cover framework, a key is to construct suitable subproblems for which opt c (S) is linear in opt l (S). The main difficulty in determining such subproblems and so in improving the length approximation performance of the Generic-Cover algorithm appears in handling k-cycles with small k in the cycle cover C R . In [60], the compression achieved by Greedy is utilized, to design a length approximation algorithm for the SSP. The algorithm is based on the scheme of Generic-Cover handling separately the 2-cycles in the minimum cycle cover C R . In this way, the algorithm achieves an approximation ratio \(2\frac{8} {9}\). In [11], an approximation algorithm that handles separately the 2-cycles and the 3-cycles is developed and gives a superstring of length at most \(2\frac{5} {6}\) times the length of a shortest superstring.

6.2.8 Approximation Algorithms Based on the TSP

As cited in [31], a relationship between the SSP and the Max-TSP according to their approximation is given by the following lemma.

Lemma 2 ([8]).

If the Max-TSP has a \((\frac{1} {2}+\varepsilon )\) -approximation, then the SSP has a \((3 - 2\varepsilon )\) -approximation with respect to the length measure.

Utilizing this relation, the approximation algorithms for the Max-TSP mentioned in Sect. 6.1.3 can be used to derive approximation ratios for the SSP with respect to the length measure. Consider an SSP instance S of n strings and the corresponding Max-TSP instance G o (S). For even number of strings, the algorithm described in [31] and achieves an approximation ratio of \(\frac{38} {63}\) for the Max-TSP, gives a \(2\frac{50} {63}\)-approximation ratio for the SSP, while the algorithm described in [7] and achieves an approximation ratio of \(\frac{8} {13}\) for the Max-TSP, gives a \(2\frac{10} {13}\)-approximation ratio for the SSP. For odd number of strings the algorithms achieves a ratio of \(2(\frac{50} {63} + \frac{1} {n})\) and \(2(\frac{10} {13} + \frac{1} {n})\) for the SSP, respectively. The algorithm described in [29] and achieves an approximation ratio of \(\frac{2} {3}\) for the Max-TSP, gives a \(2\frac{2} {3}\)-approximation ratio for the SSP.

6.2.9 Exploiting the Superstring Structures

The approximation algorithms presented above are largely graph-theoretical, meaning that they sufficiently exploit the structure of overlap and distance graphs, but they do not take advantage of the structure inside the strings or in general of the properties not evident in graph representation. In this sense, they solve a more general problem than the one at hand.

An algorithm that captures a great deal of the structure of the SSP instances is presented in [3]. It takes advantage of the structure of strings with large value of overlap, proving several key properties of such strings. It follows the framework of the Generic-Cover algorithm using a more sophisticated way to choose the representatives at line 5 and to open each cycle at line 12. After finding a cycle cover on the associated distance graph, the key is to exploit the periodic structure of the cycle superstrings that arise. In this way, the algorithm achieves a bound either to the total overlap of the rejected arcs at line 12 or to the total additional length of extending each cycle at line 15. The result is to construct a superstring whose length is no more than \(2\frac{3} {4}\) times the length of an optimal superstring.

This algorithm and the \(2\frac{50} {63}\)-approximation algorithm for the Max-TSP that is mentioned in Sect. 6.2.8 have complementary worst cases, and so a better ratio can be achieved by their combination. When the worst case of the first algorithm occurs, the Max-TSP algorithm runs as a subroutine on the set of representatives to take a better result. Balancing the two algorithms, an approximation ratio of \(2\frac{50} {69}\) for the SSP can be achieved [2].

In [4], the study of the key properties is extended to strings that exhibit a more relaxed form of the periodic structure considered before. Algorithmically, the new approach is also based on the framework of the Generic-Cover and is a generalization of the previous one. On the other hand, the analysis is very different and includes a special structure of 2-cycles. Let c be a 2-cycle in the cycle cover C R of the Generic-Cover algorithm, consisting of the vertices s i and s j , which are the representatives of the cycles c i and c j in the cycle cover C. Without loss of generality assume that d(c i ) ≥ d(c j ). The cycle c is a g -HO2-cycle if

$$\displaystyle{ \min \{\vert o(s_{i},s_{j})\vert,\vert o(s_{j},s_{i})\vert \}\geq g(d(c_{i}) + d(c_{j})). }$$

In the new algorithm, during the selection of the representatives a technique is used to anticipate the potential of each string to participate in a \(\frac{2} {3}\)-HO2-cycle. Such strings have a very specific structure, and if there is a string without such a structure in a cycle, it is chosen as the representative. Otherwise, the knowledge of the structure of the entire cycle can be used to trade the amount of the lost overlap against the additional length of extending the representative to include the rest of the cycle. In this way, a \(2\frac{2} {3}\)-approximation algorithm for the SSP is designed.

6.2.10 Rotations of Periodic Strings

Two approximation algorithms for the SSP that are based also on the inner structure of the strings and their periodic properties are presented in [9]. They use the same framework of the Generic-Cover algorithm, but they make use of new bounds on the overlap between two strings.

Both algorithms pay special attention to the selection of the representatives but without concentrating on k-cycles with small k. Instead of choosing a string obtained by opening each cycle, the new idea is to look for superstrings of the strings in a cycle that are not too long and are guaranteed not to overlap with each other by too much. Each chosen superstring does not even have to be one of the cycle superstrings obtained by opening the cycle. Given a cycle c i  = (s 1, s 2, , s p , s 1) of the cycle cover C of algorithm Generic-Cover, a string r c is a candidate to be a representative of c if for some j

  • r c is a superstring of strC(s j+1) and

  • r c is a substring of strC +(s j ).

A sophisticated procedure is used to choose the representatives such that they satisfy these two conditions and also have an appropriate property to lead to the improved ratio.

After this step, the two approximation algorithms follow different ways. The first algorithm after finding the second cycle cover opens each cycle and concatenates the cycle superstrings, achieving an approximation ratio of \(2\frac{2} {3}\). The second algorithm constructs a superstring of the representatives using as subroutine an approximation algorithm with respect to the compression measure for the SSP. As subroutines, we can use the approximation algorithms cited in Sect. 6.1.3. Using the \(\frac{38} {63}\)-approximation algorithm described in [31] a length ratio of \(2\frac{25} {42}\) is achieved, while using the \(\frac{8} {13}\)-approximation algorithm described in [7] a length ratio of \(2\frac{15} {26}\) is achieved.

6.2.11 \(2\frac{1} {2}\)-Approximation Algorithms

The best approximation ratio with respect to the length measure for the SSP is the \(2\frac{1} {2}\) until now. It can be achieved by two different methods, one from the field of the superstrings and the other from the field of the TSP.

The first algorithm is described in [57]. Given a string set S, the algorithm begins by constructing a minimum cycle cover C on graph G d (S). Then, instead of choosing representatives, it combines the cycles of C to produce a new cycle cover C′, and finally opens each cycle in C′ to produce a set of cycle superstrings. The concatenation of these superstrings yields a superstring of S. The algorithm exploits the properties of cycles and cycle covers on a special multigraph to achieve the \(2\frac{1} {2}\)-approximation ratio.

The second approach that achieves the same length ratio for the SSP is an approximation algorithm for the Max-TSP described in [29]. It finds a Hamiltonian cycle whose weight is at least \(\frac{2} {3}\) the weight of a maximum Hamiltonian cycle. Using this procedure as a subroutine in the algorithm cited in Sect. 6.2.10, a length ratio of \(2\frac{1} {2}\) for the SSP can be achieved.

7 Parallelizing the Solving Process

In complexity theory, the class NC consists of the decision problems (languages) decidable in polylogarithmic parallel time O(logO(1) n) on a parallel computer with polynomial number O(n O(1)) processors. In this definition, a parallel random access machine (PRAM) is assumed, that is a parallel computer with a central pool of memory, where any processor can access any bit of memory in constant time. The class RNC, which stands for random NC, extends NC with access to randomness. The class RNC consists of the decision problems (languages) that have a randomized algorithm which is solvable in polylogarithmic parallel time on polynomially many processors, and its probability of producing a correct solution is at least \(\frac{1} {2}\).

It is conjectured that there are some tractable problems which are inherently sequential and cannot significantly be sped up by using parallelism. For an algorithm, a common method to show that it is hardly parallelizable is to prove that the algorithm is P-complete for the problem it applied to. The Greedy algorithm belongs to this case since the problem of finding a superstring chosen by the Greedy algorithm is P-complete [11]. This means that Greedy is difficult to be parallelized effectively. In the following, parallel approximation algorithms for the SSP are presented.

7.1 NC Algorithm with Logarithmic Length Ratio

Given a ground set X of elements and a family Y of subsets of X, a set cover of X with respect to Y is a subfamily Y ′ ⊆ Y of sets whose union equals to X. Assigning a weight w(x) to each element x ∈ X the total weight of each set and family is naturally defined. The Set Cover Problem (SCP) is to find a set cover of the ground set of the minimum weight. The SCP can be approximated within a logarithmic ratio by a parallelizable algorithm [6].

In [11], a similar approach to the one presented in Sect. 6.2.2 for grouping is applied to the SCP. Given a set S of n strings, we define

$$\displaystyle{ F =\{ subSs(S,s): s \in J_{\{s_{i},s_{j}\}},s_{i},s_{j} \in S\}, }$$

that is the family of the sets of substrings of all possible pairwise join strings from S. Considering S as the ground set and F as a family of its subsets, they specify an instance of the SCP. From each set cover C ⊆ F of S, a string s C can be constructed by merging the join strings that correspond to the sets of C. Observe that s C is a superstring of S. Let the weight of each set of F be the length of the corresponding join string, and w(C) be the total weight of the set cover C. Because of the merging of the join strings, | s C  | ≤ w(C). Also, it is proved that the length of the superstring corresponds to a minimum set cover C is at most twice the length of an optimal superstring, that is \(\vert s_{C^{{\ast}}}\vert \leq 2 \times \mathrm{ opt}_{l}(S)\). These results combined with the parallelization of the SCP imply an NC algorithm with logarithmic approximation for the SSP.

Theorem 15 ([11]).

For a string set S of n strings, there is an NC algorithm that for any \(\varepsilon > 0\), finds a superstring whose length is at most \((2+\varepsilon )\log n\) times the length of a shortest superstring.

Observe that each group of strings selected by the Group-Combine algorithm is a set of the family F as it was described previously, and so this algorithm constructs implicitly a set cover of S with respect to F. Theorem 15 proves that this result can also be obtained by a parallelizable procedure letting as open problem the design of an NC algorithm with a constant approximation ratio with respect to the length measure for the SSP.

7.2 RNC Algorithm with Constant Length Ratio

An RNC algorithm for the SSP is based on a parallelizable implementation of the sequential \(2\frac{5} {6}\)-approximation algorithm mentioned in Sect. 6.2.7. The only non-trivially parallelizable steps of this algorithm are the computations of the minimum cycle covers. Remember that, the problem of finding an optimal cycle cover is equivalent to the problem of finding a maximum matching on a bipartite graph. In general, it is not known if it can be done in either NC or in RNC. However, when the weights of the graph are given in unary notation, a condition that can be satisfied in the case of this algorithm, a maximum matching can be found in RNC (see e.g. [49]), giving the next theorem for the SSP.

Theorem 16 ([11]).

For a string set S, there is an RNC algorithm that finds a superstring of length at most \(2\frac{5} {6} \times \mathrm{ opt}_{l}(S)\).

7.3 NC Algorithm with Compression Ratio \(\frac{1} {4+\varepsilon }\)

Given a weighted directed graph, a natural greedy approach for finding a maximum cycle cover is described as follows. Scan the arcs in non-increasing order of weights, and select an arc that does not have the same head or the same tail with a previously selected arc. Repeat until the selected arcs form a cycle cover. This approach finds a cycle cover of weight at least half the weight of a maximum cycle cover [11]. As mentioned in Sect. 6.2.3, if the graph is an overlap graph with loops then this greedy approach always finds a maximum weight cycle cover.

For the development of an NC compression approximation algorithm with a constant ratio for the SSP, a slightly different algorithm from the natural greedy for the CCP is designed. This algorithm achieves a worse approximation ratio, but can be parallelized. It is based on the idea that the natural greedy algorithm could be only a bit worse if in each step it chooses instead of the maximum weight arc, one with a similar weight. The arcs of the graph are partitioned into levels, such that the weights of all arcs in a level are within a constant factor. Given a graph G and a real c > 1, an arc e ∈ E(G) has c-level equal to k if c k−1 < w(e) ≤ c k, and c-level equal to 0 if w(e) ≤ 1. The algorithm operates like the natural greedy algorithm assuming that all arcs in each level have the same weight. The usage of this algorithm on overlap graphs for finding superstrings concludes to the next theorem.

Theorem 17 ([11]).

For a set S of n strings, there is an NC algorithm for the SSP that achieves a compression ratio \(\frac{1} {4+\varepsilon }\) . It runs either in time \(O(\log ^{2}n\log _{1+\varepsilon }\vert \vert S\vert \vert )\) on a PRAM with ||S|| + n 4 processors or in time \(O(\log ^{3}n\log _{1+\varepsilon }\vert \vert S\vert \vert )\) on a PRAM using n 2 + ||S|| processors.

8 Inapproximability Bounds

Both minimization and maximization versions of the superstring problem are Max-SNP-hard, which means that there exists an \(\varepsilon > 0\) such that it is NP-hard to approximate the SSP within a ratio of \(1+\varepsilon\) with respect to the length measure, or within a ratio of \(1-\varepsilon\) with respect to the compression measure. The practical side of this theoretical result is expressed by explicit bounds to the approximation ratio in both cases.

The first work to this direction appears in [41], where inapproximability bounds are given for a special case of the SSP. Specifically, the result concerns SSP instances where the alphabet is {0, 1} and every string is of the form 10m1n01m0n+410 or 01m0n10p1q01m0n10r1s01, where m, n, p, q, r, s ≥ 2. This special case is used also in Theorem 5 that concerns APX-hardness results. Let us refer to this special case as SSP2 for short. The next two theorems establish the inapproximability results.

Theorem 18 ([41]).

The SSP 2 is not approximable within \(1 \frac{1} {17245}\) with respect to the length measure, unless P = NP.

Theorem 19 ([41]).

For every \(\varepsilon > 0\), the SSP 2 is not approximable within \(1 \frac{1} {11216}-\varepsilon\) with respect to the compression measure, unless P = NP.

In [64], inapproximability bounds for the SSP restricted to instances with equal length strings are given. Moreover, these bounds are extended to instances over alphabets of cardinality 2 improving the previous ones.

Theorem 20 ([64]).

For any \(\varepsilon > 0\), unless P = NP, the SSP on instances with equal length strings is not approximable in polynomial time within ratio

  • \(1 \frac{1} {1216}-\varepsilon\) with respect to the length measure, and

  • \(\frac{1070} {1071}+\varepsilon\) with respect to the compression measure.

A very important result about the relation between the inapproximability of the SSP over an alphabet of cardinality 2 and over any alphabet is established in the next theorem. It implies that the alphabet cardinality does not affect the approximability of the SSP.

Theorem 21 ([64]).

Suppose that the SSP can be approximated by a ratio \(\varepsilon\) on instances over an alphabet of cardinality 2. Then the SSP can be approximated by a ratio \(\varepsilon\) on instances over any alphabet.

This result holds for both measures, length and compression. Therefore, the bounds established in Theorem 20 hold also for alphabets of cardinality 2.

The computation of the inapproximability bounds for the SSP reveals the large gap between these and the best known approximation ratios for the problem both for the length measure and the compression measure.

9 Heuristics

The design of the approximation algorithms is oriented to the achievement of the approximation ratio and not to the best possible result. On the other hand, real-world applications usually need practically good results and not theoretically good ratios for the result. A heuristic algorithm can satisfy this requirement by giving solutions to SSP instances that have not approximation performance guarantee but are experimentally close to the optimum. The greedy strategies seem to perform much better than their proved approximation ratios both in average and in real-world cases. In this section, the heuristic algorithms for the SSP are described.

9.1 A Variant of the Natural Greedy

A problem with the Greedy algorithm is that it makes choices that may forbid good overlaps from future selection. In an attempt to eliminate this behaviour, a heuristic that imitates Greedy but chooses differently the string pair in each step is described in [58]. Here, the modification is given in terms of strings instead of arcs in the associated overlap graph as made in the original work. The selection criterion in each step is not just the overlap but the overall influence of the choice of each string pair. Given a string set S and two string s i and s j in it, let

$$\displaystyle\begin{array}{rcl} \mathrm{oi}(s_{i},s_{j})& =& a\vert o(s_{i},s_{j})\vert {}\\ &-& \max \{\vert o(s_{i^{\prime}},s_{j})\vert,s_{i^{\prime}} \in S,i^{\prime}\neq i\} {}\\ & -& \max \{\vert o(s_{i},s_{j^{\prime}})\vert,s_{j^{\prime}} \in S,j^{\prime}\neq j\}. {}\\ \end{array}$$

where a is a parameter that tunes the method. The idea is to take under consideration also the overlaps that would be eliminated if the pair (s i , s j ) is selected. The pseudo-code of this heuristic algorithm is exactly as the one of Greedy except that line 3 changes to k = max{oi(s i , s j ): (s i , s j ) ∈ L}. In experiments cited in [58] with this heuristic algorithm, the best results were obtained with parameter a values from 2 to 2. 5. In this case, the modified algorithm gives superstrings with average additional length from the optimum about \(\frac{1} {5}\) the corresponding average additional length of Greedy.

9.2 A Heuristic Parametrized by a Learning Process

A three-stage heuristic algorithm for the SSP, named Assembly, is presented in [20]. It is based on the observation that the set of the remaining strings in the Greedy algorithm after a number of merges is very possible to contain only string pairs with small overlaps. The Assembly algorithm, in a try to avoid mistakes, terminates the greedy strategy when false merges are expected to occur, a decision based on the number of remaining strings.

The first stage of the algorithm is similar to the Greedy algorithm except that it is terminated when the remaining string set has a cardinality c. The second stage of the Assembly algorithm is also based on greedy choices, although not made among all the possible overlaps, but only among these that pass a certification procedure. Given two strings s i and s j in the set of the remaining strings with | o(s i , s j ) |  > 0, a third string s k is a certificate if its overlap with both s i and s j is greater than 0. It is experimentally determined that for two strings s i and s j with | o(s i , s j ) |  > 0, the existence of a certificate increases the probability their merge string participates to the shortest superstring. The second stage of the Assembly algorithm has as input the output string set of the first stage, and utilizes the idea of the certification to boost the greedy choices to string pairs that are also certified. It is terminated when the cardinality of the remaining string set became equal to the parameter b. The third stage of the Assembly algorithm is a restricted backtracking procedure. Its input is the output string set of the second stage. It excludes some solutions based on a learning process, and then performs an exhaustive search through the rest solution space. In this way, it tries to balance between time efficiency and accuracy.

The Assembly algorithm is tested both on domains of random and real-world SSP instances. The first is taken by random string generators over specific distribution specifications, and the second is taken by DNA sequence databases. A number of instances of each domain are used as input to a learning procedure to specify the parameters b, c and the excluded solutions of the third stage, and the rest is used to test the Assembly algorithm. Every version of the Assembly algorithm was tested on the domain that was used for its training, but also to the other domain. The results show that Assembly performs significantly better when trained on the same domain it was tested, whereas the randomly trained version has poor performance on the real-world instances. The version of the algorithm that is trained by a successfully sequenced DNA molecule achieves a very high accuracy and effectiveness to instances of the same domain. This indicates that a successfully sequenced part of a DNA molecule can be used to significantly speed up the sequencing of the whole DNA molecule. The sequenced part can act as input to the learning procedure to determine the suitable parameter values, and the whole molecule can then be obtained by the Assembly algorithm with high accuracy and significantly sped up. The running time of the algorithm mainly depends on the running time of its third stage, which may be exponential. The tested instances suggest a sub-exponential growth of search space for this stage, but experiments on larger SSP instances are needed to conjecture a polynomial growth.

9.3 Genetic Algorithm

Some heuristic algorithms are inspired by evolutionary processes in nature. Genetic algorithms [22] belong to this class of heuristics. They are search methods that simulates the evolution process of natural selection, and used in many scientific fields to solve optimization problems. In a genetic algorithm for an optimization problem, a population , that is a collection of candidate solutions, called individuals , is evolved to reach better solutions. The evolution happens in generations that reflect the alternations to the population. During each generation the fitness of each individual in the population is evaluated proportionally to the suitability of its value for the objective function of the optimization problem. The most suitable individuals are selected to perpetuate their kind by recombining their genomes, i.e., their solutions, in specific points and by possibly randomly mutated. In this way, a new population is formed and the procedure is repeated for the next generation. Commonly, the algorithm terminates when either a maximum number of generations is produced, or a satisfactory fitness level is reached.

A genetic algorithm for the SSP is described in [66]. The input of the algorithm is a set S of strings specifying the SSP instance. The genome of each individual in the population is represented as a collection of strings from S in specific order, such that it is a candidate solution to the SSP instance. A crucial point of the algorithm is that an individual may not contain all the strings from S or may contain duplicate copies of the same string. This choice makes the output not a permutation of the strings in S giving in this way new potentials to the algorithm. The algorithm was tested to SSP instances over an alphabet of cardinality 2 using specific values for the parameters of the population size, the number of generations, and the mutation rates. The input instances were generated randomly following the DNA sequencing procedure. The experimental results show that when the number of the strings is 50 the genetic algorithm is better than Greedy, while its dominance is lost when the number of the strings becomes 80.

9.4 Coevolutionary Algorithm

Coevolutionary algorithms also belong in the class of the biologically inspired evolutionary procedures. They generalize the idea of the genetic algorithms involving individuals from more than one species. Coevolution in nature refers to the simultaneous evolution of two or more species with coupled fitness. There are two different kinds of coevolution: the competitive one where the purpose is to obtain exclusivity on a limited resource, and the cooperative one where the purpose is to gain access to some hard to attain resource. In cooperative coevolutionary algorithms there is a number of independently evolving species representing components of potential solutions which together form complex structures to solve an optimization problem. Complete solutions are obtained by assembling representative members of each species. The fitness of each individual depends on the quality of the complete solutions it participates in. Therefore, the fitness function measures how well an individual cooperates with individuals from other species to solve the optimization problem.

A cooperative coevolutionary algorithm adjusted to the SSP is presented in [66]. It is based on populations of two species that evolve simultaneously. The first population contains prefixes of candidate solutions of the SSP instance, and the second population contains candidate suffixes. Each species population evolves separately and the only interaction between the two populations is through the fitness function. Computation experiments similar to those for the genetic algorithm show that this algorithm performs at least as good as the genetic algorithm and that requires less computation time since the required involved populations are smaller and the convergence is faster. Compared with Greedy, it reaches better solutions after a number of generations both in experiments with 50 and 80 input strings.

An attempt to combine the cooperative coevolutionary approach with natural greediness concludes to the design of an improved method, which incorporates both parallelism and greed as described in [66]. The method consists of three stages. In the first stage, three parallel and independent runs of the cooperative coevolutionary algorithm operate, returning as output the populations of the prefixes and suffixes, instead of the merge string of the best representatives. Also the Greedy algorithm runs and its solution is split into a prefix and a suffix. In the second stage, two new collections of prefixes and suffixes are generated. The first contains the best \(\frac{1} {3}\) individuals of the prefix population of each cooperative coevolutionary run, and the prefix of the greedy solution. The second is constructed similarly by the corresponding suffixes. In the third stage the cooperative coevolutionary algorithm runs with the two collections constructed in the second stage as initial populations, instead of random populations. The experimental results show that this algorithm performs better than the simple cooperative coevolutionary algorithm even if the cardinality of its populations and the number of its generations in each stage are quite smaller.

9.5 Preserving Favoured Subsolutions

An extension of the genetic algorithm motivated by the desire to address the failure of this algorithm in specific domains, is the Puzzle algorithm described in [67]. It is designed to improve the performance of the genetic algorithm on relative ordering problems, i.e., problems where the order between genes is crucial instead of their global locus in the genome. Corresponding genes to strings and genome to superstring the SSP is exactly a problem of this kind. The main idea behind the Puzzle algorithm is to preserve good subsolutions found by the genetic algorithm by choosing carefully the combination points between two solutions. In this way, it promotes the assembly of increasingly larger good building blocks from different individuals, a result that explains also the name of this algorithm.

Two different populations are evolved in the Puzzle algorithm. A population of solutions (s-population) and a population of building subsolutions (b-population). Accordingly, we have the p-individuals and the b-individuals. Notice that this situation is completely different from the one described for the cooperative coevolutionary algorithm, since here the two populations are not complementary components of a complete solution. The interaction between these two populations is performed differently in each way. The fitness of a b-individual depends on the fitness of the s-individuals that contain it, while the choice of the combination points in s-individuals is affected by the b-individuals that contain these points.

The Puzzle algorithm was compared with the genetic algorithm since it is its extension and with Greedy. Experimental results with SSP instances over alphabet of cardinality 2 show that the Puzzle algorithm outperforms both Greedy and genetic algorithm, producing shorter superstrings in the average. The result is obtained by instances with 50 and 80 strings. Comparing with the cooperative coevolutionary algorithm, Puzzle is better for instances with 50 strings, whereas it is worse for instances with 80 strings.

In [67], two expansions of the Puzzle algorithm are discussed. The first one is a direct combination of Puzzle with cooperative coevolution. The two ideas of the complementary components in different populations and of the solutions and subsolutions also in different populations are combined to derive a new algorithm. During this algorithm four populations are evolved:

  1. 1.

    population of prefixes,

  2. 2.

    population of suffixes,

  3. 3.

    population of building sub-prefixes, and

  4. 4.

    population of building sub-suffixes,

where the interaction between 1 and 2 operates according to cooperative coevolutionary algorithm, and the interaction between 1, 3 and between 2, 4 operates according to algorithm Puzzle. The second expansion of Puzzle involves ideas from messy genetic algorithms [21]. They are iterative optimization algorithms that use local search techniques, adaptive representation of the genomes, and decision sampling strategies.

9.6 Discrete Neural Network

In computer science, neural networks are learning programming structures that simulate the function of biological neural networks as the one constitutes the human brain. They are composed of artificial neurons and connections between them called synapses . Neural networks are used for solving artificial intelligence problems as well as combinatorial optimization problems.

A discrete neural network used for solving the SSP is described in [35]. Discreteness concerns the values that neurons can handle. In general, it is formed by n neurons, where the state of each neuron i ∈ I n is defined by its output v i . The vector V = (v 1, v 2, , v n ) whose components are the corresponding neuron outputs is called the state vector . The energy of each state vector is given by the energy function of the network. The aim of the network is to minimize the energy function via its learning operation which happens in iterations. The energy function usually coincides with the objective function of the optimization problem to solve, such that a local minimum of the former is also a local, and possibly global, optimum to the latter. In the case of the SSP, and given a string set S, any feasible vector of the neural network represents an order of the strings in S, utilizing the permutation expression of the SSP solutions. So, feasible state vectors are those correspond to permutations, and v i  = k means that string s k is placed in the i-th place in the superstring. Notice that there is an one-to-one correspondence between neurons and strings in S. In each learning iteration, the neural network searches different solutions using neuron updating schemes. Given a vector V = (v 1, v 2, , v n ) corresponding to the current state, and two neurons i and j, 1 ≤ i < j ≤ n, the network considers updates to the following different states:

  • \((v_{1},\ldots,v_{i},v_{i+1},\ldots,v_{j},v_{j+1},\ldots,v_{n}),\)

  • \((v_{1},\ldots,v_{i},v_{j+1},\ldots,v_{n},v_{i+1},\ldots,v_{j}),\)

  • \((v_{i+1},\ldots,v_{j},v_{1},\ldots,v_{i},v_{j+1},\ldots,v_{n}),\)

  • \((v_{i+1},\ldots,v_{j},v_{j+1},\ldots,v_{n},v_{1},\ldots,v_{i}),\)

  • \((v_{j+1},\ldots,v_{n},v_{1},\ldots,v_{i},v_{i+1},\ldots,v_{j})\), and

  • \((v_{j+1},\ldots,v_{n},v_{i+1},\ldots,v_{j},v_{1},\ldots,v_{i}),\)

that correspond to the combinations of the three parts that the state vector is separated into according to the specific two neurons. For each of these candidate solutions the one that decrease mostly the energy function value is selected as the next network state. This procedure is repeated until convergence is detected, thus a state vector is found where the updates with all pairs of neurons do not cause any change. Due to the used update scheme, the network remains in a feasible state along all iterations. Once the network converges, the stable state represents a local minimum of the energy function which is equivalent to a local maximum of the total overlap between the strings in S.

Experimental results are performed with SSP instances for strings of fixed and variable lengths. The neural network algorithm runs 100 times for each instance and its results were compared with those of Greedy. In experiments with fixed string length, neural network outperforms Greedy in most cases on average, and always on best results. In experiments with variable string lengths, neural network outperforms Greedy both on average and best results.

9.7 GRASP with Path Relinking

A Greedy Randomized Adaptive Search Procedure (GRASP) is an iterative meta-heuristic for combinatorial optimization, which is implemented as a multi-start procedure where each iteration is made up of a construction phase and a local search phase. The first phase constructs a randomized greedy solution, while the second phase starts at this solution and applies repeated improvement until a locally optimal solution is found. The procedure continues until a termination condition is satisfied such as a maximum number of iterations. The best solution over all iterations is kept as the final result. GRASP seems to produce good quality solutions for a wide variety of combinatorial optimization problems. A survey on GRASP can be found in [47] while an annotated bibliography in [12]. Path Relinking (PR) [19] is an approach to integrate intensification and diversification strategies in search for optimal solutions. PR in the context of GRASP is introduced in [32] as a memory mechanism for utilizing information on previously found good solutions.

In [17], an implementation of GRASP with PR for solving the SSP is presented. It solves large scale SSP instances of more than 1, 000 strings and outperforms the Greedy algorithm in the majority of the tested instances. The proposed method is able to provide multiple near-optimum solutions that is of practical importance for the DNA sequencing, and admits a natural parallel implementation. Extended computational experiments on a set of SSP instances with known optimal solutions, produced by using the integer programming formulation presented in Sect. 5.2, indicate that the new method finds the optimum in most of the cases, and its average error relative to the optimum is close to zero.

10 Asymptotic Behaviour

It can be observed a discrepancy between the theoretical results from the worst-case analysis and the experimental observations from the approximation and heuristic algorithms for the SSP. A possible explanation for this fact is given by the average-case analysis for the problem.

The asymptotic behaviour of the compression achieved by an optimal superstring is analysed in [1] under a certain probability model for the lengths of the strings and the letter distribution in them. The average optimal compression of n strings tends to \(\frac{n\log n} {H_{\mu }}\), where \(H_{\mu } = -\sum _{i=1}^{m}p(a_{i})\log p(a_{i})\) is the Shannon entropy of the choosing law μ for the letters from the alphabet to construct the strings.

The asymptotic behaviour of some algorithms for the SSP is based on the above result and explains the good performance of the greedy strategies. In [13], the algorithms Greedy, MGreedy, and Naive are analysed in a probabilistic framework and it is proved that they are asymptotically optimal. In [65], the results of the asymptotic behaviour are extended to the TGreedy and DiMatch algorithms, after the observation that the performance of TGreedy is never worse than that of MGreedy, and that the intermediate result of the maximum directed matching in DiMatch coincides actually with the result of MGreedy (see Theorem 12). The steps of DiMatch up to the construction of the maximum directed matching are analysed in a probabilistic way with the additional assumption that all strings have the same length, and the asymptotic optimality of these algorithms is established.

By the complexity results in Sect. 3.2, we know that there is not PTAS for the SSP for both performance measures unless P = NP. In [48], a probabilistic PTAS for the SSP that achieves a \((1+\varepsilon )\)-approximation in expected polynomial time, for every \(\varepsilon > 0\), is presented. This algorithm

  1. 1.

    either returns a possibly non-optimal solution, the solution of Greedy, in polynomial time,

  2. 2.

    or returns an optimal solution, via a maximum Hamiltonian path on the associated overlap graph, in non-polynomial time.

Under certain conditions in the data of the SSP instance, in the first case Greedy has asymptotic approximation ratio \(1+\varepsilon\) with respect to the length measure, and in the second case the expected running time of finding the maximum Hamiltonian path can be polynomial, since it depends on the time spent when it is executed and its execution probability. Analysing these situations, for a random input the algorithm has approximation ratio \(1+\varepsilon\) with respect to the length measure and polynomial expected running time.

11 Smoothed Analysis

The classical complexity analysis implies that the SSP is a hard problem in the worst case. The average-case analysis explains the effectiveness of greedy strategies under suitable probability models which are far from reality. In addition to these two frameworks, the latest developed smoothed analysis explains why greed works so well for the SSP in real-world instances of the DNA sequencing practice. Smoothed analysis is introduced in [52] to demonstrate the fact that some algorithms like the simplex algorithm run in exponential time in the worst case, but in practice they are very efficient.

In [36], the smoothed analysis of the Greedy algorithm is realized, making the observation that the asymptotic optimal behaviour of the greedy techniques is due to the fact that the random strings do not have large overlaps, and so the concatenation of the strings is not much longer than the shortest common superstring. However, the practical instances arising from DNA assembly are not random and the input strings have significantly large overlaps. By defining small and natural perturbations that represent the mutations of the DNA sequences during evolution, it is proved that for any given instance S of the SSP, the average approximation ratio of the Greedy algorithm on a small random perturbation of S is 1 + o(1). This result points out that the approximation inefficiency of SSP instances indicating by the Max-SNP-hardness result can be destroyed by a very small perturbation. As very handily noted, if there had been a hard instance for the DNA assembly problem in history, the hardness would have likely been destroyed by the random mutations of the DNA sequences during the evolution. This result makes the SSP a characteristic case where the complexity is different in the worst-case analysis and in the smoothed analysis.