Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In the shortest common superstring (SCS) problem one is given a set \(\mathcal{S}=\{s_1, \ldots , s_n\}\) of \(n\) strings and the goal is to find a shortest string \(s\) such that each \(s_i\) is a substring of \(s\). This is a well-known problem having applications in such areas as genome assembly and data compression.

The problem is known to be NP-hard [10] (even if the input strings have length \(3\) or if the alphabet is binary [2]) and APX-hard [12]. The fastest known exact solutions just reduce the problem to the Travelling salesman problem and have running time \((\sum _{i=1}^n|s_i|)^{O(1)}2^n\) [1, 68]. The currently best known approximation ratio is \(2\frac{11}{23}\) [11]. Better upper bounds are known for special cases when input strings have bounded length [4, 5]. A recent survey of known results (both practical and theoretical) is given in [3].

The well known greedy conjecture states that the following extremely simple greedy algorithm has approximation ratio \(2\) [14]: find two strings with longest mutual overlap and merge them into one string, repeat the process till only one string is left. This intriguing conjecture is open for more than \(25\) years already. There is a partial progress however: it is known that the conjecture is true for some orders in which the input strings are merged by the greedy algorithm [9, 15].

In this short note, we consider another special case. We prove that the greedy conjecture is true if the input strings have length \(4\). (While for strings of length \(3\) the conjecture follows from the fact that the greedy algorithm achieves\(2\)-approximation of the compression measure [13].) We do this by a careful analysis of possible overlaps produced by the greedy algorithm.

2 Preliminaries

An overlap \({\text {ov}}(a,b)\) of two strings \(a\) and \(b\) is defined as the longest suffix of \(a\) which is also a prefix of \(b\).

Let \(\mathcal{S}=\{s_1, \ldots , s_n\}\) be a set of pairwise different \(4\)-strings where by an \(r\)-string we denote just a string of length exactly \(r\). Denote by \(s^{{\text {opt}}}\) and \(s^{{\text {gr}}}\) an optimal solution and a greedy solution for \(\mathcal S\), respectively. Our goal is thus to show that

$$\begin{aligned} |s^{{\text {gr}}}| \le 2\cdot |s^{{\text {opt}}}| \, . \end{aligned}$$
(1)

For technical reasons, we assume in this paper that in case of ties the greedy algorithm prefers strings of the form aaaa for \(\mathtt{a} \in \Sigma \).

Let \(\pi =(\pi _1, \ldots , \pi _n)\) be a permutation of \(\{1, \ldots , n\}\). By overlapping \(n\) input strings in this particular order one gets a superstring of length

$$\begin{aligned} \sum \limits _{i=1}^{n}|s_i| - \sum _{i=1}^{n-1}|{\text {ov}}(s_{\pi _i}, s_{\pi _{i+1}})| \, . \end{aligned}$$
(2)

The second term in the expression above is called a compression of \(\mathcal S\) with respect to \(\pi \). Thus, an equivalent reformulation of SCS is the following: find an order of \(n\) input strings that maximizes the compression. By \(c^{{\text {opt}}}\) and \(c^{{\text {gr}}}\) we denote the compression of the optimal solution \(s^{{\text {opt}}}\) and the greedy solution \(s^{{\text {gr}}}\), respectively. By combining (1) with (2) we get an equivalent reformulation of what we need to prove:

$$\begin{aligned} 4n-c^{{\text {gr}}}\le 2\cdot (4n-c^{{\text {opt}}}) \, . \end{aligned}$$
(3)

For a string \(t\) of length at most \(3\), let \(\#^{{\text {opt}}}(t)\) and \(\#^{{\text {gr}}}(t)\) be the number of overlaps that are equal to \(t\) in \(s^{{\text {opt}}}\) and \(s^{{\text {gr}}}\), respectively. Similarly, let \(\#^{{\text {opt}}}_i\) and \(\#^{{\text {gr}}}_i\) be the number of overlaps of length exactly \(i\). Then (3) is equivalent to

$$\begin{aligned} 4n - \#^{{\text {gr}}}_1 - 2\#^{{\text {gr}}}_2 - 3\#^{{\text {gr}}}_3 \le 2 \cdot (4n - \#^{{\text {opt}}}_1 - 2\#^{{\text {opt}}}_2 - 3\#^{{\text {opt}}}_3) \end{aligned}$$
(4)

or

$$\begin{aligned} 2\#^{{\text {opt}}}_1 + 4\#^{{\text {opt}}}_2 + 6\#^{{\text {opt}}}_3 \le 4n + \#^{{\text {gr}}}_1 + 2\#^{{\text {gr}}}_2 + 3\#^{{\text {gr}}}_3 \, . \end{aligned}$$
(5)

Since \(\#^{{\text {opt}}}_1 + \#^{{\text {opt}}}_2 + \#^{{\text {opt}}}_3 \le n\) it suffices to prove that

$$\begin{aligned} 2 \#^{{\text {opt}}}_3 \le 3 \#^{{\text {gr}}}_3 + 2 \#^{{\text {gr}}}_2 + \#^{{\text {gr}}}_1 \, . \end{aligned}$$
(6)

Let \(\mathcal{S}_3^{{\text {gr}}}\) be the set of strings at the point of time when the greedy algorithm already merged all pairs of strings whose overlap is \(3\) and there is no more overlaps of length \(3\) left. In the following lemma we show that the number of overlaps equal to a \(3\)-string \(t\) in the greedy solution cannot be much smaller than that of the optimal solution.

Lemma 1

For any \(3\)-string \(t\), \(\#^{{\text {gr}}}(t) \ge \#^{{\text {opt}}}(t) - 1\). Moreover, if \(\#^{{\text {gr}}}(t) = \#^{{\text {opt}}}(t) - 1\) then \(\mathcal{S}_3^{{\text {gr}}}\) contains a string with prefix \(t\) and suffix \(t\).

Proof

Assume, for the sake of contradiction, that \(\#^{{\text {gr}}}(t) \le \#^{{\text {opt}}}(t) - 2\). The optimal solution contains \(\#^{{\text {opt}}}(t)\) overlaps equal to \(t\) and hence among the input \(n\) strings there are at least \(\#^{{\text {opt}}}(t)\) strings whose prefix is \(t\) and at least \(\#^{{\text {opt}}}(t)\) strings whose suffix is \(t\). Now consider the set \(\mathcal{S}_3^{{\text {gr}}}\). Since \(\#^{{\text {gr}}}(t) \le \#^{{\text {opt}}}(t) - 2\), we conclude that \(\mathcal{S}_3^{{\text {gr}}}\) contains at least two strings whose suffix is \(t\) and at least two strings whose prefix is \(t\). Hence there are two different strings in this set whose overlap is \(t\) which contradicts to the fact that there are no more \(3\)-overlaps.   \(\square \)

In the following the strings from \(\mathcal{S}_3^{{\text {gr}}}\) are called blocks. For a \(3\)-string \(t\), we say that a block is \(t\) -bad if its suffix and its prefix are equal to \(t\) and moreover \(\#^{{\text {gr}}}(t) = \#^{{\text {opt}}}(t)-1\). We call a block bad if it is \(t\)-bad for a \(3\)-string \(t\) and good otherwise. Let \(\#_{{\text {bad}}}\) and \(\#_{{\text {good}}}\) be the number of overlaps in all bad and good blocks, respectively. Then clearly \(\#_{{\text {bad}}}+ \#_{{\text {good}}}= \#^{{\text {gr}}}_3\) (recall that all the overlaps inside the blocks have length 3).

Note that if there are no bad blocks then already Lemma 1 is sufficient to prove (6): in this case, \(\#^{{\text {gr}}}(t) \ge \#^{{\text {opt}}}(t)\) and therefore \(\#^{{\text {gr}}}_3 \ge \#^{{\text {opt}}}_3\).

Next, we consider bad blocks of fixed length: for a \(3\)-string \(t\), let

$$\begin{aligned} \chi _{=i}(t) = [S_3^{gr} \text {contains a}\ t{-bad\;block\;of\;length \;exactly}\ i]. \end{aligned}$$

(throughout the paper, we use the standard Iverson brackets: \([P]\) is equal to \(1\) if \(P\) is true and is equal to \(0\) otherwise). Further, let

$$ \chi _{=i} = \sum \limits _{|t|=3}\chi _{=i}(t) \, .$$

Functions \(\chi _{>i}(t)\), \(\chi _{\ge i} (t)\), \(\chi _{>i}\), and \(\chi _{\ge i}\) are defined in a similar fashion.

Note that \(\chi _{=4} = 0\). Indeed a bad block of length \(4\) must have a form aaaa. Also, \(\#^{{\text {opt}}}(\mathtt{aaaa})>0\) and hence \(\mathcal S\) contains another string starting or ending with aaa. But then the greedy algorithm must merge these two strings (as it prefers strings of the form aaaa). Hence for any \(3\)-string \(t\), \(\chi _{\ge 5}(t)\) is exactly the number of \(t\)-bad blocks.

Lemma 2

For any \(3\)-string \(t\),

$$\begin{aligned} \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t) \} + \chi _{\ge 5}(t) = \#^{{\text {opt}}}(t) \, . \end{aligned}$$

Proof

Consider the following two cases:

  1. 1.

    \(\#^{{\text {gr}}}(t) \ge \#^{{\text {opt}}}(t)\), then \(\min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t)\} = \#^{{\text {opt}}}(t)\) and \(\chi _{\ge 5} (t) = 0\).

  2. 2.

    \(\#^{{\text {gr}}}(t) < \#^{{\text {opt}}}(t)\), then by Lemma 1, \(\min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t)\} = \#^{{\text {opt}}}(t) - 1\). There is at least one block starting with \(t\) and ending with \(t\). Moreover there cannot be two different such blocks as otherwise the greedy algorithm would merge them. Therefore, there is exactly one \(t\)-bad block, i.e., \(\chi _{\ge 5} (t) = 1\).    \(\square \)

By summing up the equality from Lemma 2 over all strings \(t\) of length \(3\) we get the following corollary.

Corollary 1

$$ \sum \limits _{|t|=3} \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t) \} + \chi _{\ge 5} = \#^{{\text {opt}}}_3 . $$

Assume now that \(\chi _{=5}= 0\). Then due to the fact that a bad block of length exactly \(i\) contains \(i-4\) overlaps we have that \(2 \chi _{>5}\le \#^{{\text {gr}}}_3\). By adding twice the \(\sum \limits _{|t|=3} \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t) \}\) to both sides of this inequality and applying Corollary 1 we get

$$\begin{aligned} 2 \#^{{\text {opt}}}_3 \le \#^{{\text {gr}}}_3 + 2 \sum \limits _{|t|=3} \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t) \} \le 3 \#^{{\text {gr}}}_3 , \end{aligned}$$

which implies (6).

Hence the most tricky case is when there are bad blocks of length \(5\). The rest of the paper is devoted to the analysis of this case. Note that such blocks have the form \(\mathtt{ababa}\) (for different letters \(\mathtt{a}, \mathtt{b} \in \Sigma \)) and therefore these are aba-bad blocks. To analyze such blocks carefully we introduce the following definitions. For a \(3\)-string \(t\) and \(1 \le i \le 5\), \(B_i(t)=0\) if either \(t\) is not of the form aba, or \(t\) is of the form aba and there is no block ababa. In the remaining case (i.e., \(t\) is of the form aba and there is a block ababa) \(B_i\)’s are defined as follows:

$$\begin{aligned} B_1(\mathsf{aba })&= [\#^{{\text {gr}}}(\mathsf{bab }) > \#^{{\text {opt}}}(\mathsf{bab })] , \\ \nonumber B_2(\mathsf{aba })&= [\text {there exists a block with prefix}\ \mathsf{ba }\ \text {or suffix}\ \mathsf{ab }] , \\ \nonumber B_3(\mathsf{aba })&= [\text {there exists a block except}\ \mathsf{ababa }\ \text {with prefix}\ \mathsf{ab }\ \text {or suffix}\ \mathsf{ba }] , \\ \nonumber B_4(\mathsf{aba })&= [\text {there exists a good block of length at least 5} \\ \nonumber&\quad \; \; \;\text {containing}\ \mathsf{aba }\ \text {or}\ \mathsf{bab }\ \text {as a proper substring}] ,\\ \nonumber B_5(\mathsf{aba })&= [B_2(\mathsf{aba }) = 0\, \text {and}\, B_3(\mathsf{aba }) = 0\ \text {and there exists a bad block of}\\ \nonumber&\quad \; \; \text {length at least 7 containing}\ \mathsf{aba }\ \text {or}\ \mathsf{bab }\ \text {as substring} ]. \end{aligned}$$

Further, let for \(1 \le i \le 5\), \(B_i = \sum \limits _{|t| = 3}B_i(t)\).

Now we show \(B_i\)’s provide an upper bound for the number of bad blocks of length exactly \(5\).

Lemma 3

\(\chi _{=5} \le \sum \limits _{i=1}^5 B_i\).

Proof

Note that if \(3\)-string \(t\) is not of the form aba then \(\chi _{=5}(t)=0\) so the string \(t\) contributes nothing to the left-hand side of the inequality. We now focus on \(3\)-strings \(t\) of the form aba. It is sufficient to prove the following inequality:

$$\begin{aligned} \chi _{=5}(\mathtt{aba}) \le \sum \limits _{i=1}^{5} B_i(\mathtt{aba}) \end{aligned}$$
(7)

Assume that \(\chi _{=5}(\mathtt{aba}) = 1\) and \(B_1(\mathtt{aba}) = 0\) as otherwise the inequality holds for trivial reasons. From \(B_1(\mathtt{aba}) = 0\) and Lemma 1 we have that \(\#^{{\text {opt}}}(\mathtt{bab}) - 1 \le \#^{{\text {gr}}}(\mathtt{bab}) \le \#^{{\text {opt}}}(\mathtt{bab})\). Since \(\#^{{\text {gr}}}(\mathtt{bab}) > 0\) (because \(\mathcal{S}_3^{{\text {gr}}}\) contains the block \(\mathtt{ababa}\) by definition of \(\chi _{=5}(\mathtt{aba})\)) we have that \(\#^{{\text {opt}}}(\mathtt{bab}) > 0\), i.e. the optimal solution has at least one overlap of the form \(\mathtt{bab}\). Depending of the location of this overlap in the optimal string we consider the following cases:

  1. 1.

    The overlap bab in the optimal solution is contained as a substring of \(\mathtt{ababa}\). Since \(\#^{{\text {opt}}}(\mathtt{aba}) > 0\), \(\mathcal S\) contains at least one string except \(\mathtt{abab}\) and \(\mathtt{baba}\) containing \(\mathtt{aba}\) as substring.

  2. 2.

    The overlap \(\mathtt{bab}\) in the optimal solution is not in \(\mathtt{ababa}\). Hence \(\mathcal S\) contains at least one string except \(\mathtt{abab}\) and \(\mathtt{baba}\) containing \(\mathtt{bab}\).

So in both cases there exists a string in \(\mathcal S\) except \(\mathtt{abab}\) and \(\mathtt{baba}\) that contains \(t' = \mathtt{aba}\) or \(t' = \mathtt{bab}\). This string is contained by some block \(r \in {\mathcal{S}_3^{{\text {gr}}}}\) and besides \(r \ne \mathtt{ababa}\) and \(r \ne \mathtt{babab}\). Consider the following cases:

  1. 1.

    \(r\) is a good block. Then \(B_4(\mathtt{aba}) > 0\) if \(t'\) is a proper substring of \(r\) and \(B_2(\mathtt{aba}) + B_3(\mathtt{aba}) > 0\) otherwise. Therefore (7) holds.

  2. 2.

    \(r\) is a bad block of length \(5\). Then this block has a form \(\mathtt{ababa}\) or \(\mathtt{babab}\), a contradiction.

  3. 3.

    \(r\) is a bad block of length \(6\). If \(t'\) is a prefix or a suffix of \(r\) then \(B_2(\mathtt{aba}) + B_3(\mathtt{aba}) > 0\). Otherwise either \(r = \mathtt{r_1 t'_1 t'_2 t'_1 r_5 r_6}\) or \(r = \mathtt{r_1 r_2 t'_1 t'_2 t'_1 r_6}\) where \(t' = \mathtt{t'_1 t'_2 t'_3}\). Since \(r\) is a bad block either \(\mathtt{t'_1 t'_1 t'_2 t'_1 t'_1 t'_2}\) or \(\mathtt{t'_2 t'_1 t'_1 t'_2 t'_1 t'_1}\). Finally, since either \(t' = \mathtt{aba}\) or \(t' = \mathtt{bab}\) in both these cases \(r\) has a prefix or a suffix \(\mathtt{ab}\) or \(\mathtt{ba}\). Then \(B_2(\mathtt{aba}) + B_3(\mathtt{aba}) > 0\) and (7) holds.

  4. 4.

    \(r\) is a bad block of length at least \(7\). Then \(B_5(\mathtt{aba}) > 0\) and (7) holds.   \(\square \)

3 The Proof of the Main Theorem

In this section we prove the main result of this note: we first state auxiliary lemmas providing upper bounds on \(B_i\)’s, then show how these lemmas imply the main result of the paper, and then provide the proofs of all the lemmas.

Lemma 4

\(B_1 + \sum \limits _{|t|=3} \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t) \} \le \#^{{\text {gr}}}_3\).

Lemma 5

\(B_2 \le \#^{{\text {gr}}}_2\).

Lemma 6

\(B_3 \le \#^{{\text {gr}}}_1 + \#^{{\text {gr}}}_2\).

Lemma 7

\(B_4 \le \#_{{\text {good}}}\).

Lemma 8

\(B_5 + 2 \chi _{>5}+ \chi _{=5}\le \#_{{\text {bad}}}\).

Theorem 1

The greedy algorithm for strings of length \(4\) that prefers strings of the form aaaa in case of ties is \(2\)-approximate.

Proof

By adding the inequalities from Lemmas 58 to twice the inequality from Lemma 4 and applying equality \(\#_{{\text {bad}}}+ \#_{{\text {good}}}= \#^{{\text {gr}}}_3\) one gets

$$\begin{aligned} 2 B_1 + B_2 + B_3 + B_4 + B_5 + 2 \chi _{>5}+ \chi _{=5}+ 2 \sum \limits _{|t|=3} \min \{&\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t) \} \le \\ \nonumber&3 \#^{{\text {gr}}}_3 + 2 \#^{{\text {gr}}}_2 + \#^{{\text {gr}}}_1 \, . \end{aligned}$$

By further adding the inequality from Lemma 3 we get

$$\begin{aligned} 2 \sum \limits _{|t|=3} \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t) \} + 2 \chi _{>5}+ 2 \chi _{=5}+ B_1 \le 3 \#^{{\text {gr}}}_3 + 2 \#^{{\text {gr}}}_2 + \#^{{\text {gr}}}_1 \, . \end{aligned}$$

Finally, applying Corollary 1 we get

$$\begin{aligned} 2 \#^{{\text {opt}}}_3 + B_1 \le 3 \#^{{\text {gr}}}_3 + 2 \#^{{\text {gr}}}_2 + \#^{{\text {gr}}}_1 \end{aligned}$$

which implies (6).    \(\square \)

Proof

(of Lemma 4 ). We have

$$\begin{aligned} B_1 + \sum \limits _{|t|=3} \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t) \} = \sum \limits _{|t| = 3} (B_1(t) + \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t)\})\\ \nonumber \; \,= \sum \limits _{t \ne \mathtt{aba}} (B_1(t) + \min \{\#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t)\}) + \sum \limits _{\mathtt{a}, \mathtt{b} \in \Sigma }(B_1(\mathtt{aba})+ B_1(\mathtt{bab}) \\ \nonumber + \min \{\#^{{\text {gr}}}(\mathtt{aba}), \#^{{\text {opt}}}(\mathtt{aba})\} \;+ \min \{\#^{{\text {gr}}}(\mathtt{bab}), \#^{{\text {opt}}}(\mathtt{bab})\}) \end{aligned}$$

To prove this lemma, we consider the following cases:

Case 1

If \(t \ne \mathtt{aba}\) then \(B_1(t) = 0\) and hence

$$\begin{aligned} B_1(t) + \min \{ \#^{{\text {gr}}}(t), \#^{{\text {opt}}}(t)\} \le \#^{{\text {gr}}}(t) \, . \end{aligned}$$

Case 2

If \(t = \mathtt{aba}\) and \(B_1(\mathtt{aba}) + B_1(\mathtt{bab}) = 0\) then

$$\begin{aligned} B_1(\mathtt{aba}) + B_1(\mathtt{bab}) +&\min \{\#^{{\text {gr}}}(\mathtt{aba}), \#^{{\text {opt}}}(\mathtt{aba})\}\\ \nonumber&+ \min \{\#^{{\text {gr}}}(\mathtt{bab}), \#^{{\text {opt}}}(\mathtt{bab}))\} \le \#^{{\text {gr}}}(\mathtt{aba}) + \#^{{\text {gr}}}(\mathtt{bab}) \end{aligned}$$

Case 3

If \(t = \mathtt{aba}\) and \(B_1(\mathtt{aba}) = 1\) then \(B_1(\mathtt{bab}) = 0\) and, by definition of \(B_1\), \(\#^{{\text {gr}}}(\mathtt{bab}) > \#^{{\text {opt}}}(\mathtt{bab})\). Hence

$$\begin{aligned} \,B_1(\mathtt{aba}) + B_1(\mathtt{bab}) + \min&\{\#^{{\text {gr}}}(\mathtt{aba}), \#^{{\text {opt}}}(\mathtt{aba})\} + \min \{\#^{{\text {gr}}}(\mathtt{bab}), \#^{{\text {opt}}}(\mathtt{bab})\} \\ \nonumber = 1 + \min \{\#^{{\text {gr}}}&(\mathtt{aba}), \#^{{\text {opt}}}(\mathtt{aba})\} + \min \{\#^{{\text {gr}}}(\mathtt{bab}), \#^{{\text {opt}}}(\mathtt{bab})\}\\ \nonumber&\le 1 + \#^{{\text {gr}}}(\mathtt{aba}) + \#^{{\text {opt}}}(\mathtt{bab}) \\ \nonumber \le 1&+ \#^{{\text {gr}}}(\mathtt{aba}) + \#^{{\text {gr}}}(\mathtt{bab}) - 1 = \#^{{\text {gr}}}(\mathtt{aba}) + \#^{{\text {gr}}}(\mathtt{bab}) \end{aligned}$$

Case 4

If \(t = \mathtt{aba}\) and \(B_1(\mathtt{bab}) = 1\). This case is similar to Case 3.   \(\square \)

Proof

(of Lemma 5 ). We show that \(B_2 \le \#^{{\text {gr}}}_2\). If \(B_2(t)>0\) then \(t\) is of the form aba and there exists a block with prefix ba or suffix ab. Since \(B_2(t) > 0\) there exists a pair of blocks: \(\mathtt{ababa}\) and a block with a \(2\)-prefix \(\mathtt{ba}\) or a \(2\)-suffix \(\mathtt{ab}\). Note that for different strings \(t\) these pairs of blocks do not intersect and cannot be merged with \(2\)-overlaps because the sets \(\{\mathtt{a}, \mathtt{b}\}\) are different. Note that at least one block in this pair must be merged with \(2\)-overlap with some block otherwise this pair of blocks must be merged by the greedy algorithm. Thus \(\sum \limits _{t} B_2(t) < \#^{{\text {gr}}}_2\)

    \(\square \)

For Lemma 6 we need the following auxiliary definitions. Let \({\text {Pref}}(\mathtt{a}, \mathtt{b}) = \emptyset \) if there is no block \(\mathtt{ababa}\) and the set of blocks with prefix \(\mathtt{ab}\) otherwise. Similarly, let \({\text {Suff}}(\mathtt{a}, \mathtt{b}) = \emptyset \) if there is no block \(\mathtt{ababa}\) and the set of blocks with suffix \(\mathtt{ba}\) otherwise. Then it is easy to see that:

$$\begin{aligned} (\mathtt{a} \ne \mathtt{a'} \vee \mathtt{b} \ne \mathtt{b'}) \Rightarrow ({\text {Pref}}(\mathtt{a}, \mathtt{b}) \cap {\text {Pref}}(\mathtt{a'}, \mathtt{b'}) = \emptyset \wedge {\text {Suff}}(\mathtt{a}, \mathtt{b}) \cap {\text {Suff}}(\mathtt{a'}, \mathtt{b'}) = \emptyset ) \, \end{aligned}$$

Let

$${\text {Pref}}(\mathtt{a}) = \bigcup \limits _{\mathtt{b} \in \Sigma } {\text {Pref}}(\mathtt{a}, \mathtt{b}) \text { and } {\text {Suff}}(\mathtt{a}) = \bigcup \limits _{\mathtt{b} \in \Sigma } {\text {Suff}}(\mathtt{a}, \mathtt{b}) \, .$$

Lemma 9

If \(\mathtt{a} \ne \mathtt{c}\) then the set of \(1\)- and \(2\)-suffixes of strings from \({\text {Suff}}(\mathtt{a})\) does not intersect the set of \(1\)- and \(2\)-prefixes of strings from \({\text {Pref}}(\mathtt{c})\).

Proof

All \(1\)-suffixes of strings from \({\text {Suff}}(\mathtt{a})\) are equal to \(\mathtt{a}\) while all \(1\)-prefixes of strings from \({\text {Pref}}(\mathtt{c})\) are equal to \(\mathtt{c}\), hence they do not intersect.

Assume that \(2\)-suffix of \(b_1 \in {\text {Suff}}(a)\) equals to \(2\)-prefix of block \(b_2 \in {\text {Pref}}(c)\). \(2\)-suffix of block \(b_1\) has the form \(\mathtt{xa}\) and \(2\)-prefix of \(b_2\) has the form \(\mathtt{cy}\) so \(x = \mathtt{c}, y = \mathtt{a}\). Hence \(b_1\) has form \(\mathtt{acaca}\) and \(b_2\) has form \(\mathtt{cacac}\), a contradiction.    \(\square \)

Proof

(of Lemma 6 ). \(B_3(t) > 0\) only for \(t = \mathtt{aba}\): \(B_3 = \sum \limits _{t} B_3(t) = \sum \limits _{\mathtt{a}} \sum \limits _{\mathtt{b}} B_3(\mathtt{aba})\).

By Lemma 9 one can form sets \(X_1^a\) of \(1\)-overlaps of strings from \({\text {Suff}}(a)\) and \({\text {Pref}}(a)\) counted in \(\#^{{\text {gr}}}_1\). The lemma guarantees that these sets are disjoint. Similarly we can form sets \(X_2^\mathtt{a}\) from \(2\)-overlaps of strings from \({\text {Suff}}(\mathtt{a})\) and \({\text {Pref}}(\mathtt{a})\). Hence

$$\begin{aligned} \sum \limits _{\mathtt{a}} |X_1^\mathtt{a}| \le \#^{{\text {gr}}}_1 \text { and } \sum \limits _{\mathtt{a}} |X_2^\mathtt{a}| \le \#^{{\text {gr}}}_2 \, . \end{aligned}$$
(8)

Since for each nonzero \(\chi _{=5}(\mathtt{aba})\) there exists a block \(\mathtt{ababa}\) we have, for each \(\mathtt{a}\),

$$\begin{aligned} \sum \limits _{\mathtt{b}} B_3(\mathtt{aba}) \le \min \{|{\text {Pref}}(\mathtt{a})|, |{\text {Suff}}(\mathtt{a})|\} \, . \end{aligned}$$
(9)

Since for each block \(\mathtt{ababa}\) with \(B_3(\mathtt{aba}) > 0\) there exists by definition a string with prefix \(\mathtt{ab}\) or suffix \(\mathtt{ba}\), we have:

$$\begin{aligned} \sum \limits _{\mathtt{b}} B_3(\mathtt{aba}) < \max \{|{\text {Pref}}(\mathtt{a})|, |{\text {Suff}}(\mathtt{a})|\} \, . \end{aligned}$$
(10)

Assume that \(|{\text {Pref}}(\mathtt{a})| \le |{\text {Suff}}(\mathtt{a})|\) (the opposite case is symmetric). Let us show that

$$\begin{aligned} |X_1^a| + |X_2^a| \ge \sum \limits _{b} B_3(\mathtt{aba}) \,. \end{aligned}$$
(11)

For this, assume the contrary. It follows from (9) and (10) that

$$ |X_1^\mathtt{a}| + |X_2^\mathtt{a}| \le |{\text {Pref}}(\mathtt{a})| - 1 \text { and } |X_1^\mathtt{a}| + |X_2^\mathtt{a}| \le |{\text {Suff}}(\mathtt{a})| - 2 \, .$$

Hence there exists at least one block from \({\text {Pref}}(\mathtt{a})\) whose prefix is not used in overlaps and there exist at least two blocks from \({\text {Suff}}(\mathtt{a})\) whose suffixes are not used in overlaps. But this prefix can be merged with one of these suffixes, a contradiction establishing (11).

Finally, by summing (11) for all \(\mathtt{a}\) and applying (8) we get the required inequality:

$$ \sum \limits _{\mathtt{a} \in \Sigma } \sum \limits _{\mathtt{b} \in \Sigma } B_3(\mathtt{aba}) \le \sum _{\mathtt{a}} (|X_1^\mathtt{a}| + |X_2^\mathtt{a}|) \le \#^{{\text {gr}}}_1 + \#^{{\text {gr}}}_2 \, .$$

   \(\square \)

Proof

(of Lemma 7 ). If for some \(\mathtt{a}, \mathtt{b} \in \Sigma \), \(B_4(\mathtt{aba}) +B_4(\mathtt{bab}) = 1\), then either \(\mathtt{aba}\) or \(\mathtt{bab}\) is contained by a good block as a proper substring, so there exists at least one overlap by \(t\) in a good block. Hence

$$B_4 = \sum \limits _{|t|=3} B_4(t) \le \#_{{\text {good}}}\, .$$

   \(\square \)

Proof

(of Lemma 8 ). Let \(\#_{{\text {bad}}}^i\) be the number of overlaps in bad blocks of length \(i\).

Let \(B^i_5(t) = [ i \ge 7 \wedge t = \mathtt{aba} \wedge B_2(t) = B_3(t) = 0 \wedge \text {there exists a block} \; \mathtt{ababa} \text {and a bad-block of length i which contains}\ \mathtt{aba}\ \text {or}\ \mathtt{bab}\ \text {as a proper substring}]\)

By definition,

$$\begin{aligned} B_5(t) \le \sum _{i \ge 7} B_5^i (t) \, \end{aligned}$$

Since there are two \(3\)-overlaps in bad blocks of length \(6\), \(2 \chi _{=6} = \#_{{\text {bad}}}^6\).

Consider bad blocks of length \(i \ge 7\). Each such block contains \(i-4\) \(3\)-overlaps. Note that overlaps \(\mathtt{aba}\) or \(\mathtt{bab}\) that are counted in \(B_5^i (\mathtt{aba})\) cannot be neighbouring as otherwise \(B_5^i\) would contain blocks \(\mathtt{ababa}\) and \(\mathtt{babab}\) (while this is only possible if the initial set \(\mathcal S\) contains equal strings).

Let \(\mathtt{aba}\) be the first overlap in a block. Then this block has prefix \(\mathtt{cab}\) for \(\mathtt{c} \in \Sigma \). Its suffix also equals \(\mathtt{cab}\) since this is a bad block. But in this case \(B_2 (\mathtt{aba}) > 0\) and then \(B_5 (\mathtt{aba}) = 0\), a contradiction. A similar contradiction arises if \(\mathtt{aba}\) is the last overlap in a block. Thus, for \(i \ge 7\) we have:

$$\begin{aligned} B_5^i = \sum _{s} B_5^i(s) \le \chi _{>5}\cdot \left\lceil \frac{i - 6}{2} \right\rceil \le \chi _{>5}\cdot (i - 6) \, . \end{aligned}$$

Then

$$\begin{aligned} 2 \chi _{>5}^i + B_5^i \le 2 \chi _{>5}+ \chi _{>5}\cdot (i - 6) = \chi _{>5}\cdot (i - 4) \le \#_{{\text {bad}}}^i \, . \end{aligned}$$

Finally, we have:

$$\begin{aligned} 2 \chi _{>5}+ \chi _{=5}+ B_5 \le \chi _{=5}+ 2 \chi _{=6} + \sum _{i \ge 7}&(2 \chi _{=i} + B_5^i) \\ \nonumber&\le \#_{{\text {bad}}}^5 + \#_{{\text {bad}}}^6 + \sum _{i \ge 7} \#_{{\text {bad}}}^i = \#_{{\text {bad}}}\, . \end{aligned}$$

   \(\square \)

4 Conclusion

We have proved that the greedy conjecture for the shortest common superstring problem is true for strings of length \(4\). Extending the proof to the case of \(5\)-strings seems to be even more tedious. At the same time resolving such special cases does not seem to help to resolve the general case.