1 Introduction

Given a set \(S = \{s_1, s_2, \ldots , s_n\}\) of strings of equal length \(L\) and an integer \(d\) (called radius), the closest string problem (CSP) requires the computation of a string \(s\) of length \(L\) such that \(d(s, s_i) \le d\) for each \(s_i \in S\), where \(d(s, s_i)\) is the Hamming distance between \(s\) and \(s_i\) with radius \(d\).

Closest string problem has attracted great attention in recent years due to its important applications in bioinformatics [21]. For example, one needs to solve numerous CSP instances over a binary alphabet in order to find the approximate gene clusters using the Center Gene Cluster model [1, 18]. Degenerated Primer Design [34] also involves to solve CSP instances over the DNA alphabet. Other applications include universal PCR primer design [10, 16, 20, 22, 30, 34], genetic probe design [20], antisense drug design [9, 20], finding unbiased consensus of a protein family [3], and gene regulatory motif finding [8, 12, 16, 20, 32], etc. Consequently, CSP has been extensively studied in computational biology [9, 11, 1517, 1921, 24, 2729, 31, 32]. In particular, CSP has been proved to be NP-hard [14, 20].

One approach to CSP is to design approximation algorithms. Along this line, Lancto et al. [20] presented the first non-trivial approximation algorithm for CSP, which achieves a ratio of \(\frac{4}{3}\). Li et al. [21] designed the first polynomial-time approximation scheme (PTAS) for CSP. Subsequently, the time complexity of the PTAS was improved in [23, 24]. However, the best-known PTAS in [23] has time complexity \(\mathcal {O}(Ln^{\mathcal {O}( \epsilon ^{-2} )} )\) which is prohibitive for even a moderately small \(\epsilon >0\).

A more practical approach to CSP is via fixed-parameter algorithms. Fixed-parameter algorithms for CSP are based on the observation that the radius \(d\) in a practical instance of CSP is usually reasonably small and hence an algorithm with time complexity \(\mathcal {O}( f(d) \times poly(n) )\) for a polynomial function \(poly(n)\) and exponential function \(f(d)\) may still be acceptable. Along this line, Stojanovic et al. [31] designed a linear-time algorithm for the special case of CSP where \(d\) is fixed to 1. Gramm et al. [17] proposed the first fixed-parameter algorithm for CSP, which runs in \(\mathcal {O}(nL + nd \cdot (d + 1)^{d})\) time. Ma and Sun [23] designed an algorithm that runs in \(\mathcal {O}(nL + nd \cdot (16(|\Sigma | - 1))^d)\) time. This algorithm is the first polynomial-time algorithm for the special case of CSP where \(d\) is logarithmic in the input size and the alphabet size \(|\Sigma |\) is a constant. Improved algorithms for CSP along this line were given in [6, 7, 33, 35]. Among them, the algorithm with the best theoretical time complexity for general alphabets is given in [7]. For small alphabets, the best time complexity is achieved by the algorithm in [6]. In particular, this algorithm runs in \(\mathcal {O}(nL + nd^3 \cdot 6.731^d)\) time for binary strings, while it runs in \(\mathcal {O}(nL+nd\cdot 13.183^d)\) time for DNA strings. Noticeably, in order to achieve better time complexity, these best-performing algorithms combined multiple techniques, which made the algorithms rather complicated.

Randomization has been widely employed to design fixed-parameter algorithms for many NP-hard problems [4, 5, 13, 25, 26] However, randomization has not been used to design fixed-parameter algorithms for CSP, and it was unclear if randomization will be of any benefit at all to solving CSP exactly. The only randomized algorithm that we are aware of is a randomized heuristic algorithm for the binary case of CSP proposed by Boucher and Brown [2]. With large synthetic as well as real-genomic data, they demonstrated the heuristic algorithm could detect motifs efficiently. However, no theoretical bounds on the running time or the success probability were provided.

In this paper, we demonstrate that randomization indeed helps us design much simpler and more efficient fixed-parameter algorithms for CSP. Several randomized algorithms are proposed. The first algorithm is presented in Sect. 3 and is for the binary case of CSP. The algorithm is as simple as the following: It starts with a string \(t\) that is initialized to \(s_1\). At each iteration it selects an \(s_i\) with \(d(t, s_i)>d\) and randomly flips one bit of \(t\) where \(s_i\) disagrees with \(t\). If a center string is not found within \(d\) iterations, the algorithm starts over again. This algorithm for binary case uses very similar heuristic as in [2]. However, the procedure to apply the heuristic, as well as the start and end conditions are changed in order to achieve the theoretical bounds proved in this paper. Through rigorous analysis, we show that for any given binary CSP instance, this surprisingly simple algorithm solves the problem in \(\mathcal {O}(nL + n\sqrt{d}\cdot (2e)^d) \approx \mathcal {O}(nL + n\sqrt{d}\cdot 5.437^d)\) time with arbitrarily small constant one-sided error, where \(e\) is the base of the natural logarithm. This time bound is significantly better than the bound \(\mathcal {O}(nL + nd^3\cdot 6.731^d)\) achieved by the previously best known (deterministic) algorithm for CSP [6].

The algorithm is then extended in the rest of the paper to solve the general-alphabet case and to provide better time complexity. More specifically, the algorithm is slightly changed to solve the general-alphabet case in Sect. 4: Instead of flipping the bit at a randomly selected position of \(t\), the letter at that position is changed to a letter randomly selected from the alphabet according to a carefully designed probability distribution. As a result, we show that CSP can be solved in \(\mathcal {O}(nL + n\sqrt{d}\cdot (e \sigma )^d)\) time with arbitrarily small constant one-sided error, where \(\sigma \) is the size of the given alphabet. For DNA strings where \(\sigma =4\), this new time bound is \(\mathcal {O}(nL + n\sqrt{d}\cdot (10.874)^d)\), which is significantly better than \(\mathcal {O}(nL + nd\cdot 13.183^d)\) achieved by the best known (deterministic) algorithm [6].

In Sect. 5, the algorithm is further improved by a simple strategy that avoids repeated changes at the same position of the candidate center string \(t\). Also, in each iteration the selection of \(s_i\) maximizes \(d(t,s_i)\). We show that with these small changes, the time complexity of the algorithm is reduced to \(\mathcal {O}(nL + n\sqrt{d}\cdot (2.5 \sigma )^d)\). Therefore, for binary and DNA strings, the bounds are \(\mathcal {O}(nL + n\sqrt{d}\cdot 5^d)\) and \(\mathcal {O}(nL + n\sqrt{d}\cdot 10^d)\), respectively.

Noticing that the time complexity \(\mathcal {O}(nL + n\sqrt{d}\cdot (2.5 \sigma )^d)\) is not better than the algorithm in [7] for large \(\sigma \), two different strategies are introduced in Sects. 6 and 7 to specifically deal with large alphabets. The algorithm in Sect. 6 runs in \(\mathcal {O}(nL + n\sqrt{d}\cdot (2\sigma + 4)^d)\) time. This provides better time complexity than the previously best algorithm in [7] for large \(\sigma \). For example, for protein strings (\(\sigma =20\)), the new algorithm runs in \(\mathcal {O}(nL + n\sqrt{d}\cdot 44^d)\) time while the algorithm in [7] runs in \(\mathcal {O}(nL + nd\cdot 47.21^d)\) time. The algorithm in Sect. 7 has even better time complexity for large \(\sigma \). However, the resulting time bound of our analysis is not a closed-form expression. Via numerical computation, we show that the algorithm runs in \(\mathcal {O}(nL + nd^{2}\cdot 9.81^d)\) and \(\mathcal {O}(nL + nd^{2}\cdot 40.1^d)\) time for DNA and protein strings, respectively.

Table 1 in Sect. 7 compares the time complexities of the algorithms in this paper against the previously best-known algorithms for CSP.

Table 1 The time complexities of the previously best-known algorithms and the algorithms developed in this paper.

2 Notations

In this paper, a string is over an alphabet with a fixed size \(\sigma \ge 2\). For each positive integer \(k\), \([1{\ldots }k]\) denotes the set \(\{1, 2, \ldots , k\}\). For a string \(s\), \(|s|\) denotes the length of \(s\). For each \(i \in [1{\ldots }|s|]\), \(s[i]\) denotes the letter of \(s\) at its \(i\)-th position. Thus, \(s=s[1] s[2] \ldots s[|s|]\). A position set of a string \(s\) is a subset of \([1{\ldots }|s|]\). For two strings \(s\) and \(t\) of the same length, \(d(s,t)\) denotes their Hamming distance.

Two strings \(s\) and \(t\) of the same length \(L\) agree (respectively, differ) at a position \(i \in [1{\ldots }L]\) if \(s[i] = t[i]\) (respectively, \(s[i] \ne t[i]\)). The position set where \(s\) and \(t\) agree (respectively, differ) is the set of all positions \(i \in [1{\ldots }L]\) where \(s\) and \(t\) agree (respectively, differ). The following special notations will be very useful. For two strings \(s_1\), \(s_2\) of the same length, \(\{s_1 \equiv s_2\}\) (respectively, \(\{s_1 \not \equiv s_2\}\)) denotes the set of all positions where \(s_1\) and \(s_2\) agree (respectively, differ). Moreover, for three strings \(s_1\), \(s_2\), \(s_3\) of the same length, \(\{s_1 \not \equiv s_2 \not \equiv s_3\}\) denotes the position set \( \{s_1 \not \equiv s_2 \} \cap \{s_1 \not \equiv s_3\}\cap \{s_2 \not \equiv s_3\}\), while \(\{s_1 \equiv s_2 \not \equiv s_3\}\) denotes the position set \( \{s_1 \equiv s_2 \} \cap \{s_2 \not \equiv s_3\}\).

3 Randomized Algorithm for Binary Alphabets

To get familiar with the techniques used in this paper, we start with the binary case of the problem in this section.

Most fixed-parameter algorithms for CSP are based on the bounded search tree method. These algorithms start with a suboptimal solution \(t\), and gradually change \(t\) to the center string by altering the letters at certain positions of \(t\). At each step, if \(d(t,s_i) >d\) for an input string \(s_i\), then at least one of the positions in \(\{t \not \equiv s_i\}\) needs to be changed. Different strategies for choosing the position (or positions) to change lead to very different time complexities. Gramm et al. [17] exhaustively tried every position in a size-\((d+1)\) subset of \(\{t \not \equiv s_i\}\), resulting in the first fixed-parameter algorithm for CSP with time complexity \(\mathcal {O}\left( nL + nd\cdot (d+1)^{d}\right) \). Ma and Sun [23] instead tried every subset of \(\{t \not \equiv s_i\}\) with bounded size and changed the positions in a subset all at once, which led to an \(\mathcal {O}\left( nL + nd\cdot (16 \sigma -1)^d\right) \) algorithm with a very nontrivial proof. Boucher and Brown [2] proposed a seemingly simple strategy to choose a position from \(\{t \not \equiv s_i\}\) uniformly at random. The resulting heuristic algorithm in their paper deviates from the bounded search tree scheme since it can alter the original \(t\) more than \(d\) times. Although there was no theoretical proof about the better performance of the heuristic algorithm, it worked efficiently on simulated data. Here, we apply the same strategy under the bounded search tree method in Algorithm 1, and show that such a randomized strategy is not only simpler, but also provides a cleaner proof and improves the time complexity.

figure a

Lemma 3.1

If \(\{s_1, \ldots , s_n\}\) has no center string of radius \(d\), then BoundedGuess-Binary always outputs “no”. On the other hand, if \(\{s_1, \ldots , s_n\}\) has a center string of radius \(d\), then with probability at least \( \frac{d!}{(2d)^d}\), BoundedGuess-Binary outputs a center string of \(\{s_1, \ldots , s_n\}\) with radius \(d\).

Proof

The first assertion in the lemma is obvious. So, suppose that \(s\) is a center string of \(\{s_1, \ldots , s_n\}\) with radius \(d\) such that \(d(s, s_1)\) is minimized among all center strings of \(\{s_1, \ldots , s_n\}\) with radius \(d\). For convenience, let \(d_j = d(s, s_1) - (j - 1)\) for all \(j \ge 1\).

For each \(j \ge 1\), let \(E_j\) be the event that the position \(p\) selected in Step 2.3 during the \(j\)th iteration of the while-loop belongs to \(\{ s_1 \not \equiv s \}\), and \(p\) has not been selected in Step 2.3 during the previous \(j-1\) iterations of the while-loop. For each \(j \ge 1\), we want to obtain a lower bound on the conditional probability \(\mathrm{Pr}[E_j~|~E_1, \ldots , E_{j-1}]\).

So, fix an arbitrary \(j \ge 1\) and consider the time point immediately before Step 2.3 of the \(j\)th iteration of the while-loop. Let \(a = |\{t \equiv s \not \equiv s_i\}|\), \(b = |\{s_i \equiv s \not \equiv t\}|\), and \(c = |\{s_i \equiv t \not \equiv s\}|\) (cf. Fig. 1). Clearly, \(d(s_i,s) = a + c \le d\), \(d(t,s) = b + c\), and \(d(t,s_i) = a + b > d\). Let \(\ell = d(t,s_i) - d\). Then, \(d + \ell + 2c = d(t, s_i)+2c = a+b+2c= d(s_i,s) + d(t,s) \le d + d(t,s)\). If events \(E_1\), ..., \(E_{j-1}\) have occurred, then \(d(t,s) = d_j\) and the last inequality becomes \(\ell + 2c \le d_j\). This implies that \(2(d_j-c) \ge d_j+\ell \). Therefore,

$$\begin{aligned} b = d(t,s)-c = d_j - c \ge \frac{d_j + \ell }{2}. \end{aligned}$$
(1)

Notice that \(b\) is precisely the number of bits in \(\{t \not \equiv s_i\}\) where \(t\) needs to be flipped in order to reach \(s\). Moreover, if events \(E_1\), ..., \(E_{j-1}\) have occurred, then none of the \(b\) bits has been flipped during the previous \(j-1\) iterations of the while-loop. Thus, \(\mathrm{Pr}[E_j~|~E_1, \ldots , E_{j-1}]\) is at least

$$\begin{aligned} \frac{b}{d+\ell } \ge \frac{1}{2} \cdot \frac{d_j+\ell }{d+\ell } \ge \frac{1}{2} \cdot \frac{d_j }{d} = \frac{d_j }{2d}, \end{aligned}$$
(2)

where the second inequality is correct because \(d_j \le d\) and \(\ell > 0\).

So, the probability that all of \(E_1\), ..., \(E_{d_1}\) occur is at least

$$\begin{aligned} \prod _{\Delta =1}^{d_1} \frac{\Delta }{2d} \ge \prod _{\Delta =1}^{d} \frac{\Delta }{2d} = \frac{d!}{(2d)^d}, \end{aligned}$$

where the inequality holds because \(d_1 \le d\).\(\square \)

Theorem 3.2

The binary case of the closest string problem can be solved in \(\mathcal {O}\left( nL + n\sqrt{d}\cdot (2e)^d\right) = \mathcal {O}\left( nL + n\sqrt{d}\cdot 5.437^d\right) \) time with arbitrarily small constant one-sided error.

Proof

By Stirling’s formula, \(\frac{d!}{(2d)^d}\) is at least a positive constant times the following:

$$\begin{aligned} \frac{\sqrt{ d}\cdot (\frac{d}{e})^d}{2^d d^d} = \sqrt{ d} \left( \frac{1}{2e}\right) ^d \end{aligned}$$

By Lemma 3.1, if we repeat BoundedGuess-Binary \( \mathcal {O}\left( \frac{(2e)^d}{\sqrt{d}}\right) = \mathcal {O}\left( \frac{5.437^d}{\sqrt{d}}\right) \) times, then we will obtain a center string with arbitrarily large constant probability if there is any.

Obviously, each iteration of the while-loop takes \(\mathcal {O}(nL)\) time. As observed in previous works (e.g., [17]), we can improve this time bound by carefully remembering the previous distances and only updating them after a single position is flipped. The conclusion is this: With an \(\mathcal {O}(nL)\)-time preprocessing, each iteration of the while-loop takes \(\mathcal {O}(nd)\) time.\(\square \)

The time bound in Theorem 3.2 is better than \(\mathcal {O}\left( nL + nd^3 \cdot 6.731^d\right) \), which is the best time bound achieved by the fastest deterministic algorithm for the binary case [6].

Fig. 1
figure 1

Strings \(s\), \(t\), and \(s_{i}\) immediately before Step 2.3 of the \(j\)th iteration of the while-loop in BoundedGuess-Binary, where for each position \(p\in [1{\ldots }L]\), two of the strings have the same letter at position \(p\) if and only if they are illustrated in the same color or pattern at position \(p\)

4 Randomized Algorithm for General Alphabets

In this section, we extend the algorithm in Sect. 3 to the general case. In Step 2.3 of Algorithm 1, a randomly selected \(p \in \{ t \not \equiv s_i \}\) has a good chance to be such that \(t[p]\) is different from the center string. Consequently, in the binary case, changing \(t[p]\) to \(s_i[p]\) will make \(t\) one step closer to the center string. However, when the alphabet size is greater than 2, this is not the optimal strategy any more, because \(s_i[p]\) can be either equal to or different from the center string. The algorithm has to carefully bet on one of the two cases during the search. Thus, we modify Step 2.3 in Algorithm 1 as follows:

figure b

Lemma 4.1

If \(\{s_1, \ldots , s_n\}\) has no center string of radius \(d\), then BoundedGuess always outputs “no”. On the other hand, if \(\{s_1, \ldots , s_n\}\) has a center string of radius \(d\), then with probability at least \( \frac{d!}{(\sigma d)^d}\), BoundedGuess outputs a center string of \(\{s_1, \ldots , s_n\}\) with radius \(d\).

Proof

The first assertion in the lemma is obvious. To prove the second assertion, we define \(s\) and \(d_j\) as in the proof of Lemma 3.1. For each \(j \ge 1\), let \(E_j\) be the event that the position \(p\) selected in Step 2.3 during the \(j\)th iteration of the while-loop belongs to \(\{ s_1 \not \equiv s \}\), \(p\) has not been selected in Step 2.3 during the previous \(j-1\) iterations of the while-loop, and the letter of \(t\) at position \(p\) is changed to that of \(s\) at position \(p\). The goal is to bound \(\mathrm{Pr}[E_j~|~E_1, \ldots , E_{j-1}]\) from below.

So, fix an arbitrary \(j \ge 1\) and consider the time point immediately before Step 2.3 of the \(j\)th iteration of the while-loop. Let \(a = |\{t \equiv s \not \equiv s_i\}|\), \(b = |\{s_i \equiv s \not \equiv t\}|\), \(c = |\{s_i \equiv t \not \equiv s\}|\), and \(r = |\{s_i \not \equiv s \not \equiv t\}|\) (cf. Fig. 2). Clearly, \(d(s_i,s) = a + c + r \le d\), \(d(t,s) = b + c + r\), and \(d(t,s_i) = a + b + r> d\). Let \(\ell = d(t,s_i) - d\). Then, \(d + \ell + 2c + r = d(t,s_i) + 2c +r = a+b+2c + 2r = d(s_i,s) + d(t,s) \le d + d(t,s)\). If events \(E_1\), ..., \(E_{j-1}\) have occurred, then \(d(t,s) = d_j\) and the above inequality becomes \(\ell + 2c + r \le d_j\). Thus,

$$\begin{aligned} 2b+r = 2(d(t,s)-c-r)+r = 2 d_j -2 c - r = d_j + (d_j -2c - r) \ge d_j +\ell .\nonumber \\ \end{aligned}$$
(3)

For convenience, let \( q_1 = \frac{b}{d+\ell }\) and \( q_2 = \frac{r}{d+\ell }\). Then, we have

$$\begin{aligned} 2q_1 + q_2 = \frac{2b+r}{d+\ell } \ge \frac{d_j+\ell }{d+\ell } \ge \frac{d_j }{d}, \end{aligned}$$
(4)

where the first inequality follows from Eq. (3) and the second inequality is correct because \(d_j \le d\) and \(\ell > 0\).

Notice that \(b + r\) is precisely the number of positions in \(\{t \not \equiv s_i\}\) where \(t\) needs to be modified in order to reach \(s\). Among the \(b + r\) positions, \(r\) need to be changed to one of the \(\sigma - 2\) letters that are different from both \(t\) and \(s_i\), while \(b\) need to be changed to \(s_i\). Moreover, if events \(E_1\), ..., \(E_{j-1}\) have occurred, then none of the \(b+r\) positions has been selected in Step 2.3 during the previous \(j-1\) iterations of the while-loop. Thus, by the total probability rule, \(\mathrm{Pr}[E_j~|~E_1, \ldots , E_{j-1}]\) is at least

$$\begin{aligned} \frac{2}{\sigma } \cdot q_1 + \frac{1}{\sigma } \cdot q_2 = \frac{1}{\sigma } \cdot (2 q_1 + q_2) \ge \frac{1}{\sigma } \cdot \frac{d_j}{d}, \end{aligned}$$
(5)

where the last inequality follows from Eq. (4).

So, the probability that all of \(E_1\), ..., \(E_{d_1}\) occur is at least

$$\begin{aligned} \prod _{\Delta =1}^{d_1} \frac{\Delta }{\sigma d} \ge \prod _{\Delta =1}^{d} \frac{\Delta }{\sigma d} = \frac{d!}{(\sigma d)^d}, \end{aligned}$$

where the inequality holds because \(d_1 \le d\).\(\square \)

Theorem 4.2

The closest string problem can be solved in \(\mathcal {O}\left( nL + n\sqrt{d}\cdot (\sigma e)^d\right) \) time with arbitrarily small constant one-sided error.

Proof

Similar to the proof of Theorem 3.2. The only difference is to use Lemma 4.1 instead of Lemma 3.1. \(\square \)

For DNA strings (\(\sigma = 4\)), the time bound \(\mathcal {O}\left( nL + n\sqrt{d}\cdot (\sigma e)^d\right) = \mathcal {O}\left( nL + n\sqrt{d}\cdot 10.874^d\right) \) is better than \(\mathcal {O}\left( nL + nd\cdot 13.183^d\right) \), which is the best time bound achieved by the fastest deterministic algorithm for the problem [6]. However, for large \(\sigma \) (say, \(\sigma =20\)), the time bound in [7] is better.

Fig. 2
figure 2

Strings \(s\), \(t\), and \(s_{i}\) immediately before Step 2.3 of the \(j\)th iteration of the while-loop in BoundedGuess, where for each position \(p\in [1{\ldots }L]\), two of the strings have the same letter at position \(p\) if and only if they are illustrated in the same color or pattern at position \(p\)

5 An \(\mathcal {O}(nL+n\sqrt{d} \cdot (2.5 \sigma )^d)\) Time Algorithm

In this section, we obtain a more efficient algorithm by refining BoundedGuess in Sect. 4.

First, a close inspection of the proof of Lemma 4.1 reveals that we do not need to modify a position of \(t\) again once it has been modified. Thus, if we record the already modified positions with a set \(F\), and avoid those positions in later steps, the algorithm’s time complexity may be reduced. More specifically, we can augment Step 1 by also initializing \(F = \emptyset \), modify Step 2.3 by selecting \(p\) from \(\{t\not \equiv s_i\}\setminus F\) instead of from \(\{t\not \equiv s_i\}\), and augment Step 2.4 by also adding \(p\) to \(F\). A crucial observation is that if we find an \(s_i\) in Step 2.2 so that \(d(t, s_i)\) is maximized, then we can prove that \(\{t\not \equiv s_i\}\setminus F\) is significantly smaller than \(\{t\not \equiv s_i\}\). Consequently, the probability of selecting a correct position in Step 2.3 is increased. This will be proved in Lemma 5.1.

Secondly, if \(d(t, s_i) - d \ge 2\) for the string \(s_i\) found in Step 2.2, then we still have \(d(t, s_i) > d\) after modifying only one position of \(t\) in Step 2.3 and hence the same \(s_i\) can be used in Step 2.3 during the next iteration of the while-loop. More generally, if \(d(t, s_i) - d = \ell \) for the string \(s_i\) found in Step 2.2, then we can use \(s_i\) in Step 2.3 during \(\ell \) consecutive iterations of the while-loop.

Based on the above observations, we now obtain a new algorithm as follows:

figure c

To analyze NonRedundantGuess, we first define several notations. For each integer \(j \ge 1\), we let \(s_{i_{j}}\) (respectively, \(\ell _j\) or \(R_j\)) denote the string \(s_i\) (respectively, the integer \(\ell \) or the set \(R\)) obtained in Step 2.2 (respectively, Step 2.3 or 2.4) during the \(j\)th iteration of the while-loop. Moreover, for each \(j\ge 1\), we let \(t_j\) (respectively, \(F_j\)) denote the string \(t\) (respectively, the set \(F\)) immediately before the \(j\)th iteration of the while-loop.

The next lemma is the key to analyzing NonRedundantGuess.

Lemma 5.1

For every integer \(j \ge 2\), \( |F_j \cap \{t_j \not \equiv s_{i_j}\} | \ge \frac{ d(t_j, s_1) - \ell _1 + \ell _j }{ 2 }\).

Proof

First, recall that for any two sets \(A\) and \(B\),

$$\begin{aligned} | (A \setminus B) \cup (B \setminus A) | = |A|+|B|- 2 |A\cap B|. \end{aligned}$$
(6)

By the algorithm, \(F_j = \{t_1 \not \equiv t_j\}\). So, \(t_1\) and \(s_{i_j}\) disagree at each position in \(\left( F_j \setminus \{t_j \not \equiv s_{i_j}\} \right) \cup \left( \{t_j \not \equiv s_{i_j}\} \setminus F_j \right) \), and in turn

$$\begin{aligned} \left| \left( F_j \setminus \{t_j \not \equiv s_{i_j}\} \right) \cup \left( \{t_j \not \equiv s_{i_j}\} \setminus F_j \right) \right| \le d(t_1, s_{i_j}). \end{aligned}$$

Moreover, \(s_{i_1}\) maximizes \(d(t_1, s_{i_1})\) and so \(d(t_1, s_{i_j}) \le d(t_1, s_{i_1}) = d + \ell _1\). Thus, we have

$$\begin{aligned} \left| \left( F_j \setminus \{t_j \not \equiv s_{i_j}\} \right) \cup \left( \{t_j \not \equiv s_{i_j}\} \setminus F_j \right) \right| \le d+ \ell _1. \end{aligned}$$
(7)

Applying Eq. (6) to the left side of Eq. (7), we obtain

$$\begin{aligned} |F_j| + | \{ t_j \not \equiv s_{i_j}\}| - 2 | F_j \cap \{ t_j \not \equiv s_{i_j}\}| \le d+\ell _1. \end{aligned}$$
(8)

Obviously, \(|F_j| = d(t_j, s_1)\) and \(|\{ t_j \not \equiv s_{i_j}\}| = d+ \ell _j\). Combining these two equalities with Eq. (8), we finally obtain the inequality in the lemma.\(\square \)

Lemma 5.2

If \(\{s_1, \ldots , s_n\}\) has no center string of radius \(d\), then NonRedundantGuess always outputs “no”. On the other hand, if \(\{s_1, \ldots , s_n\}\) has a center string of radius \(d\), then with probability \( \Omega ( {\sqrt{d}}\cdot ( \frac{(1+\epsilon )^{\frac{1+\epsilon }{2}} (1-\epsilon )^{\frac{1-\epsilon }{2}}}{2^{1+\epsilon }\sigma } )^d)\), NonRedundantGuess outputs a center string of \(\{s_1, \ldots , s_n\}\) with radius \(d\), where \(\epsilon = \ell _1 / d\).

Proof

The first assertion in the lemma is obvious. To prove the second assertion, we define \(s\) as in the proof of Lemma 3.1. For each \(j\ge 1\), let \( d_j = d(s, t_j)\). Moreover, for each \(j \ge 1\), let \(E_j\) be the event that \(R_j\) is a subset of \(\{ s_1 \not \equiv s \}\) and the letter of \(t\) at each position \(p \in R_j\) is changed to that of \(s\) at position \(p\) in Step 2.5 during the \(j\)th iteration of the while-loop. The goal is to bound \(\mathrm{Pr}[E_j~|~E_1, \ldots , E_{j-1}]\) from below.

For convenience, let \(R_j = \{p_{j,1}, \ldots , p_{j,\ell _j}\}\) for all \(j\ge 1\). Without loss of generality, we can assume that \(R_j\) is obtained by the following procedure: For each \(j\ge 1\) and \(1\le h\le \ell _j\), select \(p_{j,h}\) from \(\{t_j \not \equiv s_{i_j}\}\setminus \left( F_j \cup \{p_{j,1}, \ldots , p_{j,h-1}\}\right) \) uniformly at random.

For each \(j\ge 1\) and \(1\le h\le \ell _j\), let \(E_{j,h}\) denote the event that \(p_{j,h}\) belongs to \(\{t_j \not \equiv s\}\setminus \left( F_j \cup \{p_{j,1}, \ldots , p_{j,h-1}\}\right) \) and the letter of \(t_j\) at position \(p_{j,h}\) is changed to that of \(s\) at position \(p_{j,h}\) in Step 2.5 during the \(j\)th iteration of the while-loop.

By the first inequality in Eq. (4) and the equality in Eq. (5) (cf. the proof of Lemma 4.1), we have \( \mathrm{Pr}[E_{1,h}~|~E_{1,1}, \ldots , E_{1,h-1}] \ge \frac{\left( d_1-(h-1)\right) +\left( \ell _1-(h-1)\right) }{\left( d+\ell _1-(h-1)\right) \sigma }\) and in turn

$$\begin{aligned} \mathrm{Pr}[E_1] \ge \prod _{h=0}^{\ell _1-1}\frac{d_1+\ell _1-2h}{(d+\ell _1-h)\sigma }. \end{aligned}$$
(9)

Now, consider an arbitrary integer \(j\ge 2\). Since we assume that \(E_1\), ..., \(E_{j-1}\) occur, \( d(t_j, s_1) = d_1 - d_j\). By Lemma 5.1, \(|F_j \cap \{ t_j \not \equiv s_{i_j}\}| \ge \frac{d_1-d_j - \ell _1+\ell _j}{2}\). So, by the first inequality in Eq. (4), and the equality in Eq. (5) (cf. the proof of Lemma 4.1), we have

$$\begin{aligned}&\mathrm{Pr}[E_{j,h}~|~E_1, \ldots , E_{j-1}, E_{j,1}, \ldots , E_{j,h-1}] \!\ge \! \frac{\left( d_j-(h-1)\right) +\left( \ell _j-(h-1)\right) }{\left( d\!+\!\ell _j\!-\!\frac{d_1-d_j-\ell _1+\ell _j}{2}-(h-1)\right) \sigma } \\&\quad = \frac{2\left( d_j+\ell _j-2(h-1)\right) }{\left( 2d-d_1+\ell _1+d_j+\ell _j-2(h-1)\right) \sigma } \ge \frac{2\left( d_j-(h-1)\right) }{\left( 2d-d_1+\ell _1+d_j-(h-1)\right) \sigma }, \end{aligned}$$

where the last inequality is because \(\ell _j - (h-1) \ge 0\). In turn,

$$\begin{aligned} \mathrm{Pr}[E_j~|~E_1,\ldots ,E_{j-1}]&\ge \prod _{h=0}^{\ell _j-1} \frac{2\left( d_j-h\right) }{\left( 2d-d_1+\ell _1+d_j-h\right) \sigma } \nonumber \\&= \prod _{i=d_j-\ell _j+1}^{d_j} \frac{2i}{(2d-d_1+\ell _1+i) \sigma }. \end{aligned}$$
(10)

Therefore, \(\mathrm{Pr}[E_1\wedge E_2\wedge \cdots ]\) is at least

$$\begin{aligned}&\prod _{i=0}^{\ell _1-1}\frac{d_1+\ell _1-2i}{(d+\ell _1-i)\sigma } \cdot \prod _{i=1}^{d_1-\ell _1}\frac{2i}{(2d-d_1+\ell _1+i)\sigma } \nonumber \\&\quad = \left( \frac{2}{\sigma }\right) ^{d_1}\cdot \prod _{i=0}^{\ell _1-1}\frac{\frac{d_1+\ell _1}{2}-i}{(d+\ell _1-i)} \cdot \prod _{i=1}^{d_1-\ell _1}\frac{i}{(2d-d_1+\ell _1+i)} \nonumber \\&\quad = \left( \frac{2}{\sigma }\right) ^{d_1}\cdot \frac{ \left( \frac{d_1+\ell _1}{2} \right) ! \, d! \, \left( d_1-\ell _1 \right) ! \, (2d-d_1+\ell _1)!}{ \left( \frac{d_1-\ell _1}{2} \right) ! \, \left( d+\ell _1\right) ! \, (2d)!} \end{aligned}$$
(11)

For convenience, let \(\delta = \frac{d_1}{d}\). By the triangle inequality, \(d(s_1, s_{i_1}) \le d(s, s_1) + d(s, s_{i_1}) = d_1 + d(s, s_{i_1}) \le d_1 + d\) and in turn \(\ell _1 = d(s_1, s_{i_1}) - d \le d_1\). So, \(\delta \ge \epsilon \).

Now, by Eq. (11) and Stirling’s formula, the probability is at least a positive constant times the following:

$$\begin{aligned} \sqrt{d}\cdot \left( \frac{2^{\delta -\epsilon }}{4\sigma ^\delta } \cdot \frac{ (\delta +\epsilon )^{\frac{\delta +\epsilon }{2}} (\delta -\epsilon )^{\frac{\delta -\epsilon }{2}} (2-\delta +\epsilon )^{2-\delta +\epsilon }}{(1+\epsilon )^{1+\epsilon }} \right) ^d \end{aligned}$$
(12)

By elementary calculus, one can verify that Eq. (12) is a decreasing function of \(\delta \) for \(\delta \ge \epsilon \). Therefore Eq. (12) reaches its minimum value when \(\delta =1\). Consequently, the probability is at least a positive constant times the following:

$$\begin{aligned} \sqrt{d}\cdot \left( \frac{(1+\epsilon )^{\frac{1+\epsilon }{2}} (1-\epsilon )^{\frac{1-\epsilon }{2}}}{2^{1+\epsilon }\sigma } \right) ^d \end{aligned}$$
(13)

This completes the proof.\(\square \)

Theorem 5.3

The closest string problem can be solved in

$$\begin{aligned} \mathcal {O}\left( nL + n\sqrt{d}\cdot \left( \frac{2^{1+\epsilon }\sigma }{(1+\epsilon )^{\frac{1+\epsilon }{2}}(1-\epsilon )^{\frac{1-\epsilon }{2}}}\right) ^d\right) \end{aligned}$$

time with arbitrarily small constant one-sided error, where \(\epsilon = \ell _1 / d\).

Proof

Similar to the proof of Theorem 3.2. The only difference is to use Lemma 5.2 instead of Lemma 3.1.\(\square \)

Corollary 5.4

The closest string problem can be solved in \( \mathcal {O}\left( nL + n\sqrt{d}\cdot \left( 2.5\sigma \right) ^d\right) \) time with arbitrarily small constant one-sided error.

Proof

By elementary calculus, one can verify that for a fixed \(\sigma \), Eq. (13) achieves the minimum value \( \sqrt{d}\cdot \left( \frac{2}{5\sigma } \right) ^d\) at \(\epsilon = 0.6\). This completes the proof.\(\square \)

The time bound in Corollary 5.4 is better than that in Theorem 4.2.

6 More Efficient Algorithm for Large Alphabets

When \(\sigma < 16\), the time complexity in Corollary 5.4 is better than the previously best known algorithms [6, 7]. However, for large \(\sigma \) (such as \(\sigma =20\) for protein sequences), the algorithm in [7] is still better. In this section, we refine NonRedundantGuess to obtain a new algorithm (named LargeAlphabet1) that has a time complexity \(\mathcal {O}(nL + n\sqrt{d}\cdot ( 2\sigma +4)^d)\). This time complexity is better than the one of Corollary 5.4 for \(\sigma \ge 9\). Also, the new time complexity is smaller than that of the algorithm in [7] for reasonably large alphabet sizes (such as \(\sigma =20\)).

When the alphabet size is large, the most expensive factor in the time complexity in Corollary 5.4 is the \(\sigma ^d\). This factor arises from the fact that for each position that needs to be changed in Step 2.5 of NonRedundantGuess, we need to guess the letter from \(\sigma -1\) possibilities and in total there can be as many as \(d\) such guessing events. However, in the first iteration of the while-loop of NonRedundantGuess, there can be a large number of positions \(p \in \{t \not \equiv s_i\}\) such that \(t[p]\) needs to be changed to \(s_i[p]\). Denote the set of these positions by \(B\). For the positions in \(B\), the algorithm actually does not need to guess the letters. Moreover, by Lemma 3.1 in [7], \(|B| \ge \ell _1\). Based on this fact, we obtain LargeAlphabet1 by modifying Step 2.5 of NonRedundantGuess as follows:

figure d

Lemma 6.1

If \(\{s_1, \ldots , s_n\}\) has no center string of radius \(d\), then LargeAlphabet1 always outputs “no”. On the other hand, if \(\{s_1, \ldots , s_n\}\) has a center string of radius \(d\), then with probability \( \Omega \left( {d}\cdot \sqrt{\epsilon (1-\epsilon )}\cdot \left( \frac{\epsilon ^\epsilon (1-\epsilon )^{1-\epsilon }}{2^{1+\epsilon }\sigma ^{1-\epsilon }} \right) ^d\right) \), LargeAlphabet1 outputs a center string of \(\{s_1, \ldots , s_n\}\) with radius \(d\), where \(\epsilon = \ell _1 / d\).

Proof

We inherit the notations from Sect. 5. The first assertion in the lemma is obvious. To prove the second assertion, we first observe that \(\{ s_1 \not \equiv s_{i_1} \}\) contains at least \(\ell _1\) positions \(p\) such that \(s_1[p]\) should be changed to \(s_{i_1}[p]\). This follows from Lemma 3.1 in [7]. Consequently, we have

$$\begin{aligned} \mathrm{Pr}[E_1] \ge \frac{1}{\left( \begin{array}{c} {d+\ell _1}\\ \ell _1\\ \end{array}\right) } = \frac{\ell _1! \, d!}{(d+\ell _1)!}. \end{aligned}$$
(14)

Obviously, for \(j \ge 2\), Eq. (10) in the proof of Lemma 5.2 still holds. Therefore, \(\mathrm{Pr}[E_1\wedge E_2\wedge \cdots ]\) is at least

$$\begin{aligned} \frac{\ell _1! \, d!}{(d+\ell _1)!} \cdot \prod _{i=1}^{d_1-\ell _1}\frac{2i}{(2d-d_1+\ell _1+i)\sigma }&\ge \frac{\ell _1! \, d!}{(d+\ell _1)!} \cdot \prod _{i=1}^{d-\ell _1}\frac{2i}{(d+\ell _1+i)\sigma } \nonumber \\&= \frac{ d! \, {\ell _1}! \, (d-\ell _1)! }{ (2d)!} \cdot \left( \frac{2}{\sigma } \right) ^{d-\ell _1}, \end{aligned}$$
(15)

where the inequality can be easily verified because the inequality \(d_1 \le d\) implies that \((d_1-\ell _1)!\cdot (2d-d_1+\ell _1)! \ge (d-\ell _1)!\cdot (d+\ell _1)!\). Therefore, by Stirling’s formula, \(\mathrm{Pr}[E_1\wedge E_2\wedge \cdots ]\) is \( \Omega \left( {d} \cdot \sqrt{\epsilon (1-\epsilon )}\cdot \left( \frac{\epsilon ^{\epsilon } (1-\epsilon )^{1-\epsilon }}{2^{1+\epsilon } \sigma ^{1-\epsilon }} \right) ^d\right) \).\(\square \)

Theorem 6.2

The closest string problem can be solved in

$$\begin{aligned} \mathcal {O}\left( nL + n \cdot \frac{1}{\sqrt{\epsilon (1-\epsilon )}}\cdot \left( \frac{2^{1+\epsilon } \sigma ^{1-\epsilon }}{\epsilon ^{\epsilon } (1-\epsilon )^{1-\epsilon }} \right) ^d\right) \end{aligned}$$

time with arbitrarily small constant one-sided error, if \(0 < \epsilon = \ell _1 / d < 1\).

Proof

Similar to the proof of Theorem 3.2. The only difference is to use Lemma 6.1 instead of Lemma 3.1.\(\square \)

Corollary 6.3

The closest string problem can be solved in \( \mathcal {O}\left( nL + n\sqrt{d}\cdot \left( 2\sigma + 4 \right) ^d\right) \) time with arbitrarily small constant one-sided error.

Proof

If \(\epsilon = 0\), then \(\ell _1 = 0\) and so LargeAlphabet1 takes \(\mathcal {O}(nL)\) time because \(s_1\) itself is a center string of \(\{s_1, \ldots , s_n\}\) with radius \(d\). Moreover, if \(\epsilon =1\), then \(\ell _1=d\) and so the equality in Eq. (15) implies that LargeAlphabet1 takes \(\mathcal {O}\left( nL + n\sqrt{d}\cdot 4^d\right) \) time. So, assume that \(0 < \epsilon < 1\). Then, \(\frac{1}{\epsilon (1-\epsilon )} = \mathcal {O}\left( d\right) \). Consider the function \( f(\epsilon ) = \frac{2^{1+\epsilon } \sigma ^{1-\epsilon }}{\epsilon ^{\epsilon } (1-\epsilon )^{1-\epsilon }}\). By elementary calculus, one can verify that \(f\) takes the maximum value \(2\sigma +4\) at \( \epsilon = \frac{2}{\sigma +2}\). Now, the corollary follows from Theorem 6.2.\(\square \)

Obviously, when \(\sigma \ge 9\), the time bound in Corollary 6.3 is better than that in Corollary 5.4. In particular, for protein strings (i.e., when \(\sigma = 20\)), the time bound in Corollary 6.3 becomes \(\mathcal {O}\left( nL + n\sqrt{d}\cdot 44^d\right) \), which is better than the best time bound \(\mathcal {O}\left( nL + nd\cdot 47.21^d\right) \) achieved by a deterministic algorithm for the problem [7].

Since the value of \(\epsilon = \ell _1 /d\) in Theorems 5.3 and 6.2 is known at the very beginning of NonRedundantGuess and LargeAlphabet1, one can choose between the two algorithms depending on which of the two bounds in Theorems 5.3 and 6.2 is smaller. For \(\sigma \ge 5\), numerical computation shows that this combined algorithm has a better time complexity than the bounds proved in Corollaries 5.4 and 6.3. For example, when \(\sigma = 20\), we can show that the combined algorithm has time complexity \( \mathcal {O}\left( nL + n\sqrt{d}\cdot 43.54^d\right) \).

7 More Efficient Algorithm for Nonbinary Alphabets

In this section, we refine NonRedundantGuess in a different way to obtain a new algorithm (named LargeAlphabet2). The improvement has two consequences. First, LargeAlphabet2 runs faster than the algorithm in [7] for every \(\sigma \ge 2\); whereas LargeAlaphabet1 developed in the previous section has a higher complexity than the algorithm in [7] for very large alphabets (\(\sigma > 36\)). Secondly, we will show that for \(\sigma \ge 3\), the combination of NonRedundantGuess and LargeAlphabet2 has a better time bound than all other bounds obtained in this paper and the algorithm in [7]. However, the only drawback of LargeAlphabet2 is that we are unable to obtain a closed-form expression for its time complexity.

We inherit the notations from Sect. 5. Recall that the idea behind LargeAlphabet1 is as follows: In the first iteration of the while-loop, we first randomly guess \(\ell _1\) positions from \(\{t\not \equiv s_i\}\) and then change the letter of \(t\) at each guessed position \(p\) to \(s_i[p]\). The idea behind LargeAlphabet2 is to randomly guess more in the first iteration. More specifically, in the first iteration, we guess \(B = \{s_i \equiv s \not \equiv t \}\), \(H = \{t \not \equiv s_i \not \equiv s\}\), and \(s[p]\) for each \(p \in H\). The crucial point is that after we change \(t[p]\) to \(s[p]\) for each \(p \in B\cup H\) in the first iteration, we can decrease \(\Delta \) down to \(\min \{d - |B| - |H|, |B| - \ell _1 \}\) (instead of down to \(d - |B| - |H|\)). This follows from Lemma 3.1 in [7].

Based on the discussion in the last paragraph, we obtain LargeAlphabet2 from NonRedundantGuess by replacing Steps 2.4 through 2.6 with the next step:

Algorithm 5: LargeAlphabet2

2.4 If \(\Delta < d\), then perform Steps 2.4, 2.5, and 2.6 of NonRedundantGuess.

         Otherwise, perform the following three steps:

         2.4.1 Select an integer \(h \in \{0, 1, \ldots , d-\ell \}\), a subset \(H\) of \(\{t \not \equiv s_i\}\) with

                  \(|H|=h\), and a subset \(B\) of \(\{t \not \equiv s_i\} \setminus H\) with \(\ell \le |B| \le d - h\) all

                  uniformly at random.

         2.4.2 Let \(R = B \cup H\). For each \(p \in R\),change \(t[p]\) to \(s_i[p]\) if \(p \in B\), while

                  change \(t[p]\) to one of the \(\sigma -2\) letters not equal to \(s_i[p]\) uniformly at

                  random if \(p\in H\).

         2.4.3 Set \(\Delta =\min \{d-|R|, |B|-\ell \}\) and \(F=R\). \(^{1}\)

  1. \(^{1}\)Instead of setting \(F=R\), we can set \(F=\{t\not \equiv s_i\}\). This change can only speed up the algorithm. The reason for setting \(F=R\) is for the clarity of the proof by using Lemma 5.1

Lemma 7.1

Let \(\epsilon \) be as in Lemma 5.2. If \(\{s_1, \ldots , s_n\}\) has no center string of radius \(d\), then LargeAlphabet2 always outputs “no”. On the other hand, if \(\{s_1, \ldots , s_n\}\) has a center string of radius \(d\), then with probability at least \( \Omega \left( \frac{\gamma ^{d}}{d}\right) \), LargeAlphabet2 outputs a center string of \(\{s_1, \ldots , s_n\}\) with radius \(d\), where

$$\begin{aligned} \gamma = \left\{ \begin{array}{ll} \frac{(1-\epsilon )^{\frac{1-\epsilon }{2}} (1+\epsilon )^{1+\epsilon }}{(3+\epsilon )^{\frac{3+\epsilon }{2}}} &{} \hbox { if } \sigma =2 \\ {\displaystyle \min _{0\le \alpha \le 1-\epsilon }} \frac{1}{(\sigma -2)^{\alpha }} \left( \frac{1-\epsilon -\alpha }{\sigma }\right) ^{\frac{1- \epsilon - \alpha }{2}} \frac{ 2^{\frac{1-\epsilon +\alpha }{2}} \alpha ^{\alpha } (1+\epsilon -\alpha )^{(1 +\epsilon -\alpha )} }{ \left( 3 +\epsilon -\alpha \right) ^{\frac{3 +\epsilon -\alpha }{2}} } &{} \hbox { otherwise.} \\ \end{array}\right. \end{aligned}$$
(16)

Proof

We inherit the notations from Sect. 5. The first assertion in the lemma is obvious. To prove the second assertion, we first calculate \(\mathrm{Pr}[E_0 \wedge E_1]\), where \(E_0\) is the event \(R_1 = \{s_1\not \equiv s\} \cap \{s_1\not \equiv s_{i_1}\}\). Clearly, both events \(E_0\) and \(E_1\) occur if and only if

  1. 1.

    The set \(B\) selected in Step 2.4.1 is \(\{s_{i_1} \equiv s \not \equiv s_1\}\),

  2. 2.

    The set \(H\) selected in Step 2.4.1 is \(\{s_1 \not \equiv s_{i_1} \not \equiv s\}\), and

  3. 3.

    \(t[p]\) is changed to \(s[p]\) in Step 2.4.2 for each \(p \in H\).

For convenience, let \(b_1 = |\{s_{i_1} \equiv s \not \equiv s_1\}|\) and \(h_1 = |\{s_1 \not \equiv s_{i_1} \not \equiv s\}|\). Note that the number of different combinations for \(B\), \(H\), and the letters in \(H\) is bounded from above by

$$\begin{aligned} d \cdot {\left( \begin{array}{c} d + \ell _1\\ h_1\\ \end{array}\right) } \cdot (\sigma -2)^{h_1} \cdot \sum _{b=\ell _1}^{d - h_1} {\left( \begin{array}{c} d + \ell _1 - h_1\\ b\\ \end{array}\right) } \le d \cdot {\left( \begin{array}{c} d+\ell _1\\ h_1\\ \end{array}\right) } (\sigma -2)^{h_1} 2^{d+\ell _1-h_1}. \end{aligned}$$

Thus, \( \mathrm{Pr}[E_0 \wedge E_1] \ge \frac{1}{d(\sigma -2)^{h_1} 2^{d+\ell _1-h_1}} \cdot \frac{h_1! (d+\ell _1-h_1)!}{(d+\ell _1)!}\). So, if \(h_1 = 0\), then

$$\begin{aligned} \mathrm{Pr}[E_0 \wedge E_1] \ge \frac{1}{d\cdot 2^{d+\ell _1}} \end{aligned}$$
(17)

Otherwise, by Stirling’s formula, \(\mathrm{Pr}[E_0 \wedge E_1]\) is at least a positive constant times the following:

$$\begin{aligned} \sqrt{\frac{ h_1(d+\ell _1-h_1)}{d+\ell _1}}\cdot \frac{ h_1^{h_1}(d+\ell _1-h_1)^{d+\ell _1-h_1}}{d(\sigma -2)^{h_1} 2^{d+\ell _1-h_1}(d+\ell _1)^{d+\ell _1}} \end{aligned}$$
(18)

Let \(d_1 = d(s_1, s) \le d\) be the precise number of changes required to convert \(s_1\) to \(s\). Suppose that both \(E_0\) and \(E_1\) have occurred. Then, Lemma 3.1 in [7] guarantees that after the first iteration of the while-loop, we need to further modify at most \(\min ( d_1-b_1-h_1, b_1-\ell _1) \le \frac{d_1-h_1-\ell _1}{2}\) positions of \(t\). Thus, after the first iteration, \(\Delta \) is an upper bound on the remaining necessary changes to \(t\), and \(F\) consists of the already fixed positions of \(t\) that will be excluded from future changes. Hence, after the first iteration, the behavior of LargeAlphabet2 will be the same as that of NonRedundantGuess. Therefore, by Lemma 5.1 and Eq. (10), we know that for each \(j\ge 2\),

$$\begin{aligned} \mathrm{Pr}[E_j~|~E_0,E_1,\ldots ,E_{j-1}] \ge \prod _{i=d_j-\ell _j+1}^{d_j} \frac{2i}{(2d-d_1+\ell _1+i) \sigma }. \end{aligned}$$
(19)

Let \(m = \lfloor \frac{d_1-h_1-\ell _1}{2} \rfloor \). Then, \(\mathrm{Pr}[E_2\wedge E_3\wedge \cdots ~|~E_0,E_1]\) is bounded from below by

$$\begin{aligned} \prod _{i=1}^{ \lfloor \frac{d_1-h_1-\ell _1}{2} \rfloor } \frac{2i}{(2d-d_1+\ell _1 +i) \sigma } =\frac{2^mm!(2d-d_1+\ell _1)!}{\sigma ^m(2d-d_1+\ell _1+m)!}. \end{aligned}$$
(20)

Note that if \(m=0\), then the right side of Eq. (20) is 1. So, we may assume that \(m \ge 1\) and hence \(d_1-h_1-\ell _1\ge 2\). Moreover, by Stirling’s formula, Eq. (20) is at least a positive constant times the following:

$$\begin{aligned} \frac{ 2^mm^m(2d-d_1+\ell _1)^{2d-d_1+\ell _1}}{ \sigma ^m(2d-d_1+\ell _1+m)^{2d-d_1+\ell _1+m}}\cdot \frac{\sqrt{2\pi m(2d-d_1+\ell _1)}}{\sqrt{2d-d_1+\ell _1+m}} \end{aligned}$$
(21)

By elementary calculus, one can easily verify that the first factor in Eq. (21) is a decreasing function in \(m\), and the second factor is at least 1. Thus, Eq. (21) is at least

$$\begin{aligned} \frac{ 2^{\tilde{m}}{\tilde{m}}^{\tilde{m}}(2d-d_1+\ell _1)^{2d-d_1+\ell _1} }{\sigma ^{\tilde{m}} (2d-d_1+\ell _1+{\tilde{m}})^{2d-d_1+\ell _1+{\tilde{m}}}} \end{aligned}$$
(22)

where \(\tilde{m} = \frac{d_1-h_1-\ell _1}{2}\).

We claim that Eq. (22) is a decreasing function in \(d_1\). To see this claim, let \(f(d_1)\) denote the logarithm of Eq. (22). It suffices to show that \(f(d_1)\) is a decreasing function in \(d_1\). We have

$$\begin{aligned} f'(d_1) = \frac{1}{2}\ln \frac{2}{\sigma } + \ln \frac{\sqrt{\tilde{m}(2d-d_1+\ell _1+\tilde{m})}}{2d-d_1+\ell _1} \le \frac{1}{2}\ln \frac{2}{\sigma } + \ln \frac{2d-d_1+\ell _1+2\tilde{m}}{2(2d-d_1+\ell _1)}. \end{aligned}$$

Obviously, \(2\le \sigma \). Moreover, \(2d-d_1+\ell _1+2\tilde{m} = 2d - h_1 \le 2(2d-d_1+\ell _1)\) for \(d_1\le d\). So, \(f'(d_1) \le 0\) and in turn \(f(d_1)\) is a decreasing function in \(d_1\). This completes the proof of the claim.

By the claim, Eq. (22) is at least

$$\begin{aligned} \frac{ (d-h_1-\ell _1)^{\frac{d-h_1-\ell _1}{2}}2^{\frac{3d+\ell _1-h_1}{2}} (d+\ell _1)^{d+\ell _1}}{\sigma ^{\frac{d-h_1-\ell _1}{2}} (3d+\ell _1-h_1)^{\frac{3d+\ell _1-h_1}{2}}}. \end{aligned}$$

Hence, by Eq. (17) and 18, \(\mathrm{Pr}[E_0\wedge E_1\wedge E_2\wedge \cdots ]\) is

$$\begin{aligned} \Omega \left( \frac{ (d-h_1-\ell _1)^{\frac{d-h_1-\ell _1}{2}}h_1^{h_1}(d+\ell _1-h_1)^{d+\ell _1-h_1} 2^{\frac{d-\ell _1+h_1}{2}}}{d\cdot \sigma ^{\frac{d-h_1-\ell _1}{2}}(\sigma -2)^{h_1} (3d+\ell _1-h_1)^{\frac{3d+\ell _1-h_1}{2}}}\right) \end{aligned}$$
(23)

Therefore, if we let \(\alpha = \frac{h_1}{d}\) and \(\epsilon = \frac{\ell _1}{d}\), then

$$\begin{aligned} \mathrm{Pr}[E_0\wedge E_1\wedge E_2\wedge \cdots ] = \Omega \left( \frac{\gamma ^d}{d}\right) , \end{aligned}$$

where \(\gamma \) is as defined in Eq. (16).\(\square \)

Thus, we have the following theorem.

Theorem 7.2

The closest string problem can be solved in \( \mathcal {O}(nL + nd^{2} \cdot ( \frac{1}{\gamma } )^d)\) time with arbitrarily small constant one-sided error, where \(\gamma \) is as defined in Eq. (16).

Proof

Similar to the proof of Theorem 3.2. The only difference is to use Lemma 7.1 instead of Lemma 3.1.\(\square \)

Although it is difficult to find a closed-form expression for the value of \(\frac{1}{\gamma }\), we can perform numerical computation to obtain an approximate upper bound on \(\frac{1}{\gamma }\) for any given \(\sigma \). Row “LargeAlphabet2” in Table 1 shows the results for several specific \(\sigma \). Note that the time bound in Theorem 7.2 is not as good as those in Corollaries 5.4 and 6.3 for small \(\sigma \) (such as \(\sigma =4\)) but is significantly better for large \(\sigma \) (say, \(\sigma =20\) or 50). Furthermore, numerical computation also shows that for every \(\sigma \ge 2\), the bound in Theorem 7.2 is better than the time bound in [7].

For \(\sigma \ge 3\), numerical computation shows that the smaller bound between the two in Theorems 5.3 and 7.2 is better than the bounds in Corollaries 5.4 and 6.3 and Theorem 7.2. For example, when \(\sigma = 4\) (respectively, \(\sigma =20\)), we can show that the smaller bound between the two in Theorems 5.3 and 7.2 is \( \mathcal {O}\left( nL + nd^{2}\cdot 9.81^d\right) \) [respectively, \( \mathcal {O}\left( nL + nd^{2}\cdot 40.1^d\right) \)].