1 Introduction

Consider two strings X and Y. A common supersequence of X and Y is a string S such that X and Y are both subsequences of S. A shortest common supersequence (SCS) of X and Y is a common supersequence of X and Y of minimum length. The Shortest Common Supersequence problem (the SCS problem, in short) is to compute an SCS of X and Y. The SCS problem is a classic problem in theoretical computer science [18, 23, 25]. It is solvable in quadratic time using a standard dynamic-programming approach [13], which also allows computing a shortest common supersequence of any constant number of strings (rather than just two) in polynomial time. In case of an arbitrary number of input strings, the problem becomes NP-hard [23] even when the strings are binary [25].

A weighted string of length n over some alphabet \(\varSigma \) is a type of uncertain sequence. The uncertainty at any position of the sequence is modeled using a subset of the alphabet (instead of a single letter), with every element of this subset being associated with an occurrence probability; the probabilities are often represented in an \(n \times |\varSigma |\) matrix. These kinds of data are common in various applications where: (i) imprecise data measurements are recorded; (ii) flexible sequence modeling, such as binding profiles of molecular sequences, is required; (iii) observations are private and thus sequences of observations may have artificial uncertainty introduced deliberately [2]. For instance, in computational biology they are known as position weight matrices or position probability matrices [26].

In this paper, we study the Weighted Shortest Common Supersequence problem (the WSCS problem, in short) introduced by Amir et al. [5], which is a generalization of the SCS problem for weighted strings. In the WSCS problem, we are given two weighted strings \(W_1\) and \(W_2\) and a probability threshold \(\tfrac{1}{z} \), and the task is to compute the shortest (standard) string such that both \(W_1\) and \(W_2\) match subsequences of S (not necessarily the same) with probability at least \(\tfrac{1}{z} \). In this work, we show the first efficient algorithm for the WSCS problem.

A related problem is the Weighted Longest Common Subsequence problem (the WLCS problem, in short). It was introduced by Amir et al. [4] and further studied in [14] and, very recently, in [20]. In the WLCS problem, we are also given two weighted strings \(W_1\) and \(W_2\) and a threshold \(\tfrac{1}{z} \) on probability, but the task is to compute the longest (standard) string S such that S matches a subsequence of \(W_1\) with probability at least \(\tfrac{1}{z} \) and S matches a subsequence of \(W_2\) with probability at least \(\tfrac{1}{z} \). For standard strings \(S_1\) and \(S_2\), the length of their shortest common supersequence \(|\textsc {SCS} (S_1,S_2)|\) and the length of their longest common subsequence \(|\textsc {LCS} (S_1,S_2)|\) satisfy the following folklore relation:

$$\begin{aligned} |\textsc {LCS} (S_1,S_2)| + |\textsc {SCS} (S_1,S_2)| = |S_1| + |S_2|. \end{aligned}$$
(1)

However, an analogous relation does not connect the WLCS and WSCS problems, even though both problems are NP-complete because of similar reductions, which remain valid even in the case that both weighted strings have the same length [4, 5]. In this work, we discover an important difference between the two problems.

Kociumaka et al. [21] introduced a problem called Weighted Consensus, which is a special case of the WSCS problem asking whether the WSCS of two weighted strings of length n is of length n, and they showed that the Weighted Consensus problem is NP-complete yet admits an algorithm running in pseudo-polynomial time \(\mathcal {O}(n+\sqrt{z}\log z)\) for constant-sized alphabetsFootnote 1. Furthermore, it was shown in [21] that the Weighted Consensus problem cannot be solved in \(\mathcal {O}^*(z^{0.5-\varepsilon })\) time for any \(\varepsilon >0\) unless there is an \(\mathcal {O}^*(2^{(0.5-\varepsilon )n})\)-time algorithm for the Subset Sum problem. Let us recall that the Subset Sum problem, for a set of n integers, asks whether there is a subset summing up to a given integer. Moreover, the \(\mathcal {O}^*(2^{n/2})\) running time for the Subset Sum problem, achieved by a classic meet-in-the-middle approach of Horowitz and Sahni [15], has not been improved yet despite much effort; see e.g. [6].

Abboud et al. [1] showed that the Longest Common Subsequence problem over constant-sized alphabets cannot be solved in \(\mathcal {O}(n^{2-\varepsilon })\) time for \(\varepsilon >0\) unless the Strong Exponential Time Hypothesis [16, 17, 22] fails. By (1), the same conditional lower bound applies to the SCS problem, and since standard strings are a special case of weighted strings (having one letter occurring with probability equal to 1 at each position), it also applies to the WSCS problem.

The following theorem summarizes the above conditional lower bounds on the WSCS problem.

Theorem 1

(Conditional hardness of the WSCS problem; see [1, 21]). Even in the case of constant-sized alphabets, the Weighted Shortest Common Supersequence problem is NP-complete, and for any \(\varepsilon >0\) it cannot be solved:

  1. 1.

    in \(\mathcal {O}(n^{2-\varepsilon })\) time unless the Strong Exponential Time Hypothesis fails;

  2. 2.

    in \(\mathcal {O}^*(z^{0.5-\varepsilon })\) time unless there is an \(\mathcal {O}^*(2^{(0.5-\varepsilon )n})\)-time algorithm for the Subset Sum problem.

Our Results. We give an algorithm for the WSCS problem with pseudo-polynomial running time that depends polynomially on n and z. Note that such algorithms have already been proposed for several problems on weighted strings: pattern matching [9, 12, 21, 24], indexing [3, 7, 8, 11], and finding regularities [10]. In contrast, we show that no such algorithm is likely to exist for the WLCS problem.

Specifically, we develop an \(\mathcal {O}(n^2 \sqrt{z} \log {z})\)-time algorithm for the WSCS problem in the case of a constant-sized alphabetFootnote 2. This upper bound matches the conditional lower bounds of Theorem 1. We then show that unless \(P=NP\), the WLCS problem cannot be solved in \(\mathcal {O}(n^{f(z)})\) time for any function \(f(\cdot )\).

Model of Computations. We assume the word RAM model with word size \(w = \varOmega (\log n + \log z)\). We consider the log-probability representation of weighted sequences, that is, we assume that the non-zero probabilities in the weighted sequences and the threshold probability \(\tfrac{1}{z}\) are all of the form \(c^{\frac{p}{2^{dw}}}\), where c and d are constants and p is an integer that fits in \(\mathcal {O}(1)\) machine words.

2 Preliminaries

A weighted string \(W=W[1] \cdots W[n]\) of length \(|W|=n\) over alphabet \(\varSigma \) is a sequence of sets of the form

$$\begin{aligned} W[i] = \{(c,\ \pi ^{(W)}_i(c))\ :\ c \in \varSigma \}. \end{aligned}$$

Here, \(\pi _i^{(W)}(c)\) is the occurrence probability of the letter c at the position \(i \in [1\mathinner {.\,.}n]\).Footnote 3 These values are non-negative and sum up to 1 for a given index i.

By \(W[i \mathinner {.\,.}j]\) we denote the weighted substring \(W[i]\cdots W[j]\); it is called a prefix if \(i=1\) and a suffix if \(j=|W|\).

The probability of matching of a string S with a weighted string W, with \(|S|=|W|=n\), is

$$\begin{aligned} \mathcal {P}(S,W) \, =\, \prod _{i=1}^n \pi ^{(W)}_i(S[i])\,=\, \prod _{i=1}^n\, \mathcal {P}(S[i]=W[i]). \end{aligned}$$

We say that a (standard) string S matches a weighted string W with probability at least \(\tfrac{1}{z} \), denoted by \(S \approx _{z} W\), if \(\mathcal {P}(S,W) \ge \tfrac{1}{z} \). We also denote

$$\begin{aligned} \mathsf {Matched}_z(W)=\{S \in \varSigma ^n : \mathcal {P}(S,W)\ge \tfrac{1}{z} \}. \end{aligned}$$

For a string S we write \(W\subseteq _{z}S\) if \(S'\approx _{z}W\) for some subsequence \(S'\) of S. Similarly we write \(S\subseteq _{z}W\) if \(S\approx _{z}W'\) for some subsequence \(W'\) of W.

Our main problem can be stated as follows.

figure a

Example 2

If the alphabet is \(\varSigma =\{\mathtt {a},\mathtt {b}\}\), then we write the weighted string as \(W=[p_1,p_2,\ldots ,p_n]\), where \(p_i=\pi ^{(W)}_i(\mathtt {a})\); in other words, \(p_i\) is the probability that the ith letter W[i] is \(\mathtt {a}\). For

$$\begin{aligned} W_1=[1,\, 0.2,\, 0.5], \; W_2=[0.2,\, 0.5,\, 1],\text { and }z=\tfrac{5}{2}, \end{aligned}$$

we have \(\textsc {WSCS} (W_1,\, W_2,\,z)=\mathtt {baba}\) since \(W_1\subseteq _{z} \mathtt {b}{{\underline{\mathtt{{aba}}}}},\, W_2\subseteq _{z} {{{\underline{\mathtt{{ba}}}}}\mathtt {b}{{\underline{\mathtt{{a}}}}}}\) (the witness subsequences are underlined), and \(\mathtt {baba}\) is a shortest string with this property.

We first show a simple solution to WSCS based on the following facts.

Observation 3

(Amir et al. [3]). Every weighted string W matches at most z standard strings with probability at least \(\tfrac{1}{z}\), i.e., \(|\mathsf {Matched}_z(W)| \le z\).

Lemma 4

The set \(\mathsf {Matched}_z(W)\) can be computed in \(\mathcal {O}(nz)\) time if \(|\varSigma |=\mathcal {O}(1)\).

Proof

If \(S \in \mathsf {Matched}_z(W)\), then \(S[1\mathinner {.\,.}i] \in \mathsf {Matched}_z(W[1 \mathinner {.\,.}i])\) for every index i. Hence, the algorithm computes the sets \(\mathsf {Matched}_z\) for subsequent prefixes of W. Each string \(S\in \mathsf {Matched}_z(W[1 \mathinner {.\,.}i])\) is represented as a triple \((c,p,S')\), where \(c=S[i]\) is the last letter of S, \(p = \mathcal {P}(S,W[1 \mathinner {.\,.}i])\), and \(S'=S[1\mathinner {.\,.}i-1]\) points to an element of \(\mathsf {Matched}_z(W[1 \mathinner {.\,.}i-1])\). Such a triple is represented in \(\mathcal {O}(1)\) space.

Assume that \(\mathsf {Matched}_z(W[1 \mathinner {.\,.}i-1])\) has already been computed. Then, for every \(S'=(c',p',S'') \in \mathsf {Matched}_z(W[1 \mathinner {.\,.}i-1])\) and every \(c \in \varSigma \), if \(p :=p'\cdot \pi ^{(W)}_i(c) \ge \tfrac{1}{z} \), then the algorithm adds \((c,p,S')\) to \(\mathsf {Matched}_z(W[1 \mathinner {.\,.}i])\).

By Observation 3, \(|\mathsf {Matched}_z(W[1 \mathinner {.\,.}i-1])| \le z\) and \(|\mathsf {Matched}_z(W[1 \mathinner {.\,.}i])| \le z\). Hence, the \(\mathcal {O}(nz)\) time complexity follows.    \(\square \)

Proposition 5

The WSCS problem can be solved in \(\mathcal {O}(n^2z^2)\) time if \(|\varSigma |=\mathcal {O}(1)\).

Proof

The algorithm builds \(\mathsf {Matched}_z(W_1)\) and \(\mathsf {Matched}_z(W_2)\) using Lemma 4. These sets have size at most z by Observation 3. The result is the shortest string in

$$\begin{aligned} \{\textsc {SCS} (S_1,S_2)\,:\,S_1 \in \mathsf {Matched}_z(W_1),\,S_2\in \mathsf {Matched}_z(W_2)\}. \end{aligned}$$

Recall that the SCS of two strings can be computed in \(\mathcal {O}(n^2)\) time using a standard dynamic programming algorithm [13].    \(\square \)

We substantially improve upon this upper bound in Sects. 3 and 4.

2.1 Meet-in-the-Middle Technique

In the decision version of the Knapsack problem, we are given n items with weights \(w_i\) and values \(v_i\), and we seek for a subset of items with total weight up to W and total value at least V. In the classic meet-in-the-middle solution to the Knapsack problem by Horowitz and Sahni [15], the items are divided into two sets \(S_1\) and \(S_2\) of sizes roughly \(\frac{1}{2}n\). Initially, the total value and the total weight is computed for every subset of elements of each set \(S_i\). This results in two sets AB, each with \(\mathcal {O}(2^{n/2})\) pairs of numbers. The algorithm needs to pick a pair from each set such that the first components of the pairs sum up to at most W and the second components sum up to at least V. This problem can be solved in linear time w.r.t. the set sizes provided that the pairs in both sets A and B are sorted by the first component.

Let us introduce a modified version this problem.

figure b

A linear-time solution to this problem is the same as for the problem in the meet-in-the-middle solution for Knapsack. However, for completeness we prove the following lemma (see also [21, Lemma 5.6]):

Lemma 6

(Horowitz and Sahni [15]). The \(\textsc {Merge} \) problem can be solved in linear time assuming that the points in A and B are sorted by the first component.

Proof

A pair (xy) is irrelevant if there is another pair \((x',y')\) in the same set such that \(x' \ge x\) and \(y' \ge y\). Observe that removing an irrelevant point from A or B leads to an equivalent instance of the \(\textsc {Merge} \) problem.

Since the points in A and B are sorted by the first component, a single scan through these pairs suffices to remove all irrelevant elements. Next, for each \((x,y)\in A\), the algorithm computes \((x',y')\in B\) such that \(x'\ge w / x\) and additionally \(x'\) is smallest possible. As the irrelevant elements have been removed from B, this point also maximizes \(y'\) among all pairs satisfying \(x'\ge w / x\). If the elements (xy) are processed by non-decreasing values x, the values \(x'\) do not increase, and thus the points \((x',y')\) can be computed in \(\mathcal {O}(|A|+|B|)\) time in total.    \(\square \)

3 Dynamic Programming Algorithm for WSCS

Our algorithm is based on dynamic programming. We start with a less efficient procedure and then improve it in the next section. Henceforth, we only consider computing the length of the WSCS; an actual common supersequence of this length can be recovered from the dynamic programming using a standard approach (storing the parent of each state).

For a weighted string W, we introduce a data structure that stores, for every index i, the set \(\{\mathcal {P}(S,W[1 \mathinner {.\,.}i])\,:\,S \in \mathsf {Matched}_z(W[1 \mathinner {.\,.}i])\}\) represented as an array of size at most z (by Observation 3) with entries in the increasing order. This data structure is further denoted as \( Freq _i(W,z)\). Moreover, for each element \(p \in Freq _{i+1}(W,z)\) and each letter \(c \in \varSigma \), a pointer to \(p'=p{/} \pi ^{(W)}_{i+1}(c)\) in \( Freq _{i}(W,z)\) is stored provided that \(p' \in Freq _{i}(W,z)\). A proof of the next lemma is essentially the same as of Lemma 4.

Lemma 7

For a weighted string W of length n, the arrays \( Freq _i(W,z)\), with \(i\in [1\mathinner {.\,.}n]\), can be constructed in \(\mathcal {O}(nz)\) total time if \(|\varSigma |=\mathcal {O}(1)\).

Proof

Assume that \( Freq _i(W,z)\) is computed. For every \(c \in \varSigma \), we create a list

$$\begin{aligned} L_c=\{p \cdot \pi ^{(W)}_{i+1}(c)\,:\,p \in Freq _i(W,z),\, p\cdot \pi ^{(W)}_{i+1}(c) \ge \tfrac{1}{z} \}. \end{aligned}$$

The lists are sorted since \( Freq _i(W,z)\) was sorted. Then \( Freq _{i+1}(W,z)\) can be computed by merging all the lists \(L_c\) (removing duplicates). This can be done in \(\mathcal {O}(z)\) time since \(\sigma =\mathcal {O}(1)\). The desired pointers can be computed within the same time complexity.    \(\square \)

Let us extend the \(\textsc {WSCS} \) problem in the following way:

figure c

In the following, a state in the dynamic programming denotes a quadruple \((i,j,\ell ,p)\), where \(i\in [0\mathinner {.\,.}|W_1|]\), \(j\in [0\mathinner {.\,.}|W_2|]\), \(\ell \in [0\mathinner {.\,.}|W_1|+|W_2|]\), and \(p\in Freq _i(W_1,z)\).

Observation 8

There are \(\mathcal {O}(n^3z)\) states.

In the dynamic programming, for all states \((i,j,\ell ,p)\), we compute

$$\begin{aligned} \mathbf {DP}[i,j,\ell ,p] = \max \{q\,:\,\textsc {WSCS} '(W_1[1 \mathinner {.\,.}i],W_2[1 \mathinner {.\,.}j],\ell ,p,q)=\mathbf {true}\}. \end{aligned}$$
(2)

Let us denote \(\pi ^k_i(c)=\pi ^{(W_k)}_i(c)\). Initially, the array \(\mathbf {DP}\) is filled with zeroes, except that the values \(\mathbf {DP}[0,0,\ell ,1]\) for \(\ell \in [0\mathinner {.\,.}|W_1|+|W_2|]\) are set to 1. In order to cover corner cases, we assume that \(\pi _0^{1}(c)=\pi _0^{2}(c)=1\) for any \(c \in \varSigma \) and that \(\mathbf {DP}[i,j,\ell ,p]=0\) if \((i,j,\ell ,p)\) is not a state. The procedure \(\mathsf {Compute}\) implementing the dynamic-programming algorithm is shown as Algorithm 1.

figure d

The correctness of the algorithm is implied by the following lemma:

Lemma 9

(Correctness of Algorithm 1). The array \(\mathbf {DP}\) satisfies (2). In particular, \(\mathsf {Compute}(W_1,W_2,z)=\textsc {WSCS} (W_1,W_2,z)\).

Proof

The proof that \(\mathbf {DP}\) satisfies (2) goes by induction on \(i+j\). The base case of \(i+j=0\) holds trivially. It is simple to verify the cases that \(i=0\) or \(j=0\). Let us henceforth assume that \(i>0\) and \(j>0\).

We first show that

$$\begin{aligned} \mathbf {DP}[i,j,\ell ,p] \le \max \{q\,:\,\textsc {WSCS} '(W_1[1 \mathinner {.\,.}i],W_2[1 \mathinner {.\,.}j],\ell ,p,q)=\mathbf {true}\}. \end{aligned}$$

The value \(q=\mathbf {DP}[i,j,\ell ,p]\) was derived from \(\mathbf {DP}[i-1,j,\ell -1,p{/}x]=q\), or \(\mathbf {DP}[i,j-1,\ell -1,p]=q{/}y\), or \(\mathbf {DP}[i-1,j-1,\ell -1,p{/}x]=q{/}y\), where \(x = \pi _i^1(c)\) and \(y = \pi _j^2(c)\) for some \(c\in \varSigma \). In the first case, by the inductive hypothesis, there exists a string T that is a solution to \(\textsc {WSCS} '(W_1[1 \mathinner {.\,.}i-1],W_2[1 \mathinner {.\,.}j],\ell -1,p{/}x,q)\). That is, T has subsequences \(T_1\) and \(T_2\) such that

$$\begin{aligned} \mathcal {P}(T_1,W_1[1 \mathinner {.\,.}i-1])=p{/}x\quad \text {and}\quad \mathcal {P}(T_2,W_2[1 \mathinner {.\,.}j])=q. \end{aligned}$$

Then, for \(S=T c\), \(S_1=T_1c\), and \(S_2=T_2\), we indeed have

$$\begin{aligned} \mathcal {P}(S_1,W_1[1 \mathinner {.\,.}i])=p\quad \text {and}\quad \mathcal {P}(S_2,W_2[1 \mathinner {.\,.}j])=q. \end{aligned}$$

The two remaining cases are analogous.

Let us now show that

$$\begin{aligned} \mathbf {DP}[i,j,\ell ,p] \ge \max \{q\,:\,\textsc {WSCS} '(W_1[1 \mathinner {.\,.}i],W_2[1 \mathinner {.\,.}j],\ell ,p,q)=\mathbf {true}\}. \end{aligned}$$

Assume a that string S is a solution to \(\textsc {WSCS} '(W_1[1 \mathinner {.\,.}i],W_2[1 \mathinner {.\,.}j],\ell ,p,q)\). Let \(S_1\) and \(S_2\) be the subsequences of S such that \(\mathcal {P}(S_1,W_1)=p\) and \(\mathcal {P}(S_2,W_2)=q\).

Let us first consider the case that \(S_1[i] = S[\ell ] \ne S_2[j]\). Then \(T_1 = S_1[1\mathinner {.\,.}i-1]\) and \(T_2 = S_2\) are subsequences of \(T = S[1\mathinner {.\,.}\ell -1]\). We then have

$$\begin{aligned} p' := \mathcal {P}(T_1,W_1[1 \mathinner {.\,.}i-1]) = p/\pi _{i}^1(S_1[i]). \end{aligned}$$

By the inductive hypothesis, \(\mathbf {DP}[i-1,j,\ell -1,p'] \ge q\). Hence, \(\mathbf {DP}[i,j,\ell ,p] \ge q\) because \(\mathbf {DP}[i-1,j,\ell -1,p']\) is present as the second argument of the maximum in the dynamic programming algorithm for \(c=S[\ell ]\).

The cases that \(S_1[i] \ne S[\ell ] = S_2[j]\) and that \(S_1[i] = S[\ell ] = S_2[j]\) rely on the values \(\mathbf {DP}[i,j-1,\ell -1,p] \ge q{/}y\) and \(\mathbf {DP}[i-1,j-1,\ell -1,p{/}x] \ge q{/}y\), respectively.

Finally, the case that \(S_1[i] \ne S[\ell ] \ne S_2[j]\) is reduced to one of the previous cases by changing \(S[\ell ]\) to \(S_1[i]\) so that S is still a supersequence of \(S_1\) and \(S_2\) and a solution to \(\textsc {WSCS} '(W_1[1 \mathinner {.\,.}i],W_2[1 \mathinner {.\,.}j],\ell ,p,q)\).    \(\square \)

Proposition 10

The WSCS problem can be solved in \(\mathcal {O}(n^3z)\) time if \(|\varSigma |=\mathcal {O}(1)\).

Proof

The correctness follows from Lemma 9. As noted in Observation 8, the dynamic programming has \(\mathcal {O}(n^3 z)\) states. The number of transitions from a single state is constant provided that \(|\varSigma | = \mathcal {O}(1)\).

Before running the dynamic programming algorithm of Proposition 10, we construct the data structures \( Freq _i(W_1,z)\) for all \(i\in [1\mathinner {.\,.}n]\) using Lemma 7. The last dimension in the \(\mathbf {DP}[i,j,\ell ,p]\) array can then be stored as a position in \( Freq _i(W_1,z)\). The pointers in the arrays \( Freq _i\) are used to follow transitions.    \(\square \)

4 Improvements

4.1 First Improvement: Bounds on \(\ell \)

Our approach here is to reduce the number of states \((i,j,\ell ,p)\) in Algorithm 1 from \(\mathcal {O}(n^3z)\) to \(\mathcal {O}(n^2z\log z)\). This is done by limiting the number of values of \(\ell \) considered for each pair of indices ij from \(\mathcal {O}(n)\) to \(\mathcal {O}(\log z)\).

For a weighted string W, we define \(\mathcal {H}(W)\) as a standard string generated by taking the most probable letter at each position, breaking ties arbitrarily. The string \(\mathcal {H}(W)\) is also called the heavy string of W. By \(d_H(S,T)\) we denote the Hamming distance of strings S and T. Let us recall an observation from [21].

Observation 11

([21, Observation 4.3]). If \(S \approx _{z} W\) for a string S and a weighted string W, then \(d_H(S,\mathcal {H}(W))\le \log _2 z\).

The lemma below follows from Observation 11.

Lemma 12

If strings \(S_1\) and \(S_2\) satisfy \(S_1 \approx _{z} W_1\) and \(S_2 \approx _{z} W_2\), then

$$\begin{aligned} |\textsc {SCS} (S_1,S_2) - \textsc {SCS} (\mathcal {H}(W_1),\mathcal {H}(W_2))| \le 2\log _2 z. \end{aligned}$$

Proof

By Observation 11,

$$\begin{aligned} d_H(S_1,\mathcal {H}(W_1)) \le \log _2 z \quad \text {and}\quad d_H(S_2,\mathcal {H}(W_2)) \le \log _2 z. \end{aligned}$$

Due to the relation (1) between \(\textsc {LCS} \) and \(\textsc {SCS} \), it suffices to show the following.

Claim

Let \(S_1,H_1,S_2,H_2\) be strings such that \(|S_1|=|H_1|\) and \(|S_2|=|H_2|\). If \(d_H(S_1,H_1)\le d\) and \(d_H(S_2,H_2) \le d\), then \(|\textsc {LCS} (S_1,S_2) - \textsc {LCS} (H_1,H_2)| \le 2d\).

Proof

Notice that if \(S_1',S_2'\) are strings resulting from \(S_1,S_2\) by removing up to d letters from each of them, then \(\textsc {LCS} (S_1',S_2')\ge \textsc {LCS} (S_1,S_2)-2d\).

We now create strings \(S_k'\) for \(k=1,2\), by removing from \(S_k\) letters at positions i such that \(S_k[i]\ne H_k[i]\). Then, according to the observation above, we have

$$\begin{aligned} \textsc {LCS} (S_1',S_2')\ge \textsc {LCS} (S_1,S_2)-2d. \end{aligned}$$

Any common subsequence of \(S_1'\) and \(S_2'\) is also a common subsequence of \(H_1\) and \(H_2\) since \(S_1'\) and \(S_2'\) are subsequences of \(H_1\) and \(H_2\), respectively. Consequently,

$$\begin{aligned} \textsc {LCS} (H_1,H_2)\ge \textsc {LCS} (S_1,S_2)-2d. \end{aligned}$$

In a symmetric way, we can show that \(\textsc {LCS} (S_1,S_2)\ge \textsc {LCS} (H_1,H_2) - 2d\). This completes the proof of the claim.    \(\square \)

We apply the claim for \(H_1=\mathcal {H}(W_1)\), \(H_2=\mathcal {H}(W_2)\), and \(d=\log _2 z\).    \(\square \)

Let us make the following simple observation.

Observation 13

If \(S=\textsc {WSCS} (W_1,W_2,z)\), then \(S=\textsc {SCS} (S_1,S_2)\) for some strings \(S_1\) and \(S_2\) such that \(W_1 \subseteq _{z} S_1\) and \(W_2 \subseteq _{z} S_2\).

Using Lemma 12, we refine the previous algorithm as shown in Algorithm 2.

figure e

Lemma 14

(Correctness of Algorithm 2). For every state \((i,j,\ell ,p)\), an inequality \(\mathbf {DP}'[i,j,\ell ,p] \le \mathbf {DP}[i,j,\ell ,p]\) holds. Moreover, if \(S=\textsc {SCS} (S_1,S_2)\), \(|S|=\ell \), \(\mathcal {P}(S_1,W_1[1 \mathinner {.\,.}i])=p \ge \tfrac{1}{z} \) and \(\mathcal {P}(S_2,W_2[1 \mathinner {.\,.}j])=q \ge \tfrac{1}{z} \), then \(\mathbf {DP}'[i,j,\ell ,p] \ge q\). Consequently, \(\mathsf {Improved1}(W_1,W_2,z)=\textsc {WSCS} (W_1,W_2,z)\).

Proof

A simple induction on \(i+j\) shows that the array \(\mathbf {DP}'\) is lower bounded by \(\mathbf {DP}\). This is because Algorithm 2 is restricted to a subset of states considered by Algorithm 1, and because \(\mathbf {DP}'[i,j,\ell ,p]\) is assumed to be 0 while \(\mathbf {DP}[i,j,\ell ,p]\ge 0\) for states \((i,j,\ell ,p)\) ignored in Algorithm 2.

We prove the second part of the statement also by induction on \(i+j\). The base cases satisfying \(i=0\) or \(j=0\) can be verified easily, so let us henceforth assume that \(i>0\) and \(j>0\).

First, consider the case that \(S_1[i] = S[\ell ] \ne S_2[j]\). Let \(T=S[1\mathinner {.\,.}\ell -1]\) and \(T_1=S_1[1\mathinner {.\,.}i-1]\). We then have

$$\begin{aligned} p' := \mathcal {P}(T_1,W_1[1 \mathinner {.\,.}i-1]) = p/\pi _{i}^1(S_1[i]). \end{aligned}$$

Claim

If \(S_1[i] = S[\ell ] \ne S_2[j]\), then \(T=\textsc {SCS} (T_1,S_2)\).

Proof

Let us first show that T is a common supersequence of \(T_1\) and \(S_2\). Indeed, if \(T_1\) was not a subsequence of T, then \(T_1 S_1[i] = S_1\) would not be a subsequence of \(T S_1[i] = S\), and if \(S_2\) was not a subsequence of T, then it would not be a subsequence of \(T S_1[i] = S\) since \(S_1[i] \ne S_2[j]\).

Finally, if \(T_1\) and \(S_2\) had a common supersequence \(T'\) shorter than T, then \(T' S_1[i]\) would be a common supersequence of \(S_1\) and \(S_2\) shorter than S.    \(\square \)

By the claim and the inductive hypothesis, \(\mathbf {DP}'[i-1,j,\ell -1,p'] \ge q\). Hence, \(\mathbf {DP}'[i,j,\ell ,p] \ge q\) due to the presence of the second argument of the maximum in the dynamic programming algorithm for \(c=S[\ell ]\). Note that \((i,j,\ell ,p)\) is a state in Algorithm 2 since \(\ell \in L[i,j]\) follows from Lemma 12.

The cases that \(S_1[i] \ne S[\ell ] = S_2[j]\) and that \(S_1[i] = S[\ell ] = S_2[j]\) use the values \(\mathbf {DP}'[i,j-1,\ell -1,p] \ge q{/}y\) and \(\mathbf {DP}'[i-1,j-1,\ell -1,p{/}x] \ge q{/}y\), respectively. Finally, the case that \(S_1[i] \ne S[\ell ] \ne S_2[j]\) is impossible as \(S=\textsc {SCS} (S_1,S_2)\).    \(\square \)

Example 15

Let \(W_1=[1,0]\), \(W_2=[0]\) (using the notation from Example 2), and \(z \ge 1\). The only strings that match \(W_1\) and \(W_2\) are \(S_1=\mathtt {ab}\) and \(S_2=\mathtt {b}\), respectively. We have \(\mathbf {DP}[2,1,3,1]=1\) which corresponds, in particular, to a solution \(S=\mathtt {abb}\) which is not an SCS of \(S_1\) and \(S_2\). However, \(\mathbf {DP}[2,1,2,1]=\mathbf {DP}'[2,1,2,1]=1\) which corresponds to \(S=\mathtt {ab}=\textsc {SCS} (S_1,S_2)\).

Proposition 16

The WSCS problem can be solved in \(\mathcal {O}(n^2z \log z)\) time if \(|\varSigma |=\mathcal {O}(1)\).

Proof

The correctness of the algorithm follows from Lemma 14. The number of states is now \(\mathcal {O}(n^2 z\log z)\) and thus so is the number of considered transitions.    \(\square \)

4.2 Second Improvement: Meet in the Middle

The second improvement is to apply a meet-in-the-middle approach, which is possible due to following observation resembling Observation 6.6 in [21].

Observation 17

If \(S \approx _{z} W\) for a string S and weighted string W of length n, then there exists a position \(i \in [1\mathinner {.\,.}n]\) such that

$$\begin{aligned} S[1 \mathinner {.\,.}i-1] \approx _{\sqrt{z}} W[1 \mathinner {.\,.}i-1]\quad \text {and}\quad S[i+1 \mathinner {.\,.}n] \approx _{\sqrt{z}} W[i+1 \mathinner {.\,.}n]. \end{aligned}$$

Proof

Select i as the maximum index with \(S[1 \mathinner {.\,.}i-1] \approx _{\sqrt{z}} W[1 \mathinner {.\,.}i-1]\).    \(\square \)

We first use dynamic programming to compute two arrays, \(\overrightarrow{\mathbf {DP}}\) and \(\overleftarrow{\mathbf {DP}}\). The array \(\overrightarrow{\mathbf {DP}}\) contains a subset of states from \(\mathbf {DP}'\); namely the ones that satisfy \(p \ge \frac{1}{\sqrt{z}}\). The array \(\overleftarrow{\mathbf {DP}}\) is an analogous array defined for suffixes of \(W_1\) and \(W_2\). Formally, we compute \(\overrightarrow{\mathbf {DP}}\) for the reversals of \(W_1\) and \(W_2\), denoted as \(\overrightarrow{\mathbf {DP}}^R\), and set \(\overleftarrow{\mathbf {DP}}[i,j,\ell ,p] = \overrightarrow{\mathbf {DP}}^R[|W_1|+1-i,|W_2|+1-j,\ell ,p]\). Proposition 16 yields

Observation 18

Arrays \(\overrightarrow{\mathbf {DP}}\) and \(\overleftarrow{\mathbf {DP}}\) can be computed in \(\mathcal {O}(n^2 \sqrt{z} \log z)\) time.

Henceforth, we consider only a simpler case in which there exists a solution S to \(\textsc {WSCS} (W_1,W_2,z)\) with a decomposition \(S=S_L\cdot S_R\) such that

$$\begin{aligned} W_1[1 \mathinner {.\,.}i] \subseteq _{\sqrt{z}} S_L \quad \text {and}\quad W_1[i+1 \mathinner {.\,.}|W_1|] \subseteq _{\sqrt{z}} S_R \end{aligned}$$
(3)

holds for some \(i\in [0\mathinner {.\,.}|W_1|]\).

In the pseudocode, we use the array L[ij] from the first improvement, denoted here as \(\overrightarrow{L}[i,j]\), and a symmetric array \(\overleftarrow{L}\) from right to left, i.e.:

$$\begin{aligned} \overleftarrow{T}[i,j]&= \textsc {SCS} (\mathcal {H}(W_1)[i \mathinner {.\,.}|W_1|],\mathcal {H}(W_2)[j \mathinner {.\,.}|W_2|]),\\ \overleftarrow{L}[i,j]&=[\overleftarrow{T}[i,j] - \left\lfloor 2\log _2 z \right\rfloor \mathinner {.\,.}\overleftarrow{T}[i,j] + \left\lfloor 2\log _2 z \right\rfloor ]. \end{aligned}$$

Algorithm 3 is applied for every \(i\in [0\mathinner {.\,.}|W_1|]\) and \(j\in [0\mathinner {.\,.}|W_2|]\).

figure f

Lemma 19

(Correctness of Algorithm 3). Assuming that there is a solution S to \(\textsc {WSCS} (W_1,W_2,z)\) that satisfies (3), we have

$$\begin{aligned} \textsc {WSCS} (W_1,W_2,z)=\min _{i,j}(\mathsf {Improved2}(W_1,W_2,z,i,j)). \end{aligned}$$

Proof

Assume that \(\textsc {WSCS} (W_1,W_2,z)\) has a solution \(S=S_L\cdot S_R\) that satisfies (3) for some \(i\in [0\mathinner {.\,.}|W_1|]\) and denote \(\ell _L=|S_L|\), \(\ell _R=|S_R|\). Let \(S'_L\) and \(S'_R\) be subsequences of \(S_L\) and \(S_R\) such that

$$\begin{aligned} p_L:=\mathcal {P}(S'_L,W_1[1 \mathinner {.\,.}i]) \ge \tfrac{1}{\sqrt{z}} \quad \text {and}\quad p_R:=\mathcal {P}(S'_R,W_1[i+1 \mathinner {.\,.}|W_1|]) \ge \tfrac{1}{\sqrt{z}}. \end{aligned}$$

Let \(S''_L\) and \(S''_R\) be subsequences of \(S_L\) and \(S_R\) such that

$$\begin{aligned} \mathcal {P}(S''_L,W_2[1 \mathinner {.\,.}j])=q_L \quad \text {and}\quad \mathcal {P}(S''_R,W_2[j+1 \mathinner {.\,.}|W_2|])=q_R \end{aligned}$$

for some j and \(q_Lq_R \ge \tfrac{1}{z} \).

By Lemma 14, \(\overrightarrow{\mathbf {DP}}[i,j,\ell _L,p_L] \ge q_L\) and \(\overleftarrow{\mathbf {DP}}[i+1,j+1,\ell _R,p_R] \ge q_R\). Hence, the set A will contain a pair \((p_L,q'_L)\) such that \(q'_L \ge q_L\) and the set B will contain a pair \((p_R,q'_R)\) such that \(q'_R \ge q_R\). Consequently, \(\textsc {Merge} (A,B,z)\) will return a positive answer.

Similarly, if \(\textsc {Merge} (A,B,z)\) returns a positive answer for given i, j, \(\ell _L\) and \(\ell _R\), then

$$\begin{aligned} \overrightarrow{\mathbf {DP}}[i,j,\ell _L,p_L] \ge q_L\quad \text {and}\quad \overleftarrow{\mathbf {DP}}[i+1,j+1,\ell _R,p_R] \ge q_R \end{aligned}$$

for some \(p_Lp_R,q_Lq_R \ge \tfrac{1}{z} \). By Lemma 14, this implies that

$$\begin{aligned} \textsc {WSCS} '(W_1[1 \mathinner {.\,.}i],W_2[1 \mathinner {.\,.}j],\ell _L,p_L,q_L) \end{aligned}$$

and

$$\begin{aligned} \textsc {WSCS} '(W_1[i+1 \mathinner {.\,.}|W_1|],W_2[j+1 \mathinner {.\,.}|W_2|],\ell _R,p_R,q_R) \end{aligned}$$

have a positive answer, so

$$\begin{aligned} \textsc {WSCS} '(W_1,W_2,\ell _L+\ell _R,p_Lp_R,q_Lq_R) \end{aligned}$$

has a positive answer too. Due to \(p_Lp_R,q_Lq_R \ge \tfrac{1}{z} \), this completes the proof.    \(\square \)

Proposition 20

The WSCS problem can be solved in \(\mathcal {O}(n^2\sqrt{z} \log ^2 z)\) time if \(|\varSigma |=\mathcal {O}(1)\).

Proof

We use the algorithm \(\mathsf {Improved2}\), whose correctness follows from Lemma 19 in case (3) is satisfied. The general case of Observation 17 requires only a minor technical change to the algorithm. Namely, the computation of \(\overrightarrow{\mathbf {DP}}\) then additionally includes all states \((i,j,\ell ,p)\) such that \(\ell \in \overrightarrow{L}[i,j]\), \(p\ge \tfrac{1}{z} \), and \(p=\pi ^1_i(c)p'\) for some \(c\in \varSigma \) and \(p'\in Freq _{i-1}(W_1,\sqrt{z})\). Due to \(|\varSigma | = \mathcal {O}(1)\), the number of such states is still \(\mathcal {O}(n^2 \sqrt{z}\log z)\).

For every i and j, the algorithm solves \(\mathcal {O}(\log ^2 z)\) instances of Merge, each of size \(\mathcal {O}(\sqrt{z})\). This results in the total running time of \(\mathcal {O}(n^2\sqrt{z} \log ^2 z)\).    \(\square \)

4.3 Third Improvement: Removing One log z Factor

The final improvement is obtained by a structural transformation after which we only need to consider \(\mathcal {O}(\log z)\) pairs \((\ell _L,\ell _R)\).

For this to be possible, we compute prefix maxima on the \(\ell \)-dimension of the \(\overrightarrow{\mathbf {DP}}\) and \(\overleftarrow{\mathbf {DP}}\) arrays in order to guarantee monotonicity. That is, if \(\textsc {Merge} (A,B,z)\) returns true for \(\ell _L\) and \(\ell _R\), then we make sure that it would also return true if any of these two lengths increased (within the corresponding intervals).

This lets us compute, for every \(\ell _L\in \overrightarrow{L}[i,j]\) the smallest \(\ell _R\in \overleftarrow{L}[i,j]\) such that \(\textsc {Merge} (A,B,z)\) returns true using \(\mathcal {O}(\log z)\) iterations because the sought \(\ell _R\) may only decrease as \(\ell _L\) increases. The pseudocode is given in Algorithm 4.

figure g

Theorem 21

The WSCS problem can be solved in \(\mathcal {O}(n^2\sqrt{z} \log z)\) time if \(|\varSigma |=\mathcal {O}(1)\).

Proof

Let us fix indices i and j. Let us denote \( Freq _i(W,z)\) by \(\overrightarrow{ Freq }_i(W,z)\) and introduce a symmetric array

$$\begin{aligned} \overleftarrow{ Freq }_i(W,z)=\{\mathcal {P}(S,W[i \mathinner {.\,.}|W|])\,:\,S \in \mathsf {Matched}_z(W[i \mathinner {.\,.}|W|])\}. \end{aligned}$$

In the first loop of prefix maxima computation, we consider all \(\ell \in \overrightarrow{L}[i,j]\) and \(p \in \overrightarrow{ Freq }_i(W_1,\sqrt{z})\), and in the second loop, all \(\ell \in \overleftarrow{L}[i,j]\) and \(p \in \overleftarrow{ Freq }_{i}(W_1,\sqrt{z})\). Hence, prefix maxima take \(\mathcal {O}(\sqrt{z}\log {z})\) time to compute.

Each step of the while-loop in \(\mathsf {Improved3}\) increases \(\ell _L\) or decreases \(\ell _R\). Hence, the algorithm produces only \(\mathcal {O}(\log z)\) instances of Merge, each of size \(\mathcal {O}(\sqrt{z})\). The time complexity follows.    \(\square \)

5 Lower Bound for WLCS

Let us first define the WLCS problem as it was stated in [4, 14].

figure h

We consider the following well-known NP-complete problem [19]:

figure i

Theorem 22

The WLCS problem cannot be solved in \(\mathcal {O}(n^{f(z)})\) time if \(\mathrm {P}\ne \mathrm {NP}\).

Proof

We show the hardness result by reducing the NP-complete Subset Sum problem to the WLCS problem with a constant value of z.

For a set \(S=\{s_1,s_2,\ldots ,s_n\}\) of n positive integers, a positive integer t, and an additional parameter \(p\in [2\mathinner {.\,.}n]\), we construct two weighted strings \(W_1\) and \(W_2\) over the alphabet \(\varSigma =\{\mathtt {a},\mathtt {b}\}\), each of length \(n^2\).

Let \(q_i=\frac{s_i}{t}\). At positions \(i\cdot n\), for all \(i=[1\mathinner {.\,.}n]\), the weighted string \(W_1\) contains letter \(\mathtt {a}\) with probability \(2^{-q_i}\) and \(\mathtt {b}\) otherwise, while \(W_2\) contains \(\mathtt {a}\) with probability \(2^{\frac{1}{p-1}(q_i-1)}\) and \(\mathtt {b}\) otherwise. All the other positions contain letter \(\mathtt {b}\) with probability 1. We set \(z = 2\).

We assume that S contains only elements smaller than t (we can ignore the larger ones and if there is an element equal to t, then there is no need for a reduction). All the weights of \(\mathtt {a}\) are then in the interval \((\frac{1}{2},1)\) since \(-q_i\in (-1,0)\) and \(\frac{1}{p-1}(q_i-1) \in (-1,0)\). Thus, since \(z=2\), letter \(\mathtt {b}\) originating from a position \(i\cdot n\) can never occur in a subsequence of \(W_1\) or in a subsequence of \(W_2\). Hence, every common subsequence of \(W_1\) and \(W_2\) is a subsequence of \((\mathtt {b}^{n-1}\mathtt {a})^n\).

For \(I\subseteq [1\mathinner {.\,.}n]\), we have

$$\begin{aligned} \prod _{i\in I}\pi ^{(W_1)}_{i\cdot n}(\mathtt {a}) = \prod _{i \in I} 2^{-s_i/t} \ge 2^{-1} =\tfrac{1}{z}\ \Longleftrightarrow&\ \sum _{i\in I}s_i\le t \end{aligned}$$

and

$$\begin{aligned} \qquad \qquad \prod _{i\in I}\pi ^{(W_2)}_{i\cdot n}(\mathtt {a}) = \prod _{i \in I}2^{\frac{1}{p-1}(s_i/t-1)}\,\ge \,2^{-1}=\tfrac{1}{z}\ \Longleftrightarrow&\\ \tfrac{1}{t(p-1)}\left( \sum _{i\in I}s_i\right) -\tfrac{|I|}{p-1}\,\ge -1\ \Longleftrightarrow&\ \sum _{i\in I}s_i\ge t(1-p+|I|). \end{aligned}$$

If I is a solution to the instance of the Subset Sum problem, then for \(p=|I|\) there is a weighted common subsequence of length \(n(n-1)+p\) obtained by choosing all the letters \(\mathtt {b}\) and the letters \(\mathtt {a}\) that correspond to the elements of I.

Conversely, suppose that the constructed WLCS instance with a parameter \(p\in [2\mathinner {.\,.}n]\) has a solution of length at least \(n(n-1)+p\). Notice that \(\mathtt {a}\) at position \(i\cdot n\) in \(W_1\) may be matched against \(\mathtt {a}\) at position \(i'\cdot n\) in \(W_2\) only if \(i=i'\). (Otherwise, the length of the subsequence would be at most \((n-|i-i'|)n\le (n-1)n<n(n-1)+p\)). Consequently, the solution yields a subset \(I\subseteq [1\mathinner {.\,.}n]\) of at least p indices i such that \(\mathtt {a}\) at position \(i\cdot n\) in \(W_1\) is matched against \(\mathtt {a}\) at position \(i\cdot n\) in \(W_2\). By the relations above, we have (a) \(|I| \ge p\), (b) \(\sum _{i\in I}s_i\le t\), and (c) \(\sum _{i\in I}s_i\ge t(1-p+|I|)\). Combining these three inequalities, we obtain \(\sum _{i\in I}s_i= t\) and conclude that the Subset Sum instance has a solution.

Hence, the Subset Sum instance has a solution if and only if there exists \(p\in [2\mathinner {.\,.}n]\) such that the constructed WLCS instance with p has a solution of length at least \(n(n-1)+p\). This concludes that an \(\mathcal {O}(n^{f(z)})\)-time algorithm for the WLCS problem implies the existence of an \(\mathcal {O}(n^{2f(2)+1})=\mathcal {O}(n^{\mathcal {O}(1)})\)-time algorithm for the Subset Sum problem. The latter would yield \(P=NP\).    \(\square \)

Example 23

For \(S=\{3,7,11,15,21\}\) and \(t=25=3+7+15\), both weighted strings \(W_1\) and \(W_2\) are of the form:

$$\begin{aligned} \mathtt {b^4\,*\,b^4\,*\,b^4\,*\,b^4\,*\,b^4\,*}\,, \end{aligned}$$

where each \(\mathtt {*}\) is equal to either \(\mathtt {a}\) or \(\mathtt {b}\) with different probabilities.

The probabilities of choosing \(\mathtt {a}\)’s for \(W_1\) are equal respectively to

$$\begin{aligned} \big (2^{-\frac{3}{25}},2^{-\frac{7}{25}},2^{-\frac{11}{25}},2^{-\frac{15}{25}},2^{-\frac{21}{25}}\big ), \end{aligned}$$

while for \(W_2\) they depend on the value of p, and are equal respectively to

$$\begin{aligned} \big (2^{-\frac{22}{25(p-1)}},2^{-\frac{18}{25(p-1)}},2^{-\frac{14}{25(p-1)}},2^{-\frac{10}{25(p-1)}},2^{-\frac{4}{25(p-1)}}\big ). \end{aligned}$$

For \(p=3\), we have: \(\textsc {WLCS} (W_1,W_2,2)\,=\,\mathtt {b^4\,a\,b^4\,a\,b^4\,b^4\,a\,b^4}\), which corresponds to taking the first, the second, and the fourth \(\mathtt {a}\). The length of this string is equal to \(23=n(n-1)+p\), and its probability of matching is \(\frac{1}{2} = 2^{-\frac{22}{50}} \cdot 2^{-\frac{18}{50}} \cdot 2^{-\frac{10}{50}}\). Thus, the subset \(\{3,7,15\}\) of S consisting of its first, second, and fourth element is a solution to the Subset Sum problem.