Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The longest common subsequence (LCS) problem is a classic computer science problem, and has applications in bioinformatics. It is also widely applied in diverse areas, such as file comparison, pattern matching and computational biology [3, 4, 8, 9]. Given two sequences X and Y, the longest common subsequence problem is to find a subsequence of X and Y whose length is the longest among all common subsequences of the two given sequences. It differs from the problems of finding common substrings: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The most referred algorithm, proposed by Wagner and Fischer [29], solves the LCS problem by using a dynamic programming algorithm in quadratic time. Other advanced algorithms were proposed in the past decades [24, 16, 17, 19, 21]. If the number of input sequences is not fixed, the problem to find the LCS of multiple sequences has been proved to be NP-hard [23]. Some approximate and heuristic algorithms were proposed for these problems [6, 25].

For some biological applications some constraints must be applied to the LCS problem. These kinds of variants of the LCS problem are called the constrained LCS (CLCS) problem. One of the recent variants of the LCS problem, the constrained longest common subsequence (CLCS) which was first addressed by Tsai [27], has received much attention. It generalizes the LCS measure by introducing of a third sequence, which allows to extort that the obtained CLCS has some special properties [26]. For two given input sequences X and Y of lengths m and n, respectively, and a constrained sequence P of length r, the CLCS problem is to find the common subsequences Z of X and Y such that P is a subsequence of Z and the length of Z is the maximum. The most referred algorithms were proposed independently [5, 8], which solve the CLCS problem in O(mnr) time and space by using dynamic programming algorithms. Some improved algorithms have also been proposed [11, 18]. The LCS and CLCS problems on the indeterminate strings were discussed in [20]. Moreover, the problem was extended to the one with weighted constraints, a more generalized problem [24].

Recently, a new variant of the CLCS problem, the restricted LCS problem, was proposed [14], which excludes the given constraint as a subsequence of the answer. The restricted LCS problem becomes NP-hard when the number of constraints is not fixed. Some more generalized forms of the CLCS problem, the generalized constrained longest common subsequence (GC-LCS) problems, were addressed independently by Chen and Chao [7]. For the two input sequences X and Y of lengths n and m, respectively, and a constraint string P of length r, the GC-LCS problem is a set of four problems which are to find the LCS of X and Y including/excluding P as a subsequence/substring, respectively. The four generalized constrained LCS [7] can be summarized in Table 1.

Table 1. The GC-LCS problems

For the four problems in Table 1, O(mnr) time algorithms were proposed [7]. For all four variants in Table 1, \(O(r(m + n) + (m + n) \log (m+n))\) time algorithms were proposed by using the finite automata [12]. Recently, a quadratic algorithm to the STR-IC-LCS problem was proposed [10], and the time complexity of [12] was pointed out not correct.

The four GC-LCS problems can be generalized further to the cases of multiple constraints. In these generalized cases, the single constrained pattern P will be generalized to a set of d constraints \(P=\{P_1,\cdots ,P_d\}\) of total length r, as shown in Table 2.

Table 2. The Multiple-GC-LCS problems

The problem M-SEQ-IC-LCS has been proved to be NP-hard in [13]. The problem M-SEQ-EC-LCS has also been proved to be NP-hard in [14, 28]. In addition, the problems M-STR-IC-LCS and M-STR-EC-LCS were also declared to be NP-hard in [7], but without a proof. The exponential-time algorithms for solving these two problems were also presented in [7].

We will discuss the problem M-STR-EC-LCS in this paper. The failure functions in the Knuth-Morris-Pratt algorithm [22] for solving the string matching problem have been proved very helpful for solving the STR-EC-LCS problem. It has been found by Aho and Corasick [1] that the failure functions can be generalized to the case of keyword tree to speedup the exact string matching of multiple patterns. This idea can be very supportive in our dynamic programming algorithm. This is the principle idea of our new algorithm.

The organization of the paper is as follows.

In the following 3 sections, we describe our presented dynamic programming algorithm for the M-STR-EC-LCS problem.

In Sect. 2 the preliminary knowledge for presenting our algorithm for the M-STR-EC-LCS problem is discussed. In Sect. 3 we give a new dynamic programming solution for the M-STR-EC-LCS problem with time complexity O(nmr), where n and m are the lengths of the two given input strings, and r is the total length of d constraint strings. In Sect. 4, we consider the issues to implement the algorithm efficiently.

2 Preliminaries

A sequence is a string of characters over an alphabet \(\sum \). A subsequence of a sequence X is obtained by deleting zero or more characters from X (not necessarily contiguous). A substring of a sequence X is a subsequence of successive characters within X.

For a given sequence \(X=x_1x_2\cdots x_n\) of length n, the ith character of X is denoted as \(x_i \in \sum \) for any \(i=1,\cdots ,n\). A substring of X from position i to j can be denoted as \(X[i:j]=x_ix_{i+1}\cdots x_j\). If \(i\ne 1\) or \(j\ne n\), then the substring \(X[i:j]=x_ix_{i+1}\cdots x_j\) is called a proper substring of X. A substring \(X[i:j]=x_ix_{i+1}\cdots x_j\) is called a prefix or a suffix of X if \(i=1\) or \(j=n\), respectively.

For the two input sequences \(X=x_1x_2\cdots x_n\) and \(Y=y_1y_2\cdots y_m\) of lengths n and m, respectively, and a set of d constraints \(P=\{P_1,\cdots ,P_d\}\) of total length r, the problem M-STR-EC-LCS is to find an LCS of X and Y excluding each of constraint \(P_i\in P\) as a substring.

Keyword tree (Aho-Corasick Automaton) [1, 9, 15] is a main data structure in our dynamic programming algorithm to process the constraint set P of the M-STR-EC-LCS problem.

Definiton 1

The keyword tree for set P is a rooted directed tree T satisfying 3 conditions: 1. each edge is labeled with exactly one character; 2. any two edges out of the same node have distinct labels; and 3. every string \(P_i\) in P maps to some node v of T such that the characters on the path from the root of T to v exactly spell out \(P_i\), and every leaf of T is mapped to some string in P.

Fig. 1.
figure 1

Keyword trees

In order to identify the nodes of T, we assign numbers \(0,1,\cdots ,t-1\) to all t nodes of T in their preorder numbering. Then, each node will be assigned an integer \(i,0\le i<t\), as shown in Fig. 1. For each node numbered i of a keyword tree T, the concatenation of characters on the path from the root to the node i spells out a string denoted as L(i). The string L(i) is also called the label of the node i in the keyword tree T. For example, Fig. 1 shows the keyword tree T for the constraint set \(P=\{aab,aba,ba\}\), where \(P_1=aab,P_2=aba,P_3=ba\), and \(d=3,r=8\). Clearly, every node in the keyword tree corresponds to a prefix of one of the strings in set P, and every prefix of a string \(P_i\) in P maps to a distinct node in the keyword tree T. The keyword tree for set P of total length r of all strings can be easily constructed in O(r) time for a constant alphabet size.

The keyword tree can be extended into an automaton, Aho-Corasick automaton, which is composed of three functions, a goto function, an output function and a failure function. The goto function is presented as the solid edges of the keyword tree and the output function indicates when the matches occur and which strings are output. For each node i, its output function is denoted as \(O_i\), a set of indices which indicate when the node i is reached then for each index \(j\in O_i\), the string \(P_j\) is matched. For example, the output sets of nodes 3, 5 and 7 are \(O_3=\{1\}, O_5=\{2,3\}\) and \(O_7=\{3\}\), which means that the outputs of node 3, 5 and 7 are \(\{P_1=aab\}, \{P_2=aba,P_3=ba\}\) and \(\{P_3=ba\}\), respectively.

The failure function indicates which node to go if there is no character to be further matched. It is a generalization of the failure functions in the Knuth-Morris-Pratt algorithm for solving the string matching problem. It is represented by the dashed edges in Fig. 1.

For any node i of T, define lp(i) to be the length of the longest proper suffix of string L(i) that is a prefix of some string in T. It can be verified readily that for each node i of T, if A is an lp(i)-length suffix of string L(i), then there must be a unique node pre(i) in T such that \(L(pre(i))=A\). If \(lp(i)=0\) then \(pre(i)=0\) is the root of T.

The ordered pair (ipre(i)) is called a failure link. The failure link is a direct generalization of the failure functions in the KMP algorithm. For example, in Fig. 1, failure links are shown as pointers from every node i to node pre(i) where \(lp(i)>0\). The other failure links point to the root and are not shown. The failure links of T define actually a failure function pre for the constraint set P. As stated in [1, 9], for a constant alphabet size, in the worst case, the failure function pre can be computed in O(r) time.

The failure list of a given node is the ordered list of the nodes which locate on the path to the root via dashed edges. For example, for the nodes \(i=1,2,3,4,5,6,7\), the corresponding values of failure function are \(pre(i)=0,1,4,6,7,0,1\). The failure list of node 5 is \(\{7\rightarrow 1\rightarrow 0\}\), and the failure list of node 6 is \(\{0\}\), as shown in Fig. 1.

The failure function pre is used to speedup the search for all occurrences in a text Z of strings from P. For each node i of T, and a character \(c\in \sum \), if no edges out of the node i is labeled c, then the failure link of node i direct the search to the node pre(i). It is equivalent to add the edge (ipre(i)) labeled c to the node i. This set matching method generalized the next function in KMP algorithm to the Aho-Corasick-next function as follows.

Definiton 2

Given a keyword tree T and its failure function, for each node i of T and each character \(c\in \sum \), Aho-Corasick-next function \(\delta (i,c)\) denotes the destination of the first node in i’s failure list which has an edge labeled c. If there exists no such node in the failure list, the function returns the root.

Table 3 shows the Aho-Corasick-next function \(\delta \) corresponding to the example in Fig. 1.

Table 3. Aho-Corasick-next function

We take node 4 as an example. It can be seen from Fig. 1 that \(\delta (4,a)=5\) and \(\delta (4,b)=0\). It is easy to understand that each element of Aho-Corasick-next function can be computed in constant time.

The symbol \(\oplus \) is also used to denote the string concatenation. For example, if \(S_1=aaa\) and \(S_2=bbb\), then it is readily seen that \(S_1\oplus S_2=aaabbb\).

3 Our Main Result: A Dynamic Programming Algorithm

Let T be a keyword tree for the given constraint set P, and \(Z[1:l]=z_1,z_2,\cdots ,z_l\) be any common subsequence of X and Y. If we search the set matching of Z from the root of T in the direction of the Aho-Corasick-next function \(\delta \) of T, then the search will stop in a node i of T. All such common subsequence of X and Y can be classified into a group k, \(0\le k<t\). These t groups can be used to distinguish the different states in our dynamic programming algorithm. For each integer k, \(0\le k<t\), the state k represents the set of common subsequence of X and Y in group k.

Definiton 3

Let Z(ijk) denote the set of all LCSs of X[1 : i] and Y[1 : j] with state k, where \(1\le i\le n, 1\le j\le m\), and \(0\le k<t\). The length of an LCS in Z(ijk) is denoted as f(ijk).

If we can compute f(ijk) for any \(1\le i\le n, 1\le j\le m\), and \(0\le k<t\) efficiently, then the length of an LCS of X and Y excluding P must be \(\mathop {\text {max}}\limits _{0\le k<t}\left\{ f(n,m,k)| O_k=\emptyset \right\} \).

By using the keyword tree data structure described in the last section, we can give a recursive formula for computing f(ijk) by the following theorem.

Theorem 1

For the two input sequences \(X=x_1x_2\cdots x_n\) and \(Y=y_1y_2\cdots y_m\) of lengths n and m, respectively, and a set of d constraints \(P=\{P_1,\cdots ,P_d\}\) of total length r, let Z(ijk) and f(ijk) be defined as in Definition 3. Suppose a keyword tree T for the constraint set P has been built, and the t nodes of T are numbered in their preorder numbering. Then, for any \(1\le i\le n, 1\le j\le m\), and \(0\le k<t\), f(ijk) can be computed by the following recursive formula.

$$\begin{aligned} f(i,j,k)=\left\{ \begin{array}{ll} \max \left\{ f(i-1,j,k),f(i,j-1,k) \right\} &{} \texttt {if } x_i\ne y_j,\\ \max \left\{ f(i-1,j-1,k),1+\mathop {\max }\limits _{\bar{k}\in S(k,x_i)}\left\{ f(i-1,j-1,\bar{k})\right\} \right\} &{} \texttt {if } x_i= y_j. \end{array} \right. \end{aligned}$$
(1)

where,

$$\begin{aligned} S(k,x_i)=\{\bar{k}|0\le \bar{k}<t,\delta (\bar{k},x_i)=k\} \end{aligned}$$
(2)

The boundary conditions of this recursive formula are \(f(i,0,0) = f(0,j,0) = 0\) for any \(0\le i\le n, 0\le j\le m\).

Proof

For any \(0\le i\le n, 0\le j\le m\), and \(0\le k<t\), suppose \(f(i,j,k)=l\) and \(z=z_1 \cdots z_l\in Z(i,j,k)\).

First of all, we notice that for each pair \((i',j'), 1\le i'\le n, 1\le j'\le m\), such that \(i'\le i\) and \(j'\le j\), we have \(f(i',j',k) \le f(i,j,k)\), since a common subsequence z of \(X[1:i']\) and \(Y[1:j']\) with state k is also a common subsequence of X[1 : i] and Y[1 : j] with state k.

  1. (1)

    In the case of \(x_i\ne y_j\), we have \(x_i\ne z_l\) or \(y_j\ne z_l\).

    1. (1.1)

      If \(x_i\ne z_l\), then \(z=z_1 \cdots z_l\) is a common subsequence of \(X[1:i-1]\) and Y[1 : j] with state k, and so \(f(i-1,j,k) \ge l\). On the other hand, \(f(i-1,j,k)\le f(i,j,k) = l\). Therefore, in this case we have \(f(i,j,k) = f(i-1,j,k)\).

    2. (1.2)

      If \(y_j\ne z_l\), then we can prove similarly that in this case, \(f(i,j,k) = f(i,j-1,k)\).

      Combining the two subcases we conclude that in the case of \(x_i\ne y_j\), we have

      $$\begin{aligned} f(i,j,k)=\max \left\{ f(i-1,j,k),f(i,j-1,k) \right\} . \end{aligned}$$
  2. (2)

    In the case of \(x_i=y_j\), there are also two cases to be distinguished.

    1. (2.1)

      If \(x_i=y_j\ne z_l\), then \(z=z_1 \cdots z_l\) is also a common subsequence of \(X[1:i-1]\) and \(Y[1:j-1]\) with state k, and so \(f(i-1,j-1,k) \ge l\). On the other hand, \(f(i-1,j-1,k)\le f(i,j,k) = l\). Therefore, in this case we have \(f(i,j,k) = f(i-1,j-1,k)\).

    2. (2.2)

      If \(x_i=y_j=z_l\), then \(f(i,j,k) = l>0\) and \(z=z_1 \cdots z_l\) is an LCS of X[1 : i] and Y[1 : j] with state k.

Let the state of \((z_1,\cdots , z_{l-1})\) be \(\bar{k}\), then we have \(\bar{k}\in S(k,x_i)\), since \(z_l=x_i\). It follows that \(z_1 \cdots z_{l-1}\) is a common subsequence of \(X[1:i-1]\) and \(Y[1:j-1]\) with state \(\bar{k}\). Therefore, we have

$$\begin{aligned} f(i-1,j-1,\bar{k})\ge l-1 \end{aligned}$$

Furthermore, we have

$$\begin{aligned} \mathop {\max }\limits _{\bar{k}\in S(k,x_i)}\left\{ f(i-1,j-1,\bar{k})\right\} \ge l-1 \end{aligned}$$

In other words,

$$\begin{aligned} f(i,j,k)\le 1+\mathop {\max }\limits _{\bar{k}\in S(k,x_i)}\left\{ f(i-1,j-1,\bar{k})\right\} \end{aligned}$$
(3)

On the other hand, for any \(\bar{k}\in S(k,x_i)\), and \(v=v_1 \cdots v_h\in Z(i-1,j-1,\bar{k})\), \(v\oplus x_i\) is a common subsequence of X[1 : i] and Y[1 : j] with state k. Therefore, \(f(i,j,k)=l\ge 1+h=1+f(i-1,j-1,\bar{k})\), and so we conclude that,

$$\begin{aligned} f(i,j,k)\ge 1+\mathop {\max }\limits _{\bar{k}\in S(k,x_i)}\left\{ f(i-1,j-1,\bar{k})\right\} \end{aligned}$$
(4)

Combining (3) and (4) we have, in this case,

$$\begin{aligned} f(i,j,k)= 1+\mathop {\max }\limits _{\bar{k}\in S(k,x_i)}\left\{ f(i-1,j-1,\bar{k})\right\} \end{aligned}$$
(5)

Combining the two subcases in the case of \(x_i=y_j\), we conclude that the recursive formula (1) is correct for the case \(x_i=y_j\).

The proof is complete.    \(\blacksquare \)

4 The Implementation of the Algorithm

According to Theorem 1, our algorithm for computing f(ijk) is a standard 3-dimensional dynamic programming algorithm. By the recursive formula (1), the dynamic programming algorithm for computing f(ijk) can be implemented as the following Algorithm 1.

figure a

In Algorithm 1, T is the keyword tree for set P. The root of the keyword tree is numbered 0, and the other nodes are numbered \(1,2,\cdots ,t-1\) in their preorder numbering. \(\delta (\alpha ,c)\) is the Aho-Corasick-next function defined in Definition 2, which can be computed in O(1) time. \(O_k\) is the output set of node k in T. The variable S is used to record the current states created. When a node is visited first time, a new state may be created. Therefore, in Algorithm 1, the current state set S is extended gradually while the for loop processed. In the worst case, the set S will have a size of r, the total lengths of the constrained strings. The body of the triple for loops can be computed in O(1) time in the worst case. Therefor, the total time of Algorithm 1 is O(nmr). The space used by Algorithm 1 is also O(nmr).

The number of constraints is an influent factor in the time and space complexities of our new algorithm. If a string \(P_i\) in the constraint set P is a proper substring of another string \(P_j\) in P, then an LCS of X and Y excluding \(P_i\) must also exclude \(P_j\). For this reason, the constraint string \(P_j\) can be removed from constraint set P without changing the solution of the problem. Without loss of generality, we can put forward the following two assumptions on the constraint set P.

Assumption 1

There are not any duplicated strings in the constraint set P.

Assumption 2

No string in the constraint set P is a proper substring of any other string in P.

If Assumption 1 is violated, then there must be some duplicated strings in the constraint set P. In this case, we can first sort the strings in the constraint set P, then duplicated strings can be removed from P easily and then Assumption 1 on the constraint set P is satisfied. It is clear that removed strings will not change the solution of the problem.

For Assumption 2, we first notice that a string A in the constraint set P is a proper substring of string B in P, if and only if in the keyword tree T of P, there is a directed path of failure links from a node v on the path from the root to the leaf node corresponding to string B to the leaf node corresponding to string A [1, 9]. For instance, in Fig. 1, there is a directed path of failure links from node 5 to node 7 and thus we know the string ba corresponding to node 7 is a proper substring of string aba corresponding to node 5.

With this fact, if Assumption 2 is violated, we can remove all proper super strings from the constraint set P as follows. We first build a keyword tree T for the constraint set P, then mark all the leaf nodes pointed by a failure link in T by using a depth first traversal of T. All the strings corresponding to the marked leaf node can then be removed from P. Assumption 2 is now satisfied on the new constraint set and the keyword tree T for the new constraint set is then rebuilt. It is not difficult to do this preprocessing in O(r) time. It is clear that the removed proper substrings will not change the solution of the problem.

If we want to compute the longest common subsequence of X and Y excluding P, but not just its length, we can also present a simple recursive backtracking algorithm for this purpose as the following Algorithm 2.

In the end of our new algorithm, we will find an index k such that f(nmk) gives the length of an LCS of X and Y excluding P. Then, a function call back(nmk) will produce the answer LCS accordingly.

figure b

Since the cost of \(\delta (k,x_i)\) is O(1) in the worst case, the time complexity of the algorithm back(ijk) is \(O(n+m)\).

Finally we summarize our results in the following theorem.

Theorem 2

For the two input sequences \(X=x_1x_2\cdots x_n\) and \(Y=y_1y_2\cdots y_m\) of lengths n and m, respectively, and a set of d constraints \(P=\{P_1,\cdots ,P_d\}\) of total length r, the Algorithms 1 and 2 solve the M-STR-EC-LCS problem correctly in O(nmr) time and O(nmr) space, with preprocessing time \(O(r|\varSigma |)\).