Strict approximate pattern matching with general gaps

Wu, Youxi; Fu, Shuai; Jiang, He; Wu, Xindong

doi:10.1007/s10489-014-0612-3

Strict approximate pattern matching with general gaps

Published: 25 November 2014

Volume 42, pages 566–580, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Strict approximate pattern matching with general gaps

Download PDF

Youxi Wu¹,
Shuai Fu¹,
He Jiang² &
…
Xindong Wu^3,4

268 Accesses
16 Citations
Explore all metrics

Abstract

Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which only considers the matching position of the last pattern substring in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential pattern mining with gap constraints. Some strict pattern matching algorithms were designed to handle pattern mining tasks, since strict pattern matching can be used to compute the frequency of some patterns occurring in the given sequence and then the frequent patterns can be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into an exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O(m × Maxlen × W × d) and O(Maxlen × W × m ² × n × d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA.

Strict pattern matching under non-overlapping condition

Article 15 November 2016

NetDPO: (delta, gamma)-approximate pattern matching with gap constraints under one-off condition

Article 02 February 2022

NetDAP: (δ, γ) −approximate pattern matching with length constraints

Article 10 July 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Pattern matching (also called string matching) is one of the essential problems in computer science with broad applications [1, 2]. The most classical pattern matching algorithm is KMP which was proposed by Knuth [3]. After that, Fischer and Paterson [4] first proposed pattern matching with wildcards (or ‘don’t care’ symbols) and in their study the number of wildcards between two consecutive letters is a constant. Manber et al. [5] studied pattern matching with the gap constraint, which means that the wildcard is a range, but the pattern has only one gap constraint. In recent years, researchers have paid more attention to pattern matching with multiple gap constraints, the pattern in this kind of issue can be described as P = p ₀[a ₀, b ₀]p ₁...[a _j, b _j]p _{j + 1}...[a _m−2, b _m−2]p _m−1, where a _j and b _j are the minimal and maximal numbers that a wildcard can match between p _j and p _{j + 1}. Pattern matching with gaps has been applied to many domains. For instance, Navarro and Raffinot [6] proposed two algorithms which can be used for protein searching. Cole et al. [7] used the approximate pattern matching approach to judge whether the pattern string is in the specified text or dictionary. Crochemore et al. [8] investigated the (${C_{m}^{T}} ,\alpha )$-approximate matching which can be used in the music retrieval field. Cantone et al. [9] focused on the parallel-by-bit approach that can be applied in music information retrieval and analysis. In sequential pattern mining, Ji et al. [10] proposed the ConSGapMine algorithm which can mine minimal distinguishing sequences. Ferreira and Azevedo [11] proposed the gIL algorithm to mine protein sequences. Zhang et al. [12] proposed the MPP algorithm to mine sequential patterns with periodic gap constraints. Zhu and Wu [13] and Wu et al. [14] proposed state-of-the-art algorithms which have a better performance than that of MPP. All these researches mentioned above employed a pattern matching strategy. An illustrative example is given as follows to show all occurrences of a pattern with gaps in a sequence.

Example 1

Given pattern P = p ₀[a ₀, b ₀]p ₁[a ₁, b ₁]p ₂ = C[0, 2]A[0, 2]G and sequence S = s ₀ s ₁ s ₂ s ₃ s ₄ s ₅ = CACAGG.

We know that CA..G is an occurrence which satisfies pattern P, and there are 5 occurrences like this. To denote conveniently, we use the subscripts of each character to represent an occurrence, hence CA..G can be denoted by < 0, 1, 4 >. Thus, the 5 occurrences of this problem can be described by < 0, 1, 4 >, < 0, 3, 4 >, < 0, 3, 5 >, < 2, 3, 4 >, and < 2, 3, 5 >. Nevertheless, < 0, 1, 5 > is not an occurrence, because the gap between 1 and 5 is 3, which fails to satisfy the gap constraint [0,2]. We show all occurrences in Fig. 1. So in the sequential pattern mining task, we say the support of P in S is 5, while in the pattern matching issue, we say the number of occurrences of P in S is 5. Therefore, pattern matching plays an important role in sequential pattern mining with gaps, since one of the essential tasks of sequential pattern mining is to calculate the support of a pattern.

Sequential pattern mining has very important applications in real problems. For instance, miners can discover the common sequential purchasing behaviours for most of the customers according to the transactional database [15]. A pattern can be that most of the customers purchased item A, after a while bought item B, and finally purchased item C. However, this kind of mining is under the non-negative gaps, which bounds the purchasing order of the consumers. Pattern ABC with non-negative gaps fails to be detected in the sequence containing BAC, but pattern ABC with general gaps occurs in the sequence. For example, < 2, 1, 4 > is an occurrence of pattern C[−1,2]A[0,2]G in sequence S = CACAGG. Hence sequential pattern mining with general gaps is more useful. We know that pattern matching is one of the essential tasks in pattern mining. Exact pattern matching is an ideal research, since it does not allow noise, while the approximate pattern matching can solve the problem. In conclusion, the approximate pattern matching with general gaps is a more challenging and general issue. The contributions of this paper are described specifically as follows:

(1)
We propose the problem of Strict Approximate Pattern matching with general gaps and length constraints (SAP). When the Hamming distance is 0, the problem is automatically converted to the exact pattern matching, which is called the SPANGLO problem [16]. We prove that a SAP instance can be transformed into exponential SPANGLO instances; therefore, we cannot use SETS [16] to solve a SAP problem.
(2)
To solve a SAP problem effectively, we propose an effective online algorithm, named SETA, which applies pruning strategies. In addition, we prove the correctness and completeness of SETA and analyse the time and space complexities of SETA, which are O(Maxlen × W × m ² × n × T) and O(m × Maxlen × W × T) respectively, where m, Maxlen, W, and T are the length of pattern P, the maximal length constraint, the maximal span of pattern P and the approximate threshold.
(3)
Extensive experimental results on real biological data show the correctness of the approach of transforming a SAP instance to an exponential amount of SPANGLO instances, and also validate the correctness and effectiveness of SETA.

The rest of the paper is organized as follows. Section 2 summarizes the related work. Section 3 presents the definition of the SAP problem and analyses the method of transforming the SAP problem into SPANGLO problems. Section 4 proposes how to create and calculate a subnettree for SAP and proves the correctness of the calculation method. After this, we propose the SETA algorithm and analyse the time and space complexities. Finally, we illustrate how SETA works. Section 5 validates the correctness of SETA through vast real biological data. We give the conclusion of this paper in Section 6.

2 Related work

Example 1 is a kind of strict pattern matching. Another kind of pattern matching is called loose pattern matching which only considers the position of the last pattern substring in the sequence. Since the last pattern substring is p ₂ = G in Example 1, and it can match positions 4 and 5 in sequence S, there are only 2 occurrences in the loose pattern matching.

Because all the gaps of the pattern are no less than 0 in Example 1, bounded by the gap constraints, p _{j + 1} must appear in the right of p _j. The more general gap constraints can be negative. Thus p _{j + 1} can appear in either the right or the left of p _j. If there is a negative gap in the pattern, then the pattern is called a pattern with general gaps; when all the gaps are no less than 0, the pattern is called a pattern with non-negative gaps. Pattern matching with general gaps was applied to not only the loose pattern matching [17, 18] but also the exact pattern matching [16].

The length constraint, which is composed of the minimal length constraint and the maximal length constraint, refers to restraining the span of the occurrence, which means the distance between the minimal value and the maximal value of the occurrence. For instance, the minimal value and the maximal value of < 2, 3, 5 > are 2 and 5, respectively. So the span of < 2, 3, 5 > is 5 − 2 + 1 = 4. There are 3 occurrences of pattern P1 = C[−1,2]A[0,2]G in sequence S when the minimal and maximal length constraints are 3 and 4 respectively, i.e. < 2, 3, 4 >, < 2, 1, 4 >, and < 2, 3, 5 >. As the spans of < 0, 1, 4 >, < 0, 3, 4 >, and < 0, 3, 5 > are 5, 5, and 6, none of them satisfies the length constraints.

What we discussed above are all examples of exact pattern matching, i.e. p _j must be equal to s _i. But in real research, most situations are approximate pattern matching. Approximate pattern matching contains pattern matching based on edit distance and on Hamming distance. In this paper, we focus on the study of approximate pattern matching with Hamming distance; namely, the distance between the pattern and the substring corresponding to the occurrence must be no greater than the given threshold. For instance, if the given threshold is 1, in Example 1, < 0, 1, 3 > is an approximate occurrence with the Hamming distance being 1 because s ₃ = A ≠ p ₂ = G. Nevertheless, < 0, 2, 3 > is still not an approximate occurrence with the Hamming distance being 1, because the Hamming distance between < 0, 2, 3 > and the pattern is 2. Of course, < 0, 1, 4 > is also an occurrence, because the distance between < 0, 1, 4 > and the pattern is 0, which is less than the threshold. We can clearly know that if the approximate threshold is 0 it will convert from approximate pattern matching to exact pattern matching. Therefore, compared with exact pattern matching, approximate pattern matching is more general.

In pattern matching with gaps, there are some special conditions. One condition is called the one-off condition, which means that any position in the sequence can be used at most once. Guo et al. [19] investigated pattern matching with the one-off condition. The one-off condition has many different names in sequential pattern mining. Actually, Ferreira and Azevedo [11], Huang et al. [20] and Lam et al. [21] all focused on pattern mining with the one-off condition. Similar to the one-off condition, Ding et al. [22] researched sequential pattern mining with the non-overlapping condition, which means that no position in the sequence can be reused by other pattern substrings. Compared with the one-off condition and the non-overlapping condition, there is no special condition which means that any position in the sequence can be used more than once. Min et al. [23] focused on pattern matching with length constraints. Zhang et al. [12] investigated sequential pattern mining with periodic gap constraints. All these researches mentioned above are with no special condition. In this article, we also focus on no special condition. Table 1 shows the related work in pattern matching.

Table 1 Comparison of related work

Full size table

Table 1 shows that the main difference between [16] and our work is that [16] investigated exact pattern matching, while this study addresses approximate pattern matching, which is a more general issue. Since a SAP instance can be transformed into exponential SPANGLO instances, which was handled in [16], we propose an effective algorithm, named SETA, which employs effective pruning strategies. An illustrative example is given to show how to prune effectively in Section 5.4. The following cases show the meaning of this issue.

3 Problem definition and analysis

3.1 Problem definition

Definition 1

A sequence can be denoted as S = s ₀ s ₁… s _i… s _n−1, where n is the length of S. $\sum $ represents a set of characters, such as in the DNA sequence, where $\sum $ is { A, T, G, C} .

Definition 2

A pattern with general gaps can be denoted as P = p ₀[a ₀, b ₀]p ₁…[a _j−1, b _j−1]p _j… [a _m−2, b _m−2] p _m−1, where m denotes the length of P, a _j and b _j are given integers, representing the minimal and maximal wildcards between p _j and p _{j + 1}, where a _j ≤ b _j , and in addition, a _j and b _j can be negative.

Definition 3

Given two sequences Q = q ₀ q ₁… q _m−1 and R = r ₀ r ₁… r _m−1, if there are k positions at which the corresponding characters are different, i.e. q _i ≠ r _i(0 ≤ i < m), then the Hamming distance between the two strings is k(0 ≤ k ≤ m). D(Q, R) is used to denote the Hamming distance between Q and R.

Definition 4

Given a threshold d, if a group of position indices I = < i ₀, … , i _j , … , i _m−1 > satisfies the following equations

$$\begin{array}{@{}rcl@{}} && D(p_{0} p_{1} {\cdots} p_{m-1} ,s_{i_{0}} s_{i_{1}} {\cdots} s_{i_{m-1}} )\leq d \end{array} $$

(1)

$$\begin{array}{@{}rcl@{}} &&{\kern8pc}i_{j-1} \ne i_{j} \end{array} $$

(2)

$$\begin{array}{@{}rcl@{}} &&\left\{ \begin{array}{ll} {\min_{j-1} \leq i_{j} -i_{j-1} -1 \leq \max_{j-1}} &\text{, if } i_{j-1} <i_{j} \\ {\min_{j-1} \leq i_{j} -i_{j-1} \leq \max_{j-1}} &\text{, if } i_{j-1} >i_{j} \\ \end{array} \right. \end{array} $$

(3)

where 0 ≤ j ≤ m-1 and 0 ≤ i _j ≤ n-1, then I is an approximate occurrence of P in S.

Definition 5

An approximate occurrence Isatisfies the length constraint which means that the occurrence is subject to the following equation

$$ \textit{Minlen}\leq i_{max}-i_{min}+1\leq \textit{Maxlen} $$

(4)

In addition, i _{m
a
x}- i _{m
i
n}+1 is the span of occurrence I, where i _{m
a
x} = max(i ₀, …, i _j , … , i _m−1), i _{m
i
n} = min(i ₀, … , i _j, … , i _m−1), and Minlen and Maxlen are the two given integers which are the minimal and maximal length constraint, respectively.

Definition 6

Let the set T(S, P, d) denote all the approximate occurrences and |T(S, P, d)| denote the length of T(S, P, d). SAP is to calculate |T(S, P, d)| .

3.2 Theoretical analysis

From Table 1, we know that SPANGLO [16] handles the exact matching problem, while SAP deals with the approximate matching problem. Given sequence S, pattern P, Minlen and Maxlen, apparently, the SPANGLO problem can be denoted by |T(S, P, 0)| .

Theorem 1

A SAP instance can be transformed into exponential SPANGLO instances.

Proof

Let f(S, P, k) = |T(S, P, k)|−|T(S, P, k−1)| . We can know that f(S, P, k) denotes the number of occurrences whose Hamming distance between the approximate occurrence and the pattern is k. That is to say, arbitrarily choose k different positions in pattern P to make the corresponding character differ from p _j. So there are ${C_{m}^{k}} =\frac {m!}{k!\ast (m-k)!}$ different choices. There are $| \sum |$ -1 different choices in each different position. Therefore, f(S, P, k) can be transformed into ${C_{m}^{k}} \ast (| {\Sigma } | -1)$ SPANGLO instances. Since |T(S, P, d)| is ${\sum }_{i=0}^{d} {f(S,P,i)} , | T(S,P,d)|$ can be transformed into 1$+{\sum }_{i=1}^{d} {{C_{m}^{i}} \ast (|{\Sigma } | -1)} $ SPANGLO instances. Hence Theorem 1 is proved.

Wu et al. [16] proposed an effective algorithm, named SETS, to solve SPANGLO. From Theorem 1, we can know that SETS fails to solve SAP, since a SAP instance will be transformed into exponential SPANGLO instances. Therefore, we have to propose a new algorithm to solve SAP. □

4 Subnettree for SAP

4.1 Subnettree

Definition 7

A Nettree [28] is an extension of a tree, because it has many concepts similar to a tree, such as the root, leaf, level, parent, child and so on. Nettree has four features which are obviously different from a tree.

(1)
A Nettree may have n roots, where n ≥ 1;
(2)
Some nodes other than roots in a Nettree may have many parents;
(2)
There may be more than 1 path from a node to its ancestor node in a Nettree;
(4)
The same node label can appear in different levels in a Nettree. $n_{j }^{i} $denotes the node i in the j-th level.

To solve SAP, a subnettree is also employed since we can confirm the maximal value in the subnettree. So we can deal with the length constraint. More important is that, through this approach, an online algorithm is proposed.

Definition 8

A subnettree [16] is composed of three parts, a central node ${n^{i}_{j}}$, its ancestor nodes $ A({n^{i}_{j}})$, and its descendant nodes $D(n_{j}^{i})$, where the ancestor node refers to the fact that node $n_{b }^{c} $ is on the path from node $n_{j}^{i}$ to a root, c ≤ i, and 1 ≤ b < j. Similarly, the descendant node refers to the fact that node ${n^{e}_{f}}$ is on the path from node $n_{j }^{i} $to a leaf, and e < i. Subnettree $n_{j}^{i}$ is used to represent a subnettree with a central node $n_{j}^{i}$.

From Definition 8, we see that there is only one node $n_{j}^{i}$ in the j-th level, and iis the maximal node label in the subnettree, the maximal ancestor node label of $n_{j}^{i}$ can be i and the maximal descendant node label can just be i-1.

Lemma 1

When we create subnettree $n_{j }^{i} $ according to pattern P, in the following three cases, the subnettree can be omitted.

Case 1.
j is equal to m, and gap b _m−1 is less than 0.
Case 2.
j is equal to 1, and gap a ₀ is no less than 0.
Case 3.
gap b _j−2 is less than 0 or gap a _j−1 is no less than 0, where 1 < j < m.

Proof

Case 1.
When j is equal to m, it indicates that p _m−1 matches with s _i. Since a _m−1 is less than b _m−1, if b _m−1 is less than 0, then a _m−1 is also less than 0. Suppose character s _h match with p _m−2, h is greater than i according to gap [a _m−1, b _m−1]. This is contradictory to the definition of the subnettree.
Case 2.
When j is equal to 1, it indicates that p ₀ matches with s _i. If a ₀ is no less than 0, then b ₀ is also no less than 0. Suppose s _h match with p ₁, his greater than i according to gap [a ₀, b ₀], which is contradictory to the definition of the subnettree.
Case 3.
Similarly, if gap b _j−2 is less than 0 or gap a _j−1 is no less than 0 (1 < j < m), then position h is greater than i according to the gap, where h is the position of character s _h, which will match p _j−2 or p _j. This is contradictory to the definition of the subnettree. Therefore the lemma is proved.

To confirm the range of the node labels in the j-th level of the subnettree, we propose the definitions of the maximal sibling and the minimal sibling in the j-th level. □

Definition 9

The minimal sibling and the maximal sibling are the minimal and maximal node labels in the k-th level of the subnettree and denoted by c _k and e _k, respectively.

Lemma 2

In the process of creating the ancestor nodes of subnettree $n_{j}^{i}$, we create the nodes in the k-th level according to the nodes in the k + 1-th level, where 1 ≤ k < j. In this process, c _k and e _k can be calculated by (5) and (6), respectively.

Similarly, in the process of creating the descendant nodes of subnettree $n_{j}^{i}$ , we create the nodes in the k-th level according to the nodes in the k-1-th level, where j < k ≤ m. In this process, c _k and e _k can be calculated by (7) and (8), respectively.

$$ c_{k}\,=\,\left\{ \begin{array}{ll} {\max (0,i\,-\,\textit{Maxlen}\,+\,1,c_{k+1} \,-\,b_{k-1} -1)} &\quad{b_{k-1} \geq 0} \\ {\max (0,i\,-\,\textit{Maxlen}\,+\,1,c_{k+1} \,-\,b_{k-1} )} &\quad{b_{k-1} <0}\\ \end{array} \right. $$

(5)

$$ e_{k}=\left\{ \begin{array}{ll} {\min (i,e_{k+1} -a_{k-1} -1)} & \quad{a_{k-1} \geq 0} \\ {\min (i,e_{k+1} -a_{k-1} )} & \quad{a_{k-1} <0} \\ \end{array} \right. $$

(6)

$$ c_{k}\,=\,\left\{ \begin{array}{ll} {\max (0,i\,-\,\textit{Maxlen}\,+\,1,c_{k-1} \,+\,a_{k-2} +1)} & \quad{a_{k-2} \ge 0} \\ {\max (0,i\,-\,\textit{Maxlen}\,+\,1,c_{k-1} \,+\,a_{k-2} )} & \quad{a_{k-2} <0} \\ \end{array} \right. $$

(7)

$$ e_{k}=\left\{ \begin{array}{ll} {\min (i-1,e_{k-1} -b_{k-2} +1)} & \quad{b_{k-2} \ge 0} \\ {\min (i-1,e_{k-1} -b_{k-2} )} & \quad{b_{k-2} <0} \\ \end{array} \right. $$

(8)

Proof

First of all, we prove the method of calculating c _k and e _k in the ancestor set. Obviously, the minimal sibling c _k is no less than 0. Since c _k needs to satisfy the length constraint, c _k is also no less than i-Maxlen + 1. Besides, since the nodes in the k-th level and k + 1-th level correspond with p _k−1 and p _k respectively, c _k also needs to satisfy gap b _k−1. According to (2), when b _k−1 is less than 0, c _k is no less than c _{k + 1}- b _k−1, while b _k−1 is no less than 0, c _k is no less than c _{k + 1}- b _k−1-1. Therefore, c _k can be calculated by (5). Similarly, according to Definition 8, the maximal value of e _k in subnettree $n_{j}^{i}$ is i and e _k also needs to satisfy gap a _k−1. Hence, e _k can be calculated by (6).

In the descendant set, the method of calculating c _k and e _k is similar to that in the ancestor set. The difference is that in the descendant set, the nodes in the k-th level are created according to the nodes in the k-1-th level. So the values of c _k and e _k are calculated according to c _k−1 and e _k−1 respectively. Both of them need to satisfy gap a _k−2 and b _k−2 respectively. Therefore, c _k can be calculated by (7). Since the maximal value of the descendant nodes in the subnettree is i-1, e _k can be calculated by (8). Hence Lemma 2 is proved.

Since there are some paths that can satisfy the length constraint and others fail to do so, we propose several concepts to distinguish the two kinds of path. □

Definition 10

Let M be a path from node $n_{j}^{i}$ to node $n_{b}^{c}$, where 0 ≤ i < n, 1 ≤ j, 0 ≤ c < n, and 1 ≤ b.e is the minimal node label in this path, i.e. e = min(M), if path M satisfies the length constraint, i.e. Minlen ≤ i- e + 1 ≤ Maxlen, then we say that M is a path with length constraint; otherwise, M is acomplement path with length constraint.

Definition 11

NAPS (Number of Ancestor Paths with Similarity constraint) is the number of paths which are from an ancestor node $n_{k }^{l} $ to its central node ${n^{i}_{j}}$ with the Hamming distance d, denoted by $N_{A}(n_{j}^{i},n_{k}^{l},d)$. In these paths, the number of paths that satisfy the length constraint is called NAPLC (Number of Ancestor Paths with Length Constraints), denoted by $N_{A}^{C}(n_{j}^{i},n_{k}^{l},d)$ and the number of paths that do not satisfy the length constraint is called NCAPLC (Number of Complement of Ancestor Paths with Length Constraints), denoted by $N^{\sim } _{A}(n_{j}^{i},n_{k}^{l},d)$. The initial value of $N_{A}(n_{j}^{i},n_{j}^{i},d)$ is set as follows: if s _i = p _j−1, then the distance between s _i and p _j−1 is 0, or else is 1. Therefore, if s _i = p _j−1, then $N_{A}(n_{j}^{i},n_{j}^{i},0)$ = 1 and for any d > 0, $N_{A}(n_{j}^{i},n_{j}^{i},d)$ is 0. Otherwise, $N_{A}(n_{j}^{i},n_{j}^{i},1) = $1, and for any d≠1, $N_{A}(n_{j}^{i},n_{j}^{i},d)$ is 0.

Obviously, $N_{A}(n_{j}^{i},n_{k}^{l},d)$ is the sum of $N_{A}^{C}(n_{j}^{i},n_{k}^{l},d)$ and $N_{A}^{\sim } (n_{j}^{i},n_{k}^{l},d)$, $i.e. N_{A}(n_{j}^{i},n_{k}^{l},d) = N_{A}^{C}(n_{j}^{i},n_{k}^{l},d)+ N_{A}^{\sim } (n_{j}^{i},n_{k}^{l},d)$. Next, we will show how to calculate $N_{A}(n_{j}^{i},n_{k}^{l},d), N_{A}^{C}(n_{j}^{i},n_{k}^{l},d)$, and $N_{A}^{\sim } (n_{j}^{i},n_{k}^{l},d)$.

Lemma 3

$N_{A}(n_{j}^{i},n_{k}^{l},d)$ is calculated according to (9).

Proof

If s _l = p _k, then the distance between s _l and p _kis 0, or else is 1. We know that the nodes in the k-th level are created according to the nodes in the k + 1-th level in the ancestor nodes of the subnettree. Therefore, after adding node $n_{k}^{l}$, if s _l = p _k, then the distance is not changed, $N_{A}(n_{j}^{i},n_{k}^{l},d)$ is the sum of $N_{A}(n_{j}^{i},n_{k+1}^{r_{q}} ,d)$; otherwise, the distance increases by 1, hence $N_{A}(n_{j}^{i},n_{k}^{l},0) = $0 and $N_{A}(n_{j}^{i},n_{k}^{l},d)$ is the sum of $N_{A}(n_{j}^{i},n_{k+1}^{r_{q}} ,d$-1), where $n_{k+1}^{r_{q}} $ is the qth child node of ${n^{l}_{k}}$. Therefore Lemma 3 is proved. □

Lemma 4

$N_{A}^{C}(n_{j}^{i},n_{k}^{l},d)$ and $N_{A}^{\sim } (n_{j}^{i},n_{k}^{l},d)$ can be calculated according to ( 10 ) and ( 11 ), respectively.

Proof

If the distance between l and i satisfies the length constraint, then $N_{A}^{C}(n_{j}^{i},n_{k}^{l},d) = N_{A}(n_{j}^{i},n_{k}^{l},d)$ and $N_{A}^{\sim } (n_{j}^{i},n_{k}^{l},d) =$0. Otherwise, we need to consider whether s _l is the same as p _k or not, if s _l and p _k are the same, $N_{A}^{C}(n_{j}^{i},n_{k}^{l},d)$ and $N_{A}^{\sim }(n_{j}^{i},n_{k}^{l},d)$ is the sum of $N_{A}^{C} (n_{j}^{i} ,n_{k+1}^{r_{q}} ,d)$ and $N_{A}^{\sim } (n_{j}^{i} ,n_{k+1}^{r_{q}} ,d)$, respectively. If s _l and p _k are different, then $N_{A}^{C}$ ($n_{j}^{i},n_{k}^{l}$,0) and $N_{A}^{\sim } (n_{j}^{i},n_{k}^{l}$,0) are 0, besides, $N_{A}^{C}(n_{j}^{i},n_{k}^{l}$, d) and $N_{A}^{\sim } (n_{j}^{i},n_{k}^{l},d)$ should be the sum of $N_{A}^{C} (n_{j}^{i} ,n_{k+1}^{r_{q}} ,d$-1) and $N_{A}^{\sim } (n_{j}^{i} ,n_{k+1}^{r_{q}} ,d$-1), respectively. Hence Lemma 4 is proved. □

$$\begin{array}{@{}rcl@{}} N_{A}(n_{j}^{i},n_{k}^{l},d) = \left\{ \begin{array}{ll} {\sum}_{q=1}^{t} {N_{A} (n_{j}^{i} ,n_{k+1}^{r_{q}} ,d)} \\ \phantom{0}{s_{l} =p_{k}} \\ {\sum}_{q=1}^{t} {N_{A} (n_{j}^{i} ,n_{k+1}^{r_{q}} ,d-1)} \\ \phantom{0}{s_{l} \ne p_{k} \text{ and } d>0}\\ 0 \quad{s_{l} \ne p_{k} \text{ and } d=0}\\ \end{array} \right. \end{array} $$

(9)

where $n_{k+1}^{r_{q}} $ and t represent the q-th child node of $n_{k }^{l} $ and the number of child nodes of the node $n_{k }^{l} $ in the subnettree, respectively.

$$ N_{A}^{C}(n_{j}^{i},n_{k}^{l},d)\,=\,\!\left\{ \begin{array}{ll} {N_{A} \left(n_{j}^{i},n_{k}^{l} ,d\right)} & {Minlen\le i-l+1\leq Maxlen} \\ {\sum}_{q=1}^{t} N_{A}^{C} \left(n_{j}^{i} ,n_{k+1}^{r_{q}} ,d\right) & {s_{l} \,=\,p_{k} \text{ and} ~(i\,-\,l+1>Maxlen\text{ or } ~i-l+1<Minlen)}\\ {\sum}_{q=1}^{t} N_{A}^{C} \left(n_{j}^{i} ,n_{k+1}^{r_{q}} ,d\,-\,1\right) & {s_{l} \!\neq\! p_{k} \text{and} ~(i\,-\,l+1>Maxlen\text{ or~} {i-l+1}<Minlen)\! \text{and} }\! d>0\\ 0 & {s_{l} \!\neq\! p_{k} \text{ and~} (i\,-\,l\,+\,1\!>\!Maxlen\text{ or~} i\,-\,l\,+\,1\!<\!Minlen)\! \text{and }~ d\,=\,0} \end{array} \right. $$

(10)

where $n_{k+1}^{r_{q}} $ and t represent the q-th child node of $n^{l}_{k } $ and the number of child nodes of the node $n^{l}_{k } $ in the subnettree respectively.

$$ N_{A}^{\sim} ({n^{i}_{j}},n_{k}^{l},d)\,=\,\!\left\{ \begin{array}{ll} 0 & {Minlen\le i-l+1\le Maxlen}\\ {\sum}_{q=1}^{t} N_{A}^{\sim} \left(n_{j}^{i} ,n_{k+1}^{r_{q}} ,d\right) & {s_{l} =p_{k} \text{ and }(i-l+1>Maxlen\text{ or } i-l+1<Minlen)}\\ {\sum}_{q=1}^{t} N_{A}^{\sim} \left(n_{j}^{i} ,n_{k+1}^{r_{q}} ,d\,-\,1\right) & {s_{l} \ne p_{k} \text{ and (}i-l+1\!>\!Maxlen\text{ or~} i\,-\,l+1\!<\!Minlen\text{) and } d\!>\!0} \\ 0 & s_{l} \!\neq\! p_{k} \text{ and} ~(i-l+1>Maxlen\text{ or~} i-l\,+\,1\!<\!Minlen\text{) and}~ d=0\\ \end{array} \right. $$

(11)

where $n_{k+1}^{r_{q}} $ and t represent the q-th child node of $n^{l}_{k } $ and the number of child nodes of the node $n^{l}_{k } $ in the subnettree respectively.

Definition 12

NRPLC (Number of Root Paths with Length Constraint) is the number of paths which are from roots to the central node $n_{j}^{i}$ with the Hamming distance d which satisfy the length constraint, denoted by ${N^{C}_{R}}({n^{i}_{j}},d)$. Similarly, the number of paths which do not satisfy the length constraint is called NCRPLC (Number of Complement of Root Paths with Length Constraint), denoted as $N_{R}^{\sim } (n_{j}^{i},d). N_{R}^{C}({n^{i}_{j}},d)$ and $N_{R}^{\sim }(n_{j}^{i},d)$ can be calculated by the following equations.

$$ N_{R}^{C}(n_{j}^{i},d) = {\sum}_{q=1}^{t} N_{A}^{C} (n_{j}^{i} ,n_{1}^{r_{q}} ,d) $$

(12)

$$ N_{R}^{\sim}(n_{j}^{i},d) = {\sum}_{q=1}^{t} N_{A}^{\sim} (n_{j}^{i} ,n_{1}^{r_{q}} ,d) $$

(13)

where $n_{1}^{r_{q}} $ and t represent the q-th root and the number of roots in the subnettree, respectively.

Now, we will introduce how to create the descendant nodes of central node ${n_{j}^{i}}$.

Definition 13

NDPS (Number of Descent Paths with Similarity constraint) is the number of paths which are from central node ${n^{i}_{j}}$ to its descendant node ${n^{l}_{k}}$ with the Hamming distance d, denoted by $N_{D}({n^{i}_{j}},{n^{l}_{k}},d)$. In these paths, the number of paths that satisfy the length constraint is called NDPLC (Number of Descent Paths with Length Constraints), denoted by ${N_{D}^{C}}({n_{j}^{i}},{n_{k}^{l}},d)$ and the number of paths that do not satisfy the length constraint is called NCDPLC (Number of Complement of Descent Paths with Length Constraints), denoted by $N_{D}^{\sim } ({n_{j}^{i}},{n_{k}^{l}},d)$. When we initialize NDPS, it is different from the method of initializing NAPS. In this definition, no matter whether s _i is the same as p _j−1 or not, $N_{D}({n_{j}^{i}},{n_{j}^{i}}$,0) =1 and for any $d > 0, N_{D}({n_{j}^{i}},{n_{j}^{i}},d) = $0. The reason is that when s _i and p _j−1 are different, if $N_{D}({n_{j}^{i}},{n_{j}^{i}},1) = $1, then it will cause the fact that s _i and p _j−1 are different to be calculated excessively.

Lemma 5

$N_{D}({n_{j}^{i}},{n_{k}^{l}},d)$ , ${N_{D}^{C}}({n_{j}^{i}},{n_{k}^{l}},d)$ , and $N_{D}^{\sim } ({n_{j}^{i}},{n_{k}^{l}},d)$ can be calculated by ( 14 ), ( 15 ), and ( 16 ), respectively.

Proof

Similar to the method of calculating NAPS, NAPLC and NCAPLC of the ancestor nodes, we know that the nodes in the k-th level are created according to the nodes in the k-1-th level nodes in the descendant nodes of the subnettree. After adding node ${n_{k}^{l}}$, if s _l = p _k, then the distance is not changed, $N_{D}({n_{j}^{i}},{n_{k}^{l}},d)$ is the sum of $N_{D} ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d)$, otherwise, the distance increases by 1, $N_{D}({n_{j}^{i}},{n_{k}^{l}}$,0) =0, and $N_{D}({n_{j}^{i}},{n_{k}^{l}},d)$ is the sum of $N_{D} ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d-1)$, where $n_{k-1}^{r_{q}} $ is the q-th parent node of node ${n_{k}^{l}}$. Hence, $N_{D}({n_{j}^{i}},{n_{k}^{l}},d)$ can be calculated according to (14).

If l and i satisfy the length constraint, the values of NDPLC and NCDPLC of ${n_{k}^{l}}$ are the value of NDPS of ${n_{k}^{l}}$ and 0, respectively. Otherwise, we need to judge whether s _l and p _k are the same or not. If s _l and p _k are the same, ${N_{D}^{C}}({n_{j}^{i}},{n_{k}^{l}},d)$ and $N_{D}^{\sim } ({n_{j}^{i}},{n_{k}^{l}},d)$ are the sum of ${N_{D}^{C}} ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d)$ and $N_{D}^{\sim } ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d)$, respectively; otherwise, both ${N_{D}^{C}}({n_{j}^{i}},{n_{k}^{l}}$,0) and $N_{D}^{\sim }({n_{j}^{i}},{n_{k}^{l}}$,0) are 0, and ${N_{D}^{C}}({n_{j}^{i}},{n_{k}^{l}},d)$ and $N_{D}^{\sim } ({n_{j}^{i}},{n_{k}^{l}},d)$ are the sum of ${N_{D}^{C}} ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d-1)$ and $N_{D}^{\sim } ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d-1)$, respectively. Therefore Lemma 5 is proved. □

$$\begin{array}{@{}rcl@{}}N_{D}({n_{j}^{i}},{n_{k}^{l}},d) = \left\{ \begin{array}{ll} {\sum}_{q=1}^{t} {N_{D} ({n_{j}^{i}}, n_{k-1}^{r_{q}} ,d)} \\\phantom{0}{s_{l} =p_{k}}\\ {\sum}_{q=1}^{t} {N_{D} ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d-1)} \\ \phantom{0}{s_{l} \neq p_{k} \text{ and } d>0}\\ 0 \quad {s_{l} \neq p_{k} \text{ and } d=0}\\ \end{array}\right. \end{array} $$

(14)

where $n_{k-1}^{r_{q}} $ and t represent the q-th parent node of node ${n_{k}^{l}}$ and the number of parent nodes of ${n^{l}_{k}}$ in the subnettree, respectively.

$$ {N_{D}^{C}}({n_{j}^{i}},{n_{k}^{l}},d)\,=\,\!\left\{ \begin{array}{ll} {N_{D} \left({n_{j}^{i}} ,{n_{k}^{l}} ,d\right)} & \quad Minlin \leq i-l+1 \leq Maxlin\\ {\sum}_{q=1}^{t} {{N_{D}^{C}} \left({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d\right)} &{s_{l} =p_{k} \text{ and} (i-l+1>Maxlen\text{ or } i-l+1<Minlen)}\\ {\sum}_{q=1}^{t} {{N_{D}^{C}} \left({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d-1\right)} & {s_{l} \neq p_{k} \text{and} (i-l+1>Maxlen\text{ or} i-l+1<Minlen) \! \text{and}\! d>0}\\ 0 & {s_{l} \neq p_{k} \text{ and} (i-l+1\!>\!Maxlen\text{ or} i-l+1\!<\!Minlen) \! \text{and }\! d=0} \\ \end{array} \right. $$

(15)

where $n_{k-1}^{r_{q}} $ and t represent the q-th parent node of node ${n_{k}^{l}}$ and the number of parent nodes of ${n_{k}^{l}}$ in the subnettree, respectively.

$$ N_{D}^{\sim}({n_{j}^{i}}, {n_{k}^{l}},d)\,=\,\!\left\{ \begin{array}{ll} 0 & \quad {Minlen\leq i-l+1\leq Maxlen}\\ {\sum}_{q=1}^{t} {N_{D}^{\sim} ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d)} & {s_{l} =p_{k} \text{~and~} (i-l+1>Maxlen\text{~or~} i-l+1<Minlen)}\\ {\sum}_{q=1}^{t} {N_{D}^{\sim} ({n_{j}^{i}} ,n_{k-1}^{r_{q}} ,d-1)} & {s_{l} \neq p_{k} \text{~and~} (i-l+1>Maxlen\text{~or~} i\,-\,l\,+\,1\!<\!Minlen) \text{~and~} d>0}\\ 0 & {s_{l} \neq p_{k} \text{~and~} (i-l+1>Maxlen\text{~or~} i\,-\,l\,+\,1\!<\!Minlen) \text{~and~} d=0}\\ \end{array} \right. $$

(16)

where $n_{k-1}^{r_{q}} $ and t represent the q-th parent node of node ${n_{k}^{l}}$ and the number of parent nodes of ${n_{k}^{l}}$ in the subnettree, respectively.

Lemma 6

If ${\sum }_{d=0}^{T} N_{A}(n_{j}^{i},n_{k}^{l},d) =$ 0, then node ${n_{k}^{l}}$ has no parent. Similarly, if ${\sum }_{d=0}^{T} {N_{D} ({n_{j}^{i}} ,{n_{k}^{l}},d)} =$ 0, then node ${n_{k}^{l}}$ has no child.

Proof

Suppose ${n_{g}^{f}}$ is an ancestor node of ${n_{k}^{l}}$. We know that the Hamming distance of each path from ${n_{g}^{f}}$ to ${n_{j}^{i}}$ is no less than that of the path from ${n_{k}^{l}}$ to ${n_{j}^{i}}$. If ${\sum }_{d=0}^{T} {N_{A} ({n_{j}^{i}}, {n_{k}^{l}}, d)} =0$, it indicates that in the subnettree, the Hamming distance of each path from ${n_{k}^{l}}$ to ${n_{j}^{i}}$ is greater than d. Therefore, the Hamming distance of each path from ${n_{g}^{f}}$ to ${n_{j}^{i}}$ is also greater than d. Hence, we do not need to calculate the parent of the ancestor of node ${n^{l}_{k}}$. Similarly, in the process of creating descendant nodes, if ${\sum }_{d=0}^{T} {N_{D}({n_{j}^{i}},{n_{k}^{l}},d)} =$0, we can safely prune the sub-tree whose root is node ${n_{k}^{l}}$. Hence, Lemma 6 is proved. □

Definition 14

Similar to the concepts of NRPLC and NCRPLC, NLPLC (Number of Leaf Paths with Length Constraint) is the number of paths which are from central node ${n_{j}^{i}}$ to the leaves in the m-th level with the Hamming distance d and which satisfy the length constraint, denoted by ${N_{L}^{C}}({n_{j}^{i}},d)$, while NCLPLC (Number of Complement of Leaf Paths with Length Constraint) is the number of paths which do not satisfy the length constraint, denoted by $N_{L}^{\sim }({n_{j}^{i}},d)$. ${N_{L}^{C}}({n_{j}^{i}},d)$ and $N_{L}^{\sim } ({n_{j}^{i}},d)$ can be calculated by the following equations.

$$ {N_{L}^{C}}({n_{j}^{i}},d) = {\sum}_{q=1}^{t} {{N_{D}^{C}} ({n_{j}^{i}} ,n_{m}^{r_{q}} ,d)} $$

(17)

$$ N_{L}^{\sim}({n_{j}^{i}},d) = {\sum}_{q=1}^{t} {N_{D}^{\sim} ({n_{j}^{i}} ,n_{m}^{r_{q}} ,d)} $$

(18)

where $n_{m}^{r_{q}} $ and t represent the q-th leaf and the number of leaves in the subnettree, respectively.

Definition 15

Suppose M is a path from a root to a leaf which passes through node ${n^{i}_{j}}$, and besides, the Hamming distance between the corresponding string of this path in sequence S and p ₀ p ₁… p _m−1 is less than the approximate threshold d, we can say that M is a root-leaf path with length and similarity constraints or a root-leaf path for short. We use ${N_{T}^{C}}({n_{j}^{i}},d)$ to denote the number of root-leaf paths.

Lemma 7

${N_{T}^{C}}({n_{j}^{i}},d)$ can be calculated as follows.

$$\begin{array}{@{}rcl@{}} {N_{T}^{C}}({n_{j}^{i}},d)&=&{\sum}_{k=0}^{d} {\sum}_{e=0}^{d-k} ({N_{R}^{C}}({n_{j}^{i}},k)^{\ast}N_{L}^{\sim} ({n^{i}_{j}},e) \\ &&\phantom{0}+N_{R}^{\sim}({n_{j}^{i}},k)^{\ast}{N_{L}^{C}}({n_{j}^{i}},e)\\ &&\phantom{0}+{N_{R}^{C}}({n_{j}^{i}},k)^{\ast}{N_{L}^{C}}({n_{j}^{i}},e)) \end{array} $$

(19)

Proof

A root-leaf path in subnettree ${n_{j}^{i}}$ satisfies one of the following three cases.

Case 1.
The sub-path from a root to central node ${n^{i}_{j}}$ satisfies the length constraint and the other sub-path from ${n_{j}^{i}}$ to a leaf fails to satisfy the length constraint. We can know that there are ${N_{R}^{C}}({n_{j}^{i}},k,\textit {Minlen},\textit {Maxlen})^{\ast }N_{L}^{\sim } ({n^{i}_{j}},e$, Minlen, Maxlen) root-leaf paths in this case.
Case 2.
The sub-path from a root to ${n_{j}^{i}}$ does not satisfy the length constraint, but the other sub-path from ${n_{j}^{i}}$ to a leaf satisfies the length constraint. Similarly, there are $N_{R}^{\sim } ({n_{j}^{i}},k,$ Minlen, Maxlen)${}^{\ast }{N^{C}_{L}}({n^{i}_{j}},e$, Minlen, Maxlen) root-leaf paths in this case.
Case 3.
Both of the sub-paths satisfy the length constraint. There are ${N_{R}^{C}}({n_{j}^{i}},k,\textit {Minlen},\textit {Maxlen})^{\ast }{N_{L}^{C}}({n_{j}^{i}},e$, Minlen, Maxlen) root-leaf paths in this case.

Since the sum of k and e is no greater than the approximate threshold d, therefore, ${N_{T}^{C}}({n_{j}^{i}},d$, Minlen, Maxlen) can be calculated according to (19). □

Theorem 1

|T(S,P,d)| can be calculated as follows.

$$ |T(S,P,d)|={\sum}_{i=\textit{Minlen}-1}^{n-1} {\sum}_{j=1}^{m} \quad {N_{T}^{C}} ({n_{j}^{i}} ,d) $$

(20)

where n, m, d, and Minlen are the lengths of the sequence and the pattern, the approximate constraint and the minimal length constraint, respectively.

Proof

Since all the subnettrees have m levels, j can vary from 1 to m. Since i is the maximal value in subnettree ${n_{j}^{i}}$, i must be no less than Minlen-1 according to Definition 5. We see that i is also no greater than n-1. Each subnettree has ${N_{T}^{C}}({n_{j}^{i}},d)$ approximate occurrences. Hence, |T(S, P, d)| can be calculated according to (20). Therefore Theorem 1 is proved. □

4.2 SETA

We give the SETA algorithm as follows:

4.3 Analysis

Apparently, according to Theorem 1, we know that SETA is a complete algorithm. Next, we will analyse the space and time complexities of SETA.

Theorem 2

The space and time complexities of SETA are O(m × Maxlen × W × d) and O(Maxlen × W × m ² × n × d), respectively, where m, Maxlen, W, and dare the length of pattern P, the maximal length constraint, the maximal gap of pattern P and the approximation threshold, respectively.

Proof

It is easy to know that the space complexity of SETA is O(m × Maxlen × W × d), since there are at most m levels in a subnettree, each level has no more than Maxlen nodes, and each node has at most W parent (or child) nodes, i.e.W = max(max _j−min _j + 1)(0 ≤ j ≤ m−1). Besides, there are d + 1 groups of space in each node, which store NAPS, NAPLC, and NCAPLC (or NDPS, NDPLC, and NCDPLC) in the distance from 0 to d, respectively.

The time complexities of line 7 and line 15 are both O(1). Since each node has no more than W parent (or child) nodes and d + 1 groups NAPS, NAPLC, and NCAPLC (or NDPS, NDPLC, and NCDPLC), the time complexities of line 10 and line 18 are both O(W × d). There are at most Maxlen nodes in each level, and the times of the loop in line 8 and line 16 at most are O(Maxlen). Similarly, the time complexities of line 13 and line 21 are both O(Maxlen). The time complexity of line 22 is O(d ²), in general d ≪ m, so d ² is neglected. In conclusion, the time complexity from line 4 to line 23 is O(Maxlen × W × m × d). Therefore, the time complexity of SETA is O(Maxlen × W × m ² × n × d). □

4.4 A running example

An illustrative instance is used to show how the SETA algorithm works.

Example 2

Given sequence S = ATGGAGAGA, pattern P = A[-2,1]G[0,1]A[-2,1]G[-2,1]A, Minlen = 4 and Maxlen = 5, and threshold d = 1, we calculate |T(S, P, d, Minlen, Maxlen)|.

Suppose i and j are 5 and 3 in lines 2 and 3, respectively. Since b ₂ = 1 is greater than 0 and a ₃ = -2 is less than 0, according to line 4, we will create subnettree $n_{4}^{5}$ shown in Fig. 2(A). The left-top, right-top, left-bottom, and right-bottom values are the node’s NAPLC, NCAPLC, NDPLC, and NCDPLC, respectively. Since threshold d is 1, each group has d + 1 = 2 integers. The first one is the corresponding value with distance 0, while the last one is the value with distance 1.

Since there are many nodes in the subnettree and the calculating equations for each node are the same, we choose only one node in each level to illustrate how to calculate its NAPLC, NCAPLC, NDPLC, and NCDPLC, and the other nodes are ignored. Since s ₅ = p ₃ = “G”, according to SETA, we know that ${N_{A}^{C}}(n_{4}^{5},n_{4}^{5},0,\textit {LEN}) = 1, {N_{A}^{C}}(n_{4}^{5},n_{4}^{5} ,1,\textit {LEN}) = 0, N_{A}^{\sim } (n_{4}^{5},n_{4}^{5},0,\textit {LEN}) = N_{A}^{\sim }(n_{4}^{5},n_{4}^{5},1,\textit {LEN}) = 0$, and ${N_{D}^{C}}(n_{4}^{5},n_{4}^{5},0,\textit {LEN}) = 1, {N_{D}^{C}}(n_{4}^{5},n_{4}^{5},1,\textit {LEN}) = 0, N_{D}^{\sim } (n_{4}^{5},n_{4}^{5},0,\textit {LEN}) = N_{D}^{\sim } (n_{4}^{5},n_{4}^{5},1,\textit {LEN}) = $0. Now we will create the ancestor nodes of $n_{4}^{5}$. Since s ₄ = p ₂ = “A” and 5-4 +1 = 2 is less than Minlen, according to line 10, ${N_{A}^{C}}(n_{4}^{5},n_{3}^{4},0,\textit {LEN}) = {N_{A}^{C}}(n_{4}^{5},n_{3}^{4},1,\textit {LEN}) = 0, N_{A}^{\sim }(n_{4}^{5},n_{3}^{4},0, \textit {LEN}) = $1, and $N_{A}^{\sim }(n_{4}^{5},n_{3}^{4},1,\textit {LEN}) = $0. Because node $n_{2}^{2}$ has 2 child nodes, $n_{3}^{3}$ and $n_{3}^{4}$, and Minlen = 4 ≤ 5 − 2 + 1 = 4 ≤ Maxlen = 5, therefore, ${N_{A}^{C}}(n_{4}^{5},n^{2}_{2},0,\textit {LEN}) = 1, {N_{A}^{C}}(n_{4}^{5},n_{2}^{2},1,\textit {LEN}) = 1, N_{A}^{\sim }(n_{4}^{5},n_{2}^{2},0,\textit {LEN}) = N_{A}^{\sim } (n_{4}^{5},n_{2}^{2},1,\textit {LEN}) = $0. Similarly, ${N_{A}^{C}}(n_{4}^{5},n_{1}^{3},0,\textit {LEN}) = 0, {N_{A}^{C}}(n_{4}^{5},n_{1}^{3},1,\textit {LEN}) = 1, N_{A}^{\sim }(n_{4}^{5},n_{1}^{3},0,\textit {LEN}) =N_{A}^{\sim }(n_{4}^{5},n_{1}^{3},1,\textit {LEN}) = $0. Now we will create the descendant nodes of $n_{4}^{5}$. According to lines 14 to 20, we know that ${N_{D}^{C}}(n_{4}^{5},n_{5}^{4},0,\textit {LEN}) = 1, {N_{D}^{C}}(n_{4}^{5},n_{5}^{4},1,\textit {LEN}) = $0, and $N_{D}^{\sim } (n_{4}^{5},n_{5}^{4},0,\textit {LEN}) = N_{D}^{\sim }(n_{4}^{5},n_{5}^{4},1,\textit {LEN}) = $0. Hence, according to line 22, we know that ${N_{T}^{C}}$ of subnettree $n_{4}^{5}$ is 2 + 1 + 1 + 2 + 1 = 7.

Subnettree $n_{5}^{5}$ is shown in Figure 2(B) which is used to illustrate that SETA is an effective algorithm, since some nodes, such as $n^{1}_{2}$, $n^{4}_{2}$, and $n_{2}^{5}$, are pruned according to line 9 and 17. Besides, since sub-pattern “G[0,1]A” satisfies Lemma 1, we also do not create subnettree $n_{2}^{5}$.

Similarly, we can compute the number of the occurrences in all the subnettrees and do not introduce them in detail. From this example, we can see that SETA is a very effective algorithm.

5 Experimental results and analysis

5.1 Experimental environment and data

The data used in this paper are real biological sequences provided by the National Center for Biotechnology Information website. Homo Sapiens AX829174, AL158070 and AB038490 are chosen as our test data and can be downloaded from http://www.ncbi.nlm.nih.gov/nuccore/AX829174, http://www.ncbi.nlm.nih.gov/nuccore/AL158070.11, , and http://www.ncbi.nlm.nih.gov/nuccore/AB038490, respectively. To show how the different lengths of sequence affect the running time, we divide each sequence into 3 segments which are about 25 %, 50 %, and 100 % of the original length, respectively. Therefore, there are 9 sequences shown in Table 2. The source codes of SETA can be obtained from http://wuc.scse.hebut.edu.cn/nettree/sap/. All experiments are run on a laptop with Intel(R)Core(TM)2 DuoT6670@2.20GHz CPU and 3.0GB of RAM, Windows 7.

Table 2 Real biological sequences

Full size table

5.2 Correctness validation

To validate the correctness of SETA, we choose pattern P1 = A[-1,2]C[-2,3]G[-3,4]T, Minlen = 4, Maxlen = 10, and threshold d = 1 and calculate the number of approximate occurrences in the sequences S1˜S9. According to Theorem 1, this instance can be transformed into the following 13 instances of SPANGLO P1, Q1 = C[−1, 2]C[−2, 3]G[−3, 4]T, Q2 = G[−1, 2]C[−2, 3]G[−3, 4]T, Q3 = T[−1,2]C[−2, 3]G[−3, 4]T, Q4 = A[−1, 2]A[−2, 3]G[−3, 4]T, Q5 = A[−1, 2]G[−2, 3]G[−3, 4]T, Q6 = A[−1, 2]T[−2, 3]G[−3, 4]T, Q7 = A[−1, 2]C[−2, 3]A[−3, 4]T, Q8 = A[−1, 2]C[−2, 3]C[−3, 4]T, Q9 = A[−1, 2]C[−2, 3]T[−3, 4]T, Q10 = A[−1, 2]C[−2, 3]G[−3, 4]A, Q11 = A[−1, 2]C[−2, 3]G[−3, 4]C and Q12 = A[−1, 2]C[−2, 3]G[−3, 4]G. We calculate SPANGLO of the 13 patterns in the 9 sequences using SETS and SAP of pattern P1 using SETA. The results are shown in Table 3. In order to validate the effectiveness of SETA, Table 4 shows the running time.

Table 3 Results of SETS and SETA

Full size table

Table 4 Running time of SETS and SETA (ms)

Full size table

As shown in Table 3, the results validate the correctness of not only Theorem 1 but also SETA. Taking the longest sequence S6 as an example, the sum of 13 SPANGLO instances using SETS is 1127988, and of SAP instance using SETA is also 1127988. We have the same results with the other 8 sequences. Therefore, both Theorem 1 and SETA are correct. On the other hand, when the length of pattern, the size of alphabet and threshold are 4, 4, and 1, respectively, a SAP instance can be transformed into 1+4^∗(4−1) = 13 SPANGLO instances. Hence, when one of d, m or $|\sum |$ increases, the number of the corresponding SPANGLO instances of a SAP instance will increase rapidly, especially when d increases, and the corresponding instances will increase exponentially. Therefore, it is essential to propose an algorithm to deal with SAP.

Table 4 can adequately demonstrate the effectiveness of SETA. Still taking the longest sequence S6 as an example, if we adopt SETS to calculate the running times of the 13 patterns, the sum of the running times is 20792(ms), while when we adopt SETA to calculate, the running time is only 3987(ms), which is about 5 times faster than the sum of the running times of SETS. As said above, when one of d, m or $|\sum |$ increases, SETA is superior to SETS significantly.

5.3 Effectiveness

We know that Lemma 1 and 6 are two pruning strategies. To show how the two strategies affect the running time we also propose three algorithms named SETA1, SETA2, and SETA3. Neither Lemma 1 nor Lemma 6 is employed in SETA1, while they are employed in SETA2 and SETA3, respectively. We select 5 patterns with different numbers of negative gaps which are P2 = G[0, 2]A[0, 1]C[1, 2]G[0, 2]T[0, 1]C[0, 1]C[0, 2]A[0, 1]C, P3 = G[−2, 2]A[−2, 1]C[1, 2]G[0, 2]T[0, 1]C[0, 1]C[0, 2]A[0, 1] C, P4 = [−2, 2]A[−2, 1]C[−2, 2]G[−2, 2]T[0, 1]C[0, 1]C[0, 2]A[0, 1] C, P5 = G[−2, 2]A[−2, 1]C[−2, 2]G[−2, 2]T[−2, 1]C[−2, 1]C[0, 2]A [0, 1]C and P6 = G[−2, 2]A[−2, 1]C[−2, 2]G[−2, 2]T[−2, 1]C[−2, 1]C[−2, 2]A[−2, 1]C. The length constraint and threshold are Minlen = 11, Maxlen = 16, and d = 1, respectively. Figure 3 shows the sum of the running times of these patterns on S3, S6, and S9.

We know that P2, P3, P4, P5, and P6 have 0,2,4,6, and 8 negative gaps, respectively. From Fig. 3, we know that the running time tends to increase along with the number of negative gaps. Since Lemma 1 prunes the subnettree according to the pattern, the less the number of negative gaps is the more effective the strategy is. Therefore, we can see that SETA2 is more effective on P2, but less effective on P6. The most important conclusion is that SETA is the most effective algorithm. The reason is that SETA employs two effective pruning strategies.

5.4 Evaluation

In this section, generally, we neglect Minlen and Maxlen; besides, the threshold is d = 1.

5.4.1 Lengths of pattern and sequence evaluation

To show how the length of pattern affects the results and running time, we choose pattern P7 = G[−1, 3]T[−1, 3]A[−1, 3]G[−1, 3]T[−1, 3]A[−1, 3]G[−1, 3]T[−1, 3]A[−1, 3]G[−1, 3]T whose length is 11 and its sub-patterns; for instance, if m = 4, it indicates that the length of the prefix pattern of P7 is 4, i.e., G[−1, 3]T[−1, 3]A[−1, 3]G. We show the running time and results in Tables 5 and 6, respectively.

Table 5 Running time for different lengths of patterns(ms)

Full size table

Table 6 Results for different lengths of patterns

Full size table

From Table 5, we can clearly see that the running time of SETA is in linear growth with the length of sequence. For example, when m = 4, the running times on sequences S7, S8, and S9 are 612, 1454, and 2766ms, respectively, which are in linear growth with the length of sequence. We also notice that the running time grows quadratically with m. Hence, these experimental results validate the correctness of the time complexity of SETA.

We can see from Table 6 that, as the length of pattern increases, the solution of SAP increases rapidly. Especially, we notice an interesting phenomenon that when m increases by 2 every time, the solution of SAP will enlarge to about twice the previous. Taking sequence S1 as an example, when m = 4, the result is 13890, while when m = 6, the result increases to 30401 which is 2.2 times the previous value, and most instances in this table have the same phenomenon. Besides, a more outstanding phenomenon is that the solution of SAP is in linear growth with the length of sequence. We can see from Table 2 that the length of S2 is twice that of S1, while the result of SAP on S2 is also about twice that on S1. All other instances in Table 6 also present this phenomenon. Hence the solution of SAP is about n*Wˆ(m−1).

5.4.2 Threshold evaluation

In order to show how the threshold affects the running time and results, we use pattern P8 = G[−1, 3]T[−1, 3]A[−1, 3]G[−1, 3]T[−1, 3] A[−1, 3]G[−1, 3]T[−1, 3]A[−1, 3]G[−1, 3]T, Minlen = 11, Maxlen = 16, and d = 0, d = 1, d = 2, d = 3, d = 4, and d = 5, respectively. The running time and results are shown in Tables 7 and 8, respectively.

Table 7 Running time for different thresholds (ms)

Full size table

Table 8 Results for different thresholds

Full size table

We can see from Table 7 that the running time of SETA is in linear growth with d. Taking S3 as an example, when d = 2, the running time is 3948ms, which is about 2 times that of d = 1. Similarly, the running times of d = 3, d = 4, and d = 5 are about 3 times, 4 times and 5 times that of d = 1. Besides, we notice that the running times from d = 0 to d = 1 change significantly. The reason is that when d = 0, this is exact matching and SETA conducts pruning according to Lemma 6, which can improve the speed. In summary, the experiments validate that the running time is in linear growth with d.

From Table 8, we can see that the result of SAP increases rapidly with d. Taking S3 as an example, when d varies from 2 to 3, the result increases about 8.7 times. The reason is that a SAP instance can be transformed into ${C_{m}^{d}} $SPANGLO instances. Hence, the result of SAP is in exponential growth with d.

6 Conclusion

In this paper, we propose the SAP problem which is a strict approximate pattern matching with general gaps and length constraints. We prove that a SAP instance can be converted to exponential exact matching instances and design an effective online algorithm, named SETA, which employs the subnettree structure and adopts many efficient pruning strategies to deal with SAP online. We analyse the time and space complexities of SETA, which are O(Maxlen × W × m ² × n × d) and O(m × Maxlen × W × d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold, respectively. Besides, extensive experimental results validate the correctness and completeness of SETA, and the contrast experiments validate the effectiveness of SETA. Finally, we also illustrate how m, n, and d affect the results and running time.

In the future, we will focus on mining approximate sequential patterns with general gaps, especially for larger sequences [29]. Besides, while this paper focuses on strict pattern matching without special condition, there are also types of strict pattern matching with the non-overlapping condition or the one-off condition which are worth exploring.

References

Chouvalit K, Veera B (2013) A new linear-time dynamic dictionary matching algorithm. Comput Inform 32(5):897–923
MathSciNet Google Scholar
Aligon J, Golfarelli M, Marcel P, Rizzi S, Turricchia E (2014) Similarity measures for OLAP sessions. Knowl Inf Syst 39(2):463–489
Article Google Scholar
Knuth DE, Morris JH, Pratt VR (1977) Fast pattern matching in strings. SIAM J. Comput 6(2):323–350
Article MATH MathSciNet Google Scholar
Fischer MJ , Paterson MS (1974) String matching and other products . In: Proceedings of the 7th SIAM AMS complexity of computation, Cambridge, USA, pp 113-125
Manber U, Baeza YR (1991) An algorithm for string matching with a sequence of don’t cares. Inf Process Lett 37(2):133–136
Article MATH Google Scholar
Navarro G, Raffinot M (2003) Fast and simple character classes and bounded gaps pattern matching with applications to protein searching. J Comput Biol 10(6):903–923
Article Google Scholar
Cole R, Gottlieb L, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th ACM symposium on the theory of computing, Chicago, USA, pp 91-100
Crochemore M, Iliopoulos C, Makris C, Rytter W, Tsakalidis A, Trichlas K (2002) Approximate string matching with gaps. Nord J Comput 9(1):54–65
MATH Google Scholar
Cantone D, Cristofaro S, Faro S (2009) New efficient bit-parallel algorithms for the (δ, α)-matching problem with applications in music information retrieval. Int J Found Comput Sci 20(6):1087–1108
Article MATH MathSciNet Google Scholar
Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst 11(2):259–286
Article Google Scholar
Ferreira PG, Azevedo PJ (2005) Protein sequence pattern mining with constraints. In: European conference on principles and practice of knowledge discovery in databases (PKDD), Porto, Portugal, pp 96-107
Zhang M, Kao B, Cheung D, Yip K (2007) Mining periodic patterns with gap requirement from sequences. ACM Trans Knowl Discov Data 1(2):7–es
Article Google Scholar
Zhu X, Wu X (2007) Mining complex patterns across sequences with gap requirements. In: Proceedings of the 20th international joint conference on artificial intelligence (IJCAI), Hyderabad, India, pp 2934–2940
Wu Y, Wang L, Ren J, Ding W, Wu X (2014) Mining sequential patterns with periodic wildcard gaps. Appl Intell 41(1):99–116
Article Google Scholar
Tsai CY, Chen CJ, Chien CJ (2013) A time-interval sequence classification method. Knowl Inf Syst 37(2):251–278
Article Google Scholar
Wu Y, Liu Y, Guo L, Wu X (2013) Subnettrees for strict pattern matching with general gaps and length constraints. J Softw 24(5):915–932
Article MathSciNet Google Scholar
Fredriksson K, Grabowski S (2006) Efficient algorithms for pattern matching with general gaps and character classes . In: International conference on string processing and information retrieval, Glasgow, UK, pp 267-278
Fredriksson K, Grabowski S (2008) Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf Retrieval 11(4):335–357
Article Google Scholar
Guo D, Hu X, Xie F, Wu X (2013) Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph. Appl Intelligence 39(1):57–74
Article Google Scholar
Huang Y, Wu X, Hu X, Xie F, Gao J, Wu G (2009) Mining frequent patterns with gaps and one-off condition . In: IEEE international conference on computational science and engineering (CSE’09), Vancouver, BC, Canada, pp 180–186
Lam HT, Mörchen F, Fradkin D (2014) Mining compressing sequential patterns. Stat Anal Data Min 7(1):34–52
Article MathSciNet Google Scholar
Ding B, Lo D, Han J (2009) Efficient mining of closed repetitive gapped subsequences from a sequence database . In: IEEE 25th international conference on data engineering(ICDE), Shanghai, China, pp 1024–1035
Min F, Wu X, Lu Z (2009) Pattern matching with independent wildcard gaps. In: Proceedings of the 8th international conference on pervasive intelligence and computing, Chengdu, China, pp 194–199
Bille P, Gørtz I, Vildhøj H, Wind D (2010) String matching with variable length gaps. In: Proceedings of the 17th international conference on string processing and information retrieval, SPIRE, Mexico, pp 385-394
Rahman S, Iliopoulos C , Lee I, Mohamed M , Smyth W (2006) Finding patterns with variable length gaps or don’t cares. In: 12th annual international conference computing and combinatorics, Taiwan, pp 146-155
Bille P, IL Gørtz, Vildhøj HW (2012) String matching with variable length gaps. Theor Comput Sci 443:25–34
Article MATH Google Scholar
He D, Wu X, Zhu X (2007) SAIL-APPROX: An efficient on-line algorithm for approximate pattern matching with wildcards and length constraints. In: Proceedings of the 2007 IEEE international conference on bioinformatics and biomedicine (BIBM’07), Silicon Valley, USA, pp 151-0-158
Wu Y, Wu X, Min F, Li Y (2010) A Nettree for pattern matching with flexible wildcard constraints . In: Proceedings of the 2010 IEEE international conference on information reuse and integration (IRI2010), Las Vegas, USA, pp 109-114
Rasheed F, Adnan M, Alhajj R (2013) Out-of-core detection of periodicity from sequence databases. Knowl Inf Syst 36(1): 277–301
Article Google Scholar

Download references

Acknowledgments

This research is supported by the National Natural Foundation of China under grants No. 61229301 and 61370144, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China, under grant IRT13059, the Natural Science Foundation of Hebei Province of China under grant No. F2013202138, and the Key Project of the Educational Commission of Hebei Province under grant No. ZH2012038.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Hebei University of Technology, Tianjin, 300130, China
Youxi Wu & Shuai Fu
School of Software, Dalian University of Technology, Dalian, 116621, China
He Jiang
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230009, China
Xindong Wu
Department of Computer Science, University of Vermont, Burlington, VT, 05405, USA
Xindong Wu

Authors

Youxi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Fu
View author publications
You can also search for this author in PubMed Google Scholar
He Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xindong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youxi Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Y., Fu, S., Jiang, H. et al. Strict approximate pattern matching with general gaps. Appl Intell 42, 566–580 (2015). https://doi.org/10.1007/s10489-014-0612-3

Download citation

Published: 25 November 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s10489-014-0612-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Strict approximate pattern matching with general gaps

Abstract

Similar content being viewed by others

Strict pattern matching under non-overlapping condition

NetDPO: (delta, gamma)-approximate pattern matching with gap constraints under one-off condition

NetDAP: (δ, γ) −approximate pattern matching with length constraints

Explore related subjects

1 Introduction

Example 1

2 Related work

3 Problem definition and analysis

3.1 Problem definition

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

3.2 Theoretical analysis

Theorem 1

Proof

4 Subnettree for SAP

4.1 Subnettree

Definition 7

Definition 8

Lemma 1

Proof

Definition 9

Lemma 2

Proof

Definition 10

Definition 11

Lemma 3

Proof

Lemma 4

Proof

Definition 12

Definition 13

Lemma 5

Proof

Lemma 6

Proof

Definition 14

Definition 15

Lemma 7

Proof

Theorem 1

Proof

4.2 SETA

4.3 Analysis

Theorem 2

Proof

4.4 A running example

Example 2

5 Experimental results and analysis

5.1 Experimental environment and data

5.2 Correctness validation

5.3 Effectiveness

5.4 Evaluation

5.4.1 Lengths of pattern and sequence evaluation

5.4.2 Threshold evaluation

6 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation