Keywords

1 Introduction

The problem of searching a given string in a text has a wide range of applications such as in text-editing programs, search engines and searching for patterns in a DNA sequence. There are non-indexed and indexed versions of this problem. In the indexed version, it is allowed to preprocess the string (pattern or text) before searching for the pattern in the text. The motivation of preprocessing is to improve the efficiency of the search. The standard variations of string matching problems are exact string matching [7, 14], parameterized string matching [3,4,5], approximate string matching [23] and approximate parameterized string matching [8, 26].

The Approximate Parameterized String Matching (APSM) problem is a well studied problem [1, 2, 13, 19, 21, 25]. In [19], Hazay et al. have given reduction between the maximum weighted bipartite matching problem [12, 15,16,17, 20] and APSM problem for two equal length strings. They have used the maximum weighted bipartite (decomposition) algorithm, originally proposed by Kao et al. [20], to solve the APSM problem between two equal length strings \(p \in \varSigma _P^*\) and \(t \in \varSigma _T^*\) in time \(O(m^{1.5})\), where \(|p|=|t|=m\).

In this paper, we investigate All Pairs (best) Approximate Parameterized String Matching (APAPSM) problem with error threshold k (with respect to Hamming distance error model) among two sets of equal length strings. Let \(P=\{p_1, ~ p_2, \ldots , p_{n_P}\}\) \( \subseteq \varSigma _P^m\) and \(T=\{t_1, ~ t_2, \ldots , t_{n_T}\}\) \(\subseteq \varSigma _T^m\) be two sets of strings where \(1 \le i \le n_P, ~1 \le j \le n_T\) and \(|\varSigma _P|=|\varSigma _T|=\sigma \). The APAPSM problem is to find: for each \(p_i \in P\), a string \(t_j \in T\) which is approximately parameterized closest to \(p_i\) under k threshold.

Section 2 describes the required preliminaries to understand the APAPSM problem which is explained in detail in the next section. In Sect. 3, we discuss a solution to the APAPSM problem with worst-case complexity \(O(n_P \, n_T \, m)\), assuming a constant size alphabet. Next, we design a filtering technique by using Parikh vector [24] in order to preprocess the given strings and reduce the number of pair comparisons for solving APSM between the pair of strings with k error threshold. We call it PV-filter. Even though the filter does not improve the asymptotic bound theoretically, practical results in Sect. 4 show that it performs well for small error threshold. Finally, Sect. 5 summarizes the results.

2 Preliminaries and Related Results

We use some basic notions throughout the paper. An alphabet is a non-empty finite set of symbols. A string over a given alphabet is a finite sequence of symbols. We denote \(\varSigma ^*\) as the set of all finite-length strings over alphabet \(\varSigma \). The empty string is denoted by \(\varepsilon \). The length of any string w is the total number of symbols in w and is denoted by |w|; so \(|\varepsilon |=0\). Let \(\varSigma ^+ = \varSigma ^* \setminus \{ \varepsilon \}\) and for a given \(m \in \mathbb {N}_0\), \(\varSigma ^m\) is the set of all strings of length m over the alphabet \(\varSigma \) [26].

Let \(w=xyz\) be a string where \(x, y,z \in \varSigma ^*\). We call y as a substring of string w. If \(x=\varepsilon \) then y is a prefix of w. If \(z=\varepsilon \) then y is a suffix of w. The i-th symbol of a string w is denoted by w[i] for \(1 \le i \le |w|\). We denote substring y of string w as w[i..j] if y starts at position i and ends at position j for \(1 \le i \le j \le |w|\), and string \(w[i..j]=\varepsilon \) if \(i>j\) [26]. Let \(\mathbb {N}_0\) be the set of non-negative integers.

Approximate String Matching (ASM): ASM problem considers the string matching problem with errors. It is an important problem in many branches of computer science, with several applications to text searching, computational biology, pattern recognition, signal processing etc. [9, 23, 26].

Let \(d{:\;} \varSigma ^* \times \varSigma ^* \rightarrow \mathbb {N}_0\) be the distance function. The distance d(xy) between two strings \(x=x[1 .. n] \in \varSigma ^*\) and \(y=y[1 .. m] \in \varSigma ^*\) is the minimal cost of a sequence of operations that transform x into y (and \(\infty \) if no such sequence exists). The cost of a sequence of operations is the sum of the costs of the individual operations. In general, the set of possible operations are insertion, deletion, substitution or replacement and transposition [23]. Therefore under the distance measure, the ASM problem becomes minimizing the total cost to transform the pattern and its occurrence in a text to make them equal and find the text positions where this cost is low enough. Some of the most classical distance metrics are Levenshtein distance [22], Damerau distance [10] and Hamming distance [18]. Hamming Distance (HD), denoted as \(d_H\), allows only replacements. It is restricted to equal length strings. In the literature, the search problem in many cases is called “string matching with mismatches” [23, 26].

Since in this paper, the APAPSM problem is considered under HD, the following definitions are considered using HD applied to equal length strings. From now onwards we assume that \(d=d_H\), for notational simplicity. Given an error threshold \(k \in {\mathbb {N}_0}\), a pair of strings \(u \in \varSigma ^*_u\) and \(v \in \varSigma ^*_v\) where \(m=|u|=|v|\), consider the following definitions. Without loss of generality, we presume both the alphabet sizes are equal when dealing with a bijection between the alphabets.

Parameterized String Matching (PSM): String \(u=u[1 .. m]\) is said to be a parameterized match or p-match with v (denoted as \(u ~\widehat{=}~ v\)) if there exists a bijection \(\pi {:\;} \varSigma _u \rightarrow \varSigma _{v}\) such that \(\pi (u)=\pi (u[1])\pi (u[2]) \ldots \pi (u[m])=v\) [3].

Approximate Parameterized String Matching (Without Error Threshold): Given a bijection \(\pi {:\;} \varSigma _u \rightarrow \varSigma _v\), the \(\pi \)-mismatch between u and v is the HD between the image of u under \(\pi \) and v, i.e., \(d(\pi (u),v)\) [19]. We denote this by \(\pi \)-mismatch(uv). Note that, there is an exponential number of possible bijections from \(\varSigma _u\) to \(\varSigma _v\). Also, such \(\pi \) for which \(d(\pi (u),v)\) is minimum, may not be unique.

The Approximate Parameterized String Matching (APSM) between u and v is to find a \(\pi \) such that over all bijections \(\pi \)-mismatch(uv) is minimized. We denote this by \(\textit{APSM}(u,v)\). Formally, \(\textit{APSM}(u,v)=\{\pi ~|~d(\pi (u),v)\) is minimum over all \(\pi \)}. We define the \(\textit{cost of APSM}(u,v)\) as \(\textit{cost}({\textit{APSM}}(u,v))=d(\pi (u),v)\) where \(\pi \in \textit{APSM}(u,v)\).

Parameterized String Matching (PSM) with k Mismatches: PSM with k mismatch seeks to find a bijection \(\pi {:\;} \varSigma _u \rightarrow \varSigma _v\) such that the \(\pi \text{-mismatch }(u,v)\le k\). We then say that u parameterized matches v with k threshold. In literature, this problem is also known as string comparison problem with threshold k [19]. However, any \(\pi \) with \(\pi \)-mismatch\((u,v) \le k\) will be satisfactory in this case (i.e., \(\pi \)-mismatch(uv) need not be the minimum one over all \(\pi {:\;} \varSigma _u \rightarrow \varSigma _v\)).

Both the above problems were solved in \(O(m^{1.5})\) time [19] by reducing them to maximum weight bipartite matching problem and using Kao et al.’s algorithm [20]. Let us define APSM problem with k error threshold as follows.

Approximate Parameterized String Matching with k Error Threshold: APSM with k error threshold, denoted as \(\textit{APSM}(u,v,k)\), seeks to find a \(\pi {:\;} \varSigma _u \rightarrow \varSigma _v\) (over all bijections) such that \(d(\pi (u),v)\) is minimum but not greater than k [19]. More formally, \(\textit{APSM}(u,v,k)=\{\pi ~|~ \pi \in \textit{APSM}(u,v) ~ \wedge ~ d(\pi (u),v) \le k\)}. We define cost of \(\textit{APSM}(u,v,k)\) as \(\textit{cost}(\textit{APSM}(u,v,k))=d(\pi (u),v)\), where \(\pi \in \textit{APSM}(u,v,k)\). In case, \(\textit{APSM}(u,v,k)=\emptyset \), then \(\textit{cost}(\textit{APSM}(u,v,k))= \infty \).

Example 1 in page 4 shows the difference between the above definitions.

3 All Pairs Approximate Parameterized String Matching

In this section, we investigate all pairs (best) approximate parameterized string matching (APAPSM) problem with k error threshold (with respect to Hamming distance error model) among the two sets P and T of equal length strings. The problem definition is the following along with the other required definitionsFootnote 1.

Definition 1

(Pair Approximate Parameterized String Matching (PAPSM) with Error Threshold k). Given a string \(p \in \varSigma _P^m\) and \(T=\{t_1, t_2, \ldots , t_{n_T}\} \subseteq \varSigma _T^m\), where \(|\varSigma _P|=|\varSigma _T|=\sigma \) and \(0 \le k \le m\). The PAPSM problem with k error threshold is to find j such that \(\textit{APSM}(p, t_j,k)\) gives \(\pi _j\) over all bijections and \(d(\pi _j(p),t_j)\) is minimum over all j where \(1 \le j \le n_T\).

Denote this problem as \(\textit{PAPSM}(p,T,k)\). In more formal notation, \(\textit{PAPSM}(p,T,k)\) \(= \) \(\{j ~|~ \pi _j \in \textit{APSM}(p,t_j,k) ~\wedge ~ d(\pi _j(p),t_j) = \min _{1 \le i \le n_T}\{\textit{cost}(\textit{APSM}(p,t_i,k))\}\)}. In other words, the problem is to find \(t_j \in T\) which is approximately parameterized closest to p with k error threshold. We call \(d(\pi _j(p),t_j)\) as the \(\textit{cost of }{} \textit{PAPSM}(p,T,k)\) and let us denote this by \(\textit{cost}(\textit{PAPSM}(p,T,k))\). In case, \(\textit{PAPSM}(p,T,k)\) \(=\emptyset \), then \(\textit{cost}(\textit{PAPSM}(p,T,k))=\infty \).

Example 1

Given \(p=abab \in \varSigma _P^4 = \{a,b\}^4\), \(T=\{t_1=cdcd, ~t_2=dcdc,~ t_3= ccdd,~ t_4=cccd\} \subseteq \varSigma _T^4=\{c,d\}^4\) and \(k=1\). Now,

$$\begin{aligned}&\textit{APSM}(p,t_1,k)=\{\pi _1=\{a \rightarrow c, b \rightarrow d\} \}, ~~~d(\pi _1(p),t_1)) = 0 ;\\&\textit{APSM}(p,t_2,k)=\{\pi _2=\{a \rightarrow d, b \rightarrow c\} \}, ~~~d(\pi _2(p),t_2)) = 0 ;\\&\textit{APSM}(p,t_3,k) =\emptyset ;\\&\textit{APSM}(p,t_4,k)=\{\pi _4=\{a \rightarrow c, b \rightarrow d\} \}, ~~~d(\pi _4(p),t_4)) = 1.\\ \end{aligned}$$

Observe that, \(\pi _3=\{a \rightarrow c,~ b \rightarrow d\} \in \textit{APSM}(p,t_3)\) but \(d(\pi _3(p),t_3))\) \(=2 > k\). So, \(\textit{APSM}(p,t_3,1)= \emptyset \). Hence, \(\textit{PAPSM}(p,T,1)=\{1,2\}\). Also note that, if \(k=3\), then for \(\pi _4'=\{a \rightarrow d, b \rightarrow c\}\), \(\pi _4'\text{-mismatch }(p,t_4)\le k\). Hence just finding \(\pi _4'\) is also satisfactory to say that p is parameterized matched with \(t_4\) under \(k=3\) error threshold; whereas \(\textit{APSM}(p,t_4,3)=\{\pi _4\}\) and \(\pi _4' \notin \textit{APSM}(p,t_4,3)\) .    \(\square \)

Note that it is sufficient to report a string from T which is closest to p under a given error threshold k. Also, it is possible to enumerate all \(t_i \in T\) which are closest to p. Observe that, if \(\textit{PAPSM}(p,T,k)=\{i,j\}\) corresponding to the strings \(t_i\) and \(t_j\), then \(\textit{cost}(\textit{PAPSM}(p,T,k)) \le k\) and more importantly, \(\textit{cost}(\textit{PAPSM}(p,T,k)) = d(\pi _i(p),t_i) = d(\pi _j(p),t_j)\).

Theorem 1

Given \(p \in \varSigma _P^m\) and \(T=\{t_1, t_2\} \subseteq \varSigma _T^m\). If p is an approximate parameterized matched with \(t_1\) and \(t_1 ~\widehat{=}~ t_2\), then p is also approximate parameterized matched with \(t_2\) and its cost equal to \(\textit{cost}({\textit{APSM}}(p,t_1))\).

Proof

The proof consists of two phases. Since p is approximate parameterized matched with \(t_1\) (without any error threshold), then say \(\pi _1 \in \textit{APSM}(p,t_1)\). As a consequence, \(\textit{cost}(\textit{APSM}(p,t_1)) = d(\pi _1(p),t_1)\) and moreover it is minimum over all bijections from \(\varSigma _P\) to \(\varSigma _T\). Also, since \(t_1 ~\widehat{=}~ t_2\), there exist a bijection, say \(\pi {:\;} \varSigma _T \rightarrow \varSigma _T\) such that \(\pi (t_1) = t_2 \) and so \(\textit{cost}(\textit{APSM}(t_1,t_2)) = d(\pi (t_1),t_2) = 0\).

Let \(\pi _2 = \pi \circ \pi _1{:\;} \varSigma _P \rightarrow \varSigma _T\) and is defined as \(\pi \circ \pi _1(u) = \pi (\pi _1 (u))\) where \(u \in \varSigma _P^m\). It can be easily proved by contradiction that \( d(\pi _2(p),t_2)\) is minimum over all bijections. So we skip it.

Now, \(\textit{cost}(\textit{APSM}(p,t_2))\) \(= d(\pi _2(p),t_2)\) \(= d(\pi (\pi _1(p)),t_2)\) \(= d(\pi (\pi _1(p)), \pi (t_1))\) \(= d(\pi _1(p),t_1)\) \(=\textit{cost}(\textit{APSM}(p,t_1))\). Therefore, \(\pi _2 =\pi \circ \pi _1 \in \textit{APSM}(p,t_2)\) and its cost equal to \(\textit{cost}({\textit{APSM}}(p,t_1))\) unit.    \(\square \)

The above theorem is extended for APSM problem with k error threshold.

Theorem 2

Given \(p \in \varSigma _P^m\) and \(T=\{t_1, t_2\} \subseteq \varSigma ^m_T\) and \(0 \le k \le m\). If p is an approximate parameterized matched with \(t_1\) under the k error threshold and \(t_1 ~\widehat{=}~ t_2\), then p is also approximate parameterized matched with \(t_2\) under the k error threshold and with the cost equal to \(\textit{cost}(\textit{APSM}(p,t_1,k))\).

Definition 2

(All Pairs Approximate Parameterized String Matching (APAPSM) with k Threshold). Let \(P=\{p_1, ~ p_2, \ldots , p_{n_P}\} \subseteq \varSigma _P^m\) and \(T=\{t_1, ~ t_2, \ldots , t_{n_T}\}\) \(\subseteq \varSigma _T^m\). The problem is to find a mapping \(\eta {:\;} [1,n_P] \rightarrow [1,n_T]\) such that sum of the \(\textit{cost}(\textit{APSM}(p_i,t_{\eta (i)},k))\) over all  \(i~(1 \le i \le n_P)\) is minimum.

Let us denote this problem as APAPSM(PTk). The problem is to search: for each \(p_i \in P ~(1 \le i \le n_P)\), a \(~t_j \in T ~(1 \le j \le n_T)\) which is approximately parameterized closest to \(p_i\) under k error threshold. More formally, \(\textit{APAPSM}(P,T,k)= \{(\textit{PAPSM}(p_1,T,k), \textit{PAPSM}(p_2,T,k), \ldots , \textit{PAPSM}(p_{n_P},T,k))\}\).

Theorem 3

The above problem can be solved in \(O(n_P\, n_T \, m^{1.5})\) time.

Proof

It is direct from the solution of APSM problem proposed by Hazay et al. [19] by considering all possible pairs between P and T.

Definition 3

(\(\gamma (k)\) -match of strings). Let \(k \in \mathbb {N}_0\). For two given strings \(u=u[1 .. m],\, v = v[1 .. m] \in \varSigma ^*\) and the alphabet set \(\varSigma =\{a_1, a_2, \ldots , a_\sigma \}\) where each \(a_i\) \(\in \mathbb {N}_0\), u is said to be \(\gamma (k)\)-matched with v if and only if \(\, \sum _{i=1}^{m} |u_i - v_i| \le k\).

The term \(\gamma (k)-match\) is a suitably renamed version of the terminology \(\gamma -approximate\) which was prescribed in [6] and defined on strings. Similar as above, we define \(\gamma (k)\)-match on two equal cardinality vectors of numbers.

Definition 4

(\(\gamma \) -distance, \(\gamma (k)\) -match of vectors). Given two vectors \(u=(u_1, u_2, \ldots , u_m)\), \(v = (v_1, v_2, \ldots , v_m)\) where \(u_i,v_j \in {\mathbb {N}_0}, 1 \le i,j \le m\) and \(l,k \in \mathbb {N}_0\). \(\gamma \)-distance between u and v is l (denoted as \(\gamma (u,v) = l\)) if and only if \(l=\sum _{i=1}^{m} |u_i - v_i|\). We say that u, \(\gamma (k)\)-matches with v, if and only if \(\gamma (u,v)\) \(= \sum _{i=1}^{m} |u_i - v_i| \le k\).

The notion of Parikh mapping or vector was introduced by R.J. Parikh in [24]. It provides numerical properties of a string in terms of a vector by counting the number of occurrences of the symbols in the string. Parikh vector of a string w is denoted as \(\psi (w)\).

Definition 5

(Parikh Vector (PV)). Let \(\varSigma = \{a_1, a_2, \ldots , a_\sigma \}\). Given \(w \in \varSigma ^*,\) \(\psi (w)=(f(a_1,w), f(a_2,w), \ldots , f(a_\sigma ,w))\) where \(f(a_i,w)\) gives the frequency of the symbol \(a_i \in \varSigma \) (\(1 \le i \le \sigma \)) in the string w.

For example, if \(\varSigma =\{c,d\}\) then \(\psi (cddcc)=(3,2)\). However, much information is lost in the transition from a string to its PV. Note that Parikh mapping is not injective as many strings over an alphabet may have the same PV and so the information of a string is reduced while changing the string to a PV. For example, the strings cccdddd and dcdcdcd have the same Parikh vector (3, 4).

Definition 6

(Normalized Parikh Vector (NPV)). NPV of a string \(w \in \varSigma ^*\) is  \(\widehat{\psi }(w)=(g_1,~g_2, \ldots , g_\sigma )\) such that \(~\forall i, ~ 1 \le i < \sigma , ~ g_i \ge g_{i+1}\) and there exists a bijective mapping \(\rho {:\;} \{1 ..\sigma \} \rightarrow \{1 ..\sigma \}\) such that \(g_i=f(a_{\rho (i)},w)\).

In other words, we sort the elements of \(\psi (w)\) in non-increasing order to get the \(\widehat{\psi }(w)\) of string w. For example, \(\psi (dcdcdcd)=(3,4)\) and \(\widehat{\psi }(dcdcdcd)=(4,3)\).

Theorem 4

Given a pair of equal length strings \(u\in \varSigma _P^*\) and \(v\in \varSigma _T^*\), if \(u ~\widehat{=}~ v\) then \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) =0\).

Proof

Since \(u ~\widehat{=}~ v\), then by definition there exists a bijection \(\pi {:\;} \varSigma _P \rightarrow \varSigma _T\) such that \(\pi (u)=v\), i.e. \(\pi (u)\) is obtained by renaming each character of u using \(\pi \). Though symbols of \(\varSigma _P\) are renamed by \(\pi \), the frequency of each symbol \(a \in \varSigma _P\) in u will be same as the frequency of \(\pi (a) \in \varSigma _T\) in \(v=\pi (u)\). As a consequence, \(\widehat{\psi }(v) = \widehat{\psi }(u)\), even though there may be the case \(\psi (u) \ne \psi (v)\).    \(\square \)

However, the converse is not always true. To show that, we shall give the following example.

Example 2

Given \(p=ababa, \in \varSigma _P^* = \{a,b\}^*\) and \(T=\{t_1=cdcdd, ~t_2=dcdcd\} \subseteq \varSigma _T^*=\{c,d\}^*\). Now,

$$\begin{aligned}&\psi (p)=(3,2) \text { and } \widehat{\psi }(p)=(3,2);\\&\psi (t_1)=(2,3) \text { and } \widehat{\psi }(t_1)=(3,2);\\&\psi (t_2)=(2,3) \text { and } \widehat{\psi }(t_2)=(3,2). \end{aligned}$$

As mentioned in Theorem 4, \(p ~\widehat{=}~ t_2\) and so \(\gamma (\widehat{\psi }(p),\widehat{\psi }(t_2)) =0\), even though \(\psi (p) \ne \psi (t_2)\). Conversely, \(\widehat{\psi }(t_1) = \widehat{\psi }(p) = \widehat{\psi }(t_2) = (3,2)\), but \(p ~\widehat{\ne }~ t_1\) and \(p ~\widehat{=}~ t_2\). Hence, in case \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) =0\), it is required to check if \(u ~\widehat{=}~ v\) or not.    \(\square \)

In general, a filter is a device or subroutine that processes the feasible inputs and tries to remove some undesirable component. We design an interesting filtering technique by using Parikh vector in order to preprocess the given strings of P and T and to reduce the number of pair comparisons for solving approximate parameterized string matching between the pair of strings under k error threshold. We name the filter which is mentioned in Theorem 7 as PV-filter and the process of filtering the input data by PV-filter as PV-filtering.

The following theorems are useful in minimizing the number of pairs comparisons for APAPSM problem to improve the solution from the practical aspect. Theorem 5 is applicable for ASM problem. It is extended in Theorems 6 and 7 in the context of APSM problem without and with k error threshold, respectively.

Theorem 5

Let \(u, v\in \varSigma ^*\) be a pair of equal length strings and \(k= d(u,v)\), is the Hamming distance. Then \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) \le 2 k\) and \(\gamma (\psi (u),\psi (v)) \le 2 k\).

Proof

We prove it by the principle of mathematical induction on k.

 

Base case: :

For \(u=v\), \(k= d(u,v)=0\) and \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) = 0\).

Hypothesis: :

Assume that for any k with \(0 \le k=d(u,v) \le i\), \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) \le 2k\).

Inductive step: :

Let, after introducing one more error by replacement (symbol \(a \in \varSigma \) is replaced by \(b \in \varSigma \) in any position of u) operation in u we get \(u'\) such that \(d(u',u)=1\) and \(k=d(u',v) = i+1\). However, while changing u to \(u'\) with \(d(u',u)=1\), there may be only other case that \(k=d(u',v) = i-1\) for which also the inequality is true (by the induction hypothesis). So we have to argue for the former case: \(k=i+1\). While introducing one error by replacement, \(\gamma (\widehat{\psi }(u'),\widehat{\psi }(u))\) will be increased by at most 2 as the frequency of symbol a is decreased by one and the frequency of b is increased by one. Hence, \(\gamma (\widehat{\psi }(u'),\widehat{\psi }(v)) \le 2i+2=2(i+1)\) while \(k=d(u',v)=i+1\).

 

Hence the proof of the first inequality, by the principle of mathematical induction.

For the other one also, the proof justification is similar.    \(\square \)

Theorem 6

Given a pair of strings \(u \in \varSigma ^m_P\), \(v \in \varSigma ^m_T\), let \(k=\textit{cost}(\textit{APSM}(u,v))\). Then \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) \le 2{k}\).

Proof

Let \( \pi \in \textit{APSM}(u,v)\). Therefore by definition, \(k = \textit{cost}(\textit{APSM}(u,v))=d(\pi (u),v)\) is minimum over all bijections. Let \(\pi (u)=u' \in \varSigma _T^m\). Since \(u ~\widehat{=}~u'\) under \(\pi \), \(\widehat{\psi }(u)=\widehat{\psi }(u')\), by Theorem 4. Hence \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v))=\gamma (\widehat{\psi }(u'),\widehat{\psi }(v))\). By using Theorem 5, we have \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) = \gamma (\widehat{\psi }(u'),\widehat{\psi }(v)) \le 2 {k}\).    \(\square \)

Theorem 7

Given \(u \in \varSigma ^m_P\) and \(v \in \varSigma ^m_T\). Let \(\widehat{k}=\) \(\textit{cost}(\textit{APSM}(u,v,k))\). Then \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) \le 2\widehat{k}\) (which we call as PV-filter).

Proof

The proof is very similar as Theorem 6. Let \( \pi \in \textit{APSM}(u,v,k)\). Accordingly, there exists a bijection \(\pi {:\;} \varSigma _P \rightarrow \varSigma _T\) such that \(\widehat{k} = \textit{cost}(\textit{APSM}(u,v,k))=d(\pi (u),v)\) is minimum but not greater than k. Let \(u' =\pi (u) \in \varSigma _T^m\). With similar argument as above, we have \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) = \gamma (\widehat{\psi }(u'),\widehat{\psi }(v)) \le 2\widehat{k}\).    \(\square \)

We use this PV-filter as a subroutine during the design of a simple algorithm to solve the APAPSM problem with error threshold k between two sets P and T of equal length strings. In worst-case (i.e. none of the pairs are filtered out by PV-filter), it takes \( O( n_P \, n_T \, m)\).

Computing APAPSM Under Error Threshold: Let \(P=\{p_1, ~ p_2, \ldots , p_{n_P}\} \subseteq \varSigma _P^m\) and \(T=\{t_1, ~ t_2, \ldots , t_{n_T}\}\) \(\subseteq \varSigma _T^m\) be two sets of strings where \(|\varSigma _P|=|\varSigma _T|=\sigma \). In Algorithm 1, we compute APAPSM problem with error threshold \(k \in \mathbb {N}_0\) among two sets P and T of equal length strings. In Step 3, clustering is precisely recommended, in case in advance it is known that there are many exact and parameterized repetition of strings in P and T. To create the equivalence classes in P and T separately, with respect to parameterization, clustering is done based on the converse of Theorem 4, i.e., in case for any two strings \(u,v \in P\) (and T, respectively) if \(\gamma (\widehat{\psi }(u),\widehat{\psi }(v)) = 0\), then and only then check for \(u ~\widehat{=}~ v\). If \(u ~\widehat{=}~ v\) holds, them put u and v into the same cluster.

figure a

Complexity Analysis: Steps 1–3 of Algorithm 1 are the preprocessing steps for computing \(\textit{APAPSM}(P,T,k)\); Steps 1–2 takes \(O(m( n_P + n_T))\) and Step 3 takes \(O(m( n_P^2 + n_T^2))\) time, assuming a constant size alphabets. But as mentioned earlier, clustering is optional, it might be skipped depending on the circumstances.

In Step 4, for any pair \((p_i,t_j)\in P \times T\), computation of \(\textit{APSM}(p_i,t_j,k)\) can be done by reducing the problem to maximum weight bipartite matching (MWBM) problem [19]. Let \(G = (V, E, W)\) be an undirected, weighted (non-negative integer weight) bipartite graph where VE and W are the vertex set, edge set and total weight of G, respectively. MWBM problem can be solved in \(O(\sqrt{|V|}W')\) time, where \(|E| \le W' \le W\) [12]. It is a fine-tuned version of the existing decomposition solution [20]. Using the fine-tuned decomposition solution for MWBM, \(\textit{APSM}(p_i,t_j,k)\) can be solved in \(O(m \sqrt{\sigma })\) where \(W'=O(W)=O(m)\) and \(V=O(\sigma )\) [12, 13]. In the worst-case scenario: each of the clusters will have just a single string either from P or T and PV-filter in Step 4 does not filter out any pair \((p_i,t_j) \in P \times T\) . Therefore worst-case running time of the Algorithm 1 is \(O(n_P \, n_T \, m \sqrt{\sigma })\), which is \(O(n_P \, n_T \, m)\), if we assume a constant alphabet.

4 Experimental Results

To test the efficiency of the PV-filter, we performed several experimental studies, but only a few are reported in this section because of page limitation. Algorithm 1 which solves APAPSM(PTk), is implemented in MATLAB Version 7.8.0.347 (R2009a). All the experiments are conducted on a PC Laptop with an Intel\(^{\textregistered }\) Core\(^{TM}\) 2 Duo (T6570 @ 2.10 GHz) Processor, 3.00 GB RAM and 500 GB Hard Disk, running the Microsoft Windows 7 Ultimate (32-bit Operating System).

Data Description: We generate the input data sets P and T by using the predefined randi function. It helps to generate uniformly distributed pseudorandom integers. The function randi(imax,m,n) returns an \(\mathtt m\)-by-\(\mathtt n\) matrix containing pseudorandom integer values drawn from the discrete uniform distribution on 1:imax.

Efficiency of PV-Filter: The experimental results show that the PV-filter is efficient, essentially for small error threshold k, to avoid unwanted pairs (uv) comparison for \(\textit{APSM}(u,v,k)\), where \(u \in P\) and \(v \in T\). According to the random experiment, if the error threshold \(k \le \frac{m}{3}\), then almost more than one-third of the total pairs comparison can be skipped. Moreover, very smaller threshold gives much better filtering. Please see the experiments.

Experiment 1

Consider, alphabet sets \( \varSigma _P=\{\mathtt{a,b,c,d,e,f,g,h,i,j}\}\), \(\varSigma _T=\{\mathtt{a',b',c',d',e',f',g',h',i',j'}\}; ~P \in \varSigma _P^*,~ T \in \varSigma _T^* \); cardinality of each of the sets P and T is \(|P|=|T|=100\); and \(|p_i|=|t_j|=6\) for \(1 \le i,j \le |P|=|T|\).

According to the data set generated in Experiment 1, a total of 10,000 (uv) pairs of comparisons for \(\textit{APSM}(u,v,k)\), where \(u \in P\) and \(v \in T\), are required without PV-filtering. Figure 1 shows the efficiency graph of the filter on the input data set. Each blue point in the graph indicates the number of elimination of pairs comparison for a given error threshold, after using the PV-filter. For example, each point (ij) in Fig. 1 represents that for \(k=i\) error threshold, j number of (uv) pairs of strings have skipped the comparison for \(\textit{APSM}(u,v,i)\).

Fig. 1.
figure 1

Elimination graph of pairs of strings after using PV-filter for the input data set with \(| \varSigma _P|= |\varSigma _T|=10 \); \(|P|=|T|=100\); \(|p_i|=|t_j|=6\) for \(1 \le i,j \le |P|=|T|\), as mentioned in Experiment 1.

Table 1. PV-filtering for the data set in Experiment 1.

Table 1 gives more light to the Experiment 1. The second row represents that for a given k, a total number of (uv) pairs are to be checked for \(\textit{APSM}(u,v,k)\), initially before using PV-filter; the third row says, for respective k the number of pairs of strings are eliminated by PV-filter; simultaneously, the fourth row describes that how many string pairs are passed by the filter; and finally, the last row mentions, for how many (uv) pairs, actually \(\textit{cost}(\textit{APSM}(u,v)) \le k\) among the passed pairs.

Experiment 2

Consider the alphabet sets \( \varSigma _P=\{\mathtt{a,b,c,\ldots , x,y,z}\}\), \(\varSigma _T=\{\mathtt{a',b', c',\ldots , x',y',z'}\}\) with \(| \varSigma _P|= |\varSigma _T|=26 \); \(P \in \varSigma _P^*,~ T \in \varSigma _T^* \); cardinality of the sets P and T is \(|P|=|T|=100\); and \(|p_i|=|t_j|=2000\) for \(1 \le i,j \le |P|=|T|\). Figure 2 gives the elimination graph. The corresponding table is skipped due to space limitation.

Fig. 2.
figure 2

Elimination graph of pairs of strings after using PV-filter for the input data set with \(| \varSigma _P|= |\varSigma _T|=26 \); \(|P|=|T|=100\); \(|p_i|=|t_j|=2000\) for \(1 \le i,j \le |P|=|T|\), as mentioned in Experiment 2.

5 Conclusions

In this paper, we have explored all pairs approximate parameterized string matching problem with k Hamming distance error threshold between two sets of equal length stings. We have presented a solution with worst-case complexity \(O({n_P} \, {n_T} \, m)\), assuming constant alphabet size. In order to minimize number of paired comparisons for solving APSM between pair of strings with error threshold, we have proposed a PV-filtering technique by using Parikh vector. Although the filter does not improve the worst-case asymptotic bound, but the using it as a subroutine, we can avoid some of the unwanted paired comparisons for APSM. Experimental results show that the PV-filter is efficient for small error threshold.