New Variants of Pattern Matching with Constants and Variables

Igarashi, Yuki; Diptarama; Yoshinaka, Ryo; Shinohara, Ayumi

doi:10.1007/978-3-319-73117-9_43

Yuki Igarashi¹⁸,
Diptarama¹⁸,
Ryo Yoshinaka¹⁸ &
…
Ayumi Shinohara¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10706))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Informatics

1292 Accesses
1 Citations

Abstract

Given a text and a pattern over two types of symbols called constants and variables, the parameterized pattern matching problem is to find all occurrences of substrings of the text that the pattern matches by substituting a variable in the text for each variable in the pattern, where the substitution should be injective. The function matching problem is a variant of it that lifts the injection constraint. In this paper, we discuss variants of those problems, where one can substitute a constant or a variable for each variable of the pattern. We give two kinds of algorithms for both problems, a convolution-based method and an extended KMP-based method, and analyze their complexity.

Access provided by CONRICYT-eBooks. Download conference paper PDF

A Parameterized Study of Maximum Generalized Pattern Matching Problems

Pattern Matching with Variables: A Multivariate Complexity Analysis

Norn: An SMT Solver for String Constraints

Keywords

1 Introduction

The parameterized pattern matching problem was proposed by Baker [4] about a quarter of a century ago. Problem instances are two strings called a pattern and a text, which are sequences of two types of symbols called constants and variables. The problem is to find all occurrences of substrings of a given text that a given pattern matches by substituting a variable in the text for each variable in the pattern, where the important constraint is that the substitution should be an injective map. She presented an algorithm for this problem that runs in $O(n\log {n})$ time using parameterized suffix trees, where n is the length of text.

By removing the injective constraint from the parameterized pattern matching problem, Amir et al. [1] proposed the function matching problem, where the same variable may be substituted for different variables. Yet another but an inessential difference between parameterized pattern matching and function matching is in the alphabets. The function matching problem is defined to be constant-free in the sense that patterns and texts are strings over variables. However, this simplification is inessential, since it is known that the problem with variables and constants is linear-time reducible to the constant-free case [2]. This reduction technique works for the parameterized pattern matching as well. Their deterministic algorithm solves this problem in $O(|\varPi |n\log {m})$ time, where n and m are the lengths of the text and pattern, respectively, and $|\varPi |$ is the number of different symbols in the pattern. After that, Amir and Nor [3] introduced the generalized function matching problem, where one can substitute a string of arbitrary length for a variable. In addition, both a pattern and a text may contain “don’t care” symbols, which are supposed to match arbitrary strings.

Table 1. The time complexity of our proposed algorithms

Full size table

The parameterized pattern matching problem and its extensions have been of great interest not only to the pattern matching community [13] but also to the database community. Du Mouza et al. [7] proposed a variant of the function matching problem, where texts should consist solely of constants and a substitution maps variables to constants, which is not necessarily injective. Let us call their problem function matching with variables-to-constants, FVC-matching in short.^{Footnote 1} The function matching problem is linear-time reducible to this problem by simply assuming the variables in a text as constants. Therefore, this problem can be seen as a generalization of the function matching problem. Unfortunately, as we will discuss in this paper, their algorithm is in error.

In this paper, we introduce a new variant of the problem by du Mouza et al. with the injective constraint, which we call parameterized pattern matching with variables-to-constants mapping (PVC-matching). For each of the FVC-matching and PVC-matching problems, we propose two kinds of algorithms: a convolution-based method and an extended KMP-based method. The convolution-based methods and extended KMP-based methods are inspired by the algorithm of Amir et al. [1] for the function matching problem and the one by du Mouza et al. [7] for the FVC-matching problem, respectively. As a result, we fix the flaw of the algorithm by du Mouza et al. The convolution-based methods for both problems run in $O(|\varSigma _{P}|n\log {m})$ time, where $\varSigma _P$ is the set of constant symbols that occur in the pattern P. Our KMP-based methods solve the PVC-matching and FVC-matching problems with $O(|\varSigma _{P}||\varPi _{P}|m^2)$ and $O(|\varPi _{P}|(|\varSigma _{P}|+|\varPi _{P}|)m^2)$ preprocessing time and $O(|\varPi _{P}| \lceil \frac{m}{w} \rceil n)$ and $O(|\varPi _{P}|^2 \lceil \frac{m}{w} \rceil n)$ query time, respectively, where $\varPi $ is the set of variables and w is the word size of a machine (Table 1). The convolution-based methods and KMP-based methods work more efficiently than the trivial O(mn) algorithm if the pattern contains few different constants and few different variables, respectively.

A full version of this paper [10] includes pseudo codes and experimental results for these algorithms.^{Footnote 2}

2 Preliminaries

For any set Z, the cardinality of Z is denoted by |Z|. Let $\varSigma $ be an alphabet. We denote by $\varSigma ^*$ the set of strings over $\varSigma $. The empty string is denoted by $\epsilon $. The concatenation of two strings $X,Y \in \varSigma ^*$ is denoted by $XY$. For a string X, the length of $X = X[1]X[2]\cdots X[n]$ is denoted by $|X| = n$. The substring of X beginning at i and ending at j is denoted by $X[i:j] = X[i]X[i+1]\cdots X[j-1]X[j]$. Any substrings of the form X[1 : j] and X[i : n] are called a prefix and a suffix of X. For any number k, we define $X[k:k-1]=\epsilon $. The set of symbols from a subset $\varDelta $ of $\varSigma $ occurring in X is denoted by $\varDelta _X = \{\, X[i] \in \varDelta \mid 1 \le i \le n \,\}$.

This paper is concerned with matching problems, where strings consist of two kinds of symbols, called constants and variables. Throughout this paper, the sets of constants and variables are denoted by $\varSigma $ and $\varPi $, respectively. Variables are supposed to be replaced by another symbol, while constants are not.

Definition 1

For a function $\pi : \varPi \rightarrow (\varSigma \cup \varPi )$, we extend it to $\hat{\pi }: (\varPi \cup \varSigma )^* \rightarrow (\varPi \cup \varSigma )^*$ by

$$\begin{aligned} \hat{\pi }(X) = \hat{\pi }(X[1])\hat{\pi }(X[2]) \cdots \hat{\pi }(X[n]) , \text {where } \hat{\pi }(X[i]) = {\left\{ \begin{array}{ll} \pi (X[i]) &{} (X[i] \in \varPi ) \\ X[i] &{} \mathrm{(otherwise)} \end{array}\right. } \end{aligned}$$

Parameterized match [4] and function match [1]^{Footnote 3} are defined as follows.

Definition 2

Let P and Q be strings over $\varSigma \cup \varPi $ of the same length. String P is said to parameterized match (resp. function match) string Q if there exists an injection (resp. function) $\pi : \varPi \rightarrow \varPi $, such that $\hat{\pi }(P) = Q$.

The parameterized pattern matching problem (resp. function matching problem) is to find all occurrences of substrings of a given text that a given pattern parameterized matches (resp. function matches).

The problems we discuss in this paper allow variables to be mapped to constants and variables.

Definition 3

Let P and Q be strings over $\varSigma \,\cup \,\varPi $ of the same length. String P is said to parameterized match with variables-to-constants mapping (resp. function match with variables-to-constants mapping), shortly PVC-match (resp. FVC-match), string Q if there exists an injection (resp. function) $\pi : \varPi \rightarrow (\varSigma \cup \varPi )$, such that $\hat{\pi }(P) = Q$.

Problem 1

Let P and T be strings over $\varSigma \cup \varPi $ of length m and n, respectively. The parameterized pattern matching problem with variables-to-constants mapping (resp. function matching problem with variables-to-constants mapping), shortly PVC-matching (resp. FVC-matching) asks for all the indices i where pattern P PVC-matches (resp. FVC-matches) substring $T[i: i+m-1]$ of text T.

Table 2 summarizes those four problems.

Table 2. Definition of problems

Full size table

We can assume without loss of generality that the text T solely consists of constants. This restriction is inessential since one can regard variables occurring in T as constants. Under this assumption, the FVC-matching problem is exactly parameterized pattern queries [7].

Example 1

Let $\varSigma = \{\texttt {a}, \texttt {b}\}$ and $\varPi = \{\texttt {A}, \texttt {B}\}$. Consider pattern $P = \texttt {ABAb}$ and text $T = \texttt {ababbbb}$. Then, the answer of PVC-matching problem is $\{1, 2\}$, since P PVC-matches $T[1:4] = \texttt {abab}$, $T[2:5] = \texttt {babb}$. On the other hand, the answer of FVC-matching problem is $\{1, 2, 4\}$ since P FVC-matches $T[1:4] = \texttt {abab}$, $T[2:5] = \texttt {babb}$, $T[4:7] = \texttt {bbbb}$. Note that we have $\hat{\pi }(P)=T[4:7]$ for $\pi $ with $\pi (\mathtt {A})=\pi (\mathtt {B})=\mathtt {b}$, which is not injective.

Throughout this paper, we arbitrarily fix a pattern $P \in (\varSigma \cup \varPi )^*$ of length m and a text $T \in \varSigma ^*$ of length n.

3 Convolution-Based Methods

In this section, we show that the FVC-matching problem can be solved in $O(|\varSigma _{P}|n\log {m})$ time by reducing the problem to the function matching problem and the wildcard matching problem, for which several efficient algorithms are known. The PVC-matching problem can also be solved using the same reduction technique with a slight modification.

For strings P of length m over $\varSigma \cup \varPi $ and T of length n over $\varSigma $, we define $\varPi ' = \varPi _{P} \cup \varSigma _{T}$. Let $P_{\!\!\mathtt{*}}\in (\varSigma \cup \{\mathtt{*}\})^*$ be a string obtained from P by replacing all variable symbols in $\varPi $ with don’t care symbol $\mathtt{*}$. Let $P_{\!\varPi }\in \varPi '^*$ be a string obtained from P by removing all constant symbols in $\varSigma $. Moreover, for $1 \le i < n-m$, let $T'_{i}$ be a string defined by $T'_{i} = v(1) v(2) \cdots v(m)$, where $v(j) = T[i+j-1]$ if $P[j] \in \varPi $ and $v(j) = \epsilon $ otherwise. Note that both the lengths of $T'_{i}$ and $P_{\!\varPi }$ are equal to the total number of variable occurrences in P.

Example 2

For $T = \mathtt{aabcbc}$ and $P={\mathtt{A}\mathtt{a}\mathtt{B}\mathtt{B}\mathtt{b}}$ over $\varPi = \{\mathtt{A},\mathtt{B}\}$ and $\varSigma = \{\mathtt{a},\mathtt{b},\mathtt{c}\}$, we have $P_{\!\!\mathtt{*}}= \mathtt{*}\mathtt{a}\mathtt{*}\mathtt{*}\mathtt{b}$, $P_{\!\varPi }=\mathtt{A}\mathtt{B}\mathtt{B}$, $T'_{1} = \mathtt{a}\mathtt{b}\mathtt{c}$, and $T'_{2}= \mathtt{a}\mathtt{c}\mathtt{b}$.

For both FVC-matching and PVC-matching problems, the following lemma is useful to develop algorithms to solve them.

Lemma 1

P FVC-matches (resp. PVC-matches) $T[i:i+m-1]$ if and only if

1.
$P_{\!\!\mathtt{*}}$ wildcard matches $T[i:i+m-1]$, and
2.
$P_{\!\varPi }$ function matches (resp. parameterized matches) $T'_{i}$.

Lemma 1 suggests that the FVC-problem would be reducible to the combination of wildcard matching problem and function matching problem.

The wildcard matching problem (a.k.a. Pattern matching with don’t care symbol) [8] is one of the fundamental problems in pattern matching. There are many algorithms for solving the wildcard matching problem [5, 8, 11]. For example, Cole and Hariharan [5] gave an algorithm which runs $O(n\log {m})$ time by using convolution.

However, Lemma 1 does not imply the existence of a single string $T'$ such that P FVC-matches $T[i:i+m-1]$ if and only if $P_{\!\!\mathtt{*}}$ wildcard matches $T[i:i+m-1]$ and $P_{\!\varPi }$ function matches $T'[i:i+m-1]$. A Naive application of Lemma 1 to compute $T'_{i}$ explicitly for each i requires O(mn) time in total.

We will present an algorithm to check whether $P_{\!\varPi }$ function matches (parameterized matches) $T'_{i}$ for all $1 \le i < n-m$ in $O(\log {|\varSigma |}\,n\log {m})$ time in total. Without loss of generality, we assume that $\varSigma $ and $\varPi $ are disjoint finite sets of positive integers in this section, and for integers a and b, the notation $a \cdot b$ represents the multiplication of a and b but not the concatenation.

Definition 4

For integer arrays A of length n and B of length m, we define an integer array R by $ R[j] = \sum _{i=1}^{m}{A[i+j-1]\cdot B[i]} $ for $1 \le j \le n-m+1$. We denote R as $A\otimes B$.

In a computational model with word size $O(\log {m})$, the discrete convolution can be computed in time $O(n\log {n})$ by using the Fast Fourier Transform (FFT) [6]. The array R defined in Definition 4 can also be computed in the same time complexity by just reversing array B.

Amir et al. [1] proved the next lemma for function matching.

Lemma 2

[1]. For any natural numbers $a_{1},\cdots ,a_{k}$, the equation

$k\!\cdot \!\sum _{i = 1}^{k}{(a_i)^2} = ( \sum _{i=1}^{k}{a_i} )^2$ holds if and only if $a_i = a_j \text{ for } \text{ any } 1 \le i, j \le k$.

Let ${{{\varvec{T}}}}$ be the string of length n such that ${{{\varvec{T}}}}[i] = (T[i])^2$ for every $1 \le i \le n$. For a variable $x \in \varPi _{P}$, let $c_{x}$ denote the number of occurrences of x in P, and let $P_x$ be the string of length m such that $P_x[j] = 1$ if $P[j] = x$ and $P_x[j] = 0$ otherwise, for every $1 \le j \le m$. By Lemma 2, we can prove the following lemma.

Lemma 3

All the symbols (values) in $T'_{i}$ at every position j satisfying $P_{\!\varPi }[j] = x$ are the same, if and only if the equation $c_{x} \!\cdot \! (({{{\varvec{T}}}}\otimes P_x)[i]) = ((T \otimes P_x)[i])^2$ holds.

Thus, $P_{\!\varPi }$ function matches $T'_{i}$ if and only if the equation in Lemma 3 holds for all $x \in \varPi _{P}$. Both the convolutions ${{{\varvec{T}}}}\otimes P_x$ and $T \otimes P_x$ can be calculated in $O(n\log {m})$ time by simply dividing T into $2\times \frac{n}{2m}$ overlapping substrings of length 2m. For parameterized pattern matching problem, we have only to check additionally whether the value ${(T \otimes P_x)[i]}/c_{x}$ is unique among all $x \in \varPi _{P}$.

Theorem 1

The FVC-matching problem and PVC-matching problem can be solved in $O(|\varSigma _{P}|\,n\log {m})$ time.

4 KMP-Based Methods

Du Mouza et al. [7] proposed a KMP-based algorithm for the FVC-matching problem, which, however, is in error. In Sect. 4.1, we propose a correction of their algorithm, which runs in $O(|\varPi |^2 \lceil \frac{m}{w} \rceil n)$ query time with $O(|\varPi |(|\varSigma _{P}|+|\varPi |)m^2)$ preprocessing time, where w denotes the word size of a machine. This algorithm will be modified in Sect. 4.2 so that it solves the PVC-matching problem in $O(|\varPi | \lceil \frac{m}{w} \rceil n)$ query time with $O(|\varPi ||\varSigma _{P}|m^2)$ preprocessing time.

The KMP algorithm [12] solves the standard pattern matching problem in O(n) time with O(m) preprocessing time. We say that a string Y is a border of X if Y is simultaneously a prefix and a suffix of X. A border Y is nontrivial if Y is not X itself. For the preprocessing of the KMP algorithm, we calculate the longest nontrivial border $b_{k}$ for each prefix P[1 : k] of pattern P, and store them as border array $B[k] = |b_{k}|$ for each $0 \le k \le m$. Note that $b_0 = b_1 = \epsilon $. In the matching phase, the KMP algorithm compares symbols T[i] and P[k] from $i = k = 1$. We increment i and k if $T[i] = P[k]$. Otherwise we reset the index for P to be $k' = B[k-1]+1$ and resume comparison from T[i] and $P[k']$.

4.1 Extended KMP Algorithm

This subsection discusses an algorithm for the FVC-matching problem. In the matching phase, our extended KMP algorithm compares the pattern and a substring of the text in the same manner as the classical KMP algorithm except that we must maintain a function by which prefixes of the pattern match some substrings of the text. That is, our extended KMP algorithm compares symbols T[i] and P[k] from $i = k = 1$ with the empty function $\pi $. If P[k] is not in the domain $\mathrm {dom}(\hat{\pi })$ of $\hat{\pi }$, we expand $\pi $ by letting $\pi (P[k]) = T[i]$ and increment i and k. If $\hat{\pi }(P[k])$ is defined to be T[i], we increment i and k. Otherwise, we say that a mismatch occurs at position k with a function $\pi $. Note that the mismatch position refers to that of P rather than T. When we find a mismatch, we must calculate the appropriate position j of P and function $\pi '$ with which we resume comparison. If instances are variable-free, the position is solely determined by the longest border size of P[1 : k] and we have no function. In the case of FVC-matching, the resuming position depends on the function $\pi $ in addition to k.

Example 3

Let us consider the pattern $P=\mathtt {AABaaCbC}$ where $\varPi = \{\mathtt{A},\mathtt{B},\mathtt{C}\}$ and $\varSigma = \{\mathtt{a},\mathtt{b}\}$ in Fig. 1. If the concerned substring of the text is $T'=\mathtt {bbbaaabb}$, a mismatch occurs at $k=8$ with a function $\pi $ such that $\pi (\mathtt {A})=\pi (\mathtt {B})=\mathtt {b}$ and $\pi (\mathtt {C})=\mathtt {a}$. In this case, we can resume comparison with P[7] and $T'[8]$, since we have $\hat{\pi }'(P[1:6])=T'[2:7]$ for $\pi '$ such that $\pi '(\mathtt {A})=\pi '(\mathtt {C})=\mathtt {b}$ and $\pi '(\mathtt {B})=\mathtt {a}$. On the other hand, for $T''= \mathtt {bbaaaabb}$, the first mismatch occurs again at $k=8$ with a function $\rho $ such that $\rho (\mathtt {A})=\mathtt {b}$ and $\rho (\mathtt {B})=\rho (\mathtt {C})=\mathtt {a}$. In this case, one cannot resume comparison with P[7] and $T''[8]$, since there is no $\rho '$ such that $\hat{\rho }'(P[1:6])=T''[2:7]$, since $P[1] = P[2]$ but $T''[2] \ne T''[3]$. We should resume comparison between P[4] and $T''[8]$ with $\rho '$ such that $\rho '(\mathtt {A})=\mathtt {a}$ and $\rho '(\mathtt {B})=\mathtt {b}$, for which we have $\hat{\rho }'(P[1:3])=T''[5:7]$. Note that $\rho '(\mathtt {C})$ is undefined.

The goal of the preprocessing phase is to prepare a data structure by which one can efficiently compute the failure function in the matching phase:

Input: the position $k + 1$ (where a mismatch occurs) and a function $\pi $ whose domain is $\varPi _{P[1:k]}$,
Output: the largest position $j+1 < k+1$ (at which we will resume comparison) and the function $\pi '$ with domain $\varPi _{P[1:j]}$ such that $\hat{\pi }'(P[1:j]) = \hat{\pi }(P[k-j+1:k])$.

We call such $\pi $ a preceding function, $\pi '$ a succeeding function and the pair $(\pi ,\pi ')$ a (k, j)-shifting function pair. The substrings P[1 : j] and $P[k-j+1:k]$ may not be a border of P[1 : k] but under preceding and succeeding functions they play the same role as a border plays in the classical KMP algorithm. The succeeding function $\pi '$ is uniquely determined by a preceding function $\pi $ and positions k, j. The condition for functions $\pi $ and $\pi '$ form a (k, j)-shifting function pair can be expressed using the (k, j)-shifting graph (on P), defined as follows.

Definition 5

Let $\varPi '$ be a copy of $\varPi $ and $P'$ be obtained from P by replacing every variable in $\varPi $ with its copy in $\varPi '$. For two numbers k, j such that $0 \le j < k \le m$, the (k, j)-shifting graph $G_{k,j} = (V_{k,j},E_{k,j})$ is defined by

$$\begin{aligned} V_{k,j}&=\varSigma _P \cup \varPi _{P[k-j+1:k]} \cup \varPi '_{P'[1:j]}, \\ E_{k,j}&=\{\, (P[k-j+i], P'[i]) \mid 1 \le i \le j < k \text { and } P[k-j+i] \ne P'[i] \,\} \,. \end{aligned}$$

We say that $G_{k,j}$ is invalid if there are distinct $p,q \in \varSigma _P$ that belong to the same connected component. Otherwise, it is valid.

Note that $G_{k,0} = (\varSigma _{P},\emptyset )$ is valid for any k. Figure 2 shows the (7, 6)-shifting and (7, 3)-shifting graphs for $P=\mathtt {AABaaCbC}$ in Example 3. Using functions $\pi $ and $\pi '$ whose domains are $\text {dom}(\pi ) = \varPi _{P[k-j+1:k]}$ and $\text {dom}(\pi ') = \varPi _{P[1:j]}$, respectively, let us label each node $p \in \varSigma $, $x \in \varPi $, $x' \in \varPi '$ of $G_{k,j}$ with $p,\pi (x),\pi '(x)$, respectively. Then $(\pi ,\pi ')$ is a (k, j)-shifting pair if and only if every node in each connected component has the same label. Obviously $G_{k,j}$ is valid if and only if it admits a (k, j)-shifting function pair.

Thus, the resuming position should be $j+1$ for a mismatch at $k+1$ with a preceding function $\pi $ if and only if j is the largest such that $G_{k,j}$ is valid and

(a)
if $x \in \varPi $ and $p \in \varSigma $ are connected in $G_{k,j}$, then ${\pi }(x)=p$,
(b)
if $x \in \varPi $ and $y \in \varPi $ are connected in $G_{k,j}$, then ${\pi }(x)=\pi (y)$. In that case, we have $\hat{\pi }'(P[1:j])=\hat{\pi }(P[k-j+1:k])$ for $\pi '$ determined by
(c)
$\pi '(x)=\hat{\pi }(y)$ if $x' \in \varPi '_{P[1:j]}$ and $y \in \varPi \cup \varSigma $ are connected.

We call the conditions (a) and (b) the (k, j)-preconditions and (c) the (k, j)-postcondition. Note that every element in $\varPi '_{P'[1:j]}$ is connected to some element in $\varPi _{P[k-j+1:k]} \cup \varSigma _P$ in $G_{k,j}$ and thus $\pi '$ is well-defined.

Remark 1

The algorithms EdgesConstruction (preprocessing) and Match (matching) by du Mouza et al. [7] do not treat the condition induced by two nodes of distance more than 1 correctly. For example, let us consider the pattern $P=\mathtt {AABaaCbC}$ in Example 3. For a text $T = \mathtt {bbaaaabbb}$, the first mismatch occurs at $k=8$, where $\hat{\rho }(P[1:7]) = \mathtt {bbaaaab}$ for $\rho (\mathtt {A})=\mathtt {b}$ and $\rho (\mathtt {B})=\rho (\mathtt {C})=\mathtt {a}$. To have $(\rho ,\rho ')$ a (7, 6)-shifting pair for some $\rho '$, it must hold $\rho (\mathtt {A})=\rho (\mathtt {B})$. That is, one can resume the comparison at position 7 only when the preceding function assigns the same constant to $\mathtt {A}$ and $\mathtt {B}$. The preceding function $\rho $ in this case does not satisfy this constraint. However, their algorithm performs this shift and reports that P matches T at position 2.

To efficiently compute the failure function, our algorithm constructs another data structure instead of shifting graphs. The shifting condition table is a collection of functions $A_{k,j}: \varPi _{P[k-j+1:k]} \rightarrow \varPi _{P[k-j+1:k]} \cup \varSigma _{P}$ and $A'_{k,j}:\varPi '_{P'[1:j]} \rightarrow \varPi _{P[k-j+1:k]} \cup \varSigma _{P}$ for $1 \le j < k \le m$ such that $G_{k,j}$ is valid. The functions $A_{k,j}$ can be used to quickly check the (k, j)-preconditions (a) and (b) and $A'_{k,j}$ is for the (k, j)-postcondition (c). Those functions satisfy the following property: for each connected component $\alpha \subseteq V_{k,j}$, there is a representative $u_\alpha \in \alpha $ such that

if $\alpha \cap \varSigma \ne \emptyset $, then $u_\alpha \in \varSigma $,
if $\alpha \cap \varSigma = \emptyset $, then $u_\alpha \in \varPi $,
for all $x \in \alpha \cap \varPi $, $A_{k,j}(x)=u_\alpha $,
for all $x' \in \alpha \cap \varPi '$, $A'_{k,j}(x') \in \alpha \cap (\varPi \cup \varSigma )$.

Note that $G_{k-1,j-1}$ is a subgraph of $G_{k,j}$, where the difference is at most two nodes and one edge. Hence, we can compute $A_{k,j}$ and $A'_{k,j}$ from $A_{k-1,j-1}$ and $A'_{k-1,j-1}$ in $O(\log {|\varPi |})$ worst-case time and $O(\mathcal {A}(|\varPi |))$ amortized time, where $\mathcal {A}(n)$ is the inverse-Ackermann function, by using Union-Find data structure [14]. Moreover, when computing $A_{k,j}$ and $A'_{k,j}$, we can verify the validity of $G_{k,j}$.

Lemma 4

The shifting condition table can be calculated in $O(\log {|\varPi |} m^2)$ time.

Suppose that we have a mismatch at position $k+1$ with a preceding function $\pi $. By using the shifting condition table, a naive algorithm may compute the failure function in $O(k|\varPi |^2)$ time by finding the largest j such that $\pi $ satisfies the (k, j)-precondition and then compute a function $\pi '$ satisfying the (k, j)-postcondition with which we resume comparison at $j+1$. The calculation of $\pi '$ can be done in $O(|\varPi |)$ time just by referring to the array $A'_{k,j}$. We next discuss how to reduce the computational cost for finding j by preparing an elaborate data structure in the preprocessing phase.

Du Mouza et al. [7] introduced a bitmap data structure concerning the precondition (a), which can be constructed using $A_{k,j}$ in the shifting condition table as follows. Here we extend the domain of $A_{k,j}$ to $\varPi $ by defining $A_{k,j}(x)=x$ for each $x \in \varPi \setminus \varPi _{P[k-j+1:k]}$.

Definition 6

[7]. For every $0 \le j < k \le m$, $x \in \varPi $ and $p \in \varSigma _P$, define

$$\begin{aligned} r_{x, p}^{k}[j]= & {} {\left\{ \begin{array}{ll} 0 &{} (G_{k,j}\,\, is\,\, invalid\,\, or\,\, A_{k,j}(x) \in \varSigma \setminus \{p\}) \\ 1 &{} (otherwise) \end{array}\right. } \end{aligned}$$

Lemma 5

[7]. A preceding function $\pi $ satisfies the (k, j)-precondition (a) if and only if $ \bigwedge _{x \in \varPi }{r_{x, \pi (x)}^{k}[j]} = 1$.

We define a data structure corresponding to the (k, j)-precondition (b) as follows.

Definition 7

For every $0 \le j < k \le m$ and $x,y \in \varPi $, define

$$\begin{aligned} s_{x, y}^{k}[j]= & {} {\left\{ \begin{array}{ll} 0 &{} (G_{k,j}\,\, is\,\, invalid\,\, or A_{k,j}(x) = y) \\ 1 &{} (otherwise) \end{array}\right. } \end{aligned}$$

Lemma 6

A preceding function $\pi $ satisfies the (k, j)-precondition (b) if and only if $ \bigwedge ^{x,y \in \varPi }_{\pi (x) \ne \pi (y)} s_{x, y}^{k}[j] = 1$ .

Therefore, we should resume comparison at $j+1$ for the largest j that satisfies the conditions of Lemmas 5 and 6. To calculate such j quickly, the preprocessing phase computes the following bit sequences. For every $x \in \varPi $, $p \in \varSigma _P$ and $1 \le k \le m$, let $r_{x, p}^{k}$ be the concatenation of $r_{x, p}^{k}[j]$ in ascending order of j:

$$\begin{aligned} r_{x, p}^{k} = r_{x, p}^{k}[0]r_{x, p}^{k}[1]\cdots r_{x, p}^{k}[k-1] \,,\end{aligned}$$

and for every $x,y \in \varPi $ and $1 \le k \le m$, let

$$\begin{aligned} s_{x, y}^{k} = s_{x, y}^{k}[0]s_{x, y}^{k}[1]\cdots s_{x, y}^{k}[k-1] \,.\end{aligned}$$

Calculating $r_{x, p}^{k}$ and $s_{x, y}^{k}$ for all $x,y \in \varPi $, $p \in \varSigma _P$ and $1 \le k \le m$ in the preprocessing phase requires $O(|\varPi |(|\varSigma _{P}|+|\varPi |)m^2)$ time in total. When a mismatch occurs at $k+1$ with a preceding function $\pi $, we compute

$$ J = \bigwedge _{x \in \varPi } r_{x, \pi (x)}^{k} \wedge \bigwedge _{\begin{array}{c} x,y \in \varPi \\ \pi (x) \ne \pi (y) \end{array}} s_{x, y}^{k} \,.$$

Then the desired j is the right-most position of 1 in J. This operation can be done in $O(\lceil \frac{m}{w} \rceil |\varPi |^2)$ time, where w denotes the word size of a machine. That is, with $O(|\varPi |(|\varSigma _{P}|+|\varPi |)m^2)$ preprocessing time, the failure function can be computed in $O(|\varPi |^2 \lceil \frac{m}{w} \rceil )$ time. For most applications, we can assume that m is smaller than the word size w, i.e. $\lceil \frac{m}{w} \rceil = 1$.

Theorem 2

The FVC-matching problem can be solved in $O(|\varPi |^2 \lceil \frac{m}{w} \rceil n)$ time with $O(|\varPi |(|\varSigma _{P}|+|\varPi |)m^2)$ preprocessing time.

4.2 Extended KMP Algorithm for PVC-Match

In this subsection, we consider the PVC-matching problem. We redefine the (mis)match and failure function in the same manner as described in the previous section except that all the functions are restricted to be injective. We define $G_{k,j}$ exactly in the same manner as in the previous subsection. However, the condition represented by that graph should be strengthened in accordance with the injection constraint on matching functions. We say that $G_{k,j}$ is injectively valid if for each $\varDelta \in \{\varSigma , \varPi ,\varPi '\}$, any distinct nodes from $\varDelta $ are disconnected. Otherwise, it is injectively invalid. There is a (k, j)-shifting injection pair if and only if $G_{k,j}$ is injectively valid.

For $P=\mathtt {AABaaCbC}$ in Example 3 (see Fig. 2), the (7, 6)-shifting graph $G_{7,6}$ for $P=\mathtt {AABaaCbC}$ is valid but injectively invalid, since $\mathtt {A}$ and $\mathtt {B}$ are connected. On the other hand, $G_{7,3}$ is injectively valid.

In the PVC-matching, the condition for an injection pair $(\pi ,\pi ')$ to be (k, j)-shifting is described using the graph labeling by $(\pi ,\pi ')$ as follows:

two nodes are assigned the same label if and only if they are connected.

Under the assumption that $G_{k,j}$ is injectively valid, the (k, j)-precondition on a preceding function $\pi $ is given as

(a)
if $x \in \varPi $ and $p \in \varSigma $ are connected, then ${\pi }(x)=p$,
(b’)
if $x \in \varPi $ and $x' \in \varPi '$ are connected and $y' \in \varPi ' \setminus \{x'\}$ and $p \in \varSigma $ are connected, then ${\pi }(x)\ne p$.

Since each connected component of an injectively valid shifting graph $G_{k,j}$ has at most 3 nodes, it is cheap to compute the function $F_{k,j}:V_{k,j} \rightarrow 2^{V_{k,j}}$ such that $F_{k,j}(u) = \{\, v \in V_{k,j} \mid u \text {and} v \text {are connected in} G_{k,j}\,\}$. For technical convenience, we assume $F_{k,j}(u) = \emptyset $ for $u \in \varPi \setminus V_{k,j}$. Using P[k], P[j], and $F_{k-1,j-1}$, one can decide whether $G_{k,j}$ is injectively valid and can compute $F_{k,j}$ (if $G_{k,j}$ is injectively valid) in constant time.

Suppose that we have a preceding function $\pi $ at position k. By using the function $F_{k,j}$, a naive algorithm can compute the failure function in $O(k|\varPi |)$ time. We define a bitmap $t_{x, p}^{k}[j]$ to check if $\pi $ satisfies preconditions (a) and (b’).

Definition 8

For every $0 \le j < k \le m$, $x \in \varPi $ and $p \in \varSigma _P$, define

$$\begin{aligned} t_{x, p}^{k}[j]= & {} {\left\{ \begin{array}{ll} 0 &{} (G_{k,j}\,\, is\,\, injectively \,\,invalid \,\,or\,\, F_{k,j}(x) \cap \varSigma \nsubseteq \{p\} \\ &{} \text { or } |(F_{k,j}(x) \cup F_{k,j}(p)) \cap \varPi '| = 2 ) \\ 1 &{} (otherwise) \end{array}\right. } \end{aligned}$$

The conditions $F_{k,j}(x) \cap \varSigma \nsubseteq \{p\}$ and $ |F_{k,j}(x) \cap F_{k,j}(p) \cap \varPi '| = 2$ in Definition 8 for $p=\pi (x)$ correspond to the (k, j)-preconditions (a) and (b’), respectively.

Lemma 7

Suppose that $G_{k,j}$ is injectively valid. The preceding function $\pi $ satisfies the (k, j)-preconditions (a) and (b’) if and only if $ \bigwedge _{x \in \varPi }{t_{x, \pi (x)}^{k}[j]} = 1$.

Proof

Suppose that $\pi $ violates the (k, j)-precondition (a). There are $x \in \varPi $ and $q \in \varSigma $ which are connected in $G_{k,j}$ such that $\pi (x) \ne q$. Then $q \in F_{k,j}(x) \cap \varSigma \nsubseteq \{\pi (x)\}$ and $t_{x,\pi (x)}^k[j] = 0$. Thus $\bigwedge _{y \in \varPi }{t_{y,\pi (y)}^{k}[j]} = 0$. Suppose that $\pi $ violates the (k, j)-precondition (b’). There are $x \in \varPi $, $x' \in \varPi '$, and $y' \in \varPi ' \setminus \{x'\}$ such that $x' \in F_{k,j}(x)$ and $y' \in F_{k,j}({\pi }(x))$. Then $(F_{k,j}(x) \cup F_{k,j}(\pi (x))) \cap \varPi ' = \{x',y'\}$ and thus $\bigwedge _{y \in \varPi }{t_{y,\pi (y)}^{k}[j]} = t_{x,\pi (x)}^k[j] = 0$.

Suppose that $\bigwedge _{y \in \varPi }{t_{y, \pi (y)}^{k}[j]} = 0$. Then there is $x \in \varPi $ for which $t_{x, \pi (x)}^{k}[j] = 0$, and either $F_{k,j}(x) \cap \varSigma \nsubseteq \{\pi (x)\} $ or $|(F_{k,j}(x) \cup F_{k,j}(\pi (x))) \cap \varPi '| = 2$. In the former case, there is $p \in (F_{k,j}(x) \cap \varSigma ) \setminus \{\pi (x)\}$, which means that x and p are connected but $\pi (x) \ne p$. This violates the (k, j)-precondition (a). In the latter case, there are distinct $x',y' \in \varPi '$ such that $x' \in F_{k,j}(x)$ and $y' \in F_{k,j}(\pi (x))$. That is, x and $x'$ are connected and $y'$ and $\pi (x)$ are connected, which violates the (k, j)-precondition (b’). $\square $

In the preprocessing phase, we calculate

$$\begin{aligned} t_{x, p}^{k} = t_{x, p}^{k}[0]t_{x, p}^{k}[1]\cdots t_{x, p}^{k}[k-1] \end{aligned}$$

for all $x \in \varPi $, $p \in \varSigma _P$ and $1 \le k \le m$, which requires $O(|\varPi ||\varSigma _P|m^2)$ time. When a mismatch occurs at $k+1$ with a function $\pi $, we compute

$$ J = \bigwedge _{x \in \varPi } t_{x, \pi (x)}^{k} \,$$

where the desired j is the right-most position of 1 in J. We resume comparison at $j+1$. The calculation of the failure function can be done in $O(|\varPi |\lceil \frac{m}{w} \rceil )$ time, where w denotes the word size of a machine.

Theorem 3

The PVC-matching problem can be solved in $O(|\varPi |\lceil \frac{m}{w} \rceil n)$ time with $O(|\varPi ||\varSigma _{P}|m^2)$ preprocessing time.

5 Concluding Remarks

In this paper, we proposed efficient algorithms for the FVC-matching and PVC-matching problems. The FVC-matching problem has been discussed by du Mouza et al. [7] as a generalization of the function matching problem, while the PVC-matching problem is newly introduced in this paper, which can be seen as a generalization of the parameterized pattern matching problem. We have fixed a flaw of the algorithm by du Mouza et al. for the FVC-matching problem. Moreover, the experimental results [10] show that our algorithms run more effecient than the trivial O(mn) algorithm.

There can be further variants of matching problems. For example, one may think of a pattern with don’t care symbols in addition to variables and constants. This is not interesting when don’t care symbols appear only in a pattern in function matching, since don’t care symbols can be assumed to be distinct variables. However, when imposing the injection condition on a matching function, don’t care symbols play a different role from variables. This generalization was tackled in [9]. We can consider an even more general problem by allowing texts to have variables, where two strings P and S are said to match if there is a function $\pi $ such that $\hat{\pi }(P)=\hat{\pi }(S)$. This is a special case of the word equation problem, where a string instead of a symbol can be substituted, and word equations are very difficult to solve in general. Another interesting restriction of word equations may allow to use different substitutions on compared strings, i.e., P and S match if there are functions $\pi $ and $\rho $ such that $\hat{\pi }(P)=\hat{\rho }(S)$. Those are interesting future work.

Notes

1.
They called the problem parameterized pattern queries. However, to avoid misunderstanding the problem to have the injective constraint, we refrain from using the original name in this paper.
2.
Source codes are available at https://github.com/igarashi/matchingwithvcmap.
3.
Amir et al. [1] defined the problem so that strings are over a single type of symbols, which can be seen as variables. This restriction is inessential [2].

References

Amir, A., Aumann, Y., Lewenstein, M., Porat, E.: Function matching. SIAM J. Comput. 35(5), 1007–1022 (2006)
Article MathSciNet MATH Google Scholar
Amir, A., Farach, M., Muthukrishnan, S.: Alphabet dependence in parameterized matching. Inf. Process. Lett. 49(3), 111–115 (1994)
Article MATH Google Scholar
Amir, A., Nor, I.: Generalized function matching. J. Discrete Algorithms 5(3), 514–523 (2007)
Article MathSciNet MATH Google Scholar
Baker, B.S.: Parameterized pattern matching: algorithms and applications. J. Comput. Syst. Sci. 52(1), 28–42 (1996)
Article MathSciNet MATH Google Scholar
Cole, R., Hariharan, R.: Verifying candidate matches in sparse and wildcard matching. In: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, pp. 592–601. ACM (2002)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., et al.: Introduction to algorithms, vol. 44, pp. 97–138. MIT Press, Cambridge (1990)
MATH Google Scholar
Du Mouza, C., Rigaux, P., Scholl, M.: Parameterized pattern queries. Data Knowl. Eng. 63(2), 433–456 (2007)
Article Google Scholar
Fischer, M.J., Paterson, M.S.: String-matching and other products. Technical report, DTIC Document (1974)
Google Scholar
Igarashi, Y.: A study on the parameterized pattern matching problems for real data (in Japanese). Bachelor thesis, Tohoku University (2017)
Google Scholar
Igarashi, Y., Diptarama, Yoshinaka, R., Shinohara, A.: New variants of pattern matching with constants and variables. CoRR abs/1705.09504 (2017)
Google Scholar
Iliopoulos, C.S., Rahman, M.S.: Pattern matching algorithms with don’t cares. In: Proceedings of the 33rd SOFSEM, pp. 116–126. Citeseer (2007)
Google Scholar
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Article MathSciNet MATH Google Scholar
Mendivelso, J., Pinzón, Y.J.: Parameterized matching: solutions and extensions. In: Stringology, pp. 118–131. Citeseer (2015)
Google Scholar
Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM (JACM) 22(2), 215–225 (1975)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is supported by Tohoku University Division for Interdisciplinary Advance Research and Education, ImPACT Program of Council for Science, Technology and Innovation (Cabinet Office, Government of Japan), and JSPS KAKENHI Grant Number JP15H05706.

Author information

Authors and Affiliations

Graduate School of Information Sciences, Tohoku University, 6-6-05 Aramaki Aza Aoba, Aoba-ku, Sendai, Japan
Yuki Igarashi, Diptarama, Ryo Yoshinaka & Ayumi Shinohara

Authors

Yuki Igarashi
View author publications
You can also search for this author in PubMed Google Scholar
Diptarama
View author publications
You can also search for this author in PubMed Google Scholar
Ryo Yoshinaka
View author publications
You can also search for this author in PubMed Google Scholar
Ayumi Shinohara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuki Igarashi .

Editor information

Editors and Affiliations

Vienna University of Technology , Vienna, Austria
A Min Tjoa
ISAE-ENSMA, Chasseneuil-du-Poitou, France
Ladjel Bellatreche
Vienna University of Technology, Vienna, Austria
Stefan Biffl
Utrecht University, Utrecht, The Netherlands
Jan van Leeuwen
Academy of Sciences, Prague, Czech Republic
Jiří Wiedermann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Igarashi, Y., Diptarama, Yoshinaka, R., Shinohara, A. (2018). New Variants of Pattern Matching with Constants and Variables. In: Tjoa, A., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds) SOFSEM 2018: Theory and Practice of Computer Science. SOFSEM 2018. Lecture Notes in Computer Science(), vol 10706. Edizioni della Normale, Cham. https://doi.org/10.1007/978-3-319-73117-9_43

Download citation

DOI: https://doi.org/10.1007/978-3-319-73117-9_43
Published: 22 December 2017
Publisher Name: Edizioni della Normale, Cham
Print ISBN: 978-3-319-73116-2
Online ISBN: 978-3-319-73117-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics