Efficient Parameterized Pattern Matching in Sublinear Space

Ideguchi, Haruki; Hendrian, Diptarama; Yoshinaka, Ryo; Shinohara, Ayumi

doi:10.1007/978-3-031-43980-3_22

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14240))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

471 Accesses

Abstract

The parameterized matching problem is a variant of string matching, which is to search for all parameterized occurrences of a pattern P in a text T. In considering matching algorithms, the combinatorial natures of strings, especially periodicity, play an important role. In this paper, we analyze the properties of periods of parameterized strings and propose a generalization of Galil and Seiferas’s exact matching algorithm (1980) into parameterized matching, which runs in $O(\pi |T|+|P|)$ time and $O(\log {|P|}+|\mathrm{\Pi }|)$ space in addition to the input space, where $\mathrm{\Pi }$ is the parameter alphabet and $\pi $ is the number of parameter characters appearing in P plus one.

Access provided by Autonomous University of Puebla. Download conference paper PDF

On the Practical Power of Automata in Pattern Matching

Article Open access 06 April 2024

A Parameterized Study of Maximum Generalized Pattern Matching Problems

On the Space and Circuit Complexity of Parameterized Problems: Classes and Completeness

Article 16 October 2014

Keywords

1 Introduction

String matching is a problem to search for all occurrences of a pattern P in a text T. Since it is one of the most important computer applications, many efficient algorithms for the problem have been proposed. Let us denote the length of T and P by n and m, respectively. While a naive algorithm takes O(nm) time to solve the problem, Knuth, Morris, and Pratt [13] gave an algorithm which runs in only $O(n+m)$ time by constructing auxiliary arrays called border arrays. After that, various algorithms to solve the problem in linear time have been proposed, which use auxiliary data structures, such as suffix trees [19], suffix arrays [15], LCP arrays [15]. All of those algorithms outperform the naive algorithm in terms of time complexity. They require additional space to store their auxiliary data, whose sizes are typically $\varTheta (n)$ or $\varTheta (m)$. On the other hand, studies for reducing such extra space were conducted. Firstly, Galil and Seiferas reduced extra space usage to $O(\log {m})$ [11], and later several time-space-optimal, $O(n+m)$ time and O(1) extra-space algorithms were devised [5, 6, 12].

In this paper, we consider a variant of string matching: parameterized matching. It is a pattern matching paradigm in which two strings are considered a match if we can map some characters (parameter characters) in one string to characters in another string. This paradigm was first introduced by Baker [4] for use in software maintenance by the ability to detect ‘identical’ computer programs renaming their variables. For solving the parameterized matching problem, a number of linear-time algorithms have been proposed that extend algorithms for exact matching [2, 7, 8, 10, 14, 17, 18]. See also [16] for a survey. However, we know of no previous attempt to reduce extra space usage to sublinear for time-efficient parameterized matching algorithms, although one can solve the problem in constant extra space if the time efficiency does not matter.

The main contribution of this paper is to give a sublinear-extra-space algorithm for the parameterized matching problem by extending Galil and Seiferas’s exact matching algorithm [11]. It runs in $O(|{\mathrm{\Pi }_{P}}|n+m)$ time and $O(\log {m}+|\mathrm{\Pi }|)$ space in addition to the input space, where $\mathrm{\Pi }$ is the set of parameter characters and ${\mathrm{\Pi }_{P}}$ is the non-empty^{Footnote 1} set of parameter characters appearing in P.

In order to provide the basis for our algorithm, we also investigate the properties of periodicity of parameterized strings in this paper. It is widely known that periods of strings are useful for exact matching algorithms [5, 6, 11,12,13], which is also the case for parameterized matching [2]. We extend previous work on parameterized periods by Apostolico and Giancarlo [3] and derive several properties for our algorithm. In particular, we focus on ‘sufficiently short’ periods of parameterized strings having properties useful for matching algorithms. Those results contain a parameterized version of Fine and Wilf’s periodicity lemma [9].

Remark 1

The time and space complexities of our algorithm stated above are based on a computing model in which functions $\mathrm{\Pi }\rightarrow \mathbb {N}$ can be stored as arrays. If not, one can use AVL trees [1] instead of arrays to store such functions. Then, our algorithm runs in $O((|{\mathrm{\Pi }_{P}}|n+m)\log |{\mathrm{\Pi }_{P}}|)$ time and $O(\log {m}+|{\mathrm{\Pi }_{P}}|)$ extra space.

2 Preliminaries

Let $\mathbb {N}$ and $\mathbb {N}^{+}$ be the set of natural numbers including and excluding 0, respectively. For $x,y\in \mathbb {N}$, we denote by $x\mid y$ that y is a multiple of x.

For $n\in \mathbb {N}$ and a function f whose domain and codomain are the same, we denote by $f^n$ the composite of the function n times.

2.1 Parameterized Matching Problem

In parameterized matching, we consider two disjoint alphabets: the constant alphabet $\mathrm{\Sigma }$ and the parameter alphabet $\mathrm{\Pi }$. A string over ${\mathrm{\Sigma }\cup \mathrm{\Pi }}$ is called a parameterized string or a p-string. Consider a p-string $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$. We denote the length of w by |w|. For $0 \le i < |w|$, let us denote i-th letter of w by w[i], where the index i is 0-based. For $0 \le i \le j \le |w|$, we denote the substring $w[i]w[i+1]\cdots w[j-1]$ by ${w[i:j]}$. (Note that ${w[i:j]}$ does not contain w[j].)

We denote the set of permutations of $\mathrm{\Pi }$ by $S_\mathrm{\Pi }$. Throughout this paper, for a permutation $f\in S_\mathrm{\Pi }$ and a constant character $c\in \mathrm{\Sigma }$, let $f(c)=c$. Then, the map f is naturally expanded as a bijection over p-strings: ${({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}\rightarrow {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$.

Definition 1

(Baker [4]). Two p-strings x and y are called a parameterized-match or a p-match if and only if there exists a permutation $f\in S_\mathrm{\Pi }$ such that $f(x)=y$. Denote this relation by $x\equiv y$.

Example 1

Let $\mathrm{\Sigma }= \{ \texttt{a}, \texttt{b}, \texttt{c} \}$ and $\mathrm{\Pi }= \{ \texttt{A}, \texttt{B}, \texttt{C} \}$. We have $\texttt{ABaCBCa} \equiv \texttt{BCaACAa}$ with a permutation f such that $f(\texttt{A})=\texttt{B}$, $f(\texttt{B})=\texttt{C}$, and $f(\texttt{C})=\texttt{A}$.

Clearly, the relation $\equiv $ is an equivalence relation over ${({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$. Note that if $x\equiv y$, we have $|x|=|y|$ and ${x[i:j]}\equiv {y[i:j]}$ for any $0\le i\le j\le |x|$. By this relation, the problem we consider in this paper, the parameterized matching problem, is defined as follows.

Problem 1

([4]). Given two p-strings T (text) and P (pattern), find all $0\le i\le |T|-|P|$ such that ${T[i:i+|P|]}\equiv P$.

Remark 2

For Problem 1, we can assume that P contains at least one parameter character without loss of generality. If $P\in \mathrm{\Sigma }^*$, choose any $c\in \mathrm{\Sigma }$ appearing in P and let constant and parameter alphabets be ${\mathrm{\Sigma }\cup \mathrm{\Pi }}\setminus \{c\}$ and $\{c\}$, respectively. Our algorithm presented in Sect. 4 is based on this assumption.

2.2 Periodicity of Parameterized Strings

Periodicity is one of the most fundamental concepts in combinatorics of strings and a wealth of applications. In exact matching, the Knuth-Morris-Pratt algorithm and various algorithms based on it rely on the properties of periods [5, 6, 11,12,13]. It is also the case for parameterized matching [2], where periods of parameterized strings are defined as follows:

Definition 2

(Apostolico and Giancarlo [3]). Consider $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ and $p\in \mathbb {N}^{+}$ with $p\le |w|$. Then, p is called a period of w if and only if ${w[0:|w|-p]}\equiv {w[p:|w|]}$.

If p is a period of w, there exists $f\in S_\mathrm{\Pi }$ satisfying $f({w[0:|w|-p]})={w[p:|w|]}$ by definition. We denote this relation by ${p \parallel _{f} w}$ or simply by ${p \parallel _{} w}$ when f is not specified.

In general, a p-string w can have multiple periods. We denote the shortest period of w as ${ period (w)}$. It is clear that a period p of a p-string w is also a period of any substring $w'$ of w such that $|w'|\ge p$.

Example 2

Let $\mathrm{\Sigma }= \{ \texttt{a}, \texttt{b}, \texttt{c} \}$ and $\mathrm{\Pi }= \{ \texttt{A}, \texttt{B}, \texttt{C} \}$. For $w := \texttt{ABaCBCaACAa}$, we have ${4 \parallel _{f} w}$ as $\texttt{ABaCBCa} \equiv \texttt{BCaACAa}$ with $f(\texttt{A})=\texttt{B}$, $f(\texttt{B})=\texttt{C}$, and $f(\texttt{C})=\texttt{A}$.

Instead of Definition 2, one can use the following equivalent definition for periods, which is a more intuitive representation of the repetitive structure of strings:

Lemma 1

([3]). Consider $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$, $p\in \mathbb {N}^{+}$, and $f\in S_\mathrm{\Pi }$. Then, ${p \parallel _{f} w}$ holds if and only if w can be written as

$$\begin{aligned} w=f^0(v) \cdot f^1(v) \cdot f^2(v) \cdots f^{{\lfloor \rho \rfloor }-1}(v) \cdot f^{{\lfloor \rho \rfloor }}(v'), \end{aligned}$$

where $\rho =\frac{|w|}{p}$, $v={w[0:p]}$ and $v'$ is a prefix of v (allowing the case $v'$ is empty).

The following lemma has important applications for various matching algorithms. Particularly, it is used to shift the pattern string safely in the Knuth-Morris-Pratt algorithm and variants [2, 13].

Lemma 2

Consider $x,y\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ with $x \equiv y$. For any $0< \delta < { period (y)}$, we have ${x[\delta :|x|]} \not \equiv {y[0:|y|-\delta ]}$.

Proof

We give a proof by contraposition. Suppose ${x[\delta :|x|]} \equiv {y[0:|y|-\delta ]}$. Then we have ${y[0:|y|-\delta ]} \equiv {x[\delta :|x|]} \equiv {y[\delta :|y|]}$, which means ${\delta \parallel _{} y}$. Hence, $\delta \ge { period (y)}$ holds. $\square $

One of the main interest regarding string periodicity is what holds when a string w has two different periods p and q. For ordinary strings, Fine and Wilf’s periodicity lemma [9] gives an answer: $\gcd (p,q)$ is also a period when $|w|\ge p+q-\gcd (p,q)$, where $\gcd (p,q)$ is the greatest common divisor of p and q. Apostolico and Giancarlo showed a similar property for parameterized strings.

Lemma 3

([3]). For $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$, $p,q\in \mathbb {N}^{+}$, and $f,g\in S_\mathrm{\Pi }$, assume that ${p \parallel _{f} w}$ and ${q \parallel _{g} w}$. If $|w|\ge p+q$ and $fg=gf$, we have ${\gcd (p,q) \parallel _{} w}$.

It is known that the length $|w|=p+q-\gcd (p,q)$ is not sufficient for this lemma unlike in the case of ordinary strings [3].

3 Properties of Parameterized Periods

In this section, we show some properties of periods of parameterized strings. They play an important role in our algorithm presented in Sect. 4.

3.1 Alternative Periodicity Lemma

The requirements of Lemma 3 are slightly different from Fine and Wilf’s lemma for ordinary strings. Particularly, the commutativity of f and g is essential (Lemma 5 in [3]). In this section, we show a new periodicity lemma for parameterized strings which does not assume the commutativity.

Firstly, we focus on parameter characters contained in a given p-string and its substrings. For $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$, we denote by ${\mathrm{\Pi }_{w}}$ the set of parameter characters appearing in w.

Example 3

Let $\mathrm{\Sigma }= \{ \texttt{a}, \texttt{b}, \texttt{c} \}$ and $\mathrm{\Pi }= \{ \texttt{A}, \texttt{B}, \texttt{C} \}$. For $w:=\texttt{ABabAca}$, we have ${\mathrm{\Pi }_{w}}=\{ \texttt{A}, \texttt{B} \}$.

Lemma 4

Consider $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ and any of its substrings $w'$ and $w''$. Then, the following hold:

If $|w'|\ge { period (w)}\cdot (|{\mathrm{\Pi }_{w}}|-1)$, we have $|{\mathrm{\Pi }_{w'}}|\ge |{\mathrm{\Pi }_{w}}|-1$.
If $|w''|\ge { period (w)}\cdot |{\mathrm{\Pi }_{w}}|$, we have ${\mathrm{\Pi }_{w''}}={\mathrm{\Pi }_{w}}$.

Proof

The case ${\mathrm{\Pi }_{w}} = \emptyset $ is trivial. Suppose ${\mathrm{\Pi }_{w}} \ne \emptyset $. Let $p := { period (w)}$ and f be a permutation of $\mathrm{\Pi }$ such that ${p \parallel _{f} w}$. It suffices to show the lemma for the cases $|w'|=p\cdot (|{\mathrm{\Pi }_{w}}|-1)$ and $|w''|=p\cdot |{\mathrm{\Pi }_{w}}|$. By Lemma 1, $w'$ and $w''$ can be written as $w'=v'\cdot f(v') \cdots f^{|{\mathrm{\Pi }_{w}}|-2}(v')$ and $w''=v''\cdot f(v'') \cdots f^{|{\mathrm{\Pi }_{w}}|-1}(v'')$, where $v'$ and $v''$ are the prefixes of $w'$ and $w''$ of length p, respectively. Now, we consider the cyclic decomposition of f.

Suppose the characters in ${\mathrm{\Pi }_{w}}$ make one cyclic permutation in f. Let a be any parameter character contained in $v'$. Note that $a,f(a),\cdots ,f^{|{\mathrm{\Pi }_{w}}|-2}(a)$ are all different characters and all appear in $w'$. Therefore, we have $|{\mathrm{\Pi }_{w'}}|\ge |{\mathrm{\Pi }_{w}}|-1$. The analogous argument shows $|{\mathrm{\Pi }_{w''}}| = |{\mathrm{\Pi }_{w}}|$.

Suppose the characters in ${\mathrm{\Pi }_{w}}$ make two or more cyclic permutations in f. Then, those cyclic permutations are all of length $|{\mathrm{\Pi }_{w}}|-1$ or less. For $0\le i<|w|$, there exists an integer k such that $w[i+kp], w[i+(k+1)p], \cdots , w[i+(k+|{\mathrm{\Pi }_{w}}|-2)p]$ are all contained in $w'$. Then, those characters can be represented as $f^{k}(w[i]),f^{k+1}(w[i]),\cdots ,f^{k+|{\mathrm{\Pi }_{w}}|-2}(w[i])$, and by the assumption about f, at least one of them is equal to w[i]. Therefore, we have $w[i]\in {\mathrm{\Pi }_{w'}}$. Since i is arbitrary, we end up with ${\mathrm{\Pi }_{w}}\subseteq {\mathrm{\Pi }_{w'}}$, as required. $\square $

Now, we show a variant of Lemma 3. It does not require any assumption on the permutations, in exchange of a stricter requirement for the length of strings.

Lemma 5

Suppose $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ with ${\mathrm{\Pi }_{w}}\ne \emptyset $ has periods p and q. If $|w|\ge p+q+\min (p,q)\cdot (|{\mathrm{\Pi }_{w}}|-1)$, we have ${\gcd (p,q) \parallel _{} w}$.

Proof

Let f and g be permutations of $\mathrm{\Pi }$ such that ${p \parallel _{f} w}$ and ${q \parallel _{g} w}$. Without loss of generality, we suppose $f(a)=a$ and $g(a)=a$ for any $a\in \mathrm{\Pi }\setminus {\mathrm{\Pi }_{w}}$. By Lemma 3, it suffices to show that $fg=gf$. Let $w':={w[0:|w|-p-q]}$. Then, notice that $fg(w') = f({w[q:|w|-p]}) = {w[p+q:|w|]} = g({w[p:|w|-q]}) = gf(w')$, which claims $fg(a)=gf(a)$ for any $a\in {\mathrm{\Pi }_{w'}}$. Moreover, given $|w'|=|w|-p-q \ge \min (p,q)\cdot (|{\mathrm{\Pi }_{w}}|-1) \ge { period (w)}\cdot (|{\mathrm{\Pi }_{w}}-1|)$, we have $|{\mathrm{\Pi }_{w'}}|\ge |{\mathrm{\Pi }_{w}}|-1$ by Lemma 4. Hence, the permutations fg and gf behave the same for at least $|\mathrm{\Pi }|-1$ parameter characters. This implies $fg=gf$. $\square $

Corollary 1

Suppose $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ with ${\mathrm{\Pi }_{w}}\ne \emptyset $ has a period q. If $q\le \frac{|w|}{|{\mathrm{\Pi }_{w}}|+1}$, then ${ period (w)} \mid q$.

Proof

Let $p:={ period (w)}$. By $p\le q\le \frac{|w|}{|{\mathrm{\Pi }_{w}}|+1}$, we have $p\cdot |{\mathrm{\Pi }_{w}}|+q \le q\cdot (|{\mathrm{\Pi }_{w}}|+1) \le \frac{|w|}{|{\mathrm{\Pi }_{w}}|+1}(|{\mathrm{\Pi }_{w}}|+1)=|w|$. Hence, we can use Lemma 5 to obtain ${\gcd (p,q) \parallel _{} w}$. Then, since p is the smallest period of w, we have $\gcd (p,q)\ge p$, which means $\gcd (p,q)=p$ i.e. $p \mid q$, as required. $\square $

3.2 Prefix Periods

Galil and Seiferas’s exact matching algorithm [11] can be regarded as an extension of the Knuth-Morris-Pratt algorithm [13]. The main idea of their algorithm is to deal with only periods of pattern prefixes which are ‘short enough.’ They pointed out that periods shorter than $\frac{1}{k}$ times the length of the string have useful properties for saving space usage in exact string matching for an arbitrarily fixed $k \ge 3$. We show, in this section, that similar properties hold for parameterized strings as well when k is set to be $|{\mathrm{\Pi }_{w}}|+2$. Most of those properties come from Lemma 5 we proved in the previous section.

Lemma 6

Suppose $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ has a period p. If $p\le \frac{|w|}{|{\mathrm{\Pi }_{w}}|+1}$, there exists only one character $a\in {\mathrm{\Sigma }\cup \mathrm{\Pi }}$ such that ${p \parallel _{} wa}$.

Proof

Consider the prefix $w':={w[0:|w|-p]}$. By $p\le \frac{|w|}{|{\mathrm{\Pi }_{w}}|+1}$, we have $|w|-p \ge p|{\mathrm{\Pi }_{w}}| \ge { period (w)}|{\mathrm{\Pi }_{w}}|$. By Lemma 4, ${\mathrm{\Pi }_{w'}} = {\mathrm{\Pi }_{w}}$. Therefore, $w[|w|-p]$ already appears in $w'$ as $w[i]=w[|w|-p]$ for some $i < |w|-p$. Hence, for any f such that ${p \parallel _{f} w}$, it holds that ${p \parallel _{f} wa}$ if and only if $a = w[i+p]$. $\square $

Corollary 2

Suppose $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ has a period p. For any $\ell \in \mathbb {N}^{+}$ such that $\ell p\le \frac{|w|}{|{\mathrm{\Pi }_{w}}|+1}$, we have ${p \parallel _{} wa} \iff {\ell p \parallel _{} wa}$ for any $a\in {\mathrm{\Sigma }\cup \mathrm{\Pi }}$.

Proof

By Lemma 6, the characters $a_1$ and $a_2$ such that ${p \parallel _{} wa_1}$ and ${\ell p \parallel _{} wa_2}$ are unique respectively. Then, since ${p \parallel _{} wa_1} \implies {\ell p \parallel _{} wa_1}$ (shown immediately by Lemma 1), we get $a_1=a_2$, as required. $\square $

Now, we introduce the key concept for our algorithm: prefix periods. This is a natural extension of the one introduced in [12] for parameterized strings. Hereafter in this section, we consider a fixed p-string $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ with ${\mathrm{\Pi }_{w}}\ne \emptyset $ and let $k:=|{\mathrm{\Pi }_{w}}|+2$.

Definition 3

A positive integer $p\in \mathbb {N}^{+}$ is called a prefix period of w if and only if there exists a prefix $w'$ of w such that $period(w')=p$ and $p\le \frac{|w'|}{k}$.

Table 1. Let $\mathrm{\Pi }=\{\texttt{A},\texttt{B}\}$. A p-string $w:=\texttt {ABABBABAABABBABAABBA}$ has prefix periods 1 and 4. Circled numbers in the table below are prefix periods of w with ${w[0:i+1]}$ as witnesses. For instance, 4 is a prefix period of w with ${w[0:18]}$ as a witness because ${ period ({w[0:18]})}=4$ and $4 \le \frac{|{w[0:18]}|}{k}$. (Note that $k = |{\mathrm{\Pi }_{w}}|+2 = 4$.)

Full size table

We give an example of prefix periods in Table 1. For a fixed p, only prefixes $w'$ of w satisfying $|w'|\ge kp$ can be a witness for p being a prefix period. We show in the following lemmas that it suffices to consider only one prefix $w'={w[0:kp]}$ for checking whether p is a prefix period.

Lemma 7

For any $a\in {\mathrm{\Sigma }\cup \mathrm{\Pi }}$, if ${ period (wa)}\ne { period (w)}$, we have ${ period (wa)} > \frac{|w|}{|{\mathrm{\Pi }_{w}}|+1}$.

Proof

We show the lemma by contraposition. Suppose ${ period (wa)} \le \frac{|w|}{|{\mathrm{\Pi }_{w}}|+1}$. Since ${ period (wa)}$ is also a period of w, we can use Corollary 1 to obtain ${ period (w)} \mid { period (wa)}$. Therefore, we get ${{ period (w)} \parallel _{} wa}$ by Corollary 2, which implies ${ period (w)}\ge { period (wa)}$. On the other hand, we have ${ period (w)}\le { period (wa)}$ by definition. Thus ${ period (w)}={ period (wa)}$ holds. $\square $

Lemma 8

Consider any $0<p\le \frac{|w|}{k}$. Then, p is a prefix period of w if and only if ${ period (w')}=p$ where $w':={w[0:kp]}$.

Proof

Immediate by the definition of prefix periods.

$(\!\implies \!)\;$ Let v be a prefix of w that witnesses p being a prefix period, i.e., $|v| \ge kp$ and ${ period (v)} = p$. If $|v| = kp$, we are done. Suppose $|v|>kp$ and let $u:={v[0:|v|-1]}$. Then, $ { period (v)} = p< \frac{|v|}{k} \le \frac{|v|}{|\Pi _u|+2} < \frac{|u|}{|\Pi _u|+1} $. By Lemma 7, we have ${ period (u)}={ period (v)}=p$. By repeatedly applying this discussion, we can shorten the witness up to length kp. $\square $

Next, we introduce an auxiliary function ${ reach _{w}}$.

Definition 4

For any $0<p\le |w|$, let

$$\begin{aligned} { reach _{w}}(p) := \max \{r\in \mathbb {N}: r\le |w| \text { and } {p \parallel _{} {w[0:r]}}\}. \end{aligned}$$

Note that ${p \parallel _{} {w[0:r]}} \iff { reach _{w}}(p)\ge r$ holds by definition. Using ${ reach _{w}}$, we get an equivalent definition of prefix periods as follows, which is directly used in our searching algorithm.

Lemma 9

Consider any $0<p\le \frac{|w|}{k}$. Then, p is a prefix period of w if and only if all the following hold:

(1)
${ reach _{w}}(p) \ge kp$,
(2)
${ reach _{w}}(q) < { reach _{w}}(p)$ for any $0<q<p$.

Proof

$(\!\implies \!)\;$ (1) is by definition. We show (2). By Lemma 8, ${ period ({w[0:kp]})}=p$. Thus, $q < p$ is not a period of ${w[0:kp]}$, i.e., ${ reach _{w}}(q) < kp \le { reach _{w}}(p)$ by (1).

Let $w':={w[0:{ reach _{w}}(p)]}$. (2) implies ${ period (w')}=p$ since any q satisfying $0<q<p$ is not a period of $w'$. Additionally, we have $p\le \frac{|w'|}{k}$ by (1). Thus p is a prefix period of w with $w'$ as a witness. $\square $

Galil and Seiferas [11] in Corollary 1 pointed out that the number of prefix periods of a word w is $O(\log |w|)$. We show in the following lemma that it is the case for parameterized strings. It contributes directly to reducing the space complexity of our algorithm.

Lemma 10

Suppose w has prefix periods p and q. If $p<q$, then $2p\le q$.

Proof

We prove the lemma by contradiction. Suppose $p<q<2p$. By definition, ${p \parallel _{} {w[0:kp]}}$ and ${q \parallel _{} {w[0:kq]}}$ hold. Let $w' := {w[0:kp]}$. By Lemma 8, p is the shortest period of $w'$. Since both p and q are periods of $w'$ and $p\cdot |{\mathrm{\Pi }_{w'}}|+q < p\cdot |{\mathrm{\Pi }_{w}}|+2p = kp = |w'|$, we get ${\gcd (p,q) \parallel _{} w'}$ by Lemma 5. Hence, we have $\gcd (p,q)\ge { period (w')} =p$, which claims $\gcd (p,q)=p$ i.e. $p \mid q$. However, this contradicts to the assumption $p< q < 2p$. $\square $

Corollary 3

The number of prefix periods of $w\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ is at most $\log _2{|w|}$.

4 Proposed Algorithm

In this section, we propose a sublinear-extra-space algorithm for the parameterized matching problem. Throughout this section, let T and P be p-strings whose lengths are n and m respectively, and let $k:=|{\mathrm{\Pi }_{P}}|+2$. Besides, we suppose ${\mathrm{\Pi }_{P}}\ne \emptyset $. Our algorithm is an extension of Galil and Seiferas’s exact string matching algorithm [11] and runs in $O(|{\mathrm{\Pi }_{P}}|n+m)$ time and $O(\log {m}+|\mathrm{\Pi }|)$ extra space. When $|\mathrm{\Pi }|=|{\mathrm{\Pi }_{P}}|=1$, our algorithm behaves exactly as theirs.

Firstly, we introduce a method for testing whether two p-strings match. While it is common to use the prev-encoding [4] for this purpose, it is not suitable for our goal since it requires additional space proportional to the input size. Thus we use an alternative method as follows, which requires only $O(|\mathrm{\Pi }|)$ extra space.

Lemma 11

Consider a prefix x of P and $y\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$ with $x\equiv y$ and any $a,b\in {\mathrm{\Sigma }\cup \mathrm{\Pi }}$. We have $xa \equiv yb$ if and only if one of the following holds:

1.
$a\in \mathrm{\Sigma }$ and $a=b$,
2.
$a\in \mathrm{\Pi }$ and ${ first _{P}}(a)\ge |x|$ and $b\in \mathrm{\Pi }$ and ${ count _{y}}(b)=0$,
3.
$a\in \mathrm{\Pi }$ and ${ first _{P}}(a)<|x|$ and $y[{ first _{P}}(a)]=b$,

where ${ first _{P}} : \mathrm{\Pi }\rightarrow \mathbb {N}$ and ${ count _{y}} : \mathrm{\Pi }\rightarrow \mathbb {N}$ are defined as follows:

$$\begin{aligned} { first _{P}}(c)&= {\left\{ \begin{array}{ll} \min \{i\in \mathbb {N}: i<|P| \text { and } P[i]=c\} &{} \text { if } c\in {\mathrm{\Pi }_{P}} \text { ,} \\ |P| &{} \text { if } c\in \mathrm{\Pi }\setminus {\mathrm{\Pi }_{P}}\text { ,} \end{array}\right. } \\ { count _{y}}(c)&= |\{i\in \mathbb {N}: i<|y| \text { and } y[i]=c\}| \end{aligned}$$

Proof

By definition, we have $xa \equiv yb$ if and only if $b=f(a)$, where f satisfies $y=f(x)$. If a is a constant character or appears in x, the value f(a) is determined (Cases 1 and 3). Otherwise, b must be a parameter character not appearing in y (Case 2). $\square $

Let $\texttt{MATCH}(x,y,a,b,{ first _{P}},{ count _{y}})$ be the function which returns whether $xa \equiv yb$ under the condition $x \equiv y$ using Lemma 11. Clearly, one can compute it in constant time if ${ first _{P}}$ and ${ count _{y}}$ are given as arrays. Note that ${ first _{P}}$ can be computed in O(m) time and $O(|\mathrm{\Pi }|)$ space.

4.1 Pattern Preprocessing

In this section, we show the preprocessing for the pattern P for our matching algorithm. The output of the preprocessing is the list of pairs of a prefix period of P (in ascending order) and its reach, just like Galil and Seiferas [11] introduced for exact string matching. The list plays a similar role to the border array in the parameterized Knuth-Morris-Pratt algorithm [2]. While border array uses $\varTheta (m)$ space to memorize the shortest periods of all prefixes of P, the prefix period list requires only $O(\log {m})$ space by Corollary 3.

We present the preprocess in Algorithm 1. The algorithm finds prefix periods and their reaches in order from the smallest to the largest and put them into the list $ PP $. By $ PP [ idx ]. val $ and $ PP [ idx ]. reach $, we denote the $ idx $-th prefix period and its reach in $ PP $, respectively. Starting with $p=1$, it monotonically increases p and checks whether an integer p is a prefix period based on Lemma 9. Throughout the algorithm run, we maintain the invariant

We calculate ${ reach _{P}}(p)$ by increasing r as long as ${P[0:r-p]} \equiv {P[p:r]}$ holds (Lines 10–13). To let the function $\texttt{MATCH}$ decide ${P[0:r-p]} \equiv {P[p:r]}$, we use two auxiliary arrays ${ first _{}}$ and ${ count _{}}$ that satisfy ${ first _{}}[a]={ first _{P}}(a)$ and ${ count _{}}[a] = { count _{P[p:r]}}(a)$, defined in Lemma 11. Moreover, we maintain the variable $ max\_reach $ to be the largest reach calculated so far. By Lemma 9, the condition of Line 14 is satisfied if and only if p is a prefix period. One can construct the list $ PP $ by incrementing p one by one, but it takes too much time. Instead, we use a more efficient way explained later to make the algorithm run in linear time.

The following lemmas justify the behavior of our algorithm.

Lemma 12

Throughout Algorithm 1, the value of the variable $ idx $ is always the upper bound that satisfies $ PP [ idx ]. val \le \frac{r-p}{k}$. If there exists no such index, we have $ idx =-1$.

Proof

The variable $ idx $ is updated in conjunction with p and r to preserve the condition. See Lines 13 and 24. $\square $

Lemma 13

Let $\spadesuit $ hold at Line 17 in Algorithm 1. If ${ period ({P[0:r-p]})} \le \frac{r-p}{k}$, we have $ PP [ idx ]. val = { period ({P[0:r-p]})}$.

Proof

Let $w' := {P[0:r-p]}$, $p' := { period (w')}$, $p'' := PP [ idx ]. val $, and $w'' = {P[0:kp'']}$. By the assumption, $p'$ is a prefix period of P. Additionally, we have $p' \le p$ since ${p \parallel _{} w'}$. Thus $p'$ is in the list $ PP $, and thus we have $p' \le p''$ by Lemma 12. On the other hand, we have ${ period (w'')}=p''$ by Lemma 8. Since $|w''|=kp'' \le r-p=|w'|$, we have ${ period (w'')} \le { period (w')}$, i.e. $p'' \le p'$. Hence we get $p' = p''$. $\square $

Lemma 14

Let $\spadesuit $ hold at Line 17 in Algorithm 1. We have $ PP [ idx ]. reach \ge r-p \iff { period ({P[0:r-p]})} \le \frac{r-p}{k}$.

Proof

Let $w' := {P[0:r-p]}$ and $p' := PP [ idx ]. val $.

$(\!\implies \!)\;$ We have ${p' \parallel _{} w'}$ by the assumption. Then ${ period (w')} \le p' \le \frac{r-p}{k}$ holds by Lemma 12.

By Lemma 13, we have $p' = { period (w')}$. Then $ PP [ idx ]. reach = { reach _{P}}(p') = { reach _{P}}({ period (w')}) \ge |w'| = r-p$. $\square $

Now, we show that the invariant $\spadesuit $ always holds.

Lemma 15

Throughout Algorithm 1, we have ${P[0:r-p]}\equiv {P[p:r]}$.

Proof

One must see the condition preserved at the lines in which p or r is updated. The update at Lines 22–23 is trivial. Line 12 preserves the condition, ensured by the condition of Line 10. For Line 19, let $q:= PP [ idx ]. val $. Since $q={ period ({P[0:r-p]})}$ by Lemma 13, we have ${P[0:r-(p+q)]} \equiv {P[q:r-p]} \equiv {P[p+q:r]}$. Note that Lemma 13 requires $\spadesuit $ only at Line 17, so the argument does not circulate. $\square $

The following lemma plays a key role to avoid incrementing p one by one.

Lemma 16

Consider $P\in {({\mathrm{\Sigma }\cup \mathrm{\Pi }})^*}$, $p\in \mathbb {N}^{+}$ and let $r:={ reach _{P}}(p)$. Then, no prefix period q of P exists such that $p< q < p+{ period ({P[0:r-p]})}$.

Proof

We use Lemma 2 for $x:={P[p:r]}$, $y:={P[0:r-p]}$, $\delta :=q-p$ to obtain ${P[q:r]} \not \equiv {P[0:r-q]}$, which means ${q \nparallel _{} {P[0:r]}}$. Thus we have ${ reach _{P}}(q) < r = { reach _{P}}(p)$, which implies that q is not a prefix period of P by Lemma 9. $\square $

We now present the way to compute the list of prefix periods efficiently, in which we skip calculating ${ reach _{P}}(p)$ if we are sure that p is not a prefix period. For realizing an efficient shift, we maintain a variable $ idx $ so that it points at the largest index of $ PP $ such that $ PP [ idx ]. val \le \frac{r-p}{k}$ (Lemma 12). The shift amount is determined in the following manner. If $ PP [ idx ]. reach \ge r-p > 0$ at Line 17, Lemmas 14 and 13 imply $ PP [ idx ]. val = { period ({P[0:r-p]})}$. Hence, Lemma 16 justifies the shift amount $ PP [ idx ]. val $ of p at Line 19. On the other hand, if $ PP [ idx ]. reach < r-p$, by Lemma 14, we have ${ period ({P[0:r-p]})} > \frac{r-p}{k}$. This justifies the shift ${\lfloor \frac{r-p}{k} \rfloor }+1$ of p at Line 22 again by Lemma 16. If $r-p=0$, then p is incremented by just one.

Now, we show that the algorithm runs in O(m) time. Firstly, notice that the while loops at Line 9 and 10 are repeated only O(m) times in total, since the quantity $kp+r$ keeps increasing and $kp+r \le k\cdot \frac{m}{k}+m = O(m)$. Hence, the fact we must show is that decrementing $ count $ and $ idx $ at Line 18, 21, and 24 takes O(m) time in total. As their values are always greater than or equal to their initial values, the number of decrements does not exceed the number of increments, which is O(m) since they are in Line 11–13.

Theorem 1

All prefix periods of P and their reaches can be calculated in O(m) time and $O(\log {m}+|\mathrm{\Pi }|)$ extra space.

4.2 Searching for Parameterized Matches

Our matching algorithm is shown in Algorithm 2. As it is the case for the Galil-Seiferas algorithm, it resembles the preprocess. Now, the invariants in Algorithm 2 are obtained by replacing p, r, and ${P[p:r]}$ in Lemma 12–15 with i, j, and ${T[i:j]}$, respectively. Particularly, by the invariant that ${P[0:j-i]} \equiv {T[i:j]}$, one can find matching positions i when $j=i+|P|$ (Line 13). The shift amounts are also justified by using Lemma 2 for $x:={T[i:j]}$ and $y:={P[0:j-i]}$, whose conclusion ${T[i+\delta :j]} \not \equiv {P[0:j-i-\delta ]}$ implies ${T[i+\delta :i+\delta +|P|]} \not \equiv P$ for any $\delta $ smaller than the shift by the algorithm. We can show that the searching phase (Line 8–22) runs in $O(|{\mathrm{\Pi }_{P}}|n)$ time in the same way as for the preprocess with the increasing quantity $ki+j$.

Theorem 2

The parameterized matching problem can be solved in $O(|{\mathrm{\Pi }_{P}}|n+m)$ time and $O(\log {m}+|\mathrm{\Pi }|)$ extra space.

5 Conclusion and Future Work

We studied the periodicity of parameterized strings and extended the Galil-Seiferas algorithm [11] for parameterized matching. The proposed algorithm requires only sublinear extra space. The properties of periods of parameterized strings we presented in this paper may be used to design more space-efficient algorithms for parameterized matching, as Galil and Seiferas [12] used prefix periods to design a constant-extra-space algorithm for exact matching.

Notes

1.
We can assume ${\mathrm{\Pi }_{P}}\ne \emptyset $ without loss of generality. See Remark 2.

References

AdelsonVelskii, M., Landis, E.M.: An algorithm for the organization of information. Joint Publications Research Service Washington DC, Technical report (1963)
Google Scholar
Amir, A., Farach, M., Muthukrishnan, S.: Alphabet dependence in parameterized matching. Inf. Process. Lett. 49(3), 111–115 (1994)
Article MATH Google Scholar
Apostolico, A., Giancarlo, R.: Periodicity and repetitions in parameterized strings. Discret. Appl. Math. 156(9), 1389–1398 (2008)
Article MathSciNet MATH Google Scholar
Baker, B.S.: Parameterized pattern matching: algorithms and applications. J. Comput. Syst. Sci. 52(1), 28–42 (1996)
Article MathSciNet MATH Google Scholar
Crochemore, M.: String-matching on ordered alphabets. Theor. Comput. Sci. 92(1), 33–47 (1992)
Article MathSciNet MATH Google Scholar
Crochemore, M., Perrin, D.: Two-way string-matching. J. ACM 38(3), 650–674 (1991)
Article MathSciNet MATH Google Scholar
Deguchi, S., Higashijima, F., Bannai, H., Inenaga, S., Takeda, M.: Parameterized suffix arrays for binary strings. In: Proceedings of the Prague Stringology Conference 2008, pp. 84–94 (2008)
Google Scholar
Diptarama, Katsura, T., Otomo, Y., Narisawa, K., Shinohara, A.: Position heaps for parameterized strings. In: Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017), pp. 8:1–8:13 (2017)
Google Scholar
Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proc. Am. Math. Soc. 16(1), 109–114 (1965)
Article MathSciNet MATH Google Scholar
Fujisato, N., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: Right-to-left online construction of parameterized position heaps. In: Proceedings of the Prague Stringology Conference 2018 (PSC 2018), pp. 91–102 (2018)
Google Scholar
Galil, Z., Seiferas, J.: Saving space in fast string-matching. SIAM J. Comput. 9(2), 417–438 (1980)
Article MathSciNet MATH Google Scholar
Galil, Z., Seiferas, J.: Time-space-optimal string matching. J. Comput. Syst. Sci. 26(3), 280–294 (1983)
Article MathSciNet MATH Google Scholar
Knuth, D.E., Morris, J.H., Jr., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Article MathSciNet MATH Google Scholar
Kosaraju, S.R.: Faster algorithms for the construction of parameterized suffix trees. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp. 631–638 (1995)
Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Mendivelso, J., Thankachan, S.V., Pinzón, Y.: A brief history of parameterized matching problems. Discret. Appl. Math. 274, 103–115 (2020)
Article MathSciNet MATH Google Scholar
Nakashima, K., et al.: Parameterized DAWGs: efficient constructions and bidirectional pattern searches. Theor. Comput. Sci. 933, 21–42 (2022)
Article MathSciNet MATH Google Scholar
Nakashima, K., Hendrian, D., Yoshinaka, R., Shinohara, A.: An extension of linear-size suffix tries for parameterized strings. In: SOFSEM 2020 Student Research Forum, pp. 97–108 (2020)
Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar

Download references

Acknowledgements

The authors deeply appreciate the anonymous reviewers helpful comments. This work was supported by JSPS KAKENHI Grant Numbers JP19K20208 (DH), JP18K11150 (RY), JP20H05703 (RY), JP23K11325 (RY), and JP21K11745 (AS).

Author information

Authors and Affiliations

Graduate School of Information Sciences, Tohoku University, Sendai, Japan
Haruki Ideguchi, Diptarama Hendrian, Ryo Yoshinaka & Ayumi Shinohara

Authors

Haruki Ideguchi
View author publications
You can also search for this author in PubMed Google Scholar
Diptarama Hendrian
View author publications
You can also search for this author in PubMed Google Scholar
Ryo Yoshinaka
View author publications
You can also search for this author in PubMed Google Scholar
Ayumi Shinohara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haruki Ideguchi .

Editor information

Editors and Affiliations

ISTI-CNR, Pisa, Italy
Franco Maria Nardini
University of Pisa, Pisa, Italy
Nadia Pisanti
University of Pisa, Pisa, Italy
Rossano Venturini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ideguchi, H., Hendrian, D., Yoshinaka, R., Shinohara, A. (2023). Efficient Parameterized Pattern Matching in Sublinear Space. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-43980-3_22
Published: 20 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Parameterized Pattern Matching in Sublinear Space

Abstract

Similar content being viewed by others

On the Practical Power of Automata in Pattern Matching

A Parameterized Study of Maximum Generalized Pattern Matching Problems

On the Space and Circuit Complexity of Parameterized Problems: Classes and Completeness

Keywords

1 Introduction

Remark 1

2 Preliminaries

2.1 Parameterized Matching Problem

Definition 1

Example 1

Problem 1

Remark 2

2.2 Periodicity of Parameterized Strings

Definition 2

Example 2

Lemma 1

Lemma 2

Proof

Lemma 3

3 Properties of Parameterized Periods

3.1 Alternative Periodicity Lemma

Example 3

Lemma 4

Proof

Lemma 5

Proof

Corollary 1

Proof

3.2 Prefix Periods

Lemma 6

Proof

Corollary 2

Proof

Definition 3

Lemma 7

Proof

Lemma 8

Proof

Definition 4

Lemma 9

Proof

Lemma 10

Proof

Corollary 3

4 Proposed Algorithm

Lemma 11

Proof

4.1 Pattern Preprocessing

Lemma 12

Proof

Lemma 13

Proof

Lemma 14

Proof

Lemma 15

Proof

Lemma 16

Proof

Theorem 1

4.2 Searching for Parameterized Matches

Theorem 2

5 Conclusion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us