Keywords

1 Introduction

A pattern with variables, called simply pattern in the context of this work, is a string that consists of terminal symbols (e. g., \(\mathtt {a}, \mathtt {b}, \mathtt {c}\)) and variables (e. g., \(x_1, x_2, x_3\)). The terminal symbols are treated as constants, while the variables are to be uniformly replaced by strings over the set of terminals (i. e., different occurrences of the same variable are replaced by the same string); thus, a pattern is mapped to a terminal word. For example, \(x_1 \mathtt {a}\mathtt {b}x_1 x_2 \mathtt {c}x_2 x_1\) can be mapped to \(\mathtt {a}\mathtt {c}\mathtt {a}\mathtt {b}\mathtt {a}\mathtt {c}\mathtt {c}\mathtt {a}\mathtt {a}\mathtt {c}\mathtt {c}\mathtt {a}\mathtt {a}\mathtt {a}\mathtt {c}\) and \(\mathtt {b}\mathtt {a}\mathtt {b}\mathtt {b}\mathtt {a}\mathtt {c}\mathtt {a}\mathtt {b}\) by the replacements \((x_1 \rightarrow \mathtt {a}\mathtt {c}, x_2 \rightarrow \mathtt {c}\mathtt {a}\mathtt {a})\) and \((x_1 \rightarrow \mathtt {b}, x_2 \rightarrow \mathtt {a})\), respectively.

Patterns with variables appear in various areas of theoretical computer science, such as language theory (pattern languages [2]), learning theory (inductive inference [2, 24, 71, 77], PAC-learning [55]), combinatorics on words (word equations [53, 69], unavoidable patterns [63]), pattern matching (generalised function matching [1, 72]), database theory (extended conjunctive regular path queries [5]), and we can also find them in practice in the form of extended regular expressions with backreferences [12, 35, 39], used in programming languages like Perl, Java, Python, etc.

Generally, in all these contexts, patterns with variables are used to model various combinatorial pattern matching questions. For instance, searching for a word w in a text t can be expressed as testing whether the pattern xwy can be mapped to t and testing whether a word w contains a cube is equivalent to testing whether the pattern \(xy^3z\) can be mapped to w, such that y is not mapped to an empty word. Not only problems of testing whether a given word contains a regularity or a motif of a certain form can be expressed by patterns, but also problems asking whether a word can be factorised in a specifically restricted manner can be modelled in this way. For instance, asking whether \(x_1^2x_2^2\ldots x_k^2\) can be mapped to w, such that none of the variables \(x_i\) are mapped to an empty word, is equivalent to asking whether the word w can be factorised into k non-empty squares.

Unfortunately, deciding whether a given arbitrary pattern can be mapped to a given word, the matching problem, is -complete [2], whether we ask that the variables are mapped to non-empty words or not. This intractability result severely limits the practical application of patterns. Indeed, in many tasks related to applications of patterns, the matching problem is a necessary step, so the tasks become intractable as well. For instance, this is the case for the task of computing so-called descriptive patterns for finite sets of words (see [2, 37, 38] for more information on descriptive patterns): one cannot solve this problem without solving a series of (general) pattern matching tasks [27]. A more detailed analysis of the complexity of the hardness of the matching problem will be presented in Sect. 3.

On the other hand, some strong restrictions on the structure of patterns yield subclasses for which the matching problem is tractable (i.e., can be solved in polynomial time). This is clearly the case of patterns where the number of different variables in the patterns is bounded by a constant, but more sophisticated and general such subclasses can be defined. We will discuss a series of results related to this topic in Sects. 4.2, 5.1 and 5.2. In our analysis, the most general class of patterns which allow for a polynomial-time pattern matching problem is defined by establishing a deep connection between strings/patterns and graphs, and considering only patterns which correspond to graphs with bounded structural parameters. As such, the subclass of patterns with bounded treewidth. The question of finding classes of patterns which can be matched in polynomial time but do not have bounded treewidth seemed interesting to us. We show a natural construction of such patterns in Sect. 6.

We continue this survey with a result showing that considering some of the structural parameters, that lead to efficient pattern matching algorithms, as general structural parameters of strings, may lead to remarkable results in other apparently unrelated domains. We show in Sect. 7 how our results for strings can be used to obtain a state-of-the-art approximation algorithm for computing the cutwidth of graphs.

We conclude the survey with a series of extensions. We discuss the problem of injective pattern matching as well as the satisfiability problem for word equations with restricted form.

2 Basic Definitions

For detailed definitions regarding combinatorics on words we refer to [62].

We denote our alphabet by \(\varSigma \), the empty word by \(\varepsilon \), the set of all non-empty words over \(\varSigma \) by \(\varSigma ^+\), the set of all words over \(\varSigma \) by \(\varSigma ^*\), and the length of a word w by |w|. \((\varSigma ^*,\cdot ,\varepsilon )\) is the free monoid over \(\varSigma \) with concatenation as its binary operation, written \(\cdot \). For \(w\in \varSigma ^*\) and every integers ij with \(1 \le i\le j \le |w|\), let \(w[i..j] = w[i] \cdots w[j]\), where w[k] represents the letter on position k and \(1\le k\le |w|\). A period of w is any positive integer p for which \(w[i]=w[i+p]\), for all defined positions. Moreover, in this case, w is said to be p-periodic. Its minimal period is denoted by per(w) and represents the smallest period of w. For example, \(w = \mathtt {a}\mathtt {b}\mathtt {a}\mathtt {c}\mathtt {a}\mathtt {b}\mathtt {a}\mathtt {c}\mathtt {a}\mathtt {b}\mathtt {a}\mathtt {c}\mathtt {a}\mathtt {b}\mathtt {a}\mathtt {c}\mathtt {a}\mathtt {b}\) has periods 8 and 4; in particular, \(per(w) = 4\). A word w is called periodic if \(per(w)\le \frac{|w|}{2}\).

The concatenation of k words \(w_1, w_2, \ldots ,w_k\) is written \(\varPi _{i=1,k}w_i\). If \(w=w_i\) for all integers i with \(1\le i\le k\), this represents the kth power of w, denoted by \(w^k\); here, w is a root of \(w^k\). We can further extend the notion of a power of a word by saying that \(w=w[1..per(w)]^{\frac{|w|}{per(w)}}\). We say that w is primitive if it cannot be expressed as a power of exponent \(\ell \) of any root, where \(\ell \) is an integer with \(\ell >1\). Conversely, if \(w=v^\ell \) for some integer \(\ell >1\), then w is also called a repetition. The infinite repetition \(vvv\cdots \) of some word v is denoted \(v^\omega \).

For any word \(w \in \varSigma ^+\) with \(w=xyz\), we say that y is a factor of w. If x is empty, then y is also a prefix of w, while when z is empty, then y is also a suffix. Whenever we have a factor both as a prefix and as a suffix, the factor is said to be a border of the word. Furthermore, every word \(u=yzx\in \varSigma ^+\) is a conjugate of w. Note that, if w is primitive, so is every conjugate of it. If \(w=vu\), then \(v^{-1}w=u\).

Let \(X = \{x_1, x_2, x_3, \ldots \}\) and call every \(x \in X\) a variable. For a finite alphabet \(\varSigma \) of terminals with \(\varSigma \cap X = \emptyset \), we define and . Every is a pattern and every \(w \in \varSigma ^*\) is a (terminal) word. Given a word or a pattern v, for the smallest sets \(B\subseteq \varSigma \) and \(Y\subseteq X\) with \(v \in (B\cup Y)^*\), we denote and . For any \(x\in \varSigma \cup X\) and , \(|\alpha |_x\) denotes the number of occurrences of x in \(\alpha \); for the sake of convenience, we set \(|\alpha |_x = 0\) for every symbol x not in \(\varSigma \cup X\). For a pattern \(\alpha \), we say that \(w=\alpha [i..i+|w|]\) is a maximal terminal factor of \(\alpha \) if \(\alpha [i-1]\) and \(\alpha [i+|w|+1]\) are either not defined, or are variables.

A substitution (for \(\alpha \)) is a mapping . For every , we say that x is substituted by h(x) and \(h(\alpha )\) denotes the word obtained by substituting every occurrence of a variable x in \(\alpha \) by h(x) and leaving the terminals unchanged. We say that the pattern \(\alpha \) matches \(w\in \varSigma ^+\) if \(h(\alpha )=w\) for some substitution . Substitutions of the form , i. e., the empty word is excluded from the range of the substitution, are also called non-erasing; in order to emphasize that the substitution by the empty word is allowed, we also use the term erasing substitution.

Example 1

Let \(\beta = x_1 \mathtt {a}x_2 \mathtt {b}x_2 x_1 x_2\) be a pattern and let \(u = \mathtt {b}\mathtt {a}\mathtt {c}\mathtt {b}\mathtt {a}\mathtt {b}\mathtt {b}\mathtt {b}\mathtt {b}\mathtt {a}\mathtt {c}\mathtt {b}\mathtt {b}\) and \(v = \mathtt {a}\mathtt {b}\mathtt {a}\mathtt {a}\mathtt {b}\mathtt {b}\mathtt {a}\mathtt {b}\mathtt {a}\mathtt {b}\mathtt {a}\mathtt {b}\) be terminal words. The pattern \(\beta \) matches both u and v, witnessed by the substitutions h with \(h(x_1) = \mathtt {b}\mathtt {a}\mathtt {c}\mathtt {b}\), \(h(x_2) = \mathtt {b}\) and g with \(g(x_1) = g(x_2) = \mathtt {a}\mathtt {b}\), respectively. Moreover, \(\beta \) also matches the word \(w = \mathtt {a}\mathtt {c}\mathtt {b}\mathtt {b}\mathtt {c}\mathtt {b}\mathtt {c}\mathtt {b}\) by the erasing substitution h with \(h(x_1) = \varepsilon \), \(h(x_2) = \mathtt {c}\mathtt {b}\); it can be easily verified that there is no non-erasing substitution that maps \(\beta \) to w.

The matching problem, denoted by , is to decide for a given pattern \(\alpha \) and word w, whether there exists a substitution h with \(h(\alpha ) = w\). The variant where we are only concerned with non-erasing substitutions is called the non-erasing case of the matching problem; we also use the term erasing-case in order to emphasize that substitution by the empty word is allowed. Another special variant is the terminal-free case of the matching problem, where the input patterns are terminal-free, i. e., they do not contain any occurrences of terminal symbol. We shall briefly discuss some particularities of these different special cases of the matching problem in Sect. 3. Note that in the sections on efficient algorithms, namely Sects. 5.1, 5.2, and 6, we only consider the non-erasing case (with terminal symbols) of the matching problem. The presented results can easily be generalised to the general setting, but we prefer the respective framework for the ease of the presentation.

For any , the matching problem for P (or for P, for short) is the matching problem, where the input patterns are from P. In the sections of this paper we will introduce and discuss several interesting families of patterns.

As we discuss efficient algorithms, it is important to describe the computational model we use in this work. This is the standard unit-cost RAM with logarithmic word size. Also, all logarithms appearing in our time complexity evaluations are in base 2. For the sake of generality, we assume that whenever we are given as input a word \(w\in \varSigma ^*\) of length n, the symbols of w are in fact integers from \(\{1,2,\ldots ,n\}\) (i.e., ), and w is seen as a sequence of integers. This is a common assumption in the area of algorithmics on words (see, e.g., the discussion in [54]). Clearly, our algorithmic results hold canonically for constant alphabets, as well.

3 The Hardness of the Matching Problem

First, we recall that there are several different variants of the matching problem: the most general case (substitution by the empty word and occurrences of terminals in the patterns are possible), the non-erasing case (with terminal symbols), the terminal-free (erasing) case, and finally the terminal free non-erasing case. As we shall see, these differences do not matter too much if we are only concerned with the matching problem of patterns. However, in other contexts of patterns with variables (e. g., other decision problems, learning theory), these differences are most crucial and we therefore briefly provide some background.

For the class of the so-called pattern languages, i. e., the sets of all words that match a pattern, the difference between the erasing and the non-erasing case is important, since these classes of formal languages differ quite substantially with respect to basic decision problems. For example, in the non-erasing case, two patterns describe the same language if and only if the patterns are identical (up to a renaming of variables), while it is open whether the equivalence problem is even decidable in the erasing-case (see, e. g., Sect. 6 in [70], or [76]). Moreover, the inclusion problem, which is undecidable for both the erasing and the non-erasing case (see [36, 52]), can be decided for terminal-free patterns in the erasing case, while for terminal-free non-erasing patterns the decidability status is open (intuitively speaking, this has to do with the fact that avoidability questions of the form “does pattern \(\beta \) necessarily occur in long enough words over a k-letter alphabet?” can be expressed as inclusion for two languages given by terminal-free non-erasing patterns). Finally, also whether patterns (or descriptive patterns) can be inferred from positive data strongly depends on whether the erasing or non-erasing case is considered, or whether or not terminal symbols in the patterns are allowed (see [37, 38, 75, 77]).

For the matching problem (note that this corresponds to the membership problem for pattern languages), whether we consider erasing or non-erasing substitution, or whether or not we disallow terminal symbols in the patterns, has little impact on its computational hardness. In fact, that the matching problem for patterns with variables is -complete has been independently discovered in different communities and for slightly different problem variants (see, e. g., the introductions of [28, 30] for some remarks on the history of the investigation of the matching problem).

If we consider the most general case, i. e., erasing substitutions and terminals in the patterns, then a hardness-reduction is rather simple. For example, the Boolean formula

$$((v_1, v_2, v_3), (v_2, v_4, v_5), (v_3, v_1, v_3), (v_4, v_1, v_2))$$

in 3-CNF (without negated variables) is 1-in-3 satisfiable (i. e., satisfiable with exactly one literal per clause set to true) if and only if the following word w is matched by the pattern \(\alpha \):

figure a

We can further observe that this simple reduction also shows that the matching problem is hard even for binary terminal alphabets and under the restriction that variables are substituted by single symbols (or the empty word) only. This directly raises the questions under which restrictions the matching problem remains hard. For example, a problem instance has a large number of natural parameters (length of the pattern, length of the word, number of variables, number of occurrences per variable, alphabet size, length of words substituted for variables) and in addition to that, it comes in four natural variants resulting from whether we consider the erasing or non-erasing case, and whether or not we allow terminals in the pattern. In the above reduction, the number of variables, the number of occurrences per each variable and the word length are unbounded.

All these numerous restricted problem variants have been thoroughly investigated in [29] and it turns out that the matching problem remains -hard under rather strong restrictions. We cite the following result as an example and refer to [29] for further details.

Theorem 1

([29]). The erasing case of the matching problem for patterns with variables is -complete, even if \(\varSigma = \{\mathtt {a}, \mathtt {b}\}\), every variable has at most 2 occurrences and every variable can only be substituted by a single symbol or the empty word.

This result also holds as stated for terminal-free patterns. In the non-erasing case, however, it holds when the bound on the substitution words is 3 instead of 1, and in the non-erasing and terminal-free case the result holds when additionally the bounds on the occurrences per variable and alphabet size are 3 and 4, respectively.

The only polynomial-time solvable cases of the matching problem obtained by restricting the numerical parameters mentioned above are trivial ones. More precisely, the matching problem can be easily solved for unary alphabets (in this case, we only have to solve an equation in the integers and with integer coefficients, which are given in unary encoding), or if every variable has only one occurrence (the patterns are then regular, see Sect. 4), or if the number of variables or the length of the input word is bounded by a constant (the former is obvious, while the latter, in the erasing case, requires a slightly more careful argument [45]).

In particular, this also points out that Theorem 1 describes some kind of dichotomy, i. e., if we would further restrict the alphabet size, or the maximum number of occurrences per variable to 1, then we would obtain a polynomial-time solvable variant (even if all other parameters are unrestricted); similarly, if we allow variables to be substituted by single symbols only, but not the empty word, then the matching problem becomes efficiently solvable as well (regardless of the alphabet size).

Generally, by brute-force algorithms, the matching problem can be solved in time or , making it polynomial-time solvable provided that there is a constant upper bound on |w| or \(|\alpha |\) (in fact, a bound on is sufficient). However, this constant upper bound occurs in the exponent, which means that even for rather low such bounds, say 7, the corresponding polynomial-time algorithms are most likely impractical for larger problem instances. This leads to the question whether exponential-time algorithms are possible whose running-times are such that the exponential part exclusively depends on, say , but not on |w|, i. e., running-times of the form , where g is a polynomial and f is some computable function (exponential, or even double-exponential etc.). Such a running-time is polynomial for upper bounded , but the degree of the polynomial is always the same independent from the actual upper bound. If a problem has an algorithm with such a running-time, then it is called fixed-parameter tractable (with respect to the bounded parameter); see the textbooks [23, 32] for more information on parameterised complexity. Whether the matching problem for patterns with variables allows fixed-parameter tractability for some parameters has been thoroughly investigated in [31]. Although there are some more or less trivial cases of fixed-parameter tractability, the main insight provided by [31] is of a negative nature and can be summarised in the following way.

Theorem 2

([31]). All variants of the matching problem parameterised by \(|\alpha |\) are -hard. The erasing case of the matching problem parameterised by |w| is -hard.Footnote 1

Note that since \(|\alpha |\) and |w| are rather general parameters, this result covers other parameters as well, e. g., \(|\varSigma |\) or . In the non-erasing case, |w| is an upper bound for ; thus, treating |w| as a parameter means that is also a parameter and therefore the matching problem is fixed-parameter tractable by the obvious brute-force algorithm. We refer to [31] for further such simple fixed-parameter tractable case.

Consequently, even strong restrictions of the obvious numerical parameters of instances of the matching problem, i. e., number of variables, alphabet size, occurrences per variable etc., does not yield interesting efficiently matchable subclasses of patterns with variables. However, as discussed in the next section, looking deeper into the structure of patterns will help.

4 Structural Restrictions for Patterns

From an intuitive point of view it is clear that not only the mere length of a pattern or the number of its variables should have an impact on the matching complexity, but also the actual order of the variables. For example, it has been observed rather early in [81] that if the variable occurrences in the patterns are sorted, e. g., as in \(x_1 \mathtt {a}x_1 x_2 x_2 \mathtt {a}\mathtt {b}x_2 x_2 \mathtt {a}x_3 x_4 \mathtt {c}x_4\), then they can be matched efficiently “from-left-to-right” (more precisely, it is observed in [81] that matching such patterns can be done in logarithmic space).

A systematic investigation of such structural restrictions has been done in the last decade and numerous efficiently matchable subclasses of patterns have been found. In the following, we first present a unifying approach based on graph morphisms and the concept of treewidth. Then, we define and summarise several structural parameters for patterns and respective subclasses of patterns.

4.1 Pattern Matching by Graph Morphisms

The following general framework for matching patterns with variables has been developed in [78]. For a pattern \(\alpha \in (X \cup \varSigma ^*)\), the standard graph representation of \(\alpha \) is the undirected graph , where \(V_{\alpha } = \{1, 2, \ldots , |\alpha |\}\) and with being the set of neighbour edges and being the set of equality edges (see Fig. 1 for an illustration).

Fig. 1.
figure 1

The standard graph representation for \(\alpha = x_1 x_2 x_3 \mathtt {b}\mathtt {b}x_2 x_1 \mathtt {a}x_2 x_3 x_2 \mathtt {c}x_1\); the dashed, straight and dotted equality edges correspond to occurrences of \(x_1\), \(x_2\) and \(x_3\), respectively; the grey vertices correspond to occurrences of terminal symbols.

In a similar way, we can also encode words \(w \in \varSigma ^*\) as graph structures , where every factor w[i..j] of w is represented by a vertex (ij), equality edges are drawn between (ij) and \((i', j')\) if \(w[i..j] = w[i'..j']\), and neighbour edges if \(j + 1 = i'\). It has been shown in [78] that \(\alpha \) matches w if and only if there is a graph morphism from to . Moreover, the concept of the treewidth for graphs now also applies to patterns (i. e., the treewidth of a pattern is the treewidth of its standard graph representation), which is of relevance since the graph morphism problem can be solved in polynomial-time provided that the source graphs have bounded treewidth.Footnote 2 Consequently, we can conclude the following algorithmic meta-theorem.

Theorem 3

([78]). If a class P of patterns has bounded treewidth, then the matching problem for P can be solved in polynomial-time.

Due to the generality of the statement of Theorem 3, the polynomial-time matching algorithm that it implies is of little practical value, even for rather simple classes of patterns. On the other hand, its theoretical relevance is demonstrated by the fact that it covers almost all known classes of patterns with a polynomial-time matching problem.Footnote 3 After an additional remark regarding [78], we shall briefly define and compare those efficiently matchable classes of patterns in the next subsection.

Remark 1

Technically, the matching problem reduces to the morphism problem for (simple) relational structures instead of undirected graphs. However, since we are here only interested in the treewidth of these structures, we can as well only talk about the underlying undirected graphs.

Moreover, the actual meta-theorem of [78] is stronger in the sense that there the treewidth of patterns is not defined with respect to the standard graph representation, but with respect to a slightly more general graph representations (i. e., we allow any way of drawing the equality edges as long as all vertices corresponding to the same variable form a connected component).

4.2 Efficiently Matchable Classes of Patterns

The most obvious way to restrict patterns is to limit their number of (repeated) variables or the number of occurrences per variable. In this regard, let and be the class of patterns with at most k variables and with at most k repeated variables, respectively. Due to Theorem 1, we already know that bounding the number of occurrences per variable does not in general yield polynomial-time matchable classes. The only exception are patterns with at most one occurrence per variable, which are called regular patterns and are denoted by , e. g., \(x_1 \mathtt {a}x_2 \mathtt {b}\mathtt {a}\mathtt {c}x_3 \mathtt {a}\) is a regular pattern. Regular patterns have been first considered in [81] and their name is motivated by the fact that the corresponding pattern languages are regular languages.

Next, we define the so-called scope coincidence degree (see [78]). For every , the scope of y in \(\alpha \) is defined by , where i is the leftmost and j the rightmost occurrence of y in \(\alpha \). The scopes of some variables coincide in \(\alpha \) if . By , we denote the scope coincidence degree of \(\alpha \), which is the maximum number of variables in \(\alpha \) such that their scopes coincide, and by , we denote the class of patterns with scope coincidence degree of at most k. See Fig. 2 for an example of the scope coincidence degree. An important special class is , which has been first introduced in [81] as the class of non-cross patterns (denoted by ). Intuitively speaking, the variables in non-cross patterns are sorted, e. g., \(x_1 \mathtt {a}x_1 x_2 \mathtt {b}\mathtt {a}x_2 \mathtt {c}x_3 \mathtt {a}x_3 x_3\).

Fig. 2.
figure 2

Two pattern \(\alpha _1\) and \(\alpha _2\) with and . The scopes of variable \(x_1\) (dashed line), \(x_2\) (straight line) and \(x_3\) (dotted line) are highlighted.

Next, we define the locality number, which is a general string-parameter, and which has been first introduced in [15]. A word is k-local if there exists an order of its symbols such that, if we mark the symbols in the respective order (which is called a marking sequence), at each stage there are at most k contiguous blocks of marked symbols in the word. This k is called the marking number of that marking sequence. The locality number of a word is the smallest k for which that word is k-local, or, in other words, the minimum marking number over all marking sequences. For example, the marking sequence \(\sigma = (\mathtt {a}, \mathtt {g}, \mathtt {c})\) marks \(w = \mathtt {a}\mathtt {g}\mathtt {a}\mathtt {g}\mathtt {c}\mathtt {a}\mathtt {c}\) as follows (marked blocks are illustrated by overlines): \(\mathtt {a}\mathtt {g}\mathtt {a}\mathtt {g}\mathtt {c}\mathtt {a}\mathtt {c}\), \(\overline{\mathtt {a}} \mathtt {g}\overline{\mathtt {a}} \mathtt {g}\mathtt {c}\overline{\mathtt {a}} \mathtt {c}\), \(\overline{\mathtt {a}\mathtt {g}\mathtt {a}\mathtt {g}} \mathtt {c}\overline{\mathtt {a}} \mathtt {c}\), \(\overline{\mathtt {a}\mathtt {g}\mathtt {a}\mathtt {g}\mathtt {c}\mathtt {a}\mathtt {c}}\); thus, the marking number of \(\sigma \) is 3. In fact, all marking sequences for w have a marking number of 3, except \((\mathtt {g}, \mathtt {a}, \mathtt {c})\), for which it is 2: \(\mathtt {a}\overline{\mathtt {g}} \mathtt {a}\overline{\mathtt {g}} \mathtt {c}\mathtt {a}\mathtt {c}\), \(\overline{\mathtt {a}\mathtt {g}\mathtt {a}\mathtt {g}} \mathtt {c}\overline{\mathtt {a}} \mathtt {c}\), \(\overline{\mathtt {a}\mathtt {g}\mathtt {a}\mathtt {g}\mathtt {c}\mathtt {a}\mathtt {c}}\). Thus, the locality number of w, denoted by , is 2. When we measure the locality number for patterns, we simply ignore all terminal symbols, e. g., . The class of patterns with locality number at most k is denoted by .

The next classes have been first considered in [78] and are based on possible nesting structures of variables. For a pattern \(\alpha \), we call two variables entwined if \(\alpha \) contains xyxy or yxyx as a subsequence. A pattern \(\alpha \) is nested, if no two variables in \(\alpha \) are entwined; the class of nested patterns is denoted by . A proper subclass of , considered in [15], are the so-called strongly nested patterns (denoted by ), which are inductively defined as follows: any pattern is strongly nested; if \(\alpha _1\) and \(\alpha _2\) are strongly nested and variable-disjoint patterns, x is a variable not in and \(\beta _1, \beta _2 \in (\{x\} \cup \varSigma )^*\), then \(\alpha _1 \alpha _2\) and \(\beta _1 \alpha _1 \beta _2\) are strongly nested patterns. For example, the pattern \(\alpha = x_1 x_2 \mathtt {a}x_2 x_1 \mathtt {b}x_3 x_4 \mathtt {a}x_3\) is strongly nested, whereas \(\alpha x_1\) is nested, but not strongly nested anymore.

If, for every , \(\alpha = \beta x \gamma _1 y \gamma _2 x \gamma _3 y \delta \) implies \(\gamma _2 = \varepsilon \), then \(\alpha \) is called closely entwined, and a pattern \(\alpha \) is mildly entwined if it is closely entwined and, for every , if \(\alpha = \beta x \gamma x \delta \) with \(|\gamma |_x = 0\), then \(\gamma \) is nested. We denote the class of mildly entwined patterns by . The main motivation for the somewhat peculiar class of mildly entwined patterns is that mildly entwined patterns are exactly those patterns that have a standard graph representation that is outer-planar (see [78]).Footnote 4 It is known that outer-planar graphs have a rather low treewidth of at most 2. Since the concept of outer-planarity generalises to k-outer-planarity and k-outer-planar graphs have a treewidth of at most \(3k-1\), we can also define the classes of k-outer-planar patterns (i. e., their standard graph representation is k-outer-planar). In this regard note that . See Fig. 3 for an example of a mildly-entwined pattern.

Fig. 3.
figure 3

The standard graph representation for \(\alpha = x_1 x_3 x_4 x_3 x_1 x_2 x_3 x_5 \mathtt {b}x_5 x_2 x_5 x_6 \mathtt {a}x_6 x_2\). By definition, \(\alpha \) is mildly entwined. Furthermore, since no vertex is completely “surrounded” by edges, the shown embedding is outer-planar.

It can be easily verified that all of the pattern classes defined above have bounded treewidth; thus, by application of Theorem 3, they can be matched efficiently. For some of them this upper bound on the treewidth is rather low (e. g., , , ), while for those classes obtained by bounding a structural parameter, e. g., , , , the bound on the treewidth also grows with this parameter. Figure 4 shows how these pattern classes relate to each other and how they form infinite hierarchies within the class of all patterns (denoted by ).

Fig. 4.
figure 4

An overview of efficiently matchable classes of patterns. By \(A \rightarrow B\), we denote \(A \subset B\); pairs without arrow are incomparable. Note that and .

In a sense, Fig. 4 is a “tractability map” for the matching problem of patterns with variables. For the classes that have low treewidth, we can expect matching algorithms that are rather efficient. On the other hand, these classes are quite restricted (compared to the full class of patterns) and are most likely only applicable for very special pattern matching tasks. An obvious approach to matching general patterns would be to first perform a preprocessing that identifies a “low class” of the tractability map that contains the input pattern and then uses the most efficient algorithm for matching it. In this regard, it is even an asset that most of the different efficiently matchable classes and hierarchies of classes are incomparable: it is possible that an input pattern has a very large locality number of 100, but can nevertheless be matched efficiently, because its standard graph representation is 2-outerplanar; on the other hand, a pattern could have a large scope coincidence degree and a large number of variables, but at the same time a very low locality number. It might even be a worthwhile research task to experimentally analyse a large corpus of (random) patterns with respect to the classes of the tractability map in which they are contained.

Remark 2

Bounding the structural parameters defined above yields polynomial-time matchable classes of patterns; thus, the question arises whether the matching problem is also fixed-parameter tractable with respect to those parameters. However, Theorem 2 already states that this is most likely not the case for parameter , and since is an upper bound for the number of repeated variables, the scope coincidence degree, the outer-planarity and the locality number of \(\alpha \), it is also highly unlikely that we can achieve fixed-parameter tractability with respect to those parameters.

4.3 Computing Structural Parameters for Patterns

Since the structural restrictions of patterns surveyed above are all meant to be exploited algorithmically, the task of checking them (or computing the respective parameters) is an important issue. In this regard, note that in general computing the treewidth of a graph is an -hard problem and it is also not known whether it can be computed efficiently for standard graph representations of patterns. This also emphasises the importance of easily computable parameters that are bounding the treewidth of a pattern and also points out why the value of Theorem 3 is of a theoretical nature that provides guidance in finding such restrictions with higher practical relevance. Restrictions like the regularity, the non-cross condition, number of (repeated) variables and the different nesting properties can be easily checked for. Moreover, also the scopes of a pattern and therefore its scope coincidence degree can be efficiently computed, and the smallest k for which a graph is k-outerplanar can also be computed in polynomial time (for more details see [79]). On the other hand, computing the locality number seems more difficult and it was left open in [15] whether or not is hard to compute. This gap was closed in [13] where it was shown that computing the locality number is -hard, but fixed-parameter tractable (if the locality number or \(|\varSigma |\) is considered a parameter); in addition, approximation of the locality number has also been investigated in [13] (note that these result will be discussed in more detail in Sect. 7).

5 Faster Pattern Matching

In this section we will overview some efficient matching algorithms developed for various classes of patterns, some defined already in the previous sections, and some defined via some other natural structural restrictions. Most of the result of this paper were shown in [15, 16, 26].

5.1 Patterns with Low Scope Coincidence Degree

We start with several definitions. The one-variable blocks in a pattern are maximal contiguous blocks of occurrences of the same variable. A pattern \(\alpha \) with m one-variable blocks can be written as \(\alpha =w_0\varPi _{i=1,m}(z_i^{k_i}w_i)\) with for \(i\in \{1,2,\ldots ,m\}\) and \(z_i\ne z_{i+1}\), whenever \(w_i=\varepsilon \) for \(i\in \{1,2,\ldots ,m-1\}\). The number of one-variable blocks is a natural complexity measure that we will consider.

Example 2

The pattern \(\alpha =x_1x_2x_2 \mathtt {a}x_2x_2x_2x_3 \mathtt {a}x_3x_2x_2x_3x_3\) has 7 one-variable blocks: \(x_1, x_2x_2, x_2x_2x_2, x_3, x_3, x_2 x_2, x_3 x_3\).

As discussed in the previous sections, prominent subclasses of patterns for which can be solved in polynomial time are the classes of patterns with a bounded number of (repeated) variables ( and ), of regular patterns (), of non-cross patterns (), and of patterns with a bounded scope coincidence degree (). However, the known respective algorithms are rather poor considering their running times. For example, for , the matching problem can be solved in , where m and n are the lengths of the pattern and the word (see [47]). For patterns with a scope coincidence degree of at most k, an time algorithm can be derived using the general matching technique described by Theorem 3, where m and n are the lengths of the pattern and the word, respectively, and the proof that the matching problem for non-cross patterns is in (see [81]) leads to an -time algorithm. Hence, for all these classes, we consider the following refinement of the problem of showing that the matching problem for a class of patterns is in .

Problem 1

Let K be a class of patterns for which the matching problem can be solved in polynomial time. Find an efficient algorithm that solves the matching problem for K.

The main class of patters considered in the following is that of patterns with bounded scope coincidence degree, and its subclasses.

If the scope coincidence degree is bounded by 1, i. e., non-cross patterns, we can decide whether a pattern \(\alpha \) having m one-variable blocks matches a word w of length n in time. This result can be achieved via a general dynamic programming approach, which tries to match prefixes of the pattern \(\alpha \) to the prefixes of the word w. This general approach is rather standard but the big gain is that it can be implemented efficiently by a detailed combinatorial analysis of the possible matches between the one-variable blocks occurring in \(\alpha \) and factors of w. For instance, if the shortest factor of \(\alpha \) containing all occurrences of a variable x starts with a one-variable block containing at least two occurrences the variable x, we can efficiently find the matches of this factor by exploiting a major result from [14], which states that the primitively rooted squares contained in a word of length n can be listed optimally in . As each match for a factor starting with two occurrences of a variable starts with a primitively rooted square, the respective matches can be found efficiently. The result regarding primitively rooted squares can be extended to show that, given a word w of length n and a word v with length shorter than n, the word w contains factors of the form uvu with uv primitive, and all these factors can be found optimally in time. This allows us to find efficiently the matches for one-variable that the shortest factor of \(\alpha \) which contains all occurrences of x and starts with xvx, for all choices of a variable x such that v is a non-empty terminal string.

Theorem 4

([26]). The matching problem for is solvable in time, where w is the input word of length n and m is the number of one-variable blocks occurring in the pattern.

Two particular subclasses of non-cross patterns are of interest: the regular patterns and the one-variable patterns (see also Fig. 4). It is not hard to show that regular patterns can be matched in linear time , by iteratively using the Knuth-Morris-Pratt algorithm to identify greedily the terminal factors occurring in the pattern, in their orders of occurrences. All factors of a word w that match a given regular pattern \(\alpha \) can be detected in linear time too.

More interesting is the case of one-variable patterns. The simplest example of one-variable patterns are the repetitions, i.e., patterns of the form \(x^k\). Checking whether a word is a match for a pattern \(x^k\) can be done in linear time. Moreover, a compact representation of all periodic factors of a word w can be also obtained in linear time by identifying the (at most |w|) so-called runs inside w [4]. With this, a compact representation of occurrences of \(x^k\) in w can also be obtained in linear time. More complex one-variable patterns are the pseudo-repetitions (see [41, 43, 44] and the references therein). These are patterns from \(\{x,x^R\}^*\), where \(x^R\) is a variable that is always substituted by the reverse image of the string substituting x. Checking whether a string matches a given pseudo-repetition can be done in linear time [44]. The following general result can be shown for one-variable patterns, see [59]. Given a pattern \(\alpha = v_1xv_2x\cdots v_{r-1}xv_r\) such that x is a variable and \(v_1,v_2,\ldots ,v_r\) are terminal strings, a compact representation of all P instances of \(\alpha \) in the input string w of length n can be computed in time, so that one can report those occurrences in time. The same result holds also for the case when some of the occurrences of x in such a pattern are replaced by \(x^R\). It is worth noting that using this algorithm to find the factors of a given word that match the shortest factor of \(\alpha \) containing all occurrences of a variable x inside a non-cross pattern in our approach for matching does not lead to a faster matching algorithm in that case.

When considering general patterns with bounded scope coincidence degree, one can show, using a similar dynamic programming approach as in the case of non-cross patterns, that the matching problem for is solvable in time, where n is the length of the input word and m is, again, the number of one-variable blocks occurring in the pattern. One should note that in this case it seems hard to use the combinatorial insights used for non-cross patterns (thus, the \(\log n\) factor is replaced by an n factor in the evaluation of the time complexity), but, still, this algorithm is significantly faster than the previously known solution.

Theorem 5

([26]). The matching problem for is solvable in time, where w is the input word of length n and m is the number of one-variable blocks occurring in the pattern.

Next we consider the classes . For the basic case of \(k=1\), the matching problem can be solved in time, where n is the length of the input word. The idea of this algorithm is to guess the length \(\ell \) of the repeated variable x, and then to partition the suffix array of the input word into clusters, such that all suffixes in a cluster start with the same factor of length \(\ell \). Essentially, in a match between the pattern and the word, where x is mapped to a factor of length \(\ell \), the positions where the factors matching x occur in the input word belong to the same cluster. Using this idea, the desired complexity is then reached, again via dynamic programming.

Theorem 6

([26]). The matching problem for is solvable in quadratic time.

Further, one can use this result to show that the matching problem for the general class of patterns is solvable in time. This algorithm is better than the one that could have been obtained by using the fact that patterns with at most k repeated variables have the scope coincidence degree bounded by \(k+1\), and then directly applying our previous algorithm solving the matching problem for .

Theorem 7

([26]). The matching problem for is solvable in time, where n is the length of the input word.

Note that the classes of non-cross patterns and of patterns with a bounded scope coincidence degree or with a bounded number of repeated variables are of special interest, since for them we can compute so-called descriptive patterns (see [2, 81]) in polynomial time. A pattern \(\alpha \) is descriptive (with respect to, say, non-cross patterns) for a finite set S of words if it can generate all words in S and there exists no other non-cross pattern that describes the elements of S in a better way. Computing a descriptive pattern, which is -complete in general, means to infer a pattern common to a finite set of words, with applications for inductive inference of pattern languages (see [71]). For example, our algorithm for computing non-cross patterns can be used in order to obtain an algorithm that computes a descriptive non-cross pattern in time , where m is the length of a shortest word of S (see [27] for details).

The algorithms, except the ones for the basic cases of regular and non-cross patterns and patterns with only one repeated variable, still have an exponential dependency on the number of repeated variables or the scope coincidence degree. Therefore, only for very low constant bounds on these parameters can these algorithms be considered efficient. Naturally, finding a polynomial time algorithm for which the degree of the polynomial does not depend on the number of repeated variables or on the scope coincidence degree would be desirable. However, by Remark 2 such algorithms are very unlikely.

Finally we recall a result regarding gapped repeats and palindromes. A gapped repeat (palindrome) is an instance of a terminal-free pattern xyx (respectively, \(xyx^R\)). For \(\alpha \ge 1\), an \(\alpha \)-gapped repeat in a word w is a factor uvu of w such that \(|uv|\le \alpha |u|\); the two factors u in such a repeat are called arms, while the factor v is called gap. Such a repeat is called maximal if its arms cannot be extended simultaneously with the same symbol to the right or, respectively, to the left. In a sense, \(\alpha \)-gapped repeats are instances of the pattern xyx where length constraints are imposed on the strings that substitute x and y. In [42] it was shown that the number of maximal \(\alpha \)-gapped repeats that may occur in a word is upper bounded by \(18\alpha n\). Using this, an algorithm finding all the maximal \(\alpha \)-gapped repeats of a word in was defined; this result is optimal, in the worst case, as there are words that have \(\varTheta (\alpha n)\) maximal \(\alpha \)-gapped repeats. Comparable results were developed for the case of \(\alpha \)-gapped palindromes, i.e., factors \(uvu^R\) with \(|uv|\le \alpha |u|\). On the one hand, these results were relevant as they provided optimal algorithms for the identification of \(\alpha \)-gapped repeats and palindromes, and closed an open problem from [57, 58] (see also [42] and the references therein for more on gapped repeats and palindromes). On the other hand, they point towards the study of for patterns with (linear) length constraints on the images of the variables.

5.2 Patterns with Low Locality Number

Intuitively, the notion of k-locality (already introduced in Sect. 4.2) involves marking the variables in the pattern in some arbitrary order until all the variables are marked. The pattern is k-local if this marking can be done while never creating more than k marked blocks. Variables which only occur adjacent to those which are already marked can be marked “for free” – without creating any new blocks, and thus a valid marking sequence allows a sort-of parsing of the pattern whilst maintaining a degree of closeness (locality) to the parts already parsed. The notion of k-locality was introduced and further analysed in [15]. With respect to pattern matching, the main result proven in that paper is the following:

Theorem 8

([15]). for can be decided in \(O(mkn^{\max {(3k-1,2k+1)}})\) time, where m is the length of the input pattern and n is the length of the input word.

To solve the matching problem for we use the following idea. Using a simple dynamic programming approach we can show that, given a pattern \(\beta \in (X\cup \varSigma )^*\) of length m, we can decide in \(O(m^{2k}k)\) time whether , and if the answer is positive, we can produce in the same time a marking sequence witnessing that \(\beta \) is k-local. As such we can keep track of the marked factors in the pattern, while executing the marking according to the computed marking sequence. We also need now to keep track to which factors of the input word the marked factors correspond. Then we try to assign every new variable so that it fits nicely around the already matched factors. This is done efficiently using a data structure from [59], mentioned also above: given a word w and a one-variable pattern \(\gamma \) (so, ), one can produce a compact representation of all the g factors of w matching \(\gamma \) in \(O(|\gamma ||w|)\) time; moreover, we can obtain all the g factors of w matching \(\gamma \) in O(|g|) time. This allows us to test efficiently which factors of w match any of the one-variable blocks of \(\beta \), and, ultimately, to assign a value to each variable. In comparison to the algorithm from [78] for patterns of bounded treewidth, which firstly constructs relational structures from \(\alpha \) and w, and solves the homomorphism problem on these relational structures (see Sect. 4.1), the above algorithm exploits directly the locality structure present in the patterns. The advantage of this more focussed approach is that it allows for a considerable improvement in the required time, reducing the exponent of n from \(4k+4\) to \(3k-1\).

6 Efficient Pattern Matching Beyond Bounded Treewidth

In [16] the authors tried to identify classes of patterns that do not have bounded treewidth but can still be matched in polynomial time. The idea behind defining such classes was relatively simple: consider generalised repetitions of patterns.

One simple observation is that, if we can match patterns from a class \(\mathcal {C}\) in polynomial time, then we can also match repetitions of these patterns in polynomial time: if we wish to check whether \(\alpha ^k\) matches a word w, where \(\alpha \) is chosen from the class \(\mathcal {C}\) for which we can solve efficiently, then we can firstly check whether \( w =v^k\) for some word v, and then check whether \(\alpha \) matches v, so we can also match \(\alpha ^k\) efficiently. Moreover, it can be observed that most parameters that lead to efficiently matchable classes, e. g., the scope coincidence degree or locality, are defined independently from the terminal symbols, i. e., via the word obtained after removing all terminals, which shall be called skeleton in the following (e. g., the skeleton of \(x_1 \mathtt {a}x_2 \mathtt {b}\mathtt {a}x_1 x_2 \mathtt {b}\) is \(x_1 x_2 x_1 x_2\)). As a result, it is possible that a pattern, that is not a repetition of any \(\alpha \in \mathcal {C}\), has nevertheless a skeleton that is a repetition of a skeleton from \(\mathcal {C}\). For example, \(\mathtt {a}x_1 (x_2)^3 x_3 \mathtt {b}x_3 x_1 (x_2)^2 \mathtt {b}x_2 \mathtt {a}(x_3)^2\) is not a repetition of a non-cross pattern, but its skeleton \((x_1 (x_2)^3 (x_3)^2)^2\) is. In [16] it is shown that, for some important classes \({\mathcal C}\) of patterns, including and , for constant k, the polynomial time solvability of does not only extend from \(\mathcal {C}\) to exact repetitions, but also to such skeleton-repetitions, called \(\mathcal {C}\)-repetitions.

Theorem 9

([16]). For , solving the matching problem for the class of \(\mathcal {C}\)-repetitions can be done in polynomial time.

It is interesting to note that the general treewidth-based framework of polynomial time matching of patterns does not seem to cover a very simple and natural aspect: repetitions of the same pattern. More precisely, if \(\mathcal {C}\) is one of the known efficiently matchable classes of patterns, then a repetition \(\alpha ^k\) for some \(\alpha \in \mathcal {C}\) is usually not in \(\mathcal {C}\) anymore. In fact, it can be shown that even for patterns \(\alpha \) with bounded and very low treewidth, the treewidth of repetitions \(\alpha ^k\) can be unbounded.

Theorem 10

([16]). Let \({\mathcal C}\) be a class of patterns that contains . Then the class of \({\mathcal C}\)-repetitions contains patterns with arbitrarily large treewidth.

In particular, the previous theorem holds for the class of regular patterns, arguably the simplest class allowing an unbounded number of variables (note that patterns with a constant number of variables can trivially be matched in polynomial-time). In the same paper it is shown that if the notion of repetition is relaxed further, by considering a setting where the order in which the variables appear is no longer constrained at all (i.e., considering abelian repetitions instead of repetitions), then the matching problem is -complete. This holds even in the minimal case when the number of repetitions is restricted to two, and that the pattern which is repeated is regular.

7 From Locality to Graph Parameters

Following the ideas of Sect. 3 we explore further the connection between string and graph parameters. The main idea behind such a connection is to reach it by “flattening” a graph into a sequential form, or by “inflating” a string into a graph, so that algorithmic techniques available for each one of these become applicable for the other one as well. In this section, following [13], we are concerned with certain structural parameters (and the problems of computing them) for graphs and strings: the cutwidth of a graph G (i. e., the maximum number of “stacked” edges if the vertices of a graph are drawn on a straight line), the pathwidth of a graph G (i. e., the minimum width of a tree decomposition the tree structure of which is a path), and the locality number of a string \(\alpha \) (explained in more detail in Sect. 4.2). By \(\textsc {Cutwidth}\), \(\textsc {Pathwidth}\) and , we denote the corresponding natural decision problems (i. e., decide whether a given graph has a pathwidth/cutwidth, or a given string has a locality number of at most k, for given k) and with the prefix \(\textsc {Min}\), we refer to the minimisation variants. The two former graph-parameters are very classical. Pathwidth is a simple (yet still hard to compute) subvariant of treewidth, which measures how much a graph resembles a path. The problems \(\textsc {Pathwidth}\) and \(\textsc {MinPathwidth}\) are intensively studied (in terms of exact, parameterised and approximation algorithms) and have numerous applications (see the surveys and textbook [7, 9, 56]). \(\textsc {Cutwidth}\) is the best known example of a whole class of so-called graph layout problems (see the survey [20, 73] for detailed information), which are studied since the 1970s and were originally motivated by questions of circuit layouts.

In comparison, the locality number seems a rather simple parameter directly defined on strings, but, however, it bounds the treewidth of the string (in the sense defined in Sect. 4.1), and the corresponding marking sequences can be seen as instructions for a dynamic programming algorithm for matching the pattern. In this way, it resembles a bit to the way the pathwidth and treewidth of graphs are used in algorithmic settings. Moreover, compared to other “tractability-parameters” of strings, it seems to cover best the treewidth of a string, but it also cannot be efficiently computed compared to the other simpler parameters.

Going more into detail, for , exact exponential-time algorithms are not hard to be devised [15] but whether it can be solved in polynomial-time, or whether it is at least fixed-parameter tractable was left open in the paper where this measure was introduced. On the other hand, \(\textsc {Pathwidth}\) and \(\textsc {Cutwidth}\) are known -complete problems, fixed-parameter tractable with respect to parameter or , respectively (even with “linear” fpt-algorithms with running-time  [8, 10, 82]). With respect to approximation, their minimisation variants have received a lot of attention, mainly because they yield (like many other graph parameters) general algorithmic approaches for numerous graph problems, i. e., a good linear arrangement or path-decomposition can often be used to design a dynamic programming (or even divide and conquer) algorithm for other problems. The best known approximation algorithms for the problems \(\textsc {MinPathwidth}\) and \(\textsc {MinCutwidth}\) (with approximations ratios of and , respectively) follow from approximations of vertex separators (see [25]) and edge separators (see [60]), respectively.

There are two natural approaches to represent a word \(\alpha \) over alphabet \(\varSigma \) as a graph \(G_{\alpha } = (V_{\alpha }, E_{\alpha })\): (1) \(V_{\alpha } = \{1, 2, \ldots , |\alpha |\}\) and the edges are somehow used to represent the actual symbols (note that this is the case for the standard graph representation of patterns defined in Sect. 4.1), or (2) \(V_{\alpha } = \varSigma \) and the edges are somehow used to represent the positions of \(\alpha \). A reduction of type (2) can be defined such that and , and a reduction of type (1) can be defined such that and . Since these reductions are parameterised reductions and also allow to transfer approximation results, one may conclude that is fixed-parameter tractable if parameterised by \(|\varSigma |\) (note that for parameter \(|\varSigma |\) a simple, but less efficient fpt-algorithm is trivially obtained by simply enumerating all marking sequences) or by the locality number, and also that there is a polynomial-time -approximation algorithm for \(\textsc {MinLoc}\).

In addition, one can represent an arbitrary multi-graph \(G = (V, E)\) by a word \(\alpha _G\) over alphabet V with \(|\alpha _G|=|E|\) and . This describes a Turing-reduction from \(\textsc {Cutwidth}\) to which also allows to transfer approximation results between the minimisation variants. As a result, is -complete. Finally, by plugging together the reductions from \(\textsc {MinCutwidth}\) to \(\textsc {MinLoc}\) and from \(\textsc {MinLoc}\) to \(\textsc {MinPathwidth}\), one obtains a reduction which transfers approximation results from \(\textsc {MinPathwidth}\) to \(\textsc {MinCutwidth}\), which yields an -approximation algorithm for \(\textsc {MinCutwidth}\). This result from [13] improved, for the first time since 1999, the best approximation for \(\textsc {Cutwidth}\) from [60]. Interestingly, this improvement appeared as a side-product of relating string-parameters with graph-parameters.

Theorem 11

([13]). There is an -approximation algorithm (running in polynomial time) for \(\textsc {MinCutwidth}\) on multigraphs with h edges. In particular, this yields an -approximation algorithm for \(\textsc {MinCutwidth}\) for graphs.

Moreover, this approach allows also for establishing a direct connection between cutwidth and pathwidth, which preserves the good algorithmic properties, and has not yet been reported in the literature so far. This is rather surprising, since \(\textsc {Cutwidth}\) and \(\textsc {Pathwidth}\) have been jointly investigated in the context of exact and approximation algorithms, especially in terms of balanced vertex and edge separators. We think that a reason for overlooking this connection might be that it is less obvious on the graph level and becomes more apparent if linked via the string parameter of locality, emphasising, as such, the value of such mixed approaches.

8 Extensions

8.1 Injectivity

In our setting, the substitutions that map variables to words are not required to be injective, i. e., different variables can be mapped to the same word. However, the requirement of injectivity is natural in some contexts. For example, in the pattern matching community, the first mentioning of pattern matching with variables concerns the case where variables have to be substituted by single symbols and in an injective way. More precisely, this parameterised pattern matching was introduced in [3] to formalise the problem of detecting code clones (i. e., we want to find code segments that are created by copying some code blocks and renaming program variables (this renaming will be injective, since otherwise the semantic of the code might change)). More generally speaking, the injectivity condition is appropriate whenever we know a priori that different variables should always refer to different words (e. g., when matching the pattern

$$x_1\,\textsf {name:}\,y\,;\,\textsf {address:}\,z\,;\,x_2\,\textsf {name:}\,y\,;\,\textsf {address:}\,z\,x_3$$

in order to check whether there is a repetition of some name-address data tuple, then it is likely that we can assume injectivity).

Depending on the actual variant, the injectivity condition can make the matching problem harder or easier. In [26], it is shown that it is -hard to decide for a given word w and an integer k whether w can be factorised into at least k pairwise different factors. This immediately implies that the injectivity condition makes the matching problem -hard even for the “trivial” pattern class \(\{x_1 x_2 \ldots x_n \mid n \ge 1\}\) (note that this is even a subset of the class of regular patterns). On the other hand, if we have an upper bound on \(|\varSigma |\) and \(\max \{|h(x)| \mid x \in X\}\) (recall that this case is still -hard even for bounds 2 and 1, respectively; see Theorem 1) then also the total number of possible substitution words is bounded; thus, the injectivity condition bounds the total number of variables and therefore the matching problem becomes tractable (see [29]). A similar observation can be made with respect to fixed-parameter tractability if we parameterise by \(|\varSigma |\) and \(\max \{|h(x)| \mid x \in X\}\) (see [31]).

8.2 Word Equations

A word equation is an equality \(\alpha = \beta \), where \(\alpha \) and \(\beta \) are patterns with variables, e. g., \(\alpha =x_1 \mathtt {a}\mathtt {b}x_2\) and \(\beta = \mathtt {a}x_1 x_2 \mathtt {b}\) define the equation \(x_1 \mathtt {a}\mathtt {b}x_2 = \mathtt {a}x_1 x_2 \mathtt {b}\). A solution to an equation \(\alpha = \beta \) is a substitution (in the sense defined in Sect. 2) that satisfies \(h(\alpha ) = h(\beta )\). For the example equation from above, the solutions are the substitutions h with \(h(x_1) = \mathtt {a}^k\), for \(k\ge 0\), and \(h(x_2) = \mathtt {b}^\ell \), for \(\ell \ge 0\).

The study of word equations (or the existential theory of equations over free monoids) is an important topic found at the intersection of algebra and computer science, with significant connections to, e.g., combinatorial group or monoid theory [21, 65, 66], unification [48, 49, 80]), and, more recently, data base theory [33, 34].

The central computational problem for word equations is the satisfiability problem, i. e., the problem of deciding whether a given word equation \(\alpha =\beta \) has a solution or not. In this regard, the matching problem for patterns with variables describes just the special case of the satisfiability problem for word equations where one side of the equation is a terminal word, e. g., \(x_1 \mathtt {a}\mathtt {b}x_1 x_2 \mathtt {c}x_2 x_1 = \mathtt {b}\mathtt {a}\mathtt {b}\mathtt {b}\mathtt {a}\mathtt {c}\mathtt {a}\mathtt {b}\) is an instance of the matching problem already mentioned in the introduction, phrased as a word equation. Consequently, the satisfiability problem is intractable, even for very strongly restricted cases (see Theorems 1 and 2). Also note that it has been shown in [22] that the solvability problem remains -hard if every variable has at most two occurrences in \(\alpha \beta \) (called quadratic equations), but the proof of [22] actually talks about the matching problem for patterns with at most two occurrences per variable.

While the matching problem for patterns with variables is trivially decidable, it is not at all obvious how to solve the satisfiability problem for word equations. In fact, the question whether it is decidable was initially approached with the expectation that it will be answered in the negative. It was, however, shown to be decidable by Makanin [67] (see Chap. 12 of [64] for a survey). Later it was shown that the satisfiability problem is in by Plandowski [74]; a new proof of this result was obtained in [51], based on a new simple technique called recompression. There are also cases when the satisfiability problem is tractable. For instance, word equations with only one variable can be solved in linear time in the size of the equation, see [50]; equations with two variables can be solved in time , see [19].

Given the fact that there are many structural restrictions of patterns that yield tractability (with respect to the matching problem, see Sect. 4), the question naturally arises how the complexity of the satisfiability problem for word equations (which are essentially equations of patterns) behaves if these restrictions are applied to word equations. More precisely, while each class of patterns with -hard matching problem yields a class of word equations with -hard satisfiability problem, the hardness of the satisfiability problem for equations with sides in some efficiently matchable class of patterns is no longer immediate. An investigation of that question was initiated in [68], where the following results were obtained. Firstly, the satisfiability problem for non-cross word equations (i. e., word equations for which both sides are non-cross) remains -hard. In particular, solving non-cross equations \(\alpha =\beta \) where each variable occurs at most three times, at most twice in \(\alpha \) and exactly once in \(\beta \), is -hard (note that this constitutes the first -hardness result for word equations that is not a direct conclusion from a hardness result for the matching problem). Secondly, the satisfiability of one-repeated variable equations (i. e., at most one variable occurs more than once in \(\alpha \beta \), but arbitrarily many other variables occur only once) having at least one non-repeated variable on each side, was shown to be trivially in .

In [18], it is shown that it is (still) -hard to solve regular ordered word equations. More precisely, these are word equations where each side is a regular pattern and the order of the variables in both sides is the same (it is, however, possible that some variables only occur on one side of the equation), e. g., \(x_1 \mathtt {a}x_2 \mathtt {b}\mathtt {a}x_3 x_4 = \mathtt {b}x_1 x_3 \mathtt {a}\mathtt {a}x_4\) is a regular ordered word equation. They are particular cases of both quadratic equations and non-cross equations, so the reductions showing the hardness of solving these more general equations do not carry over. In particular, note that the class of regular patterns is arguably the most simple class of patterns in terms of their matching complexity (see Sect. 5.1).

The respective hardness reduction relied on some deep word-combinatorics ideas. As a first step, a reachability problem for a certain type of (regulated) string rewriting systems was introduced, and showed it is -complete. This was achieved via a reduction from the problem 3-Partition [40], which is strongly -complete. Then it was shown that this reachability problem can be reduced to the satisfiability of regular-ordered word equations; in this reduction the applications of the rewriting rules of the system were encoded into the periods of the words assigned to the variables in a solution to the equation. The main technicality was to make sure to only use one occurrence of each variable per side, and moreover to even have the variables in the same order in both sides. This result exhibits the arguably structurally-simplest class of word equations for which the satisfiability problem is -hard.

The main open problem in the area of word equations remains, even for simple subclasses such as regular equations or quadratic equations, to show that the satisfiability problem of word equations of the respective types is in (note that this was already explicitly posed as an open question for the class of quadratic word equations in [22]).

9 Conclusions

In this work we tried to survey several results related to the problem of matching patterns with variables, that seem important to us. While this work is clearly not exhaustive, it is aimed to offer a basic understanding of the problems and state of the art in this area.

From an algorithmic point of view, the results we covered provide a wide variety of classes of patterns with variables, for which can be efficiently solved. Moreover, as explained in Sect. 4.3, it is usually easy to check whether a pattern belongs to one of these classes. So, putting it all together, one could use the following approach when trying to match a pattern, rather than just using an exponential time algorithm (based, e.g., on general SMT-solvers, or on the theory of string solving [6, 83]). First, check if the pattern belongs to one of the classes for which efficient matching algorithms are known and, then, use this algorithm; only use a general algorithm when no customised one can be applied. Identifying more natural pattern classes for which can be solved efficiently appears, as such, as a rather useful task. Following the practically motivated challenges that arise from the area of string solving, one could also try to find efficient matching algorithms for various classes of patterns, enhanced with various constraints: regular constraints, length constraints, etc.

As an important part of this survey deals with polynomial time algorithms, it is natural to also ask whether they are optimal or not. This kind of questions are the focus of the area of fine-grained complexity (see, e.g., the survey [11] and the citations therein). It would be interesting to see, using tools from this area, whether one can show lower bounds for the problems for different classes of patterns.

In the light of the results from [13], it seems that exploring the connections between string parameters and parameters for other classes of objects could lead to some interesting results in both worlds. So, it also seems like an interesting challenge to explore what the structural parameters of strings that we explored here (and maybe some other new ones) mean when various other types of data are represented as strings, and what consequences can be derived from such a representation.

Finally, the area of word equations abounds with open problems. As mentioned, it is not even clear whether the satisfiability of regular or quadratic equations is in . So even if we restrict to equations with structurally simple left and right hand sides, the complexity of solving equations is not known. Such problems become even more involved when we consider equations with various types of constraints (e.g., length or regular). For instance, the decidability of general word equations with length constraints is a long standing open problem, but it is already an interesting open question for simpler cases (once again: regular or quadratic equations); see, e.g., [17, 46, 61], and the references therein. It seems interesting to us whether some of the ideas used in matching patterns can be transferred to solving (simplified) word equations, with or without constraints.