Keywords

1 Introduction

Compressed text indexes have become the standard tool for maintaining highly-repetitive texts when full-text search queries like locating all occurrences of a pattern are of importance. When working on indexes on highly-repetitive data, a desired property is to have a self-index, i.e., a data structure that supports queries on the underlying text without storing the text in its plain form. One type of such self-indexes are grammar indexes, which are an augmentation of the admissible grammar [20] produced by a grammar compressor. Grammar indexes exhibit strong compression ratios for semi-automatically generated or highly-repetitive texts. Unlike other indexes that perform pattern matching stepwise character-by-character, some grammar indexes have locality sensitive parsing properties, which allow them to match certain non-terminals of the admissible grammar built upon the pattern with the non-terminals of the text. Such a property helps us to perform fewer comparisons, and thus speeds up pattern matching for particularly long patterns, which could be large gene sequences in a genomic database or source code files in a database maintaining source code. Here, our focus is set on indexes that support queries retrieving the starting positions of all occurrences of a given pattern P in a given text.

1.1 Our Contribution

Our main contribution is the discovery of a locality sensitive parsing property in the grammar produced by the grammar compression by induced sorting (GCIS) [29], which helps us to answer with an index built upon GCIS with the following bounds:

Theorem 1

Given a text T of length n, we can compute an indexing data structure on T in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(n)\) time, which can locate all \(\text {occ}\) occurrences of a given pattern of length m in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(m \lg |{\mathcal {S}}| + \text {occ}_C\lg n \lg |{\mathcal {S}}| + \text {occ})\) time, where \({\mathcal {S}}\) is the set of characters and non-terminals of the GCIS grammar and \(\text {occ}_C\) is the number of occurrences in the right side of the production rules of the GCIS grammar of a selected core of the pattern, where a core is a string of symbols of the grammar of P defined in Sect. 4.1. Our index uses \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) words of working space, where g is the sum of the lengths of the right hand sides of all production rules.

Similar properties hold for other grammars such as the signature encoding [25], ESP [7], HSP [16], the Rsync parse [17], or the grammar of [3, Sect. 4.2]. A brief review of these and other self-indexes follows.

1.2 Related Work

With respect to indexing a grammar for answering , the first work we are aware of is due to [5] who studied indices built upon so-called straight-line programs (SLPs). An SLP is a context-free grammar representing a single string in the Chomsky normal form.

Other research focused on particular types of grammar, such as the ESP-index [24, 31, 32], an index [4] combining Re-Pair [22] with the Lempel–Ziv-77 parsing [34], a dynamic index [26] based on signature encoding [25], the Lyndon SLP [33], or the grammar index of [3]. For the experiments in Sect. 5, we will additionally have a look at other self-indexes capable of -queries. There, we analyze Burrows–Wheeler-transform (BWT) [2]-based approaches, namely the FM-index [15] and the r-index [18].

Finally, the grammar GCIS has other interesting properties besides being locality sensitive. [28] showed how to compute the suffix array and the longest-common-prefix array from GCIS during a decompression step restoring the original text. Recently, [8] showed how to compute the BWT directly from the GCIS grammar.

2 Preliminaries

With \(\lg \) we denote the logarithm to base two (i.e., \(\lg = \log _2\)). Given two integers ij, we denote the interval \([i..j] = \{i,i+1,\ldots ,j-1,j\}\), with \([i..j] = \{\}\) if \(i > j\). Our computational model is the standard word RAM with machine word size , where n denotes the length of a given input string T[1..n], which we call , whose characters are drawn from an integer alphabet \(\varSigma \) of size \(n^{\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(1)}\). We call the elements of \(\varSigma \) . For a string \(S \in \varSigma ^*\), we denote with S[i..] its i-th suffix, and with |S| its length. The order < on the alphabet \(\varSigma \) induces a lexicographic order on \(\varSigma ^*\), which we denote by \(\prec \).

2.1 Induced Suffix Sorting

SAIS [27] is a linear-time algorithm for computing the suffix array [23]. We briefly review the parts of SAIS important for constructing the GCIS grammar. SAIS assigns each suffix a type, which is either \(\texttt {L}\) or \(\texttt {S}\): T[i..] is an L suffix if \(T[i..] \succ T[i+1..]\), or T[i..] is an S suffix otherwise, i.e., \(T[i..] \prec T[i+1..]\), where we stipulate that T[|T|] is always type \(\texttt {S}\). Since it is not possible that \(T[i..] = T[i+1..]\), SAIS assigns each suffix a type. An S suffix T[i..] is additionally an \(\texttt {S}^*\) suffix (also called LMS suffix in [27]) if \(T[i-1..]\) is an L suffix. The substring between two succeeding \(\texttt {S}^*\) suffixes is called an . In other words, a substring T[i..j] with \(i < j\) is an LMS substring if and only if T[i..] and T[j..] are \(\texttt {S}^*\) suffixes and there is no \(k \in [i+1..j-1]\) such that T[k..] is an \(\texttt {S}^*\) suffix. Regarding the defined types, we make no distinction between suffixes and their starting positions (e.g., the statements that (a) T[i] is type L and (b) T[i..] is an L suffix are equivalent). In fact, we can determine L and S positions solely based on their succeeding positions with the equivalent definition: if \(T[i] > T[i+1]\), then T[i] is L; if \(T[i] < T[i+1]\), then T[i] is S; finally, if \(T[i] = T[i+1]\), then T[i] has the same type as \(T[i+1]\).

The LMS substrings of \(\texttt {\#} T\) for # being a special character smaller than all characters appearing in T such that \(\texttt {\#}T\) starts with an \(\texttt {S}^*\) position, induce a factorization of \(T = F_1 \cdots F_z\), where each factor starts with an LMS substring. We call this factorization . By replacing each factor \(F_i\) by the lexicographic rank of its respective LMS substringFootnote 1, we obtain a string \({T}^{(1)}\) of these ranks. We recurse on \({T}^{(1)}\) until we obtain a string \({T}^{(\tau _T-1)}\) whose rank-characters are all unique or whose LMS-factorization consists of at most two factors.

2.2 Constructing the Grammar

We assign each computed factor \({F}^{(h)}_j\) a non-terminal \({X}^{(h)}_j\) such that \({X}^{(h)}_j \rightarrow {F}^{(h)}_j\), but omit the delimiter #. The order of the non-terminals \({X}^{(h)}_j\) is induced by the lexicographic order of their respective LMS-substrings. We now use the non-terminals instead of the lexicographic ranks in the recursive steps. If we set \({X}^{(\tau _T)} \rightarrow {T}^{(\tau _T-1)}\) as the start symbol, we obtain a context-free grammar \({\mathcal {G}}_T:= (\varSigma , \varGamma , \pi , {X}^{(\tau _T)})\), where \(\varGamma \) is the set of non-terminals and a function \(\pi : \varGamma \rightarrow (\varSigma \cup \varGamma )^+\) that applies (production) rules. For simplicity, we stipulate that \(\pi (c) = c\) for \(c \in \varSigma \). Let g denote the sum of the lengths of the right hand sides of all grammar rules. We say that a non-terminal (\(\in \varGamma \)) or a character (\(\in \varSigma \)) is a , and denote the set of characters and non-terminals with \({\mathcal {S}}:= \varSigma \cup \varGamma \). We understand \(\pi \) also as a string morphism \(\pi : {\mathcal {S}}^* \rightarrow {\mathcal {S}}^*\) by applying \(\pi \) on each symbol of the input string. This allows us to define the  \(\pi ^*(X)\) of a symbol X, which is the iterative application of \(\pi \) until obtaining a string of characters, i.e., \(\pi ^*(X) \subset \varSigma ^*\) and \(\pi ^*({X}^{(\tau _T)}) = T\). Since \(\pi (X)\) is deterministically defined, we use to say the right hand side of X for \(\pi (X)\).

Lemma 1

([29]). The GCIS grammar \({\mathcal {G}}_T\) can be constructed in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(n)\) time. \({\mathcal {G}}_T\) is reduced, meaning that we can reach all non-terminals of \(\varGamma \) from \({X}^{(\tau _T)}\).

\({\mathcal {G}}_T\) can be visualized by its derivation tree \({\mathcal {T}}_T\), which has \({X}^{(\tau _T)}\) as its root. Each rule \({X}^{(h)}_k \rightarrow {X}^{(h-1)}_{i} \cdots {X}^{(h-1)}_{j}\) defines a node \({X}^{(h)}_k\) having \({X}^{(h-1)}_{i}, \ldots , {X}^{(h-1)}_{j}\) as its children. The height of \({\mathcal {T}}_T\) is \(\tau _T = \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\lg n)\) because the number of LMS substrings of \({T}^{(h)}\) is at most half of the length of \({T}^{(h)}\) for each recursion level h. The leaves of \({\mathcal {T}}_T\) are the terminals at height 0 that constitute the characters of the text T. Reading the nodes on height \(h \in [0..\tau _T-1]\) from left to right gives \({T}^{(h)}\) with \({T}^{(0)} = T\). Note that we use \({\mathcal {T}}_T\) only as a conceptional construct since it would take \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(n)\) words of space. Instead, we merge (identical) subtrees of the same non-terminal together to form a directed acyclic graph , which is implicitly represented by \(\pi \) as follows:

By construction, each non-terminal appears exactly in one height of \({\mathcal {T}}_T\). We can therefore separate the non-terminals into the sets \({\varGamma }^{(1)}, \ldots , {\varGamma }^{(\tau _T)}\) such that a non-terminal of height h belongs to \({\varGamma }^{(h)}\). More precisely, \(\pi \) maps a non-terminal on height \(h > 1\) to a string of symbols on height \(h-1\). Hence, the grammar is acyclic.

3 GCIS Index

In what follows, we want to show that we can augment \({\mathcal {G}}_T\) with auxiliary data structures for answering . Our idea stems from the classic pattern matching algorithm with the suffix tree [19, APL1]. The key difference is that we search the core of a pattern in the right hand sides of the rules. For that, we make use of the generalized suffix tree \(\mathsf {GST}\) built upon the right hand sides of all rules separated by a special delimiter symbol $ being smaller than all symbols. Specifically, we rank the rules such that \(\{X_1, \ldots , X_{|\varGamma |}\} = \varGamma \) (this ranking will be fixed later), and set . Since we have a budget of \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) words, we can afford to use a plain pointer-based tree topology. Each leaf \(\lambda \) stores a pointer to the non-terminal \({X}^{(h)}\) and an offset o such that \(\pi ({X}^{(h)})[o..]\) is a prefix of \(\lambda \)’s string label. Next, we need the following operations on \(\mathsf {GST}\): First, gives the lowest common ancestor (LCA) of two nodes u and v. We can augment \(\mathsf {GST}\) with the data structure of [1] in linear time and space in the number of nodes of \(\mathsf {GST}\). This data structure answers in constant time. Next, gives the child of the node u connected to u with an edge having a label starting with \(c \in \varGamma \). Our \(\mathsf {GST}\) implementation answers in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\lg |{\mathcal {S}}|)\) time. For that, each node stores the pointers to its children in a binary search tree with the first symbol of each connecting edge as key. Finally, returns the string depth of a node v, i.e., the length of its string label, which is the string read from the edge labels on the path from the root to v. We can compute and store the string depth of each node during its construction. The operation allows us to compute the of a string S, i.e., the highest \(\mathsf {GST}\) node u whose string label has S as a prefix, in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(|S| \lg |{\mathcal {S}}|)\) time. For each \(\pi (X)\), we augment the locus u of \(\pi (X)\texttt {\$}\) with a pointer to X such that we can perform returning the non-terminal X with \(\pi (X) = S\) or an invalid symbol \(\bot \) if such an X does not exist. The time is dominated by the time for computing the locus of S. Finally, all leaves in suffix order are stored in a linked list such that we can traverse the leaves in lexicographic order with respect to their corresponding suffixes.

Linkage to the Grammar. Each rule \(X \in \varGamma \) stores an array X.P of \(|\pi (X)|\) pointers to the leaves in \(\mathsf {GST}\) such that the X.P[i] points to the leaf that points back to X and has offset i (its string label has \(\pi (X)[i..]\) as a prefix). Additionally, each rule X stores the length of \(\pi (X)\), an array X.L of all expansion lengths of all its prefixes, i.e., \(X.L[i] := \sum _{j=1}^i |\pi ^*(\pi (X)[j])|\), and an array X.R of the lengths of the right hand sides of all its prefixes, i.e., \(X.R[i] := \sum _{j=1}^i |\pi (\pi (X)[j])|\).

LCE Queries. Each internal node v stores a pointer to the leftmost leaf in the subtree rooted at v. With that we can use the function returning the (LCE) of \(\pi (X)[i..]\) and \(\pi (Y)[j..]\) for \(X,Y \in \varGamma \) and \(i \in [1..|\pi (X)|], j \in [1..|\pi (Y)|]\). We can answer by selecting the leaves X.P[i] and Y.P[j], retrieve the LCA  of both leaves, and take its string depth, all in constant time. More strictly speaking, we return \(\min (|\pi (X)[i..]|\), \(|\pi (Y)[j..]|\), , since the delimiter \(\texttt {\$}\) is not a unique character, but appears at each end of each right hand side in the underlying string R of \(\mathsf {GST}\).

Complexity Bounds. \(\mathsf {GST}\) can be computed in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) time [13]. The grammar index consists of the GCIS grammar, \(\mathsf {GST}\) built upon \(|R| = g+|\varGamma |\) symbols, and augmented with a data structure for [1]. This all takes \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) space. Each non-terminal is augmented with an array X.P of pointers to leaves, X.L and X.R storing the expansion lengths of all prefixes of \(\pi (X)\), which take again \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) space when summing over all non-terminals.

4 Pattern Matching Algorithm

Like [30, Sect. 2], our idea is to first fix a core C of a given pattern P, find the occurrences of C in the text, and then try to extend all these occurrences to occurrences of P.

4.1 Cores

A core C is a string of symbols of the GCIS grammar \({\mathcal {G}}_P\) built on the pattern P with the following property: given C consists of consecutive nodes on height \(h \ge 0\) in \({\mathcal {T}}_T\), if there is an occurrence of C in \({\mathcal {T}}_T\) being a set of nodes on height h that have not the same parent node on height \(h+1\), then the expansion of this occurrence of C does not lead to an occurrence of P. So for each occurrence of C in \({\mathcal {T}}_T\) whose expansion is contained in an occurrence of P, this occurrence is a (not necessarily proper) substring of the right hand side of a rule of \({\mathcal {G}}_T\).

We qualify a core by the difference in the number of occurrences of P and C in \({\mathcal {T}}_T\). On the one hand, although a character P[i] always qualifies as a core, the appearance of P[i] in T is unlikely to be an evidence of an occurrence of P.

On the other hand, the non-terminal covering most of the characters of P might not be a core. Hence, we aim for the highest possible non-terminal, for which we are sure that it exhibits the core property.

Finding a Core. We determine a core C of P during the computation of the GCIS grammar \({\mathcal {G}}_P\) of P. During this computation, we want to assure that we only create a new non-terminal for a factor F whenever ; if , we borrow the non-terminal X from \({\mathcal {G}}_T\). By doing so, we ensure that non-terminals of \({\mathcal {G}}_P\) and \({\mathcal {G}}_T\) are identical whenever their right hand sides of their productions are equal. In detail, if we create the factors \({P}^{(h)} = {F}^{(1)}_1 \cdots {F}^{(h)}_{z_h}\), we first retrieve for each \(i \in [2..{z_h}-1]\). If one of the -queries returns \(\bot \), we abort since we can be sure that the pattern does not occur in T. That is because all non-terminals \({Y}^{(h)}_2, \ldots {Y}^{(h)}_{{z_h}-2}\) classify as cores. To see this, we observe that prepending or appending symbols to \({P}^{(h)}\) does not change the factors \({F}^{(h)}_2, \ldots , {F}^{(h)}_{z-1} =: {C}^{(h)}\).

Correctness. We show that prepending or appending characters to \({F}^{(h)}_1\) \({C}^{(h)}\) \({F}^{(h)}_{z_h}\) does not modify the computed factorization of \({C}^{(h)} = {F}^{(h)}_2, \ldots , {F}^{(h)}_{z-1}\). What we show is that we cannot change the type of any position \({C}^{(h)}[i]\) to \(\texttt {S}^*\): Firstly, the type of a position (S or L) depends only on its succeeding position, and hence prepending cannot change the type of a position in \({C}^{(h)}\). Secondly, appending characters can either prolong \({F}^{(h)}_{z_h}\) or create a new factor \({F}^{(h)}_{z_{h+1}}\) since \({F}^{(h)}_{z_h}\) starts with \(\texttt {S}^*\), and therefore appending cannot change \({C}^{(h)}\). An additional insight is that on the one side, prepending character can only introduce a new factor or extend \({F}^{(h)}_1\). On the other side, appending characters can introduce at most one new \(\texttt {S}^*\) position in \({F}^{(h)}_{z_h}\) that can make it split into two factors. We will need this observation later for extending the core to the pattern.

The construction of \({\mathcal {G}}_P\) iterates the LMS factorization until we are left with a string of symbols \({P}^{(\tau _P)}\) whose LMS factorization consists of at most two factors. In that case, we partition \({P}^{(\tau _P)}\) into three substrings \(C_p C C_s\) with \(C_p\) and \(C_s\) possibly empty, and defined by one of the following mutually exclusive conditions: (1) If the LMS factorization consists of two non-empty factors \(F_1 \cdot F_2\), then \(C_p\) is \(F_1\). (2) Given \({P}^{(\tau _P)} = ({P}^{(\tau _P)}[j_1])^{c_1} \cdots ({P}^{(\tau _P)}[j_k])^{c_k}\) is the run-length-encoded representation of P with \(1 = j_1< \ldots < j_k = |{P}^{(\tau _P)}|\), \(c_{j_i} \ge 1\) for \(i \in [1..k]\), and \(P[j_i] \not = P[j_{i+1}]\) for \(i \in [1..k-1]\), we set \(C_s \leftarrow ({P}^{(\tau _P)}[j_k])^{c_k}\) if \({P}^{(\tau _P)}[j_k] < {P}^{(\tau _P)}[j_{k-1}]\). (3) In the other cases, \(C_p\) and/or \(C_s\) are empty.

To see why C is a core, we only have to check the case when \(C_s\) is empty. The other cases have already been covered by the aforementioned analysis of the cores on the lower heights. If \(C_s\) is empty, then C ends with P, and as a border case, the last position of C is \(\texttt {S}^*\). In that case, appending a symbol smaller than P[m] to \({F}^{(h)}_1 C\) changes the type of the last position of C to L. If we append a symbol larger than P[m], then the last position of C becomes S, but does not become \(\texttt {S}^*\) since \({P}^{(\tau _P)}[j_k] > {P}^{(\tau _P)}[j_{k-1}]\) due to construction (otherwise \(C_s\) would not be empty).

In total, there are symbols \({A}^{(1)}, \ldots , {A}^{(\tau _P-1)}\) and \({S}^{(1)}, \ldots , {S}^{(\tau _P-1)}\) such that

$$\begin{aligned} P = \pi ( {F}^{(1)}_1 \cdots {F}^{(\tau _P-2)}_1 {A}^{(1)} \cdots {A}^{(\tau _P-2)} C {S}^{(\tau _P-2)} \cdots {S}^{(1)} {F}^{(\tau _P-2)}_{z_{\tau _P-2}} \cdots {F}^{(1)}_{z_1} ), \end{aligned}$$
(1)

and \({A}^{(h)}\), \({S}^{(h)} \in {\varGamma }^{(h)}\) are cores of P, while \({F}^{(h)}_1\), \({F}^{(h)}_{z_h} \in ({\varGamma }^{(h-1)})^*\) are factors.

4.2 Matching with \(\mathsf {GST}\)

Having C, we now switch to \(\mathsf {GST}\) and use it to find all parents of C, whose number we denote by \(\text {occ}_C\in \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\). This number is also the number of occurrences of C in the right hand sides of all rules of \({\mathcal {G}}_T\). Having these parents, we want to find all lowest ancestors of C whose expansions are large enough to not only cover C but also P by extending C to its left and right side—see Fig. 1 for a sketch. We proceed as follows: We first compute the locus v of C in \(\mathsf {GST}\) in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(|C| \lg |{\mathcal {S}}|)\) time via . Subsequently, we take the pointer to the leftmost leaf in the subtree rooted at v, and then process all leaves in this subtree by using the linked list of leaves. For each such leaf \(\lambda \), we compute a path in form of a list \(\lambda _L\) from the non-terminal containing C on its right hand side up to an ancestor of it that has an expansion large enough to cover P if we would expand the contained occurrence of C to P. We do so as follows: Each of these leaves stores a pointer to a non-terminal X and a starting position i such that we know that \(\pi ^*(X)[i..]\) starts with \(\pi ^*(C)\). By knowing the expansion lengths \(X.L[|\pi (X)|]\), \(X.L[i-1]\), and \(|\pi ^*(C)|\), we can judge whether the expansion of X has enough characters to be able to extend its occurrence of C to P. If it has enough characters, we put (Xi) onto \(\lambda _L\) such that we know that \(\pi ^*(X)[X.L[i-1] + 1..]\) has C as a prefix. If X does not have enough characters, we exchange C with X and recurse on finding a non-terminal with a larger expansion. By doing so, we visit at most \(\tau _T = \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\lg n)\) non-terminals per occurrence of C in the right hand sides of \({\mathcal {G}}_T\). We perform all operations in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\text {occ}_C\tau _T \lg |{\mathcal {S}}|)\) time because we query in every recursion step.

Fig. 1.
figure 1

Deriving a non-terminal \({Y}^{(h)}\) with \(\pi ^*({Y}^{(h)})\) containing P from a non-terminal \({Y}^{(\tau _P)}\) with \(\pi ^*({Y}^{(\tau _P)})\) containing \(\pi ^*(C)\). The expansion of none of the descendants of \({Y}^{(h)}\) towards \({Y}^{(\tau _P)}\) is large enough for extending its contained occurrence of \(\pi ^*(C)\) to an occurrence of P. We can check the expansion lengths of the substrings in \(\pi ({Y}^{(h)})\) via the array \({Y}^{(h)}.L\).

The previous step computes, for each accessed leaf \(\lambda \), a list \(\lambda _L\) containing a path \(({Y}^{(h)},\ldots ,{Y}^{(\tau _P)})\) of length \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\tau _T)\) and an offset \({o}^{(\tau _P)}\) such that \({Y}^{(\tau _P)}[{o}^{(\tau _P)}..]\) starts with C. By construction, these paths cover all occurrences of C in \({\mathcal {T}}_T\). Note that we process the node \({Y}^{(\tau _P)}\) (but for different offsets \({o}^{(\tau _P)}\)) as many times as C occurs in \(\pi ({Y}^{(\tau _P)})\)). In what follows, we try to expand the occurrence of C captured by \({Y}^{(\tau _P)}\) and \({o}^{(\tau _P)}\) to an occurrence of P.

Naively, we would walk down from \({Y}^{(\tau _P)}[{o}^{(\tau _P)}]\) to the character level and extend the substring \(\pi ^*(C)\) in both directions by character-wise comparison with P. However, this would take \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\text {occ}_Cm \tau _T)\) time since such a non-terminal \({Y}^{(h)}\) is of height \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\tau _T)\). Our claim is that we can perform the computation in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(m + \text {occ}_C\tau _T)\) time with the aid of and an amortization argument.

For that, we use Eq. (1), which allows us to use LCE queries in the sense that we can try to extend an occurrence of C with an already extended occurrence (that maybe does not match P completely). For the explanation, we only focus on extending all occurrences of C to the right to \(C C_s\) (the left side side is done symmetrically). We maintain an array D of length \(\tau _P\) storing pairs \(({X}^{(h)}, \ell _h)\) for each height \(h \in [1..\tau _P-1]\) such that \(\pi ({X}^{(h)})\) has the currently longest extension of length \(\ell _h\) with the core \({S}^{(h-1)}\) of P in common (cf. Eq. (1)). By maintaining D, we can first query with the specific non-terminal in D, and then resort to plain symbol comparison. We descend to the child where the mismatch happens and recurse until reaching the character level of \({\mathcal {T}}_T\). This all works since by the core property the mismatch of a child means that there is a mismatch in the expansion of this child. Since a plain symbol comparison with matching symbols lets us exchange the currently used non-terminal in D with a longer one, we can bound (a) the total number of naive symbol matches to \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(m)\) and (b) the total number of naive symbol mismatches and LCE queries to \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\text {occ}_C\tau _T)\).

Finding the Starting Positions. It is left to compute the starting position in T of each occurrence captured by an element in W. We can do this similarly to computing the pre-order ranks in a tree: For each pair \((X,\ell ) \in W\), climb up from X to the root while accumulating the expansion lengths of all left siblings of the nodes we visit (we can make use of X.L for that). If this accumulated length is s, then \(\ell +s\) is the starting position of the occurrence captured by \((X,\ell )\). However, this approach would cost \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\tau _T)\) time per element of W. Here, we use the amortization argument of [6, Sect. 5.2], which works if we augment, in a pre-computation step, each non-terminal X in \(\varGamma \) with (a) a pointer to the lowest ancestor \(Y_X\) on every path from X to the root that has X at least twice as a descendant, and (b) the lengths of the expansions of the left siblings of the child of \(Y_X\) being a parent of X or X itself. By doing so, when taking a pointer of a non-terminal X to its ancestor \(Y_X\), we know that X has another occurrence in (and thus there is another occurrence of P). Therefore, we can charge the cost of climbing up the tree with the amount of occurrences \(\text {occ}\) of the pattern.

Total Time. To sum up, we spent \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(m \lg |{\mathcal {S}}|)\) time for finding C, \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\text {occ}_C\tau _T \lg |{\mathcal {S}}|)\) time for computing the non-terminals covering C, \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(m + \text {occ}_C\tau _T)\) time for reducing these non-terminals to W, and \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\text {occ})\) time for retrieving the starting positions of the occurrences of P in T from W. To be within our \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) space bounds, we can process each parent of C individually, and keep only D globally stored during the whole process. The total additional space is therefore \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\tau _T) \subset \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) for maintaining D and a path for each occurrence of C.

5 Implementation and Experiments

The implementation deviates from theory with respect to the rather large hidden constant factor in the \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) words of space. We drop \(\mathsf {GST}\), and represent with multiple arrays. For that, we first enumerate the non-terminals as follows: The height and the lexicographic order induce a natural order on the non-terminals in \(\varGamma \), which are ranked by first their height and secondly by the lexicographic order of their right hand sides, such that we can represent \(\varGamma = \{X_1, \ldots , X_{|\varGamma |}\}\). By stipulating that all characters are lexicographically smaller than all non-terminals, we obtain the property that \(\pi (X_i) \prec \pi (X_{i+1})\) for all \(i \in [1..|\varGamma |-1]\). In the following, we first present a plain representation of , called GCIS-\(\mathsf {nep}\), then give our modified algorithm, and subsequently present a compressed version of using universal coding, called GCIS-\(\mathsf {uni}\). Finally, we evaluate both implementations in Sect. 5.

Our first implementation, called GCIS-\(\mathsf {nep}\)Footnote 2, represents each symbol with a 32-bit integer. We use \(R := \prod _{i=1}^{|\varGamma |} \pi (X_i)\) again, but omit the delimiters $ separating the right hand sides. To find the right hand side of a non-terminal \(X_i\), we create an array of positions \(Q[1..|\varGamma |]\) such that Q[i] points to the starting position of \(\pi (X_i)\) in R. Finally, we create an array \(L[1..|\varGamma |]\) storing the length of the expansion \(|\pi ^*(X_i)|\) in L[i], for each non-terminal \(X_i\). Due to the stipulated order of the symbols, the strings \(R[Q[i]..Q[i+1]-1]\) are sorted in ascending order. Hence, we can evaluate for a string S in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(|S| \lg |{\mathcal {S}}|)\) time by a binary search on Q with \(i \mapsto R[Q[i]..Q[i+1]-1]\) as keys.

Locate. Our implementation follows theory for computing \({\mathcal {G}}_P\) and C (cf. Sect. 4.1) in the same time bounds, but deviates after computing the core C: To find all non-terminals whose right hand sides contain C, we linearly scan the right hand sides of all non-terminals on height \(\tau _P\), which we can do cache-friendly since the right-hands of R are sorted by the height of their respective non-terminals. This takes \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g + |C|)\) time in total with a pattern matching algorithm [21].

Finally, for extending a found occurrence of the core C to an occurrence of P, we follow the naive approach to descend to the character level and compare the expansion with P character-wise, which results in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\text {occ}_C{} |P| \tau _T)\) time. The total time cost is \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g + |P| (\text {occ}_C{} \tau _T + \lg |{\mathcal {S}}|))\).

Table 1. Sizes of the used datasets and the indexes stored on disk. Sizes are in megabytes [MB].

GCIS-uni. To save space, we can leverage universal code to compress the right hand sides of the productions. First, we observe that Q and the first symbols \(F := \pi (X_1)[1], \ldots , \pi (X_{|\varGamma |})[1]\) form an ascending sequence, such that we represent both Q and F in Elias–Fano coding [11]. Next, we observe that each right hand side \(\pi (X_i)\) form a bitonic sequence: the ranks of the first \(\ell _i\) symbols are non-decreasing, while rest of the ranks are non-increasing. Our idea is to store \(\ell _i\) and the rest of \(\pi (X_i)[2..]\) in delta-coding, i.e., \(\varDelta [i][k] := |\pi (X_i)[k] - \pi (X_i)[k-1]|\) for \(k \in [2..|\pi (X_i)|]\), which is stored in Elias-\(\gamma \) code [12]. Although \(\pi (X_i)[k] - \pi (X_i)[k-1] < 0\) for \(k > \ell _i\), we can decode \(\pi (X_i)[k]\) by subtracting instead of adding the difference to \(\pi (X_i)[k-1]\) as usual in delta-coding. Hence, we can replace R with \(\varDelta \), but need to adjust Q such that Q[i] points to the first bit of \(\varDelta [i]\). Finally, like in the first variant, we store the expansion lengths of all non-terminals in L. Here, we separate L in a first part using 8 bits per entry, then 16 bits per entry, and finally 32 bits per entry. To this end, we represent L by three arrays, start with filling the first array, and continue with filling the next array whenever we process a value whose bit representation cannot be stored in a single entry of the current array. Since Elias–Fano code supports constant-time random access and Elias-\(\gamma \) supports constant-time linear access, we can decode \(\pi (X_i)\) by accessing F[i] and then sequentially decode \(\varDelta [i]\). Hence, we can simulate GCIS-\(\mathsf {nep}\) with this compressed version without sacrificing the theoretical bounds. We call the resulting index GCIS-\(\mathsf {uni}\).

Fig. 2.
figure 2

Maximum memory consumption (top) and time (bottom) during the construction of the indexes.

Experiments. In the following we present an evaluation of our implementation and different self-indexes for comparison, which are the FM-index [14], the ESP-Index [32], and the r-index [18]Footnote 3. All code has been compiled with gcc-10.2.0 in the highest optimization mode -O3. We ran all our experiments with an Intel Xeon CPU X5670 clocked at 2.93 GHz running Arch Linux.

Our datasets shown in Table 1 are from the Pizza&Chili and the tudocomp [9] corpus.Footnote 4 With respect to the index sizes, we have the empirically ranking GCIS-\(\mathsf {uni}\) < ESP-index < GCIS-\(\mathsf {nep}\), followed by one of the BWT-based indexes. While the r-index needs less space than the FM-index on highly-compressible datasets, it is the least favorable option of all indexes for less-compressible datasets. Figure 2 gives the time and memory needed for constructing the indexes.

Fig. 3.
figure 3

Time for while scaling the pattern length on the datasets english.001.2 (left) and fib41 (right). The plots are in logscale. The right figure does not feature the FM-index, which takes considerably more time than the other approaches. For the same reason, there is no data shown for the ESP-index for small pattern lengths, which needs 170 s on average for \(|P| = 10\).

Fig. 4.
figure 4

Left: The average height \(\tau _P\) of \({\mathcal {G}}_P\) for a pattern of a certain length. Right: Percentage of the computation of \({\mathcal {G}}_P\) in relation to the whole running time for answering with GCIS-\(\mathsf {nep}\).

We can observe in Fig. 3 that our indexes answer fast when P is sufficiently long or has many occurrences \(\text {occ}\) in T. GCIS-\(\mathsf {uni}\) is always slower than GCIS-\(\mathsf {nep}\) due to the extra costs for decoding. In particular for english.001.2, GCIS-\(\mathsf {nep}\) is the fastest index when the pattern length reaches 10000 characters and more. At this time, the pattern grammar reached a height \(\tau _P\) of almost six, which is the height \(\tau _T\). The algorithm can extend an occurrence of a core to a pattern occurrence by checking only 80–100 characters. However, when the pattern surpasses 5000 characters, the computation of \({\mathcal {G}}_P\) becomes the time bottleneck. With that respect, the ESP-index shares the same characteristic. encoding make slow down the location time by about 2 to 10 times approximately. Let us have a look at the dataset fib41, which is linearly recurrent [10], a property from which we can derive the fact that a pattern that occurs at least once in T has actually a huge number of occurrences in T. There are almost 3,000,000 occurrence of patterns with a length of 100. Here, we observe that our indexes are faster than ESP-index. ESP-index needs more time for than GCIS because GCIS can form a core than covers a higher percentage of the pattern than the core selected by ESP. FM-index, and ESP-index with \(|P| = 10\) take 100 s or more on average – we omitted them in the graph to keep the visualization clear.

In Fig. 4, we study the maximum height \(\tau _P = \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(\lg |P|)\) that we achieved for the patterns with \(|P| = 100\) in each dataset. For this experiment, we randomly select a position j in T and extracted \(P=T[j..j+99]\). For every dataset, we could observe that \(\tau _P\) is logarithmic to the pattern length, especially for the artificial datasets fib41, tm29, and rs.13, where \(\tau _P\) is empirically larger than measured in other datasets. In dna and commoncrawl, \(\tau _P\) is at most 3 , but this is because \(\tau _T = 3\) for these datasets.