1 Introduction

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). An important variant of this problem is the gapped string indexing problem [7, 9, 11, 14, 18, 31, 32, 36]. Here, the goal is to compactly represent the string such that given two patterns \(P_1\) and \(P_2\) and a gap range \([\alpha , \beta ]\) we can quickly find occurrences of \(P_1\) and \(P_2\) with distance in \([\alpha , \beta ]\). Searching and indexing with gaps is frequently used in computational biology applications [9, 15, 17, 18, 23, 25, 26, 37, 40, 43].

Another variant is string indexing for consecutive occurrences [6, 8, 12, 45]. Navarro and Thankachan [45] study the problem of compactly representing the string such that given a pattern P and a gap range \([\alpha , \beta ]\) we can quickly find consecutive occurrences of P with distance in \([\alpha , \beta ]\), i.e., pairs of subsequent occurrences with distance within the range.

In this paper, we consider the natural combination of these variants that we call gapped indexing for consecutive occurrences. Here, the goal is to compactly represent the string such that given two patterns \(P_1\) and \(P_2\) and a gap range \([\alpha , \beta ]\) we can quickly find the consecutive occurrences of \(P_1\) and \(P_2\) with distance in \([\alpha , \beta ]\).

We can apply standard techniques to obtain several simple solutions to the problem. To state the bounds, let n be the size of S. If we store the suffix tree for S, we can answer queries by searching for both query strings, merging the results, and removing all non-consecutive occurrences. This leads to a solution using O(n) space and \({\widetilde{O}}(|P_1| + |P_2| + \text {occ}_{P_1} + \text {occ}_{P_2})\) query time, where \(\text {occ}_{P_1}\) and \(\text {occ}_{P_2}\) denote the number of occurrences of \(P_1\) and \(P_2\), respectively.Footnote 1 However, \(\text {occ}_{P_1} + \text {occ}_{P_2}\) may be as large as \(\Omega (n)\) and much larger than the size of the output. Alternatively, we can obtain constant query time for counting queries by precomputing the answer for each pair of suffix tree nodes and each possible distance in \(O(n^3)\) space.

In this paper, we introduce new solutions that significantly improve the above time-space trade-offs. Specifically, we present data structures that use linear space and query time \({\widetilde{O}}(|P_1|+|P_2|+n^{2/3})\) for existence and counting and \({\widetilde{O}}(|P_1|+|P_2|+n^{2/3}\text {occ}^{1/3})\) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using \({\widetilde{O}}(n)\) space must use \(\widetilde{\Omega }(|P_1| + |P_2| + \sqrt{n})\) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

1.1 Setup and Results

Throughout the paper, let S be a string of length n. Given two patterns \(P_1\) and \(P_2\) a consecutive occurrence in S is a pair of occurrences (ij), \(0 \le i< j < |S|\) where i is an occurrence of \(P_1\) and j an occurrence of \(P_2\), such that no other occurrences of either \(P_1\) or \(P_2\) occurs in between. The distance of a consecutive occurrence (ij) is \(j-i\). Our goal is to preprocess S into a compact data structure that given pattern strings \(P_1\) and \(P_2\) and a gap range \([\alpha , \beta ]\) supports the following queries:

  • \(\mathsf {Exists}(P_1, P_2, \alpha , \beta )\): determine if there is a consecutive occurrence of \(P_1\) and \(P_2\) with distance within the range \([\alpha , \beta ]\).

  • \(\mathsf {Count}(P_1, P_2, \alpha , \beta )\): return the number of consecutive occurrences of \(P_1\) and \(P_2\) with distance within the range \([\alpha , \beta ]\).

  • \(\mathsf {Report}(P_1, P_2, \alpha , \beta )\): report all consecutive occurrences of \(P_1\) and \(P_2\) with distance within the range \([\alpha , \beta ]\).

We present new data structures with the following bounds:

Theorem 1

Given a string of length n and constant \(\epsilon >0\), we can construct an O(n) space data structure that supports \(\mathsf {Exists}(P_1, P_2, \alpha , \beta )\) and \(\mathsf {Count}(P_1, P_2, \alpha , \beta )\) queries in \(O(|P_1| + |P_2| + n^{2/3}\log ^{\epsilon }n)\) time and \(\mathsf {Report}(P_1, P_2, \alpha , \beta )\) queries in \({O}(|P_1| + |P_2| + n^{2/3} \text {occ}^{1/3}\log ^{4/3 + \epsilon } n)\) time, where \(\text {occ}\) is the size of the output.

Hence, Theorem 1 achieves linear space and query time \({\widetilde{O}}(|P_1| + |P_2|+n^{2/3})\) for existence and counting and \({\widetilde{O}}(|P_1| + |P_2|+n^{2/3}\text {occ}^{1/3})\) for reporting. Compared to the above mentioned simple suffix tree approach that finds all occurrences of the query strings and merges them, we match the linear space bound, while reducing the dependency on n in the query time from worst-case \(\Omega (|P_1| + |P_2|+n)\) to \({\widetilde{O}}(|P_1| + |P_2|+n^{2/3})\) for \(\mathsf {Exists}\) and \(\mathsf {Count}\) queries and \({\widetilde{O}}(|P_1| + |P_2|+n^{2/3}\text {occ}^{1/3})\) for \(\mathsf {Report}\) queries.

We complement Theorem 1 with a conditional lower bound based on the set intersection problem. Specifically, we use the Strong SetDisjointness Conjecture from [24] to obtain the following result:

Theorem 2

Assuming the Strong SetDisjointness Conjecture, any data structure on a string S of length n that supports Exists queries in \(O(n^{\delta }+|P_1|+|P_2|)\) time, for \(\delta \in [0,1/2]\), requires \({\Omega }\left( n^{2-2\delta - o(1)}\right) \) space. This bound also holds if we limit the queries to only support ranges of the form \([0,\beta ]\), and even if the bound \(\beta \) is known at preprocessing time.

With \(\delta = 1/2\), Theorem 2 implies that any near linear space solution must have query time \(\widetilde{\Omega }(|P_1| + |P_2| + \sqrt{n})\). Thus, Theorem 1 is optimal within a factor roughly \(n^{1/6}\). On the other hand, with \(\delta = 0\), Theorem 2 implies that any solution with optimal \({\widetilde{O}}(|P_1| + |P_2|)\) query time must use \(\widetilde{\Omega }(n^{2-o(1)})\) space.

Finally, note that Theorem 2 holds even when the gap range is of the form \([0, \beta ]\). As a simple extension of our techniques, we show how to improve our solution from Theorem 1 to match Theorem 2 in this special case.

1.2 Techniques

To obtain our results we develop new techniques and show new interesting properties of consecutive occurrences. We first consider \(\mathsf {Exists}\) and \(\mathsf {Count}\) queries. The key idea is to split gap ranges into large and small distances. For large distances there can only be a limited number of consecutive occurrences and we show how these can be efficiently handled using a segmentation of the string. For small distances, we cluster the suffix tree and store precomputed answers for selected pairs of nodes. Since the number of distinct distances is small we obtain an efficient bound on the space.

We extend our solution for \(\mathsf {Exists}\) and \(\mathsf {Count}\) queries to handle \(\mathsf {Report}\) queries. To do so we develop a new decomposition of suffix trees, called the induced suffix tree decomposition that recursively divides the suffix tree in half by index in the string. Hence, the decomposition is a balanced binary tree, where every node stores the suffix tree of a substring of S. We show how to traverse this structure to efficiently recover the consecutive occurrences.

For our conditional lower bound we show a reduction based on the set intersection problem. Along the way we show that set intersection remains hard even if all elements in the instance have the same frequency.

1.3 Related Work

As mentioned, string indexing for gaps and consecutive occurrences are the most closely related lines of work to this paper. Another related area is document indexing, where the goal is to preprocess a collection of strings, called documents, to report those documents that contain patterns subject to various constraints. For a comprehensive overview of this area see the survey by Navarro [41].

A well studied line of work within document indexing is document indexing for top-k queries [16, 27,28,29,30, 38, 39, 42, 44, 47, 48]. The goal is to efficiently report the top-k documents of smallest weight, where the weight is a function of the query. Specifically, the weight can be the distance of a pair of occurrences of the same or two different query patterns [30, 38, 42, 47]. The techniques for top-k indexing (see e.g. Hon et al. [30]) can be adapted to efficiently solve gapped indexing for consecutive occurrences in the special case when the gap range is of the form \([0, \beta ]\). However, since these techniques heavily exploit that the goal is to find the top-k closest occurrences, they do not generalize to general gap ranges.

There are several results on conditional lower bounds for pattern matching and string indexing [4, 5, 24, 33,34,35]. Notably, Kopelowitz and Krauthgamer [33] consider the snippets problem, where the goal is to preprocess the text to enable fast reporting of the closest pair of occurrences of query patterns \(P_1\) and \(P_2\). They prove a lower bound for that problem based on SetDisjointness, which is closely related to our lower bound in Theorem 2. Our result uses a similar reduction but introduces an intermediate step of potential independent interest, where we prove hardness for instances of SetDisjointness where every element has the same frequency. Ultimately, this leads to a clean proof of the final lower bound.

Other results also establish a link between indexing for two patterns and set intersection. Ferragina et al. [20] and Cohen and Porat [19] reduce the two dimensional substring indexing problem to set intersection (though the goal was to prove an upper, not a lower bound). In the two dimensional substring indexing problem the goal is to preprocess pairs of strings such that given two patterns we can output the pairs that contain a pattern each. Larsen et al. [35] prove a conditional lower bound for the document version of indexing for two patterns, i.e., finding all documents containing both of the two query patterns. Goldstein et al. [24] show that similar lower bounds can be achieved via conjectured hardness of set intersection. Our reduction is still quite different from these, since we need a translation from intersection to distance.

1.4 Outline

The paper is organized as follows. In Sect. 2 we define notation and recall some useful results. In Sect. 3 we show how to answer \(\mathsf {Exists}\) and \(\mathsf {Count}\) queries. In Sect. 4 we show how to answer \(\mathsf {Report}\) queries. In combination this proves Theorem 1. In Sect. 5 we prove the lower bound, proving Theorem 2. Finally, in Sect. 6 we apply our techniques to solve the variant where \(\alpha = 0\).

2 Preliminaries

Strings. A string S of length n is a sequence \(S[0]S[1]\dots S[n-1]\) of characters from an alphabet \(\Sigma \). A contiguous subsequence \(S[i,j]=S[i]S[i+1]\dots S[j]\) is a substring of S. The substrings of the form \(S[i,n-1]\) are the suffixes of S. The suffix tree [49] is a compact trie of all suffixes of \(S\$\), where $ is a symbol not in the alphabet, and is lexicographically smaller than any letter in the alphabet. Each leaf is labelled with the index i of the suffix \(S[i,n-1]\) it corresponds to. Using perfect hashing [22], the suffix tree can be stored in O(n) space and solve the string indexing problem (i.e., find and report all occurrences of a pattern P) in \(O(m+\text {occ})\) time, where m is the length of P and \(\text {occ}\) is the number of times P occurs in S.

For any node v in the suffix tree, we define \(\text {str}(v)\) to be the string found by concatenating all labels on the path from the root to v. The locus of a string P, denoted \(\text {locus}(P)\), is the minimum depth node v such that P is a prefix of \(\text {str}(v)\). The suffix array stores the suffix indices of \(S\$\) in lexicographic order. We identify each leaf in the suffix tree with the suffix index it represents. The suffix tree has the property that the leaves below any node represent suffixes that appear in consecutive order in the suffix array. For any node v in the suffix tree, \(\text {range}(v)\) denotes the range that v spans in the suffix array. The inverse suffix array is the inverse permutation of the suffix array, that is, an array where the ith element is the index of suffix i in the suffix array.

Orthogonal range successor. The orthogonal range successor problem is to preprocess an array \(A[0,\dots ,n-1]\) into a data structure that efficiently supports the following queries:

  • \(\textsf {RangeSuccessor}(a,b,x)\): return the successor of x in \(A[a,\dots , b]\), that is, the minimum \(y> x\) such that there is an \(i\in [a,b]\) with \(A[i]=y\).

  • \(\textsf {RangePredecessor}(a,b,x)\): return the predecessor of x in \(A[a,\dots , b]\), that is, the maximum \(y< x\) such that there is an \(i\in [a,b]\) with \(A[i]=y\).

3 Existence and Counting

In this section we give a data structure that can answer \(\mathsf {Exists}\) and \(\mathsf {Count}\) queries. The main idea is to split the query interval into “large” and “small” distances. For large distances we exploit that there can only be a small number of consecutive occurrences and we check them with a simple segmentation of S. For small distances we cluster the suffix tree and precompute answers for selected pairs of nodes.

We first show how to use orthogonal range successor queries to find consecutive occurrences. Then we define the clustering scheme used for the suffix tree and give the complete data structure.

3.1 Using Orthogonal Range Successor to Find Consecutive Occurrences

Assume we have found the loci of \(P_1\) and \(P_2\) in the suffix tree. Then we can answer the following queries in a constant number of orthogonal range successor queries on the suffix array:

  • \(\textsf {FindConsecutive}_{P_2}(i)\): given an occurrence i of \(P_1\), return the consecutive occurrence (ij) of \(P_1\) and \(P_2\), if it exists, and No otherwise.

  • \(\textsf {FindConsecutive}_{P_1}(j)\): given an occurrence j of \(P_2\), return the consecutive occurrence (ij) of \(P_1\) and \(P_2\), if it exists, and No otherwise.

Given a query \(\textsf {FindConsecutive}_{P_2}(i)\), we answer as follows. First, compute \(j = \textsf {RangeSuccessor}(\text {range}(\text {locus}(P_2)), i)\) to get the closest occurrence of \(P_2\) after i. Then, compute \(i' = \textsf {RangePredecessor}(\text {range}(\text {locus}(P_1)), j)\) to get the closest occurrence of \(P_1\) before j. If \(i = i'\) then no other occurrence of \(P_1\) exists between i and j and they are consecutive. In that case we return (ij). Otherwise, we return \(\textsc {No}\).

Similarly, we can answer \(\textsf {FindConsecutive}_{P_1}(j)\) by first doing a RangePredecessorand then a RangeSuccessor query. Thus, given the loci of both patterns and a specific occurrence of either \(P_1\) or \(P_2\), we can in a constant number of \(\textsf {RangeSuccessor}\) and \(\textsf {RangePredecessor}\) queries find the corresponding consecutive occurrence, if it exists.

3.2 Data Structure

To build the data structure we will use a cluster decomposition of the suffix tree.

Cluster Decomposition A cluster decomposition of a tree T is defined as follows: For a connected subgraph \(C\subseteq T\), a boundary node v is a node \(v\in C\) such that either v is the root of T, or v has an edge leaving C – that is, there exists an edge (vu) in the tree T such that \(u\in T {\setminus } C\). A cluster is a connected subgraph C of T with at most two boundary nodes. A cluster with one boundary node is called a leaf cluster. A cluster with two boundary nodes is called a path cluster. For a path cluster C, the two boundary nodes are connected by a unique path. We call this path the spine of C. A cluster partition is a partition of T into clusters, i.e. a set CP of clusters such that \(\bigcup _{C\in CP}V(C)=V(T)\) and \(\bigcup _{C\in CP}E(C)=E(T)\) and no two clusters in CP share any edges. Here, E(G) and V(G) denote the edge and vertex set of a (sub)graph G, respectively. We need the next lemma which follows from well-known tree decompositions [1,2,3, 21] (see Bille and Gørtz [10] for a direct proof).

Lemma 1

Given a tree T with n nodes and a parameter \(\tau \), there exists a cluster partition CP such that \(|CP|=O(n/\tau )\) and every \(C\in CP\) has at most \(\tau \) nodes. Furthermore, such a partition can be computed in O(n) time.

Data Structure We build a clustering of the suffix tree of S as in Lemma 1, with cluster size at most \(\tau \), where \(\tau \) is some parameter satisfying \(0 < \tau \le n\). Then the counting data structure consists of:

  • The suffix tree of S, with some additional information for each node. For each node v we store:

    • The range v spans in the suffix array, i.e., \(\text {range}(v)\).

    • A bit that indicates if v is on a spine.

    • If v is on a spine, a pointer to the lower boundary node of the spine.

    • If v is a leaf, the local rank of v. That is, the rank of the suffix corresponding to v in the text order of the suffixes corresponding to leaves in the cluster that contains v. Note that this is at most \(\tau \).

  • The inverse suffix array of S.

  • A range successor data structure on the suffix array of S.

  • An array M(uv) of length \(\lfloor \frac{n}{\tau }\rfloor + 1\) for every pair of boundary nodes (uv). For \(1 \le x \le \lfloor \frac{n}{\tau }\rfloor \), M(uv)[x] is the number of consecutive occurrences (ij) of \(\text {str}(u)\) and \(\text {str}(v)\) with distance at most x. We set \(M(u,v)[0] = 0\).

Denote \(M(u,v)[\alpha ,\beta ]=M(u,v)[\beta ]-M(u,v)[\alpha -1]\), that is, \(M(u,v)[\alpha ,\beta ]\) is the number of consecutive occurrences of \(\text {str}(u)\) and \(\text {str}(v)\) with a distance in \([\alpha ,\beta ]\).

Space Analysis We store a constant amount of words per node in the suffix tree. The suffix tree and inverse suffix array occupy O(n) space. For the orthogonal range successor data structure we use the data structure of Nekrich and Navarro [46] which uses O(n) space and \(O(\log ^\epsilon n)\) time, for constant \(\epsilon > 0\). There are \(O\left( n^2/\tau ^2\right) \) pairs of boundary nodes and for each pair we store an array of length \(O\left( n/\tau \right) \). Therefore the total space consumption is \(O\left( n + n^3/\tau ^3\right) \).

3.3 Query Algorithm

We now show how to count the consecutive occurrences (ij) with a distance in the interval, i.e. \(\alpha \le j - i \le \beta \). We call each such pair a valid occurrence.

To answer a query we split the query interval \([\alpha , \beta ]\) into two: \([\alpha ,\lfloor \frac{n}{\tau }\rfloor ]\) and \([\lfloor \frac{n}{\tau }\rfloor +1, \beta ]\), and handle these separately.

3.3.1 Handling Distances \(> \frac{n}{\tau }\)

We start by finding the loci of \(P_1\) and \(P_2\) in the suffix tree. As shown in Sect. 3.1, this allows us to find the consecutive occurrence containing a given occurrence of either \(P_1\) or \(P_2\). We implicitly partition the string S into segments of (at most) \(\lfloor n / \tau \rfloor \) characters by calculating \(\tau \) segment boundaries. Segment i, for \(0 \le i < \tau \), contains characters \(S[i \cdot \lfloor \frac{n}{\tau }\rfloor , (i+1) \cdot \lfloor \frac{n}{\tau }\rfloor - 1]\) and segment \(\tau \) (if it exists) contains the characters \(S[\tau \cdot \lfloor \frac{n}{\tau }\rfloor ,n-1]\).We find the last occurrence of \(P_1\) in each segment by performing a series of \(\textsf {RangePredecessor}\) queries, starting from the beginning of the last segment. Each time an occurrence i is found we perform the next query from the segment boundary to the left of i, continuing until the start of the string is reached. For each occurrence i of \(P_1\) found in this way, we use \(\textsf {FindConsecutive}_{P_2}(i)\) to find the consecutive occurrence (ij) if it exists. We check each of them, discard any with distance \(\le \frac{n}{\tau }\) and count how many are valid.

3.3.2 Handling Distances \(\le \frac{n}{\tau }\)

In this part, we only count valid occurrences with distance \(\le \frac{n}{\tau }\). Consider the loci of \(P_1\) and \(P_2\) in the suffix tree. Let \(C_i\) denote the cluster that contains \(\text {locus}(P_i)\) for \(i=1,2\). There are two main cases.

At least one locus is not on a spine If either locus is in a small subtree hanging off a spine in a cluster or in a leaf cluster, we directly find all consecutive occurrences as follows: If \(\text {locus}(P_1)\) is in a small subtree then we use \(\textsf {FindConsecutive}_{P_2}(i)\) on each leaf i below \(\text {locus}(P_1)\) to find all consecutive occurrences, count the valid occurrences and terminate. If only \(\text {locus}(P_2)\) is in a small subtree then we use \(\textsf {FindConsecutive}_{P_1}(j)\) for each leaf j below \(\text {locus}(P_2)\), count the valid occurrences and terminate.

Both loci are on the spine If neither locus is in a small subtree then both are on spines. Let \((b_1, b_2)\) denote the lower boundary nodes of the clusters \(C_1\) and \(C_2\), respectively. There are two types of consecutive occurrences (ij):

  1. (i)

    Occurrences where either the leaf corresponding to i or the leaf corresponding to j is inside \(C_1\) resp. \(C_2\).

  2. (ii)

    Occurrences below the boundary nodes, that is, i is below \(b_1\) and j is below \(b_2\).

See Fig. 1a. We describe how to count the different types of occurrences next.

Type (i) occurrences To find the valid occurrences (ij) where either \(i\in C_1\) or \(j\in C_2\) we do as follows. First we find all the consecutive occurrences (ij) where i is a leaf in \(C_1\) by computing \(\textsf {FindConsecutive}_{P_2}(i)\) for all leaves i below \(\text {locus}(P_1)\) in \(C_1\). We count all valid occurrences we find in this way. Then we find all remaining consecutive occurrences (ij) where j is a leaf in \(C_2\) by computing \(\textsf {FindConsecutive}_{P_1}(j)\) for all leaves j below \(\text {locus}(P_2)\) in \(C_2\). If \(\textsf {FindConsecutive}_{P_1}(j)\) returns a valid occurrence (ij) we use the inverse suffix array to check if the leaf i is below \(b_1\). This can be done by checking whether i’s position in the suffix array is in \(\text {range}(b_1)\). If i is below \(b_1\) we count the occurrence, otherwise we discard it.

Type (ii) occurrences Next, we count the consecutive occurrences (ij), where both i and j are below \(b_1\) and \(b_2\), respectively. We will use the precomputed table, but we have to be a careful not to overcount. By its construction, \(M(b_1,b_2)[\alpha ,\min (\lfloor \frac{n}{\tau }\rfloor , \beta )]\) is the number of consecutive occurrences \((i',j')\) of \(\text {str}(b_1)\) and \(\text {str}(b_2)\), where \(\alpha \le j' - i' \le \min (\lfloor \frac{n}{\tau }\rfloor , \beta )\). However, not all of these occurrence \((i',j')\) are necessarily consecutive occurrences of \(P_1\) and \(P_2\), as there could be an occurrence of \(P_1\) in \(C_1\) or \(P_2\) in \(C_2\) which is between \(i'\) and \(j'\). We call such a pair \((i',j')\) a false occurrence. See Fig. 1b. We proceed as follows.

Fig. 1
figure 1

a Any consecutive occurrences (ij) of \(P_1\) and \(P_2\) is either also a consecutive occurrence of \(\text {str}(b_1)\) and \(\text {str}(b_2)\), or i or j are within the respective cluster. The suffix array is shown in the bottom with the corresponding ranges marked. b Example of a false occurrence. Here \((i',j')\) is a consecutive occurrence of \(\text {str}(b_1)\) and \(\text {str}(b_2)\), but not a consecutive occurrence of \(P_1\) and \(P_2\) due to i. The string S is shown in bottom with the positions of the occurrences marked

  1. 1.

    Set \(c = M(b_1,b_2)[\alpha ,\min (\lfloor \frac{n}{\tau }\rfloor , \beta )]\).

  2. 2.

    Construct the lists \(L_i\) containing the leaves in \(C_i\) that are below \(\text {locus}(P_i)\) sorted by text order for \(i = 1,2\). We can obtain the lists as follows. Let [ab] be the range of \(\text {locus}(P_i)\) and \([a',b']=\text {range}(b_i)\). Sort the leaves in \([a,a'-1] \cup [b'+1,b]\) using their local rank.

  3. 3.

    Until both lists are empty iteratively pick and remove the smallest element e from the start of either list. There are two cases.

  • e is an element of \(L_1\).

    • Compute \(j' = \textsf {RangeSuccessor}(\text {range}(b_2), e)\) to get the closest occurrence of \(\text {str}(b_2)\) after e.

    • Compute \(i' = \textsf {RangePredecessor}(\text {range}(b_1), j')\) to get the closest occurrence of \(\text {str}(b_1)\) before \(j'\).

  • e is an element of \(L_2\).

    • Compute \(j' = \textsf {RangeSuccessor}(\text{ range }(b_1), i')\) to get the previous occurrence \(i'\) of \(\text {str}(b_1)\).

    • Compute \(j' = \textsf {RangeSuccessor}(\text {range}(b_1), j')\) to get the following occurrence \(j'\) of \(\text {str}(b_2)\).

If \(\alpha \le j'-i' \le \min (\lfloor \frac{n}{\tau }\rfloor , \beta )\) and \(i'< e< j'\) decrement c by one. We skip any subsequent occurrences that are also inside \((i',j')\). As the lists are sorted by text order, all occurrences that are within the same consecutive occurrence \((i',j')\) are handled in sequence.

Finally, we add the counts of the different type of occurrences.

Correctness Consider a consecutive occurrence (ij) where \(j-i > \frac{n}{\tau }\). Such a pair must span a segment boundary, i.e., i and j cannot be in the same segment. As (ij) is a consecutive occurrence, i is the last occurrence of \(P_1\) in its segment and j is the first occurrence of \(P_2\) in its segment. With the \(\textsf {RangePredecessor}\) queries we find all occurrences of \(P_1\) that are the last in their segment. We thus check and count all valid occurrences of large distance in the initial pass of the segments.

If either locus is in a small subtree, then we use \(\textsf {FindConsecutive}_{P_2}(.)\) or \(\textsf {FindConsecutive}_{P_1}(.)\) on the leaves below that locus, which by the arguments in Sect. 3.1 will find all consecutive occurrences.

If neither locus is in a small subtree, then both loci are on a spine. To count type (i) occurrences we use \(\textsf {FindConsecutive}_{P_2}(i)\) for all leaves i below \(\text {locus}(P_1)\) in \(C_1\) and \(\textsf {FindConsecutive}_{P_1}(j)\) for all leaves j below \(\text {locus}(P_2)\) in \(C_2\). However, any valid occurrence (ij) where both \(i \in C_1\) and \(j \in C_2\) is found by both operations. Therefore, whenever we find a valid occurrence (ij) via \(i=\textsf {FindConsecutive}_{P_1}(j)\) for \(j \in C_2\), we only count the occurrence if i is below \(b_1\). Thus we count all type (i) occurrences exactly once.

To count type (ii) occurrences we start with \(c=M(b_1,b_2)[\alpha ,\min (\lfloor \frac{n}{\tau }\rfloor , \beta )]\), which is the number of consecutive occurrences \((i',j')\) of \(\text {str}(b_1)\) and \(\text {str}(b_2)\), where \(\alpha \le j' - i' \le \min (\lfloor \frac{n}{\tau }\rfloor , \beta )\). Each \((i',j')\) is either also a consecutive occurrence of \(P_1\) and \(P_2\), or there exists an occurrence of \(P_1\) or \(P_2\) between \(i'\) and \(j'\). Let \((i',j')\) be a false occurrence and let w.l.o.g. i be an occurrence of \(P_1\) with \(i'<i<j'\). Then i is a leaf in \(C_1\), since \((i',j')\) is a consecutive occurrence of \(\text {str}(b_1)\) and \(\text {str}(b_2)\). In step 3 we check for each leaf inside the clusters below the loci, if it is between a consecutive occurrence \((i',j')\) of \(\text {str}(b_1)\) and \(\text {str}(b_2)\) and if \(\alpha \le j' - i' \le \min (\lfloor \frac{n}{\tau }\rfloor ,\beta )\). In that case \((i',j')\) is a false occurrence and we adjust the count c. As \((i',j')\) can have multiple occurrences of \(P_1\) and \(P_2\) inside it, we skip subsequent occurrences inside \((i',j')\). After adjusting for false occurrences, c is the number of type (ii) occurrences.

Time Analysis We find the loci in \(O(|P_1| + |P_2|)\) time. Then we perform a number of range successor and FindConsecutive queries. The time for a FindConsecutive query is bounded by the time to do a constant number of range successor queries. To count the large distances we check at most \(\tau \) segment boundaries and thus perform \(O(\tau )\) range successor and FindConsecutive queries.

For small distances, if either locus is not on a spine we check the leaves below that locus. There are at most \(\tau \) such leaves due to the clustering. To count type (i) occurrences we check the leaves below the loci that are inside the clusters. There are at most \(2\tau \) such leaves in total. To count type (ii) occurrences we check two lists constructed from the leaves inside the clusters below the loci. There are again at most \(2\tau \) such leaves in total. For each of these \(O(\tau )\) leaves we use a constant number of range successor and FindConsecutive queries. Thus the time for this part is bounded by the time to perform \(O(\tau )\) range successor queries.

Using the data structure of Nekrich and Navarro [46], each range successor query takes \(O(\log ^{\epsilon } n)\) time so the total time for these queries is \(O(\tau \log ^{\epsilon } n)\). For type (ii) occurrences we sort two lists of size at most \(\tau \) from a universe of size \(\tau \), which we can do in \(O(\tau )\) time. Thus, the total query time is \(O(|P_1| + |P_2| + \tau \log ^{\epsilon } n)\).

Setting \(\tau =\Theta (n^{2/3})\) we get a data structure that uses \(O\left( n + n^3/\tau ^3\right) = O(n)\) space and has query time \(O(|P_1| + |P_2| + \tau \log ^{\epsilon } n) = O(|P_1| + |P_2| + n^{2/3} \log ^{\epsilon } n)\), for constant \(\epsilon > 0\). We answer an \(\mathsf {Exists}\) query with a \(\mathsf {Count}\) query. In summary, we have the following result:

Lemma 2

Given a string of length n and constant \(\epsilon >0\), we can construct an O(n) space data structure that supports \(\mathsf {Exists}(P_1, P_2, \alpha , \beta )\) and \(\mathsf {Count}(P_1, P_2, \alpha , \beta )\) queries in \(O(|P_1| + |P_2| + n^{2/3}\log ^{\epsilon }n)\) time.

4 Reporting

In this section, we describe our data structure for reporting queries. Note that in Sect. 3, we explicitly find all valid occurrences except for type (ii) occurrences, where we use the precomputed values. In this section, we describe how we can use a recursive scheme to report these.

The main idea, inspired by fast set intersection by Cohen and Porat [19], is to build a recursive binary structure which allows us to recursively divide the problem into subproblems of half the size. Intuitively, the subdivision is a binary tree where every node contains the suffix tree of a substring of S. We use this structure to find type (ii) occurrences by recursing on smaller trees. We define the binary decomposition of the suffix tree next. The details of the full solution follow after that.

4.1 Induced Suffix Tree Decomposition

Let T be a suffix tree of a string S of length n. For an interval [ab] of text positions, we define T[ab] to be the subtree of T induced by the leaves in [ab]: That is, we consider the subtree consisting of leaves in [ab] together with their ancestors. We then delete each node that has only one child in the subtree and contract its ingoing and outgoing edge. See Fig. 2.

Fig. 2
figure 2

The suffix tree of NANANANABATMAN$ together with its children trees T[0, 7] and T[8, 14]. The red crosses show a node v in the parent tree and the two loci of \(\text {str}(v)\) in the two children trees

The induced suffix tree decomposition of T now consists of a higher level binary tree structure, the decomposition tree, where each node corresponds to an induced subtree of the suffix tree. The root corresponds to \(T[0,n-1]\), and whenever we move down in the decomposition tree, the interval splits in half. We also associate a level with each of the induced subtrees, which is their depth in the decomposition tree. In more detail, the decomposition tree is a binary tree such that:

  • The root of the decomposition tree corresponds to \(T[0,n-1]\) and has level 0.

  • For each T[ab] of level i in the decomposition, if \(b-a>1\), its two children in the decomposition tree are T[ac] and \(T[c+1,b]\) where \(c=\lfloor \frac{a+b}{2}\rfloor \); we will sometimes refer to these as “children trees” to differentiate from children in the suffix tree.

The decomposition tree is a balanced binary tree and the total size of the induced subtrees in the decomposition is \(O(n\log n)\): There are at most \(2^i\) decomposition tree nodes on level i, each of which corresponds to an induced subtree of size \(O\left( \frac{n}{2^i}\right) \), and thus the total size of the trees on each of the \(O(\log n)\) levels is O(n).

However, as we show next, we can implicitly store the induced suffix tree decomposition via a 2D-range reporting data structure.

Lemma 3

Let s(n) be the space and t(n) the per-occurrence reporting time of a 2D-range reporting data structure. Then, there exists an \(O(n+s(n))\) space data structure which allows reporting all occurrences of P in any T[ab] in \(O(|P|+t(n)\cdot \text {occ})\) time, where \(\text {occ}\) is the number of occurrences of P in T[ab]. Further, subsequent queries for the same pattern P and different \(T[a',b']\) can be answered in \(O(t(n)\cdot \text {occ})\) time.

Proof

We store the suffix tree and a 2D-range reporting data structure over the suffix array, i.e., there is a point for every (ij) such that \(SA(i)=j\). To answer a query for P and T[ab], we find the suffix array range [xy] of P by matching in the suffix tree. Then we query the range reporting data structure for \([x,y]\times [a,b]\). \(\square \)

Let \(v_{a,b}\) be the locus of \(\text {str}(v)\) in T[ab], if it exists. Lemma 3 shows that we do not need to explicitly find \(v_{a,b}\) for reporting, however, the notion will be useful for describing the next part. In the following, we will build a counting data structure as in Sect. 3 for each induced subtree in the suffix tree decomposition. For this, we will need to store a cluster decomposition as in Lemma 1 for each induced subtree. We show how to implicitly store such a decomposition next.

Lemma 4

Assume we cluster each induced subtree of level i with cluster size \(\tau _i\). We can store an additional \(O\left( n+\sum _{i=1}^{\lceil \log n\rceil } \frac{n}{\tau _i}\right) \) words which allow, for any node v in the suffix tree and for any T[ab], doing the following query in \(O(\log \log n)\) time: decide whether \(v_{a,b}\) is on a spine in the cluster decomposition of T[ab] and if yes, output the lower boundary node of that cluster.

Proof

We can identify each node in the suffix tree with its pre-order number. Then the subtree rooted at any node is defined by a continuous interval. We store this interval at every node in the suffix tree. For the cluster decomposition of any T[ab], we store the boundary nodes in a successor data structure (i.e., we store the pre-order number of the nodes with respect to the full suffix tree). Using e.g. y-fast-tries [50], this uses linear space in the number of boundary nodes and allows successor queries in \(O(\log \log n)\) time. Since there are at most \(2^i\) induced subtrees on level i, each of which has \(O(\frac{n/2^i}{\tau _i})\) boundary nodes, the space bound follows. To locate the boundary node, we find the successor of v’s pre-order number in the structure for T[ab] in the interval defined by v’s descendants. Note that since \(v_{a,b}\) is the unique closest descendant of v present in T[ab], the successor of v and the successor of \(v_{a,b}\) are the same. By the properties of the cluster decomposition, the lower boundary node of the cluster containing \(v_{a,b}\) is an ancestor of all boundary nodes in the subtree rooted at \(v_{a,b}\), therefore, the successor query will return the right boundary node, if it exists. If no boundary node exists in the range, then it is either because \(\text {str}(v)\) is not present in T[ab] (in which case none of v’s descendants are in T[ab] either), or \(v_{a,b}\) is in a leaf cluster or in a small subtree hanging off a spine. \(\square \)

Lemma 3 shows how to implicitly maintain the induced suffix tree decomposition, and Lemma 4 shows how to implicitly maintain cluster decompositions on the subtrees in the decomposition. In the reporting data structure described in the rest of the section, we will run Exists queries on subtrees in the decomposition as a subroutine. For this, we need some additional functionality, i.e., deciding if the locus of a pattern P is on the spine of a cluster, locating the next boundary node if so, and outputting all leaves below the locus and within the cluster in text order. We now describe how to do this using the data structures from Lemmas 3 and 4. We store the data structure from Lemma 3, using a range successor data structure as the 2D-range reporting data structure, and the data structure from Lemma 4. To support the above-mentioned functionalities, we first find the locus v of P and its suffix array range [xy]. Note that if v is the locus of P in \(T[0,n-1]\), then by definition \(v_{a,b}\) is the locus of P in T[ab]. Thus, we can find the lower boundary node using Lemma 4, if it exists. If the suffix array range of this node is \([x',y']\), then all the occurrences within the cluster in T[ab] can be found by two 2D-reporting queries for \([x,x']\times [a,b]\) and \([y',y]\times [a,b]\). Since we use an orthogonal range successor data structure for the 2D-range reporting, we report the occurrences in each range in text order which we can merge to obtain the full list of occurrences in text order. If the lower boundary node does not exist, then either there are no more occurrences of P in T[ab], or the locus is not on the spine of a cluster. In any case the occurrences are given by querying for \([x,y]\times [a,b]\). Next, we describe the data structure and algorithm for reporting in detail.

4.2 Data Structure

The data structure consists of:

  • The suffix tree of S, with some additional information for each node. For each node v we store:

    • the range v spans in the suffix array, i.e., \(\text {range}(v)\);

    • its pre-order number;

    • the interval of pre-order numbers spanned by the nodes in v’s subtree.

  • The orthogonal range successor data structure from Nekrich and Navarro [46] on the suffix array.

  • For each induced subtree, the following cluster decomposition, implemented as in Lemma 4: Denote \(n_i=n/2^i\) and let \(\tau _i=n_i^{2/3}\log ^{1/3}n\). We store the cluster decomposition from Lemma 1 with parameter \(\tau _i\) for each induced subtree of level i as long as \(n_i>\tau _i\), or equivalently, \(n_i>\log n\).

  • For each pair of boundary nodes (uv) in a cluster decomposition in T[ab] the counts for the \(n_i/\tau _i\) smallest distances, as in Sect. 3.

Space Analysis The total space for the structure is linear: The structure from Lemma 4 uses space \(O\left( \sum _{i=1}^{\lceil \log n\rceil } \frac{n}{\tau _i}\right) =O\left( \sum _{i=1}^{\lceil \log n\rceil } 2^i\frac{n_i}{\tau _i}\right) =O\left( \sum _{i=1}^{\lceil \log n\rceil } 2^i\left( \frac{n_i}{\log n}\right) ^{1/3}\right) =O\left( \sum _{i=1}^{\lceil \log n\rceil } 2^i\left( \frac{n_i}{\log n}\right) \right) =O(n)\). The vectors storing the small distances for one T[ab] of level i use space \(O(n_i^3/\tau _i^3)=O(n_i/\log n)\). Summing up over all levels yields linear space.

4.3 Query Algorithm

The main idea behind the algorithm is the following: For large distances, as in Sect. 3, we implicitly segment S to find all consecutive occurrences of at least a certain distance. For small distances, we are going to use the cluster decomposition and counting arrays to decide whether valid occurrences exist. That is, if one of the loci is in a small subtree, we use \(\textsf {FindConsecutive}_{P_2}(.)\) resp. \(\textsf {FindConsecutive}_{P_1}(.)\) to find all consecutive occurrences. Else, we perform a query as in Sect. 3 to decide whether any valid occurrences exist, and if yes, we recurse on smaller subtrees.

The idea here is that, in the induced suffix tree decomposition, the trees are divided in half by text position - therefore, a consecutive occurrence either will be fully contained in the left child tree, fully contained in the right child tree, or have the property that the occurrence of \(P_1\) is the rightmost occurrence in the left child tree and the occurrence of \(P_2\) is the leftmost occurrence in the right child tree. We will check the border case each time when we recurse.

In detail, we do the following: We find the loci of \(P_1\) and \(P_2\) in the suffix tree. As in the previous section, we check \(\tau _0\) segment boundaries with \(\tau _0=\Theta (n^{2/3}\log ^{1/3} n)\) to find all consecutive occurrences with distance within \([\max (\alpha ,\lfloor n^{1/3}/\log ^{1/3}n\rfloor ),\beta ]\). Then, we only have to find consecutive occurrences of distance within \([\alpha , \min (\beta , \lfloor n^{1/3}/\log ^{1/3}n\rfloor )]\) in \(T=T[0,n-1]\). In general, let \(n_i=\lfloor \frac{n}{2^i}\rfloor \) and \(\beta _i=\min (\beta , \lfloor n_i^{1/3}/\log ^{1/3} n\rfloor )\) and let T[ab] be an induced subtree of level i.

To find all consecutive occurrences with distance within \([\alpha , \beta _i]\) in T[ab] of level i,

recursively do the following:

  • First, we use Lemma 4 to decide for the loci of \(P_1\) and \(P_2\) in T[ab] whether they are on the spine of a cluster, and find the lower boundary nodes if they are. If any of the loci is not on a spine of a cluster, we find all occurrences of the corresponding pattern in T[ab] using Lemma 3, and all consecutive occurrences as in Sect. 3 using \(\textsf {FindConsecutive}_{P_2}(.)\) resp. \(\textsf {FindConsecutive}_{P_1}(.)\). We check for each of them if they are valid; we report all such, then terminate.

  • Else, we use the query algorithm for small distances from Sect. 3 to decide whether a valid occurrence with distance within \([\alpha ,\beta _i]\) exists in T[ab]. By the discussion at the end of Sect. 4.1 we have all functionality we need to repeat the algorithm from Sect. 3: we can find all occurrences of \(P_1\) or \(P_2\) that are within a cluster in text order in \(O(\log ^{\epsilon } n)\) time per occurrence; we can check if an occurrence is below a boundary node by using the pre-order identification; and finally, we can use \(\textsf {FindConsecutive}_{P_2}(\)), \(\textsf {FindConsecutive}_{P_1}(\)) and range successor queries with the same functionality as before.

If such a valid occurrence exists, we recurse; that is, set \(c=\lfloor \frac{a+b}{2}\rfloor \). We use RangePredecessor to find the last occurrence of \(P_1\) before and including c, and \(\textsf {RangeSuccessor}\) to find the first occurrence of \(P_2\) after c. Then we check if they are consecutive (again using RangePredecessor and RangeSuccessor), and if it is a valid occurrence. If yes, we add it to the output. Then, for both S[ac] and \(S[c+1,b]\), we implicitly partition them into segments of size \(\lfloor n_{i+1}^{1/3}/\log ^{1/3}n\rfloor \) and find and output all valid occurrences of distance \(>n_{i+1}^{1/3}/\log ^{1/3}n\). We recurse on the children trees T[ac] and \(T[c+1,b]\) to find all consecutive occurrences of distance within \([\alpha , \beta _{i+1}]\).

Correctness At any point before we recurse on level i, we check all consecutive occurrences of distance \(>n_{i+1}^{1/3}/\log ^{1/3}n\) by segmenting the current substring of S. By the arguments of the previous section, we will find all such valid occurrences. Thus, on the subtrees of level \(i+1\), we need only care about consecutive occurrences with distance in \([\alpha ,\beta _{i+1}]\).

By the properties of the induced suffix tree decomposition, a consecutive occurrence of \(P_1\) and \(P_2\) that is present in T[ab] will either be fully contained in T[ac], or in \(T[c+1,b]\), or the occurrence of \(P_1\) is the last occurrence before and including c and the occurrence of \(P_2\) is the first occurrence after c. We check the border case each time we recurse. Thus, no consecutive occurrences get lost when we recurse. If we stop the recursion, it is either because one of the loci is in a small subtree or that no valid occurrences with distance within \([\alpha , \beta _i]\) exists in T[ab]. In the first case we find all valid occurrences with distance within \([\alpha , \beta _i]\) in T[ab] by the same arguments as in Sect. 3. Thus, we find all valid occurrences of \(P_1\) and \(P_2\).

Time Analysis For finding the locus, we spend \(O(|P_1|+|P_2|)\) time in the initial suffix tree. We can then navigate the induced subtrees in the decomposition using Lemmas 3 and 4. The orthogonal range successor data structure functions as the 2D-range reporting data structure for Lemma 3. Thus, the time consumption is dominated by the number of queries to the orthogonal range successor data structure, which we will count next. Consider the recursion part of the algorithm as a traversal of the decomposition tree, and consider the subtree of the decomposition tree we traverse. Each leaf of that subtree is a node where we stop recursing. Since we only recurse if we know there is an occurrence to be found, there are at most \(O(\text {occ})\) leaves. Thus, we traverse at most \(O(\text {occ} \log n)\) nodes.

Each time we recurse, we spend a constant number of RangeSuccessor and RangePredecessor queries to check the border cases. Additionally, we spend \(O(n_i^{2/3}\log ^{1/3}n)\) such queries on each node of level i that we visit in the decomposition tree: For finding the “large” occurrences, and additionally either for reporting everything within a small subtree or doing an existence query. For finding large occurrences, there are \(O(n_i^{2/3}\log ^{1/3}n)\) segments to check. The number of orthogonal range successor queries used for existence queries or reporting within a small subtree is bounded by the number of leaves within a cluster, which is also \(O(n_i^{2/3}\log ^{1/3}n)\).

Now, let x be the number of decomposition tree nodes we traverse and let \(l_i\), \(i=1,\dots ,x\), be the level of each such node. The goal is to bound \(\sum _{i=1}^x \left( \frac{n}{2^{l_i}}\right) ^{2/3}\log ^{1/3}n\). By the argument above, \(x=O(\text {occ} \log n)\). Note that because the decomposition tree is binary we have that \(\sum _{i=1}^x \frac{1}{2^{l_i}}\le \log n\). The number of queries to the orthogonal range successor data structure is thus asymptotically bounded by:

$$\begin{aligned} \sum _{i=1}^x \left( \frac{n}{2^{l_i}}\right) ^{2/3}\log ^{1/3}n&=n^{2/3}\log ^{1/3}n\sum _{i=1}^x \left( \frac{1}{2^{l_i}}\right) ^{2/3}\cdot 1\\&\le n^{2/3}\log ^{1/3}n\left( \sum _{i=1}^x \left( \frac{1}{2^{l_i}}\right) ^{\frac{2}{3}\cdot \frac{3}{2}}\right) ^{2/3}\left( \sum _{i=1}^x 1^3\right) ^{1/3}\\&= n^{2/3}\log ^{1/3}n\left( \sum _{i=1}^x \frac{1}{2^{l_i}}\right) ^{2/3}x^{1/3}\\&=O(n^{2/3}\text {occ}^{1/3}\log ^{4/3} n) \end{aligned}$$

For the inequality, we use Hölder’s inequality, which holds for all \((x_1,\dots ,x_k)\in {\mathbb {R}}^k\) and \((y_1,\dots , y_k)\in {\mathbb {R}}^k\) and p and q both in \((1,\infty )\) such that \(1/p+1/q = 1\):

$$\begin{aligned} \sum _{i=1}^{k}|x_iy_i|\le \left( \sum _{i=1}^k |x_i|^p\right) ^{1/p}\left( \sum _{i=1}^k |y_i|^q\right) ^{1/q} \end{aligned}$$
(1)

We apply (1) with \(p=3/2\) and \(q=3\). Since the data structure of Nekrich and Navarro [46] uses \(O(\log ^{\epsilon } n)\) time per query, the total running time of the algorithm is \(O(|P_1|+|P_2|+n^{2/3}\text {occ}^{1/3}\log ^{4/3+\epsilon } n)\). In summary, we have the following result:

Lemma 5

Given a string of length n and constant \(\epsilon >0\), we can construct an O(n) space data structure that supports \(\mathsf {Report}(P_1, P_2, \alpha , \beta )\) queries in \({O}(|P_1| + |P_2| + n^{2/3} \text {occ}^{1/3}\log ^{4/3 + \epsilon } n)\) time, where \(\text {occ}\) is the size of the output.

Combining Lemmas 2 and 5 we have shown Theorem 1.

We note that by adjusting \(\tau _i\) and choosing different range reporting data structures, other tradeoffs are possible. For example, removing the \(\log ^{1/3} n\) factors from \(\tau _i\) and using the range reporting data structure by Zhou et al. [51], we get an \(O(n \log n)\) space data structure with query time \(O(|P_1|+|P_2|+n^{2/3}\text {occ}^{1/3} \log n \log \log n)\) (which is the tradeoff stated in the conference version of this paper [13]).

5 Lower Bound

We now prove the conditional lower bound from Theorem 2 based on set intersection. We use the framework and conjectures as stated in Goldstein et al. [24]. Throughout the section, let \({\mathcal {I}} = S_1,,\dots , S_m\) be a collection of m sets of total size N from a universe U. The SetDisjointness problem is to preprocess \({\mathcal {I}}\) into a compact data structure, such that given any pair of sets \(S_i\) and \(S_j\), we can quickly determine if \(S_i \cap S_j = \emptyset \). We use the following conjecture.

Conjecture 1

Any data structure that can answer SetDisjointness queries in t query time must use \(\widetilde{\Omega }\left( \frac{N^{2}}{t^2}\right) \) space.

5.1 SetDisjointness with Fixed Frequency

We define the following weaker variant of the SetDisjointness problem: the f-FrequencySetDisjointness problem is the SetDisjointness problem where every element occurs in precisely f sets. We now show that any solution to the f-FrequencySetDisjointness problem implies a solution to SetDisjointness, matching the complexities up to polylogarithmic factors.

Lemma 6

Assuming the Strong SetDisjointness Conjecture, every data structure that can answer f-FrequencySetDisjointness queries in time \(O(N^{\delta })\), for \(\delta \in [0,1/2]\), must use \({\Omega }\left( {N^{2-2\delta -o(1)}}{}\right) \) space.

Proof

Assume there exists a data structure D which can solve the f-FrequencySetDisjointness problem in time \(O(N^{\delta })\) and space \(O\left( {N^{2-2\delta -\epsilon }}{}\right) \) for constant \(\epsilon \) with \(0<\epsilon <1\). Let \({\mathcal {I}}=S_1,\dots , S_m\) be a given instance of SetDisjointness, where each \(S_i\) is a set of elements from universe U, and assume w.l.o.g. that m is a power of two.

Define the frequency of an element, \(f_e\), as the number of sets in \({\mathcal {I}}\) that contain the element e. We construct \(\log m\) instances \({\mathcal {I}}_1,\dots , {\mathcal {I}}_{\log m}\) of the f-FrequencySetDisjointness problem. The instances are sorted such that \({\mathcal {I}}_1\) handles the least frequent elements in \({\mathcal {I}}\), while \({\mathcal {I}}_{\log m}\) handles the most frequent elements. More precisely, for each j, \(1\le j\le \log m\), the instance \({\mathcal {I}}_j\) contains the following sets:

  • For each \(i\in [1,m]\) a set \(S_i^j\) containing all \(e\in S_i\) that satisfy \(2^{j-1}\le f_e < 2^j\);

  • \(2^{j-1}\) “dummy sets”, which contain extra copies of elements to make sure that all elements have the same frequency. That is, we add every element with \(2^{j-1}\le f_e < 2^j\) to the first \(2^j-f_e\) dummy sets. These sets will not be queried in the reduction.

Clearly, \(S_i=\bigcup _j S^j_i\). Instance \({\mathcal {I}}_j\) has O(m) sets and every element occurs exactly \(2^j\) times. Further, the total number of elements in all the instances is at most 2N. We now build f-FrequencySetDisjointness data structures \(D_j=D({\mathcal {I}}_j)\) for each of the \(\log m\) instances.

To answer a SetDisjointness query for two sets \(S_{i_1}\) and \(S_{i_2}\), we query \(D_j\) for the sets \(S_{i_1}^j\) and \(S_{i_2}^j\), for each \(1\le j\le \log m \). If there exists a j such that \(S_{i_1}^j\) and \(S_{i_2}^j\) are not disjoint, we output that \(S_i\) and \(S_j\) are not disjoint. Else, we output that they are disjoint.

If there exists \(e\in S_{i_1}\cap S_{i_2}\), let j be such that \(2^{j-1}\le f_e < 2^j\). Then \(e\in S^j_{i_1}\cap S^j_{i_2}\), and we will correctly output that the sets are not disjoint. If \(S_{i_1}\) and \(S_{i_2}\) are disjoint, then, since \(S_{i_1}^j\) is a subset of \(S_{i_1}\) and \(S_{i_2}^j\) is a subset of \(S_{i_2}\), the queried sets are disjoint in every instance. Thus we also answer correctly in this case.

Let \(N_j\) denote the total number of elements in \({\mathcal {I}}_j\). For each j, we have \(N_j\le 2N\) and thus \(N_j^{2-2\delta -\epsilon }\le (2N)^{2-2\delta -\epsilon }\). Thus, the space complexity is asymptotically bounded by

$$\begin{aligned} \sum _{j=1}^{\lceil \log m\rceil }N_j^{2-2\delta -\epsilon } = O(N^{2-2\delta -\epsilon }\log m ). \end{aligned}$$

Similarly, we have \(N_j^{\delta }=O(N^{\delta })\) and so the time complexity is asymptotically bounded by

$$\begin{aligned} \sum _{j=1}^{\lceil \log m\rceil }N_j^{\delta } = O(N^{\delta }\log m ). \end{aligned}$$

This is a contradiction to Conjecture 1. \(\square \)

5.2 Reduction to Gapped Indexing

We can reduce the f-FrequencySetDisjointness problem to \(\mathsf {Exists}\) queries of the gapped indexing problem: Assume we are given an instance of the f-FrequencySetDisjointness problem with a total of N elements. Each distinct element occurs f times. Assume again w.l.o.g. that the number of sets m is a power of two. Assign to each set \(S_i\) in the instance a unique binary string \(w_i\) of length \(\log m\). Build a string S as follows: Consider an arbitrary ordering \(e_1,e_2,...\) of the distinct elements present in the f-FrequencySetDisjointness instance. Let \(\$\) be an extra letter not in the alphabet. The first \(B=f\cdot \log m+f\) letters are a concatenation of \(w_i\$\) of all sets \(S_i\) that \(e_1\) is contained in, sorted by i. This block is followed by B copies of $. Then, we have B symbols consisting of the strings for each set that \(e_2\) is contained in, again followed by B copies of $, and so on. See Fig. 3 for an example.

Fig. 3
figure 3

Instance of f-FrequencySetDisjointness problem reduced to \(\mathsf {Exists}\). Alphabet \(\Sigma = \{0,1\}\) and fixed frequency \(f = 2\), resulting in block size \(B = 2 \cdot 2 + 2 = 6\)

For a query for two sets \(S_i\) and \(S_j\), where \(i<j\), we set \(P_1=w_i\) and \(P_2=w_j\), \(\alpha =0\), and \(\beta = B\). If the sets are disjoint, then there are no occurrences which are at most B apart. Otherwise \(w_i\) and \(w_j\) occur in the same block, and \(w_j\) comes after \(w_i\). The length of the string S is \(2N\log m+2N\): In the block for each element, we have \(\log m+1\) letters for each of its occurrences, and it is followed by a \(\$\) block of the same length.

This means that if we can solve \(\mathsf {Exists}\) queries in s(n) space and \(t(n)+O(|P_1|+|P_2|)\) time, where n is the length of the string, we can solve the f-FrequencySetDisjointness problem in \(s(2N\log m+2N)\) space and \(t(2N\log m+2N)+O(\log m)\) time. Together with Lemma 6, Theorem 2 follows.

6 Gapped Indexing for \([0, \beta ]\) Gaps

In this section, we consider the special case where the queries are one sided intervals of the form \([0,\beta ]\).

We give a data structure supporting the following tradeoffs:

Theorem 3

Given a string of length n and \(\epsilon >0\), we can construct an O(n) space data structure that supports \(\mathsf {Exists}(P_1,P_2,0,\beta )\) queries in \(O(|P_1| + |P_2| + \sqrt{n}\log ^{\epsilon } n)\) time and \(\mathsf {Count}(P_1,P_2,0,\beta )\) and \(\mathsf {Report}(P_1,P_2,0,\beta )\) queries in \(O(|P_1| + |P_2| + (\sqrt{n\cdot \text {occ}})\log ^{1/2+\epsilon } n)\) time, where \(\text {occ}\) is the size of the output.

Note that since the results match (up to \(\log \) factors) the best known results for set intersection, this is about as good as we can hope for. We mention here that for this specific problem, a similar tradeoff follows from the strategies used by Hon et al. [30]. The results from that paper include (among others) a data structure for documents such that given a query of two patterns \(P_1\) and \(P_2\) and a number k, one can output the k documents with the closest occurrences of \(P_1\) and \(P_2\). Thus, the problem is slightly different, however, with some adjustments, the results from Theorem 3 follow (up to a \(\log \) factor). We show a simple, direct solution.

The data structure is a simpler version of the data structure considered in the previous sections. The main idea is that for each pair of boundary nodes u and v, we do not have to store an array of distances, but only one number that carries all the information: the smallest distance of a consecutive occurrence of \(\text {str}(u)\) and \(\text {str}(v)\). Thus, for existence, we can cluster with \(\tau =\sqrt{n}\) to achieve linear space, and we do not need to check large distances separately. For the reporting solution, we store the decomposition from Sect. 4.1, and use the matrix M to decide where to recurse. In the following we will describe the details.

Existence data structure For solving Exists queries in this setting, we cluster the suffix tree with parameter \(\tau =\sqrt{n}\). Again, we store the linear space orthogonal range successor data structure by Nekrich and Navarro [46] on the suffix array. For each pair of boundary nodes (uv), we store at M(uv) the minimum distance of a consecutive occurrence of \(\text {str}(u)\) and \(\text {str}(v)\). The total space is linear. To query, we proceed similarly as in Sect. 3 for the “small distances”: We find the loci of \(P_1\) and \(P_2\). If any of the loci is not on the spine, we check all consecutive occurrences using \(\textsf {FindConsecutive}_{P_2}(.)\) resp. \(\textsf {FindConsecutive}_{P_1}(.)\). If both loci are on the spine, denote \(b_1,~b_2\) the lower boundary nodes of the respective clusters. Find \(M(b_1,b_2)\). If \(M(b_1,b_2)\le \beta \), we can immediately return \(\textsc {Yes}\): If a valid occurrence \((i',j')\) of \(\text {str}(b_1)\) and \(\text {str}(b_2)\) exists, then either \((i',j')\) is a consecutive occurrence of \(P_1\) and \(P_2\), or there exists a consecutive occurrence of smaller distance. Otherwise, that is if \(M(b_1,b_2)>\beta \), all valid occurrences (ij) have the property that either i is in the cluster of \(\text {locus}(P_1)\) or j is in the cluster of \(\text {locus}(P_2)\), and in that case we check all such pairs using \(\textsf {FindConsecutive}_{P_2}(.)\) resp. \(\textsf {FindConsecutive}_{P_1}(.)\). The running time is \({O}(|P_1| + |P_2| + \sqrt{n}\log ^\epsilon n)\).

Reporting data structure The data structure consists of:

  • The suffix tree with the same auxiliary information per node as in Sect. 4.2.

  • The orthogonal range successor data structure from Nekrich and Navarro [46] on the suffix array.

  • For each induced subtree, the following cluster decomposition, implemented as in Lemma 4: Denote \(n_i=n/2^i\) and let \(\tau _i=\sqrt{n_i\log n}\). We store the cluster decomposition from Lemma 1 with parameter \(\tau _i\) for each induced subtree of level i as long as \(n_i>\tau _i\), or equivalently, \(n_i>\log n\).

  • For each pair of boundary nodes (uv) in a cluster decomposition in T[ab] the minimum distance of a consecutive occurrence of \(\text {str}(u)\) and \(\text {str}(v)\) in T[ab].

Space analysis The total space for the structure is linear: The structure from Lemma 4 uses space \(O\left( \sum _{i=1}^{\lceil \log n\rceil } \frac{n}{\tau _i}\right) =O\left( \sum _{i=1}^{\lceil \log n\rceil } 2^i\frac{n_i}{\tau _i}\right) =O\left( \sum _{i=1}^{\lceil \log n\rceil } 2^i\left( \frac{n_i}{\log n}\right) ^{1/2}\right) =O\left( \sum _{i=1}^{\lceil \log n\rceil } 2^i\left( \frac{n_i}{\log n}\right) \right) =O(n)\). The vectors storing the minimum distances for T[ab] of level i use space \(O(n_i^2/\tau _i^2)=O(n_i/\log n)\). Summing up over all levels yields linear space.

Reporting algorithm The algorithm follows a similar, but simpler, recursive structure as in Sect. 4. We begin by finding the loci of \(P_1\) and \(P_2\). If either of the loci is not on a spine, we find all consecutive occurrences using \(\textsf {FindConsecutive}_{P_2}(.)\) resp. \(\textsf {FindConsecutive}_{P_1}(.)\), check if they are valid, report those, and terminate. If both loci are on a spine, we check \(M(b_1,b_2)\) for the lower boundary nodes \(b_1\) and \(b_2\). If \(M(b_1,b_2)>\beta \), all valid occurrences (ij) have the property that either i is in the cluster of \(\text {locus}(P_1)\) or j is in the cluster of \(\text {locus}(P_2)\). We check all such pairs using \(\textsf {FindConsecutive}_{P_2}(.)\) resp. \(\textsf {FindConsecutive}_{P_1}(.)\), report all the valid occurrences, and terminate. If \(M(b_1,b_2)\le \beta \), we recurse on the children trees. That is, we check the border case and repeat the algorithm on the children trees, using Lemmas 3 and 4 for reporting and locating boundary nodes.

Time analysis For time analysis, we spend \(O(\tau _i)=O(\sqrt{\frac{n\log n}{2^{l_i}}})\) orthogonal range successor queries on the nodes in the decomposition tree of level \(l_i\) where we stop the recursion. For all other nodes we visit in the tree traversal, we only spend a constant number of successor and range successor queries. In total, we visit \(O(\text {occ}\log (n/\text {occ})+\text {occ})\) decomposition tree nodes (by following the analysis in [19]), and we spend \(O(\sqrt{\frac{n\log n}{2^{l_i}}})\) orthogonal range successor queries on \(O(\text {occ})\) many such nodes.

We use the same notation as in Sect. 4. By \(x=O(\text {occ})\) we now denote the number of nodes where we stop the algorithm and output. Since each such node can be seen as a leaf in a binary tree, \(\sum _{i=1}^x \frac{1}{2^{l_i}}\le 1\). We use the Cauchy-Schwarz inequality (which is a special case of Hölders with \(p=q=2\)). We get as an asymptotic bound for the number of orthogonal range successor queries:

$$\begin{aligned} \sum _{i=1}^x \sqrt{\frac{n\log n}{2^{l_i}}}&=\sqrt{n\log n}\sum _{i=1}^x \sqrt{\frac{1}{2^{l_i}}}\cdot 1 \\&\le \sqrt{n\log n} \sqrt{\sum _{i=1}^x \frac{1}{2^{l_i}}} \sqrt{\sum _{i=1}^x 1}\\&\le \sqrt{n\log n\cdot x} = O(\sqrt{n\log n\cdot \text {occ}}). \end{aligned}$$

Note that since \(\text {occ}\log (n/\text {occ})=O(\text {occ}\sqrt{n/\text {occ}})=O(\sqrt{n\cdot \text {occ}})\), this brings the total number of orthogonal range successor queries to \(O(\text {occ} + \sqrt{n\log n\cdot \text {occ}})\).

Using the data structure by Nekrich and Navarro [46], the time bound from Theorem 3 follows.

7 Conclusion

We have considered the problem of gapped indexing for consecutive occurrences. We have given a linear space data structure that can support counting and reporting queries.

The running time for both includes an \(O(n^{2/3})\) term, which forms a gap of \(O(n^{1/6})\) to the conditional lower bound of \(O(\sqrt{n})\). Thus, the most obvious open question is whether we can close this gap, either by improving the data structure or finding a stronger lower bound.

Further, we have used the property that there can only be few consecutive occurrences of large distances. Thus, our solution cannot be easily extended to finding all pairs of occurrences with distance within the query interval. An open question is if it is possible to get similar results for that problem. Lastly, document versions of similar problems have concerned themselves with finding all documents that contain \(P_1\) and \(P_2\) or the top-k of smallest distance; conditional lower bounds for these problems are also known. It would be interesting to see if any of these results can be extended to finding all documents that contain a (consecutive) occurrence of \(P_1\) and \(P_2\) that has a distance within a query interval.