Suffix Trees, DAWGs and CDAWGs for Forward and Backward Tries

Inenaga, Shunsuke

doi:10.1007/978-3-030-61792-9_16

Shunsuke Inenaga^10,11

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12118))

Included in the following conference series:

Latin American Symposium on Theoretical Informatics

637 Accesses
1 Citations

Abstract

The suffix tree, DAWG, and CDAWG are fundamental indexing structures of a string, with a number of applications in bioinformatics, information retrieval, data mining, etc. An edge-labeled rooted tree (trie) is a natural generalization of a string, which can also be seen as a compact representation of a set of strings. Kosaraju [FOCS 1989] proposed the suffix tree for a backward trie, where the strings in the trie are read in the leaf-to-root direction. In contrast to a backward trie, we call a usual trie as a forward trie. Despite a few follow-up works after Kosaraju’s paper, indexing forward/backward tries is not well understood yet. In this paper, we show a full perspective on the sizes of indexing structures such as suffix trees, DAWGs, and CDAWGs for forward and backward tries. In particular, we show that the size of the DAWG for a forward trie with n nodes is $\varOmega (\sigma n)$, where $\sigma $ is the number of distinct characters in the trie. This becomes $\varOmega (n^2)$ for an alphabet of size $\sigma = \varTheta (n)$. Still, we show that there is a compact O(n)-space implicit representation of the DAWG for a forward trie, whose space requirement is independent of the alphabet size. This compact representation allows for simulating each DAWG edge traversal in $O(\log \sigma )$ time, and can be constructed in O(n) time and space over any integer alphabet of size O(n).

Access provided by Autonomous University of Puebla. Download conference paper PDF

On Suffix Tree Breadth

Suffix Tree of Alignment: An Efficient Index for Similar Data

Relative Lempel-Ziv Compression of Suffix Arrays

1 Introduction

Text indexing is a fundamental problem in theoretical computer science that dates back to 1970’s when suffix trees were invented [26]. Here the task is to preprocess a given text string S so that subsequent patten matching queries on S can be answered efficiently. Suffix trees have numerous other applications e.g. sequence comparisons [26], lossless data compression [2], data mining [23], and bioinformatics [15, 21].

A trie is a rooted tree where each edge is labeled with a single character. A backward trie is an edge-reversed trie. Kosaraju [19] was the first to consider the trie indexing problem, and he proposed the suffix tree of a backward trie that takes O(n) space, where n is the number of nodes in the backward trie. Kosaraju also claimed an $O(n \log n)$-time construction. Breslauer [7] showed how to build the suffix tree of a backward trie in $O(\sigma n)$ time and space, where $\sigma $ is the alphabet size. Shibuya [25] presented an O(n)-time and space construction for the suffix tree of a backward trie over an integer alphabet of size O(n). This line of research has been followed by the invention of XBWTs [11], suffix arrays [11], enhanced suffix arrays [18], and position heaps [24] for backward tries.

This paper considers the suffix trees, the directed acyclic word graphs (DAWGs) [5, 9], and the compact DAWGs (CDAWGs) [6] built on a backward trie or on a forward (ordinary) trie. While all these indexing structures support linear-time pattern matching queries on tries, their sizes can significantly differ. We present tight lower and upper bounds on the sizes of all these indexing structures, as summarized in Table 1. Probably the most interesting result in our size bounds is the $\varOmega (n^2)$ lower bound for the size of the DAWG for a forward trie with n nodes over an alphabet of size $\varTheta (n)$ (Theorem 6), since this reveals that Mohri et al.’s algorithm [22] that constructs the DAWG for a forward trie with n nodes must take at least $\varOmega (n^2)$ time and space in the worst case. We show that, somewhat surprisingly, there exists an implicit compact representation of the DAWG for a forward trie that occupies only O(n) space independently of the alphabet size, and allows for simulating traversal of each DAWG edge in $O(\log \sigma )$ time. We also present an algorithm that builds this implicit representation of the DAWG for a forward trie in O(n) time and space for any integer alphabet of size O(n).

Table 1. Summary of the numbers of nodes and edges of the suffix tree, DAWG, and CDAWG for a forward/backward trie with n nodes over an alphabet of size $\sigma $. The new bounds obtained in Sect. 5 of this paper are highlighted in bold. All the bounds here are valid with any alphabet size $\sigma $ ranging from $\varTheta (1)$ to $\varTheta (n)$. Also, all these upper bounds are tight in the sense that there are matching lower bounds (see Sect. 5).

Full size table

DAWGs for strings have important applications to pattern matching with don’t cares [20], online Lempel-Ziv factorization in compact space [27], finding minimal absent words [13], etc. CDAWGs for strings can be regarded as grammar compression of input strings and can be stored in space linear in the number of right-extensions of maximal repeats [3]. It is known that the number of maximal repeats can be much smaller than the string length, particularly in highly repetitive strings. Hence, studying and understanding DAWGs/CDAWGs for tries are very important and are expected to lead to further research on efficient processing of tries.

Omitted proofs and supplemental figures can be found in a full version [16].

2 Preliminaries

Let $\varSigma $ be an ordered alphabet. Any element of $\varSigma ^*$ is called a string. For any string S, let |S| denote its length. Let $\varepsilon $ be the empty string, namely, $|\varepsilon | = 0$. Let $\varSigma ^+ = \varSigma ^* \setminus \{\varepsilon \}$. If $S = XYZ$, then X, Y, and Z are called a prefix, a substring, and a suffix of S, respectively. For any $1 \le i \le j \le |S|$, let S[i..j] denote the substring of S that begins at position i and ends at position j in S. For convenience, let $S[i..j] = \varepsilon $ if $i > j$. For any $1 \le i \le |S|$, let S[i] denote the ith character of S. For any string S, let $\overline{S}$ denote the reversed string of S, i.e., $\overline{S} = S[|S|] \cdots S[1]$. Also, for any set $\mathbf {S}$ of strings, let $\overline{\mathbf {S}}$ denote the set of the reversed strings of $\mathbf {S}$, namely, $\overline{\mathbf {S}} = \{\overline{S} \mid S \in \mathbf {S}\}$.

A trie $\mathsf {T}$ is a rooted tree $(\mathsf {V}, \mathsf {E})$ such that (1) each edge in $\mathsf {E}$ is labeled by a single character from $\varSigma $ and (2) the character labels of the out-going edges of each node begin with mutually distinct characters. In this paper, a forward trie refers to an (ordinary) trie as defined above. On the other hand, a backward trie refers to an edge-reversed trie where each path label is read in the leaf-to-root direction. We will denote by $\mathsf {T}_{\mathsf {f}}= (\mathsf {V}_{\mathsf {f}}, \mathsf {E}_{\mathsf {f}})$ a forward trie and by $\mathsf {T}_{\mathsf {b}}= (\mathsf {V}_{\mathsf {b}}, \mathsf {E}_{\mathsf {b}})$ the backward trie that is obtained by reversing the edges of $\mathsf {T}_{\mathsf {f}}$. We denote by a triple $(u, a, v)_{\mathsf {f}}$ an edge in a forward trie $\mathsf {T}_{\mathsf {f}}$, where $u, v \in \mathsf {V}$ and $a \in \varSigma $. Each reversed edge in $\mathsf {T}_{\mathsf {b}}$ is denoted by a triple $(v, a, u)_{\mathsf {b}}$. Namely, there is a directed labeled edge $(u, a, v)_{\mathsf {f}} \in \mathsf {E}_{\mathsf {f}}$ iff there is a reversed directed labeled edge $(v, a, u)_{\mathsf {b}} \in \mathsf {E}_{\mathsf {b}}$.

For a node u of a forward trie $\mathsf {T}_{\mathsf {f}}$, let $\textit{anc}(u, j)$ denote the jth ancestor of u in $\mathsf {T}_{\mathsf {f}}$ if it exists. Alternatively, for a node v of a backward $\mathsf {T}_{\mathsf {b}}$, let $\textit{des}(v, j)$ denote the jth descendant of v in $\mathsf {T}_{\mathsf {b}}$ if it exists. We use a level ancestor data structure [4] on $\mathsf {T}_{\mathsf {f}}$ (resp. $\mathsf {T}_{\mathsf {b}}$) so that $\textit{anc}(u, j)$ (resp. $\textit{des}(v, j)$) can be found in O(1) time for any node and integer j, with linear space.

For nodes u, v in a forward trie $\mathsf {T}_{\mathsf {f}}$ s.t. u is an ancestor of v, let $\textit{str}_{\mathsf {f}}(u, v)$ denote the string spelled out by the path from u to v in $\mathsf {T}_{\mathsf {f}}$. Let r denote the root of $\mathsf {T}_{\mathsf {f}}$ and $\mathsf {L_f}$ the set of leaves in $\mathsf {T}_{\mathsf {f}}$. The sets of substrings and suffixes of the forward trie $\mathsf {T}_{\mathsf {f}}$ are respectively defined by $\textit{Substr}(\mathsf {T}_{\mathsf {f}}) = \{\textit{str}_{\mathsf {f}}(u, v) \mid u, v \in \mathsf {V}_{\mathsf {f}}, u{ isanancestorof}v\}$ and $\textit{Suffix}(\mathsf {T}_{\mathsf {f}}) = \{\textit{str}_{\mathsf {f}}(u, l) \mid u \in \mathsf {V}_{\mathsf {f}}, l \in \mathsf {L_f}{} \}$.

For nodes v, u in a backward trie $\mathsf {T}_{\mathsf {b}}$ s.t. v is a descendant of u, let $\textit{str}_{\mathsf {b}}( v, u )$ denote the string spelled out by the reversed path from v to u in $\mathsf {T}_{\mathsf {b}}$. The sets of substrings and suffixes of the backward trie $\mathsf {T}_{\mathsf {b}}$ are respectively defined by $\textit{Substr}(\mathsf {T}_{\mathsf {b}}) = \{\textit{str}_{\mathsf {b}}( v, u ) \mid v, u \in \mathsf {V}_{\mathsf {b}}, \ v{ isadescendantof}u\}$ and $\textit{Suffix}(\mathsf {T}_{\mathsf {b}}) = \{\textit{str}_{\mathsf {b}}( v, r ) \mid v \in \mathsf {V}_{\mathsf {b}}, \ r{ istherootof}\mathsf {T}_{\mathsf {b}}\}$.

In what follows, let n be the number of nodes in $\mathsf {T}_{\mathsf {f}}$ (or equivalently in $\mathsf {T}_{\mathsf {b}}$).

Fact 1

(a) For any $\mathsf {T}_{\mathsf {f}}$ and $\mathsf {T}_{\mathsf {b}}$, $\textit{Substr}(\mathsf {T}_{\mathsf {f}}) = \overline{\textit{Substr}(\mathsf {T}_{\mathsf {b}})}$. (b) For any forward trie $\mathsf {T}_{\mathsf {f}}$, $|\textit{Suffix}(\mathsf {T}_{\mathsf {f}})| = O(n^2)$. For some forward trie $\mathsf {T}_{\mathsf {f}}$, $|\textit{Suffix}(\mathsf {T}_{\mathsf {f}})| = \varOmega (n^2)$. (c) $|\textit{Suffix}(\mathsf {T}_{\mathsf {b}})| \le n-1$ for any backward trie $\mathsf {T}_{\mathsf {b}}$.

Fact 1-(a), Fact 1-(c) and the upper bound of Fact 1-(b) should be clear from the definitions. To see the lower bound of Fact 1-(b) in detail, consider a forward trie $\mathsf {T}_{\mathsf {f}}$ with root r such that there is a single path of length k from r to a node v, and there is a complete binary tree rooted at v with k leaves. Then, for all nodes u in the path from r to v, the total number of strings in the set $\{\textit{str}_{\mathsf {f}}(u, l) \mid l \in \mathsf {L_f}{} \} \subset \textit{Suffix}(\mathsf {T}_{\mathsf {f}})$ is at least $k(k+1)$, since each $\textit{str}_{\mathsf {f}}(u, l)$ is distinct for each path (u, l). By setting $k \approx n/3$ so that the number $|\mathsf {V}_{\mathsf {f}}|$ of nodes in $\mathsf {T}_{\mathsf {f}}$ equals n, we obtain Fact 1-(b). The lower bound is valid for alphabets of size $\sigma $ ranging from 2 to $\varTheta (k) = \varTheta (n)$.

3 Maximal Substrings in Forward/Backward Tries

Blumer et al. [6] introduced the notions of right-maximal, left-maximal, and maximal substrings in a set $\mathbf {S}$ of strings, and presented clean relationships between the right-maximal/left-maximal/maximal substrings and the suffix trees/DAWGs/CDAWGs for $\mathbf {S}$. Here we give natural extensions of these notions to substrings in our forward and backward tries $\mathsf {T}_{\mathsf {f}}$ and $\mathsf {T}_{\mathsf {b}}$, which will be the basis of our indexing structures for $\mathsf {T}_{\mathsf {f}}$ and $\mathsf {T}_{\mathsf {b}}$.

Maximal Substrings on Forward Tries: For any substring X in a forward trie $\mathsf {T}_{\mathsf {f}}$, X is said to be right-maximal on $\mathsf {T}_{\mathsf {f}}$ if (i) there are at least two distinct characters $a, b \in \varSigma $ such that $Xa, Xb \in \textit{Substr}(\mathsf {T}_{\mathsf {f}})$, or (ii) X has an occurrence ending at a leaf of $\mathsf {T}_{\mathsf {f}}$. Also, X is said to be left-maximal on $\mathsf {T}_{\mathsf {f}}$ if (i) there are at least two distinct characters $a, b \in \varSigma $ such that $aX, bX \in \textit{Substr}(\mathsf {T}_{\mathsf {f}})$, or (ii) X has an occurrence beginning at the root of $\mathsf {T}_{\mathsf {f}}$. Finally, X is said to be maximal on $\mathsf {T}_{\mathsf {f}}$ if X is both right-maximal and left-maximal in $\mathsf {T}_{\mathsf {f}}$. For any $X \in \textit{Substr}(\mathsf {T}_{\mathsf {f}})$, let $\textit{r-mxml}_{\mathsf {f}}(X)$, $\textit{l-mxml}_{\mathsf {f}}(X)$, and $\textit{mxml}_{\mathsf {f}}(X)$ respectively denote the functions that map X to the shortest right-maximal substring $X \beta $, the shortest left-maximal substring $\alpha X$, and the shortest maximal substring $\alpha X \beta $ that contain X in $\mathsf {T}_{\mathsf {f}}$, where $\alpha , \beta \in \varSigma ^*$.

Maximal Substrings on Backward Tries: For any substring Y in a backward trie $\mathsf {T}_{\mathsf {b}}$, Y is said to be left-maximal on $\mathsf {T}_{\mathsf {b}}$ if (i) there are at least two distinct characters $a, b \in \varSigma $ such that $aY, bY \in \textit{Substr}(\mathsf {T}_{\mathsf {b}})$, or (ii) Y has an occurrence beginning at a leaf of $\mathsf {T}_{\mathsf {b}}$. Also, Y is said to be right-maximal on $\mathsf {T}_{\mathsf {b}}$ if (i) there are at least two distinct characters $a, b \in \varSigma $ such that $Ya, Yb \in \textit{Substr}(\mathsf {T}_{\mathsf {b}})$, or (ii) Y has an occurrence ending at the root of $\mathsf {T}_{\mathsf {b}}$. Finally, Y is said to be maximal on $\mathsf {T}_{\mathsf {b}}$ if Y is both right-maximal and left-maximal in $\mathsf {T}_{\mathsf {b}}$. For any $Y \in \textit{Substr}(\mathsf {T}_{\mathsf {b}})$, let $\textit{l-mxml}_{\mathsf {b}}(Y)$, $\textit{r-mxml}_{\mathsf {b}}(Y)$, and $\textit{mxml}_{\mathsf {b}}(Y)$ respectively denote the functions that map Y to the shortest left-maximal substring $\gamma Y$, the shortest right-maximal substring $Y \delta $, and the shortest maximal substring $\gamma Y \delta $ that contain Y in $\mathsf {T}_{\mathsf {b}}$, where $\gamma , \delta \in \varSigma ^*$.

Clearly, the afore-mentioned notions are symmetric over $\mathsf {T}_{\mathsf {f}}$ and $\mathsf {T}_{\mathsf {b}}$, namely:

Fact 2

String X is right-maximal (resp. left-maximal) on $\mathsf {T}_{\mathsf {f}}$ iff $\overline{X}$ is left-maximal (resp. right-maximal) on $\mathsf {T}_{\mathsf {b}}$. Also, X is maximal on $\mathsf {T}_{\mathsf {f}}$ iff $\overline{X}$ is maximal on $\mathsf {T}_{\mathsf {b}}$.

4 Indexing Forward/Backward Tries and Known Bounds

A compact tree for a set $\mathbf {S}$ of strings is a rooted tree such that (1) each edge is labeled by a non-empty substring of a string in $\mathbf {S}$, (2) each internal node is branching, (3) the string labels of the out-going edges of each node begin with mutually distinct characters, and (4) there is a path from the root that spells out each string in $\mathbf {S}$, which may end on an edge. Each edge of a compact tree is denoted by a triple $(u, \alpha , v)$ with $\alpha \in \varSigma ^+$. We call internal nodes that are branching as explicit nodes, and we call loci that are on edges as implicit nodes. We will sometimes identify nodes with the substrings that the nodes represent.

In what follows, we will consider DAG or tree data structures built on a forward trie or backward trie. For any DAG or tree data structure $\mathsf {D}$, let $|\mathsf {D}|_{\#{\textit{Node}}}$ and $|\mathsf {D}|_{\#{\textit{Edge}}}$ denote the numbers of nodes and edges in $\mathsf {D}$, respectively.

4.1 Suffix Trees for Forward Tries

The suffix tree of a forward trie $\mathsf {T}_{\mathsf {f}}$, denoted $\mathsf {STree}(\mathsf {T}_{\mathsf {f}})$, is a compact tree which represents $\textit{Suffix}(\mathsf {T}_{\mathsf {f}})$. All non-root nodes in $\mathsf {STree}(\mathsf {T}_{\mathsf {f}})$ represent right-maximal substrings on $\mathsf {T}_{\mathsf {f}}$. Since now all internal nodes are branching, and since there are at most $|\textit{Suffix}(\mathsf {T}_{\mathsf {f}})|$ leaves, the numbers of nodes and edges in $\mathsf {STree}(\mathsf {T}_{\mathsf {f}})$ are proportional to the number of suffixes in $\textit{Suffix}(\mathsf {T}_{\mathsf {f}})$. The following (folklore) quadratic bounds hold due to Fact 1-(b).

Theorem 1

For any forward trie $\mathsf {T}_{\mathsf {f}}$ with n nodes, $|\mathsf {STree}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Node}}} = O(n^2)$ and $|\mathsf {STree}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = O(n^2)$. These upper bounds hold for any alphabet. For some forward trie $\mathsf {T}_{\mathsf {f}}$ with n nodes, $|\mathsf {STree}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Node}}} \! = \! \varOmega (n^2)$ and $|\mathsf {STree}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}}$ $= \varOmega (n^2)$. These lower bounds hold for a constant-size or larger alphabet.

4.2 Suffix Trees for Backward Tries

The suffix tree of a backward trie $\mathsf {T}_{\mathsf {b}}$, denoted $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$, is a compact tree which represents $\textit{Suffix}(\mathsf {T}_{\mathsf {b}})$. Since $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ contains at most $n-1$ leaves by Fact 1-(c) and all internal nodes of $\textit{Suffix}(\mathsf {T}_{\mathsf {b}})$ are branching, the following precise bounds follow from Fact 1-(c), which were implicit in the literature [7, 19].

Theorem 2

For any backward trie $\mathsf {T}_{\mathsf {b}}$ with $n \ge 3$ nodes, $|\mathsf {STree}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Node}}} \le 2n-3$ and $|\mathsf {STree}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Edge}}} \le 2n - 4$, independently of the alphabet size.

The above bounds are tight since the theorem translates to the suffix tree with $2m-1$ nodes and $2m-2$ edges for a string of length m (e.g., $a^{m-1}b$), which can be represented as a path tree with $n = m+1$ nodes. By representing each edge label $\alpha $ by a pair $\langle v, u \rangle $ of nodes in $\mathsf {T}_{\mathsf {b}}$ such that $\alpha = \textit{str}_{\mathsf {b}}(u, v)$, $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ can be stored with O(n) space.

4.2.1 Suffix Links and Weiner Links:

For each explicit node aU of the suffix tree $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ of a backward trie $\mathsf {T}_{\mathsf {b}}$ with $a \in \varSigma $ and $U \in \varSigma ^*$, let $\textit{slink}(aU) = U$. This is called the suffix link of node aU. For each explicit node V and $a \in \varSigma $, we also define the reversed suffix link $\mathcal{W}_{a}({V}) = aVX$ where $X \in \varSigma ^*$ is the shortest string such that aVX is an explicit node of $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$. $\mathcal{W}_{a}({V})$ is undefined if $aV \notin \textit{Substr}(\mathsf {T}_{\mathsf {b}})$. These reversed suffix links are also called as Weiner links (or W-link in short) [8]. A W-link $\mathcal{W}_{a}({V}) = aVX$ is said to be hard if $X = \varepsilon $, and soft if $X \in \varSigma ^+$. The suffix links, hard and soft W-links of nodes in the suffix tree $\mathsf {STree}(\mathsf {T}_{\mathsf {f}})$ of a forward trie $\mathsf {T}_{\mathsf {f}}$ are defined analogously.

4.3 DAWGs for Forward Tries

The directed acyclic word graph (DAWG) of a forward trie $\mathsf {T}_{\mathsf {f}}$ is a (partial) DFA that recognizes all substrings in $\textit{Substr}(\mathsf {T}_{\mathsf {f}})$. Hence, the label of every edge of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ is a single character from $\varSigma $. $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ is formally defined as follows: For any substring X from $\textit{Substr}(\mathsf {T}_{\mathsf {f}})$, let $[X]_{E, \mathsf {f}}$ denote the equivalence class w.r.t. $\textit{l-mxml}_{\mathsf {f}}(X)$. There is a one-to-one correspondence between the nodes of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ and the equivalence classes $[\cdot ]_{E, \mathsf {f}}$, and hence we will identify the nodes of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ with their corresponding equivalence classes $[\cdot ]_{E, \mathsf {f}}$. By the definition of equivalence classes, every member of $[X]_{E, \mathsf {f}}$ is a suffix of $\textit{l-mxml}_{\mathsf {f}}(X)$. If X, Xa are substrings in $\textit{Substr}(\mathsf {T}_{\mathsf {f}})$ and $a \in \varSigma $, then there exists an edge labeled with character $a \in \varSigma $ from node $[X]_{E, \mathsf {f}}$ to node $[Xa]_{E, \mathsf {f}}$ in $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$. This edge is called primary if $|\textit{l-mxml}_{\mathsf {f}}(X)| + 1 = |\textit{l-mxml}_{\mathsf {f}}(Xa)|$, and is called secondary otherwise. For each node $[X]_{E, \mathsf {f}}$ of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ with $|X| \ge 1$, let $\textit{slink}([X]_{E, \mathsf {f}}) = Z$, where Z is the longest suffix of $\textit{l-mxml}_{\mathsf {f}}(X)$ not belonging to $[X]_{E, \mathsf {f}}$. This is the suffix link of this node $[X]_{E, \mathsf {f}}$.

Mohri et al. [22] introduced the suffix automaton for an acyclic DFA $\mathsf {G}$, which is a small DFA that represents all suffixes of strings accepted by $\mathsf {G}$. They considered equivalence relation $\equiv $ of substrings X and Y in an acyclic DFA $\mathsf {G}$ such that $X \equiv Y$ iff the following paths of the occurrences of X and Y in $\mathsf {G}$ are equal. Mohri et al.’s equivalence class is identical to our equivalence class $[X]_{E, \mathsf {f}}$ when $\mathsf {G} = \mathsf {T}_{\mathsf {f}}$. To see why, recall that $\textit{l-mxml}_{\mathsf {f}}(X) = \alpha X$ is the shortest substring of $\mathsf {T}_{\mathsf {f}}$ such that $\alpha X$ is left-maximal, where $\alpha \in \varSigma ^*$. Therefore, X is a suffix of $\textit{l-mxml}_{\mathsf {f}}(X)$ and the following paths of the occurrences of X in $\mathsf {T}_{\mathsf {f}}$ are identical to the following paths of the occurrences of $\textit{l-mxml}_{\mathsf {f}}(X)$ in $\mathsf {T}_{\mathsf {f}}$. Hence, in case where the input DFA $\mathsf {G}$ is in form of a forward trie $\mathsf {T}_{\mathsf {f}}$ such that its leaves are the accepting states, then Mohri et al.’s suffix automaton is identical to our DAWG for $\mathsf {T}_{\mathsf {f}}$. Mohri et al. [22] showed the following:

Theorem 3

(Corollary 2 of [22]). For any forward trie $\mathsf {T}_{\mathsf {f}}$ with $n \ge 3$ nodes, $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Node}}} \le 2n - 3$, independently of the alphabet size.

We remark that Theorem 3 is immediate from Theorem 2 and Fact 2. This is because there is a one-to-one correspondence between the nodes of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ and the nodes of $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$, which means that $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Node}}} = |\mathsf {STree}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Node}}}$. Recall that the bound in Theorem 3 is only on the number of nodes in $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$. We shall show later that the number of edges in $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ is $\varOmega (\sigma n)$ in the worst case, which can be $\varOmega (n^2)$ for a large alphabet.

4.4 DAWGs for Backward Tries

The DAWG of a backward trie $\mathsf {T}_{\mathsf {b}}$, denoted $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$, is a (partial) DFA that recognizes all strings in $\textit{Substr}(\mathsf {T}_{\mathsf {b}})$. The label of every edge of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$ is a single character from $\varSigma $. $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$ is formally defined as follows: For any substring Y from $\textit{Substr}(\mathsf {T}_{\mathsf {b}})$, let $[Y]_{E, \mathsf {b}}$ denote the equivalence class w.r.t. $\textit{l-mxml}_{\mathsf {b}}(Y)$. There is a one-to-one correspondence between the nodes of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$ and the equivalence classes $[\cdot ]_{E, \mathsf {b}}$, and hence we will identify the nodes of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$ with their corresponding equivalence classes $[\cdot ]_{E, \mathsf {b}}$. The notions of primary edges, secondary edges, and the suffix links of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$ are defined in similar manners to $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$, but using the equivalence classes $[Y]_{E, \mathsf {b}}$ for substrings Y in the backward trie $\mathsf {T}_{\mathsf {b}}$.

4.4.1 Symmetries Between Suffix Trees and DAWGs:

The well-known symmetry between the suffix trees and the DAWGs (refer to [5, 6, 10]) also holds in our case of forward and backward tries. Namely, the suffix links of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ (resp. $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$) are the (reversed) edges of $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ (resp. $\mathsf {STree}(\mathsf {T}_{\mathsf {f}})$). Also, the hard W-links of $\mathsf {STree}(\mathsf {T}_{\mathsf {f}})$ (resp. $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$) are the primary edges of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$ (resp. $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$), and the soft W-links of $\mathsf {STree}(\mathsf {T}_{\mathsf {f}})$ (resp. $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$) are the secondary edges of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$ (resp. $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$).

4.5 CDAWGs for Forward Tries

The compact directed acyclic word graph (CDAWG) of a forward trie $\mathsf {T}_{\mathsf {f}}$, denoted $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})$, is the edge-labeled DAG where the nodes correspond to the equivalence class of $\textit{Substr}(\mathsf {T}_{\mathsf {f}})$ w.r.t. $\textit{mxml}_{\mathsf {f}}(\cdot )$. In other words, $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})$ can be obtained by merging isomorphic subtrees of $\mathsf {STree}(\mathsf {T}_{\mathsf {f}})$ rooted at internal nodes and merging leaves that are equivalent under $\textit{mxml}_{\mathsf {f}}(\cdot )$, or by contracting non-branching paths of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$.

Theorem 4

( [17]). For any forward trie $\mathsf {T}_{\mathsf {f}}$ with n nodes over a constant-size alphabet, $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Node}}} = O(n)$ and $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = O(n)$.

We emphasize that the above result by Inenaga et al. [17] states size bounds of $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})$ only in the case where $\sigma = O(1)$. We will later show that this bound does not hold for the number of edges, in the case of a large alphabet.

4.6 CDAWGs for Backward Tries

The compact directed acyclic word graph (CDAWG) of a backward trie $\mathsf {T}_{\mathsf {b}}$, denoted $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})$, is the edge-labeled DAG where the nodes correspond to the equivalence class of $\textit{Substr}(\mathsf {T}_{\mathsf {b}})$ w.r.t. $\textit{mxml}_{\mathsf {b}}(\cdot )$. Similarly to its forward trie counterpart, $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})$ can be obtained by merging isomorphic subtrees of $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ rooted at internal nodes and merging leaves that are equivalent under $\textit{mxml}_{\mathsf {f}}(\cdot )$, or by contracting non-branching paths of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})$.

5 New Size Bounds on Indexing Forward/Backward Tries

To make the analysis simpler, we assume each of the roots, the one of $\mathsf {T}_{\mathsf {f}}$ and the corresponding one of $\mathsf {T}_{\mathsf {b}}$, is connected to an auxiliary node $\bot $ with an edge labeled by a unique character $\$$ that does not appear elsewhere in $\mathsf {T}_{\mathsf {f}}$ or in $\mathsf {T}_{\mathsf {b}}$.

5.1 Size Bounds for DAWGs for Forward/Backward Tries

Theorem 5

For any backward trie $\mathsf {T}_{\mathsf {b}}$ with n nodes, $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Node}}} = O(n^2)$ and $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Edge}}} = O(n^2)$. These upper bounds hold for any alphabet. For some backward trie $\mathsf {T}_{\mathsf {b}}$ with n nodes, $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Node}}} = \varOmega (n^2)$ and $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Edge}}}$ $= \varOmega (n^2)$. These lower bounds hold for a constant-size or larger alphabet.

Theorem 6

For any forward trie $\mathsf {T}_{\mathsf {f}}$ with n nodes, $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = O(\sigma n)$. For some forward trie $\mathsf {T}_{\mathsf {f}}$ with n nodes, $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = \varOmega (\sigma n)$ which is $\varOmega (n^2)$ for a large alphabet of size $\sigma = \varTheta (n)$.

Proof

Since each node of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ can have at most $\sigma $ out-going edges, the upper bound $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = O(\sigma n)$ follows from Theorem 3.

To obtain the lower bound $|\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = \varOmega (\sigma n)$, we consider $\mathsf {T}_{\mathsf {f}}$ which has a broom-like shape such that there is a single path of length $n - \sigma - 1$ from the root to a node v which has out-going edges with $\sigma $ distinct characters $b_1, \ldots , b_\sigma $. Since the root of $\mathsf {T}_{\mathsf {f}}$ is connected with the auxiliary node $\bot $ with an edge labeled $\$$, each root-to-leaf path in $\mathsf {T}_{\mathsf {f}}$ represents $\$ a^{n - \sigma + 1} b_i$ for $1 \le i \le \sigma $. Now $a^k$ for each $1 \le k \le n - \sigma -2$ is left-maximal since it is immediately preceded by a and $\$$. Thus $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ has at least $n - \sigma - 2$ internal nodes, each representing $a^k$ for $1 \le k \le n - \sigma - 2$. On the other hand, each $a^k \in \textit{Substr}(\mathsf {T}_{\mathsf {f}})$ is immediately followed by $b_i$ with all $1 \le i \le \sigma $. Hence, $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ contains $\sigma (n-\sigma -2) = \varOmega (\sigma n)$ edges when $n - \sigma - 2 = \varOmega (n)$. By choosing e.g. $\sigma \approx n/2$, we obtain $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ that contains $\varOmega (n^2)$ edges. $\square $

Mohri et al. (Proposition 4 of [22]) claimed that one can construct $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ in time proportional to its size. The following corollary is immediate from Theorem 6:

Corollary 1

The DAWG construction algorithm of [22] applied to a forward trie with n nodes must take at least $\varOmega (n^2)$ time in the worst case for an alphabet of size $\sigma = \varTheta (n)$.

5.2 Size Bounds for CDAWGs for Forward/Backward Tries

Theorem 7

For any backward trie $\mathsf {T}_{\mathsf {b}}$ with n nodes, $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Node}}} \le 2n - 3$ and $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Edge}}} \le 2n -4$. These bounds are independent of the alphabet size.

Proof

Since any maximal substring in $\textit{Substr}(\mathsf {T}_{\mathsf {b}})$ is right-maximal in $\textit{Substr}(\mathsf {T}_{\mathsf {b}})$, by Theorem 2 we have $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Node}}} \le |\mathsf {STree}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Node}}} \le 2n - 3$ and $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Edge}}} \le |\mathsf {STree}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Edge}}} \le 2n - 4$. $\square $

The bounds in Theorem 7 are tight: Consider an alphabet $\{a_1, \ldots , a_{\lceil \log _2 n \rceil },$ $b_1, \ldots ,$ $b_{\lceil \log _2 n \rceil }, \$\}$ of size $2 \lceil \log _2 n \rceil + 1$ and a binary backward trie $\mathsf {T}_{\mathsf {b}}$ with n nodes where the binary edges at each depth $d \ge 2$ are labeled by the sub-alphabet $\{a_d, b_d\}$ of size 2. Because every suffix $S \in \textit{Suffix}(\mathsf {T}_{\mathsf {b}})$ is maximal in $\mathsf {T}_{\mathsf {b}}$, $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})$ for this $\mathsf {T}_{\mathsf {b}}$ contains $n-1$ sinks. Also, since for each suffix S in $\mathsf {T}_{\mathsf {b}}$ there is a unique suffix $S' \ne S$ that shares the longest common prefix with S, $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})$ for this $\mathsf {T}_{\mathsf {b}}$ contains $n-2$ internal nodes (including the source). This also means $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})$ is identical to $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ for this backward trie $\mathsf {T}_{\mathsf {b}}$.

Theorem 8

For any forward trie $\mathsf {T}_{\mathsf {f}}$ with n nodes, $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Node}}} \le 2n - 3$ and $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = O(\sigma n)$. For some forward trie $\mathsf {T}_{\mathsf {f}}$ with n nodes, $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = \varOmega (\sigma n)$ which is $\varOmega (n^2)$ for a large alphabet of size $\sigma = \varTheta (n)$.

Proof

It immediately follows from Fact 1-(a), Fact 2, and Theorem 7 that $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Node}}}$ $= |\mathsf {CDAWG}(\mathsf {T}_{\mathsf {b}})|_{\#{\textit{Node}}} \le 2n - 3$. Since a node in $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})$ can have at most $\sigma $ out-going edges, the upper bound $|\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})|_{\#{\textit{Edge}}} = O(\sigma n)$ of the number of edges trivially holds. To obtain the lower bound, we consider the same broom-like forward trie $\mathsf {T}_{\mathsf {f}}$ as in Theorem 6. In this $\mathsf {T}_{\mathsf {f}}$, $a^k$ for each $1 \le k \le n - \sigma -2$ is maximal and thus $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})$ has at least $n - \sigma - 2$ internal nodes each representing $a^k$ for $1 \le k \le n - \sigma - 2$. By the same argument to Theorem 6, $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})$ for this $\mathsf {T}_{\mathsf {f}}$ contains at least $\sigma (n-\sigma -2) = \varOmega (\sigma n)$ edges, which accounts to $\varOmega (n^2)$ for a large alphabet of size e.g. $\sigma \approx n/2$. $\square $

The upper bound of Theorem 8 generalizes the bound of Theorem 4 for constant-size alphabets. Remark that $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})$ for the broom-like $\mathsf {T}_{\mathsf {f}}$ is almost identical to $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$, except for the unary path $\$a$ that is compacted in $\mathsf {CDAWG}(\mathsf {T}_{\mathsf {f}})$.

6 Constructing O(n)-size Representation of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ in O(n) Time

We have seen that $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ for any forward trie $\mathsf {T}_{\mathsf {f}}$ with n nodes contains only O(n) nodes, but can have $\varOmega (\sigma n)$ edges for some $\mathsf {T}_{\mathsf {f}}$ over an alphabet of size $\sigma $ ranging from $\varTheta (1)$ to $\varTheta (n)$. Thus some $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ can have $\varTheta (n^2)$ edges for $\sigma = \varTheta (n)$ (Theorem 3 and Theorem 6). Hence, in general it is impossible to build an explicit representation of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ within linear O(n)-space. By an explicit representation we mean an implementation of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ where each edge is represented by a pointer between two nodes.

We show that there exists an O(n)-space implicit representation of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ for any alphabet of size $\sigma $ ranging from $\varTheta (1)$ to $\varTheta (n)$, that allows us $O(\log \sigma )$-time access to each edge of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$. This is trivial in case $\sigma = O(1)$, and hence in what follows we consider an alphabet of size $\sigma $ such that $\sigma $ ranges from $\omega (1)$ to $\varTheta (n)$. Also, we suppose that our alphabet is an integer alphabet $\varSigma = [1..\sigma ]$ of size $\sigma $. Then, we show that such an implicit representation of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ can be built in O(n) time and working space.

Based on the property stated in Sect. 4, constructing $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ reduces to maintaining hard and soft W-links over $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$. Our data structure explicitly stores all O(n) hard W-links, while it only stores carefully selected O(n) soft W-links. The other soft W-links can be simulated by these explicitly stored W-links, in $O(\log \sigma )$ time each.

Our algorithm is built upon the following facts which are adapted from [12]:

Fact 3

Let a be any character from $\varSigma $.

(a)
If there is a (hard or soft) W-link $\mathcal{W}_{a}({V})$ for a node V in $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$, then there is a (hard or soft) W-link $\mathcal{W}_{a}({U})$ for any ancestor U of V in $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$.
(b)
If two nodes U and V have hard W-links $\mathcal{W}_{a}({U})$ and $\mathcal{W}_{a}({V})$, then the LCA Z of U and V also has a hard W-link $\mathcal{W}_{a}({Z})$.

In the following statements (c), (d), and (e), let V be any node of $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ such that V has a soft W-link $\mathcal{W}_{a}({V})$ for $a \in \varSigma $.

(c)
There is a descendant U of V s.t. $U \ne V$ and U has a hard W-link $\mathcal{W}_{a}({V})$.
(d)
The highest descendant of V that has a hard W-link for character a is unique. This fact follows from (b).
(e)
Let U be the unique highest descendant of V that has a hard W-link $\mathcal{W}_{a}({U})$. For every node Z in the path from V to U, $\mathcal{W}_{a}({Z}) = \mathcal{W}_{a}({U})$, i.e. the W-links of all nodes in this path for character a point to the same node in $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$.

We construct a micro-macro tree decomposition [1] of $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ in a similar manner to [14], such that the nodes of $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ are partitioned into $O(n / \sigma )$ connected components (called micro-trees), each of which contains $O(\sigma )$ nodes. Such a decomposition always exists and can be computed in O(n) time. The macro tree is the induced tree from the roots of the micro trees, and thus the macro tree contains $O(n / \sigma )$ nodes.

In every node V of the macro tree, we explicitly store all soft and hard W-links from V. Since there can be at most $\sigma $ W-links from V, this requires O(n) total space for all nodes in the macro tree. Let $\mathsf {mt}$ denote any micro tree. We compute the ranks of all nodes in a pre-order traversal in $\mathsf {mt}$. Let $a \in \varSigma $ be any character such that there is a node V in $\mathsf {mt}$ that has a hard W-link $\mathcal{W}_{a}({V})$. Let $\mathsf {P}_a^{\mathsf {mt}}$ denote an array that stores a sorted list of pre-order ranks of nodes V in $\mathsf {mt}$ that have hard W-links for character a. Hence the size of $\mathsf {P}_a^{\mathsf {mt}}$ is equal to the number of nodes in $\mathsf {mt}$ that have hard W-links for character a. For all such characters a, we store $\mathsf {P}_a^{\mathsf {mt}}$ in $\mathsf {mt}$. The total size of these arrays for all the micro trees is O(n).

Let $a \in \varSigma $ be any character, and V any node in $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ which does not have a hard W-link for a. We wish to know if V has a soft W-link for a, and if so, we want to retrieve the target node of this link. Let $\mathsf {mt}$ denote the micro-tree that V belongs to. Consider the case where V is not the root R of $\mathsf {mt}$, since otherwise $\mathcal{W}_{a}({V})$ is explicitly stored. If $\mathcal{W}_{a}({R})$ is nil, then by Fact 3-(a) no nodes in the micro tree has W-links for character a. Otherwise (if $\mathcal{W}_{a}({R})$ exists), then we can find $\mathcal{W}_{a}({W})$ as follows:

(A)
If the predecessor P of V exists in $\mathsf {P}_a^{\mathsf {mt}}$ and P is an ancestor of V, then we follow the hard W-link $\mathcal{W}_{a}({P})$ from P. Let $Q = \mathcal{W}_{a}({P})$, and c be the first character in the path from P to V.
1. (i)
  If Q has an out-going edge whose label begins with c, the child of Q below this edge is the destination of the soft W-link $\mathcal{W}_{a}({V})$ from V for a.
2. (ii)
  Otherwise, then there is no W-link from V for a.
(B)
Otherwise, $\mathcal{W}_{a}({R})$ from the root R of $\mathsf {mt}$ is a soft W-link, which is explicitly stored. We follow it and let $U = \mathcal{W}_{a}({R})$.
1. (i)
  If $Z = \textit{slink}(U)$ is a descendant of V, then U is the destination of the soft W-link $\mathcal{W}_{a}({V})$ from V for a.
2. (ii)
  Otherwise, then there is no W-link from V for a.

The correctness of this algorithm follows from Fact 3-(e). Since each micro-tree contains $O(\sigma )$ nodes, the size of $\mathsf {P}_a^{\mathsf {mt}}$ is $O(\sigma )$ and thus the predecessor P of V in $\mathsf {P}_a^{\mathsf {mt}}$ can be found in $O(\log \sigma )$ time by binary search. We can check if one node is an ancestor of the other node (or vice versa) in O(1) time, after standard O(n)-time preprocessing over the whole suffix tree. Hence, this algorithm simulates soft W-link $\mathcal{W}_{a}({V})$ in $O(\log \sigma )$ time.

Lemma 1

Given a backward trie $\mathsf {T}_{\mathsf {b}}$ with n nodes, we can compute $\mathsf {STree}(\mathsf {T}_{\mathsf {b}})$ with all hard W-links in O(n) time and space.

Lemma 2

We can compute, in O(n) time and space, all W-links of the macro tree nodes and the arrays $\mathsf {P}_a^{\mathsf {mt}}$ for all the micro trees $\mathsf {mt}$ and characters $a \in \varSigma $.

Proof

We perform a pre-order traversal on each micro tree $\mathsf {mt}$. At each node V visited during the traversal, we append the pre-order rank of V to array $\mathsf {P}_a^{\mathsf {mt}}$ iff V has a hard W-link $\mathcal{W}_{a}({V})$ for character a. Since the size of $\mathsf {mt}$ is $O(\sigma )$ and since we have assumed an integer alphabet $[1..\sigma ]$, we can compute $\mathsf {P}_a^{\mathsf {mt}}$ for all characters a in $O(\sigma )$ time. It takes $O(\frac{n}{\sigma } \cdot \sigma ) = O(n)$ time for all micro trees.

The preprocessing for the macro tree consists of two steps. Firstly, we need to compute soft W-links from the macro tree nodes (recall that we have already computed hard W-links from the macro tree nodes by Lemma 1). For this sake, in the above preprocessing for micro trees, we additionally pre-compute the successor of the root R of each micro tree $\mathsf {mt}$ in each non-empty array $\mathsf {P}_a^{\mathsf {mt}}$. By Fact 3-(d), this successor corresponds to the unique descendant of R that has a hard W-link for character a. As above, this preprocessing also takes $O(\sigma )$ time for each micro tree, resulting in O(n) total time. Secondly, we perform a bottom-up traversal on the macro tree. Our basic strategy is to “propagate” the soft W-links in a bottom up fashion from lower nodes to upper nodes in the macro tree (recall that these macro tree nodes are the roots of micro trees). In so doing, we first compute the soft W-links of the macro tree leaves. By Fact 3-(c) and -(e), this can be done in $O(\sigma )$ time for each leaf using the successors computed above. Then we propagate the soft W-links to the macro tree internal nodes. The existence of soft W-links of internal nodes computed in this way is justified by Fact 3-(a), however, the destinations of some soft W-links of some macro tree internal nodes may not be correct. This can happen when the corresponding micro trees contain hard W-links (due to Fact 3-(e)). These destinations can be modified by using the successors of the roots computed in the first step, again due to Fact 3-(e). Both of our propagation and modification steps take $O(\sigma )$ time for each macro tree node of size $O(\sigma )$, and hence, it takes a total of O(n) time. $\square $

Theorem 9

Given a forward trie $\mathsf {T}_{\mathsf {f}}$ of size n over an integer alphabet $\varSigma = [1..\sigma ]$ with $\sigma = O(n)$, we can construct an O(n)-space representation of $\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})$ in O(n) time and working space.

References

Alstrup, S., Secher, J.P., Spork, M.: Optimal on-line decremental connectivity in trees. Inf. Process. Lett. 64(4), 161–164 (1997)
Article MathSciNet Google Scholar
Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proc. IEEE 88(11), 1733–1744 (2000)
Article Google Scholar
Belazzougui, D., Cunial, F.: Fast label extraction in the CDAWG. In: Fici, G., Sciortino, M., Venturini, R. (eds.) SPIRE 2017. LNCS, vol. 10508, pp. 161–175. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67428-5_14
Chapter MATH Google Scholar
Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004)
Article MathSciNet Google Scholar
Blumer, A., et al.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)
Article MathSciNet Google Scholar
Blumer, A., Blumer, J., Haussler, D., Mcconnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Article MathSciNet Google Scholar
Breslauer, D.: The suffix tree of a tree and minimizing sequential transducers. Theor. Comput. Sci. 191(1–2), 131–144 (1998)
Article MathSciNet Google Scholar
Breslauer, D., Italiano, G.F.: Near real-time suffix tree construction via the fringe marked ancestor problem. J. Discrete Algorithms 18, 32–48 (2013)
Article MathSciNet Google Scholar
Crochemore, M.: Transducers and repetitions. Theor. Comput. Sci. 45(1), 63–86 (1986)
Article MathSciNet Google Scholar
Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, Cambridge (1994)
Google Scholar
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1), 4:1–4:33 (2009)
Google Scholar
Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching, CPM, pp. 160–171 (2015). Full version: https://arxiv.org/abs/1302.3347
Fujishige, Y., Tsujimaru, Y., Inenaga, S., Bannai, H., Takeda, M.: Computing DAWGs and minimal absent words in linear time for integer alphabets. In: Proceedings of the 41st International Symposium on Mathematical Foundations of Computer Science, MFCS, pp. 38:1–38:14 (2016)
Google Scholar
Gawrychowski, P.: Simple and efficient LZW-compressed multiple pattern matching. J. Discrete Algorithms 25, 34–41 (2014)
Article MathSciNet Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press (1997)
Google Scholar
Inenaga, S.: Suffix trees, DAWGs and CDAWGs for forward and backward tries. arXiv e-prints p. 1904.04513 (2019). http://arxiv.org/abs/1904.04513
Inenaga, S., Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S.: Construction of the CDAWG for a trie. In: Proceedings of the Prague Stringology Conference, PSC 2001, pp. 37–48 (2001)
Google Scholar
Kimura, D., Kashima, H.: Fast computation of subpath kernel for trees. In: Proceedings of the 29th International Conference on Machine Learning, ICML (2012)
Google Scholar
Kosaraju, S.R.: Efficient tree pattern matching (preliminary version). In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science, FOCS, pp. 178–183 (1989)
Google Scholar
Kucherov, G., Rusinowitch, M.: Matching a set of strings with variable length don’t cares. Theor. Comput. Sci. 178(1–2), 129–154 (1997)
Article MathSciNet Google Scholar
Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, Cambridge (2015)
Google Scholar
Mohri, M., Moreno, P.J., Weinstein, E.: General suffix automaton construction algorithm and space bounds. Theor. Comput. Sci. 410(37), 3553–3562 (2009)
Article MathSciNet Google Scholar
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 657–666 (2002)
Google Scholar
Nakashima, Y., I, T., Inenaga, S., Bannai, H., Takeda, M.: The position heap of a Trie. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 360–371. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_38
Chapter Google Scholar
Shibuya, T.: Constructing the suffix tree of a tree with a large alphabet. IEICE Trans. E86-A(5), 1061–1066 (2003)
Google Scholar
Weiner, P.: Linear pattern-matching algorithms. In: Proceedings of the 14th IEEE Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Yamamoto, J., Tomohiro, I., Bannai, H., Inenaga, S., Takeda, M.: Faster compact on-line Lempel-Ziv factorization. In: Proceedings of the 31st International Symposium on Theoretical Aspects of Computer Science, STACS, pp. 675–686 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, Fukuoka, Japan
Shunsuke Inenaga
PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
Shunsuke Inenaga

Authors

Shunsuke Inenaga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shunsuke Inenaga .

Editor information

Editors and Affiliations

University of São Paulo, São Paulo, Brazil
Yoshiharu Kohayakawa
University of Campinas, Campinas, Brazil
Flávio Keidi Miyazawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Inenaga, S. (2020). Suffix Trees, DAWGs and CDAWGs for Forward and Backward Tries. In: Kohayakawa, Y., Miyazawa, F.K. (eds) LATIN 2020: Theoretical Informatics. LATIN 2021. Lecture Notes in Computer Science(), vol 12118. Springer, Cham. https://doi.org/10.1007/978-3-030-61792-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-61792-9_16
Published: 03 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61791-2
Online ISBN: 978-3-030-61792-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Suffix Trees, DAWGs and CDAWGs for Forward and Backward Tries

Abstract

Similar content being viewed by others

On Suffix Tree Breadth

Suffix Tree of Alignment: An Efficient Index for Similar Data

Relative Lempel-Ziv Compression of Suffix Arrays

1 Introduction

2 Preliminaries

Fact 1

3 Maximal Substrings in Forward/Backward Tries

Fact 2

4 Indexing Forward/Backward Tries and Known Bounds

4.1 Suffix Trees for Forward Tries

Theorem 1

4.2 Suffix Trees for Backward Tries

Theorem 2

4.2.1 Suffix Links and Weiner Links:

4.3 DAWGs for Forward Tries

Theorem 3

4.4 DAWGs for Backward Tries

4.4.1 Symmetries Between Suffix Trees and DAWGs:

4.5 CDAWGs for Forward Tries

Theorem 4

4.6 CDAWGs for Backward Tries

5 New Size Bounds on Indexing Forward/Backward Tries

5.1 Size Bounds for DAWGs for Forward/Backward Tries

Theorem 5

Theorem 6

Proof

Corollary 1

5.2 Size Bounds for CDAWGs for Forward/Backward Tries

Theorem 7

Proof

Theorem 8

Proof

6 Constructing O(n)-size Representation of \(\mathsf {DAWG}(\mathsf {T}_{\mathsf {f}})\) in O(n) Time

Fact 3

Lemma 1

Lemma 2

Proof

Theorem 9

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation