1 Introduction

The eXtended Burrows-Wheeler Transform (XBWT) [8, 9] can be used to compactly represent a trie by a character array L and a bit array \(\mathsf {Last}\); see Fig. 1 for an example. A recent empirical comparison [19] of string dictionary implementations shows that the XBWT achieves the best compression of all techniques under consideration. Moreover, in contrast to most other methods, the XBWT supports substring searches. The compact representation of the XBWT can be computed as follows. Each internal node v of the trie T is associated with a string that is obtained by concatenating the characters at the edges in the upward path from v to the root of T (the root itself is associated with the empty string \(\varepsilon \)). If T has n internal nodes, then there are n associated strings and the (virtual) array \(\varPi [1..n]\) stores them in lexicographical order. We (conceptually) number the internal nodes of T according to \(\varPi \): If node v is associated with the string \(\varPi [i]\), it gets the number i. Let \(L_i\) be the set of characters at outgoing edges of node i (in no particular order) and let the character array L contain the concatenation of \(L_1,L_2,\dots ,L_n\). Furthermore, the bit array \(\mathsf {Last}\) stores the borders of \(L_i\): we initialize \(\mathsf {Last}\) with zeros and for all \(i \in \{1,\dots ,n\}\) we set \(\mathsf {Last}[j_i] =1\), where \(j_i = \sum _{\ell =1}^i |L_\ell |\). As already mentioned, the XBWT representation of the trie consists of the arrays L and \(\mathsf {Last}\). These arrays can be calculated with the help of an array \(\mathsf {MR}\), which we will define next.

Suppose the trie T is constructed from the pairwise distinct strings \(x_1,\dots ,x_k\). Let \(y_i = x^R_i\), where \(x^R_i\) denotes the string that is obtained by reversing \(x_i\). Furthermore, let \(S=y_1\$y_2\$ \dots y_k\$ \) be the concatenation of the \(y_i\), separated by a special character \(\$ \), which is assumed to be smaller than any other character. In the following, let m be the length of S (note that \(m = k+ \sum _{i=1}^k |x_i|\)). The suffix array \(\mathsf {SA}\) and the Burrows-Wheeler Transform \(\mathsf {BWT}\) of S are obtained by sorting the suffixes of S lexicographically (this can be done in linear time): If S[j..m] is the i-th lexicographically smallest suffix of S, then \(\mathsf {SA}[i]=j\) and \(\mathsf {BWT}[i]=S[j-1]\) is the character preceding that suffix (if \(j=1\), then \(\mathsf {BWT}[i]=\$ \)); see Fig. 1 for an example. Fast implementations of (semi-) external suffix sorting algorithms exist [6, 16], but multi-string \(\mathsf {BWT}\) construction algorithms may be competitive in the context of this paper; see [3, 17]. The array \(\mathsf {MR}\) is defined with the help of suffix array intervals. If \(\omega \) is a substring of S, then the \(\omega \)-interval is the largest interval [i..j] such that \(\omega \) is a prefix of all the suffixes in the interval [i..j]. Now \(\mathsf {MR}[lb]=1\) if and only if lb is the left boundary of a z-interval, where z is a suffix of some \(y_i\) (i.e., \(z^R\) is a prefix of some \(x_i\)). Note that \(\mathsf {MR}[1]=1\) because \(\varepsilon \) is a suffix of all \(y_i\) and [1..m] is the \(\varepsilon \)-interval. To avoid case distinctions, we set \(\mathsf {MR}[m+1]=1\). Let \(j_1=1< j_2< \dots< j_n < j_{n+1} = m+1\) be the indices with \(\mathsf {MR}[j_\ell ]=1\). For each i with \(1\le i \le n\), the interval \([j_i..j_{i+1}-1]\) is the \(\varPi [i]\)-interval. Thus \(L_i\) is the set of the characters in \(\mathsf {BWT}[j_i..j_{i+1}-1]\) and \(\mathsf {Last}[p_i] =1\), where \(p_i = \sum _{\ell =1}^i |L_\ell |\). It is readily verified that the arrays L and \(\mathsf {Last}\) can be computed in O(m) time by simultaneously scanning the arrays \(\mathsf {MR}\) and \(\mathsf {BWT}\) from left to right.

Fig. 1.
figure 1

Example for the input strings \(\texttt {ab},\texttt {ac},\texttt {bac},\texttt {aba}\). Left: XBWT consisting of the arrays \(\mathsf {Last}\) and L (the array \(\varPi \) is not stored). Center: Trie of the strings, where failure links of internal nodes are indicated by dashed arrows. Right: \(\mathsf {MR}\), \(C_c\), and \(\mathsf {BWT}\) for the concatenation of the reversed strings (i.e., \(S=\mathtt {ba\$ca\$cab\$aba\$}\)).

Recall that node i in the trie T is associated with the string \(\varPi [i]\). In this context, the failure link of i points to the node j so that \(\varPi [j]\) is the longest proper prefix of \(\varPi [i]\). Failure links are not supported by the XBWT representation of T, but Manzini [18] showed that a balanced parentheses sequence P can be used to support them in constant time with only \(2n+o(n)\) bits of space. P can be defined by means of \(\varPi \): For \(i=1,\dots ,n\) a pair of parentheses is written by repeating the following: (1) For each \(\ell < i\), for which its closing parenthesis has not been written yet and \(\varPi [\ell ]\) is not a prefix of \(\varPi [i]\), write a closing parenthesis. (2) Write the opening parenthesis for i. After termination of this for-loop, write a closing parenthesis for each \(\ell \), for which its closing parenthesis has not been written yet. In the example of Fig. 1, we have \(P= (((()))(())(()))\). P can be preprocessed in linear time, using only o(n) bits, so that the operations rank, select, and enclose can be supported in constant time [7, 15, 20]. Using these operations on P, failure links can be supported in constant time; see [18, Lemma 4] for details. Manzini [18] devised two different algorithms that construct P. In the next section, we suggest an alternative way for constructing P that outperforms his algorithms.

2 The New Algorithm

Our new construction algorithm uses an idea of Belazzougui [4], who devised a rather simple method to build the balanced parenthesis representation of a suffix tree topology. He writes: “Our key observation is that we can easily build a balanced parenthesis representation by enumerating the suffix array intervals. More precisely for every position in [1..n], we associate two counters, one for open and the other for close parentheses implemented through two arrays of counters \(C_o[1..n]\) and \(C_c[1..n]\). Then given a suffix array interval [ij] we will simply increment the counters \(C_o[i]\) and \(C_c[j]\). Then we scan the counters \(C_c\) and \(C_o\) in parallel and for each i from 1 to n, write \(C_o[i]\) opening parentheses followed by \(C_c[i]\) closing parentheses. It is easy to see that the constructed sequence is that of the balanced parentheses of the suffix tree.” Since we do not want to represent a suffix tree topology, we cannot enumerate all suffix array intervals. Instead, we must enumerate all z-intervals for which z is a suffix of some \(y_i\) (for then \(z^R\) is a prefix of some \(x_i\)). Recall that \(\mathsf {MR}[lb]=1\) if and only if lb is the left boundary of such a z-interval. Consequently, the array \(\mathsf {MR}[1..m]\) coincides with the array \(C_o[1..m]\). Moreover, observe that if z is a suffix of some \(y_i\), then the left boundary \(b_{z}\) of the z-interval in the suffix array of S coincides with the left boundary \(b_{z\$}\) of the \(z\$ \)-interval because \(z\$ \) is a substring of S and \(\$ \) is the smallest character.

For the explanation of the pseudo-code of our new construction algorithm (Algorithm 1), we need a few preliminaries. For each character c, C[c] is the overall number of occurrences of characters in \(\mathsf {BWT}[1..m]\) that are strictly smaller than c. Given the \(\omega \)-interval [lb..rb] and a character c, the \(c\omega \)-interval [i..j] can be computed by \(i= C[c] + rank_c(\mathsf {BWT},lb-1)+1\) and \(j= C[c] + rank_c(\mathsf {BWT},rb)\), where \(rank_c(\mathsf {BWT},lb-1)\) returns the number of occurrences of character c in the prefix \(\mathsf {BWT}[1..lb-1]\) (we have \(i\le j\) if \(c\omega \) is a substring of S; otherwise \(i > j\)); see [10] for details. The (balanced) wavelet tree [14] of the \(\mathsf {BWT}\) supports such a backward search step in \(O(\log \sigma )\) time, where \(\sigma \) is the size of the alphabet. Backward search can be generalized on the wavelet tree as follows: Given an \(\omega \)-interval [lb..rb], a slight modification of the procedure getIntervals([lb..rb]) described in [5] returns the list \([(c,[i..j]) \mid c\omega \text { is a substring of } S \text { and } [i..j] \text { is the }\) \(c\omega \text {-interval}]\), where the first component of an element (c, [i..j]) is a character. The worst-case time complexity of the procedure getIntervals is \(O(occ + occ \cdot \log (\sigma /occ))\), where occ is the number of elements in the output list; see [12, Lemma 3].

figure a

If \(z = \varepsilon \) is the empty string, then the z-interval is \([b_{z}..e_{z}] = [1..m]\) and the \(z\$ \)-interval is \([b_{z\$}..e_{z\$}] = [1..k]\). The function call visit (1, km) computes the arrays \(\mathsf {MR}= C_o\) and \(C_c\); the pseudo-code of this function can be found in Algorithm 1. The function first counts an opening parenthesis at position \(b_{z} = b_{z\$}\) and a closing parenthesis at position \(e_{z}\). With the help of the procedure getIntervals it then computes all non-empty \(cz\$ \)-intervals, where \(c\in \varSigma \) and \(c\ne \$ \). The fact that a \(cz\$ \)-interval \([b_{cz\$}..e_{cz\$}]\) is not empty means that cz is a suffix of some \(y_i\). It follows as a consequence that the cz-interval \([b_{cz}..e_{cz}]\) is also not empty. Again, \(b_{cz} = b_{cz\$}\) holds true, but the right boundary \(e_{cz}\) of the cz-interval is not known yet. Now there are two cases. If the right boundaries of the z-interval and the \(z\$ \)-interval coincided, then so do the right boundaries \(e_{cz}\) and \(e_{cz\$}\) of the cz-interval and the \(cz\$ \)-interval. If they were not the same, \(e_{cz}\) must be computed by evaluating \(C[c]+rank_c(\mathsf {BWT},e_{z})\) as in backward search. Finally, the function recursively calls itself with the new parameters \(b_{cz\$},e_{cz\$},e_{cz}\).

The overall time complexity of the construction of P is \(O(m \log \sigma )\) because the \(\mathsf {BWT}\) can be build in O(m) time, the wavelet tree of the \(\mathsf {BWT}\) can be constructed in \(O(m \log \sigma )\) time, initialization and computation of the arrays \(\mathsf {MR}\) and \(C_c\) takes \(O(m + n \log \sigma )\) time (n is the number of internal nodes of the trie and satisfies \(n \le m\)), and the computation of P based on \(\mathsf {MR}\) and \(C_c\) requires O(m) time.

Let us consider the working space of Algorithm 1. By the definition of P, there are at most \(\max _i |x_i|\) consecutive closing parentheses, thus the array \(C_c\) requires \(m \log (\max _i |x_i|)\) bits. The array \(\mathsf {MR}\) occupies only m bits and the wavelet tree of the \(\mathsf {BWT}\) essentially uses \(m \lceil \log \sigma \rceil +o(m \log \sigma )\) bits of space; see e.g. [21]. The stack for the recursion contains (at any point in time) at most \(\max _i |x_i|\) elements. Each stack element stores a list returned by the procedure getIntervals; this list contains at most \(\sigma \) elements of the form (c, [lb..rb]). Since every list element requires O(1) space, the whole stack uses \(O(\sigma \cdot \max _i |x_i|)\) space.

3 Saving Space

figure b

As already observed by Manzini [18], the number n of ones in the bit array \(\mathsf {MR}\) gives the number of internal nodes of the trie. If one computes \(\mathsf {MR}\) in a first phase (for instance, by the algorithm in [18, Fig. 4]), then n is known and more space-efficient algorithms for computing P can be deduced. Manzini suggests to use two arrays \(\mathsf {RCP}'\) and \(\mathsf {LEN}'\) of length n that use \(O(n \log (\max _i |x_i|))\) bits of memory and store only values for which the corresponding entry in the array \(\mathsf {MR}\) equals 1. His algorithm [18, Fig. 5] calculates P based on these arrays. We would like to follow this approach, but the example from Fig. 1 shows that there are non-zero entries \(C_c[i]\) for which \(\mathsf {MR}[i]=0\) (\(i=12\) in Fig. 1). We next derive a version of Algorithm 1 that increments only counters at indices i for which \(\mathsf {MR}[i]=1\). To distinguish the new version from Algorithm 1, we use \(\hat{C}_c\) to denote the array of counters (which is still of size m). Recall that Algorithm 1 increments \(C_c[e_{z}]\) by one, where \(e_{z}\) is the right boundary of a z-interval. The new version increments \(\hat{C}_c[j]\) instead, where \(j = \max \{ i \mid i\le e_z \text { and } \mathsf {MR}[i]=1\}\). In other words, if \(\mathsf {MR}[e_{z}]=1\), it increments \(\hat{C}_c[e_{z}]\) and if \(\mathsf {MR}[e_{z}]=0\), it increments the counter at which the previous one in \(\mathsf {MR}\) can be found. In the example from Fig. 1 it would increment the counter at index 11 instead of that at \(i=12\). To see that this preserves correctness, consider two indices i and j so that \(\mathsf {MR}[i]=1\), \(\mathsf {MR}[j]=1\), and \(\mathsf {MR}[k]=0\) for all k with \(i< k < j\) (the case in which i is the last index with \(\mathsf {MR}[i]=1\) follows similarly). On the one hand, if we use the array \(C_c\), an opening parenthesis will be written for \(\mathsf {MR}[i]=1\), followed by \(\sum _{k=i}^{j-1} C_c[k]\) closing parentheses, and then an opening parenthesis will be written for \(\mathsf {MR}[j]=1\). On the other hand, if we use the array \(\hat{C}_c\), an opening parenthesis will be written for \(\mathsf {MR}[i]=1\), followed by \(\hat{C}_c[i]\) closing parentheses and an opening parenthesis for \(\mathsf {MR}[j]=1\). Since \(\hat{C}_c[i] = \sum _{k=i}^{j-1} C_c[k]\), it follows that both algorithms compute the same sequence of parentheses. Algorithm 2 implements the new version of Algorithm 1, however, it uses an array \(C_c'\) of length n and size \(n \log (\max _i |x_i|)\) bits instead of the array \(\hat{C}_c\) of length m. First, it computes the bit array \(\mathsf {MR}\) and then preprocesses it so that rank-queries can be answered in constant time. Then it calls the function visit2 with parameters 1, km. Function visit2 can be obtained from function visit by deleting line 2 in Algorithm 1 and replacing the assignment in line 3 by \(C'_c[rank_1(\mathsf {MR},e_{z})] \leftarrow C'_c[rank_1(\mathsf {MR},e_{z})]+1\). That is, for a z-interval \([b_{z}..e_{z}]\), function visit2 increments \(C'_c[rank_1(\mathsf {MR},e_{z})]\) by one. This simulates the new version of Algorithm 1, in which \(\hat{C}_c[j]\) is incremented, where \(j = \max \{ i \mid i\le e_z \text { and } \mathsf {MR}[i]=1\}\).

In contrast to Algorithm 1, Algorithm 2 uses two passes to compute \(\mathsf {MR}\) and \(C'_c\) separately. That is, it saves space by using \(C'_c\) instead of \(C_c\), but the run-time doubles in practice (its time complexity is also \(O(m \log \sigma )\)).

4 Experimental Results

We experimentally compared our new XBWT construction algorithms with the ones presented in [18]. More precisely, we implemented the following algorithms, as we could not find an implementation of Manzini’s algorithms:

figure c

Our test data—the files dblp.xml, dna, proteins, english, and sources—originate from the Pizza & Chili corpus.Footnote 1 In our experiments, we constructed tries for each of the files using the above-mentioned algorithms, where the distinct lines of a file were used as input strings for trie construction.

Table 1. Trie construction results. The left column lists test data along with its size and the length of its longest string. The other columns show, for each test case, the construction time in seconds and the memory peak during construction, excluding suffix array and BWT construction.

The experiments were conducted on a 64 bit Ubuntu 16.04.4 LTS system equipped with two 16-core Intel Xeon E5-2698v3 processors and 256 GB of RAM. All programs were compiled with the O3 option using g++ (version 5.4.1). Our programs and the benchmark are publically available.Footnote 2 Table 1 shows the results of the experiments. Among all tested algorithms, OSB-LW has the lowest memory peak. Surprisingly, if the trie is built from long strings (dna), Algorithm MAN-LW requires a lot of memory, probably because of a stack that stores items consisting of several components. Algorithms MAN and OSB are the fastest construction methods, but despite of a lower memory peak, OSB often outperforms MAN (we think this is caused by cache-misses in MAN, which occur during accesses to the suffix array and the test data).

Summing up, our new algorithms OSB and OSB-LW outperform the algorithms MAN and MAN-LW in terms of memory consumption, and perform similarly fast or even faster. As OSB requires only a little more memory than OSB-LW, but performs similarly fast as MAN, algorithm OSB has a good space-time tradeoff and therefore is our method of choice for XBWT construction.

Our implementation is based on the sdsl-lite library [13] and we further tried to reduce the memory peak of our algorithms by using compressed wavelet trees supported by the sdsl-lite library. With Huffman-shaped wavelet trees that use rrr-bitvectors [13], it is possible to obtain a 25% reduction of the memory peak on average, but the construction time increases by a factor of 2.5 on average. It might be worth trying other compressed wavelet trees such as the one described in [2], but unfortunately its implementation contained in the sdsl-lite library lacks support for the procedure getIntervals.

figure e

5 Concluding Remark

Some readers may prefer to construct the balanced parentheses sequence P by means of a suffix tree, and of course this is possible. To this end, build the generalized suffix tree ST of the reversed input strings \(y_1,y_2 \dots , y_k\). In such a generalized suffix tree, all strings are either terminated by \(\$ \) or they are terminated by pairwise different symbols \(\$_1,\$_2 \dots , \$_k\). Here, we will use \(\$ \). Then traverse ST in a depth-first fashion, i.e., call function dft of Algorithm 3 with the root of ST as parameter. If an internal node v is visited during the traversal, the algorithm writes parentheses for that node only if v has an outgoing edge with label \(\$ \) because in this case the path from the root to v corresponds to a suffix of some \(y_i\). Moreover, leaves whose incoming edge has label \(\$ \) are ignored.