Keywords

1 Introduction

The Burrows-Wheeler transform (BWT) was proposed by Burrows and Wheeler in 1994 [2] for data compression. Given an input string, its BWT is a permutation of the characters of this string, which can be computed by lexicographically sorting all the suffixes, then the list composed of the preceding character of each sorted suffix is the BWT. The same characters often span consecutively in a BWT, so it is easier to be compressed than the original string [6], e.g. the bzip2 was designed in this way [1]. Besides data compression, BWT was later applied to other applications such as genome alignment, construction of compressed full-text index [7] and etc. As a result, it has become an important data structure in modern bioinformatics. The algorithms for computing the BWT can be directly or indirectly, in terms of that, the former directly computes the BWT from the input string, and the latter first computes an interim data structure such as suffix array (SA) [5] and then from which computes the BWT. The SA can play as a succinct alternative for replacing suffix trees in many applications [10, 11]. This paper presents a lightweight external memory algorithm to compute BWT from SA, where “lightweight” means a small required space.

The input T is a size-n string with the characters from a constant alphabet Σ of size O(1). Let A and B be the suffix array and the BWT of T, respectively, where A is a size-n integer array with the pointers to all the suffixes of T in their increasing lexicographical order, and B[i] = T[A[i] − 1] for A[i] > 0 or else B[i] = T[n − 1]. Using internal memory, B can be easily computed from A and T in linear time O(n) and workspace O(1), where the workspace excludes the input and the output. In external memory, B can be computed from A in a workspace of 2n integers by a straightforward method as follows: (1) compute the inverse suffix array of T from A, called A −1, which is a size-n integer array satisfying that A −1[i] = j if and only if A[j] = i; (2) scan A −1 to get the preceding character of each suffix; (3) sort the preceding character of each suffix by the key as the suffix position in A −1. This method is rather space consuming and becomes the space bottleneck in some existing applications such as a suffix array checker.

In addition to be constructed in a deterministic way, a suffix array can also be constructed by a probabilistic method [4]. In this case, the resulting suffix array might be incorrect with a very small probability. Hence, it has to be checked to ensure the correctness. In addition, a suffix array is usually checked after its construction to avoid errors caused by program bugs. A suffix array checker is commonly provided in the open source software for SA construction nowadays. A key step in an SA checker is computing the BWT from the given SA and T, which uses the aforementioned straightforward method and hence constitutes the space bottleneck of the whole checking process. A lightweight approach with less space consumption is demanded for overcoming this issue, which motivates our work reported here.

To the best of our knowledge, in the existing literature, there is no study devoted to designing an external memory algorithm for computing BWT from SA. In this paper, we propose a lightweight and linear I/O complexity external memory algorithm called SA2BWT for computing BWT from SA, with a workspace of at most n/2 integers. SA2BWT employs a divide-and-conquer method, which first splits the original problem into a number of sub-problems that each can be solved in internal memory, then solves in internal memory each sub-problem using induced sorting [9] and merges in external memory the sub-solutions to get the final result.

The rest of this paper is organized as follows. Section 2 gives the algorithm with a complexity analysis. Section 3 presents the algorithm with a running example for illustration. Section 4 conducts an experiment to evaluate the algorithm’s time and space performance and discusses the optimization of algorithm. Section 5 gives the summary.

2 Our Solution

2.1 Basic Notations

The problem to be solved is that given a size-n input string T and its suffix array, use an I/O complexity O(n) and a workspace of at most n/2 integers to compute in external memory the BWT of T, where ||Σ|| = O(1). Some basic notations are recapitulated from [9] for presentation convenience.

Let T[0, n) = T[0]T[1]…T[n − 1], where T[n − 1] is a sentinel $ that is the unique lexicographically smallest character in T but not in Σ. A suffix suf(T, i) starts from T[i] to the sentinel, it is classified as S- or L-type as follows: S-type if (1) i = n − 1 or (2) T[i] < T[i + 1], or (3) T[i] = T[i + 1] and suf(T, i + 1) is S-type; or else L-type. T[i] is S-type if suf(T, i) is S-type, or else L-type. Moreover, if T[i] is S-type and T[i − 1] is L-type, T[i] is an LMS (leftmost S-type) character; if T[i] is L-type and T[i + 1] is S-type, T[i] is a RML (rightmost L-type) character. For a given substring in T, it is an LMS/RML-substring if it starts at an LMS/RML character and ends at the first LMS/RML character on the right hand, respectively.

In the suffix array A of T, the suffixes of a same head character ch occupy a consecutive range called a bucket, denoted by bkt(ch). In a bucket, the L- and S-type suffixes are clustered at the left and right ends, constituting two sub-buckets called the L- and S-bucket denoted by bkt_l(ch) and bkt_s(ch), respectively.

2.2 Algorithm Framework

The algorithm’s underlying idea is sketched first. Let n L and n S be the number of L- and S-type characters in T, respectively. Without loss of generality, unless specified otherwise, suppose n L  < n S in the rest of this article.

  1. 1.

    Scan T to divide T and its suffixes into logical blocks.

  2. 2.

    Scan the A from left to right, for each scanned suffix, detecting its type and sequentially putting the L-type suffixes into their block buffers in disk.

  3. 3.

    Compute in internal memory the B of each block by induced sorting.

  4. 4.

    Scan the A to merge the B of each block into the final B.

In step 1, T is scanned to divide its characters and suffixes into blocks, in a way such that the B of each block can be computed in internal memory.

In step 2, we need to detect the type of each scanned suffix. For ||Σ|| = O(1), we can scan T to collect the size of each L- and S-bucket in the A. Given the size of each sub-bucket, we can on-the-fly determine the type of the sub-bucket that a scanned suffix resides at, and hence know the type of a scanned suffix.

In step 3, given all the sorted L-type suffix of a block, we can induce the order of all the suffixes in the block, from which the B of the block is obtained.

In step 4, given the B of each block, we scan the A to sequentially retrieving the preceding character of each scanned suffix from the B of the block that the suffix belongs to. Notice that for any two suffixes in a block, their relative order computed by step 3 is identical to that in the suffix array. As a result, this step produces the final B.

Following this idea, we proceed to design the detailed algorithm SA2BWT in the rest of this section. Figure 1 outlines the algorithm framework. Lines 1–6 detect the type of each character to divide all the suffixes of T into blocks. Lines 7–8 scan the A to sequentially put each suffix to its block, and compute in internal memory the B of each block. Line 9 merges in external memory the B of each block to get the final B. The I/O complexity of each step is O(n), while the peak disk usage occurs in Line 7.

Fig. 1.
figure 1

The algorithm framework for computing BWT from SA.

2.3 Dividing T into Blocks

Given n L  < n S , RML characters are used to split T, otherwise, LMS characters are used instead. The rules for dividing T into n 1 = n/m blocks {T_b i | i ∈ [0, n 1−1]} are as follows [8], where m is the block size:

  1. 1.

    Each block starts with T[0] or a RML character and ends with another RML character or T[n − 1].

  2. 2.

    Any pair of neighboring blocks overlaps on a common RML character.

  3. 3.

    A block T[g, h], 0 ≤ g < h ≤ n − 1, can have more than m characters only if there is no RML character in T[g + 1, h − 1].

Initialize i = n 1−1 and T_b i as empty. While scanning T from right to left, each RML-substring will be added to T_b i if the addition won’t cause ||T_b i || > m. Otherwise, i is decreased by 1 and this substring is added into a new block. If the length of a substring is larger than m, it will create a single block. For determining the position of each block in T, record their lengths.

Splitting T in this way, the lengths of blocks may not be equal, we can determine the location information of each block in T using the boundary characters as follows [8]. For T[j], where im ≤ j ≤ (i + 1) ∗ m, and i ∈ [0, n 1−1], it must belong to (1) the block containing T[im], or (2) the block containing T[(i + 1) ∗ m], or (3) the block between them. In this way, given a suffix, the target block can be located in O(1) time and O(n 1) space.

2.4 Computing and Merging the BWT of Each Block

After splitting T into blocks, the sorted L-type suffixes are selected for inducing the suffix array of each block, while the type of each suffix can be determined on-the-fly when scanning A. Given the alphabet size O(1), we can scan T to get the location and size information of each L- or S-bucket in A, then scan the A to determine the type of each suffix by the sub-bucket information, and put each suffix to its block sequentially in the external memory, where each block i is denoted by TS_b i .

The key idea of SA2BWT is to compute the A and B of each block in internal memory by the induced sorting method, then the whole B of T is completed by merging the B of each block. To compute the B of each block in internal memory, we use a block data structure BLOCKBUF to provide information below:

  • SA_b: an array storing all suffixes belonging to this block.

  • ch: an array consists of all characters belonging to this block.

  • type: an array storing the type of each character in ch.

  • start_pos: the start position of this block.

  • end_pos: the end position of this block.

With the required data structure, the approach for induced sorting A from the L-type suffixes is given below:

  1. 1.

    Find the head of each L-type bucket in SA_b, and scan TS_b i from head to end. For each suffix suf(T, i), put it to the current head of the L-type bucket for ch[suf(T, i) − start_pos] and shift right the current head one step.

  2. 2.

    Find the end of each S-type bucket in SA_b, and scan SA_b once from end to head. For each item SA_b[i], if ch[SA_b[i] 1 − start_pos] is S-type, put SA_b[i] 1 to the current end of the S-type bucket for ch[SA_b[i] 1 − start_pos] and shift left the current end one step.

Analogously, if n S is less, the algorithm for induced sorting the A from the S-type suffixes is modified as follows:

  1. 1.

    Find the end of each S-type bucket in SA_b, and scan TS_b i from head to end. For each suffix suf(T, i), put it to the current end of the S-type bucket for ch[suf(T, i) − start_pos] and shift left the current end one step.

  2. 2.

    Find the head of each L-type bucket in SA_b, and scan SA_b once from head to end. For each item SA_b[i], if ch[SA_b[i] 1 − start_pos] is L-type, put SA_b[i] 1 to the current head of the L-type bucket for ch[SA_b[i] 1 − start_pos] and shift right the current head one step.

The correctness of step 2 is established by Lemma 3.9 in [9]:

Lemma 1

[9]. Given all the sorted L-type (or S-type) suffixes of T, the order of all the suffixes of T can be induced in O(n) time.

After sorting all the suffixes in a block, the B of this block can be immediately computed from the sorted suffixes and input characters for this block. Then, we scan A once to merge the B of each block, by sequentially retrieving the B of each block to produce the overall B.

2.5 Complexity Analysis

Theorem 1

(I/O and Workspace Complexities). Given a size-n input string T of a constant alphabet, the I/O and workspace complexities for SA2BWT is O(n) and n/2 integers, respectively, where an integer is log n bits.

Proof.

Each step in SA2BWT only scans A and/or T a number of O(1) passes, hence the total I/O complexity is O(n). Because min(n L , n S )/n ≤ 1/2, the workspace is at most n/2 integers, which is caused by storing the selected L- or S-type suffixes for each block.

3 A Running Example

In Fig. 2, a running example of SA2BWT is illustrated for a sample string T = immiissiissiip$, where $ is the sentinel. Assuming the block size of 8 bytes, the algorithm starts scanning T from right to left to detect the type of each character. Line 2 lists the type of each character, where the L-type characters are less. Then, the algorithm continues the following steps:

Fig. 2.
figure 2

A running example of SA2BWT.

Step 1: Scan T from right to left to split T according to the RML characters. Notice that all the RML characters are marked by ‘∗’ at Line 3. Thus T is split into three blocks, which contains “immiiss”, “siiss”, and “siip$” with size of 7, 5 and 5, respectively.

Step 2: Determine the positions of boundary characters. The boundary characters in T is T[0] and T[8], since the block size is set as 8 bytes. They belong to block_0 and block_1, respectively.

Step 3: Compute the B of each block in internal memory, then merge them for the final B. Line 8 lists all the suffixes in A by their start positions in T. As the number of L-type suffixes is less, we scan A to find the L-type suffixes and put them into their blocks. For example, the first scanned suffix starts at A[0] = 14 = n − 1, it is S-type and belongs to the last block. The second one starts at A[1] = 11, which is also S-type as bkt_l(i) = 0. Until A[8] = 2 is scanned, it is detected as a L-type suffix. Furthermore, position 2 is within the start and end positions of block_0, so this suffix belongs to block_0. After scanning T, Lines 10–12 show the final TS_b, which are TS_b 0 = {2, 1, 6, 5}, TS_b 1 = {10, 6, 9} and TS_b 2 = {13, 10}.

Now the algorithm induced sorts the suffixes in each block. Take induced sorting A in block_0 as an example. There are 3 buckets ‘i’, ‘m’ and ‘s’ in SA_b. Firstly, the suffixes in block_0 are initialized as –1. As the number of L-type is less, Line 14 finds the head of each bucket and marks them by ‘^’. Then Lines 14–23 scan TS_b 0 rightwards. The first suffix is TS_b 0[0] = 2 with an initial character ‘m’, so it is put into the head position of bucket ‘m’, and the current head of the bucket is increased by one. Notice that the initial characters are retrieved from ch in BLOCKBUF, so as to avoid random access to disk. The next suffix is TS_b 0[1] = 1 with the same initial character ‘m’, and it is put into the bucket ‘m’ at the position pointed by ‘^’ now. Similarly, the next two suffixes are appended to the bucket ‘s’. The result is shown in Line 22 with pointers at the next positions after the end of bucket ‘m’ and ‘s’. Afterwards, Line 24 finds and marks the end of each bucket with ‘^’ and scans SA_b leftwards. Herein, the current item being visited is marked by ‘@’. When SA_b[6] = 5 is visited in Line 24, its type in BLOCKBUF is not L-type, thus the preceding suffix suf(T, 5-1 = 4) is added to the bucket ‘i’ and the current end of the bucket is decreased by one. The next suffix being visited is SA_b[5] = 6 and it is L-type, so it is skipped and the symbol ‘@’ moves leftwards by one. Scanning SA_b in this way, all the suffixes in block_0 are finally sorted as Line 36. The same procedure is applied to the other blocks to produce the resulting B of each block shown in Lines 38–40, i.e. TB_b i .

Finally, scan A from left to right to determine the corresponding block for A[i] and retrieve B[i] from TB_b, where A[i] = n − 1 is skipped as it is for the sentinel. The final result of B[0… n − 2] = {s, s, m, p, i, i, i, m, i, i, s, s, i, i} is shown in Line 41.

4 Experiments

For performance evaluation, we implement SA2BWT using the STXXL library [3], which is an implementation of the C++ standard template library STL for external memory computations. The experiments were performed on Linux (version 3.13.0-86-generic) with CPU (Intel Core i3 3.20 GHz) and 4.00 GiB RAM. The program was implemented in C/C++ and compiled by gcc/g++ of version 4.8.4. The datasets chosen for experiments are listed in Table 1, which are of varying sizes, alphabets and redundancy.

Table 1. Datasets for experiments, n in Gi, one byte per character.

The performance measures are the runtime, peak disk usage and I/O volume. To smooth the performance fluctuation, each result is the mean of three runs. The peak disk usage and I/O volume are returned by the STXXL library, the time (μs/char) and peak disk usage (bytes/char) are normalized by n.

The experimental results of runtime, peak disk usages and I/O volumes for different datasets are shown in Figs. 3, 4 and 5, where enwiki_1g, enwiki_2g and enwiki_4g are the 1, 2 and 4 GiB prefixes of enwiki, and guten_4g and guten_8g are the 4 and 8 GiB prefixes of guten, respectively. In addition, Fig. 6 shows the λ = min(n L , n S )/n of each dataset, which can affect the algorithm’s performance. The horizontal axis in each figure are the datasets ordered in non-decreasing n.

Fig. 3.
figure 3

Runtime for different datasets.

Fig. 4.
figure 4

Peak disk usages for different datasets.

Fig. 5.
figure 5

I/O volumes for different datasets.

Fig. 6.
figure 6

λ for different datasets.

The time bottleneck of an external algorithm is mostly due to the I/O volume. To test the machine’s I/O throughput, we use the shell command “time cp” to evaluate the time for copying enwiki_1g. The mean time for three runs is 22.54 s. The I/O volume for this copy is 2 GiB, hence the I/O throughput of this machine is about 22.54/(2 ∗ 1024) = 0.011 μs per byte. The I/O throughput of our program is about 0.55/29 = 0.0189 μs per byte, where 0.55 is the average runtime per character in Fig. 3 and 29 is the average I/O volume in bytes per character observed from Fig. 5. So the I/O throughput of our program is 0.011/0.0189 = 58% of the machine’s saturated I/O throughput.

The peak disk usage of the algorithm is at most n/2 integers, i.e., 2.5n bytes for 40-bit integers in this experiment. As shown in Fig. 4, the peak disk usage of each dataset is within the theoretical range, moreover, bigger λ and more space. In particular, the peak disk usages of four datasets uniprot, guten_4g, guten_8g and android are 2.1–2.3 bytes/char, which are less than that of the others. In Fig. 6, their λ are 42–45%, while λ of the other datasets are about 50%. The peak disk usages for some datasets exceed 2.5 bytes/char slightly, this is due to the small auxiliary arrays.

The average I/O volume of 29 bytes/char is a little bit more than the theoretical I/O volume of 27 bytes/char. In order to save space, stxxl::uint40 is used in disk while the type of unsigned long long is used in memory, which occupies 8 bytes on a 64-bits machine. Due to the STXXL library, when assigning the value of a stxxl::uint40 to unsigned long long, an implicit conversion occurs, i.e., stxxl::uint40 will be first converted into stxxl::uint64, which results in 2.5n bytes I/O volume. Similar to the peak disk usage, the I/O volume of the four datasets uniprot, guten_4g, guten_8g and android are also less than that of the others, due to the difference in λ.

The current design of SA2BWT requires a workspace of n/2 integers. It can be further optimized as follows. Given all the LMS suffixes of T sorted, all the L-type suffixes can be sorted in O(n) time [9]. So it is feasible to induced sort the SA from the LMS suffixes, which are the suffixes starting from LMS characters. Also from [9], given the probabilities for each character to be S- or L- type as 1/2, the mean size of a non-sentinel LMS substring is 4, i.e., the number of LMS substring is at most n/3, resulting in that the expecting space complexities for improving the algorithm to use LMS suffixes is n/3 integers. The key is to detect the LMS suffixes. Now we only need to judge the type of a scanned suffix, but if we use the LMS suffixes, not only the character T[A[i]] needs to be S-type, but also the preceding T[A[i] 1] needs to be L-type. An approach for scanning A to detect each LMS suffix starting at A[i] without retrieving T[A[i] 1] is demanded and will be addressed in an extended version of this paper.

5 Summary

A lightweight external memory algorithm is proposed for computing the BWT from the suffix array and the input string. The algorithm requires a linear I/O complexity O(n) and a workspace of at most n/2 integers. Currently, computing the BWT constitutes the space bottleneck in an external memory suffix array checker. This algorithm can be employed for computing the BWT, and hence reduce the peak disk usage for checking a big suffix array in external memory. As a next step, we are developing efficient methods to further improve the time and space performance of the algorithm.