Abstract
The suffix array of a string T is basically a sorted list of all the suffixes of T. Suffix arrays have been fundamental index data structures in computational biology. If we are to search a DNA sequence in a genome sequence, we construct the suffix array for the genome sequence and then search the DNA sequence in the suffix array. In this paper, we consider the construction of the suffix array of T of length n where the size of the alphabet is fixed. It has been well-known that one can construct the suffix array of T in O(n) time by constructing suffix tree of T and traversing the suffix tree. Although this approach takes O(n) time, it is not appropriate for practical use because it uses a lot of spaces and it is complicated to implement. Recently, almost at the same time, several algorithms have been developed to directly construct suffix arrays in O(n) time. However, these algorithms are developed for integer alphabets and thus do not exploit the properties given when the size of the alphabet is fixed. We present a fast algorithm for constructing suffix arrays for the fixed-size alphabet. Our algorithm constructs suffix arrays faster than any other algorithms developed for integer or general alphabets when the size of the alphabet is fixed. For example, we reduced the time required for constructing suffix arrays for DNA sequences by 25%-38%. In addition, we do not sacrifice the space to improve the running time. The space required by our algorithm is almost equal to or even less than those required by previous fast algorithms.
This work is supported by Korea Research Foundation grant KRF-2003-03-D00343.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm, Technical Report 124 (1994), Digital Equipment Corporation (1994)
Farach, M.: Optimal suffix tree construction with large alphabets. In: IEEE Symp. Found. Computer Science, pp. 137–143 (1997)
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. Assoc. Comput. Mach. 47, 987–1011 (2000)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: IEEE Symp. Found. Computer Science, pp. 390–398 (2001)
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: ACM-SIAM Symp. on Discrete Algorithms, pp. 269–278 (2001)
Gonnet, G., Baeza-Yates, R., Snider, T.: New indices for text: Pat trees and pat arrays. In: Frakes, W.B., Baeza-Yates, R.A. (eds.) Information Retrieval: Data Structures & Algorithms, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge Univ. Press, Cambridge (1997)
Gusfield, D.: An “Increment-by-one” approach to suffix arrays and trees (1990) (manuscript)
Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: Experiments with compressing suffix arrays and applications. In: ACM-SIAM Symp. on Discrete Algorithms (2004)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: ACM Symp. Theory of Computing, pp. 397–406 (2000)
Hon, W.K., Sadakane, K., Sung, W.K.: Breaking a time-and-space barrier in constructing full-text indices. In: IEEE Symp. Found. Computer Science, pp. 251–260 (2003)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longestcommon- prefix computation in suffix arrays and its applications. In: Symp. Combinatorial Pattern Matching, pp. 181–192 (2001)
Kärkkäinen, J., Sanders, P.: Simpler linear work suffix array construction. In: Int. Colloq. Automata Languages and Programming, pp. 943–955 (2003)
Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Symp. Combinatorial Pattern Matching, pp. 186–199 (2003)
Ko, P., Aluru, S.: Space-efficient linear time construction of suffix arrays. In: Symp. Combinatorial Pattern Matching, pp. 200–210 (2003)
Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 935–938 (1993)
McCreight, E.M.: A space-economical suffix tree construction algorithm, J. Assoc. Comput. Mach. 23, 262–272 (1976)
Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Int. Symp. Algorithms and Computation, pp. 410–421 (2000)
Sadakane, K.: Succinct representations of lcp Information and improvements in the compressed suffix arrays. In: ACM-SIAM Symp. on Discrete Algorithms, pp. 225–232 (2002)
Sim, J.S., Kim, D.K., Park, H., Park, K.: Linear-time search in suffix arrays. In: Australasian Workshop on Combinatorial Algorithms, pp. 139–146 (2003)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14, 249–260 (1995)
Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th IEEE Symp. Switching and Automata Theory, pp. 1–11 (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, D.K., Jo, J., Park, H. (2004). A Fast Algorithm for Constructing Suffix Arrays for Fixed-Size Alphabets. In: Ribeiro, C.C., Martins, S.L. (eds) Experimental and Efficient Algorithms. WEA 2004. Lecture Notes in Computer Science, vol 3059. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24838-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-24838-5_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22067-1
Online ISBN: 978-3-540-24838-5
eBook Packages: Springer Book Archive