Abstract
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
Supported in part by Taiwan NSC grant 96–2221–E–007–082–MY3 (W. Hon) and USA National Science Foundation grant CCF–0621457 (R. Shah and J. S. Vitter).
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Aggarwal, A., Vitter, J.S.: The Input/Output complexity of sorting and related problems. Communications of the ACM 31(9), 1116–1127 (1988)
Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 83–94. Springer, Heidelberg (2007)
Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 680–689 (2007)
Bayer, R., Unterauer, K.: Prefix B-trees. ACM Transactions on Database Systems 2(1), 11–26 (1977)
Belazzougui, D.: Succinct dictionary matching with no slowdown. In: Proc. Symp. on Combinatorial Pattern Matching (June 2010)
Bialynicka-Birula, I., Grossi, R.: Rank-sensitive data structures. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 79–90. Springer, Heidelberg (2005)
Burrows, M., Wheeler, D.: A block sorting data compression algorithm. Technical report, Digital Systems Research Center (1994)
Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2) (2007)
Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In: Proc. IEEE Data Compression Conf., pp. 252–261 (2008)
Chiu, S.-Y., Hon, W.-K., Shah, R., Vitter, J.S.: I/O-efficient compressed text indexes: From theory to practice. In: Proc. IEEE Data Compression Conf., pp. 426–434 (2010)
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proc. Symp. on Operating Systems Design and Implementation. December 2004, pp. 137–150, USENIX (2004)
Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21, 194–203 (1975)
Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Information and Computation 207(8), 849–866 (2009)
Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. Journal of the ACM 52(4), 688–713 (2005)
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics 12, article 1.12 (2008)
Ferragina, P., Grossi, R.: The String B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM 46(2), 236–280 (1999)
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Proc. ACM Conf. on Principles of Database Systems, Vancouver, June 2008, pp. 181–190 (2008)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. IEEE Symp. on Foundations of Computer Science, pp. 184–196 (2005)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. IEEE Symp. on Foundations of Computer Science, November 2000, vol. 41, pp. 390–398 (2000)
Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms 3(2) (May 2007) Conference version in SPIRE 2004
Ferragina, P., Venturini, R.: Compressed permuterm index. In: Proc. ACM SIGIR Conf. on Res. and Dev. in Information Retrieval, pp. 535–542 (2007)
Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theoretical Computer Science 410(51), 5354–5364 (2009)
Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: Experiments on suffix arrays and trees. ACM Transactions on Algorithms 2(4), 611–639 (2004); Conference versions in SODA 2004 and DCC 2004
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. IEEE Symp. on Foundations of Computer Science, vol. 40, pp. 285–298 (1999)
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Information Retrieval: Data Structures And Algorithms, ch. 5, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)
González, R., Navarro, G.: A compressed text index on secondary memory. In: Proc. Intl. Work. Combinatorial Algorithms, Newcastle, Australia, pp. 80–91. College Publications (2007)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (January 2003)
Grossi, R., Gupta, A., Vitter, J.S.: Nearly tight bounds on the encoding length of the Burrows-Wheeler transform. In: Proc. Work. on Analytical Algorithmics and Combinatorics (January 2008)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. ACM Symp. on Theory of Computing, May 2000, vol. 32, pp. 397–406 (2000)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(32), 378–407 (2005)
Hon, W.-K., Lam, T.-W., Shah, R., Tam, S.-L., Vitter, J.S.: Compressed index for dictionary matching. In: Proc. IEEE Data Compression Conf., pp. 23–32 (2008)
Hon, W.-K., Lam, T.-W., Shah, R., Tam, S.-L., Vitter, J.S.: Succinct index for dynamic dictionary matching. In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878. Springer, Heidelberg (2009)
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On entropy-compressed text indexing in external memory. In: Hyyro, H. (ed.) SPIRE 2009. LNCS, vol. 5721, pp. 75–89. Springer, Heidelberg (2009)
Hon, W.-K., Shah, R., Vitter, J.S.: Ordered pattern matching: Towards full-text retrieval. In: Purdue University Tech. Rept. (2006)
Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: Proc. IEEE Symp. on Foundations of Computer Science, Atlanta (October 2009)
Kärkkäinen, J.: Repetition-Based Text Indexes. Ph.d., University of Helsinki (1999)
Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Meyer, U., Sanders, P., Sibeyn, J. (eds.) Algorithms for Memory Hierarchies, ch. 7, pp. 149–170. Springer, Berlin (2003)
Külekci, M.O., Hon, W.-K., Shah, R., Vitter, J.S., Xu, B.: A parallel sparse index for read alignment on genomes (2010)
Lam, T.-W., Sung, W.-K., Wong, S.-S.: Improved approximate string matching using compressed suffix data structures. Algorithmica 51(3), 298–314 (2008)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), article R25 (2009)
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)
Lin, H., Zhang, Z., Zhang, M.Q., Ma, B., Li, M.: ZOOM: Zillions of oligos mapped. Bioinformatics 24(21), 2431–2437 (2008)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)
Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proc. Latin American Theoretical Informatics Symp., pp. 703–714 (2006)
Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3), article 12 (June 2008)
Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching—efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)
Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3) (2001); Conference version in SODA 1999
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)
Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4), 349–379 (1996)
Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 657–666 (2002)
Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science. now Publishers, Hanover (2005)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)
NCBI short read archive SRR001115, http://www.ncbi.nlm.nih.gov/
Patrascu, M.: Succincter. In: Proc. IEEE Symp. on Foundations of Computer Science, pp. 305–313 (2008)
Puglisi, S.J., Smyth, W.F., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 122–133. Springer, Heidelberg (2006)
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3(4), article 43 (2007)
Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)
Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (December 2000)
Sadakane, K.: New text indexing functiionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)
Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)
Sadakane, K.: Succinct Data Structures for Flexible Text Retrieval Systems. Journal of Discrete Algorithms 5(1), 12–22 (2007)
Sodan, A.C., Machina, J., Deshmeh, A., Macnaughton, K., Esbaugh, B.: Parallelism via multithreaded and multicore CPUs. IEEE Computer 43(3), 24–32 (2010)
Tam, A., Wu, E., Lam, T.W., Yiu, S.-M.: Succinct text indexing with wildcards. In: Proc. Intl. Symp. on String Processing Information Retrieval, August 2009, pp. 39–50 (2009)
Thankachan, S.V., Hon, W.-K., Shah, R., Vitter, J.S.: String retrieval for multi-pattern queries (2010)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)
Vitter, J.S.: Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science. now Publishers, Hanover (2008)
Vitter, J.S., Shriver, E.A.M.: Algorithms for parallel memory I: Two-level memories. Algorithmica 12(2–3), 110–147 (1994)
Weiner, P.: Linear pattern matching algorithm. In: Proc. IEEE Symp. on Switching and Automata Theory, Washington, DC, vol. 14, pp. 1–11 (1973)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, Los Altos (1999)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2) (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hon, WK., Shah, R., Vitter, J.S. (2010). Compression, Indexing, and Retrieval for Massive String Data. In: Amir, A., Parida, L. (eds) Combinatorial Pattern Matching. CPM 2010. Lecture Notes in Computer Science, vol 6129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13509-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-13509-5_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13508-8
Online ISBN: 978-3-642-13509-5
eBook Packages: Computer ScienceComputer Science (R0)