Abstract
Burstsort is a trie-based string sorting algorithm that distributes strings into small buckets whose contents are then sorted in cache. This approach has earlier been demonstrated to be efficient on modern cache-based processors [Sinha & Zobel, JEA 2004]. In this paper, we introduce improvements that reduce by a significant margin the memory requirements of burstsort. Excess memory has been reduced by an order of magnitude so that it is now less than 1% greater than an in-place algorithm. These techniques can be applied to existing variants of burstsort, as well as other string algorithms.
We redesigned the buckets, introducing sub-buckets and an index structure for them, which resulted in an order-of-magnitude space reduction. We also show the practicality of moving some fields from the trie nodes to the insertion point (for the next string pointer) in the bucket; this technique reduces memory usage of the trie nodes by one-third. Significantly, the overall impact on the speed of burstsort by combining these memory usage improvements is not unfavourable on real-world string collections. In addition, during the bucket-sorting phase, the string suffixes are copied to a small buffer to improve their spatial locality, lowering the running time of burstsort by up to 30%.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Aho, A., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)
Andersson, A., Nilsson, S.: Implementing radixsort. ACM Jour. of Experimental Algorithmics 3(7) (1998)
Arge, L., Ferragina, P., Grossi, R., Vitter, J.S.: On sorting strings in external memory. In: Leighton, F.T., Shor, P. (eds.) Proc. ACM Symp. on Theory of Computation, El Paso, pp. 540–548. ACM Press, New York (1997)
Bender, M.A., Colton, M.F., Kuszmaul, B.C.: Cache-oblivious string b-trees. In: PODS 2006: Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, New York, NY, USA, pp. 233–242. ACM Press, New York (2006)
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank. Nucleic Acids Research 31(1), 23–27 (2003)
Bentley, J., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Saks, M. (ed.) Proc. Annual ACM-SIAM Symp. on Discrete Algorithms, New Orleans, LA, USA. Society for Industrial and Applied Mathematics, pp. 360–369 (1997)
Bentley, J.L., McIlroy, M.D.: Engineering a sort function. Software—Practice and Experience 23(11), 1249–1265 (1993)
Brodal, G.S., Fagerberg, R., Vinther, K.: Engineering a cache-oblivious sorting algorithm. ACM Jour. of Experimental Algorithmics 12(2.2), 23 (2007)
Demaine, E.D.: Cache-oblivious algorithms and data structures. In: Lecture Notes from the EEF Summer School on Massive Data Sets, BRICS, University of Aarhus, Denmark, June 2002. LNCS (2002)
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Beame, P. (ed.) FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, Washington, DC, USA, pp. 285–298. IEEE Computer Society Press, Los Alamitos (1999)
Graefe, G.: Implementing sorting in database systems. Computing Surveys 38(3), 1–37 (2006)
Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)
Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Transactions on Information Systems 20(2), 192–223 (2002)
Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, 2nd edn., vol. 3. Addison-Wesley, Reading (1998)
Levitin, A.V.: Introduction to the Design and Analysis of Algorithms, 2nd edn. Pearson, London (2007)
McIlroy, P.M., Bostic, K., McIlroy, M.D.: Engineering radix sort. Computing Systems 6(1), 5–27 (1993)
Moffat, A., Eddy, G., Petersson, O.: Splaysort: Fast, versatile, practical. Software—Practice and Experience 26(7), 781–797 (1996)
Sedgewick, R.: Algorithms in C, 3rd edn. Addison-Wesley Longman Publishing Co., Inc., Boston (1998)
Seward, J.: Valgrind—memory and cache profiler (2001), http://developer.kde.org/~sewardj/docs-1.9.5/cg_techdocs.html
Sinha, R., Ring, D., Zobel, J.: Cache-efficient string sorting using copying. ACM Jour. of Experimental Algorithmics 11(1.2) (2006)
Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. ACM Jour. of Experimental Algorithmics 9(1.5) (2004)
Sinha, R., Zobel, J.: Using random sampling to build approximate tries for efficient string sorting. ACM Jour. of Experimental Algorithmics 10 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sinha, R., Wirth, A. (2008). Engineering Burstsort: Towards Fast In-Place String Sorting. In: McGeoch, C.C. (eds) Experimental Algorithms. WEA 2008. Lecture Notes in Computer Science, vol 5038. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68552-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-68552-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68548-7
Online ISBN: 978-3-540-68552-4
eBook Packages: Computer ScienceComputer Science (R0)