Abstract
In this article, we present a novel word-based lossless compression algorithm for text files using a semi-static model. We named this method the ‘Multi-stream word-based compression algorithm (MWCA)’ because it stores the compressed forms of the words in three individual streams depending on their frequencies in the text and stores two dictionaries and a bit vector as side information. In our experiments, MWCA produces a compression ratio of 3.23 bpc on average and 2.88 bpc for files greater than 50 MB; if a variable length encoder such as Huffman coding is used after MWCA, the given ratios are reduced to 2.65 and 2.44 bpc, respectively. MWCA supports exact word matching without decompression, and its multi-stream approach reduces the search time with respect to single-stream algorithms. Additionally, the MWCA multi-stream structure supplies the reduction in network load by requesting only the necessary streams from the database. With the advantage of its fast compressed search feature and multi-stream structure, we believe that MWCA is a good solution, especially for storing and searching big text data.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Bell, T.C.; Cleary, J.G.; Witten, I.H.: Text compression. vol. 348. Englewood Cliffs: Prentice Hall, Inc., Upper Saddle River, NJ, USA (1990)
Brisaboa, N.R.; Farina, A.; Navarro, G.; Param, J.R.: Improving semistatic compression via pair-based coding. In: International Andrei Ershov Memorial Conference on Perspectives of System Informatics (pp. 124–134) (2006)
Brisaboa, N.R.; Farina, A.; Navarro, G.; Esteller, M.F.: (S, C)-dense coding: an optimized compression code for natural language text databases. In: International Symposium on String Processing and Information Retrieval (pp. 122–136) (2003)
Navarro, G.; Tarhio, J.: Boyer–Moore string matching over Ziv–Lempel compressed text. In: Annual Symposium on Combinatorial Pattern Matching (pp. 166–180) (2000)
Moffat, A.: Word-based text compression. Softw. Pract. Exp. 19(2), 185–198 (1989)
Ziv, J.; Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Welch, T.A.: A technique for high-performance data compression. IEEE Comput. 17(6), 8–19 (1984)
Deutsch, L.P.: DEFLATE compressed data format specification version 1.3 (1996)
Cleary, J.; Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)
Bentley, J.L.; Sleator, D.D.; Tarjan, R.E.; Wei, V.K.: A locally adaptive data compression scheme. Commun. ACM 29(4), 320–330 (1986)
Turpin, A.; Moffat, A.: Fast file search using text compression. Australian Comput. Sci. Commun. 19, 18 (1997)
de Moura, E.; Navarro, G.; Ziviani, N.; Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Trans. Inf. Syst. (TOIS) 18(2), 113–139 (2000)
Brisaboa, N.R.; Faria, A.; Navarro, G.; Param, J.R.: Lightweight natural language text compression. Inf. Retr. 10(1), 133 (2007)
Brisaboa, N.R.; Farina, A.; Navarro, G.; Parama, J.R.: New adaptive compressors for natural language text. Softw. Pract. Exp. 38(13), 1429–1450 (2008)
Brisaboa, N.; Farina, A.; Navarro, G.; Param, J.: Dynamic lightweight text compression. ACM Trans. Inf. Syst. (TOIS) 28(3), 10 (2010)
Carus, A.; Mesut, A.: A new compression algorithm for fast text search. Turk. J. Electr. Eng. Comput. Sci. 24(5), 4355–4367 (2016)
Farina, A.; Navarro, G.; Param, J.R.: Word-based statistical compressors as natural language compression boosters. In: Data Compression Conference, 2008. DCC 2008 (pp. 162–171) (2008)
Farina, A.; Navarro, G.; Param, J.R.: Boosting text compression with word-based statistical encoding. Comput. J. bxr096 (2011)
Burrows, M.; Wheeler, D.J.: A Block-Sorting Lossless Data Compression Algorithm. Digital Systems Research Center Research Reports, Palo Alto, California, USA (1994)
Awan, F.S.; Mukherjee, A.: LIPT: a lossless text transform to improve compression. In: Information Technology: Coding and Computing, 2001. Proceedings. International Conference on (pp. 452–460) (2001)
Sun, W.; Zhang, N.; Mukherjee, A.: Dictionary-based fast transform for text compression. In: Information Technology: Coding and Computing [Computers and Communications], 2003. Proceedings. ITCC 2003. International Conference on (pp. 176–182) (2003)
Skibiski, P.; Grabowski, S.; Deorowicz, S.: Revisiting dictionary-based compression. Softw. Pract. Exp. 35(15), 1455–1476 (2005)
Amir, A.; Benson, C.: Efficient two-dimensional compressed matching. In: Data Compression Conference, 1992. DCC’92. (pp. 279–288) (1992)
Wu, S.; Manber, U.: Fast Text Searching with Errors. University of Arizona, Department of Computer Science, Tucson (1991)
Boyer, R.S.; Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)
Berry, T.; Ravindran, S.: A fast string matching algorithm and experimental results. In: Stringology (pp. 16–28) (1999)
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)
Knuth, D.E.; Morris Jr., J.H.; Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Navarro, G.; Tarhio, J.: LZgrep: a Boyer–Moore string matching tool for Ziv–Lempel compressed text. Softw. Pract. Exp. 35(12), 1107–1130 (2005)
Ziviani, N.; De Moura, E.S.; Navarro, G.; Baeza-Yates, R.: Compression: a key for next-generation text retrieval systems. Computer 33(11), 37–44 (2000)
Heaps, H.S.: Information retrieval: Computational and theoretical aspects. Academic Press, Inc., Orlando, FL, USA (1978)
Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Öztürk, E., Mesut, A. & Diri, B. Multi-Stream Word-Based Compression Algorithm for Compressed Text Search. Arab J Sci Eng 43, 8209–8221 (2018). https://doi.org/10.1007/s13369-018-3378-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-018-3378-9