Abstract
We describe a simple and efficient scheme which allows words to be managed in PPM modelling when a natural language text file is being compressed. The main idea for managing words is to assign them codes to make them easier to manipulate. A general technique is used to obtain this objective: a dictionary mapping on PPM modelling. In order to test our idea, we are implementing three prototypes: one implements the basic dictionary mapping on PPM, another implements the dictionary mapping with the separate alphabets model and the last one implements the dictionary with the spaceless words model. This technique can be applied directly or it can be combined with some word compression model. The results for files of 1 Mb. and over are better than those achieved by the character PPM which was taken as a base. The comparison between different prototypes shows that the best option is to use a word based PPM in conjunction with the spaceless word concept.
This work was partially supported by the TIC2003-09268 project from MCyT, Spain.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Adiego, J., de la Fuente, P., Navarro, G.: Merging prediction by partial matching with structural contexts model. In: Proceedings of 14th Data Compression Conference (DCC 2004), p. 522 (2004)
Arnold, R., Bell, T.C.: A corpus for the evaluation of lossless compression algorithms. In: Proceedings of 7th Data Compression Conference (DCC 1997), pp. 201–210 (1997)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison-Wesley-Longman (May 1999)
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, Englewood Cliffs (1990)
Bell, T.C., Moffat, A., Nevill-Manning, C., Witten, I.H., Zobel, J.: Data compression in full-text retrieval systems. Journal of the American Society for Information Science 44, 508–531 (1993)
Bentley, J., Sleator, D., Tarjan, R., Wei, V.: A locally adaptive data compression scheme. Communications of the ACM 29, 320–330 (1986)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Proceedings of 11th Data Compression Conference (DCC 2001), pp. 163–172 (2001)
Clearly, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984)
Dvorský, J., Pokorný, J., Snášel, V.: Word-based compression methods and indexing for text retrieval systems. In: Eder, J., Rozman, I., Welzer, T. (eds.) ADBIS 1999. LNCS, vol. 1691, pp. 75–84. Springer, Heidelberg (1999)
Harman, D.: Overview of the Third Text REtrieval Conference (NIST Special Publication 500-207). In: Proc. Third Text REtrieval Conference (TREC-3), pp. 1–19. NIST Special Publication 207-500 (1995)
Heaps, H.S.: Information Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)
Horspool, R.N., Cormack, G.V.: Constructing word-based text compression algorithms. In: Proceedings of 2nd Data Compression Conference (DCC 1992), pp. 62–71 (1992)
Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: Proc. ACM SIGMOD 2000, pp. 153–164 (2000)
Moffat, A.: Word-based text compression. Software - Practice and Experience 19(2), 185–198 (1989)
Moffat, A., Isal, R.Y.K.: Word-based text compression using the Burrows–Wheeler transform. Information Processing & Management 41(5), 1175–1192 (2005)
Moura, E., Navarro, G., Ziviani, N.: Indexing compressed text. In: Proceedings of the Fourth South American Workshop on String Processing, pp. 95–111 (1997)
Shkarin, D.: PPM: One step to practicality. In: Proceedings of 12th Data Compression Conference (DCC 2002), pp. 202–211 (2002)
Skibinski, P., Grabowski, S., Deorowicz, S.: Revisiting dictionary-based compression. Software–Practice and Experience 35(15), 1455–1476 (2005)
Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison–Wesley (1949)
Ziviani, N., Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. IEEE Computer 33(11), 37–44 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Adiego, J., de la Fuente, P. (2006). Mapping Words into Codewords on PPM. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_15
Download citation
DOI: https://doi.org/10.1007/11880561_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)