Abstract
Sharing of common subtrees has been reported useful not only for XML compression but also for main-memory XML query processing. This method compresses subtrees only when they exhibit identical structure. Even slight irregularities among subtrees dramatically reduce the performance of compression algorithms of this kind. Furthermore, when XML documents are large, the chance of having large number of identical subtrees is inherently low. In this paper, we proposed a method of decomposing XML documents for better compression. We proposed a heuristic method of locating minor irregularities in XML documents. The irregularities are then projected out from the original XML document. We refered this process to as document decomposition. We demonstrated that better compression can be achieved by compressing the decomposed documents separately. Experimental results demonstrated that the compressed skeletons, for all real-world datasets, to our knowledge, fit comfortably into main memory of commodity computers nowadays. Preliminary results on querying compressed skeletons validate the effectiveness our approach.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Babu, S., Garofalakis, M.N., Rastogi, R.: Spartan: A model-based semantic compression system for massive data tables. In: SIGMOD, pp. 283–294 (2001)
Berchtold, S., Bohm, C., Keim, D.A., Kriegel, H.-P.: A cost model for nearest neighbor search in high-dimensional data space. In: PODS, pp. 78–86 (1997)
Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R., Viglas, S.: Vectorizing and querying large xml repositories. In: ICDE, pp. 261–272 (2005)
Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: Aberer, K., Koubarakis, M., Kalogeraki, V. (eds.) VLDB 2003. LNCS, vol. 2944, pp. 141–152. Springer, Heidelberg (2004)
Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Data Compression Conference, pp. 163–172 (2001)
Cheng, J., Ng, W.: Xqzip: Querying compressed xml using structural indexing. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 219–236. Springer, Heidelberg (2004)
Deutsch, A., Fernandez, M.F., Suciu, D.: Storing semistructured data with STORED. In: SIGMOD, pp. 431–442. ACM Press, New York (1999)
Gray, J., Slutz, D., Szalay, A., Thakar, A.,, J.: vandenBerg, P. Kunszt, and C. Stoughton. Data mining the SDSS Skyserver database. Technical Report MSR-TR-2002-01, Microsoft (2002)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, pp. 119–130. Morgan Kaufmann, San Francisco (2000)
Jagadish, H.V., Madar, J., Ng, R.T.: Semantic compression and pattern extraction with fascicles. In: VLDB, pp. 186–198 (1999)
Jagadish, H.V., Ng, R.T., Ooi, B.C., Tung, A.K.H.: Itcompress: An iterative semantic compression algorithm. In: ICDE, pp. 646–657 (2004)
Language and Information in Computation at Penn. Penn treebank project, Available at: http://www.cis.upenn.edu/~treebank/
Ley, M.: Dblp bibliography (March 2005), Available at: http://www.informatik.uni-trier.de/~ley/db/
Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: SIGMOD, pp. 153–164 (2000)
Miller, E., Swick, R., Brickley, D., McBride, B., Hendler, J., Schreiber, G., Connolly, D.: Semantic Web. W3C Working Group (August 2005), http://www.w3.org/2001/sw/
Min, J.-K., Park, M.-J., Chung, C.-W.: Xpress: a queriable compression for xml data. In: SIGMOD, pp. 122–133 (2003)
Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: A benchmark for XML data management. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 974–985. Springer, Heidelberg (2003)
Tolani, P.M., Haritsa, J.R.: Xgrind: A query-friendly xml compressor. In: ICDE, pp. 225–234 (2002)
U.S. National Library of Medicine. MEDLINE distributed in XML format., Available at: http://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html
Valduriez, P.: Join indices. TODS 12(2), 218–246 (1987)
Wang, K., Liu, H.: Discovering typical structures of documents: a road map approach. In: SIGIR, pp. 146–154 (1998)
Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Choi, B. (2006). Document Decomposition for XML Compression: A Heuristic Approach. In: Li Lee, M., Tan, KL., Wuwongse, V. (eds) Database Systems for Advanced Applications. DASFAA 2006. Lecture Notes in Computer Science, vol 3882. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11733836_16
Download citation
DOI: https://doi.org/10.1007/11733836_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33337-1
Online ISBN: 978-3-540-33338-8
eBook Packages: Computer ScienceComputer Science (R0)