Abstract
This paper proposes to exploit content and usage information to rearrange an inverted index for a full-text IR system. The idea is to merge the entries of two frequently co-occurring terms, either in the collection or in the answered queries, to form a single, paired, entry. Since postings common to paired terms are not replicated, the resulting index is more compact. In addition, queries containing terms that have been paired are answered faster since we can exploit the pre-computed posting intersection. In order to choose which terms have to be paired, we formulate the term pairing problem as a Maximum-Weight Matching Graph problem, and we evaluate in our scenario efficiency and efficacy of both an exact and a heuristic solution. We apply our technique: (i) to compact a compressed inverted file built on an actual Web collection of documents, and (ii) to increase capacity of an in-memory posting list. Experiments showed that in the first case our approach can improve the compression ratio of up to 7.7%, while we measured a saving from 12% up to 18% in the size of the posting cache.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Roy, S., Kumar, R., Prvulovic, M.: Improving system performance with compressed memory. In: IPDPS 2001: Proceedings of the 15th International Parallel & Distributed Processing Symposium, p. 66. IEEE Computer Society, Washington (2001)
Turpin, A., Tsegay, Y., Hawking, D., Williams, H.E.: Fast generation of result snippets in web search. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) SIGIR, pp. 127–134. ACM, New York (2007)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6 (2006)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes – Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)
Golomb, S.: Run-length encodings. IEEE Transactions on Information Theory 12(3), 399–401 (1966)
Rice, R.F., Plaunt, J.R.: Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Trans. Commun. COM-19, 889–897 (1971)
Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 387–396. ACM, New York (2008)
Blandford, D., Blelloch, G.: Index compression through document reordering. In: DCC 2002: Proceedings of the Data Compression Conference (DCC 2002), p. 342. IEEE Computer Society, Washington (2002)
Shieh, W.Y., Chen, T.F., Shann, J.J.J., Chung, C.P.: Inverted file compression through document identifier reassignment. Inf. Process. Manage. 39(1), 117–131 (2003)
Silvestri, F., Orlando, S., Perego, R.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 305–312. ACM, New York (2004)
Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007)
Blanco, R., Barreiro, A.: Tsp and cluster-based solutions to the reassignment of document identifiers. Inf. Retr. 9(4), 499–517 (2006)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)
Long, X., Suel, T.: Three-level caching for efficient query processing in large web search engines. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 257–266. ACM, New York (2005)
Chaudhuri, S., Church, K.W., Knig, A.C., Sui, L.: Heavy-tailed distributions and multi-keyword queries. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) SIGIR, pp. 663–670. ACM, New York (2007)
Edmonds, J., Johnson, E.L., Lockhart, S.C.: Blossom i: a computer code for the matching problem. Unpublished report, IBM T. J. Watson Research Center (1969)
Gabow, H.N.: An efficient implementation of edmonds’ algorithm for maximum matching on graphs. J. ACM 23(2), 221–234 (1976)
Preis, R.: Linear time 1/2-approximation algorithm for maximum weighted matching in general graphs. In: Meinel, C., Tison, S. (eds.) STACS 1999. LNCS, vol. 1563, pp. 259–269. Springer, Heidelberg (1999)
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: The impact of caching on search engines. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 183–190. ACM, New York (2007)
Blanco, R., Barreiro, A.: Static pruning of terms in inverted files. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 64–75. Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lam, H.T., Perego, R., Quan, N.T.M., Silvestri, F. (2009). Entry Pairing in Inverted File. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds) Web Information Systems Engineering - WISE 2009. WISE 2009. Lecture Notes in Computer Science, vol 5802. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04409-0_50
Download citation
DOI: https://doi.org/10.1007/978-3-642-04409-0_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04408-3
Online ISBN: 978-3-642-04409-0
eBook Packages: Computer ScienceComputer Science (R0)