Abstract
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Anh, V., Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: Proceedings of the 24th Annual SIGIR Conference on Research and Development in Information Retrieval, pp. 35–42, September 2001
Anh, V., Moffat, A.: Compressed inverted files with reduced decoding overheads. In: Proceedings of the 21st Annual SIGIR Conference on Research and Development in Information Retrieval, pp. 290–297, 1998
Arasu, A., Cho, J., Garcia-Molina, H., Raghavan, S.: Searching the web. ACM Trans. Internet Technol. 1(1), June (2001)
Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, N.: Distributed query processing using partitioned inverted files. In: Proceedings of the 9th String Processing and Information Retrieval Symposium (SPIRE), September 2002
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addision Wesley (1999)
Bahle, D., Williams, H., Zobel, J.: Efficient phrase querying with an auxiliary index. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 215–221, 2002
Bhattacharjee, B., Chawathe, S., Gopalakrishnan, V., Keleher, P., Silaghi, B.: Efficient peer-to-peer searches using result-caching. In: Proceedings of the 2nd International Workshop on Peer-to-Peer Systems, 2003
Brewer, E.: Lessons from giant scale services. IEEE Internet Comput., pp. 46–55, August (2001)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the Seventh World Wide Web Conference, 1998
Broder, A.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences, pp. 21–29, IEEE Comput. Soc. (1997)
Cao, P., Irani, S.: Cost-aware WWW proxy caching algorithms. In: USENIX Symposium on Internet Technologies and Systems (USITS), 1997
Chaudhuri, S., Gravano, L.: Optimizing queries over multimedia repositories. Data Eng. Bull. 19(4), 45–52 (1996)
Demaine, E., Lopez-Ortiz, A., Munro, J.: Adaptive set intersections, unions, and differences. In: Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 743–752, 2000
Fagin, R.: Combining fuzzy information from multiple systems. In: Proceedings of ACM Symposium on Principles of Database Systems, 1996
Fagin, R., Carmel, D., Cohen, D., Farchi, E., Herscovici, M., Maarek, Y., Soffer, A.: Static index pruning for information retrieval systems. In: Proceedings of the 24th Annual SIGIR Conference on Research and Development in Information Retrieval, pp. 43–50, September 2001
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proceedings of ACM Symposium on Principles of Database Systems, 2001
Garcia-Molina, H., Ullman, J., Widom, J.: Database System Implementation. Prentice Hall (2000)
Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP Completeness. WH Freeman and Company (1979)
Haveliwala, T.: Topic-sensitive pagerank. In: Proceedings of the 11th International World Wide Web Conference, May 2002
Jonsson, B.T., Franklin, M.J., Srivastava, D.: Interaction of query evaluation and buffer management for information retrieval. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 118–129, June 1998
Kaszkiel, M., Zobel, J., Sacks-Davis, R.: Efficient passage ranking for document databases. ACM Trans. Inf. Sys. (TOIS) 17(4), 406–439, October (1999)
Lempel , R., Moran, S.: Optimizing result prefetching in web search engines with segmented indices. In: Proceedings of the 28th International Conference on Very Large Data Bases, August 2002
Lempel, R., Moran, S.: Predictive caching and prefetching of query results in search engines. In: Proceedings of the 12th International World-Wide Web Conference, 2003
Li, J., Loo, B., Hellerstein, J., Kaashoek, F., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing. In: Proceedings of the 2nd International Workshop on Peer-to-Peer Systems, 2003
Long, X., Suel, T.: Optimized query execution in large search engines with global page ordering. In: Proceedings of the 29th International Conference on Very Large Data Bases, September 2003
Markatos, E.: On caching search engine query results. In: 5th International Web Caching and Content Delivery Workshop, May 2000
Megiddo, N., Modha, D.: Outperforming LRU with an adaptive replacement cache. IEEE Comput., pp. 58–65, April (2004)
Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the web. In: Proceedings of the 10th International World Wide Web Conference, May 2000
Persin, M., Zobel, J., Sacks-Davis, R.: Filtered document retrieval with frequency-sorted indexes. J. Am. Soc. Inf. Sci., 47(10), 749–764, May (1996)
Richardson, M., Domingos, P.: The intelligent surfer: Probabilistic combination of link and content information in pagerank. In: Advances in Neural Information Processing Systems, 2002
Risvik, K., Aasheim, Y., Lidal, M.: Multi-tier architecture for web search engines. In: First Latin American Web Congress, pp. 132–143, 2003
Risvik, K., Michelsen, R.: Search engines and web dynamics. Comput. Netw. 39, 289–302 (2002)
Saraiva, P., de Moura, E., Ziviani, N., Meira, W., Fonseca, R., Ribeiro-Neto, B.: Rank-preserving two-level caching for scalable search engines. In: Proceedings of the 24th Annual SIGIR Conference on Research and Development in Information Retrieval, pp. 51–58, September 2001
Scholer, F., Williams, H., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proceedings of the 25th Annual SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229, 2002
Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data Engineering, 2002
Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., Shanmugasundaram, K.: ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In: International Workshop on the Web and Databases (WebDB), June 2003
Tomasic, A., Garcia-Molina, H.: Performance of inverted indices in distributed text document retrieval systems. In: Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (PDIS), 1993
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition (1999)
Xie, Y., O’Hallaron, D.: Locality in search engine queries and its implications for caching. In: IEEE Infocom 2002, pp. 1238–1247, 2002
Young, N.: On-line file caching. In: Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 82–86, 1998
Author information
Authors and Affiliations
Corresponding author
Additional information
Work supported by NSF CAREER Award CCR-0093400 and the New York State Center for Advanced Technology in Telecommunications (CATT) at Polytechnic University.
Rights and permissions
About this article
Cite this article
Long, X., Suel, T. Three-Level Caching for Efficient Query Processing in Large Web Search Engines. World Wide Web 9, 369–395 (2006). https://doi.org/10.1007/s11280-006-0221-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-006-0221-0