Abstract
Collection selection is one of the key problems in distributed information retrieval. Due to resource constraints it is not usually feasible to search all collections in response to a query. Therefore, the central component (broker) selects a limited number of collections to be searched for the submitted queries. During the past decade, several collection selection algorithms have been introduced. However, their performance varies on different testbeds. We propose a new collection-selection method based on the ranking of downloaded sample documents. We test our method on six testbeds and show that our technique can significantly outperform other state-of-the-art algorithms in most cases. We also introduce a new testbed based on the trec gov2 documents.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Avrahami, T., et al.: The FedLemur: federated search in the real world. Journal of the American Society for Information Science and Technology 57(3), 347–358 (2006)
Baillie, M., Azzopardi, L., Crestani, F.: Adaptive query-based sampling of distributed collections. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 316–328. Springer, Heidelberg (2006)
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems 19(2), 97–130 (2001)
Callan, J., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proc. ACM SIGIR Conf., Seattle, Washington, pp. 21–28. ACM Press, New York (1995)
Craswell, N., Bailey, P., Hawking, D.: Server selection on the World Wide Web. In: Proc. ACM Conf. on Digital Libraries, San Antonio, Texas, pp. 37–46. ACM Press, New York (2000)
D’Souza, D., Thom, J., Zobel, J.: Collection selection for managed distributed document databases. Information Processing and Management 40(3), 527–546 (2004a)
D’Souza, D., Zobel, J., Thom, J.: Is CORI effective for collection selection? an exploration of parameters, queries, and data. In: Proc. Australian Document Computing Symposium, Melbourne, Australia, pp. 41–46 (2004b)
Gravano, L., et al.: STARTS: Stanford proposal for Internet meta-searching. In: Proc. ACM SIGMOD Conf., Tucson, Arizona, pp. 207–218. ACM Press, New York (1997)
Gravano, L., Garcia-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems 24(2), 229–264 (1999)
Hawking, D., Thomas, P.: Server selection methods in hybrid portal search. In: Proc. ACM SIGIR Conf., Salvador, Brazil, pp. 75–82. ACM Press, New York (2005)
Joachims, T., et al.: Accurately interpreting clickthrough data as implicit feedback. In: Proc. ACM SIGIR Conf., Salvador, Brazil, pp. 154–161. ACM Press, New York (2005)
Manmatha, R., Rath, T., Feng, F.: Modeling score distributions for combining the outputs of search engines. In: Proc. ACM SIGIR Conf., New Orleans, Louisiana, pp. 267–275. ACM Press, New York (2001)
Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval quality for resource selection. In: Proc. ACM SIGIR Conf., Toronto, Canada, pp. 290–297. ACM Press, New York (2003)
Powell, A.L., French, J.: Comparing the performance of collection selection algorithms. ACM Transactions on Information Systems 21(4), 412–456 (2003)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proc. 27th Int. Conf. on Very Large Data Bases, Roma, Italy, pp. 129–138. Morgan Kaufmann, San Francisco (2001)
Shokouhi, M., Scholer, F., Zobel, J.: Sample sizes for query probing in uncooperative distributed information retrieval. In: Proc. Asia Pacific Web Conf., Harbin, China, pp. 63–75 (2006a)
Shokouhi, M., et al.: Capturing collection size for distributed non-cooperative retrieval. In: Proc. ACM SIGIR Conf., Seattle, Washington, pp. 316–323. ACM Press, New York (2006b)
Si, L., Callan, J.: Unified utility maximization framework for resource selection. In: Proc. ACM CIKM Conf., New York, NY, pp. 32–41. ACM Press, New York (2004)
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proc. ACM SIGIR Conf., Toronto, Canada, pp. 298–305. ACM Press, New York (2003a)
Si, L., Callan, J.: A semisupervised learning method to merge search engine results. ACM Transactions on Information Systems 21(4), 457–491 (2003b)
Si, L., et al.: A language modeling framework for resource selection and results merging. In: Proc. ACM CIKM Conf., McLean, Virginia, pp. 391–397. ACM Press, New York (2002)
Xu, J., Croft, B.: Cluster-based language models for distributed retrieval. In: Proc. ACM SIGIR Conf., Berkeley, California, United States, pp. 254–261. ACM Press, New York (1999)
Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on the Internet. In: Proc. Conf. on Database Systems for Advanced Applications, Melbourne, Australia, pp. 41–50 (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Shokouhi, M. (2007). Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-71496-5_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)