Abstract
Uniform Resource Locator (URL) ordering algorithms are used by Web crawlers to determine the order in which to download pages from the Web. The current approaches for URL ordering based on link structure are expensive and/or miss many good pages, particularly in social network environments. In this paper, we present a novel URL ordering system that relies on a cooperative approach between crawlers and web servers based on file system and Web log information. In particular, we develop algorithms based on file timestamps and Web log internal and external counts. By using this change and popularity information for URL ordering, we are able to retrieve high quality pages earlier in the crawl while avoiding requests for pages that are unchanged or no longer available. We perform our experiments on two data sets using the Web logs from university and CiteSeer websites. On these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google’s PageRank) of 57.2% and 65.7% over that of a breadth-first search crawl while increasing the number of unique pages gathered by skipping unchanged or deleted pages.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Brandman, O., Cho, J., Garcia-Molina, H., Shivakumar, N.: Crawler friendly Web servers. In: Proc Workshop on Performance and Architecture of Web Servers (PAWS), Santa Clara, California (2000)
Buzzi, M.: Cooperative crawling. In: Proc. Latin American Conference on World Wide Web (LA-Web), Santiago, Chile, pp. 209–211 (2003)
Castillo, C.: Effective Web crawling PhD Thesis, University of Chile, Chile (2004)
Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for web crawling. In: Proc. Latin American Web Conference, Brazil, pp. 10–17 (2004)
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. In: Proc. 7th World Wide Web Conference, Brisbane, Australia, pp. 161–172 (1998)
Cho, J., Roy, S., Adams, R.E.: Page quality: In search of an unbiased web ranking. In: Proc. 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, pp. 551–562 (2005)
Cho, J., Schonfeld, U.: RankMass crawler: A crawler with high PageRank coverage guarantee. In: Proc. 33rd International Conference on Very Large Data Bases, Vienna, Austria, pp. 375–396 (2007)
Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality. In: Proc. 10th International World Wide Web Conference, Hong Kong, pp. 114–118 (2001)
Pandey, S., Olston, C.: Crawl ordering by search impact. In: Proc. of the International Conference on Web Search and Data Mining, Palo Alto, California, pp. 3–14 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chandramouli, A., Gauch, S., Eno, J. (2012). A Cooperative Approach to Web Crawler URL Ordering. In: Hippe, Z.S., Kulikowski, J.L., Mroczek, T. (eds) Human – Computer Systems Interaction: Backgrounds and Applications 2. Advances in Intelligent and Soft Computing, vol 98. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23187-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-23187-2_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23186-5
Online ISBN: 978-3-642-23187-2
eBook Packages: EngineeringEngineering (R0)