A Cooperative Approach to Web Crawler URL Ordering

Chandramouli, A.; Gauch, S.; Eno, J.

doi:10.1007/978-3-642-23187-2_22

A. Chandramouli⁴,
S. Gauch⁵ &
J. Eno⁵

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 98))

1126 Accesses
1 Citations

Abstract

Uniform Resource Locator (URL) ordering algorithms are used by Web crawlers to determine the order in which to download pages from the Web. The current approaches for URL ordering based on link structure are expensive and/or miss many good pages, particularly in social network environments. In this paper, we present a novel URL ordering system that relies on a cooperative approach between crawlers and web servers based on file system and Web log information. In particular, we develop algorithms based on file timestamps and Web log internal and external counts. By using this change and popularity information for URL ordering, we are able to retrieve high quality pages earlier in the crawl while avoiding requests for pages that are unchanged or no longer available. We perform our experiments on two data sets using the Web logs from university and CiteSeer websites. On these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google’s PageRank) of 57.2% and 65.7% over that of a breadth-first search crawl while increasing the number of unique pages gathered by skipping unchanged or deleted pages.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

A Dynamic Page-Refresh Index Policy for Web Crawlers

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Brandman, O., Cho, J., Garcia-Molina, H., Shivakumar, N.: Crawler friendly Web servers. In: Proc Workshop on Performance and Architecture of Web Servers (PAWS), Santa Clara, California (2000)
Google Scholar
Buzzi, M.: Cooperative crawling. In: Proc. Latin American Conference on World Wide Web (LA-Web), Santiago, Chile, pp. 209–211 (2003)
Google Scholar
Castillo, C.: Effective Web crawling PhD Thesis, University of Chile, Chile (2004)
Google Scholar
Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for web crawling. In: Proc. Latin American Web Conference, Brazil, pp. 10–17 (2004)
Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. In: Proc. 7th World Wide Web Conference, Brisbane, Australia, pp. 161–172 (1998)
Google Scholar
Cho, J., Roy, S., Adams, R.E.: Page quality: In search of an unbiased web ranking. In: Proc. 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, pp. 551–562 (2005)
Google Scholar
Cho, J., Schonfeld, U.: RankMass crawler: A crawler with high PageRank coverage guarantee. In: Proc. 33rd International Conference on Very Large Data Bases, Vienna, Austria, pp. 375–396 (2007)
Google Scholar
Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality. In: Proc. 10th International World Wide Web Conference, Hong Kong, pp. 114–118 (2001)
Google Scholar
Pandey, S., Olston, C.: Crawl ordering by search impact. In: Proc. of the International Conference on Web Search and Data Mining, Palo Alto, California, pp. 3–14 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Kansas, Lawrence, KS, USA
A. Chandramouli
Department of Computer Science, University of Arkansas, Fayetteville, AR, USA
S. Gauch & J. Eno

Authors

A. Chandramouli
View author publications
You can also search for this author in PubMed Google Scholar
S. Gauch
View author publications
You can also search for this author in PubMed Google Scholar
J. Eno
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Expert Systems and Artificial Intelligence, University of Information Technology and Management , 35-225, Rzeszów, Poland
Zdzisław S. Hippe & Teresa Mroczek &
M. Nalecz Institute of Biocybernetics and Biomedical Engineering, Polish Academy of Sciences , 4 Ks. Trojdena Str., 02-109, Warsaw, Poland
Juliusz L. Kulikowski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chandramouli, A., Gauch, S., Eno, J. (2012). A Cooperative Approach to Web Crawler URL Ordering. In: Hippe, Z.S., Kulikowski, J.L., Mroczek, T. (eds) Human – Computer Systems Interaction: Backgrounds and Applications 2. Advances in Intelligent and Soft Computing, vol 98. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23187-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-23187-2_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23186-5
Online ISBN: 978-3-642-23187-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Cooperative Approach to Web Crawler URL Ordering

Abstract

Chapter PDF

Similar content being viewed by others

A Dynamic Page-Refresh Index Policy for Web Crawlers

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

A Cooperative Approach to Web Crawler URL Ordering

Abstract

Chapter PDF

Similar content being viewed by others

A Dynamic Page-Refresh Index Policy for Web Crawlers

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation