Focused Crawls, Tunneling, and Digital Libraries

Bergmark, Donna; Lagoze, Carl; Sbityakov, Alex

doi:10.1007/3-540-45747-X_7

Donna Bergmark⁶,
Carl Lagoze⁶ &
Alex Sbityakov⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2458))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1742 Accesses
36 Citations

Abstract

Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990’s, crawler technology having been developed for use by search engines. Now, Web crawling is being seriously considered as an important strategy for building large scale digital libraries. This paper covers some of the crawl technologies that might be exploited for collection building. For example, to make such collection-building crawls more effective, focused crawling was developed, in which the goal was to make a “best-first” crawl of the Web. We are using powerful crawler software to implement a focused crawl but use tunneling to overcome some of the limitations of a pure best-first approach. Tunneling has been described by others as not only prioritizing links from pages according to the page’s relevance score, but also estimating the value of each link and prioritizing them as well. We add to this mix by devising a tunneling focused crawling strategy which evaluates the current crawl direction on the fly to determine when to terminate a tunneling activity. Results indicate that a combination of focused crawling and tunneling could be an effective tool for building digital libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lagoze (ed.), C., Arms, W., Gan, S., Hillmann, D., Ingram, C., Krafft, D., Marisa, R., Phipps, J., Saylor, J., Terrizzi, C.: Core services in the architecture of the National Digital Library for science education NSDL). In: Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries, Portland, OR (2002)
Google Scholar
Zia, L.L.: The NSF national science, technology, engineering, and mathematics education digital library (NSDL) program: New projects and a project report. D-Lib Magazine: The Magazine of Digital Library Research 7 (2001)
Google Scholar
Arms, W.: Automated digital libraries: How effectively can computers be used for the skill tasks of professional librarianship. D-Lib Magazine: The Magazine of Digital Library Research (2000) http://www.dlib.org/dlib/july00/arms/07arms.html.
Bergmark, D.: Collection synthesis. In: Proceedings of the Second ACM/IEEECS Joint Conference on Digital Libraries, Portland OR (2002) Available: http://mercator.comm.nsdlib.org/CollectionBuilding/bergmark-paper.pdf.
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. In: Proceedings of the Eighth International World-Wide Web Conference., Toronto, Canada (1999) 545–562 Available: http://www8.org/w8-papers/5a-search-query/crawling/index.html and http://www.cs.berkeley.edu/soumen/doc/www99focus/ Current as of August 2001.
Belew, R.K.: Finding Out About. Cambridge Press (2001)
Google Scholar
Salton, G.: Automatic Information Organization and Retrieval. McGraw-Hill, New York (1968)
Google Scholar
Bergmark, D.: Using high performance systems to build collections for a digital library. In: Proceedings of the 2002 International Conference on Parallel Processing Workshops (ICPP 2002 Workshops), Vancouver, Canada (2002) Preprint available at http://mercator.comm.nsdlib.org/CollectionBuilding/DCADL_bergmark.ps.
Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: Extracting usable structures from the Web. (1996) Available: http://www.acm.org/pubs/articles/proceedings/chi/238286/p118-pirolli/p118-pirolli.html.
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46 (1999) 604–632
Article MATH MathSciNet Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th International World Wide Web Conference (WWW7), Brisbane, Australia (1998) Available online at http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm, (current as of 28 Feb. 2001).
Gibson, D., Kleinberg, J., Raghavan, P.: Inferring Web communities from link topology. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space— Structure in Hypermedia Systems (hypertext’98, Pittsburg, PA). (1998) 225–234
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Distributed hypertext resource discovery through examples. In: Proceedings of the 25th VLDB Conference, Edinburgh,Scotland, Morgan-Kaufman (1999) 375–386
Google Scholar
Rennie, J., McCallum, A.: Using reinforcement learning to spider the Web efficiently. In: Proceedings of the International Conference on Machine Learning (ICML). (1999)
Google Scholar
Menczer, F., Belew, R.K. In: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. (1999) 1–45 Republished in Machine Learning, 39(2/3) pp. 203-242, 2000.
Google Scholar
Menczer, F., Pant, G., Srinivasan, P.: Evaluating topic-driven Web crawlers. In: SIGIR’01, September 9–12, New Orleans, La. USA (2001)
Google Scholar
Mukherjea, S.: WTMS: A system for collecting and analyzing topic-specific Web information. In: Proceedings of the 9th International World Wide Web Conference: The Web: The Next Generation, Amsterdam, Elsevier (2000) Available: http://www9.org/w9cdrom/293/293.html (current as of August 2001).
Chakrabarti, S.: Recent results in automatic Web resource discovery. ACM Computing Surveys (1999) Available: http://www.acm.org/pubs/articles/journals/surveys/1999-31-43es/a17-chakrabarti/a17-chakrabarti.pdf.
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Databases. (2000)
Google Scholar
Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World Wide Web 2 (1999)
Google Scholar
Najork, M., Heydon, A.: High-performance Web crawling. Technical Report Research Report 173, Compaq SRC (2001) Available at http://gatekeeper.research.compaq.com/pub/DEC/SRC/research-reports/abstracts/src-rr-173.html.
Davison, B.D.: Topical locality in the Web. In: Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, ACM (2000)
Google Scholar
Joachimes, T.: A support vector method for learning ranking functions in information retrieval (2002) Cornell University Colloqium.
Google Scholar
Parsia, B.: A simple, prima facie argument in favor of the semantic web. MonkyFist (2002) Available: http://monkeyfist.com/articles/815.
Kluev, V.: Compiling document collections from the Internet. SIGIR Forum 34 (2000) Available at http://www.acm.org/sigir/forum/F2000/Kluev00.pdf.
Han, E.H.S., Karypis, G.: Centroid-based document classification: Analysis & experimental results. Technical Report 00-017, Computer Science, University of Minnesota (2000)
Google Scholar
Katz, V., Li, W.S.: Topic distillation on hierarchically categorized Web documents. In: Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange, IEEE (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Cornell Digital Library Research Group, USA
Donna Bergmark, Carl Lagoze & Alex Sbityakov

Authors

Donna Bergmark
View author publications
You can also search for this author in PubMed Google Scholar
Carl Lagoze
View author publications
You can also search for this author in PubMed Google Scholar
Alex Sbityakov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering, University of Padua, Via Gradenigo 6/a, 35131, Padova, Italy
Maristella Agosti
Istituto di Scienza e Tecnologie dell’ Informazione (ISTI-CNR), Area della Ricerca CNR di Pisa, Via G. Moruzzi 1, 56124, Pisa, Italy
Costantino Thanos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bergmark, D., Lagoze, C., Sbityakov, A. (2002). Focused Crawls, Tunneling, and Digital Libraries. In: Agosti, M., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2002. Lecture Notes in Computer Science, vol 2458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45747-X_7

Download citation

DOI: https://doi.org/10.1007/3-540-45747-X_7
Published: 13 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44178-6
Online ISBN: 978-3-540-45747-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics