Adaptive Web Crawling Through Structure-Based Link Classification

Faheem, Muhammad; Senellart, Pierre

doi:10.1007/978-3-319-27974-9_5

Muhammad Faheem^16,17 &
Pierre Senellart^16,18

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9469))

Included in the following conference series:

International Conference on Asian Digital Libraries

2952 Accesses
3 Citations

Abstract

Generic web crawling approaches cannot distinguish among various page types and cannot target content-rich areas of a website. We study the problem of efficient unsupervised web crawling of content-rich webpages. We propose ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that uses the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works in two phases: in the learning phase, it constructs a dynamic site map (limiting the number of URLs retrieved) and learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the intensive crawling phase, ACEBot performs massive downloading following the chosen navigation patterns. Experiments over a large dataset illustrate the effectiveness of our system.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

PathMarker: protecting web contents against inside crawlers

Article Open access 20 February 2019

Adaptive Focused Crawling Using Online Learning

Discovering Informative Contents of Web Pages

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW (2005)
Google Scholar
Q-Success: Usage of content management systems for websites (2015). http://w3techs.com/technologies/overview/content_management/all
Alpert, J., Hajaj, N.: We knew the web was big (2008). http://googleblog.blogspot.co.uk/2008/07/we-knew-web-was-big.html
Faheem, M.: Intelligent Content Acquisition in Web Archiving. PhD thesis, Télécom ParisTech (2014)
Google Scholar
Liu, Z.-H., Ng, W.-K., Lim, E.: An automated algorithm for extracting website skeleton. In: Lee, Y.J., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 799–811. Springer, Heidelberg (2004)
Chapter Google Scholar
Kao, H.Y., Lin, S.H., Ho, J.M., Chen, M.S.: Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng. (2004)
Google Scholar
Crescenzi, V., Merialdo, P., Missier, P.: Fine-grain web site structure discovery. In: WIDM (2003)
Google Scholar
Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3) (2005)
Google Scholar
Bertoli, C., Crescenzi, V., Merialdo, P.: Crawling programs for wrapper-based applications. In: IRI (2008)
Google Scholar
Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-driven crawler generation by example. In: SIGIR (2006)
Google Scholar
Jiang, J., Song, X., Yu, N., Lin, C.Y.: Focus: Learning to crawl web forums. IEEE Trans. Knowl. Data Eng. (2013)
Google Scholar
Cai, R., Yang, J.M., Lai, W., Wang, Y., Zhang, L.: iRobot: an intelligent crawler for web forums. In: WWW (2008)
Google Scholar
Blanco, L., Dalvi, N.N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. In: WWW (2011)
Google Scholar
Faheem, M., Senellart, P.: Intelligent and adaptive crawling of web applications for web archiving. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 306–322. Springer, Heidelberg (2013)
Chapter Google Scholar
Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: HLT-NAACL (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay, Paris, France
Muhammad Faheem & Pierre Senellart
University of Ottawa, Ottawa, Canada
Muhammad Faheem
IPAL, CNRS, National University of Singapore, Singapore, Singapore
Pierre Senellart

Authors

Muhammad Faheem
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Senellart
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre Senellart .

Editor information

Editors and Affiliations

Yonsei University, Seoul, Korea (Republic of)
Robert B. Allen
School of ITEE, University of Queensland, St. Lucia, Queensland, Australia
Jane Hunter
School of Library & Info Sci, Kent State University, KENT, Ohio, USA
Marcia L. Zeng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Faheem, M., Senellart, P. (2015). Adaptive Web Crawling Through Structure-Based Link Classification. In: Allen, R., Hunter, J., Zeng, M. (eds) Digital Libraries: Providing Quality Information. ICADL 2015. Lecture Notes in Computer Science(), vol 9469. Springer, Cham. https://doi.org/10.1007/978-3-319-27974-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-27974-9_5
Published: 18 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27973-2
Online ISBN: 978-3-319-27974-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive Web Crawling Through Structure-Based Link Classification

Abstract

Chapter PDF

Similar content being viewed by others

PathMarker: protecting web contents against inside crawlers

Adaptive Focused Crawling Using Online Learning

Discovering Informative Contents of Web Pages

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Adaptive Web Crawling Through Structure-Based Link Classification

Abstract

Chapter PDF

Similar content being viewed by others

PathMarker: protecting web contents against inside crawlers

Adaptive Focused Crawling Using Online Learning

Discovering Informative Contents of Web Pages

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation