Turn the Page: Automated Traversal of Paginated Websites

Furche, Tim; Grasso, Giovanni; Kravchenko, Andrey; Schallhart, Christian

doi:10.1007/978-3-642-31753-8_27

Tim Furche¹⁹,
Giovanni Grasso¹⁹,
Andrey Kravchenko¹⁹ &
…
Christian Schallhart¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7387))

Included in the following conference series:

International Conference on Web Engineering

2196 Accesses
8 Citations

Abstract

Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages.

We identify pagination links in a wide range of domains and sites with near perfect accuracy (99%). We obtain these results with a novel framework for web block classification, \({\textsc{ber}_y{\textsc l}}\), that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, \({\textsc{ber}_y{\textsc l}}\) is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how \({\textsc{ber}_y{\textsc l}}\) minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual).

The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement DIADEM, no. 246858.

Download to read the full chapter text

Chapter PDF

Adaptive Web Crawling Through Structure-Based Link Classification

Lost but not forgotten: finding pages on the unarchived web

Article Open access 03 June 2015

Structural Profiling of Web Sites in the Wild

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Almpanidis, G., Kotropoulos, C., Pitas, I.: Combining text and link analysis for focused crawling - an application for vertical search engines. Inf. Syst. 32(6), 886–908 (2007)
Article Google Scholar
Bra, P.D., Post, R.D.J.: Information retrieval in the world-wide web: Making client-based searching feasible. Computer Networks and ISDN Systems 27(2), 183–192 (1994)
Article Google Scholar
Chakrabarti, S., Berg, M.V.D., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. In: Computer Networks, pp. 1623–1640 (1999)
Google Scholar
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)
Article MathSciNet MATH Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M.A., Saggion, H., Petrak, J., Li, Y., Peters, W.: Text Processing with GATE, Version 6 (2011)
Google Scholar
Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)
Google Scholar
Fazzinga, B., Flesca, S., Tagarelli, A.: Schema-based web wrapping. Knowledge and Inf. Sys. 26, 127–173 (2011)
Article Google Scholar
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm. an application: tailored web site mapping. Computer Networks and ISDN Systems 30(1-7), 317–326 (1998)
Article Google Scholar
Kang, J., Choi, J.: Block classification of a web page by using a combination of multiple classifiers. In: NCM (2008)
Google Scholar
Lee, C.H., Ken, M.Y., Lai, S.: Stylistic and lexical co-training for web block classification. In: WIDM (2004)
Google Scholar
Liu, H., Janssen, J., Milios, E.: Using HMM to learn user browsing patterns for focused web crawling. DKE 59(2) (2006)
Google Scholar
Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. TOIS 23(4), 430–462 (2005)
Article Google Scholar
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. TKDE 18(1), 107–122 (2006)
Google Scholar
Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance model for web pages. In: WWW (2004)
Google Scholar
Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Inf. Retrieval 8, 417–447 (2005)
Article Google Scholar
Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., Zhang, W.V.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: KDD (2009)
Google Scholar
Yang, X., Shi, Y.: Learning web page block functions using roles of images. In: ICPCA (2008)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)
Google Scholar
Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford, UK, OX1 3QD
Tim Furche, Giovanni Grasso, Andrey Kravchenko & Christian Schallhart

Authors

Tim Furche
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Grasso
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Kravchenko
View author publications
You can also search for this author in PubMed Google Scholar
Christian Schallhart
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Elettronica e Informazione, Politecnico di Milano, Via Ponzio 34/5, 20133, Milano, Italy
Marco Brambilla
Department of Computer Science, Tokyo Institute of Technology, 2-12-1 Oookayama, 152-8552, Tokyo, Japan
Takehiro Tokuda
Institut für Informatik, Freie Universität Berlin, Königin-Luise-Strasse 24-26, 14195, Berlin, Germany
Robert Tolksdorf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Furche, T., Grasso, G., Kravchenko, A., Schallhart, C. (2012). Turn the Page: Automated Traversal of Paginated Websites. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds) Web Engineering. ICWE 2012. Lecture Notes in Computer Science, vol 7387. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31753-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-31753-8_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31752-1
Online ISBN: 978-3-642-31753-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Turn the Page: Automated Traversal of Paginated Websites

Abstract

Chapter PDF

Similar content being viewed by others

Adaptive Web Crawling Through Structure-Based Link Classification

Lost but not forgotten: finding pages on the unarchived web

Structural Profiling of Web Sites in the Wild

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Turn the Page: Automated Traversal of Paginated Websites

Abstract

Chapter PDF

Similar content being viewed by others

Adaptive Web Crawling Through Structure-Based Link Classification

Lost but not forgotten: finding pages on the unarchived web

Structural Profiling of Web Sites in the Wild

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation