Abstract
The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the “client-side” Hidden Web. To that end, we accomplish several tasks. First, we perform a thorough analysis of the different client-side technologies and the main features of the Web 2.0 pages in order to determine the initial levels of the aforementioned scale. Second, we submit a Web site whose purpose is to check what crawlers are capable of dealing with those technologies and features. Third, we propose several methods to evaluate the performance of the crawlers in the Web site and to classify them according to the levels of the scale. Fourth, we show the results of applying those methods to some OpenSource and commercial crawlers, as well as to the robots of the main Web search engines.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Álvarez, M., Pan, A., Raposo, J., Hidalgo, J.: Crawling Web Pages with Support for Client-Side Dynamism (2006)
Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V.: Crawling the Content Hidden Behind Web Forms. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part II. LNCS, vol. 4706, pp. 322–333. Springer, Heidelberg (2007)
Bergman, M.K.: The deep web: Surfacing hidden value (2000)
Chellapilla, K., Maykov, A.: A taxonomy of javascript redirection spam. In: Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2007, pp. 81–88 (2007)
Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy (2005)
Khare, R., Cutting, D.: Nutch: A flexible and scalable open-source web search engine. Technical report (2004)
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1, 1241–1252 (2008)
Mesbah, A., Bozdag, E., van Deursen, A.: Crawling ajax by inferring user interface state changes. In: Web Engineering, ICWE 2008, pp. 122–134 (2008)
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to heritrix, an archival quality web crawler. In: 4th International Web Archiving Workshop, IWAW 2004 (2004)
Pavuk Web page (2011), http://www.pavuk.org/
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 129–138. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Teleport Web page (2011), http://www.tenmax.com/teleport/pro/home.html
Web Copier Pro Web page, http://www.maximumsoft.com/products/wc_pro/overview.html
Web2Disk Web page (2011), http://www.inspyder.com/products/Web2Disk/Default.aspx
Weideman, M., Schwenke, F.: The influence that JavaScript has on the visibility of a Website to search engines - a pilot study. Information Research 11(4) (July 2006)
Wu, B., Davison, B.D.: Cloaking and redirection: A preliminary study (2005)
Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proceedings of the 14th International World Wide Web Conference, pp. 820–829. ACM Press (2005)
Wu, B., Davison, B.D.: Detecting semantic cloaking on the web. In: Proceedings of the 15th International World Wide Web Conference, pp. 819–828. ACM Press (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Prieto, V.M., Álvarez, M., López-García, R., Cacheda, F. (2012). Analysing the Effectiveness of Crawlers on the Client-Side Hidden Web. In: Rodríguez, J., Pérez, J., Golinska, P., Giroux, S., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28795-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-28795-4_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28794-7
Online ISBN: 978-3-642-28795-4
eBook Packages: EngineeringEngineering (R0)