Abstract
Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather coherent captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies toward better quality with given resources. We define data quality measures, characterize their properties, and develop a suite of quality-conscious scheduling strategies for archive crawling. Our framework includes single-visit and visit–revisit crawls. Single-visit crawls download every page of a site exactly once in an order that aims to minimize the “blur” in capturing the site. Visit–revisit strategies revisit pages after their initial downloads to check for intermediate changes. The revisiting order aims to maximize the “coherence” of the site capture(number pages that did not change during the capture). The quality notions of blur and coherence are formalized in the paper. Blur is a stochastic notion that reflects the expected number of page changes that a time-travel access to a site capture would accidentally see, instead of the ideal view of a instantaneously captured, “sharp” site. Coherence is a deterministic quality measure that counts the number of unchanged and thus coherently captured pages in a site snapshot. Strategies that aim to either minimize blur or maximize coherence are based on prior knowledge of or predictions for the change rates of individual pages. Our framework includes fairly accurate classifiers for change predictions. All strategies are fully implemented in a testbed and shown to be effective by experiments with both synthetically generated sites and a periodic crawl series for different Web sites.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web changes everything: understanding the dynamics of Web content. In: WSDM’09, pp. 282–291 (2009)
Alam, Md.H., Ha, J., Lee, S.: Fractional pagerank crawler: Prioritizing URLs efficiently for crawling important pages early. In: DASFAA’09, pp. 590–594 (2009)
Segev A., Shoshani A.: Logical modeling of temporal data. SIGMOD Rec. 16(3), 454–466 (1987)
Baeza-Yates R., Gionis A., Junqueira F., Murdock V., Plachoura V., Silvestri F.: Design trade-offs for search engine caching. ACM Trans. Web 2(4), 1–28 (2008)
Batsakis S., Petrakis E.G.M., Milios E.E.: Improving the performance of focused Web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009)
Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: WWW’09, pp. 1109–1110 (2009)
Brewington B.E., Cybenko G.: Keeping up with the changing Web. Computer 33(5), 52–58 (2000)
Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for Web crawling. In: LA-WEBMEDIA’04, pp. 10–17 (2004)
Chen L., Bhowmick S.S., Nejdl W.: Near-miner: mining evolution associations of Web site directories for efficient maintenance of Web archives. PVLDB 2(1), 1150–1161 (2009)
Cho J., Garcia-Molina H.: Synchronizing a database to improve freshness. SIGMOD Rec. 29(2), 117–128 (2000)
Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Inter. Tech. 3(3), 256–290 (2003)
Cho J., Garcia-Molina H., Page L. (2007) Efficient crawling through URL ordering. In: WWW’07, pp. 161–172. (2007)
Cho J., Ntoulas A. (2002) Effective change detection using sampling. In: VLDB’02, pp. 514–525. (2002)
Cho J., Schonfeld U. (2007) Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: VLDB’07, pp. 375–386. (2007)
Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003)
Colby L.S., Kawaguchi A., Lieuwen D.F., Mumick I.S., Ross K.A.: Supporting multiple view maintenance policies. SIGMOD Rec. 26(2), 405–416 (1997)
Dai, N., Davison, B.D.: Freshness matters: in flowers, food, and Web authority. In: SIGIR’10, pp. 114–121 (2010)
Dash, D., Kantere, V., Ailamaki, A.: An economic model for self-tuned cloud caching. In: ICDE’09, pp. 1687–1693 (2009)
Denev D., Mazeika A., Spaniol M., Weikum G.: Sharc: framework for quality-conscious Web archiving. PVLDB 2(1), 586–597 (2009)
Masanès, J. (eds): Web Archiving. Springer, UK (2006)
Härder, T., Bühmann, A.: Value complete, column complete, predicate complete. In: VLDBJ 17(4), pp. 805–826 (2008)
Jiawei M., Han J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2005)
Kan, M.-Y., Thi, H.O.N.: Fast Webpage classification using URL features. In: CIKM’05, pp. 325–326 (2005)
Kim, S., Lee, S.: Estimating the change of Web pages. In: ICCS’07, Vol. 4489 of LNCS, pp. 798–805 (2007)
Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: scaling to 6 billion pages and beyond. In: WWW’08, pp. 427–436 (2008)
Levene, M., Poulovassilis, A. (eds): Web Dynamics—Adapting to Change in Content, Size, Topology and Use. Springer, Berlin (2004)
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality Web crawler. In: IWAW’04 (2004)
Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: WWW’01, pp. 114–118 (2001)
Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web?: the evolution of the Web from a search engine perspective. In: WWW’04, pp. 1–12 (2004)
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW’08, pp. 437–446 (2008)
Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: In SIGMOD’02, pp. 73–84 (2002)
Practice.com. Debunking the wayback machine. http://practice.com/2008/12/29/debunking-the-wayback-machine
Qi X., Davison B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)
Schenkel, R.: Temporal shingling for version identification in Web archives. In: ECIR’10, pp. 508–519 (2010)
Schonfeld, U., Shivakumar, N.: Sitemaps: above and beyond the crawl of duty. In: WWW’09, pp. 991–1000 (2009)
Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in Web archiving. In: WICOW’09, pp. 19–26 (2009)
Tolia, N., Satyanarayanan, M.: Consistency-preserving caching of dynamic database content. In: WWW’07, pp. 311–320 (2007)
Singh, S.R. (2007) Estimating the rate of Web page updates. In: IJCAI’07, pp. 2874–2879 (2007)
Zheng, S., Dmitriev, P., Giles, C.L.: Graph-based seed selection for Web-scale crawlers. In: CIKM’09, pp. 1967–1970 (2009)
Zhou, Y., Jiang, M., Zhang, Q., Huang, X., Wu, L.: Selective recrawling for object-level vertical search. In: WWW’10, pp. 1221–1222 (2010)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Denev, D., Mazeika, A., Spaniol, M. et al. The SHARC framework for data quality in Web archiving. The VLDB Journal 20, 183–207 (2011). https://doi.org/10.1007/s00778-011-0219-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-011-0219-9