Abstract
Many news pages which are of high freshness requirements are published on the internet every day. They should be downloaded immediately by instant crawlers. Otherwise, they will become outdated soon. In the past, instant crawlers only downloaded pages from a manually generated news website list. Bandwidth is wasted in downloading non-news pages because news websites do not publish news pages exclusively. In this paper, a novel approach is proposed to discover news pages. This approach includes seed selection and news URL prediction based on user behavior analysis. Empirical studies in a user access log for two months show that our approach outperforms the traditional approach in both precision and recall.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A Large-scale Study of the Evolution of Web Pages. Software Practice and Experience (2004)
Brewington, B., Cybenko, G.: How Dynamic is the Web. In: Proceedings of WWW9 –9th International World Wide Web Conference (IW3C2), pp. 264–296 (2000)
Cho, J., Garcia-Molina, H.: Effective Page Refresh Policies for Web Crawlers. ACM Transactions on Database Systems (TODS) (2003)
Shkapenyuk, V., Suel, T.: Design and Implementation of a High-performance Distributed Web Crawler. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, Calif. (2002)
Barbosa, L., Salgado, A.C., Carvalho, F., Robin, J., Freire, J.: Workshop On Web Information And Data Management. In: Proceedings of the 7th annual ACM international workshop on Web information and data management (2005)
Menczer, F., Belew, R.: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning 39(23), 203–242 (2000)
Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003)
Stamatakis, K., Karkaletsis, V., Paliouras, G., Horlock, J., et al.: Domain-specific Web Site Identification: the CROSSMARC focused Web crawler. In: Proceedings of the 2nd International Workshop on Web Document Analysis (WDA 2003), Edinburgh, UK (2003)
Menczer, F., Pant, G., Srinivasan, P.: Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004)
Cho, J., Garcia-Molina, H., Page, L.: Effecient Crawling through URL Ordering. WWW8 / Computer Networks 30(1-7), 161–172 (1998)
Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the Web Frontier. In: Proc. 13th WWW, pp. 309–318 (2004)
Eiron, N., McCurley, K.S.: Locality, Hierarchy, and Bidirectionality in the Web. In: Workshop on Algorithms and Models for the Web Graph, Budapest (2003)
Abiteboul, S., Preda, M., Cobena, G.: Adaptive On-line Page Importance Computation. In: Proc. 12th World Wide Web Conference, pp. 280–290 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, Y., Liu, Y., Zhang, M., Ma, S. (2008). News Page Discovery Policy for Instant Crawlers. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_58
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)