News Page Discovery Policy for Instant Crawlers

Wang, Yong; Liu, Yiqun; Zhang, Min; Ma, Shaoping

doi:10.1007/978-3-540-68636-1_58

Yong Wang¹,
Yiqun Liu¹,
Min Zhang¹ &
…
Shaoping Ma¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Asia Information Retrieval Symposium

1423 Accesses

Abstract

Many news pages which are of high freshness requirements are published on the internet every day. They should be downloaded immediately by instant crawlers. Otherwise, they will become outdated soon. In the past, instant crawlers only downloaded pages from a manually generated news website list. Bandwidth is wasted in downloading non-news pages because news websites do not publish news pages exclusively. In this paper, a novel approach is proposed to discover news pages. This approach includes seed selection and news URL prediction based on user behavior analysis. Empirical studies in a user access log for two months show that our approach outperforms the traditional approach in both precision and recall.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Crawling Policies Based on Web Page Popularity Prediction

AcT: Accuracy-aware crawling techniques for cloud-crawler

Article 15 February 2015

The BBC News Hunter: A Novel Crawler for BBC News

Keywords

References

Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A Large-scale Study of the Evolution of Web Pages. Software Practice and Experience (2004)
Google Scholar
Brewington, B., Cybenko, G.: How Dynamic is the Web. In: Proceedings of WWW9 –9th International World Wide Web Conference (IW3C2), pp. 264–296 (2000)
Google Scholar
Cho, J., Garcia-Molina, H.: Effective Page Refresh Policies for Web Crawlers. ACM Transactions on Database Systems (TODS) (2003)
Google Scholar
Shkapenyuk, V., Suel, T.: Design and Implementation of a High-performance Distributed Web Crawler. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, Calif. (2002)
Google Scholar
Barbosa, L., Salgado, A.C., Carvalho, F., Robin, J., Freire, J.: Workshop On Web Information And Data Management. In: Proceedings of the 7th annual ACM international workshop on Web information and data management (2005)
Google Scholar
Menczer, F., Belew, R.: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning 39(23), 203–242 (2000)
Article MATH Google Scholar
Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003)
Google Scholar
Stamatakis, K., Karkaletsis, V., Paliouras, G., Horlock, J., et al.: Domain-specific Web Site Identification: the CROSSMARC focused Web crawler. In: Proceedings of the 2nd International Workshop on Web Document Analysis (WDA 2003), Edinburgh, UK (2003)
Google Scholar
Menczer, F., Pant, G., Srinivasan, P.: Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004)
Article Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Effecient Crawling through URL Ordering. WWW8 / Computer Networks 30(1-7), 161–172 (1998)
Article Google Scholar
Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the Web Frontier. In: Proc. 13th WWW, pp. 309–318 (2004)
Google Scholar
Eiron, N., McCurley, K.S.: Locality, Hierarchy, and Bidirectionality in the Web. In: Workshop on Algorithms and Models for the Web Graph, Budapest (2003)
Google Scholar
Abiteboul, S., Preda, M., Cobena, G.: Adaptive On-line Page Importance Computation. In: Proc. 12th World Wide Web Conference, pp. 280–290 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Lab of Intelligent Tech. & Sys., Tsinghua University,
Yong Wang, Yiqun Liu, Min Zhang & Shaoping Ma

Authors

Yong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yiqun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shaoping Ma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Liu, Y., Zhang, M., Ma, S. (2008). News Page Discovery Policy for Instant Crawlers. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_58

Download citation

DOI: https://doi.org/10.1007/978-3-540-68636-1_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

News Page Discovery Policy for Instant Crawlers

Abstract

Chapter PDF

Similar content being viewed by others

Crawling Policies Based on Web Page Popularity Prediction

AcT: Accuracy-aware crawling techniques for cloud-crawler

The BBC News Hunter: A Novel Crawler for BBC News

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

News Page Discovery Policy for Instant Crawlers

Abstract

Chapter PDF

Similar content being viewed by others

Crawling Policies Based on Web Page Popularity Prediction

AcT: Accuracy-aware crawling techniques for cloud-crawler

The BBC News Hunter: A Novel Crawler for BBC News

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation