Abstract
There is a plethora of information inside the Web. Even the top commercial search engines can not download and index all the available information. So, in the recent years, there are several research works on the design and implementation of focused topic crawlers and also on geographic scope crawlers.
Despite other areas of information retrieval, research on Web crawling is not using the temporal information extracted from Web pages in the used crawling criteria. Therefore, our research challenge is the use of temporal data extracted from Web pages as the main crawling criteria to satisfy a given temporal focus. The importance of the time dimension is quite amplified when combined with topic or geography, but now we want to study it isolated. The used approach is based on temporal segmentation of Web pages text. It only follows links within segments tagged with dates in the scope of restriction. A precision around 75% was achieved in preliminary experimental results.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Exposto, J., Macedo, J., Pina, A.: Geographical partition for distributed web crawling. In: GIR (2005)
Baeza-Yates, R.: Searching the future. In: SIGIR Workshop (2005)
Alonso, O., Gertz, M., Baeza-Yates, R.: Clustering and exploring search results using timeline constructions. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 97–106. ACM, New York (2009)
Alonso, O., Baeza-Yates, R., Gertz, M.: Effectiveness of temporal snippets. In: WSSP Workshop, WWW 2009 (2009)
Nunes, S., Ribeiro, C., David, G.: Using neighbors to date web documents. In: WIDM, pp. 129–136 (2007)
Yu, P.S., Li, X., Liu, B.: On the temporal dimension of search. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, WWW 2004, pp. 448–449 (2004)
Dai, N., Davison, B.D.: Freshness matters: in flowers, food, and web authority. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 114–121 (2010)
Craveiro, O., Macedo, J., Madeira, H.: It is the time for portuguese texts! In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds.) PROPOR 2012. LNCS, vol. 7243, pp. 106–112. Springer, Heidelberg (2012)
Craveiro, O., Macedo, J., Madeira, H.: Leveraging temporal expressions for segmented-based information retrieval. In: ISDA 2010, pp. 754–759 (2010)
Crawler4j website, https://code.google.com/p/crawler4j/ Technical report
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Pereira, P., Macedo, J., Craveiro, O., Madeira, H. (2014). Time-Aware Focused Web Crawling. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-06028-6_53
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)