Abstract
National web archives have been successfully made available through domain—and language-specific web crawlers for years. We here propose another focused web crawler for collecting foreign language web pages that are also related to a nation. Rather finding the most relevant web pages, an ensemble machine learning has been trained with selective features to find relevant clusters of unvisited web pages, called website segments. During consecutive crawling cycles, the machine will be retrained with features extracted from new found website segments. Preliminary experiments in the real web space on Thai-tourism related topics show that this approach can take advantage of recent crawling experiences to produce more promising harvest rates than traditional breadth—and best-first baselines.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
British Library: UK web archive. http://www.webarchive.org.uk (2011)
National Diet Library: Web archiving project. http://warp.ndl.go.jp (2011)
Baeza-Yates, R., Castillo, C., Lopez, V.: Characteristics of the web of spain. Cy- bermetrics 9(1) (2005)
Christensen, N.H.: Preserving the bits of the danish internet. In: Proc. of the 5th IWAW. (2005)
Gomes, D., Nogueira, A., Miranda, J., Costa, M.: Introducing the portuguese web archive initiative. In: Proc. of the 8th IWAW. (2008)
Baeza-Yates, R., Castillo, C., Marin, M., Rodriguez, A.: Crawling a country: Better strategies than breadth-first for web page ordering. In: Proc. of the 14th WWW. (2005)
Bordino, I., Boldi, P., Donato, D., Santini, M., Vigna, S.: Temporal evolution of the uk web. In: Proc. of the 8th ICDMW. (2008)
Alabbad, S.H., Alanazi, S.: Language based crawling: Crawling the arabic content of the web. In: Proc. of the IC0MP’09. (2009)
Somboonviwat, K., Tamura, T., Kitsuregawa, M.: Finding thai web pages in foreign web spaces. In: Proc. of the 22nd ICDEW. (2006)
Srisukha, E., Jinarat, S., Haruechaiyasak, C., Rungsawang, A.: Naive bayes based language-specific web crawling. In: Proc. of 5th ECTI-C0 N. (2008)
Tamura, T., Somboonviwat, K., Kitsuregawa, M.: A method for language-specific web crawling and its evaluation. Systems and Computers in Japan 38 (2007)
Tadapak, P., Suebchua, T., Rungsawang, A.: A machine learning based language specific web site crawler. In: Proc. of the 13th NBiS. (2010)
DMOZ: Open directory project (ODP). http://www.dmoz.org (2011)
Nakatani, S.: Language detection library for java. http://code.google.com/p/lan- guage-detection/(2010)
Garcia, S., Herrera, F.: Evolutionary undersampling for classification with imbal- anced datasets: proposals and taxonomy. Evolutionary Computation 17-3 (2009)
Ranawana, R., Palade, V.: Multi-classifier systems: Review and a roadmap for developers. International Journal of Hybrid Intelligent Systems 3 (2006)
Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology 4(4) (2004)
Acknowledgments
The first author thanks the JSTP-NSTDA Thailand for the funding support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media Singapore
About this paper
Cite this paper
Suebchua, T., Manaskasemsak, B., Rungsawang, A. (2014). Thai Related Foreign Language Specific Web Crawling Approach. In: Herawan, T., Deris, M., Abawajy, J. (eds) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). Lecture Notes in Electrical Engineering, vol 285. Springer, Singapore. https://doi.org/10.1007/978-981-4585-18-7_72
Download citation
DOI: https://doi.org/10.1007/978-981-4585-18-7_72
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-4585-17-0
Online ISBN: 978-981-4585-18-7
eBook Packages: EngineeringEngineering (R0)