Abstract
Context: Reinforcement learning (RL) can help in solving various challenges of deep web crawling. Deep web content can be accessed by filling the search forms rather than hyperlinks. Understanding the search form and proper selection of queries are necessary steps to retrieve the deep web content successfully. Thus, crawling the deep web is a very challenging task. The reinforcement learning-based technique helps in filling the search form and retrieving the deep web content successfully. RL selects the action based on the given state, and the environment assigns reward/penalty to the selected action. Objective: This study reports a survey of RL-based techniques applied in the domain of deep web crawling. Method: Existing literature survey is based on 31 articles from 77 articles published in various reputed journals, conferences, and workshops. Results: Challenges related to various crawling steps of deep web crawling are presented. RL-based techniques are being used in multiple research papers, which solves deep web crawling challenges. Comparative analysis of RL techniques used in deep web crawling is done based on the strength, metrics, dataset, and research gaps. Conclusion: Various RL-based techniques can be applied to deep web crawling, which has not been explored yet. Open challenges and research directions are also recommended.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bergman, M. K. (2001). White paper: The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1).
Hernández, I., Rivero, C. R., & Ruiz, D. (2019). Deep web crawling: A survey. World Wide Web, 22(4), 1577–1610.
Zheng, Q., Wu, Z., Cheng, X., Jiang, L., & Liu, J. (2013). Learning to crawl deep web. Information Systems, 38(6), 801–819.
Mishra, A., Mattmann, C. A., Ramirez, P. M., & Burke, W. M. (2018). ROACH : Online apprentice critic focused crawling via CSS cues and reinforcement. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDDD 2018), August (pp. 1–9).
Leslie Pack Kaelbling, A. W. M., & Littman, M. L. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 708–713.
Moraes, M. C., Heuser, C. A., Moreira, V. P., & Barbosa, D. (2013). Prequery discovery of domain-specific query forms: A survey. IEEE Transactions on Knowledge and Data Engineering, 25(8), 1830–1848.
Kantorski, G. Z., Moreira, V. P., & Heuser, C. A. (2015). Automatic filling of hidden web forms. ACM SIGMOD Record, 44(1), 24–35.
Saini, C., & Arora, V. (2016). Information retrieval in web crawling: A survey. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 2635–2643).
Kumar, M., Bhatia, R., & Rattan, D. (2017). A survey of web crawlers for information retrieval. WIREs Data Mining Knowledge Discovery, 7(6), e1218.
Li, S., Chen, C., Luo, K., & Song, B. (2019). Review of deep web data extraction. In 2019 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1068–1070).
Google Scholar. 2020. [Online]. Available http://scholar.google.com/. Accessed 30 December 2020.
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In 27th VLDB Conference—Roma, Italy (pp. 1–10).
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.
Kumar, M., & Bhatia, R. (2018). Hidden webpages detection using distributed learning automata. Journal of Web Engineering, 17(3–4), 270–283.
Wirth, C., Akrour, R., Neumann, G., & Fürnkranz, J. (2017). A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 2, 30–34.
Shah, S., Patel, S., & Nair, P. S. (2014). Focused and deep web crawling—A review. International Journal of Computer Science and Information Technologies, 5(6), 7488–7492.
Akilandeswari, J., & Gopalan, N. P. (2007). A novel design of hidden web crawler using reinforcement learning based agents. In Advanced parallel processing technologies (Vol. 4847, pp. 433–440). Springer.
Marin-Castro, H. M., Sosa-Sosa, V. J., Martinez-Trinidad, J. F., & Lopez-Arevalo, I. (2013). Automatic discovery of web query Interfaces using machine learning techniques. Journal of Intelligent Information System, 40(1), 85–108.
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proceedings of 11th International Conference on World Wide Web, WWW’02 (pp. 148–159).
Sharma, D. K., & Sharma, A. K. (2011). A QIIIEP based domain specific hidden web crawler. In Proceedings of the International Conference & Workshop on Emerging Trends in Technology—ICWET ’11 (pp. 224–227).
Singh, L., & Sharma, D. K. (2013). An approach for accessing data from hidden web using intelligent agent technology. In 2013 3rd IEEE International Advance Computing Conference (IACC) (pp. 800–805).
Alzubi, O. A., Alzubi, J. A., Ramachandran, M., & Al-shami, S. (2020). An optimal pruning algorithm of classifier ensembles: Dynamic programming approach. Neural Computer Applications, 6.
Zhang, Z., Du, J., & Wang, L. (2013). Formal concept analysis approach for data extraction from a limited deep web database. Journal of Intelligent Information System, 41(2), 211–234.
Pavai, G., & Geetha, T. V. (2017). Improving the freshness of the search engines by a probabilistic approach based incremental crawler. Information Systems Frontiers, 19(5), 1013–1028.
Pratiba, D., Shobha, G., Lalithkumar, H., & Samrudh, J. (2017). Distributed web crawlers using hadoop. International Journal of Applied Engineering Research, 12(24), 15187–15195.
Ahmed Md. Tanvir, M. C. (2019). Design and implementation of web crawler utilizing unstructured data. Journal of Korea Multimedia Society, 22(3), 374–385.
Gupta, D., Rodrigues, J. J. P. C., Sundaram, S., Khanna, A., Korotaev, V., & De Albuquerque, V. H. C. (2018). Usability feature extraction using modified crow search algorithm: A novel approach. Neural Computer Application, 6.
Murali, R. (2018). An intelligent web spider for online e-commerce data extraction. In 2018 Second International Conference on Green Computing and Internet of Things (ICGCIoT) (pp. 332–339).
Tahseen, I., & Salim, D. (2018). A proposal of deep web crawling system by using breath-first approach. Iraqi Journal of Information and Communications Technology, 48–61.
Tanvir, A. M., Kim, Y., & Chung, M. (2019). Design and implementation of an efficient web crawling using neural network. In Advances in computer science and ubiquitous computing (pp. 116–122). Springer.
Patil, Y., & Patil, S. (2016). Implementation of enhanced web crawler for deep-web interfaces. International Research Journal of Engineering and Technology, 2088–2092.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Madan, K., Bhatia, R. (2022). Reinforcement Learning in Deep Web Crawling: Survey. In: Gupta, D., Khanna, A., Kansal, V., Fortino, G., Hassanien, A.E. (eds) Proceedings of Second Doctoral Symposium on Computational Intelligence . Advances in Intelligent Systems and Computing, vol 1374. Springer, Singapore. https://doi.org/10.1007/978-981-16-3346-1_24
Download citation
DOI: https://doi.org/10.1007/978-981-16-3346-1_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3345-4
Online ISBN: 978-981-16-3346-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)