Reinforcement Learning in Deep Web Crawling: Survey

Madan, Kapil; Bhatia, Rajesh

doi:10.1007/978-981-16-3346-1_24

Kapil Madan¹⁹ &
Rajesh Bhatia¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1374))

1365 Accesses

Abstract

Context: Reinforcement learning (RL) can help in solving various challenges of deep web crawling. Deep web content can be accessed by filling the search forms rather than hyperlinks. Understanding the search form and proper selection of queries are necessary steps to retrieve the deep web content successfully. Thus, crawling the deep web is a very challenging task. The reinforcement learning-based technique helps in filling the search form and retrieving the deep web content successfully. RL selects the action based on the given state, and the environment assigns reward/penalty to the selected action. Objective: This study reports a survey of RL-based techniques applied in the domain of deep web crawling. Method: Existing literature survey is based on 31 articles from 77 articles published in various reputed journals, conferences, and workshops. Results: Challenges related to various crawling steps of deep web crawling are presented. RL-based techniques are being used in multiple research papers, which solves deep web crawling challenges. Comparative analysis of RL techniques used in deep web crawling is done based on the strength, metrics, dataset, and research gaps. Conclusion: Various RL-based techniques can be applied to deep web crawling, which has not been explored yet. Open challenges and research directions are also recommended.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Focused Crawling Through Reinforcement Learning

Classification with costly features in hierarchical deep sets

Article Open access 22 May 2024

Common challenges of deep reinforcement learning applications development: an empirical study

Article 14 June 2024

References

Bergman, M. K. (2001). White paper: The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1).
Google Scholar
Hernández, I., Rivero, C. R., & Ruiz, D. (2019). Deep web crawling: A survey. World Wide Web, 22(4), 1577–1610.
Article Google Scholar
Zheng, Q., Wu, Z., Cheng, X., Jiang, L., & Liu, J. (2013). Learning to crawl deep web. Information Systems, 38(6), 801–819.
Article Google Scholar
Mishra, A., Mattmann, C. A., Ramirez, P. M., & Burke, W. M. (2018). ROACH : Online apprentice critic focused crawling via CSS cues and reinforcement. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDDD 2018), August (pp. 1–9).
Google Scholar
Leslie Pack Kaelbling, A. W. M., & Littman, M. L. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 708–713.
Google Scholar
Moraes, M. C., Heuser, C. A., Moreira, V. P., & Barbosa, D. (2013). Prequery discovery of domain-specific query forms: A survey. IEEE Transactions on Knowledge and Data Engineering, 25(8), 1830–1848.
Article Google Scholar
Kantorski, G. Z., Moreira, V. P., & Heuser, C. A. (2015). Automatic filling of hidden web forms. ACM SIGMOD Record, 44(1), 24–35.
Article Google Scholar
Saini, C., & Arora, V. (2016). Information retrieval in web crawling: A survey. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 2635–2643).
Google Scholar
Kumar, M., Bhatia, R., & Rattan, D. (2017). A survey of web crawlers for information retrieval. WIREs Data Mining Knowledge Discovery, 7(6), e1218.
Google Scholar
Li, S., Chen, C., Luo, K., & Song, B. (2019). Review of deep web data extraction. In 2019 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1068–1070).
Google Scholar
Google Scholar. 2020. [Online]. Available http://scholar.google.com/. Accessed 30 December 2020.
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In 27th VLDB Conference—Roma, Italy (pp. 1–10).
Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.
Google Scholar
Kumar, M., & Bhatia, R. (2018). Hidden webpages detection using distributed learning automata. Journal of Web Engineering, 17(3–4), 270–283.
Google Scholar
Wirth, C., Akrour, R., Neumann, G., & Fürnkranz, J. (2017). A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 2, 30–34.
MATH Google Scholar
Shah, S., Patel, S., & Nair, P. S. (2014). Focused and deep web crawling—A review. International Journal of Computer Science and Information Technologies, 5(6), 7488–7492.
Google Scholar
Akilandeswari, J., & Gopalan, N. P. (2007). A novel design of hidden web crawler using reinforcement learning based agents. In Advanced parallel processing technologies (Vol. 4847, pp. 433–440). Springer.
Google Scholar
Marin-Castro, H. M., Sosa-Sosa, V. J., Martinez-Trinidad, J. F., & Lopez-Arevalo, I. (2013). Automatic discovery of web query Interfaces using machine learning techniques. Journal of Intelligent Information System, 40(1), 85–108.
Article Google Scholar
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proceedings of 11th International Conference on World Wide Web, WWW’02 (pp. 148–159).
Google Scholar
Sharma, D. K., & Sharma, A. K. (2011). A QIIIEP based domain specific hidden web crawler. In Proceedings of the International Conference & Workshop on Emerging Trends in Technology—ICWET ’11 (pp. 224–227).
Google Scholar
Singh, L., & Sharma, D. K. (2013). An approach for accessing data from hidden web using intelligent agent technology. In 2013 3rd IEEE International Advance Computing Conference (IACC) (pp. 800–805).
Google Scholar
Alzubi, O. A., Alzubi, J. A., Ramachandran, M., & Al-shami, S. (2020). An optimal pruning algorithm of classifier ensembles: Dynamic programming approach. Neural Computer Applications, 6.
Google Scholar
Zhang, Z., Du, J., & Wang, L. (2013). Formal concept analysis approach for data extraction from a limited deep web database. Journal of Intelligent Information System, 41(2), 211–234.
Article Google Scholar
Pavai, G., & Geetha, T. V. (2017). Improving the freshness of the search engines by a probabilistic approach based incremental crawler. Information Systems Frontiers, 19(5), 1013–1028.
Article Google Scholar
Pratiba, D., Shobha, G., Lalithkumar, H., & Samrudh, J. (2017). Distributed web crawlers using hadoop. International Journal of Applied Engineering Research, 12(24), 15187–15195.
Google Scholar
Ahmed Md. Tanvir, M. C. (2019). Design and implementation of web crawler utilizing unstructured data. Journal of Korea Multimedia Society, 22(3), 374–385.
Google Scholar
Gupta, D., Rodrigues, J. J. P. C., Sundaram, S., Khanna, A., Korotaev, V., & De Albuquerque, V. H. C. (2018). Usability feature extraction using modified crow search algorithm: A novel approach. Neural Computer Application, 6.
Google Scholar
Murali, R. (2018). An intelligent web spider for online e-commerce data extraction. In 2018 Second International Conference on Green Computing and Internet of Things (ICGCIoT) (pp. 332–339).
Google Scholar
Tahseen, I., & Salim, D. (2018). A proposal of deep web crawling system by using breath-first approach. Iraqi Journal of Information and Communications Technology, 48–61.
Google Scholar
Tanvir, A. M., Kim, Y., & Chung, M. (2019). Design and implementation of an efficient web crawling using neural network. In Advances in computer science and ubiquitous computing (pp. 116–122). Springer.
Google Scholar
Patil, Y., & Patil, S. (2016). Implementation of enhanced web crawler for deep-web interfaces. International Research Journal of Engineering and Technology, 2088–2092.
Google Scholar

Download references

Author information

Authors and Affiliations

Punjab Engineering College (Deemed to be University), Chandigarh, India
Kapil Madan & Rajesh Bhatia

Authors

Kapil Madan
View author publications
You can also search for this author in PubMed Google Scholar
Rajesh Bhatia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science Engineering, Maharaja Agrasen Institute of Technology, Rohini, Delhi, India
Deepak Gupta
Maharaja Agrasen Institute of Technology, Rohini, Delhi, India
Ashish Khanna
Institute of Engineering and Technology, Lucknow, Uttar Pradesh, India
Vineet Kansal
University of Calabria, Rende, Cosenza, Italy
Giancarlo Fortino
Department of Information Technology, Cairo University, Giza, Egypt
Aboul Ella Hassanien

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Madan, K., Bhatia, R. (2022). Reinforcement Learning in Deep Web Crawling: Survey. In: Gupta, D., Khanna, A., Kansal, V., Fortino, G., Hassanien, A.E. (eds) Proceedings of Second Doctoral Symposium on Computational Intelligence . Advances in Intelligent Systems and Computing, vol 1374. Springer, Singapore. https://doi.org/10.1007/978-981-16-3346-1_24

Download citation

DOI: https://doi.org/10.1007/978-981-16-3346-1_24
Published: 20 September 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3345-4
Online ISBN: 978-981-16-3346-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Reinforcement Learning in Deep Web Crawling: Survey

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Focused Crawling Through Reinforcement Learning

Classification with costly features in hierarchical deep sets

Common challenges of deep reinforcement learning applications development: an empirical study

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Reinforcement Learning in Deep Web Crawling: Survey

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Focused Crawling Through Reinforcement Learning

Classification with costly features in hierarchical deep sets

Common challenges of deep reinforcement learning applications development: an empirical study

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation