Skip to main content

Reinforcement Learning in Deep Web Crawling: Survey

  • Conference paper
  • First Online:
Proceedings of Second Doctoral Symposium on Computational Intelligence

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1374))

  • 1365 Accesses

Abstract

Context: Reinforcement learning (RL) can help in solving various challenges of deep web crawling. Deep web content can be accessed by filling the search forms rather than hyperlinks. Understanding the search form and proper selection of queries are necessary steps to retrieve the deep web content successfully. Thus, crawling the deep web is a very challenging task. The reinforcement learning-based technique helps in filling the search form and retrieving the deep web content successfully. RL selects the action based on the given state, and the environment assigns reward/penalty to the selected action. Objective: This study reports a survey of RL-based techniques applied in the domain of deep web crawling. Method: Existing literature survey is based on 31 articles from 77 articles published in various reputed journals, conferences, and workshops. Results: Challenges related to various crawling steps of deep web crawling are presented. RL-based techniques are being used in multiple research papers, which solves deep web crawling challenges. Comparative analysis of RL techniques used in deep web crawling is done based on the strength, metrics, dataset, and research gaps. Conclusion: Various RL-based techniques can be applied to deep web crawling, which has not been explored yet. Open challenges and research directions are also recommended.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bergman, M. K. (2001). White paper: The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1).

    Google Scholar 

  2. Hernández, I., Rivero, C. R., & Ruiz, D. (2019). Deep web crawling: A survey. World Wide Web, 22(4), 1577–1610.

    Article  Google Scholar 

  3. Zheng, Q., Wu, Z., Cheng, X., Jiang, L., & Liu, J. (2013). Learning to crawl deep web. Information Systems, 38(6), 801–819.

    Article  Google Scholar 

  4. Mishra, A., Mattmann, C. A., Ramirez, P. M., & Burke, W. M. (2018). ROACH : Online apprentice critic focused crawling via CSS cues and reinforcement. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDDD 2018), August (pp. 1–9).

    Google Scholar 

  5. Leslie Pack Kaelbling, A. W. M., & Littman, M. L. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 708–713.

    Google Scholar 

  6. Moraes, M. C., Heuser, C. A., Moreira, V. P., & Barbosa, D. (2013). Prequery discovery of domain-specific query forms: A survey. IEEE Transactions on Knowledge and Data Engineering, 25(8), 1830–1848.

    Article  Google Scholar 

  7. Kantorski, G. Z., Moreira, V. P., & Heuser, C. A. (2015). Automatic filling of hidden web forms. ACM SIGMOD Record, 44(1), 24–35.

    Article  Google Scholar 

  8. Saini, C., & Arora, V. (2016). Information retrieval in web crawling: A survey. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 2635–2643).

    Google Scholar 

  9. Kumar, M., Bhatia, R., & Rattan, D. (2017). A survey of web crawlers for information retrieval. WIREs Data Mining Knowledge Discovery, 7(6), e1218.

    Google Scholar 

  10. Li, S., Chen, C., Luo, K., & Song, B. (2019). Review of deep web data extraction. In 2019 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1068–1070).

    Google Scholar 

  11. Google Scholar. 2020. [Online]. Available http://scholar.google.com/. Accessed 30 December 2020.

  12. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In 27th VLDB Conference—Roma, Italy (pp. 1–10).

    Google Scholar 

  13. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.

    Google Scholar 

  14. Kumar, M., & Bhatia, R. (2018). Hidden webpages detection using distributed learning automata. Journal of Web Engineering, 17(3–4), 270–283.

    Google Scholar 

  15. Wirth, C., Akrour, R., Neumann, G., & Fürnkranz, J. (2017). A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 2, 30–34.

    MATH  Google Scholar 

  16. Shah, S., Patel, S., & Nair, P. S. (2014). Focused and deep web crawling—A review. International Journal of Computer Science and Information Technologies, 5(6), 7488–7492.

    Google Scholar 

  17. Akilandeswari, J., & Gopalan, N. P. (2007). A novel design of hidden web crawler using reinforcement learning based agents. In Advanced parallel processing technologies (Vol. 4847, pp. 433–440). Springer.

    Google Scholar 

  18. Marin-Castro, H. M., Sosa-Sosa, V. J., Martinez-Trinidad, J. F., & Lopez-Arevalo, I. (2013). Automatic discovery of web query Interfaces using machine learning techniques. Journal of Intelligent Information System, 40(1), 85–108.

    Article  Google Scholar 

  19. Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proceedings of 11th International Conference on World Wide Web, WWW’02 (pp. 148–159).

    Google Scholar 

  20. Sharma, D. K., & Sharma, A. K. (2011). A QIIIEP based domain specific hidden web crawler. In Proceedings of the International Conference & Workshop on Emerging Trends in Technology—ICWET ’11 (pp. 224–227).

    Google Scholar 

  21. Singh, L., & Sharma, D. K. (2013). An approach for accessing data from hidden web using intelligent agent technology. In 2013 3rd IEEE International Advance Computing Conference (IACC) (pp. 800–805).

    Google Scholar 

  22. Alzubi, O. A., Alzubi, J. A., Ramachandran, M., & Al-shami, S. (2020). An optimal pruning algorithm of classifier ensembles: Dynamic programming approach. Neural Computer Applications, 6.

    Google Scholar 

  23. Zhang, Z., Du, J., & Wang, L. (2013). Formal concept analysis approach for data extraction from a limited deep web database. Journal of Intelligent Information System, 41(2), 211–234.

    Article  Google Scholar 

  24. Pavai, G., & Geetha, T. V. (2017). Improving the freshness of the search engines by a probabilistic approach based incremental crawler. Information Systems Frontiers, 19(5), 1013–1028.

    Article  Google Scholar 

  25. Pratiba, D., Shobha, G., Lalithkumar, H., & Samrudh, J. (2017). Distributed web crawlers using hadoop. International Journal of Applied Engineering Research, 12(24), 15187–15195.

    Google Scholar 

  26. Ahmed Md. Tanvir, M. C. (2019). Design and implementation of web crawler utilizing unstructured data. Journal of Korea Multimedia Society, 22(3), 374–385.

    Google Scholar 

  27. Gupta, D., Rodrigues, J. J. P. C., Sundaram, S., Khanna, A., Korotaev, V., & De Albuquerque, V. H. C. (2018). Usability feature extraction using modified crow search algorithm: A novel approach. Neural Computer Application, 6.

    Google Scholar 

  28. Murali, R. (2018). An intelligent web spider for online e-commerce data extraction. In 2018 Second International Conference on Green Computing and Internet of Things (ICGCIoT) (pp. 332–339).

    Google Scholar 

  29. Tahseen, I., & Salim, D. (2018). A proposal of deep web crawling system by using breath-first approach. Iraqi Journal of Information and Communications Technology, 48–61.

    Google Scholar 

  30. Tanvir, A. M., Kim, Y., & Chung, M. (2019). Design and implementation of an efficient web crawling using neural network. In Advances in computer science and ubiquitous computing (pp. 116–122). Springer.

    Google Scholar 

  31. Patil, Y., & Patil, S. (2016). Implementation of enhanced web crawler for deep-web interfaces. International Research Journal of Engineering and Technology, 2088–2092.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Madan, K., Bhatia, R. (2022). Reinforcement Learning in Deep Web Crawling: Survey. In: Gupta, D., Khanna, A., Kansal, V., Fortino, G., Hassanien, A.E. (eds) Proceedings of Second Doctoral Symposium on Computational Intelligence . Advances in Intelligent Systems and Computing, vol 1374. Springer, Singapore. https://doi.org/10.1007/978-981-16-3346-1_24

Download citation

Publish with us

Policies and ethics