Skip to main content

URL Classification on Extracted Feature Using Deep Learning

  • Conference paper
  • First Online:
Computer Vision and Machine Intelligence

Abstract

The widespread adoption of the World Wide Web (WWW) has brought about a monumental transition toward e-commerce, online banking, and social media. This popularity has presented attackers with newer opportunities to scam the unsuspecting—malicious URLs are among the most common forms of attack. These URLs host unsolicited content and perpetrate cybercrimes. Hence classifying a malicious URL from a benign URL is crucial to enable a secure browsing experience. Blacklists have traditionally been used to classify URLs, however, blacklists are not exhaustive and do not perform well against unknown URLs. This necessitates the use of machine learning/deep learning as they improve the generality of the solution. In this paper, we employ a novel feature extraction algorithm using ‘urllib.parse’, ‘tld’, and ‘re’ libraries to extract static and dynamic lexical features from the URL text. IPv4 and IPv6 address groups and the use of shortening services are detected and used as features. Static features like https/http protocols used show a high correlation with the target variable. Various machine learning and deep learning algorithms were implemented and evaluated for the binary classification of URLs. Experimentation and evaluation were based on 450,176 unique URLs where MLP and Conv1D gave the best overall results with 99.73% and 99.72% accuracies and F1 Scores of 0.9981 and 0.9983, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Internet Security Threat Report (ISTR) 2019–Symantec.: https://www.symantec.com/content/dam/symantec/docs/reports/istr-24-2019-en.pdf. Last Accessed 17 Mar 2022

  2. Sahoo, D., Liu, C., Hoi, S.C.: Malicious URL detection using machine learning: a survey (2017). arXiv preprint arXiv:1701.07179

  3. Khonji, M., Iraqi, Y., Jones, A.: Phishing detection: a literature survey. IEEE Commun. Surv. Tutorials 15(4), 2091–2121 (2013)

    Article  Google Scholar 

  4. Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious JavaScript code. In Proceedings of the 19th International Conference on World Wide Web, pp. 281–290. (2010)

    Google Scholar 

  5. Heartfield, R., Loukas, G.: A taxonomy of attacks and a survey of defence mechanisms for semantic social engineering attacks. ACM Comput. Surv. (CSUR) 48(3), 1–39 (2015)

    Article  Google Scholar 

  6. Prakash, P., Kumar, M., Kompella, R.R., Gupta, M.: Phishnet: predictive blacklisting to detect phishing attacks. In: 2010 Proceedings IEEE INFOCOM, pp. 1–5. IEEE (2010)

    Google Scholar 

  7. Garera, S., Provos, N., Chew, M., Rubin, A.D.: A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, pp. 1–8. (2007)

    Google Scholar 

  8. Khonji, M., Jones, A., Iraqi, Y.: A study of feature subset evaluators and feature subset searching methods for phishing classification. In: Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, pp. 135–144. (2011)

    Google Scholar 

  9. Kuyama, M., Kakizaki, Y., Sasaki, R.: Method for detecting a malicious domain by using whois and dns features. In: The Third International Conference on Digital Security and Forensics (DigitalSec2016), vol. 74 (2016)

    Google Scholar 

  10. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Learning to detect malicious urls. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 1–24 (2011)

    Google Scholar 

  11. Singh, V., Gourisaria, M.K., Harshvardhan, G.M., Rautaray, S.S., Pandey, M., Sahni, M., ... Espinoza-Audelo, L.F.: Diagnosis of intracranial tumors via the selective CNN data modeling technique. Appl. Sci. 12(6), 2900 (2022)

    Google Scholar 

  12. Das, H., Naik, B., Behera, H.S.: Classification of diabetes mellitus disease (DMD): a data mining (DM) approach. In: Progress in Computing, Analytics and Networking, pp. 539–549. Springer, Singapore (2018)

    Google Scholar 

  13. Sarah, S., Singh, V., Gourisaria, M.K., Singh, P.K.: Retinal disease detection using CNN through optical coherence tomography images. In 2021 5th International Conference on Information Systems and Computer Networks (ISCON), pp. 1–7. IEEE (2021)

    Google Scholar 

  14. Panigrahi, K.P., Sahoo, A.K., Das, H.: A cnn approach for corn leaves disease detection to support digital agricultural system. In: 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI), vol. 48184, pp. 678–683. IEEE (2020)

    Google Scholar 

  15. Chandra, S., Gourisaria, M.K., Harshvardhan, G.M., Rautaray, S.S., Pandey, M., Mohanty, S.N.: Semantic analysis of sentiments through web-mined twitter corpus. In CEUR Workshop Proceedings, vol. 2786, pp. 122–135. (2021)

    Google Scholar 

  16. Pramanik, R., Khare, S., Gourisaria, M.K.: Inferring the occurrence of chronic kidney failure: a data mining solution. In: Gupta, D., Khanna, A., Kansal, V., Fortino, G., Hassanien, A.E. (eds.) Proceedings of Second Doctoral Symposium on Computational Intelligence. Advances in Intelligent Systems and Computing, vol. 1374, Springer, Singapore (2022)

    Google Scholar 

  17. Sun, B., Akiyama, M., Yagi, T., Hatada, M., Mori, T.: Automating URL blacklist generation with similarity search approach. IEICE Trans. Inf. Syst. 99(4), 873–882 (2016)

    Article  Google Scholar 

  18. Sinha, S., Bailey, M., Jahanian, F.: Shades of grey: on the effectiveness of reputation-based “blacklists”. In: 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE), pp. 57–64. IEEE (2008)

    Google Scholar 

  19. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254. (2009)

    Google Scholar 

  20. Vundavalli, V., Barsha, F., Masum, M., Shahriar, H., Haddad, H.: Malicious URL detection using supervised machine learning techniques. In: 13th International Conference on Security of Information and Networks, pp. 1–6. (2020)

    Google Scholar 

  21. Aydin, M., Butun, I., Bicakci, K., Baykal, N.: Using attribute-based feature selection approaches and machine learning algorithms for detecting fraudulent website URLs. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0774–0779. IEEE (2020)

    Google Scholar 

  22. Bharadwaj, R., Bhatia, A., Chhibbar, L. D., Tiwari, K., Agrawal, A.: Is this url safe: detection of malicious urls using global vector for word representation. In: 2022 International Conference on Information Networking (ICOIN), pp. 486–491. IEEE (2022)

    Google Scholar 

  23. https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls. Last Accessed 3 Mar 2022

  24. Singh, V., Gourisaria, M.K., Das, H.: Performance analysis of machine learning algorithms for prediction of liver disease. In: 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), pp. 1–7. IEEE (2021)

    Google Scholar 

  25. Das, H., Naik, B., Behera, H.S.: Medical disease analysis using neuro-fuzzy with feature extraction model for classification. Inform. Med. Unlocked 18, 100288 (2020)

    Google Scholar 

  26. Sarah, S., Gourisaria, M.K., Khare, S., Das, H.: Heart disease prediction using core machine learning techniques—a comparative study. In: Advances in Data and Information Sciences, pp. 247–260. Springer, Singapore (2022)

    Google Scholar 

  27. Magesh Kumar, C., Thiyagarajan, R., Natarajan, S.P., Arulselvi, S., Sainarayanan, G.: Gabor features and LDA based face recognition with ANN classifier. In: 2011 International Conference on Emerging Trends in Electrical and Computer Technology, pp. 831–836. IEEE (2011)

    Google Scholar 

  28. Wijoyo, S., Wijoyo, S.: Speech recognition using linear predictive coding and artificial neural network for controlling the movement of a mobile robot. In: Proceedings of the 2011 International Conference on Information and Electronics Engineering (ICIEE 2011), Bangkok, Thailand, pp. 28–29. (2011)

    Google Scholar 

  29. Jain, S., Gupta, R., Moghe, A.A.: Stock price prediction on daily stock data using deep neural networks. In: 2018 International Conference on Advanced Computation and Telecommunication (ICACAT), pp. 1–13. IEEE (2018)

    Google Scholar 

  30. Visca, M., Bouton, A., Powell, R., Gao, Y., Fallah, S.: Conv1D energy-aware path planner for mobile robots in unstructured environments. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 2279–2285. IEEE (2021)

    Google Scholar 

  31. Kim, T., Lee, J., Nam, J.: Sample-level CNN architectures for music auto-tagging using raw waveforms. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 366–370. IEEE (2018)

    Google Scholar 

  32. Singh, V., Gourisaria, M.K., Harshvardhan, G.M., Singh, V.: Mycobacterium tuberculosis detection using CNN ranking approach. In: Gandhi, T.K., Konar, D., Sen, B., Sharma, K. (eds.) Advanced Computational Paradigms and Hybrid Intelligent Computing. Advances in Intelligent Systems and Computing, vol. 1373. Springer, Singapore (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sahoo, V.K., Singh, V., Gourisaria, M.K., Acharya, A.K. (2023). URL Classification on Extracted Feature Using Deep Learning. In: Tistarelli, M., Dubey, S.R., Singh, S.K., Jiang, X. (eds) Computer Vision and Machine Intelligence. Lecture Notes in Networks and Systems, vol 586. Springer, Singapore. https://doi.org/10.1007/978-981-19-7867-8_33

Download citation

Publish with us

Policies and ethics