Abstract
The widespread adoption of the World Wide Web (WWW) has brought about a monumental transition toward e-commerce, online banking, and social media. This popularity has presented attackers with newer opportunities to scam the unsuspecting—malicious URLs are among the most common forms of attack. These URLs host unsolicited content and perpetrate cybercrimes. Hence classifying a malicious URL from a benign URL is crucial to enable a secure browsing experience. Blacklists have traditionally been used to classify URLs, however, blacklists are not exhaustive and do not perform well against unknown URLs. This necessitates the use of machine learning/deep learning as they improve the generality of the solution. In this paper, we employ a novel feature extraction algorithm using ‘urllib.parse’, ‘tld’, and ‘re’ libraries to extract static and dynamic lexical features from the URL text. IPv4 and IPv6 address groups and the use of shortening services are detected and used as features. Static features like https/http protocols used show a high correlation with the target variable. Various machine learning and deep learning algorithms were implemented and evaluated for the binary classification of URLs. Experimentation and evaluation were based on 450,176 unique URLs where MLP and Conv1D gave the best overall results with 99.73% and 99.72% accuracies and F1 Scores of 0.9981 and 0.9983, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Internet Security Threat Report (ISTR) 2019–Symantec.: https://www.symantec.com/content/dam/symantec/docs/reports/istr-24-2019-en.pdf. Last Accessed 17 Mar 2022
Sahoo, D., Liu, C., Hoi, S.C.: Malicious URL detection using machine learning: a survey (2017). arXiv preprint arXiv:1701.07179
Khonji, M., Iraqi, Y., Jones, A.: Phishing detection: a literature survey. IEEE Commun. Surv. Tutorials 15(4), 2091–2121 (2013)
Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious JavaScript code. In Proceedings of the 19th International Conference on World Wide Web, pp. 281–290. (2010)
Heartfield, R., Loukas, G.: A taxonomy of attacks and a survey of defence mechanisms for semantic social engineering attacks. ACM Comput. Surv. (CSUR) 48(3), 1–39 (2015)
Prakash, P., Kumar, M., Kompella, R.R., Gupta, M.: Phishnet: predictive blacklisting to detect phishing attacks. In: 2010 Proceedings IEEE INFOCOM, pp. 1–5. IEEE (2010)
Garera, S., Provos, N., Chew, M., Rubin, A.D.: A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, pp. 1–8. (2007)
Khonji, M., Jones, A., Iraqi, Y.: A study of feature subset evaluators and feature subset searching methods for phishing classification. In: Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, pp. 135–144. (2011)
Kuyama, M., Kakizaki, Y., Sasaki, R.: Method for detecting a malicious domain by using whois and dns features. In: The Third International Conference on Digital Security and Forensics (DigitalSec2016), vol. 74 (2016)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Learning to detect malicious urls. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 1–24 (2011)
Singh, V., Gourisaria, M.K., Harshvardhan, G.M., Rautaray, S.S., Pandey, M., Sahni, M., ... Espinoza-Audelo, L.F.: Diagnosis of intracranial tumors via the selective CNN data modeling technique. Appl. Sci. 12(6), 2900 (2022)
Das, H., Naik, B., Behera, H.S.: Classification of diabetes mellitus disease (DMD): a data mining (DM) approach. In: Progress in Computing, Analytics and Networking, pp. 539–549. Springer, Singapore (2018)
Sarah, S., Singh, V., Gourisaria, M.K., Singh, P.K.: Retinal disease detection using CNN through optical coherence tomography images. In 2021 5th International Conference on Information Systems and Computer Networks (ISCON), pp. 1–7. IEEE (2021)
Panigrahi, K.P., Sahoo, A.K., Das, H.: A cnn approach for corn leaves disease detection to support digital agricultural system. In: 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI), vol. 48184, pp. 678–683. IEEE (2020)
Chandra, S., Gourisaria, M.K., Harshvardhan, G.M., Rautaray, S.S., Pandey, M., Mohanty, S.N.: Semantic analysis of sentiments through web-mined twitter corpus. In CEUR Workshop Proceedings, vol. 2786, pp. 122–135. (2021)
Pramanik, R., Khare, S., Gourisaria, M.K.: Inferring the occurrence of chronic kidney failure: a data mining solution. In: Gupta, D., Khanna, A., Kansal, V., Fortino, G., Hassanien, A.E. (eds.) Proceedings of Second Doctoral Symposium on Computational Intelligence. Advances in Intelligent Systems and Computing, vol. 1374, Springer, Singapore (2022)
Sun, B., Akiyama, M., Yagi, T., Hatada, M., Mori, T.: Automating URL blacklist generation with similarity search approach. IEICE Trans. Inf. Syst. 99(4), 873–882 (2016)
Sinha, S., Bailey, M., Jahanian, F.: Shades of grey: on the effectiveness of reputation-based “blacklists”. In: 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE), pp. 57–64. IEEE (2008)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254. (2009)
Vundavalli, V., Barsha, F., Masum, M., Shahriar, H., Haddad, H.: Malicious URL detection using supervised machine learning techniques. In: 13th International Conference on Security of Information and Networks, pp. 1–6. (2020)
Aydin, M., Butun, I., Bicakci, K., Baykal, N.: Using attribute-based feature selection approaches and machine learning algorithms for detecting fraudulent website URLs. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0774–0779. IEEE (2020)
Bharadwaj, R., Bhatia, A., Chhibbar, L. D., Tiwari, K., Agrawal, A.: Is this url safe: detection of malicious urls using global vector for word representation. In: 2022 International Conference on Information Networking (ICOIN), pp. 486–491. IEEE (2022)
https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls. Last Accessed 3 Mar 2022
Singh, V., Gourisaria, M.K., Das, H.: Performance analysis of machine learning algorithms for prediction of liver disease. In: 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), pp. 1–7. IEEE (2021)
Das, H., Naik, B., Behera, H.S.: Medical disease analysis using neuro-fuzzy with feature extraction model for classification. Inform. Med. Unlocked 18, 100288 (2020)
Sarah, S., Gourisaria, M.K., Khare, S., Das, H.: Heart disease prediction using core machine learning techniques—a comparative study. In: Advances in Data and Information Sciences, pp. 247–260. Springer, Singapore (2022)
Magesh Kumar, C., Thiyagarajan, R., Natarajan, S.P., Arulselvi, S., Sainarayanan, G.: Gabor features and LDA based face recognition with ANN classifier. In: 2011 International Conference on Emerging Trends in Electrical and Computer Technology, pp. 831–836. IEEE (2011)
Wijoyo, S., Wijoyo, S.: Speech recognition using linear predictive coding and artificial neural network for controlling the movement of a mobile robot. In: Proceedings of the 2011 International Conference on Information and Electronics Engineering (ICIEE 2011), Bangkok, Thailand, pp. 28–29. (2011)
Jain, S., Gupta, R., Moghe, A.A.: Stock price prediction on daily stock data using deep neural networks. In: 2018 International Conference on Advanced Computation and Telecommunication (ICACAT), pp. 1–13. IEEE (2018)
Visca, M., Bouton, A., Powell, R., Gao, Y., Fallah, S.: Conv1D energy-aware path planner for mobile robots in unstructured environments. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 2279–2285. IEEE (2021)
Kim, T., Lee, J., Nam, J.: Sample-level CNN architectures for music auto-tagging using raw waveforms. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 366–370. IEEE (2018)
Singh, V., Gourisaria, M.K., Harshvardhan, G.M., Singh, V.: Mycobacterium tuberculosis detection using CNN ranking approach. In: Gandhi, T.K., Konar, D., Sen, B., Sharma, K. (eds.) Advanced Computational Paradigms and Hybrid Intelligent Computing. Advances in Intelligent Systems and Computing, vol. 1373. Springer, Singapore (2022)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sahoo, V.K., Singh, V., Gourisaria, M.K., Acharya, A.K. (2023). URL Classification on Extracted Feature Using Deep Learning. In: Tistarelli, M., Dubey, S.R., Singh, S.K., Jiang, X. (eds) Computer Vision and Machine Intelligence. Lecture Notes in Networks and Systems, vol 586. Springer, Singapore. https://doi.org/10.1007/978-981-19-7867-8_33
Download citation
DOI: https://doi.org/10.1007/978-981-19-7867-8_33
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7866-1
Online ISBN: 978-981-19-7867-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)