Keywords

1 Introduction

Phishing is a technique adversaries used to gain personal and financial information such as login credentials and payment card details by impersonating the user or by tricking them into trusting fake websites or emails. It is a social engineering act in which an attacker uses a specially crafted message to random people in the hopes of obtaining sensitive information or utilizing the vulnerability of the user system for deploying and executing malicious software on the victim’s infrastructure, such as viruses, Trojans, and ransomware [1]. Fake emails have increased where the scammer pretends to be a reliable and legitimate government or bank official, and facts are added to support their claim [2]. The scammer may notify the user that their account has been used illegally or questionably. The email contains details to convince the user that a significant purchase was made in another region and ask if the user has approved the payment. The fraudsters are prompt to confirm the credit card or bank account information so the ‘bank’ can examine the whole scenario. This way, they steal sensitive information from users.

Every day, cybercriminals conduct thousands of phishing campaigns like this, and most of them are successful. Attackers often modify their techniques, but specific characteristics might help the user identify a malicious email, text message, or website. User education to follow best practices while browsing the Internet and following safety guidelines can assist users to avoid getting trapped in attempts to steal sensitive information [3]. Advanced attack methods and assault tactics, which cannot be diagnosed by primary education and training, necessitate automated detection and prevention approaches. Software-based defense mechanisms help in detecting malicious emails, fake text messages, and phishing websites.

Phishing Tactics: A typical phishing assault might use various tactics, such as exploiting browser vulnerabilities or executing man-in-the-middle attacks. However, the most basic and often used technique is to create a webpage that looks identical to the one that the user is familiar with or craft emails or text messages in a way that appears to be genuine and helps them gain the trust of the user [4]. They hide URLs of fake external websites which replicate the appearance and user interface of the original website to trick the user into entering their details and credentials. The most popular phishing attacks are shown in Fig. 1.

Fig. 1
figure 1

Common phishing attacks used by adversaries for stealing sensitive information from the users

Phishing remains a severe security issue, and many Internet users are still victims of this deception. Furthermore, such attacks cause severe problems for Internet users and organizations that offer financial services over the Internet.

Filter-based phishing detection [5] techniques are incorporated these days into most email service providers to transfer suspicious emails to a ‘Junk’ or ‘Spam’ folder. When email filtering is enabled, incoming emails are scanned independently for features that indicate malicious content and transfer those emails to a different folder. Browser extension [3] and webpage content analysis [6]-based phishing detection and prevention solutions provide security from phishing websites if they match the suspicious criteria. Despite numerous anti-phishing solutions available these days for prevention from phishing attacks, the adversaries always stay a step ahead which makes all anti-phishing solutions incapable of preventing zero-day attacks. Hence, the need to deeply understand the correlation among different features present in a phishing webpage arises. The appropriate models should be selected while drafting a new adaptable anti-phishing solution making it more suitable for zero-day attack prevention.

This paper will analyze the effectiveness of phishing detection techniques based on URL classifications. With the advancements in machine learning algorithms and their accurate predictions in less time, the phishing domain research in the last decade has shifted to the machine learning, ensemble, and deep learning-based solutions. Section 2 presents the research methodology followed during this study and analysis. Section 3 discusses the details for feature analysis from URLs and the correlation among them. Section 4 covers popular classification models for phishing detection, and their performance is comparatively analyzed when used on the same phishing URL dataset. It analyzes the performance metrics and evaluates the results obtained for classification models on the URL dataset to find the best classifier. Section 5 summarizes the paper.

2 Research Methodology

Various research studies have been conducted by authors earlier to analyze the phishing detection techniques, and the approaches followed are as follows. As discussed in [7], the authors compared software-based phishing detection techniques like blacklists, heuristic detection techniques, visual similarity detection techniques, and data mining detection techniques available in the literature. A detailed survey of literature available for phishing detection approaches is also presented in [8]. Most of the surveys or analyses are focused on techniques mentioned in the literature, but there is no comparative analysis available that should guide toward selecting the best optimal model while designing an anti-phishing solution. This paper’s study will aid academics and industry in determining the optimum algorithm that can be used for anti-phishing solutions based on requirements and resources.

This section briefs the steps followed in this study for URL feature analysis and performance comparison of latest machine learning, ensemble, and deep learning-based classification models on phishing dataset provided by PhishTank [9], dmoztools.net [10]. Figure 2 presents the sequence of steps followed, and detailed explanation is given in following sections.

Fig. 2
figure 2

Research methodology followed in the paper for URL feature analysis and performance comparison of latest classification models on phishing dataset

3 Analysis of Features Extracted from Web URLs

The URL and webpage content for fake websites are usually replicated to appear similar to the original website [11]. This research focuses on phishing website detection in real time by examining the features of the URL of the webpage. The malicious websites can be efficiently detected by thoroughly analyzing their URL. The attackers cannot utilize the exact URL of an original site, and they frequently misspell URL elements such as ‘PrimaryDomain,’ ‘SubDomain,’ and ‘PathDomain’ [12]. Identifying these phishing URL alteration tactics will undoubtedly assist in educating individuals and organizations about phishing attacks and ensuring prevention from them.

Features are extracted from the collected dataset of URLs and classified based on categories defined in [3] and are shown in Fig. 3. The analysis is done to understand the importance of each feature in classification for phishing or legitimate URL and the correlation among different features. Lexical characteristics of the URL [13] are analyzed in this study, and the efficacy for phishing prediction is studied. The observations from examining features of web URLs in the collected dataset are:

Fig. 3
figure 3

Classification of website features based on address bar, abnormal, HTML and JavaScript and domain features

  • Each data sample contains 30 features and a class label ‘result’ that indicates whether or not it is a phishing website (1 or \(-1\)).

  • Size of URL: Long URLs are frequently associated with concealing the suspicious section of a fake website URL in the address bar to mislead the user.

  • Number of dots: In comparison with legitimate websites, phishing pages frequently have more than 5–6 dots in their URLs.

  • IP in the domain: Using an IP address instead of a domain name in a URL indicates an effort to steal personal information.

  • Special characters in URL: The phishing link will perplex the user by adding a special character in the URL. “@” hides the phishing URL by commenting out the domain name that comes before it. The presence of the ‘hyphen’ and ‘@’ symbol in the URL dominate in malicious URLs, whereas legitimate URLs avoid using them [12].

  • Double slash (‘//’): The presence of a double slash in a URL route indicates that the visitor will be redirected to a different website.

  • Multiple sub-domains and a domain name mismatch: Phishers employ this type of technique to persuade victims that the message or email they received originated from a well-known organization. They use the genuine organization’s domain name and append multiple sub-domains as a prefix to deceive users into thinking the crafted fake URL is genuine.

  • Age of a URL: Phishing websites have been found to only exist for a short time, whereas trustworthy websites are registered and paid for several years in advance.

The distribution of domain length, URL length, and the number of dots present in legitimate and phishing web URLs is shown in Fig. 4a–c, respectively.

Fig. 4
figure 4

Visual representation of distribution of a length of domain, URL length (b), and c number of dots present in legitimate and phishing URLs in the collected dataset

Correlation between URL features: The statistical measure of a linear relationship between two variables is known as correlation, and the visual representation is called correlation heatmap [14]. It is the measure of interdependence between two variables. The matrix data format is utilized when there are numerous variables. Figure 5 shows a custom diverging colormap in the form of a matrix generated for the collected dataset and is drawn with the mask and correct aspect ratio. The correlation coefficient might have any value between \(-1\) and 1 [15].

  • Value = 1: The correlation between two variables is considered to be positive and indicates that while one variable rises, the other increases as well.

  • Value = \(-1\): Negative correlation between two variables is defined as a value of \(-1\) and indicates that as one variable goes up, the other goes down.

  • Value = 0: There is no connection between two variables if the value is 0 and indicates that the variables vary at random in relation to one another.

Identifying the characteristics of phishing URL alteration methods and their correlations can aid users in recognizing phishing attempts just by looking at them.

Fig. 5
figure 5

Correlation heatmap representing the interdependence between URL features of the collected dataset

4 Classification Models for Phishing Detection Based on Web URLs

Phishing is a binary classification problem that classifies the given sample into two classes: legitimate or phishing. This research is focused on analyzing the performance of the latest classification algorithms provided by machine learning, ensemble techniques, and deep learning models that are best suited for binary classification.

Machine Learning (ML) Techniques: Data analysis is made easier and more efficient using machine learning. The capacity to construct adaptable models for specific tasks like phishing detection is a fundamental feature of machine learning. ML models might swiftly adapt to changes to identify patterns, which would aid in developing a learning-based identification system [3]. The following algorithms have been chosen because of their accurate prediction results for binary problems: decision tree, random forest, K-nearest neighbor, logistic regression, support vector machine.

Ensemble Classification Techniques: An ensemble is made up of several hypotheses that are created from training data using a primary learning method. Most ensemble methods generate homogeneous ensembles using a single base learning algorithm; however, other approaches employ several learning algorithms to produce heterogeneous ensembles. Several high-performance and advanced frameworks like AdaBoost, XGBoost, and a family of gradient boosting techniques [16] that focus on both speed and accuracy have recently been analyzed on the collected dataset for their performance.

Deep Learning (DL) Techniques: Artificial neural networks are used in deep learning models to conduct complex computations on large datasets [17]. It is a form of ML based on the human brain’s structure and function. Algorithms extract features, organize objects, and find valuable data patterns during the training phase by using unknown elements in the input distribution. Machines are trained using examples. DL algorithms require high-end infrastructure to train in an acceptable amount of time. When there is a dearth of domain expertise for feature introspection, DL approaches shine since feature engineering is less of a concern [18].

4.1 Performance Comparison of Classification Models

The malicious and legitimate URLs are collected from PhishTank and dmoztools.net, and a dataset is formed. The collected dataset consists of a total of 65428 web URLs which have 29182 phishing URLs and 36246 legitimate URLs. The dataset is preprocessed, and features are extracted and chosen for analysis. The collected features from URLs are concatenated after random shuffling during the feature extraction step to prevent the overfitting [19] problem during model training. This also helps balance the distribution while splitting the data into training (75%) and testing (25%) sets. The implementation is done in Python with the help of machine learning libraries, and the pseudo-code is presented in Fig. 6.

Fig. 6
figure 6

Pseudo-code used in experiments on the collected dataset for comparative analysis

Table 1 gives a comparison summary of performance metrics obtained as a result of applying classification algorithms for ML, ensemble, and DL techniques on the training set and validation set. Figure 7 shows visual representation of accuracies obtained from machine learning, ensemble, and deep learning techniques applied on collected dataset.

Table 1 Performance comparison of machine learning, ensemble, and deep learning techniques applied on collected dataset
Fig. 7
figure 7

Visual representation of accuracies obtained from machine learning, ensemble, and deep learning techniques applied on collected dataset

4.2 Discussion and Results Analysis

This research evaluated the time taken for training different classification models (in seconds), validation accuracy obtained, recall, precision, and F1 score. Table 1 lists the results of the experiments mentioned above, and the following are the observations.

ML models get quickly trained, allowing them to make predictions and self-improve algorithms.

  • Compared to other ML algorithms, the decision tree classifier (C4.5) predicts the phishing website accurately and fastest in training. It uses the ‘Gini measure of impurity,’ which helps the tree create ‘pure nodes’ with only one class label that does not need to be further divided, hence the fast execution [3].

  • SVM has been tested with different kernels, but RBF kernel gave the best results compared to linear, poly, and sigmoid kernels.

  • For testing KNN, there is no ideal number for setting k suitable for all types of datasets. Various experiments were conducted by altering the value of k to find the best-suited value for the collected dataset. A perfect balance needs to be found as noise has a greater influence on the outcome when the number of neighbors is small; moreover, a large number of neighbors makes obtaining the result computationally expensive [20].

Ensemble Models show better results as compared to standard ML models. Random forest (RF) and gradient boosting with XGBoost are the best in terms of accuracy (97.3%) and even take almost the same time for getting trained.

  • RF adds randomness to the training and validation dataset and uses more trees, reducing variance, ultimately making the predictions fast and noise prune.

  • The significant benefit of XGBoost over other algorithms is its rapid speed, as well as the ‘regularization parameter,’ which successfully lowers ‘variance.’ Usage of learning rate and subsamples from features like RF allows it to generalize even further. Hyperparameter tuning increases performance, and hybrid with gradient boosting algorithms reduces the training time and results in higher accurate predictions.

DL models take maximum time to get train due to the large number of hyperparameters. They incrementally learn high-level characteristics from data through the hidden layer architecture. The performance of DL models has been observed to increase with the amount of data [18].

  • CNN combined with LSTM gave the best prediction results as compared to separately testing CNN or LSTM. The grid pattern analysis of CNN, when combined with feedback connections of LSTM, takes less time to train, and the performance increases.

  • Although DL models take more time to train, once trained, their accuracy keeps on increasing with time as they can learn from past observations and utilize them for future predictions.

  • DL models work best with a large amount of data, and the dataset used in this study was not that huge. With small datasets, the results are less accurate than ensemble models.

This study can be utilized before finalizing any model before drafting any anti-phishing solution based on URL characteristics. The comparative analysis highlights the best classifier in each section which can be chosen based on requirements and resources for the research. These results were retrieved using classifiers on a live dataset; thus, they are not theoretical, which enhances the study’s reliability and dependability.

5 Conclusion

Recently with an increased emergence of phishing attacks through emails, fake websites, text messages, and phone calls, the need for awareness among Internet users has risen to identify a fake email or webpage. This research covered the basic phishing attack scenarios that deceive users into trusting scammers or adversaries. The URL feature analysis presents the correlation between web URL features which help any user scan the links provided in email before clicking them. A thorough understanding of these interdependencies among features prevents users from falling prey to fake websites that look similar to the original webpages but steal sensitive information. This study also analyzed the performance of the latest ML, ensemble, and DL algorithms. Their speed and accuracy are compared for the same dataset. Random forest and gradient boosting with XGBoost are found to be the best optimal solution for classifying URL-based phishing detection in terms of accuracy (97.3%). The limitation of this research is that comparative analysis was limited only to feature analysis of the web URLs. The classification techniques were tested on collected web URL datasets which can be extended to cover website content in the future.