1 Introduction

Phishing is a type of cybercrime that involves establishing a fake website that seems like a real website in order to collect vital or private information from consumers. Phishing detection method deceives the user by capturing a picture from a reputable website. Image comparison, on the other hand, takes more time and requires more storage space. Provides a high percentage of false negatives and fails to detect minor changes in visual appearance. Phishing detection method works well with huge datasets. Phishing detection also eliminates the disadvantages of the current technique and allows for the detection of zero-day attacks. As a result, the suggested method will focus on detecting phishing websites using tree-based classifiers [1].

Hackers used better way their phony websites to gain personal information. We find some signs and aspects that can help to judge the difference between a real and a fake website.

We can avoid phishing websites by using direct websites from the URL address or using real websites Pop-Ups windows. If we find any warning message which shows harm computer Non-Secured Sites then left the URL or if we find lacks https Pay Close Attention to the URL or Web Address insecure. If the Content and Design of the Website for some are below standard then it will be phishing website. Community people already provide credit score so we can judge on the basis of Online Reviews.

The Table 1 represents total number of unique phishing reports (campaigns) received, according to Anti-Phishing Working Group (APWG). With the study, we find on July 15, 2020, various twitter suffered a strong break that combined elements of security and phishing. With the previous study, we find various people targeted on identifying malicious URLs from the massive set of URLs [2]. The main objectives of this study to focus each and every angle of phishing dataset by various features selection methods and features elimination method of machine learning. The Sects. 1, 2, 3, 4, and 5 organized Introduction, methodology of the research, results, and discussion and concludes respectively.

Table 1 Total number of unique phishing reports (campaigns) received, according to anti phishing working group

Jain and Gupta considered Naïve Bayes and support vector machine with malicious websites. They found both learners do not store previous results in the memory. Finally, authors found efficiency of URL detector may be reduced [3].

Purbay and Kumar [4] examined multiple classifiers with URL websites. Authors measured the performance of multiple classifiers but they did not support retrieval capacity of the algorithms.

Gandotra and Gupta [5] used multiple predictors for analyzing malicious URLs. After all the examination they found the performance of the system was better compare to other classifiers, but a drawback was run with the organized classifier, this system did not support large volume dataset.

Le et al. [6] organized a deep learning system based on URL detector applied on lexical features for examined phishing websites. They found more time requirements for produce an output by deep learning.

Hong et al. [7] organized a system for URLs sites to identify lexical features in phishing websites. They evaluated crawker based dataset and found no assurance of URL detector with real time.

Kumar et al. [8] examined URL detector blacklisted dataset. They used a system on lexical features and classified malicious and legitimate websites. In the examination authors find the performance of the detector reduced with time.

Abutair and Belghith [9] discussed for classifying websites and predicts the phishing websites. They used GA techniques to measure the performance of time for huge and complex dataset.

Rao and Pais [10] experimented with logo, favicon, scripts and styles attributes of page. They update page attributes that helps in performance reduction in detecting system.

Aljofy et al. [11] discussed about identifying the phishing page using CNN algorithm. They found organized system easy retrieve image rather than text. Finally, authors detect CNN results are better compare to another classifier.

AlEroud and Karabatis [12] organized a system of neural network for observe adversarial network. The system easily identifies the impression of advert network compare to other algorithms.

Althobaiti et al. [13] have discussed total URL features in six categories: lexical, host, rank, redirection, certificate, search engines and black/white lists. All these six categories of features make the 89 features of the UCI machine learning phishing website dataset.

Gupta et al. [14] have applied the features selection technique as choosing the lexical feature only and obtained the highest accuracy of 99.57% in the case of random forest. Since the author has chosen only a smaller number of features so they obtained to much high accuracy, which is not justifiable.

Sahoo et al. [15] have presented a review paper in which they have discussed total phishing website features in five categories as black list, lexical, host, content-based features and other features.

In this study ensemble classification approach for detecting Phishing Websites. Training, feature optimization, and testing are the three primary steps in this process. The classifiers (DT, RF, and Gradient Boosting) were first trained using training websites dataset. There was no optimization strategy used in this stage. In the second stage, a hybrid features selection approach is utilized to optimize these classifiers that may be used to improve the classifiers' overall accuracy. Following that, depending on their ranking, optimized classifiers were used as the chi-square, extra tree, recursive features elimination techniques. The result obtained by the proposed model shows a high improvement in terms of accuracy as the results of research studied in literature reviewed.

2 Methods

In this study, we have applied three different feature selection techniques: Extra Tree, Chi-Square and Recursive Feature Elimination on phishing website dataset obtained by UCI machine learning repository. Phishing website dataset consist of 89 variables, by applying these three feature selection techniques we obtained 29 most important features (attributes) and obtained new optimum subsets of phishing website dataset. Then we have applied three machine learning techniques: Decision Tree, Random Forest and Gradient Boosting Classifier to train the optimum subset of phishing website dataset.

The predictions obtained by three different feature selection methods are compared to choose the best feature selection techniques and best prediction accuracy. The whole proposed methodology used in this research paper is described in Fig. 1.

Fig. 1
figure 1

Represents proposed method for phishing dataset

Following classifiers and feature selection techniques are used to evaluate the performance of proposed model.

2.1 Gradient boosting

Regularization strategies that punish various sections of the algorithm and overall enhance the algorithm's performance by decreasing over-fitting might help it. GB is a non-parametric supervised machine learning technique [16]. Boosting is the method for converting weak learners into strong learner. In gradient boosting, each new tree is a fit on a modified version of the original data set. The gradient boosting algorithm (GB) can be most easily explained by first introducing the AdaBoost Algorithm. The AdaBoost Algorithm begins by training a decision tree in which each observation is assigned an equal weight.

2.2 Random forest

A random forest classifier is a supervised learning technique and can be used for classification and regression analysis. This algorithm is most simple and flexible to use. A forest is collection of various trees. If high number of trees is present, then the forest is more robust. Random forests randomly select data to create decision trees, and give prediction from each tree and choose the best solution by use of voting technique. It also provides an attractive excellent display of the feature importance [17].

2.3 Decision tree

A decision tree is a supervised learning based predictive modeling tool [18]. This tool works on the principle of multivariate analysis, that can help in predicting, explaining, describing, classifying the outcome. It splits the dataset based on multiple conditions, thus help in describing beyond one cause cases and help us describe the condition based on multiple influences. Quinlan created Iterative Dichotomiser version 3 (ID3) algorithms, which was used for generation of decision trees. A decision tree is generated from root following top-down approach that involves partitioning of data, entropy is used to calculate homogeneity of data samples, if the sample data is completely homogeneous, the entropy value is 0 or if sample data is not homogeneous, the entropy value is 1. Entropy can be calculated using Eq. (1).

$$E\left(S\right)=\sum_{i=1}^{c}-{p}_{i}{log}_{2}{p}_{i}$$
(1)

2.4 Dataset analysis

We have used phishing website dataset collected from UCI machine learning repository, which consists 89 features as shown in Table 2. There is total 11,430 numbers of instances out of which 5715 are Legitimate and 5715 are Phishing. The categorical variables "Legitimate" and "Phishing" in the gathered dataset have been changed to numerical values by substituting the values "1" and "− 1" for "Legitimate" and "Phishing," respectively.

Table 2 Phishing website data attributes

3 Results

The feature selection techniques are very important for improving the performance of a developed model. We have applied three feature selection techniques extra tree, chi-square and recursive feature elimination technique to find the 29 relevant features, which play on important role to improve the results of developed model.

3.1 Extra trees

Extra Trees is an ensemble machine learning approach that aggregates the predictions of many decision trees (see Fig. 2). Extra Trees ensemble is a decision tree ensemble that is similar to random forest. This is a model-based technique to picking characteristics that uses tree-based supervised models to make judgments about their relevance. Instead of using a bootstrap replica, it fits each decision tree to the whole dataset and splits the nodes at random. Random Forest selects the best split, whereas Extra Trees choose it at random [19]. The greatest and lowest feature significance levels are represented by the extra tree. Once the split points are chosen, the two algorithms determine which of the subsets of characteristics the best is.

$$\left[ \begin{gathered} 0.00{37}00{36}\quad 0.00{48668}\quad 0.00{56781} \quad 0.00{22648} \quad 0.00{534266}\quad 0.00{4}0{65}0{4}\quad 0.00{465}0{11} \hfill \\ 0.00{287578}\quad 0.0{1}0{78235}\quad 0.00{7}0{4151}\quad 0.0{1275825}\quad 0.00{8}0{6429}\quad 0.00{399562}\quad 0.0{1562412} \hfill \\ 0.00{212852}\quad 0.0{1431833}\quad 0.00{257516}\quad 0.{24535652}\quad 0.0{5775128}\quad 0.{58616}0{4 } \hfill \\ \end{gathered} \right]$$
Fig. 2
figure 2

Represents Extra tree features selection method for phishing dataset

3.2 Chi-square

We want to pick features that are heavily dependent on the reaction while we're selecting features in Table 3. This test is based on frequencies rather than factors like mean and standard deviation (as a non-parametric test). The test is only useful for hypothesis testing and not for estimate. As previously stated, this test contains the additive property [20].

Table3 Represents Chi-Square features selection method for phishing dataset

3.3 Recursive feature elimination

Recursive Feature Elimination is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable. There are two important configuration options when using RFE: the choice in the number of features to select and the choice of the algorithm used to help choose features. Both of these hyper parameters can be explored, although the performance of the method is not strongly dependent on these hyper parameters being configured well.

The resultant features are shown in Fig. 3.

Fig. 3
figure 3

Represents RFE features extraction method for phishing dataset

After applying the three base classifiers decision tree, gradient boosting and random forest on data subset obtained after feature selection techniques, the obtained results are shown in Tables 4 and 5.

Table 4 Represents computational table of training model (70%) phishing websites dataset
Table 5 Represents computational table of test model (30%) phishing websites dataset

With the results, we found Table 4 Represents Computational table of Training Model (70%) Phishing Websites dataset by DT, GB and RF algorithms. The experimental results Random Forest calculated highest values for sensitivity and accuracy as 0.9761, 0.9655 respectively.

After the experiment, we found test results as test model (30%) Phishing Websites dataset, the Table 5 represents Computational table for DT, GB and RF algorithms. The experimental results Random Forest calculated highest values for sensitivity and accuracy as 0.9905, 0.9862 respectively.

Recursive Feature Elimination is a feature selection algorithm. Like an excel spreadsheet, a machine learning dataset for classification or regression is made up of rows and columns. Feature selection refers to methods for selecting a subset of a dataset's most important characteristics (columns). Using the feature importance property of the model in Fig. 4 we can extract the feature importance of each feature in the dataset. The feature significance score assigns a value to each data feature; the higher the score, the more essential or relevant the feature is to the output variable [21].

Fig. 4
figure 4

Represents analysis Correlation method for phishing dataset

The Table 6 represents analysis (Training Set = 70%) for phishing dataset using classifiers. The results indicated that random forest classifiers had achieved the highest Correlation coefficient result of 0.9317% when compared to Decision Tree, Random Forest and Gradient Boosting [22].

Table 6 Represents analysis (Training Set = 70%) for phishing dataset using classifiers

The Table 7 represents analysis of test Set on 30% phishing dataset using classifiers. The results indicated that random forest classifiers had achieved the highest Correlation coefficient result of 0.9816% and lowest error, when compared to Decision Tree, Random Forest and Gradient Boosting. The random forest performs better compare to other selected classifiers in phishing website. The features selection methods determine effective of phishing website in Table 7.

Table 7 Represents analysis (Test Set = 30%) for phishing dataset using classifiers

4 Discussion

Correlation coefficients are used to determine the strength of the link between two variables [23]. Correlation involves determining the correlation between two variables. By the experiment, we find (Training Set = 70%) for Decision Tree, Random Forest, Gradient Boosting calculated as Correlation coefficient 0.8593, 0.9317, 0.7311; Analysis (Test Set = 30%), Decision Tree, Random Forest, Gradient Boosting calculated as Correlation coefficient 0.9281, 0.9816, 0.8014 in Fig. 4.

MAE is calculated [24] as:

$$MAE = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left| {y_{i} - x_{i} } \right|}}{n} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left| {e_{i} } \right|}}{n}$$
(2)

By the experiment (Training Set = 70%) for Decision Tree, Random Forest and Gradient Boosting calculated as Mean absolute error 0.0703, 0.0822, 0.2327 and analysis for (Test Set = 30%), Decision Tree, Random Forest and Gradient Boosting evaluated as Mean absolute error 0.064, 0.0751, 0.1625 in Fig. 5.

Fig. 5
figure 5

Represents analysis MAE for phishing dataset

The relative absolute error [25] is calculated as:

$$E_{i} = \frac{{\mathop \sum \nolimits_{j - 1}^{n} \left| {P_{{\left( {ij} \right)}} - T_{j} } \right|}}{{\mathop \sum \nolimits_{j = 1}^{n} \left| {T_{j} - \overline{T}} \right|}}$$
(3)

where P(ij) = predicted value and Tj = target value

$$\overline{T }=\frac{1}{n}\sum_{j=1}^{n}{T}_{j}$$
(4)

By the experiment, we find (Training Set = 70%), Decision Tree, Random Forest and Gradient Boosting calculated as Relative absolute error 14.0673%, 16.4381%, 46.5433% and analysis for (Test Set = 30%), Decision Tree, Random Forest and Gradient Boosting evaluated as Relative absolute error 12.1931%, 14.4138%, 42.499% respectively in Fig. 6.

Fig. 6
figure 6

Represents analysis RAE for phishing dataset

By the experiment, we find (Training Set = 70%), Decision Tree, Random Forest and Gradient Boosting calculated as Root relative squared error 53.0401%, 36.4876%, 68.2275% and analysis for (Test Set = 30%), Decision Tree, Random Forest and Gradient Boosting evaluated as Root relative squared error, 46.2762%, 29.5203%, 61.1923% respectively in Fig. 7.

Fig. 7
figure 7

Represents analysis RSE for phishing dataset

RMSE [26] Formulated as:

$$RMSE=\sqrt{\frac{{\sum }_{i=1}^{n}{({y}_{i}-\widehat{y})}^{2}}{n}}$$
(5)

With the results, we find (Training Set = 70%), Decision Tree, Random Forest and Gradient Boosting calculated as Root mean squared error 0.2652, 0.1825, 0.3412 and analysis for (Test Set = 30%), Decision Tree, Random Forest and Gradient Boosting examined as Root mean squared error 0.1964, 0.1126, 0.271 respectively.

Because the sample data set has labels, this study uses supervised machine learning (phishing and legitimate). Furthermore, supervised machine learning produces good outcomes by reducing mistakes. In this research paper we have used three classifiers is RF, DT and GB and evaluated “R-square, Root Mean Square Error, and Mean Absolute Error”. Tables 5 and 6 shows the Random Forest algorithm perform best compare to decision tree and gradient boosting classifiers in training and testing phase for phishing datasets.

5 Conclusion

In this research paper, we used Chi-Square and Extra Tree features selection techniques for organizing complex dataset and extract import features by Recursive Features Elimination as pipeline model, then trained three different machine learning method as Random Forest, Decision Tree and Gradient boosting on 70% phishing dataset and test on 30% dataset. In all experiment, we find Random calculated Correlation coefficient 0.9317, Mean absolute error 0.0822, Root mean squared error 0.1825, Relative absolute error 16.4381%, and Root relative squared error 36.4876%. Analysis (Test Set = 30%), Random calculated correlation coefficient 0.9816,Mean absolute error 0.0751, Root mean squared error 0.1126, Relative absolute error 14.4138%,Root relative squared error 46.2762%. Finally, we concluded Random Forest classifier performs better results compare to another classifier. In the future, we plan to extend this by using real online real dataset using various ensemble models and predict user beneficial results.