1 Introduction

The revolution in communication technologies and the Internet has dramatically changed our daily lives. Also, the advancements in Artificial Intelligence and computing have led to an increasing number of distributed intelligent systems. The scale of networks becomes larger and the network environment becomes more complex day by day [1]. As a result, the amount and categories of data flow in networks are constantly expanding. Users now deal with a huge amount of data that is transferred in cyberspace because of the spread of the Internet of Things (IoT) and cloud services with distributed heterogeneous devices [2]. However, protecting these devices from attacks is extremely important in order to protect users’ data and their physical devices, and obtain the most benefit from these cloud services. In this context, cyber security is crucial in order to make cloud services efficient and successful. Firewalls and traditional techniques such as encryption and user authentication are unable to protect devices in cyberspace due to the rapid development of new intrusion techniques [3, 4]. Intrusion Detection Systems (IDS) are security systems that are able to detect and prevent attacks in a particular network environment, such as Denial of Service (DoS) attacks, phishing, malware etc. Moreover, these systems should be able to intelligently identify and classify any abnormal behaviors within a network.

Currently, the demand for intelligent intrusion detection approaches using Machine Learning (ML) techniques is significantly increasing. ML can play a vital role in building IDS that are able to classify and predict attacks in cyberspace. Traditional ML-based methods such as k-nearest neighbor (k-NN) algorithms, Support Vector Machines (SVM), Logistic Regression (LR), NaiveBayes (NB) Models and Decision Trees have a significant role in detecting anomalies and attacks in cyber security [5,6,7]. Among the machine learning techniques, Decision Trees are one of the most popular predictive models that can be used in building intrusion detection systems based on classification algorithms that fall under supervised learning. These predictive Models are commonly used supervised ML algorithms that can be used for classification [8]. In a tree-based classification model, a model or classifier is constructed to predict the categorical class. For instance, a tree-based classification model can predict whether a particular network activity is “normal” or an “attack”. Decisions are made at each node of the tree until the leaf node is reached. The class of the data point (i.e. normal or attack) is determined in the leaf node. In other words, the tree node represents a feature, each edge or branch represents a decision made depending on the information gained for each feature, and each leaf represents a class [8]. Nevertheless, the massive volume of network traffic data and the large number of dimensions or features (i.e. security features) can affect the accuracy of prediction. In addition, they can increase the complexity of computation of the tree-based predictive model (i.e. overfitting and processing time). The need for reliable and efficient intrusion detection systems has become a significant requirement to make cloud services successful and beneficial. The design of an IDS that performs with maximum accuracy with minimum false predictions is a challenging task [9]. In addition, since most AI techniques require learning from big data sets and reasoning using a multitude of classification patterns, it is necessary to create new simplified and collaborative solutions [10]. In this paper, an intrusion detection approach based on the concept of decision trees is proposed. Our approach involves considering the ranking of security features before building the predictive model. The model aims to increase the prediction accuracy and reduce the complexity of computation compared with other traditional ML techniques. The main contributions of this research are summarized as follows:

  1. 1.

    An intrusion detection model is developed based on the concept of Decision Trees to efficiently predict and detect attacks in cyberspace.

  2. 2.

    An approach to security features selection and ranking is developed in order to select the security features with the most importance, which should be processed by the proposed tree-based predictive model.

  3. 3.

    The proposed model is applied to a real dataset with 175,341 records for network intrusion detection systems to evaluate it based on predefined performance evaluation metrics, namely accuracy, precision, recall and Fscore, compared with other traditional ML techniques.

The remainder of this paper is structured as follows. Section 2 investigates related work on intrusion detection models. Section 3 presents and discusses our tree-based intrusion detection model taking into consideration the ranking of security features. Section 4 presents our experiments and evaluation of the proposed model. The last section concludes the paper and highlights our future work.

2 Related Work

Cyber security has attracted the interest of many researchers due to the increasing demand for reliable and efficient intrusion detection methods. Intrusion Detection Systems (IDS) are developed to detect abnormal activities and attacks in cyberspace. Machine Learning and its applications can play a significant role in building intelligent and efficient IDS. Much research work has focused on implementing a variety of ML techniques in building IDS while seeking efficiency and effectiveness [11,12,13,14]. The work presented in [15,16,17,18,19,20,21,22,23] introduced different intrusion detection methods using Deep Learning, Decision Trees and other techniques. Aloqaily et al. [24] introduced an intrusion detection system against security attacks for connected vehicles in smart cities. The system is based on deep learning and decision trees mechanisms. Currently, the Tree-based technique is one of the common Machine Learning techniques and predictive models that is used by researchers for building IDS to predict and detect attacks in communication networks [12]. In the literature, there are a considerable number of studies that propose tree-based intrusion detection models taking into consideration the ranking and selection of security features. This process can enhance the prediction accuracy and minimize the complexity of computation [25,26,27]. Ingre et al. [28] proposed a decision tree based intrusion system using the Feature Correlation Selection (FCR) method in order to increase the prediction accuracy of the model. Moon et al. [29] presented an IDS based on the concept of decision trees using behavior analysis to prevent Advanced Persistent Threats (APT) attacks especially in social media networks. Sarker et al. [30] proposed a behavioral decision tree model, which predicts users’ diverse behaviors considering multi-dimensional contexts. A number of research studies have also proposed enhanced prediction algorithms to detect attacks efficiently in a particular network. For example, Puthran and Shah [31] highlighted the poor performance of the ID3 algorithm for Probe, R2L and U2R attacks. Moreover, an Improved Decision Tree algorithm using Binary Split (IDTBS) and an improved decision tree algorithm is proposed using quad split (IDTQS) to improve the detection rate of Probe, U2R and R2L attacks. Rai et al. [32] developed a decision tree algorithm based on the C4.5 decision tree approach taking into consideration feature selection and split value. A machine-learning-based security model called IntruDtree was proposed, taking into consideration the ranking of the security data featured [33]. This model was developed to increase the prediction accuracy and reduce the complexity of computation (i.e. overfitting, time). Decision Trees can play a significant role in building intrusion detection systems. However, it is vital that such systems have the ability to deal with the huge volume of network traffic data, with many dimensions and security features, so that the detection process is reliable and efficient with high accuracy and reduced complexity of computation. Nevertheless, high variance with regard to over-fitting, high complexity and low prediction accuracy are common limitations of tree-based models when building intrusion detection systems, especially when the predictive model processes many security features with high dimensions.

To this end, the process of evaluating such methods depends on many factors such as the volume of a given dataset, data consistency, the number of security features and the parameters used in the experiments. As a result, it is difficult to conclude that a particular ML technique is better than other techniques unless these factors are considered. However, unlike the proposed models mentioned above, a tree-based intrusion detection model is proposed; it considers the ranking of security features before building the prediction decision tree to overcome the shortcomings of tree-based models mentioned above. Besides, the model is applied to a real dataset with 175,341 records and follows the main steps required in ML, especially at the early stages of building such models. In the following section, our tree-based intrusion detection model is proposed and discussed in detail.

3 Tree-Based Intrusion Detection Model

In this section, the Tree-based Intrusion Detection Model is introduced and discussed in details.

3.1 Model

The proposed intrusion detection model is composed of three main modules. The first module consists of three processes, namely data exploration, data preprocessing and standardization, and features ranking and selection. These processes are crucial in order to build our tree-based intrusion detection approach based on feature ranking and selection. The last two modules are concerned with model training and testing in order to build a classification model that is able to detect attacks in cyberspace. Figure 1 illustrates our proposed model, and each step in the model is discussed in detail in the following sections.

Fig. 1
figure 1

Tree-based intrusion detection model

3.2 Data Exploration

In Data Mining (DM) and ML techniques, the quality of the data is considered one of the most crucial assets that can radically affect the prediction accuracy of any proposed prediction model. Therefore, the data exploration process in our model examines the data in order to understand its features, identify any integrity issues and apply the data cleansing process. In addition, data types (i.e. feature types) are reviewed in order to determine whether a particular feature is numerical or categorical. This process is important to correctly apply any statistical or prediction measurements and accordingly draw conclusions regarding certain assumptions about the data. In this research, a dataset with 175,341 records for network intrusion detection systems taken from the comprehensive dataset “UNSW-NB 15” is used; this dataset is available on the Kaggle website [34]. The dataset was created in the Cyber Range Lab of the Australian Centre for Cyber Security and consists of 42 features excluding the class label (i.e. 0 for normal records, 1 for attack records). The class feature in the dataset is used to determine whether a particular activity is Normal or an Attack. Moreover, the type of attack, which is one of the dataset features, is excluded from our work, as it is outside the scope of this research. After completing the Data Exploration process, the 42 features are selected for further processing as shown in Table 1.

Table 1 Security features of the selected dataset

Table 1 clearly shows that all of the features are quantitative except the proto, service and state features, which are nominal. As a result, these features (i.e. independent variables) must be subjected to Feature Encoding (i.e. Feature Engineering) in order to fit our ML-based intrusion detection model. Feature Encoding transforms nominal values into numerical values. Another aspect that should be taken into consideration is Data Standardization. It involves rescaling the distribution of feature values so that the mean of the values is 0 and the standard deviation is 1. This process is important when the features values are in different ranges. In the following section, the feature encoding and standardization are discussed in detail.

3.3 Data Preprocessing and Standardization

This process is considered one of the most vital steps in machine learning. In this process, the Security Feature Encoding and Security Feature Standardization take place as discussed in the following points.

3.3.1 Security Feature Encoding

In the previous section, the nominal security features that must be encoded were identified, namely proto, service, and state, as shown previously in Table 1. Two methods can be used in this context namely, One Hot Encoding and Label Encoding. In this study, Label Encoding is used to encode all of the nominal security features as the One Hot Encoding method can significantly increase the feature dimensions by creating additional features based on the number of unique values in each nominal feature [35]. The Label Encoding method makes all of the feature values numeric. For example, if the security feature state has the values [ACC, CLO, CON, CLO, INT, INT], then these values can be converted to the vector V = [0,1,2,1,3,3]. This process is implemented in Python using the LabelEncoder method in the sklearn class for all of the security features mentioned above.

3.3.2 Security Feature Standardization

The next step is concerned with features that have different value distributions or different scales. This process is considered vital in data preprocessing and it must be completed before the data is processed by our tree-based intrusion model. In the dataset, all features of the data that have a significant difference in data scales are rescaled so that the values for each feature have a zero-mean and unit-variance. The calculation method is shown in formula (1).

$${{\varvec{X}}}_{{\varvec{S}}{\varvec{c}}{\varvec{a}}{\varvec{l}}{\varvec{e}}{\varvec{d}}}=\frac{{{\varvec{X}}}_{{\varvec{o}}{\varvec{r}}{\varvec{i}}{\varvec{g}}{\varvec{i}}{\varvec{n}}{\varvec{a}}{\varvec{l}}}-{\overline{\varvec{X}}}}{{\varvec{\sigma}}}$$

where Xscaled denotes the new-scaled value of the feature, Xoriginal denotes the original value of the feature, \({\overline{X}}\) denotes the mean of the feature values and σ is the standard deviation.

The sklearn class in Python is used to rescale the values of all of the features that have different value distributions. For instance, the security features dur, sload, sinpkt and rate have different value distributions and must be scaled in order to fit in our tree-based intrusion detection model. The density plot is used in order to understand the spread of values for each feature. Figure 2 shows the different density plots for each of the features mentioned above.

Fig. 2
figure 2

Different density plots for the dur, sload, sinpkt and rate features

As can be seen from the figure above, the density plots for each of these features indicate that they have different distributions. To this end, all of the features in the dataset are scaled (i.e. normalized) and encoded so that the data is ready for the feature ranking and selection process, as discussed in the following section.

3.4 Features Ranking and Selection

In supervised machine learning methods such as decision trees, it is important to choose a suitable method to identify the features that significantly influence the decision making process. There are two common methods in this context, namely Information Gain and the Gini Index. The former implies that the feature with the highest information gain is used as the root to start building a particular decision tree. The latter implies that the feature with a lower Gini index should be chosen for a binary split (i.e. two decisions for each node) [8]. The Gini index (i.e. Gini impurity) is used by Classification and Regression Trees (CART) algorithms and is easy to implement especially for bigger distributions. Therefore, to achieve our goal, a feature ranking approach is proposed using the Gini index method in order to identify the impurity of the features and then rank them based on the Gini impurity (i.e. entropy) before building our decision tree. By achieving this goal, we can then build our tree-based intrusion detection approach with the features that have the lowest Gini index. The Gini index is calculated by deducting the sum of squared of probabilities of each class from one. The more a feature decreases the impurity, the more important the feature is. According to [8, 36], the Gini index for a node n is calculated as shown in formula (2).

$${{\varvec{G}}}_{{\varvec{I}}}\left({\varvec{n}}\right)= 1- {\sum_{{\varvec{i}}=1}^{{\varvec{c}}} }{{({\varvec{P}}}_{{\varvec{i}}})}^{2}$$
(2)

where Pi denotes the probability of a tuple in n belonging to a distinct security class. The Gini index is calculated for all features in the dataset in Python and the feature importance scores are ordered for the features as shown in Fig. 3.

Fig. 3
figure 3

Security feature importance score

In this research, a threshold value of 0.02 (i.e. t = 0.02) is chosen to select the most important features that should be processed in the proposed tree-based model. It is worth mentioning here that this value can be changed depending on the dataset used. Therefore, the number of features is reduced to 19 based on the score for each feature. Figure 4 illustrates these features that will be used to build our tree-based intrusion detection.

Fig. 4
figure 4

Selected features based on threshold value of 0.02

To this end, the data is ready to be processed by our proposed tree-based model, taking into consideration 19 features instead of 42 features. This study aims to decrease the computation complexity in building our tree-based intrusion detection model and improve its accuracy in regard to attack prediction, as the selected feature has a significant influence on the decision-making process. In the following section, the building of the tree-based intrusion detection model is outlined.

3.5 Tree-Based Intrusion Detection

At this level, our tree-based intrusion detection model can be built after all of the previous steps have been completed. Our model is constructed based on reduced dimensions of security features, which can reduce the complexity of the model computation. Besides, it is built using the highest ranked security features that can significantly improve its prediction accuracy. To start our tree-based model, we should identify the root node that will break down the dataset into smaller subsets and then create the branches of the tree. This process is achieved using the Gini Index discussed earlier in this paper. The leaf node is labelled with our target class, which determines whether a particular activity is classified as normal or an attack. The tree-based model is implemented in Python and a sample of our intrusion decision tree is illustrated in Fig. 5.

Fig. 5
figure 5

A snapshot of our Intrusion detection Tree

In Fig. 5 above, a depth value of 3 (i.e. d = 3) is selected to illustrate part of the intrusion detection tree based on the selected features mentioned earlier. For example, the sttl feature was chosen based on the Gini index as the root node and then the branches of the tree were expanded. Each decision node shows the feature name, Gini index, samples, values captured and class name. The class name indicates that a particular activity is Normal or an Attack. To this end, our tree-based intrusion detection model is built and implemented using the selected features. The following section summarizes our experiments to evaluate the proposed model and compare the results with other models.

4 Experiments

In this section, the experiments are summarized using the cyber security dataset mentioned earlier in this paper. In addition, the proposed model is evaluated based on Accuracy, Precision, Recall and Fscore and the results are compared with other traditional ML models.

4.1 Evaluation Metrics

The values of Accuracy, Precision, Recall and Fscore are significant metrics for evaluating the efficiency of IDS. These values are calculated based on the following terms [8]:

  • True Positives (TP) The number of tuples that are truly detected as an intrusion at the end of the detection process.

  • True Negatives (TN) The number of tuples that are truly detected normally at the end of the detection process.

  • False Positives (FP) The number of tuples that are safe but are detected as an intrusion at the end of the detection process.

  • False Negatives (FN) The number of tuples that are harmful but are detected normally at the end of the detection process.

The Accuracy metric is the total number of correct predictions divided by the total number of predictions made for a dataset. It is calculated as shown in formula (3).

$$Accuracy= \frac{TP+TN}{TP+TN+FP+FN}$$
(3)

The Precision metric quantifies the number of positive class predictions that actually belong to the positive class. It is calculated as shown in formula (4).

$$Precision= \frac{TP}{TP+FP}$$
(4)

The Recall metric quantifies the number of positive class predictions made out of all positive examples in the dataset. It is calculated as shown in formula (5).

$$Recall= \frac{TP}{TP+FN}$$
(5)

The Fscore metric provides a single score that balances both the concerns of precision and recall in one number. It is calculated as shown in Formula (6).

$$Fscore=2 \times \frac{Recall \times Precision}{Recall+Precision}$$
(6)

4.2 Dataset

In this study, a real dataset with 175,341 records taken from the comprehensive dataset “UNSW-NB 15” is used. It was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). The aim was to generate a hybrid of real modern normal activities and synthetic contemporary attack behaviors [34]. The dataset is publicly available on Kaggle website [34]. It is stored in a “.csv” file that can be processed in Jupyter Notebook using Python.

As stated earlier in this research, the dataset consists of 42 features excluding the class label (see Table 1). The class label is used to determine whether a particular activity is Normal or an Attack. Besides, the type of attack, which is one of the dataset features, is excluded from this study, as it is outside the scope of this research.

4.3 Experiment Design

The first step in our experiments was to complete the processes discussed in Sect. 3. Then, we split the dataset into two sets, namely the training and test sets. The training set comprised 80% of the total records (i.e. randomly selected) in the dataset. It was used to train our proposed model. On the other hand, the test set comprised 20% of the total number of records. It was used to test and validate the proposed model. Two experiments were conducted. The first experiment involved applying our proposed model to the selected dataset taking into consideration the ranking of security features discussed in Sect. 3. The second experiment involved applying traditional ML models such as the k-nearest neighbor (k-NN) algorithm, Support Vector Machines (SVM), Logistic Regression (LR) and NaiveBayes (NB) as common baseline methods. All experiments are implemented in Python using a personal computer with 1.8 GHz processor speed and 4 GB RAM. Table 2 outlines the implementation environment of experiments.

Table 2 Implementation environment

5 Results and Evaluation

5.1 Experiment 1

In the first experiment, the proposed model is applied to the selected dataset and the intrusion decision tree is built based on the ranking of the selected security features. As mentioned earlier in this research, the performance evaluation metrics were used, namely accuracy, precision, recall and Fscore to validate our proposed model. The accuracy metric is one of the most popular performance metrics that can be used in classification algorithms; it can be simply defined as the percentage of correct predictions. Table 3 shows the results with respect to each of these metrics.

Table 3 Results of experiment 1

In Table 3 above, the metrics for each class are presented. As stated earlier in Sect. 4.1, the Accuracy metric is the percentage of test samples that are correctly classified by the model. The Precision metric is the ratio of true positives to the total of the true positives and false positives. The Recall metric quantifies the number of positive class predictions made out of all positive examples in the dataset. The Fscore metric provides a single score that balances both precision and recall values. The number of samples of the true response that lie in each class (i.e. Normal or Attack) can be presented in the full classification report, as shown in Table 4.

Table 4 Full classification report of experiment 1

In this context, another performance metric that can be used to evaluate our proposed model, is the Receiver Operating Curve (ROC). It provides an indication of the capability of our predictive model in regard to distinguishing the security classes. It is created by plotting the True Positive Rate (TPR) (i.e. same value of Recall) versus the False Positive Rate (FPR). The FPR is the total number of false positives divided by the number of false positives and the number of true negatives. The higher Area under the curve (AUC), the better the predictive model. Figure 6 shows the ROC curve of our proposed model with AUC = 0.97.

Fig. 6
figure 6

The ROC curve of our proposed model

5.2 Experiment 2

In this experiment, the traditional ML models mentioned earlier in this research are used and applied to the same dataset. To achieve our goal in this research, however, our approach to ranking each security feature was excluded from this experiment in order to evaluate our proposed model. Other steps such as data encoding and scaling remained, as in the first experiment. First, the k-nearest neighbor (k-NN) algorithm, Support Vector Machines (SVM), Logistic Regression (LR) and NaiveBayes (NB) models were used. Figure 7 shows a summary of the results of experiment 2 for each of these methods compared with the proposed model.

Fig. 7
figure 7

Results of experiment 2 for the selected baseline traditional methods

The results in Fig. 7 above show the performance metrics for each of the baseline methods compared with our proposed model. The models were equally applied to the selected dataset in the same environment. The proposed model provides better performance than the other models. In addition, our approach of selecting the highly ranked security features reduced the complexity of computation in terms of time processing and over-fitting (i.e. reduced number of security features).

6 Conclusion and Future Work

In this paper, we present an intelligent tree-based model that is capable of efficiently and effectively predicting and detecting attacks in cyberspace. Within the model, the main steps in machine learning are followed such as data rescaling and encoding. Moreover, an approach was developed to select the security features that should be processed based on the ranking of each security feature before building our tree-based intrusion model. The Gini Index was used to measure the impurity of the security features. Specifically, for efficient and accurate results, the highly ranked features were used to train and test the proposed model instead of using all of the security features. In our experiments, we presented the efficiency and effectiveness of our model compared with other popular ML methods.

Meanwhile, our future work will involve working out how to predict what types of attacks will occur in cyberspace using our model and assessing its effectiveness with more dimensions of security features. Besides, for the features selection and ranking process, we intend to apply combined methods such as feature filtering and wrapping into our model in order to improve its performance.