Keywords

1 Introduction

Intrusion is a serious issue in the security and a prime issue of the security break. It is because a solitary example of interruption can take or erase the information from computer machines and system framework in almost no time. An interruption can make additional harm to the framework and related equipment. Besides, the interruption can cause tremendous loses of the monetarily and bargain the information technology basic foundation, in this way prompting data inadequacy in cyberwar [1]. In this manner, an interruption recognition framework is imperative to stay away from interruption. Subsequently, an Intrusion Detection System (IDS) is proposed to organize traffic that is utilized for dubious activities. A few IDS are equipped for making a move when bizarre traffic or vindictive action is recognized, including blocking traffic sent from a dubious IP address while abnormality discovery and revealing is the essential capacity. Even though IDS screen arranges for potential vindictive action that has been recognized, they are additionally inclined to bogus cautions (bogus positive). Throughout the most recent decade, there has been expanding altogether the measure of the system assault. These assaults have been enormously serious and complex in nature [2]. There are numerous programmer tests and assault computer machines. To make a guard of these different digital assaults and computer machines infection, there are bunches of computer security procedure that have been concentrated in the most recent decade. As models incorporate considered cryptography, firewalls and interruption identification framework and so on [3].

As of late, an alternate kind of Machine Learning (ML) methods and techniques have been proposed to improve the presentation of interruption recognition frameworks of IDS [4, 5]. The ML methods are a part of computerized reasoning base on exact information like sensor information or database. These methods are notable on account of their capacity in detecting anomalies based on pattern analysis and finding solutions [6]. Some of the ML methods that have been looking at in IDS tasks are SVM [4], Random Forest (RF) [5], software agent [6] and Decision Jungle (DJ) [7]. The ML has a wide scope of uses including web indexes, clinical analysis, text and penmanship acknowledgement, picture screening, load determining, showcasing and deals determination [8,9,10,11].

There are many existing components for an intrusion detection system. The significant issue for the difficult articulation is the security and precision of the framework [12]. An interruption discovery framework was made to improve the issue of exactness and the proficiency of the framework each regular characterization approach three calculations are utilized [6]. This exploration is made to know with the calculation is the best to decrease sorts of assault. These standards can decide interruption attributes than to actualize in the firewall strategy administers as anticipation. The mix of IDS and firewall supposed the IPS, with the goal that other than recognizing the presence of interruption additionally can execute by doing preclude from securing interruption as avoidance [1]. The target of this proposition is to introduce a KDD dataset procedure that diminishes IDS cautions and evaluates its danger [13]. To accomplish the point of this work, the accompanying goals will be considered: to apply the data gain proportion calculation to separate the best highlights of IDS alarms to survey the cautions, construct a conglomeration IDS ready technique dependent on three choices tree-based calculations that decrease the measure of bogus positive cautions and diminish the alarms excess and assess the exactness and accuracy of the three calculations utilizing a chose standard dataset [14,15,16,17].

Different techniques and methods have been proposed, developed, and evaluated to safeguard internet users against attacks. There are many research studies in IDS including the work of Li, et al. [17] which proposed an interruption recognition framework dependent on Online Sequence Extreme Learning Machine (OS-ELM) is built up, which is accustomed to identifying the assault in AMI and completing the near investigation with different calculations. Reproduction results show that contrasted and other interruption location techniques, interruption discovery strategy dependent on OS-ELM are increasingly predominant in identification speed and precision. Shakya and Kaphle [18], work propose another learning approach towards building up a novel interruption discovery framework (IDS) by backpropagation neural systems (BPN) and self-arranging map (SOM) and analyse the exhibition between them. The principle capacity of Intrusion Detection System is to shield the assets from dangers. It dissects and predicts the practices of clients, and afterwards, these practices will be viewed as an assault or typical conduct. The proposed strategy can fundamentally decrease the preparation time required.

This research is conducted by focusing on the intrusion detection system classification using the popular ML methods which are Decision Tree (DT), Decision Jungle (DJ), and Decision Forest (DF). The characteristics of Kaggle intrusion detection dataset are multivariate, medium sizes (126000 raws and 42 columns) and have some missing values. Among the most important factors to be considered are identifying the categories of illegal activities that lead to intrusions. The ML methods are selected to overcome intrusion problems using the same dataset. This work is segmented into five sections starting with Sect. 1 that represents the Introduction. The literature review has been discussed in Sect. 2. Next, the research methodology is illustrated in Sect. 3. Section 4 shows the testing results. Whereas, Sect. 5 concludes the work and proposes future research.

2 Methods and Materials

This research will use Knowledge Discovery in Database (KDD). KDD is the process of discovering useful knowledge from a collection of data [12]. The experiments were carried out using the Azure Machine Learning tool with 10-fold validation method for training and testing [19]. This method is being used because data is obtained from a dataset. KDD methodology involves seven steps of (1) data cleaning to removal noisy and irrelevant data (2) data integration to combine heterogeneous data of multiple sources (3) data selection to retrieve relevant data from the data collection (4) data transformation to prepare the data in the appropriate form (5) data mining to extract potentially useful patterns (6) pattern evaluation to identify related patterns based on given measures and (7) knowledge representation to represent and visualize results. Figure 1 shows the KDD methodology.

Fig. 1.
figure 1

Knowledge Discovery in Database (KDD) [12]

2.1 Testing Dataset

The data that have been used in the research is introducing WESAD, a Multimodal Dataset for Wearable Stress and Affect Detection taken from the Kaggle website [20]. This dataset has 42 attributes and 126000 instances. This data was selected by using Placement.

2.2 Machine Learning Methods

There are three methods that been used in this research which are Decision Tree (DT), Decision Jungle (DJ), and Random Forest (RF) have been discussed in detail. Decision Tree (DT) is one of the most powerful and simple data mining method that has been employed in IDS. The decision tree is a kind of a tree that consists of branch nodes representing a choice among a number of alternatives, and each leaf nodes representing a class of data [1]. The architecture of the DT is illustrated in Fig. 2 in which TI, T2, T3, and T4 are branch nodes that assign a class number to an input pattern by filtering the pattern down through the tests in the tree. Subsequently, any input patterns can be categorized to class 1, 2, or 3 when the input pattern reaches the leaf nodes [3]. Therefore, the DT is valuable to categorize the data from large datasets.

Fig. 2.
figure 2

Decision Tree Architecture [1, 3]

Decision Jungle (DJ) algorithm is a troupe learning strategy for grouping. The calculation works by building different choice trees and afterwards deciding on the most mainstream yield class. The trees that have high expectation certainty have a more noteworthy load in an official conclusion of the group. Furthermore, Choice Jungles are an expansion of Decision Forests [13]. Both create and afterwards total choice trees, yet with Decision Jungles there is the extra alternative of permitting branches to consolidate, bringing about a much-diminished memory impression. Choice Jungles are profoundly adaptable, non-parametric and non-straight, which means they are additionally exceptionally clamoring lenient. A choice wilderness comprises of a group of choice coordinated non-cyclic diagrams (DAGs) [1]. Choice wildernesses are non-parametric models, which can speak to non-direct choice limits. They perform incorporated component determination and characterization and are flexible within the sight of boisterous highlights.

Decision Forest (DF) algorithm is a gathering learning strategy for arrangement. The calculation works by building numerous choice trees and afterwards deciding on the most famous yield class as shown in Fig. 3. The trees that have high expectation certainty have a more noteworthy load in an ultimate conclusion of the outfit. DF is outfit classifiers, which are utilized for characterization and relapse investigation on the interruption discovery information. DF works by making different choice trees in the preparation stage and yield class marks those have the lion’s share vote [13]. The DF accomplishes high grouping exactness and can deal with exceptions and clamor in the information. DF is utilized in this work since it is less defenseless to over-fitting and it has recently demonstrated great characterization results.

Fig. 3.
figure 3

The architecture of DF for IDS [13]

Figure 3 shows the execution of the irregular timberland grouping model in the information characterization in the proposed framework. A pre-prepared example of n tests is taken care of to the choice backwoods classifier. DF makes n various trees by utilizing a few element subsets. Each tree delivers a grouping result, and the consequence of the order model relies upon the greater part casting a ballot [14]. The example is allocated to the class that gets the most noteworthy democratic scores. The recently achieved characterization results demonstrate that DF is sensibly reasonable in the order of such information on the grounds that now and again, it has acquired preferable outcomes over have different classifiers. Different focal points of the RF incorporate its higher precision than Adaboost and less odds of overfitting.

The DT, DJ and DF consist of several steps for the training and testing phases as shown in Fig. 4. The first step includes importing the dataset, then obtaining the labels. Subsequently, the labels will be checked one by one based on the original dataset features. Furthermore, in the step of traffic analysis, a setting function is employed to analyze and monitor the incoming traffics and set the threshold. Subsequently, the DT, RF, and DJ will analyse the features of the incoming traffics, then, the IDS will forward it to the decision function to determine whether the incoming traffics are attack traffics or not. In case of the incoming traffics have anomalies, the IDS saves the IP address which sends the attack traffic for a permanent block. Whereas in case of the incoming traffics do not have anomalies this means that the traffics identified as normal traffic and pass it to the webserver.

Fig. 4.
figure 4

The architecture of the ML-IDS

2.3 Evaluation Metrics

The evaluation metric includes the following:

  • Micro-average method: In Micro-average method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and apply them to get the statistics [3, 21].

$$ {\text{Micro - average of precision }} = \frac{{TP_{1} + TP_{2} }}{{TP_{1} + TP_{2} + FP_{1} + FP_{2} }} $$
(1)

and,

$$ {\text{Micro - average of recall }} = \frac{{TP_{1} + TP_{2} }}{{TP_{1} + TP_{2} + FN_{1} + FN_{2} }} $$
(2)
  • Macro-average Method: The method is straight forward. Just take the average of the precision and recall of the system on different sets [22, 23].

$$ {\text{Macro - average precision}} = \frac{{P_{1} + P_{2} }}{2*3} $$
(3)

and,

$$ {\text{Macro - average recall}} = \frac{{R_{1} + R_{2} }}{2} $$
(4)
  • Overall accuracy: Overall Accuracy is essentially told us out of all of the reference sites what proportion were mapped correctly. The overall accuracy is usually expressed as a percent, with 100% accuracy being a perfect classification where all reference sites were classified correctly [19, 24].

$$ {\text{Overall Accuracy }} = \frac{TP + TN}{P + N} $$
(5)

3 Results

The IDS prevents hackers from hacking the systems and makes networks secure from the threat of attack include DDoS, Benign, DoS GoldenEye, Heartbleed, DoS Hulk, DoS Slowhttp, DoS slowloris, SSH-Patator, FTP-Patator, Web Attack, Infiltration, Bot and PortScan [1, 24]. The DT, DJ and DF algorithms that are integrated into the IDS help to detect the threats that attack the computer or network systems. The outcome of this research decides the best ML algorithm from the three by comparing the results of them. Intrusion detection performance depends on accuracy as well as decreases false alarm and increases true alarm rates. The evaluation metrics of accuracy, precision and recall are calculated to measure the performance of the algorithms. The testing experiments were carried out on Windows 7 using the Azure ML tool and 10-fold cross-validation. Whereas, the hardware specifications of the implementation and testing are Intel (R) Core (TM) i7-5500U processor, 2.40 GHz, and 16 GB RAM. Subsequently, Fig. 5 gives information about the actual classes and predicted classes of the multiclass confusion matrix of the DJ test.

Fig. 5.
figure 5

The confusion matrix of the DJ

Initially, a data cleaning and multiple testing are performed to ensure that the dataset and the algorithms are ready for the training, testing and evaluation phases. Meanwhile, 10-folds cross-validation is performed to obtain reliable results. Table 1 shows the results of the tests for all the three DT, DJ and DF algorithms in terms of accuracy, precision and recall with the range of the dataset splitting. From the table, we can see that all three algorithms have high performance.

Table 1. The result of accuracy, precision and recall of DT

The results show that the DF got a higher overall accuracy of 99.83%, the DJ got the medium overall accuracy of 99.74% and the DT got the lowest accuracy of 95.59%. Moreover, the DF has a higher recall compared to the DT and DJ. However, the DJ has a higher precision compared to the DT and DF. Ultimately, the DF outperforms the DT and DJ as Fig. 6 shows.

Fig. 6.
figure 6

The overall accuracy, precision and recall of the algorithms

4 Conclusion

This research about the technique that can give the best performance to detect an intrusion in the IDS. It presents an analysis for the detection of intrusion using ML-based classification algorithms for IDS. The algorithms are Decision Tree (DT), Decision Jungle (DJ), and Random Forest (RF). The performance assessment in the IDS models is made based on accuracy precision and recall measurements. The implementation of the models is performed by Azure ML tool. The test results show that the DF has a higher overall accuracy of 99.83%, DJ got the medium overall accuracy of 99.74% and the lowest score is made by the DT with an accuracy of 95.59%. In future research, we plan to explore more attributes along with other data mining classification tasks and platforms.