Keywords

1 Introduction

Computer networks have a major role in today’s modern world and they are developing and becoming inclusive rapidly. At the same time, ensuring their security, maintenance and stability require a high cost. Since the main purpose of attacks is to reach the high amount of information, intrusion detection techniques, have attracted researchers attention. They attempt to find a way that is efficient from both aspects of time and detection ability, and at the same time the technique should be capable of being implemented in network security devices. Network attacks as a group of destructive activities are known for fragmentation, denial and destruction of the information and services in computer networks. For example, network attacks are viruses attached to e-mails, system’s probe for collecting information, internet worms, unauthorized use of a system and denial of services with abuse of system’s attributes or exploiting a bug in software in order to change the system’s information.

An intrusion detection system (IDS) can be either a device or a software application by which a network or a system is monitored for malicious activity or policy violations. Any malicious activity or violation is typically reported either to an administrator or collected centrally using a security information and event management (SIEM) system. A SIEM system combines outputs from multiple sources, and then uses alarm filtering. Intrusion detection typically refers to tools for detecting efforts which want to unauthorized access to a system or to decline its efficiency. In other words, these systems with checking the saved information of user’s loggings do not permit to any unauthorized login to the system and meanwhile they detect the users' activities while they are doing something on a system in order to inform the system’s manager if there is an unauthorized activity by a user. A simple model for network intrusion detection system has shown in Fig. 1:

Fig. 1.
figure 1

A simple model of exposure IDS in computer networks

2 Network Intrusion Detection Systems

Network intrusion detection systems (NIDS) like other network equipment are developing in attacks’ detection aspect. For a long time, intrusion detection systems have been focusing on anomaly detection and misuse detection. Meanwhile, commercial manufacturers concentrate highly on misuse detection for high level of detection ability and high amount of precision. Anomaly detection is being developed in academic researches for the existence of high level of theoretical background. This method as a general analysis features like; CPU consumption, input and output, traffic network card, number of file access, user’s identity, machines that a user want to access, all of the opened files, read pages and page fault. Then with being far from the threshold, by using statistical or intelligent techniques, it will be detected as anomaly [1]. In misuse detection methods, patterns that are clear in data course are first encoded then corresponded with intrusive procedures like special signatures [2]. At the same time, anomaly detection, a model of data flow, is being monitored by statistical analysis to detect whether in normal situations, intrusive procedures, abnormal traffic, and an unusual activity happened as intrusion or not [1]. In addition, it is difficult to recognize signatures that include different types of possible attacks. All the mistakes in detecting these signatures cause the increase of false alarm rate and decrease of detection technique’s efficiency. Therefore, techniques which are based on rules can be used. Thus, the security expert can form the policies as rules then it is corresponded with data flow model. It is imperative that the methods based on rules in corresponding patterns be updated by security experts [2].

3 KDDCUP99 and NSL-KDD Datasets

Different data sets with various classifications have been presented for attacks up to now, but the [3] classification seems to be more complicated and more complete than the others and at the same time includes the whole qualities and capabilities of other classifiers. If there is a better description of attacks, the detection of them can be easily done by machine learning techniques. Since 1999, KDDCUP99 data set has been used in order to evaluate the anomaly detection method widely. This data set was prepared at Lincoln laboratory of MIT University by Stolfo et al. during 7 weeks with approximately 5 million records of data and the capacity of 4 GB in which each record had 100 bites capacity; this data set also constituted 41 features [4].

As regards with having a comprehensive analysis of the recent process in anomaly detection and according to previously reported researches which has been mentioned above, the highest detection rate of 98% and false detection rate of 2% can be obtained [5]. Despite highest rate of attack detection in academic researches, you can't see any machine learning methods in produced commercial devices. That's the reason, cyber security equipment manufacturers do not believe to efficiency of recently introduced technologies. In order to find out the reason of this contradiction, A.A. Ghorbani et al. [6] investigated the details of accomplished studies in anomaly detection domain and its different aspects, including: training, learning, testing and evaluation of data sets with variety methods. Their studies reveal that there are intrinsic problems in KDDCUP99 data set. Nevertheless, most of the researchers use this data set which is one of the prevalent data sets for anomaly detection and obtain unreliable results for ages. The first shortcoming of KDDCUP99 data set is the large amounts of data redundancy.

As regards with analyzing, training and testing data sets, it can be realized that nearly 78% and 75% of records of these sets are duplicated [6]. This large amount of data redundancy in the training set causes the machine learning algorithms don't have a good performance. As a result, having duplicated records in both testing and training sets has been reported a high percentage of detection by previous researchers in this area. While studying different machine learning algorithms and randomly selected instances from data sets as mentioned before, a high detection rate of 98% can be obtained. This amount is declined to approximately 86% in the worst conditions. A.A. Ghorbani et al. in [6] their research, by presenting KDDCUP99 problems acknowledged that the evaluated results in this area are unreliable. On the other hand, the existence of redundant, duplicated and repeated records in both testing and training tables is harmful and in reported papers the detection rates of these attacks are lower than other ones. Nevertheless, there is only a few numbers of such attacks in both tables and they do not follow a normal distribution. Thus, as the first step the redundancy of the training and testing data set records are eliminated and then the train records are eliminated which are repeated in the test table. A new data set is presented as NSL-KDD in [8]. Although this new data set does not have the above mentioned problems, it still has the problems asserted by McHugh [7].

4 Related Works

Nowadays with the extensive development of computer networks and the rapid increase of special applications running in these networks, the importance of the security of these networks is being concerned. During the last decade, misuse and anomaly detections have been more concerned. The researchers about overcoming the flaws of misuse detection in novel attacks, and KDDCUP99 data set is highly being used for evaluation systems. For a long time, researches on intrusion detection range had been concentrated on anomaly and misuse detections. Since misuse detection is concentrated by commercials manufacturers for high level of detection ability and high amount of precision, anomaly detection is developing for the existence of high level of theoretical background in academic researches.

4.1 Naïve Bayes Method in Anomaly Detection

Conditional probability P(HjE) is used to compute the probability of H given E. H can be sampled as a column feature vector and can be considered as X = x1, x2,… We calculate: P(Xjclass = Normal).P(Normal) and P(Xjclass = Attack).P(Attack), each part that becomes maximum, indicates that input data is Normal or Attack respectively. Adebayo et al. [9] has eliminated these features with using fuzzy methods but he has not given a clear explanation of how he did it: 0, 1, 8, 14, 15, 16, 17, 18, 19, 20, 21, 36 features from their test and carried out their evaluations based on only 22 features and used Naïve Bayes method with 5924 training data and 12130 test data, and finally the results were the same as those obtained from the whole features equal to 96.67%. Ben Amor et al. [10] for DoS, U2R, R2L and Probe attacks as well as for the normality of input packets using Naïve Bayes method obtained the accuracy of 96.38%, 11.84%, 7.11%, 78.18% and 96.64% respectively. At the same time the precision of 98.48% and 89.75% was reported for normal and abnormal detections respectively.

4.2 Decision Trees Method in Anomaly Detection

In artificial intelligence, trees are used for different concepts such as: sentences structures, equations, game modes, and so on. Decision trees learning is a way for approximation of the objective functions of discrete values. This method, which is resistant to noise of data, is able to learn the disjunction predicate conjunction. Pachghare et al. [11] detected the level of packet’s normality about 99% without any preprocessing only by using decision trees and 1000 instances. In [13], with using “Feature Selection” technique and “InfoGain” method, the accuracy rate of 95% was obtained.

4.3 Support Vector Machine Method in Anomaly Detection

The main idea of the support- vector machines, [12, 13] is to increase the samples size as they can be separated. Hence, despite the fact that there is a common process in order to reduce the dimensions in the support vector machines, in reality the dimensions increase. The aim is to find a very dimensions, it may seem excessive as a volume). Teng et al. [15] using the fuzzy and SVM methods and also dividing test dataset and train dataset to three groups performed their tests based on TCP, UDP, ICMP protocols and at the end they obtained 82.5% accuracy rate for a Single SVM and 91.2% of accuracy for a Multi SVM. In [15] article, Rung-Ching et al. obtained 89.13% of accuracy using SVM and Rough Set methods.

4.4 Artificial Neural Networks Method in Anomaly Detection

Multilayer perceptron (MLP) [12] is one of the most common algorithms being used in neural networks classification. Researchers use multilayer perceptron for detection of the attacks in KDDCUP99 data set [16]. Their structure consists of Feed-Forward, three-Layer neural networks: an input layer, a hidden layer and an output layer. Unipolar sigmoid transfer functions for each neuron in both hidden and output layers are used with deviation value of 1. The applied detection algorithm is a random descending gradient with the mean square error function. As a whole, there are 41 neurons in the input layer (pattern with 41 input features) and 5 neurons (one for each group) in the output layer. The reported results show that 88.7% of attacks are probe, 97.2% are DoS, 13.2% are U2R and 5.6% of attacks are R2L [16]. In [17], Abdulkader et al. using neural networks for some special DoS attacks with 24 neurons and a hidden layer, obtained 91,42% detection rate with 8,57% false detection rate. Their test revealed that even if they increased the number of neurons, the above ratios would not change. While Mukhopadhyay et al. used the back propagation neural network [18] with all 41 features; they used corrected data set as learned and test. As a result, from 311030 records of this data set, they used 217720 records for train and 46655 records as test and finally they reported 95.6% detection rate with 4.4% false detection rate.

5 Evaluation Made by Intelligence Algorithms on KDDCUP99 and NSL-KDD Datasets

As already mentioned, different tables have been extracted from KDDCUP99. Generally, in the published papers, random samples are used from kddcupdata10percent table, for training and testing, which finally give unreliable results. In this research, first of all the tables are selected using KDDCUP99 data set for evaluation and then they are compared with similar related works. In the next step, evaluations are done based on NSL-KDD Data Set as follows and finally the results are compared.

5.1 Preprocessing and Analysis of Various Methods on KDDCUP99 Data Set

First of all, from KDDCUP99 data set 10% of the corrected table is selected randomly as testing data with 17 novel attacks, and 10% of kddcup.data_10_percent table as training data. Analyzing the information in the tables with SQL Servers facilities (see Table 1), it can be clearly seen that num_outbound_cmds feature has the value of zero in all rows. Therefore, this feature is not used in our computations using machine learning techniques and the following results can be obtained:

Table 1. Random sample selection from KDDCUP99

We evaluated various methods on KDDCUP99 and compared them with [13,14,15,16] which are shown in Table 2 and Table 3.

Table 2. Comparison of accuracy various methods on KDDCUP99
Table 3. Analysis of various methods on KDDCUP99

5.2 Preprocessing and Analysis of Various Methods on NSL-KDD Data Set

According to invalid results mentioned before, in order to obtain reliable and acceptable results, NSL-KDD data set will be used in this research. Generally, for obtaining high percentages in researches by using this data set, only the training table is used and unreliable results are obtained. For this reason in this research, from NSL-KDD data set 50% of records are extracted from two NSL-Train and NSL-Test tables randomly with an appropriate distribution of Protocol, Service and Flag features, by using a simple SQL command, then we will compare the results of learning machines with related works. When different researches are reviewed, it can be realized that the only valid and reliable research that corroborates our method of study is the research of A.A. Ghorbani et al. in [6]. According to the analysis on the tables in SQLServer, it is revealed that the num_outbound_cmds feature, in both tables has the value of zero for all rows. The nature of this field is used in ftp and has nothing to do with IDS. Accordingly, this feature is not used in our computations using machine learning methods. The results are shown by Table 4 and Table 5:

Table 4. Analysis of various methods on NSL-KDD
Table 5. Comparison with 40 features and Ref [6]

It can be concluded from Table 2, Table 3 and Table 4, Table 5 that:

  1. 1-

    The Naïve Bayes classification method for the detection of U2R and R2L R2L and Probe attacks is better than other approaches.

  2. 2-

    The Decision Trees classification method for the detection of DoS and Probe attacks is better than other approaches.

  3. 3-

    The Neural Networks classification method for the detection of DoS attacks is better than other approaches.

  4. 4-

    The Support Vector Machine classification method for the detection of Normal packets is better than other approaches.

  5. 5-

    The accuracy of Neural Networks for indicating of Normal/Attack is better than other approaches.

6 Feature Selection

Some studies on KDDCUP99 NSL-KDD data sets' showed researchers among feature selection techniques, select features that are important in the computation of accuracy and false positive and false negative detection. Moreover, they select the features most relevant to each other. Indeed, unnecessary features that decrease accuracy are ignored. These techniques increase the performance and reduce the time compared to normal situation (without selecting feature). InfoGain method is used for selection of features. Using this method has some problems in some attacks which will be discussed later. In this research, InfoGain method is used for selection of the most relevant features and then based on Naïve Bayes.

6.1 InfoGain

Suppose S is the set of labels with the corresponding labels and there are m classes and the sample \({\varvec{s}}_{{\varvec{i}}}\) content from class I and s the number of samples in the train set. The expected information needed to classify a given set is calculated according to the following formula [19]:

$$ {\text{I}}\left( {{\varvec{s}}_{{\varvec{1}}} ,{\varvec{s}}_{{\varvec{2}}} , \ldots ,{\varvec{s}}_{{\varvec{n}}} } \right) = - \sum\nolimits_{{\user2{i = 1}}}^{{\varvec{m}}} {\frac{{{\varvec{s}}_{{\varvec{i}}} }}{{\varvec{s}}}{\mathbf{log}}_{{\mathbf{2}}} } \frac{{{\varvec{s}}_{{\varvec{i}}} }}{{\varvec{s}}} $$
(1)

The property F with values of \(\left\{ {{\varvec{f}}_{1} ,{\varvec{f}}_{{\varvec{2}}} , \ldots ,{\varvec{f}}_{{\varvec{v}}} } \right\}\) can be added to the training set inside v with subsets \(\left\{ {{\varvec{S}}_{{\varvec{1}}} ,{\varvec{S}}_{{\varvec{2}}} , \ldots ,{\varvec{S}}_{{\varvec{v}}} } \right\}\) so that Sj is a subset which has the value fj for the attribute F. Furthermore, Sj is include Sij samples of class i. The entropy of the attribute F is obtained by the following formula:

$$ {\text{E}}\left( {\text{F}} \right) = \sum\nolimits_{{\user2{j = 1}}}^{{\varvec{v}}} {\frac{{{\varvec{s}}_{{{\varvec{1j}}}} + \ldots + {\varvec{s}}_{{{\varvec{mj}}}} }}{{\varvec{s}}} * {\text{I}}\left( {{\varvec{S}}_{{{\varvec{1j}}}}, \ldots, {\varvec{S}}_{{{\varvec{mj}}}} } \right)} $$
(2)

Therefore:

$$ {\text{InfoGain}}\left( {\text{F}} \right) = {\text{I}}\left( {{\varvec{s}}_{{\varvec{1}}} ,{\varvec{s}}_{{\varvec{2}}} , \ldots {\varvec{s}}_{{\varvec{n}}} } \right) - {\text{E}}\left( {\text{F}} \right) $$
(3)

In this case, If we accomplish InfoGain algorithm on NSL-KDD data set we obtain this features: Duration, service, src-bytes, dst-bytes, land, hot, num-failed-login, logged-in, num-compromised, Root-shell, su-attemped, num-root, num-file-creation, num-shells, num-access-files, is-host-login, is-guest-login. So, when these features are used with Naïve Bayes, we can obtain results which have been represented in Table 6:

Table 6. Analysis InfoGain + Naïve Bayes on NSL-KDD

In these experiments, various tests with using different feature selection methods to select the best features are accomplished. However, when the evaluation is done based on “SVM”, “Decision Trees”, “Neural Networks” for the detection of Probe and DoS Attacks, have no good results are obtained.

7 Conclusion

Regardless of KDDCUP99 data sets defects, such as data redundancy and duplicated records, among the mentioned techniques based on McHugh and A.A.Ghorbani et al. reports in [6] and [7] respectively and also according to investigation conducted on KDDCUP99, it is concluded that decision trees and SVM work outperform other methods for detecting the normality of input packet. Also, Neural Networks have a better performance than other methods for detection of DoS attacks. Similarly, for the Proble attacks, “Decision Trees” are much better than the other methods. Meanwhile, “Naïve Bayes” is also the most effective method for detecting U2R and R2L attacks. The result of conducted evaluations on NSL-KDD data set shows that Feature Selection techniques in NSL-KDD data set cause problems at detection of probe attacks. It can be concluded that among mentioned techniques and investigations that have been conducted for the detection of normality of input packet and also detection of DoS attacks, decision trees report a better result than other techniques. For Probe attacks, "Naïve Bayes" technique is better than the others and for U2R and R2L attacks; “InfoGain” and "Naïve Bayes" techniques have better results. For detecting DoS, Probe, normality input packets all the features except feature num_outbound_cmds should be used. This summary is shown in Fig. 2:

Fig. 2.
figure 2

The comparison some machine learning techniques in category of attacks