1 Introduction

Today a wide range use of network-based services and applications in almost public and private organizations require good and adoptive security measures against network and computer intrusions. Intrusions or attacks on computers and networks are activities or attempts to jeopardize main system security objectives which called as confidentiality, integrity and availability. An intrusion detection system (IDS) monitors events occurring in a computer system or a network an analysis them for sign of intrusions (Kent and Mell 2006) . Network-based intrusion detection systems are generally rule-based or anomaly based. Rule-based (misuse-based) detection systems try to detect previously known patterns. The main flaw of the rule-based IDS is their weakness to detect novel attacks. But the anomaly-based approach builds a model based on behavior of normal systems after capturing network traffic and tries to detect patterns that deviate from normal behavior, which called anomaly activities, and alerts the user from these activities. Main superiority of this approach is its functionality against novel and unseen malicious activities. Anomaly detection fall into two different categories (Dua and Du 2011): supervised and unsupervised. In the supervised anomaly detection methods the normal behavior model of system or networks is established by training with a labeled dataset. These behavior models used to classify new network connection and distinguish malign or anomaly behaviors from normal ones. Unsupervised anomaly detection approaches work without any labeled training data and most of them detect malign activities by clustering or outliers-detections techniques. Based on benchmark datasets such as KDD99 and its refined version NSL-KDD which described specifically in Section 4.1, malicious activities (attacks) in network-based systems are divided into four categories:

DoS: Denial of Services, an attacker tries to prevent legitimate users from using service. (e.g. SYN flood),

Probe: an attacker tries to gain information about target host like ports scanning,

R2L: Remote to Local, attackers try to gain access remotely to victim machine like brute force password guessing,

U2R: User to Root, an attacker has local access to the target machine and tries to gain super user privilege like privilege escalation.Most of proposed techniques are tried to gain overall detection rate (Classification Accuracy) without considering the importance of the attacks. As it is defined U2R and R2L attacks can be dangerous in comparison to the other types because they are relatively rare in the field to sample and analysis and also can causes serious damages. Although many models and methods introduced for dealing with anomalous behavior had introduced by researchers, most of them are suffer from addressing dangerous and rare attacks which belong to R2L and U2R categories. For stance Li et al. proposed an intrusion detection system based on support vector machines in Li et al. (2012) which has good detection rate against frequent attacks in training and test set like DoS and also normal behavior but its efficiency against U2R and R2L attack not very desirable. In this paper a novel supervised two-tier classification model is proposed which uses Naïve Bayes, a customized version of k-nearest neighbor (KNN) classifiers and as well as a supervised dimension reduction module to detect anomalies. The main contribution of this work against other existing methods can defined as follow:

  1. 1.

    Multi attack detection by using two classifiers.

  2. 2.

    Lower computational complexity due to optimal dimension reduction.

  3. 3.

    Higher detection rate against rare and dangerous attack types like U2R and R2L categories by applying Certainty-Factor (CF) for similarity measure.

NSL-KDD (Tavallaee et al. 2009) dataset is used to evaluate the proposed model. The experimental results show a desirable detection rate against rare and complex attacks such as U2R and R2L categories. This paper is organized as follows: Related work is given in Section 2, Section 3 covers the proposed model, Section 4 gives detailed of the experiment as well as result and Section 5 covers conclusion and future research issues.

2 Related work

Many models have been proposed for anomaly detection based on artificial intelligence concepts. Most of the proposed models for detecting anomalous activities use statistical approaches such as cluster analysis, Bayes theory and dimension reduction (e.g. Principal Component Analysis (PCA) and fuzzy induction). Leung and Leckie (2005) for finding anomaly activities proposed an unsupervised anomaly detection model which uses density based and grid-base clustering based on subspace algorithm. They did not mention how to deal with specific attack types. Chan et al. (2005) proposed a model based on distance and density of clusters to find out that attacks were often in outlying clusters with statistically low or high densities. Zhang and Zulkernine (2006) proposed a model which combines misuse detection and anomaly detection components using the random forests algorithm, they also used high sampling by random forest to reduce dependency to previous knowledge. Toosi and Kahani (2007) combined a neuro-fuzzy network, the fuzzy inference approach and genetic algorithms to design an intrusion detection system.Their model obtained high detection rate on major attacks but still suffers from low detection rate on rare attacks. Lu and Xu (2009) proposed a three level supervised classification model using decision and Naïve Bayes and also Bayesian clustering to detect anomaly. Since their model exploit multi-level classification approach, it gains good results on different type of attacks. Panda et al. (2010) employed PCA for dimension reduction and also Naïve Bayes for classify anomalous behaviors. They applied several combination feature set to obtained result, their evaluation did not consist unseen attacks. Horng et al. (2011) proposed a model based on support vector machines (SVM) and also using BRICH (Zhang et al. 1996) clustering algorithm to extract prominent features from dataset. Their model also has high detection rate on normal and DoS classes because of their frequent pattern in both training and test set. Kromer et al. (2011) proposed a model which uses fuzzy classification and Evolutionary Algorithms for evolving fuzzy classifiers to detect anomalous activities. Kim and Kim (2014) proposed logistic regression-based anomaly detection system which exploited hierarchical feature reduction to distinguish anomalous behaviors from normal ones. Although this model proposed to address increasing detection rate of rare and dangerous attacks (U2R and R2L), increasing false alarm rate is one of its disadvantages.

3 Proposed model and methodology

To overcome the deficiencies in previous works, a two-tier classification model is proposed (Fig. 1). First tier consists data preprocessing and dimension reduction which has better result for decision making and first stage of classification using Naïve Bayes. At the second tier of the proposed model for better separation between normal and anomalous activities, specific classification using KNN-CF will be performed. A detailed explanation of the proposed model will be explained at the following sections.

Fig. 1
figure 1

Two-tier classification schema which defined in the proposed model overview

3.1 Dimension reduction

For reducing computation time complexity and better classification multi-class Linear Discriminant analysis (LDA) will be performed (Li et al. 2006). LDA is one of the dimension reduction techniques that introduced in Izenman (2008) and it is widely used in signal processing, image processing, bankruptcy and market analysis problems. Although Principal Component Analysis (PCA) can extract features that are the most efficient for representation, it is not useful for discrimination. LDA selects an optimal projection matrix to transform a higher dimensional feature domain to a lower dimensional space while preserving the significant information for data classification (Tan et al. 2010). In the LDA technique two scatter matrices should be define, the first one is S B which defines as between-class scatter matrix and the second one is S W which defines as within-class scatter matrix. In the proposed model LDA scatters dataset from high dimension to lower dimension. Assume there is a set of n d-dimensional samples x i ,..., x n are assigned to k different classes. Each class C i , where i = 1,2,3,, k has n i instances (in the proposed model k = 5 e.g. normal, DoS, Probe, U2R, L2R). Projection matrix W is computed to maximize the between-class scatter matrix (1) and minimize the within-class scatter matrix (2)

$$ S_{B} = \sum\limits_{c=1}^{k} (\mu_{c} - \bar{x} ) (\mu_{c} - \bar{x} )^{T} $$
(1)
$$ S_{W} = \sum\limits_{c=1}^{k} \sum\limits_{i=1}^{n} (x_{i} - \mu_{c} ) (x_{i} - \mu_{c} )^{T} $$
(2)

Where \(\bar {x}\) in (1) is mean of the whole data set denoted by:

$$ \bar{x} = \frac{1}{n} \sum\limits_{j=1}^{n} x_{j} $$
(3)

And μ c is the sample mean for class C i given by

$$ \mu_{c} = \frac{1}{n_{k}} \sum\limits_{x \in C_{k}} x_{i} $$
(4)

Thus, the ratio J is the between-class scatter matrix S B and the within-class scatter matrix S W and can be easily maximized by the projection matrix W r

$$ J = \frac{{W^{T}_{r}} S_{B} W_{r}}{{W^{T}_{r}} S_{W} W_{r}} $$
(5)

After solving the optimization problem, we can easily perform classification on low dimension feature space by projecting the original feature space onto the optimal projection matrix W r . The new obtained feature space has four dimensions called {l d a 1,..., l d a 4}. Figure 2 depicts two first dimension of mapped feature space which have optimum separation in class label instances. As it can see overlaps between classes of attack are still exist and they should be addressed.

Fig. 2
figure 2

Two dimension of train_20 % dataset in new mapped feature space

3.2 Naïve Bayes classifier

Naïve Bayes is an efficient and effective classification algorithm since it assumes all attributes of each instance are independent in given class (conditional independence assumption). Despite the fact that assumption are violated in most of the time, the generated result are so promising and desirable.

In Naïve Bayes, an instance (object) is defined by a feature vector with n attributes, X = (x 1, x 2,..., x n ). Suppose there are m class labels C 1, C 2..., C m , next calculate P(C i |X) for (k = 1,2,..., m) and select the maximum of P(C i |X). Then, the object X is classified into category C i , where P(C i |X) is posterior probability and defined by:

$$ P(C_{i} | X) = \frac{P(C_{i} | X) P(X)}{P(X)} $$
(6)

Since P(X) is a constant, in the proposed model only P(C i |X)P(C i ) will be calculated and then the maximum value will be selected. Since in Naïve Bayes, attributes are independent, therefore it uses (7) for computing probability multiplication:

$$ P(C_{i} | X) = \prod\limits_{k=1}^{n} P(X_{k} | C_{i}) $$
(7)

The probability of P(X 1|C i ), P(X 2|C i ),..., P(X n |C i ) can be evaluated from the training set. The reason for choosing Naïve Bayes is its good performance on low amount of training class instance for classification. After this phase of classification, for better purity in detection the output which labeled as normal behavior will be chosen again by using KNN-CF (explained next) classifier to classify them. For showing the level of dependency among input attributes of Naïve Bayes classifier, four new mapped attributes are assessed by correlation coefficient measure which applied to show degree of dependency among random variables in statistic. This measure giving a value between +1 and 1 inclusive, where 1 is total positive correlation, 0 there is no correlation, and 1 is total negative correlation. Thus, this measure for new mapped attributes of both training set are obtained as Tables 1 and 2 respectively. The obtained value for each pair of attribute in both training shows no strength dependency among them.

Table 1 Assessment of new mapped attributes dependency by correlation coefficient measure in Train_20 %
Table 2 Assessment of new mapped attributes dependency by correlation coefficient measure in T r a i n +

3.3 KNN-CF classification module

As Table 3 shows, most of rare and dangerous malicious instances feature vector had completely overlap with normal ones in distinguished training set. That is why most of classifiers like Naïve Bayes make a wrong decision to gain good separation boundary between these classes. To obtain better separation between anomalous and normal objects the outputs of last classifier which are labeled as normal will be considered as suspected input to k-Nearest Neighbor (KNN) classifier. The KNN approach provides k data points in a given dataset most relevant to a query in a data analysis application. When no further information is available, the k-nearest neighbors of the query in the dataset are k most relevant data distinguished by their distance to objects. Any query can predict the class label of unlabeled object by assigning the most frequent class label occurring in the neighbors. This is a typical method based on the majority rule. Since in most of anomaly detection problems normal class are the major one and anomalous objects are rare, we are dealing with imbalance data problem. To address this issue, certainty factor (CF) associated with Euclidean distance is adopted for similarity measure in the new feature space (Zhang 2010).

Table 3 NSL-KDD dataset normal and malicious feature vector similarity (Bouzida and Cuppens 2006)

CF measures incorporated in KNN classification are as follows.

  • Let N(Q, k) be k nearest neighbor of Q

  • p(C = c i |D) be the ratio of c i in training set D

  • p(C = c i |N(Q, k)) be the ratio of c i in the query result.

Now CF can be computed using (8) and (9) as follows:

if (p(C = c i |N(Q, k))≥p(C = c i |D))

$$ CF (C= c_{i} ,N (Q,k )) = \frac{(p(C= c_{i} | N(Q,k)) - p(C= c_{i} |D) )} {(1- p(C= c_{i} |D) ) } $$
(8)

else

$$ CF (C= c_{i} ,N (Q,k )) = \frac{(p(C= c_{i} | N(Q,k)) - p(C= c_{i} |D) )} {(p(C= c_{i} |D) ) } $$
(9)

Values of C F(C = c i , N(Q, k)) are in the range of [−1,1]. The CF strategy for KNN classification is defined as follows.

$$ SCF = argmax CF \{(C= c_{i} ,N (Q,k ))\} $$
(10)

Before performing classification, for better separation between normal and anomalous classes feature selection will be performed. As mentioned, new feature space consists of four dimensions, in this phase PCA feature selection will be used which selects two effective features (l d a 1, l d a 2) out of four and then classification will be applied. To improve classifier performance in KNN classifier, a bucketing technique using KD-tree data structure is applied to store the reduced training set (Friedman et al. 1977). At this stage a detailed analysis of the proposed model will be explained. The proposed model converts a high dimension data set into lower one and performs its classification by two machine learning classifier. The experimental results will be discussed in the next section.

4 Evaluation and experimental result

At this section first a detailed analysis of the applied dataset will be discussed, then IDS performance metrics will be defined and finally evaluation of the proposed model will be argued.

4.1 NSL-KDD dataset

To evaluate the proposed model NSL-KDD benchmark dataset is used. NSL-KDD (Tavallaee et al. 2009) dataset is new version of KDD99 (KDD Cup 1999) dataset this data set introduce for network intrusion detection systems competition. Each NSL-KDD record consists of a host-to-host connection which has 41 distinguished features (e.g., protocol type, service and flag) and are labeled as normal, anomaly or one of the specific attack names as it presented in Table 4. All attacks fall into four major group: DoS, probe, U2R and R2L. The feature vector consists of three categorical values; five symbolic values and the rest of them are continuous values.

Table 4 Classification of attacks in NSL-KDD data set

Since there were some flaws in original KDD dataset (Panda et al. 2010) and (Tavallaee et al. 2009), in order to evaluate the proposed model NSL-KDD dataset applied. The dataset came with two training set and one test set which contains DoS, Probe, U2R, R2L attack classes beside normal label. Table 5 shows the distribution of class labels in both training set and test set. Test set also containing 17 attack types which did not observe in training set, according to this issue, we can evaluate the proposed model by unseen attacks to show its effectiveness. Although the dataset is refined and does not have redundancy, it still suffers from some problems (McHugh 2000).

Table 5 NSL-KDD data set classes distribution

4.2 Data preprocessing

In the data preprocessing phase for better decision making and mitigating the computational overhead, the original dataset will be converted to a normal form (Han and Kamber 2006), it will be as follows:

  • Each categorical feature values will be assigned to a unique integer number like (TCP = 1, UDP = 2, ICMP = 3).

  • Continuous-valued features will be discretized using logarithm to the base 2 and then casting the result value to integer for avoiding any bias. This step uses (11) for each Continuous-valued z.

    $$ if (z \ge 2) z = \int \log_{2} (z + 1) {\vspace*{-2.5pt}} $$
    (11)

After the normalization for better classification, attack labels will be grouped into four major classes and a normal class.

4.3 Performance metrics

Performance indicators (Gu et al. 2006) for the intrusion detection systems are: True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN), detection rate and false alarm rate, where:

TP represents that the normal behavior which is correctly predicted as normal,

TN represents the anomaly behavior which is detected correctly,

FP shows that the anomalous behavior which is predicted as normal,

FN means that the normal behavior which is wrongly thought as anomalous behavior,

Detection rate: \( DR = \frac {TP}{FN+TP} \)

False alarm rate: \( FAR = \frac {FP}{FP+TN} \)

Classification Accuracy: \( CA = \frac {TP+TN}{P+N} \)

4.4 Complexity analysis

As it said in the contributions section, model provided “lower computational complexity due to optimal dimension reduction”. With regard to this reduction, in the first tier Naïve Bayes classifier applied which its computational complexity is defined as O(e ×f), where e is number of objects in dataset and f indicates number of attributes. Therefore at this stage, according to LDA transformation the classifier will be fed with only four attributes instance of 41, the computation complexity reduced by ten fold. On other hand at the second tier where KNN classifier is applied, model just needed to remember only two dimensions of training set. As a result of that it would takes up less space than indigenous dataset. In addition the bucketing technique (k-d tree) is used for searching nearest neighbors and due to the preprocessing phase, which all features have an integer value, finding nearest neighbor will be more convenient and in such (two dimensions) k-d tree points takes O( \(\log n\) ) time on average.

4.5 Testing environment and results

The experiment was processed within a MATLAB R2013a environment, which was running on a PC powered by AMD Phenom II X6 3.8 GHz CPU and 12 GB RAM.

The proposed model was trained by both training set (T r a i n_20 %, T r a i n +) and then evaluated by given test set (T e s t +) provided by NSL-KDD which contains 22544 instances. So all the given results in this research are evaluated by this test set. After normalizing test set, the projection matrix (W r ) which obtained from training test applied to test set. As Fig. 3 illustrates the scattering rates of mapped test set not much variant from training set. Another important issue which implies from this figure is the rare and dangerous attack like R2L are so involved with normal behaviors. But the proposed model can almost solve this issue by using k nearest neighbor as it second classifier. Figure 4 shows the result of KNN classification detection rates for various k values, at this step 40 iterations is experimented. According to detection rates of this experiment three k values nominated to applied in the proposed model, k = 3 is chosen because of obtaining better detection rate on rare class of attacks in comparison with other nominated values. Figure 5 depicts effects of this experiment on false positive rate (FPR) in both given training set, these results show there is no significant changes in FPR for different k values. Figure 6 shows comparison between detection rates and false alarms on the all possible combinations of new mapped features in T r a i n_20 % training set. According to this experiment the highest detection rate belongs to combination of the two first attribute (l d a 1, l d a 2). To show usefulness of the proposed model concept for using two level of classification, in Table 6 Detection rates of the first level which belongs to anomalous instances is compared to final decision on incoming objects from test set.

Fig. 3
figure 3

Two first dimension of the test set which mapped by projection matrix of NSL-KDD training set

Fig. 4
figure 4

Detection rate experiment over different k values by NSL-KDD data set

Fig. 5
figure 5

False Positive rate experiment over different k values by NSL-KDD data set

Fig. 6
figure 6

Comparison between detection rates and false alarms using different new mapped feature subsets of NSL-KDD data set

Table 6 Comparison between anomalous instances detection rates (%) of the first level and the refined level of classification

We also gained 4.83 % false alarm using Train_20 % dataset and 5.44 % in using Train+ as training set and using Test+ respectively for testing the proposed model. The comparison results in Table 7 shows that the proposed model obtained better detection rate in normal and the rare attacks (U2R, R2L) and also a close detection rates to other types of attacks against one of the recent works. In comparison to the two classification models, the proposed model also obtained a desirable results.

Table 7 Multi-class classification Detection Rates (%) comparison to existing models

it worth noting that this model is proposed to tackle with deficiency of other existing models in detecting the rare class attacks which is located in the data set and also gaining a promising detection rates of the other types of attack. In addition, the model must compare with multi-class classification ones like HFR-MLR (Kim and Kim 2014) which presented a solution for the same issue. As it can be seen in Table 7. The proposed model outperformed U2R attacks detection rate by two fold increase and also made progress in R2L attacks, in addition the model also caused lower false alarm comparing with one of the latest methods (Kim and Kim 2014). Moreover it should mention that in HFR-MLR authors present their results by experimenting multi-set of attributes which still shows an uncertainty for choosing right attributes set.

As a downside, the proposed model cannot provide an impressive detection rate compared with existing models which had better results confronted the routine attack types like DoS and Probe. In one of the latest approaches (Pervez and Md Farid 2014) which uses SVM classifier to battle anomaly in network, average of classification accuracy is about 81.4 percent and the best obtained accuracy based on figure is about 82.68, in the meanwhile our presented approach obtained 83.24 with a distinction separation among attack types and improve detection rate in the rare ones. Further more the proposed model gained much lower false alarm rate, 5 percent against 15 percent. The undeniable issue is that, it is virtually impossible to have a significant detection rate against the rare attacks and also having impressive low false alarm. As in Table 3 mentioned In the training set we have two feature vectors with the same values but with different class labels. Thus if it is focused on the comparison of false alarm versus these types of attack it be can seen that they have distinct issues. Let’s take a glance at other existing models which had impressive low false alarm and their detection rate against the rare attacks (Table 8), this discrepancy will be revealed.

Table 8 Confusion matrix of existing models which had low false alarm and undesirable detection rate (%) against the rare attacks versus proposed model

In this work two-class (normal or anomaly) of anomaly detection classification problem is also considered, each arriving object which gave one of attack label called as anomaly and other named as normal behavior. Table 9 also provides a binary classification comparison between one-tier approaches and the proposed model which exploited two classifiers. As it can be seen the two-tier model outperformed the others in both detection and false alarm rates.

Table 9 Single-layer Vs. multi-layer binary classification comparison (%) result evaluated by Test+

For evaluating the second tier similarity measure (CF) SMOTE (Chawla et al. 2011) technique was used to generate two balanced training sets from Train_20 % dataset. In the first training set rare instances distribution which belong to U2R and R2L classes have been increased five times and in the second training set tried to make dataset balance by increasing rare attacks class label distribution. Table 10 shows original training set class labels distribution and the generated ones. For evaluating the similarity measure, the proposed model run with and without CF measure. The results show no significant improvement in detection rate of rare classes and also higher false alarm in compared with CF measure. Table 11 shows the obtained detection rate. Figure 7 depicts the proposed model false alarm rate when it used CF similarity measure with original training set and when it did not. As it can be seen CF similarity measure false alarm rate using imbalance training set (original) is lower from the generated training sets.

Table 10 Generated training set by SMOTE technique
Table 11 Detection Rates (%) comparison to Generated training sets using SMOTE technique
Fig. 7
figure 7

comparison between false alarm rates of proposed model on different training set within and without CF similarity measure

5 Conclusion and future works

This paper is proposed a network anomaly detection model which used a data preprocessing, LDA feature reduction module and also two level classifier. The proposed model works with only four mapped feature out of 41 distinguished attributes of NSL-KDD Datasets. Applying two level of classification by Naïve Bayes and CF-KNN which led to gain higher detection rate on the rare and dangerous types of attacks in comparison to existing models.

In summary, if we want to weigh up the pros and cons of the proposed model on the positive side, the two-tier anomaly detection does not need to remember high dimensional and heavy training set for the model consumption due to dimension reduction by LDA and also a feature selection in the second tier and also by such reduction the computation in both tier is reduced. In addition the proposed model relatively relieve the problem of insufficient dealing with the rare attacks which have same behaviors to normal ones due to their feature vectors, which located in training set, by certainty factor similarity measure. But in other hand the model still is incapable to gain the appropriate detection rate against routine and less dangerous kind of attack yet.

For extending the proposed model, we are investigating other dimension reduction technique such as non-parametric techniques for obtaining more useful features and also we are working on fuzzy clustering techniques for better separating normal instances from the anomalous ones to increasing detection rate.