Keywords

1 Introduction

Intrusion Detection is very much essential these days to protect information systems security, especially in the view of worldwide increasing incidents of cyber attacks. Identification of unauthorized use, misuse and attacks on information system is defined as intrusion detection. It is needed because traditional firewalls can’t provide full protection against security breaches. An Intrusion Detection System (IDS) doesn’t prevent an intrusion, it only detects it and informs the operator. It detects a hacker breaking into the system or a genuine user exploiting the system resources.

Primary measurement criteria for an IDS are as follows

  • False Positives i.e. an event being incorrectly identified as an intrusion when none has occurred.

  • False Negatives i.e. an event which IDS fails to identify as an intrusion when it really occurs.

  • True positive i.e. an event being correctly classified as an intrusion when one has occurred.

  • True Negative i.e. an event being not classified as an intrusion when none has occured.

  • Accuracy i.e. how efficient the IDS is in detecting intrusions when it has really occurred.

It is essential to analyze the audit data (generated by the operating systems and networks) in order to estimate the extent of damage occurred, specially in attack trace and listing the attack pattern for future prevention. This makes an IDS a real time detection and prevention tool as well as forensic analysis tool [1].

Artificial intelligence techniques have been widely used by the IDS researchers worldwide due to their generalization capability that help in identifying known intrusions as well as unknown intrusions. Neural Networks have been used to identify both misuse and anomalous patterns [24]. Support vector machines (SVMs) have been emerged as a new and powerful technique for learning from data and in particular for solving classification and regression problems. It has been also proved that SVM yields very good result in different domains for the purpose of classification.

SVMs and their variants have been proposed and used by a number of studies for detection of intrusion. Xiao et al. [5] proposed a combined technique of Ad hoc technology and SVM for effective detection of intrusion. Yendrapalli et al. [6] also applied a biased SVM (BSVM) as an IDS tool on DARPA dataset. Recently, Aburomman and Reaz [7] used an ensemble of SVM with other optimization (Particle Swarm Optimization) and clustering method (k-nn) to detect intrusion. A brief discussion about application of SVM as IDS is provided in Sect. 2.

In this paper our aim is to use a cost-based SVM for intrusion detection. Even though the previous studies have evaluated performance of SVM in detecting intrusion, no study has yet explored the efficacy of cost-based SVM to detect intrusion. The available bench mark data in intrusion detection is highly imbalanced in the sense that number of intrusion samples highly outnumber the number of normal samples indicating imbalance in available data. This motivates the research in this paper to use different cost for different class justifying the use of cost-based SVM as an IDS.

The paper is organized as follows. Section 2, a brief literature review is provided. The learning theory of SVM and cost based SVM is written in Sect. 3. Experimental setup, results and analysis is discussed in Sect. 4. Finally, Sect. 5 concludes the paper.

2 Literature Review

A literature review on application of SVM for intrusion detection has been provided by Kausar et al. [8]. Following the literature review it is found that, Xiao et al. [5] proposed a combined technique of Ad hoc technology and SVM for effective detection of intrusion. They identified an enhanced performance of IDS through selecting feature subset and then optimizing the SVM parameters. A Gaussian Kernel SVM could produce a better performance compared with other kernels. They evaluated their technique on DARPA 1998 dataset which has four different attacks namely DOS, R2L, U2R and probe.

Yendrapalli et al. [6] applied a biased SVM (BSVM) as an IDS tool on DARPA dataset. The experimental result using leave-one-out validation technique showed that the performance of IDS for differentiating between normal and u2r attacks is better achieved through using a standard SVM while that of between probe and R2L is better for BSVM. They also concluded that, the performance of SVMs as IDS depends on suitable choice of the SVM parameters.

Yuancheng et al. [9] used SVM followed by selection of features using Kernel Independent Component Analysis. They validated their approach on KDDCUP99 dataset and the experimental results showed a very promising performance of intrusion detection. The resultant SVM also could classify new types of attack which was not included in the training dataset. The ultimate aim of that study was to decrease false alarm with a penalty of overall classification performance.

Gao et al. [10] optimized the SVM parameters and applied the SVM as an IDS. The found SVM very time efficient in detecting intrusion types and also the generalization ability of it. The experimental result proved a better performance by SVM compared with that of RBFNN (Radial Basis Function Neural Network).

Rung-Ching et al. [11] used rough set theory (RST) to deselect less influential features from the dataset and then applied SVM to classify the type of attacks. They evaluated the approach on KDDCUP99 dataset and achieved a higher accuracy in terms of false positive rate and attack detection rate compared with that of an entropy based feature selections.

Yuan et al. [12] proposed the application of hypothesis test theory to SVM classifier (HTSVM) as an IDS. The hypothesis test theory was adopted to decrease the impact of penalty factor in SVM and thereby the overall performance of SVM was improved. The experimental results of using HTSVM on KDDCUP99 dataset showed a better intrusion detection performance compared with that of C-SVM. The results also showed a very good generalization ability of HTSVM classifier.

Guan et al. [13] used the concept of agent along with SVM as an IDS. Each agent involved one SVM to classify two different type of attacks. Finally, a majority voting approach was applied to decide about the type of attack. The experimental results on KDDCUP99 showed a better detection accuracy when compared with the performance of artificial neural networks.

Xiaomei et al. [14] proposed an adaptive genetic algorithm (AGA) to obtain an optimal penalty factor for SVM. Then the SVM was used to analyze audit and detect attack type. Their experimental results on KDDCUP99 dataset revealed the applicability of SVM as IDS.

3 Support Vector Machine

Support Vector Machines introduced by Vapnik [15] have been widely used for applications in many classification and regression problems. The underlying theory of SVM finds a hyperplane which is optimal in separating data in either of two classes. We refer to this hyperplane as optimal separating hyperplane (OSH). Further to this, the kernel trick of transforming data into a higher dimensional space makes SVM an efficient classification tool. Thus, an OSH generation becomes easier in this transformed feature space. The data vectors that lie closed to the OSH are called as support vectors (SV). Since these SVs determine the OSH, they are very useful in classifying data [16].

Let us, consider a training set \( D = \mathop {\{ (\mathop {\mathbf{x}}\nolimits_{i\,} ,\mathop y\nolimits_{i} )\} }\nolimits_{i = 1}^{L} \), where \( x_{i} \) represents ith input x i  ∈ ℜn and the associated class label is y i  ∈ {−1, +1}. In order to search for an OSH in SVM, every input pattern x is first mapped into a higher dimension feature space ℱ by z = ϕ(x); where, \( \phi (x) \) is a non-linear mapping function as ϕ: ℜn → ℱ. In this case the assumption is that, data are not separable using a linear boundary in real feature space and data in the transformed feature space is linearly separable. Thus, there exists a vector w ∈ ℱ and a scalar value b such that the separating hyperplane is:

$$ \begin{aligned} & {\mathbf{w}} \cdot {\mathbf{z}} + b = 0\;{\text{and}} \\ & y_{i} \left( {{\mathbf{w}} \cdot {\mathbf{z}}_{i} + b} \right) \ge 1- \xi_{i} ,\quad \forall i \\ \end{aligned} $$
(1)

where ξ i (≥ 0) are referred as slack variables. The significance of \( \xi_{i} \) is that, it yields to the misclassified data patterns. Thereby, the term \( \sum\nolimits_{i = 1}^{L} {\mathop \xi \nolimits_{i} } \) is the measure of misclassification during OSH generation. The ultimate aim of OSH generation is to achieve a maximum classification accuracy and minimum training error. The condition of such optimal hyperplane generation considering data ℱ is:

$$ \begin{aligned} & minimize :\quad \quad \frac{1}{2}{\mathbf{w}} \cdot {\mathbf{w}} + C\sum\limits_{i = 1}^{L} {\mathop \xi \nolimits_{i} } \\ & subject\,to :\quad \quad y_{i} \left( {{\mathbf{w}} \cdot {\mathbf{z}}_{i} + b} \right) \ge 1-\upxi_{i} ,{\text{and}}\quad\upxi_{i} \ge 0,\forall i \\ \end{aligned} $$
(2)

here C is a constant parameter known as regularization parameter. This parameter represents a trade off measurement between the maximum margin and minimum classification error.

The solution of Eq. (2) is a Quadratic Programming (QP) problem. In the process of solving Eq. (2) first a primal Lagrangian transformation is formed and then the primal Lagrangian is transformed into a dual.

From the primal and dual form of Lagrangian the following optimal hyperplane is obtained:

$$ \begin{aligned} maximize :\quad \quad & W(\alpha ) = \sum\limits_{i = 1}^{L} {\alpha_{i} - \frac{1}{2}\sum\limits_{j = 1}^{L} {\sum\limits_{j = 1}^{L} {\alpha_{i} \alpha_{j} y_{i} y_{j} K({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} )} } } \hfill \\ subject\,to :\quad \quad & \sum\limits_{i = 1}^{L} {y_{i} \alpha_{i} = 0\quad {\text{and}}\quad 0 \le \alpha_{i} \le C,\quad \forall i} . \hfill \\ \end{aligned} $$
(3)

where α1, α2, …, α L are the non-negative Lagrangian multipliers. The data points x i corresponding to α i  > 0 lie along the margins of decision boundary and are the SVs. The kernel function K(.,.) is an inner product (K(x i , x j ) = ϕ(x i ) ∙ ϕ(x j ) = z i  ∙ z j ) among pairwise input patterns. One condition of kernel function is that it must satisfy the Mercer’s condition [13]. Through determination of the optimum Lagrangian multipliers, optimum solution for weight vector w is obtained as

$$ {\mathbf{w}} = \sum\limits_{i \in SVs} {\alpha_{i} y_{i} {\mathbf{z}}_{i} } $$
(4)

where SVs are the support vectors. For any unknown data vector x ∈ ℜn, the classification is done by

$$ y = f({\mathbf{x}}) = sign({\mathbf{w}} \cdot {\mathbf{z}} + b) = sign\left( {\sum\limits_{i \in SVs} {\alpha_{i} y_{i} K({\mathbf{x}}_{i} ,{\mathbf{x}}) + b} } \right) $$
(5)

In the process of SVM classifier training, one must tune C and choose a suitable kernel function with its parameters. Since, no theory is available about how to choose the best C and kernels, the performance of SVMs for a problem may vary with this choice. Table 1 lists few different types of kernels used in SVMs.

Table 1 Types of kernel functions in SVM

When the number of samples in two classes (positive and negative classes) are grossly unequal, the classification data is regarded as imbalanced. A technique has been proposed by Morik et al. [17] to deal with imbalanced data for SVM learning using different costs (C+ and C instead of single C in Eq. (2)) for each class.

Lacking data for designing a more refined cost model, the cost-factors are chosen so that the potential total cost of the false positives equals the potential total cost of the false negatives. This means that the parameters C+ and C_ of the SVM are chosen to follow the ratio in Eq. (6).

$$ \frac{{C_{ + } }}{{C_{ - } }} = \frac{number\,of\,positive\,training\,examples}{number\,of\,negative\,examples} $$
(6)

The quantity \( \frac{{C_{ + } }}{{C_{ - } }} \) is expressed by a quantity j. Considering the above stated equation, Eq. (2) is reformulated as

$$ \begin{aligned} minimize :\quad \quad & \frac{1}{2}\varvec{w} \cdot \varvec{w} + C_{ + } \mathop \sum \limits_{i = positive\,class} \xi_{i} + C\_\mathop \sum \limits_{{\left\{ {j = negative\,class} \right\}}} \xi_{j} \\ subject \, to :\quad \quad & y_{i} \left( {\varvec{w}.\varvec{z}_{\varvec{i}} + b} \right) \ge 1 - \xi_{i} \;{\text{and}}\;\xi_{i} \ge 0, \forall i \\ & y_{i} \left( {\varvec{w}.\varvec{z}_{\varvec{j}} + b} \right) \ge 1 - \xi_{j} \;{\text{and}}\;\xi_{j} \ge 0,\forall j \\ \end{aligned} $$
(7)

4 Experimental Setup and Results

4.1 Dataset

In this paper we used DARPA99 dataset to evaluate the cost based SVM as an IDS. To generate the dataset an environment was set up to acquire raw TCP/IP dump data for a network by simulating a typical US Air Force LAN. The LAN was operated like a real environment, but being blasted with multiple attacks. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes. For each TCP/IP connection, 41 various quantitative and qualitative features were extracted. Of this database a subset of 494,021 data were used, of which 20 % represent normal patterns and the rest 80 % are attack data.

4.2 Data Preparation

In this experiment, 15,000 random samples have been chosen from the subset of 494,021 samples. The experiment aims to identify intrusion differentiating it from the normal pattern while the second part of the experiment aims to identify different types of intrusion along with the normal pattern. So the first part is essentially a binary classification problem while the second part is a multiclass problem. For the first part we have defined intrusion as +1 (positive symbolizes attack or intrusion) and normal pattern as −1. For the second part we have defined normal pattern as 1, DoS as 2, Probe as 3, R2L as 4, and U2Su as 5. So it symbolizes a multiclass problem.

The following 10 features (urgent, root_shell, su_attempted, num_root, num_file_creations, num_shells, num_access_files, num_outbound_cmds, is_host_login, is_guest_login) have been deleted from the 15,000 random samples because the value of each feature is zero and hence these features doesn’t contribute anything to the variation factor. The rest 31 features were finally used for the intrusion detection in the experiment. Out of 15,000 random samples—3000 were normal pattern, 11,850 were DoS, 120 were probe, 29 were R2L and 1 was U2Su. This was done in accordance to their proportion in the original subset of 494,021 samples. In the last phase of data preparation, the data set was divided into 5 equal sets each consisting of 3000 samples for a fivefold cross validation scheme.

4.3 Performance Measures

The following three performance measures (accuracy, sensitivity and specificity) were used to assess the performance of the SVM as an IDS.

$$ Accuracy = \frac{TP + TN}{TP + FP + TN + FN} \times 100\,\% $$
(8)
$$ Sensitivity = \frac{TP}{TP + FN} \times 100\,\% $$
(9)
$$ Specificity = \frac{TN}{TN + FP} \times 100\,\% $$
(10)

where TP is the number of true positives, i.e. the classifier identifies an intrusion that was labeled as intrusion; TN is the number of true negatives, i.e. classifier identifies a normal pattern that was labeled as normal; FP is the number of false intrusion identification; and FN is the number of false normal identification. Accuracy indicates overall detection accuracy for both normal and intrusion patterns, sensitivity is defined as the ability of the classifier to accurately recognize a intrusion pattern whereas specificity would indicate the classifier’s ability not to generate a false detection.

In addition to the above measures, classifier’s performance was also evaluated in terms of receiver operating characteristic (ROC). ROC curve plots sensitivity against (1-specificity) as the threshold level of the classifier is varied and depicts the performance of a classifier without regard to class distribution. The area under the ROC curve (AUC) summarizes the quality of classification and is used as a single measure of accuracy.

4.4 Experimental Results and Analysis

4.4.1 Intrusion Detection Between Two Classes: Normal Versus Attack

This section provides the results of two-class (intrusion and normal) detection problem. Two kernel functions—Linear and RBF were used.

For Linear kernel, different values of j (j is the cost factor as defined in Eq. (6)) were used to get different results. For each value of j, fivefold cross validation was performed and then the final value was given by averaging the results from different folds. Table 2 shows the IDS performance for linear kernel cost based SVM with varying values of j.

Table 2 IDS performance using linear kernel cost based SVM

Table 2 clearly depicts that both accuracy and ROC increase with the increase in value of j till j = 2.0. After that they start to drop, within the region studied. When j = 1.0 i.e. equal emphasis is given both to negative class and positive class (may be referred as standard SVM) the accuracy and ROC are 97.94 % and 0.995 respectively. When the value of j is increased to 2.0 i.e. number of negative training examples (i.e. intrusion pattern) is given double the emphasis as compared to number of positive training examples (i.e. not intrusion), both accuracy and ROC improve.

We also used an RBF kernel to the cost based SVM as an IDS. In this case, for each value of j, fivefold cross validation was performed for a varying values of σ and then the final value was given by averaging the results from different folds. Table 3 summarizes the performance IDS while the classifier used is an RBF kernel with cost based SVM. the table clearly depict that higher value of j and lower value of σ (i.e. when the width of the kernel decreases and it becomes more non-linear) give a better measure of accuracy, ROC and sensitivity.

Table 3 IDS performance summary for RBF kernel cost based SVM

It is evident from Table 3 that accuracy, sensitivity and ROC increase with the increase in value of j within the region studied. When j = 1.0 (standard SVM) i.e. equal emphasis is given both to negative class and positive class the accuracy and ROC are 95.80 % and 0.975 respectively. When the value of j is increased to 5.0 i.e. number of negative training examples (i.e. intrusion pattern) is given five times emphasis as compared to number of positive training examples (i.e. not intrusion), both accuracy and ROC improve by a good extent. The improvement in accuracy is almost by 2.5 %, as compared to when equal emphasis is given to both positive and negative classes.

As compared to Linear kernel, there is an improvement in both accuracy and ROC by using RBF kernel. However specificity i.e. the ability of the classifier to identify a normal pattern as normal only increases significantly to 100 % as compared to Linear kernel.

4.4.2 Intrusion Detection Among Multiple Classes

We applied the RBF kernel cost based SVM (as mentioned above RBF kernel provided better performance compared to a linear kernel) to classify the dataset into four attack types: Normal, DOS, Probe, R2L. Figure 1 summarizes the intrusion detection accuracy for j = 2 and varying values of σ for the considered data samples.

Fig. 1
figure 1

Overall accuracy RBF kernel with varying σ

As shown in the graph, we notice that for a small value of σ and j = 2, the performance of RBF kernel with cost based SVM reaches up to 99 %. However, for an increased value of σ, the performance of intrusion detection is not as good as 99 %. This result evidently suggest the importance of choosing SVM parameters for a successful application of SVM as an IDS. Nonetheless, the high detection accuracy and ROC area also encourages the application of cost based SVM for detecting attacks in network environment.

5 Conclusion

In this paper, we proposed the application of a cost based SVM to detect intrusion in a network environment. The experimental results also revealed that, a cost based SVM (i.e. cost factor j ≥ 2) can outperform the standard SVM (i.e. j = 1) while attempting to differentiate whether an event is attack or non-attack (normal). Having proven the efficacy of cost based SVM with RBF kernel, the same method was applied to detect the type of attacks (i.e. Normal, DOS, Probe, R2L) and the experimental results showed an overall accuracy is about 99 %. These high accuracies in attack detection certainly establishes the applicability of cost based SVM as an IDS.