Keywords

1 Introduction

Nowadays, computer networks play an important role in people’s work and life. The security of network has become very crucial in protecting the confidentiality, integrity and availability of computer systems [1]. Intrusion detection systems (IDSs), which detect attack activities by analyzing network traffics, are excellent security countermeasure in securing network.

In recent years, a growing number of machine learning methods have been used in the area of network intrusion detection [2]. One reason is huge amount of network traffic data brings about a big challenge for IDSs. So, data reduction method is a need when building an intrusion detection model.

Feature selection is a pre-processing step to reduce the number of features in a machine learning task, which focuses on removing redundant and irrelevant features from all features [3]. Generally speaking, one feature can be defined as relevant when it is highly correlated with class labels, redundant when it doesn’t provide more information than any relevant feature, or irrelevant when it is uncorrelated with class labels. Most relevant features are selected by using a feature selection method before the process of training a model. In general, feature selection methods are divided into three kinds: wrapper-based, filter-based and embedded methods [4]. Wrapper-based methods use classifiers to score given feature subsets according to their predictive power. Filter-based methods do not use classification model, but rely on the general characteristics of the training data to choose feature subsets with high ranking scores. Embedded methods select features in the process of training, which are usually specific to given classifiers. In those cases in which the number of features is lager, filter-based methods are computationally more efficient [4].

In practice, different feature selection methods may generate different feature subsets. Hence, how to choose an appropriate feature selection method is still a problem in building an intrusion detection model. Inspired by the idea of ensemble learning, we propose a new feature selection algorithm in this paper, in which several different feature selection methods are employed and the significance features are elected based on the majority voting algorithm.

In this paper, five feature selection methods, i.e., CFS [5], LS [6], ReliefF [7, 8], \( \chi^{2} \) [9] and UDFS [10], are integrated into our algorithm. The description of these methods will be given in Sect. 2. In this paper, we present a new intrusion detection model by combining the SVM (support vector machine) algorithm [11] and our feature selection approach. This model is marked as SVM+Our. In order to verify the performance of our approach, we also build other five detection models by combing SVM and the above five methods, respectively. The five models are SVM+CFS, SVM+LS, SVM+ReliefF, SVM+\( \chi^{2} \) and SVM+UDFS, respectively. We conduct experiments on the NSL-KDD dataset [12]. The results show that the SVM+Our model achieves better performance than others.

The remainder of this paper is organized as follows. In Sect. 2, the feature selection methods used in this work are introduced. Section 3 describes the proposed feature selection algorithm and Sect. 4 experimentally evaluates the performance of the proposed algorithm. The conclusion of this work is given in Sect. 5.

2 Feature Selection Methods

In many practical applications, the number of features is high. High dimensional feature space will decrease the efficiency and accuracy of a predictive model. Feature selection can eliminate irrelevant or redundant features, and then improve the classification accuracy of a model and reduce the running time. Therefore, feature selection is an important step before building a model for classification problem.

Feature selection can be understood as finding a subset of features that leads to the largest possible generalization. Among the common feature selection strategies, ReliefF [7, 8] belongs to a kind of feature weighting algorithms, which estimates the quality of the features according to the correlation of features and classes.

Selecting relevant features in unsupervised learning cases is hard due to the absence of class labels. Laplacian Score (LS) is an unsupervised method, in which the importance of a feature is evaluated by its locality preserving power [6]. LS value is computed based on the fact that two samples are probably related to the same class, if they are close to each other.

Nguyen [5] focused on the Correlation based Feature Selection (CFS) method, which aims to calculate the connection degree of feature subset according to the correlation formula. CFS method was developed by Hall in 2000, which goal is to choose optimal feature subset highly associated with the class based on heuristic estimating [13].

Chi-square (\( \chi^{2} \)) method was proposed based on the statistical theory [9], which used to measure the statistical correlation between the features and classes. The method describes the independence or the deviation degree between the actual values and expectations. The chi-square values between attributes and class labels are calculated, sorted based on the principle of statistics. The optimal feature subset is selected from the sorted chi-square values.

Unsupervised Discriminative Feature Selection (UDFS) algorithm aims to select the most discriminative features. This algorithm considers to use the manifold structure, which makes it different from the existing unsupervised feature selection algorithms [10].

3 The Proposed Algorithm

For large-scale data, feature filtering method is usually adopted to reduce the dimension, speed up the execution efficiency and improve the detection accuracy. However, the significance of each feature is inconsistent by using different feature selection methods, then the important features selected by different methods are also different. So far, there is not a unified standard to judge the merit of each feature selection method. The selection of important features has brought certain difficulties for us. In this paper, we propose a novel feature selection approach to solve the above problem.

The proposed approach combines several feature selection methods based on the majority voting algorithm, which is the most simple and effective way in terms of data fusion. In this paper, majority voting algorithm is employed as the basic selection strategy to form a new feature selection algorithm.

Given the dataset D, \( X_{1} ,X_{2} , \cdots ,X_{n} \) are samples in D. For each sample, it has m features: \( F_{1} ,F_{2} , \cdots ,F_{m} \). Suppose g feature selection methods are used in our approach, and k important features are expected. The execution process of the proposed approach is described as follow:

Step 1. Evaluating the importance of each feature by means of g feature selection methods respectively, and sorting them in descending order of importance. The sequence of features generated by the ith feature selection method is denoted as \( a_{i1} ,a_{i2} , \ldots ,a_{ik} , \ldots ,a_{im} \). Then, a matrix about feature sequences from g feature selection methods is defined as:

figure a

Step 2. Let \( B_{h} \) be a matrix which is composed of the first h columns of matrix A, i.e.,

$$ B_{h} = \left[ {\begin{array}{*{20}l} {a_{11} } \hfill & { a_{12} } \hfill & \cdots \hfill & {a_{1h} } \hfill \\ {a_{21} } \hfill & {a_{22} } \hfill & \cdots \hfill & {a_{2h} } \hfill \\ \vdots \hfill & \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {a_{g1} } \hfill & {a_{g2} } \hfill & { \cdots } \hfill & {a_{gh} } \hfill \\ \end{array} } \right]. $$

Suppose \( \varphi (B_{h} ) \) is the number of features selected from matrix \( B_{h} \) by majority voting algorithm.

Step 3. Find an integer l, such as

$$ l = \arg \mathop {\hbox{min} }\limits_{h} (\varphi \left( {B_{h} } \right) = k). $$

Then, select elements from matrix \( B_{l} \) according to majority voting algorithm. The selected elements are the important features, which number is k.

In this paper, we integrate five feature selection methods into the proposed approach. The five algorithms are CFS, LS, ReliefF, \( \chi^{2} \) and UDFS, which are introduced in Sect. 2.

4 Experiments

4.1 Dataset

In this paper, we use the NSL-KDD datasetFootnote 1 as benchmark to verify the performance of the proposed approach. The NSL-KDD dataset is a modified version of the famous KDDCup99 dataset, which solved some inherent problems existing in the KDDCup99 dataset [12].

In our experiments, we used one training set (KDDTrain+20%) and two testing sets (KDDTest+ and KDDTest-21). KDDTrain+20% is a 20% subset of the full NSL-KDD training set. KDDTest+ is the full NSL-KDD testing set and KDDTest-21 is a subset of KDDTest+ which removes the records correctly classified with 21 learners. The features and attack types are described in Tables 1 and 2, respectively.

Table 1. Description of features
Table 2. Description of attack types

4.2 Evaluation Criteria

Confusion Matrix is a useful tool for comparing the outcome with ground truth (see Table 3). In Table 3, TP is the number of attacks that are correctly predicted; TN is the number of normal events that are correctly identified; FP is the number of normal connections that are wrongly predicted; FN is the number of attacks that are wrongly detected.

Table 3. Confusion matrix

Four evaluation criteria used in our experiments are Accuracy (Acc), Precision, False alarm rate (FAR) and Recall [14, 15]. The calculations of these four criteria are defined at follows.

$$ Acc\text{ } = \text{ }\frac{{TP\text{ } + \text{ }TN}}{{TP\text{ } + \text{ }TN\text{ } + \text{ }FP\text{ } + \text{ }FN}} $$
(1)
$$ Precision\text{ } = \text{ }\frac{TP}{{TP\text{ } + \text{ }FP}} $$
(2)
$$ FAR = \frac{FP}{TN + FP} $$
(3)
$$ Recall\text{ } = \text{ }\frac{TP}{{TP\text{ } + \text{ }FN}} $$
(4)

4.3 Experimental Results

In this section, we conduct two experiments to verify the performance of the proposed approach. The effectiveness of our feature selection approach is measured by combining the SVM. The corresponding detection models are listed in Sect. 1.

In the first experiment, we estimate the influence of the number of important features k to the accuracy of each detection model. In this experiment, we test six different k, i.e., 10, 15, 20, 25, 30 and 35. The results are shown in Table 4 (the best result for each column is highlighted in boldface). From Table 4, we can observe that the model of SVM+Our achieves five best accuracy out of six columns on KDDTest+. Only when k = 10, SVM+Our obtains the second-best result on KDDTest+. Similar results can be found on KDDTest-21. From the results in Table 4, we can conclude that our feature selection algorithm can choose more useful features than single feature selection method. In addition, we can find that SVM+Our gets the best accuracy on both testing set when the number of important features is 25. Thus, for our feature selection algorithm, we suggest to choose 25 important features.

Table 4. The accuracy of each model with different number of features (k)

The second experiment in this section is to compare the performance of each model in terms of four evaluation criteria depicted in Sect. 4.2. The results are illustrated in Table 5. In the experiment, number of important features is 25. From Table 5, we can see that the overall performance of SVM on both testing sets is the worst. That indicates that feature selection is necessary in intrusion detection. For Accuracy and Precision, SVM+Our achieves the best, and for Recall, the results of SVM+Our approximate the best. Additionally, SVM+Our obtains the best and second-best false alarm rates on KDDTest-21 and KDDTest+, respectively. The results in Table 5 indicate that our algorithm is more effective than individual feature selection method.

Table 5. Comparison of different models in terms of four criteria

5 Conclusion

In this paper, a novel feature selection algorithm is proposed, which combining several feature selection methods and choosing important features by using majority voting rule. In this work, our algorithm integrates five individual feature selection methods. To verify the effectiveness of the proposed approach, we build different intrusion detection models by combing the SVM classification method with the proposed approach and the corresponding individual feature selection method. Experiments on NSL-KDD dataset show that our feature selection algorithm is more effective than single method.