Keywords

1 Introduction

As modern technology develops by leaps and bounds and the continuous expansion of the network scale to the world, cyberspace has been in an important position as the fifth dimension [1]. However, more and more user devices access the network, which bring huge difficulty to the cyberspace. The emergence of complex and volatile cyber attacks makes the network security situation remains critical. Therefore, how to effectively detect and differentiate various intrusionshas become the key topic of network security researches.

Intrusion detection systems, as an active form of security protection, can carry out real-time monitoring of the network. It can better detect network attacks and provide relevant personnel with important response decisions [2]. In addition, it can also actively defend against network attacks, make corresponding actions before the network is compromised, and effectively maintain network security. In recent years, many new algorithms have been applied in the field of IDS to build relatively stable and high-accuracy intrusion detection systems. However, with the dramatic increase of network data, the bandwidth has enhanced significantly. In addition, the complex features and the diversity of information show an increasing number of new forms of attacks. Shallow machine learning (ML) that relies too much on feature learning is no longer able to handle massive network intrusion data [3]. Deep learning (DL) builds a nonlinear network system containing many hidden layers to learn the underlying laws of sample information. It not only satisfies high-dimensional research, but also has high efficiency. Therefore, deep learning has important value for processing traffic in the network.

In the face of increasingly sophisticated network attacks and massive amounts of high-dimensional data, traditional security techniques are gradually failing, and deep learning algorithms are gradually being applied to the field of network intrusion detection [4]. DL techniques can solve the problems of shallow learning algorithms in detection, effectively extract deep features in network data, and speed up model convergence. In recent years, DL-based intrusion detection methods have made phased progress, but a number of unresolved issues remain. For one thing, the traffic of normal behavior is often significantly larger than the traffic of abnormal behaviors in real network environments. In addition, the number of different attack types is unbalanced, resulting in poor detection of rare data by the model. For another thing, the complexity and diversity means of network attacks makes the existing detection techniques not well adapted to the current changing types of network attacks. Some new types of attacks also pose a major threat to network security, which has become a major problem in the application of deep learning in the field of intrusion detection [5, 6].

With the aim of solving the problem of low detection accuracy caused by new types of attacks and data imbalance in the network, this paper proposes an effective network anomaly detection model, CNN-CGAN, which combines the improved GAN with CNN. CNN-CGAN solves the class distribution imbalance problem by data augmentation and extracts the deep features in the network data by CNN, thus effectively improving the accuracy of the model in detecting attack data. The contributions of CNN-CGAN model proposed in the atricle are as follows:

  1. (1)

    Chi-square test performs feature screening on network data, extracts effective features in the network, and reduces confusion among various attack categories. In addition, feature selection can also accelerate model convergence and improve model training efficiency.

  2. (2)

    CGAN takes the real data and preset labels in the network as input, and completes data augmentation by generating specific types of attack data, thereby reducing the imbalance of the training set. In addition, synthetic data can be used to simulate unknown cyberattacks, helping to improve the performance of detection models in identifying minority and unknown attacks.

The rest of the study is structured as follows. We present the related work of intrusion detection. Section 3 presents an anomaly detection framework combining GAN and CNN. Section 4 analyzes the experimental results and the performance of the CNN-CGAN model. In Sect. 5, a conclusions is given.

2 Related Works

As machine learning develops by leaps and bounds, its classification techniques have been introduced as detection models for various network security monitoring. The authors in [7] introduced the detection performance and experimental results of various machine learning models including decision trees in intrusion detection. Reference [8] compared the differences between traditional ML and DL in intrusion detection, and showed that traditional methods can no longer effective in dealing with the big data environment, while deep learning can analyze large data and has a good classification effect. In [9], the authors proposed a recurrent neural network-based method for abnormal traffic identification, and investigated the effect of different learning rates on accuracy and training time in binary and multi-classification. Literature [10] realized an intrusion detection model based on gated recurrent unit (GRU) by analyzing the characteristics of intrusion detection data, which achieved detection accuracy comparable to Long short-term memory, and had shorter training and detection time.

However, the data in the network is unbalanced, with significantly more normal behavior traffic than abnormal traffic. This means that the detection model will focus more on normal behavior and less on the correct identification of a small number of samples. To address these issues, the methods used in existing research are generally to increase small number of samples, that is, oversampling technology, such as, random oversampling [11], SMOTE [12] and adaptive synthesis [13]. Reference [14] proposed a small-sample oversampling technique based on Radius-SMOTE, which used a safe radius distance to create synthetic data, reducing the overlapping problem of synthetic data. Besides, the authors of [15] studied an anomaly detection methods based on GAN and RF, and it successfully solved the overfitting problem in traditional oversampling techniques and exhibited good detection performance.

In a conclusion, the researches show that it is vital to obtain good training data to solve the sample imbalance problem, which remains a very challenging job in the current network traffic anomaly identification.

3 Methods

The intrusion detection method combining improved generative adversarial network and CNN proposed in the paper consists of two main parts: data processing and model training. The overall block-diagram is shown in Fig. 1.

3.1 Data Processing

This module consists of three parts, including data standardization, feature selection, sample equalization and data enhancement. The main role of this module is to perform feature selection and data balancing for the dataset used for the training of the model.

Fig. 1.
figure 1

Overall block diagram of intrusion detection

Data Normalization. Data normalization mainly completes data cleaning, one-hot encoding, and data normalization, thus laying the foundation for later training. First, we remove the missing and invalid values from the dataset. Then, we convert the text types to numeric types since datasets often contain both numeric and string-type features, and machine learning cannot train on string-type data. Finally, some numerical data have large numerical differences, which may lead to problems such as slowing network convergence. Therefore, in order to reduce the influence of inter-indicator dimensions on the data analysis results, the data need to be normalized. The eigenvalues of the data are restricted between [0, 1] by using the max-min normalization method which can be described as follows:

$$\begin{aligned} {D}'=\frac{D-{{D}_{\min }}}{{{D}_{\max }}-{{D}_{\min }}} \end{aligned}$$
(1)

where D is the original data that needs to be normalized, \({{D}_{\max }}\) and \({{D}_{\max }}\) are the feature minimum and maximum values in the dataset, respectively.

Feature Selection. The second step in data processing part of this paper is feature selection. According to the NSL-KDD dataset selected in the article, the attack data is divided into four types, namely, denial-of-service attack, port monitoring or scanning, unauthorized access from remote host, and privileged access by unauthorized local super users. Each piece of data in the dataset used in this paper consists of 41 features and 1 label.

Feature selection can improve model efficiency and detection accuracy, it removes label-independent features and reduces computational burden. The method chosen in this paper is the chi-square test, which can calculate the degree of deviation between the observed value of the statistical sample and the expected value, and reflect the degree of correlation between the label and the feature, so as to obtain the feature that needs to be retained. As shown in Eq. (2), M means the observed value, and E means the expected value. Features in the sample can correspond to the observed value, and the label category can correspond to the expected value. The stronger the correlation between observed and theoretical values indicates that the feature will have a greater impact on the label and the more important the feature is likely to be. Therefore, feature selection can be performed by calculating the cardinality value between the label and category and ranking them.

$$\begin{aligned} X_c^2 = \sum {\frac{{{{({M_i} - {E_i})}^2}}}{{{E_i}}}} \end{aligned}$$
(2)

Data Imbalance. Data imbalance is the third step of data processing, which is used to solve the data imbalance problem that normal traffic is larger than abnormal traffic in the network data. In this paper, CGAN is used to generate rare data and to mix synthetic data with real data to disrupt the order for subsequent model training.

Generative Adversarial Network [16] draws on the idea of zero-sum game and consists of generator and discriminator, the specific structure is described in Fig. 2. The discriminator is a binary classifier consisting of a neural network that discriminates whether the input sample is real or fake. Similarly, the generator learns the latent distribution of real samples and is able to generate fake samples that approximate the distribution of real samples.

Fig. 2.
figure 2

The network structure of GAN

The G and D of the GAN are trained by minimizing the Jensen-Shannon scatter between the false and true distributions, with the following objective function.

$$\begin{aligned} \mathop {\min }\limits _G \mathop {\max }\limits _D V(D,G) = {E_{x \sim {p_r}}}[\log D(x)] + {E_{\tilde{x} \sim {p_g}}}[\log (1 - D(\tilde{x}))] \end{aligned}$$
(3)

where \(\tilde{x} = G(z)\) is the synthetic data generated by the generator. z denotes the random noise data sampled from p(z). The original data distribution and the data distribution of the generated samples are represented by \({p_r}\) and \({p_g}\) respectively. The two neural networks are confronted with each other and iteratively optimized. First, the discriminator D is trained to improve the accuracy of the discriminated true and false data, and generator G is optimized to produce more realistic synthetic dates and deceive the discriminator D.

The generator of GAN can synthesize data based on random noise, but cannot generate specific types of data. Therefore, we add conditional constraints to the generator and discriminator based on GAN, which can use data and corresponding labels for training. As shown in Fig. 3, the generator and discriminator of CGAN add a label to the input, which can be specified to generate the desired type of samples, so the loss function of CGAN can be described as:

$$\begin{aligned} \mathop {\min }\limits _G \mathop {\max }\limits _D V(D,G) = {E_{x \sim {p_r}}}[\log D(x|y)] + {E_{z \sim {p_z}}}[\log (1 - D(G(z|y)))] \end{aligned}$$
(4)

where y is the unique heat coded data for each attack category label in the data.

Fig. 3.
figure 3

The network structure of CGAN

The CGAN model is built with keras. First, a generator for synthesizing data and a discriminator for judging the authenticity of the data are constructed, and then the model is connected and trained. Details as follows:

Fig. 4.
figure 4

The network structure of generator and discriminator

  1. (1)

    Construction of generator and discriminator: The generator and discriminator in CGAN use fully connected networks, which mainly include Dense, LeakyReLU and BatchNormalization layer. As we can see, Fig. 4(a) is the network structure of the G, and its output layer uses the tanh function. Meanwhile, Fig. 4(b) is the structure of the discriminator, and its output layer is the sigmoid function. In this paper, random noise and labels are used as inputs to the generator, and synthetic samples are used as outputs. Specifically, it uses the embedding layer to convert the label into a dense vector with the same length as the random noise vector, and then uses the Multiply layer to multiply the label and the random noise vector as the generator’s input. The input of discriminator is a sample-label pair, and the output is the probability of whether the sample-label pair is true. The discriminator will take a label, use the Embedding layer to turn the label into a dense vector of sample length, and then use the Multiply layer to multiply the label and the sample as the input to the discriminator.

  2. (2)

    Model connection and training: The constructed generator and discriminator are trained against each other to build a complete CGAN model. The specific training process is as follows. First, we randomly initialize the model parameters, and control the generator G to remain unchanged, train the discriminator D, and use the Adam optimization algorithm as the optimizer to update the parameters in the network. Then, the training of the discriminator is suspended and the generator uses the feedback from the loss to the generator to complete the parameter update. Finally, when the loss value and cycle frequency do not reach the set thresholds, the steps of the first two steps are repeated, and the optimization gradient is updated using the Adam optimizer.

3.2 Model Training

This part uses CNN to complete the training of binary-class and multi-class model. The binary classification realizes the classification of normal traffic and abnormal traffic. Multi-classification identifies various attack types, and compares the detection accuracy of small samples before and after data balance. In addition, we continuously tune and optimize for better training results, improving the model’s accuracy and generalization ability.

4 Experiments

4.1 Metrics of Performance Evaluation

In order to study the impact of different optimization methods on the classifier, and effectively evaluate the performance of the classifier, we select 4 performance indicators that are widely used in most works, including accuracy, precision, detection rate (DR), F1 score. In addition, the confusion matrix is demonstrate in Table 1.

The accuracy is the most commonly used and most intuitive performance indicator. It represents the ratio of the correct samples predicted by the detection model to all samples. The higher the accuracy of the model classification, the better the classification performance. Its definition is described as Eq. (5):

$$\begin{aligned} A\mathrm{{cc}} = \frac{{TN + TP}}{{TN + TP + FN + FP}} \end{aligned}$$
(5)

DR or Recall represents the proportion of all actual positive samples that are predicted to be positive. The definition of DR is as Eq. (6):

$$\begin{aligned} DR = \frac{{TP}}{{FN + TP}} \end{aligned}$$
(6)

The precision rate represents the proportion of all the samples classified as positive examples that are truly positive examples. The higher the accuracy, the better the performance of the classification model. It is described as Eq. (7):

$$\begin{aligned} Precision = \frac{{TP}}{{FP + TP}} \end{aligned}$$
(7)

F1 score is the recall and precision weighted harmonic average, also known as F-Score. In an imbalanced dataset, F1 can better measure the performance of the model (\(F1 \in [0,1]\)). F1 is defined as Eq. (8):

$$\begin{aligned} F - Score= \frac{{2 \times TP}}{{2 \times TP + FN + FP}} \end{aligned}$$
(8)
Table 1. Various types of data distribution

4.2 Dataset

The dataset selected in the experiment is the NSL-KDD dataset consisting of five categories, and the data distribution of each category is shown in Table 2. We can see that the original testing set and training set are highly unbalanced, with the least category in training set accounting for only 0.04%, and the test set data of R2L and U2R is much larger than the training set data. At the same time, there are many categories that do not emerge in the training set, which places a demand on the generalization ability of the model. The above problems further highlight the role of sample balancing and feature selection.

Table 2. Various types of data distribution

4.3 Model Performance Comparison

We split the NSL-KDD dataset into a training set and a test set for training and evaluating the model, respectively. The evaluation metrics of different classifier models on the dataset are shown in Fig. 5. We can see that the CNN-CGAN model proposed in this paper is higher than other intrusion detection systems in terms of accuracy, recall and F1 score. So the detection method proposed in this paper can effectively classify the anomalous traffic in the network and the accuracy of the classification is improved by 3%.

Fig. 5.
figure 5

Performance comparison under different classification

Figure 6 displays the improvement of detection performance by different data enhancement models. It can be seen that data enhancement techniques have improved the detection accuracy and other performance. The detection effect of the model proposed in this paper is significantly improved. The experimental results show that the data equalization effect of CGAN is more effective than other methods.

Fig. 6.
figure 6

Effects comparison of different data enhancement methods for detection.

From Fig. 7, we can see that CGAN has improved detection performance for all four categories. The smaller number of R2L and U2R categories have a larger performance improvement. Therefore, it can be concluded that cGAN has a high improvement in improving the detection accuracy of small samples.

Fig. 7.
figure 7

Comparison of the detection performance of different models for various types of attacks

5 Conclusion

In this paper, we propose an anomaly identification method based on an improved generative adversarial network, which solves the problem of low attack data detection accuracy caused by data imbalance in existing networks. First, we perform feature screening through the chi-square distribution, which speeds up the training of the model. Then, we use CGAN to generate designated attack data with similar distribution to the original data based on the categories of input labels to achieve data augmentation and reduce training data imbalance. Finally, our detection method is verified to have better detection performance on the NSL-KDD dataset. In addition, our detection method improves detection accuracy for minority classes and unknown attack types more than the other five baseline methods. This method effectively alleviates the problems caused by changes in attack types and data imbalance in the network.