1 Introduction

Intrusion detection, as a kind of multilevel and multilayer network protection measure, aims to detect various intrusion behaviors by collecting and analyzing all kinds of information on the network. In fact, intrusion detection is usually equivalent to a classification problem. It identifies whether network traffic behaviors are normal or any one of the other four attack types: Denial of Service (DoS), Probe, Root to Local (R2L) and User to Root (U2R) [1, 2]. Then, it sets the alarm and takes appropriate measures. Therefore, there is no doubt that constructing a suitable classifier and training it to improve its generalization is the key task of intrusion detection.

Several machine learning methods, including support vector machine (SVM) [3, 4], artificial neural network (ANN) [5, 6], K-nearest neighbors (KNN) [7, 8], random forest (RF) [9] and others [10, 11], have been implemented as classifiers to improve the performance of intrusion detection and have made good progress. However, previous works based on traditional machine learning methodologies, which belong to shallow learning algorithms, have a limited ability to represent complex functions for complex classification problems [12].

More recently, with the development of deep learning, more and more researchers have explored deep learning methods to enhance the performance of classifiers for intrusion detection, and have also achieved remarkable results. Compared with traditional machine learning algorithms, deep learning methods are adept in representing high-dimensional spatial features and can automatically learn the intrinsic features without feature engineering [13]. The experiments demonstrated deep networks significantly outperformed the shallow network in detection of attacks [12].

However, in practice, whether classifiers for intrusion detection are based on traditional machine learning or based on deep learning methods, the detection effectiveness is highly dependent on the number of samples for training. After all, in the supervised learning, classifiers can only get limited useful information from the limited number of labeled samples, which usually affects the generalization of classifiers. The best way to make a machine learning model generalize better is to train it on more data. Data augmentation allows more data to be generated from limited data, increasing the number and diversity of training samples. Training samples supplements can reduce the model’s dependence on certain attributes, thereby improving the generalization ability of the model. Unfortunately, it is difficult to generate new fake data for a density estimation task unless we have already solved the density estimation problem [14]. In addition, it is expensive and time-consuming for labeling large training datasets. It is sometimes impossible due to emerging and fast evolving intrusion attacks, which makes those problems particularly severe. Generative adversarial networks (GANs) can learn the probability distribution of dataset, and try to generate new ‘fake’ samples similar to data samples. Since GANs introduce interaction in the training stage (which is equivalent to adding a kind of ‘fake’ labeled samples relative to original data samples), GANs can expand the labeled data and give more useful information on the basis of the training set. In fact, as a semi-supervised model, GANs enhance the effectiveness of image recognition [15], anomaly detection [16], imaging markers [17], etc. Since GANs are suited to model the high-dimensional complex distributions of real-world data [18], it is reasonably straightforward to utilize them to offer more useful information to improve the generalization of classifiers. Several works have begun to explore the applications of GANs for anomaly detection. As far as we know, there are three types of application methods to apply GAN for intrusion detection.

With the purpose of generating adversarial attacks to evade the intrusion detection system, authors applied GAN to generate adversarial malicious examples to perform the black-box attacks. The experiments showed many intrusion detection systems were vulnerable to adversarial perturbations using GAN [19, 20]. A framework based on GAN generated DoS attack traffic similar to the normal traffic to evade network traffic classifiers [21]. Obviously, our research goal in this paper is to enhance the performance of the classifier for intrusion detection rather than to obtain the generated attacks to evade the intrusion detection system.

With the purpose of balancing previously unbalanced datasets, a framework based on GAN was proposed to generate data that captured the data distribution of selected attack types from the dataset. As a result, the framework was feasible for improving the performance of intrusion detection systems [22]. With the same purpose to address the challenges of both data scarcity and data imbalance, a framework was developed to incorporate deep adversarial learning with statistical learning. Experiments indicated that the proposed framework outperformed other models [23]. Indeed, the problem of unbalanced classification in the training set leads to a decline in the detection performance for intrusion detection, which results in a large bias of the classifier, and the prediction of the classifier tends to be the majority in the dataset. The above methods are essential to train the GAN model to learn the distribution of minority status, and oversampling samples to balance the training set. At last, they change the inter-class distribution in the training set. In this paper, we leverage GAN to generate new label data. It is equivalent to adding other new category of data to the training set rather than changing the number of other categories, and fundamentally different from those oversampling methods. It does not need to train each class separately, the training process is more simplified, and the training time and overhead are certainly decreased.

With the purpose of identifying anomaly attacks, in [16], authors developed GAN-based models for anomaly detection, and achieved good results on image and network intrusion datasets for binary classification. Compared with those methods, the application scenario in this paper is different. The above methods mainly use GANs for binary classification. The discriminant model of GAN itself is a binary classifier to judge the real or fake for input data, which is very similar to the anomaly detection. Therefore, GANs can easily be extended to the application scenario of binary classification. However, in the multiclass classification scenario, the multiclass classifier not only needs to judge the anomaly, but also needs to further judge the category of anomaly.

Inspired by the above reasoning, in this paper, we restrict our focus to complement the labeled samples via adversarial training, and augment the training set. Specifically, in the training phase, because the generative model G is continually generating ‘fake’ samples to offer the classifier C with useful information (which enhances the classification performance of the classifier C), the ID-GAN framework improves the generalization of the classifier.

As far as we know, we first propose the supervised learning framework based on GAN for intrusion detection (ID-GAN) under the multiclass classification task. Experimental results show that the ID-GAN framework improves the performance of classifiers by using complementary and helpful information by adversarial training. The framework can effectively enhance the generalization of the classifier and can improve the effectiveness of intrusion detection in a series of adversarial rounds, which achieves state-of-the-art results on the benchmark NSL-KDD dataset.

The main contributions in this paper are embodied in overcoming the following challenges.

  1. 1.

    The discriminative model D in the original GAN is a binary classifier, so it only can be used to judge whether the sample is from the real dataset or not. It does not have ability to further predict the classification of real samples. Hence, a novel supervised learning framework based on GAN for the multiclass classification is proposed in the paper.

  2. 2.

    Because the structure of the proposed framework is different from the original GAN, it is necessary to deduce the loss function again for the supervised learning according to the needs of the multiclass classification for intrusion detection. Therefore, we show the theoretical derivation of the loss function.

  3. 3.

    Unlike the original GAN and its variants, the purpose of the proposed framework is to train a multiclass model with enhanced performance rather than a generative model. Therefore, the training method of the framework is different from that of the original GAN and its variants, and how to train the proposed framework needs to be studied. Several empirical strategies are proposed to improve the stabilization of the framework.

  4. 4.

    Aiming at the problem of experimental verification, we compare the performance of the original classifier with that of the enhanced classifier via adversarial training on the benchmark NSL-KDD dataset for the multiclass classification, and show the graphical depiction in detail on how to enhance the classifier with the help of the proposed framework.

The remainder of the paper is organized as follows. Section 2 introduces the latest research progress in the field of intrusion detection, especially application research using deep learning in this area. We detail the original GANs and its variants in Sect. 3. Section 4 introduces our method and describes how to construct the supervised learning framework. The experimental configuration and evaluation criteria are shown in Sect. 5. Section 6 reveals how to train the proposed framework, and the experimental results and discussion are presented in Sect. 7. Finally, conclusion and future work are drawn.

2 Related work

Significant progress has been made in improving the performance of classifiers in the field of intrusion detection.

To improve the accuracy of decision tree (DT) and naïve Bayes (NB) classifiers for multiclass classification, two independent hybrid mining algorithms were presented [24]. Hybrid DT algorithm utilized a NB classifier to avoid overfitting, while Hybrid NB algorithm employed a DT classifier to select important features to improve efficiency. Kanakarajan, Muniasamy [25] introduced greedy randomized adaptive search procedure with annealed randomness—Forest (GAR-forest), to improve the performance of multiclass classification with feature selection. Experimental results showed that GAR-forest performed better for multiclass classification problem compared with random forest, C4.5, naive Bayes and multilayer perceptron.

Some researchers reduce the high dimensions and feature space by removing redundant or unimportant features to further improve the performance of the classifier. Kuang et al. [26] reduced the high-dimensional data using hybrid kernel principal component analysis (KPCA) and then utilized the SVM for intrusion detection. Ikram, Cherukuri [27] studied the integrating of principal component analysis (PCA) and SVM for abnormal recognition. The above research reduces the dimensions of the input feature space in the intrusion detection system, which effectively improves the overall performance of the classification. Nevertheless, PCA suffers from the fact that it is a linear combination of all the original variables. Thus, it often cannot obtain deterministic mappings from high-dimensional spaces to low-dimensional spaces [28]. Furthermore, its nonlinear extension KPCA suffers from two major disadvantages. First, the underlying manifold structure of data is not considered in process modeling. Second, the selection of the kernel function and kernel parameters is always problematic [29].

More recently, deep learning is one of the most effective machine learning techniques which is getting popular, and has gained a wide range of applications in the intrusion detection community. The researchers take advantage of generative models such as deep autoencoder (DAE), deep Boltzmann machine (DBMs) and deep belief networks (DBNs) in a pre-training stage (unsupervised learning) to improve the detection performance. During this process, each of the lower layers is separately trained from other layers, which allows other layers to be greedily trained layer by layer from the bottom up [12, 30]. Furthermore, the final prediction and classification are carried out by traditional machine learning algorithms such as SVM or SoftMax, which avoids manual intervention to select features and can effectively represent high-dimensional features.

Abolhasanzadeh [31] proposed an approach to detect attacks in big data using DAE based on dimensionality reduction and the neural network bottleneck feature extraction. The results in terms of accuracy rate outperformed PCA, factor analysis and KPCA.

Gao et al. [32] successfully exploited a classifier based on DBN for intrusion detection, and concluded that the classifier achieved a high accuracy when the greedy layer-by-layer learning algorithm was used for pre-train.

For optimizing the basic network structure of the DBN classification model in intrusion detection system, Wei et al. [33] designed an artificial fish swarm algorithm optimization particle swarm optimization joint genetic algorithm optimization particle swarm optimization algorithm (AFSA-GA-PSO). In order to optimize DBN model, the framework based on the above algorithm (AFSA-GA-PSO-DBN) was proposed and tested for multiclass classification. Compared with the machine learning model with superior performance such as SVM, random forest and naive Bayes, the framework improved the average classification accuracy.

Potluri et al. [34] applied the convolutional neural networks (CNNs) for intrusion detection with the purpose of identifying the multiple attack classes. Different performance metrics such as precision, recall and F-measure were calculated and compared with the existing deep learning approaches.

Javaid et al. [35] proposed a deep learning method based on self-taught learning (STL), and improved the performance of network intrusion detection for multiclass classification.

However, the DAE, DBNs and DBMs algorithms have the difficulties of an intractable partition function or an intractable posterior distribution. Therefore, they are typically only used for pre-training a classification network [36].

Generative adversarial networks are another type of deep generative model. Different from other generative models, GANs incorporate the adversarial idea and allow for interaction during training. It is a great potential model to be applied and popularized for many actual scenarios. In the past 2 years, there have been hundreds of GAN variants. In terms of GANs as a semi-supervised model [37, 38], the authors extended GAN to semi-supervised model in image classification to enhance the robustness of unsupervised learning models.

Several works have begun to explore the applications of GANs for the anomaly detection task. A semi-supervised model based on GAN, consisting of two generators, three discriminators and one classifier, was proposed for detection anomalies in communication packet streams [39]. The approach was effective for packet flow binary classification.

In [16], authors developed GAN-based models for anomaly detection, and achieved good results on image and network intrusion datasets for binary classification. However, the variant of GANs belongs to unsupervised learning, which is only leveraged for anomaly detection, and is incompetent for multiclass classification.

Different from the above variants, in this paper, we develop a novel framework based on GAN, propose the supervised learning approach for multiclass classification, and suggest several empirical techniques for the framework training.

3 Generative adversarial networks

Generative adversarial networks [40] belong to one of the deep learning frameworks first proposed by Ian J. Goodfellow in 2014. This idea is sought after by academics in various fields of study, and it shows broad application prospects in imaging, visual computing and other fields.

Generally, a standard framework of GANs consists of a generative model G and a discriminative model D. During the training of GANs, the samples (which are called as ‘fake’ samples or generated samples) generated by the generative model G and the real data samples are mixed, and then randomly transmitted to the discriminative model D. The goal of the discriminative model D (which is equivalent to a binary classifier) is to identify the real data samples and the generated samples as accurately as possible. Meanwhile, the goal of the generative model G is the opposite of the discriminant model, which is to deceive the discriminative model D as much as possible and minimize the probability that the discriminative model D identifies the generated sample. Both sides are constantly optimizing themselves during training until they reach the equilibrium where neither side can improve and the generated sample is completely indistinguishable from the real data sample.

In summary, the original GAN contains a generative model G and a discriminative model D. G is used to capture the distribution of the dataset and generate similar samples, while D is a discriminator that determines whether the input is a data sample or a generated sample. The basic framework of the original GAN is shown in Fig. 1.

Fig. 1
figure 1

The framework of original GAN

The generative model G takes a noise distribution p(z) (usually a Gaussian distribution or uniform distribution) as an input and produces fake samples G(z). Meanwhile, the discriminative model D identifies whether a sample comes from the data distribution p(x) or the generated samples G(z). The loss function of GANs can be defined as the following optimization problem [40]:

$$\mathop {\hbox{min} }\limits_{G} \mathop {\hbox{max} }\limits_{D} V\left( {D,G} \right) = E_{{x \sim P_{data} }} \left[ {\log D\left( x \right)} \right] + E_{{z \sim P_{z} }} \left( z \right)\left[ {\log \left( {1 - D\left( {G\left( z \right)} \right)} \right)} \right]$$
(1)

Equation (1) shows that in the training process of GANs, the discriminative model needs to be constantly revised to maximize the value of V, that is, to maximize D(x) and minimize D(G(z)). Meanwhile, it is necessary to revise model G to minimize the value of V. In other words, by maximizing D(G(z)), the generative model tries to generate samples that are very similar to the data samples. Finally, both G and D reach the Nash equilibrium. The generative model G can estimate the probability distribution of the real data samples. Meanwhile, the detection accuracy of the determinative model D is equal to 50%, which makes it difficult to identify whether the data sample is real or fake.

Additionally, GANs can be applied in semi-supervised learning. In an original GAN, the discriminator is a binary classifier that identifies the authenticity of samples. Considering a K-class task, the output of the generator can be classified as K+1, and the corresponding discriminator becomes a (K+1)-category classification problem [15]. The advantage of this kind of processing is that it can make full use of unlabeled data to learn the probability distribution of real data samples and thus aid the training process of supervised learning.

$$\begin{aligned} L & = - E_{{x,y \sim P_{data} \left( {x,y} \right)}} \left[ {\log P_{model} \left( {y |x} \right)} \right] - E_{x \sim G} \left[ {\log P_{model} (y = K + 1|x)} \right] \\ & = L_{supervised} + L_{unsupervised} ,\;{\text{where}} \\ L_{superviesed} & = - E_{{x,y \sim p_{data} }} \left( {x,y} \right)\log P_{model} (y|x,{\text{y}} < K + 1),\;{\text{and}} \\ L_{unsuperviesed} & = - \{ E_{{x \sim p_{data} }} \left( x \right)\log \left[ {1 - P_{model} \left( {y = K + 1 |x} \right)} \right] \\ & \quad + E_{x \sim G} log[P_{model} (y = K + 1|x)]\} \\ \end{aligned}$$
(2)

Let Pmodel(y = K+1|x) denote the probability that x is a generated sample, which corresponds to 1 − D(x) in the original GAN framework. Assuming that the dataset consists of real data and some generated samples, the loss function for training the classifier then becomes Eq. (2), which can be divided into two parts for different data sources. For labeled data samples, Lsupervised stands for the negative log probability of the labeled sample, given that the sample is from the real data. The goal is to expect the discriminative model to output the correct label on the real data distribution Pdata(x, y). For an unsupervised loss Lunsupervised, the loss function is defined by GANs. This is in fact the standard GAN game-value that becomes evident when we substitute D(x) = 1 − Pmodel(y = K+1|x) in Eq. (3) [15].

$$L_{unsuperviesed} = - E_{{x \sim P_{data} \left( x \right)}} \log D\left( x \right) - E_{z \sim noise} \log (1 - D\left( {G\left( Z \right)} \right))$$
(3)

4 Proposed methodologies

4.1 The supervised learning framework using adversarial training for intrusion detection

A standard multiclass classifier for intrusion detection usually takes a sample x as input, and outputs a 5-dimensional vector (lnormal, lprobe, ldos, lr2l, lu2r) that can be turned into one of the five possible class probabilities by applying the softmax function. In the supervised learning, such a model is then trained by minimizing the cross-entropy between the real labels and the predictive distribution Pmodel(y|x) to obtain the optimal parameters.

As stated before, the two core models in a GAN are the generative model G and the discriminative model D. D as a binary classifier only has ability to judge whether the sample is from the real dataset or not. It does not have ability to further predict the classification of real data samples. Additionally, most of classifiers for intrusion detection usually belong to the supervised learning, so we need to reconstruct a supervised learning framework based on GAN.

In order to supply more information for the multiclass classifier C, we take the output of the generative model C as the input of the classifier C together with the original training set.

To improve the efficiency of the framework and further simplify the framework, we replace the discriminative model D with a multiclass classifier C. In this way, the classifier C not only undertakes the task of classification for the training set, but also serves as the role of the discriminative model D to determine whether the sample is from the generative model G or the real dataset. We regard the output of the generative model G as the ‘fake’ category, and the corresponding multiclass classifier becomes a 6-category classifier. Therefore, the original GAN is transformed into a supervised learning framework for intrusion detection, as shown in Fig. 2. First, the framework needs to train the classifier through adversarial training, as indicated by the green arrow. Second, the blue arrow represents the framework inputs the test samples into the trained classifier for multiclass classification.

Fig. 2
figure 2

The framework of ID-GAN

After the introduction of adversarial training for intrusion detection, the generative model can continually generate ‘fake’ samples from a random distribution p(z). In the adversarial training, the multiclass classifier identifies whether the sample is Normal, or fake, or any one of the other four attack types: DoS, Probe, R2L and U2R, while the generative model dynamically adjusts the strategy for generating more similar fake samples according to the feedback (fake or real) from the multiclass classifier. Thus, the framework can train the classifier together with new augmented training set, which includes original five-category labeled samples and constantly generated new ‘fake’ samples.

For example, originally, only a professor (similar to the classifier) trained students to recognize and classify five languages (Russian, English, Arabic, French and German). An assistant professor (similar to the generative model) is added to train the students to recognize whether it belongs to five languages or not. Although the supplementary assistant professor does not directly teach students how to identify and classify the five types of languages, the practice of distinguishing ‘whether it belongs to five languages’ is also helpful for classification and recognition of languages. There are rough feedbacks, which are better than no feedback.

In summary, the main idea of the ID-GAN framework is to train a multiclass classifier that plays both the roles of a classifier performing the classification task and a classifier to distinguish generated samples from the real data samples. To be more specific, the classifier takes a sample as the input and classifies it into six classes. Real data samples are classified into the first five classes, and generated samples are classified into ’fake’ class, as shown in Fig. 2.

4.2 The derivation of the loss function

It is assumed that (xl, yl) is a sample from the training set that contains a 5-category classification label, where \(y_{l} \in \left\{ {normal,dos,probe,r2l,u2r} \right\}\). The generative model generates ‘fake’ samples (xf, yf) from the random noise distribution, where yf = ‘fake.’ The samples (x, y) are synthetic data samples and generated samples, where the label y contains six classes (\({\text{y}} \in \left\{ {normal,dos,probe,r2l,u2r,fake} \right\}\)). For the multiclass classification problem, the classifier inputs a sample x and outputs the classification probabilities for the six classes pi (i = 1, 2, 3, 4, 5, 6). The first five categories correspond to the original classification, and the last classification corresponds to ‘fake’ category by applying the softmax function.

Assuming that p is the real probability distribution of the sample and q is the predicted probability distribution of the classifier, the cross-entropy for a given dataset X is defined as:

$${\text{CE}}\left( {p,q} \right) = - \mathop \sum \limits_{x \in X} p\left( x \right)\log q\left( x \right)$$
(4)

The value of Eq. (4) indicates the error between the real classification and the predicted classification. The smaller the value is, the closer the predicted probability distribution is to the real probability distribution, and the more accurate the predicted result will be.

Under the multiclass classification task, the loss function is usually defined as cross-entropy loss. Let \(y_{{x_{i} }}^{j}\) represent the real probability distribution of the sample xi, and let \(P_{model} \left( {y = j |x_{i} } \right)\) represent the predicted probability distribution of the sample xi, then the corresponding loss function can be defined as:

$$L_{x} = - \mathop \sum \limits_{j} y_{{x_{i} }}^{j} \log P_{model} \left( {y = j |x_{i} } \right)$$
$$\forall j \in \left\{ {normal,dos,probe,r2l,u2r,fake} \right\}$$
(5)

For dataset X, which is synthetic data samples and generated samples, the corresponding loss function is defined as:

$$Lc = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{j} y_{{x_{i} }}^{j} \log P_{model} \left( {y = j |x_{i} } \right)$$
$$\forall j \in \left\{ {normal,dos,probe,r2l,u2r,fake} \right\}$$
(6)

After one-hot coding, the real category of the sample \(y_{{x_{i} }}^{j}\) is mapped into a K-dimension vector. For example, [1, 0, 0, 0, 0, 0] indicates that the sample belongs to the ‘normal’ category, [0, 1, 0, 0, 0, 0] indicates that the sample belongs to the ‘dos’ category, and [0, 0, 1, 0, 0, 0] indicates that the sample belongs to the ‘probe’ category. [0, 0, 0, 1, 0, 0] means that the sample belongs to ‘r2l’ category, [0, 0, 0, 0, 1, 0] indicates that the sample belongs to ‘u2r’ category, and [0, 0, 0, 0, 0, 1] indicates that the sample belongs to the ‘fake’ category. Similarly, if the sample xi belongs to category c, then \(y_{{x_{i} }}^{c}\) = 1. Besides, all the values of the remaining columns are 0, that is, \(y_{{x_{i} }}^{j \ne c}\) = 0.

Therefore, the loss function of the multiclass classifier in the proposed framework can be further expressed as follows.

$$\begin{aligned} Lc & = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{j} y_{{x_{i} }}^{j} \log P_{model} \left( {y = j |x_{i} } \right) \\ & = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left[ {y_{{x_{i} }}^{j = c} \log P_{model} \left( {y = c |x_{i} } \right) + \mathop \sum \limits_{j \ne c} y_{{x_{i} }}^{j} \log P_{model} \left( {y = j |x_{i} } \right) } \right] \\ & = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left[ {y_{{x_{i} }}^{j = c} \log P_{model} \left( {y = c |x_{i} } \right) } \right] \\ & = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left[ {\log P_{model} \left( {y = c |x_{i} } \right) } \right] \\ \end{aligned}$$

5 Experiments

5.1 Dataset

The NSL-KDD (Knowledge Discovery and Data Mining) [41, 42] is a benchmark dataset for network intrusion detection. It removes a large amount of redundant data in the original dataset and adjusts the normal and abnormal data in proper proportions to make the testing and training set sizes more reasonable. This is still ideal and the most trustful public benchmark dataset [43,44,45] for an effective and accurate assessment of different machine learning algorithms for intrusion detection.

The NSL-KDD dataset consists of a training set and a test set. The training set KDDTrain+ contains 125,973 instances, and the test set KDDTest+ contains 22,544 instances, as shown in Table 1.

Table 1 Different classifications in the NSL-KDD dataset

There are 41 features and 1 class label for every traffic record, and the features include basic features (No. 1–No. 10), content features (No. 11–No. 22) and traffic features (No. 23–No. 41), as shown in Table 2. According to their characteristics, attacks in the dataset are categorized into four attack types: DoS, Probe, R2L and U2R. The testing set has some specific attack types that disappear in the training set, which allows it to provide a more realistic theoretical basis for intrusion detection.

Table 2 Features of NSL-KDD dataset

Each instance in the dataset is described by 41 features and 1 label, which is normal or one of the attack types (DoS, Probe, R2L and U2R). Because the input value should be a numeric matrix, we must convert some nonnumeric features, such as ‘protocol_type,’ ‘service’ and ‘flag’ features, into numeric form. For example, the feature ‘protocol_type’ has three types of attributes, ‘tcp,’ ‘udp’ and ‘icmp,’ and its numeric values are encoded as binary vectors (1,0,0), (0,1,0) and (0,0,1). Similarly, the feature ‘service’ has 70 types of attributes, and the feature ‘flag’ has 11 types of attributes. Continuing in this way, therefore, after one-hot coding and normalization of the features, the 41-dimensional feature is transformed into a 122-dimensional feature.

5.2 Selection of generative model and classifier for ID-GAN

In theory, we can choose any generative model and classifier as the generative model G and multiclass classifier C for the ID-GAN framework, respectively. However, in practical applications, the generative model and classifier are generally nonlinear mapping functions, such as the multilayer perception machine, long short-term memory (LSTM) and others.

We selected the LSTM network as the generative model, which is a special type of recurrent neural networks (RNNs) with improvement and promotion. LSTM has the ability to remember the long-term information and overcome the vanishing gradient problem. In this paper, a 3-layer LSTM network was adopted as a generative model in the ID-GAN framework, which included an input layer, a hidden layer and an output layer. The number of neurons in the input layer was set to 120, and the number of hidden layer nodes was set to 80. The number of output nodes was 122, the same as the number of processed features mentioned in Sect. 4.1. Besides, the time step was set to 10.

Artificial neural network is one of the common machine learning methods for many complex applications, such as pattern recognition, automatic control and deep learning. It is a nonlinear structure, and its main purpose is to classify spam identification, disease judgment, cats and dogs classification and so on. In [5], authors proposed a model based on ANNs for binary and multiclass classification in the realm of intrusion detection and achieved good results. In this paper, we used a 4-layer neural network structure as multiclass classifier C, including an input layer, two hidden layers and an output layer. The number of neurons in the input layer was 122, which was the same as the number of features. The numbers of hidden layer nodes in the first and second layer were 80 and 20, respectively. For adversarial training, the number of output layers was 6, which was the same as the number of classes.

5.3 Classification metrics

In order to comprehensively and objectively evaluate the performance of the classifier in detecting intrusion behaviors, five indicators, accuracy, precision, recall, f1-score and confusion matrix, are in use.

True Positive (TP) indicates the intrusion traffic that is correctly detected, True Negative (TN) indicates the legitimate traffic that is correctly detected, False Negative (FN) indicates the intrusion traffic that is incorrectly detected as legal traffic, and False Positive (FP) indicates the legal traffic that is falsely detected as illegal traffic.

As shown in Table 3, each column of the confusion matrix represents a prediction category, and the total number in each column represents the number of data predicted as the category. Each row represents the true category of the data, and the total number of data in each row represents the number of data in the category.

Table 3 Confusion matrix

The accuracy, precision, recall and f1-score are, respectively, defined in Eqs. (8) to (11).

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(8)
$${\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}$$
(9)
$${\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}$$
(10)
$${\text{F}}1\;{\text{score}} = \frac{{2 \times {\text{Recall}} \times {\text{Precision}}}}{{{\text{Recall}} + {\text{Precision}}}}$$
(11)

To evaluate the performance of the multiclass classifier, due to label imbalance, we utilize weighted average precision, weighted average recall and weighted average f1-score. We only need to calculate metrics for each label, and find their average weighted by support (the proportion of each class). In short, the weight is the occurrence ratio that the number of each label accounts for the total number. Obviously, the overall accuracy is equal to weighted average recall.

Hence, a good classifier for intrusion detection should have a higher accuracy, precision, recall and f1-score.

5.4 Controlled experiments

In this paper, controlled experiments were adopted to more accurately and objectively evaluated the effect of the multiclass classifier via adversarial training using our framework. As stated before, since ANN was selected as the multiclass classifier in the ID-GAN framework, we should also choose ANN without adversarial training as the controlled group. Apparently, the parameters of the multiclass classifier without adversarial training (denoted as Coriginal) were the same as those of the classifier via adversarial training (denoted as Cenhanced) in the ID-GAN framework. The number of input nodes was also 122. The number of the first and second hidden nodes was 80 and 20, respectively. However, the output of Coriginal has five nodes (five categories), which was the only difference compared with Cenhanced in ID-GAN framework.

5.5 Experimental configuration

In this paper, we used one of the broadest deep learning frameworks—Keras, the architecture of which is flexible and supports various models and new tricks such as Batch normalization. The experiment was performed on a personal notebook ThinkPad E450, which has a configuration of an Intel Core i5-5200U CPU @ 2.20 GHz with 8 GB memory and did not use GPU acceleration.

6 Training the framework

To ensure the objectivity and impartiality of experiments and to accurately and objectively evaluate the performance of the ID-GAN framework for enhancing the multiclass classifier, we had selected 11 observation points at 100, 200, 500, 1000, 2000, 5000, 8000, 10,000, 20,000, 50,000 and 125,973 different numbers of ‘fake’ samples that were mixed in the adversarial training at each epoch. Their respective ratios to the total samples are 0.08%, 0.16%, 0.40%, 0.79%, 1.56%, 3.82%, 5.97%, 7.35%, 13.70%, 28.41%, and 50.00%. Then, we observed the average accuracy of the enhanced classifier via adversarial training on the test set KDDTest+ over epochs.

Although GANs have achieved great success in image generation, training a stable GAN is still difficult in practice. As mentioned before, we chose the generative model and multiclass classifier for the ID-GAN framework. This section focuses on addressing the issue of training difficulty because of the excessively free and uncontrollable frameworks of GANs. We chose two training parameters that most affect the performance of the multiclass classifier, including the existence of prior training and the noise distribution. This allows us to test how the above parameters affect the performance of the classifier and how to get a more stable, robust and generalized trained classifier.

  1. 1.

    Prior training

The prior training mentioned in this paper refers to training the classifier n_times times in advance by using the original five-category dataset before adversarial training. This allows the classifier to be properly guided to prevent itself from being too free to control. Under the same experimental conditions, the detection performance of the multiclass classifier was tested on the test set KDDTest+ with and without prior training, respectively.

As shown in Algorithm 1, first, we trained the multiclass classifier C for n_times. Then, we drew m generated ‘fake’ samples and n real data samples, respectively, to train the classifier. Moreover, since the classifier C can better guide adjustments of generative model G, generally let the classifier C loop more times during the training [40]. Thirdly, we froze the classifier C and draw m generative samples again to train the generative model G.

figure a

Without loss of generality, the times n_times of the prior training were set to 10, the epoch k of ID-GAN was set to 90. Thus, total training epoch was 100.

Figure 3 shows the average accuracy of the classifier on the test set at the 11 mixed observation points with and without prior training, respectively. The results reveal that the average detection performance of the classifier with prior training on the test set at 11 different mixed ‘fake’ samples is higher than that without prior training. As a result of the experiments, we concluded that prior training of the multiclass classifier is conducive to training the ID-GAN model toward a more ‘correct’ direction.

Fig. 3
figure 3

The average accuracy of the classifier using the ID-GAN at different observation points with and without prior training, respectively

  1. 2.

    Uniform distribution versus Gaussian distribution

The uniform distribution or Gaussian distribution can usually be used as the input of the generative model G. To further train a more stable and efficient multiclass classifier, we mainly studied the performance of the multiclass classifier via adversarial training using the uniform distribution or Gaussian distribution as the input of the generative model, respectively.

The experimental results shown in Fig. 4 indicate that the average accuracy of the classifier via the ID-GAN framework on the test set does not differ among the five observation points of 1000, 5000, 8000, 20,000 and 125,973 using the Gaussian distribution or the uniform distribution as the input of the generative model, and the maximum difference of the average accuracy is 0.0036. However, at the observation points of 500, 2000, 10,000 and 50,000, the average accuracy on the test set with the uniform distribution as the input of G is higher than the Gaussian distribution under the same situation, and the maximum difference is 0.0131.

Fig. 4
figure 4

The average accuracy of the classifier at different observation points with the Gaussian distribution and uniform distribution, respectively

Through the study of training the ID-GAN framework, a priori training for classifiers is used in advance, and the uniform distribution is helpful to enhance the performance of classifiers. In this way, we can train a relative local optimal classifier via the ID-GAN framework of which the detection performance is high and stable. Of course, the training model we get is not the best, but we provide the notion of training and prove the impacts of prior training and the choice of the noise distribution on the performance.

7 Results and discussion

7.1 Controlled group

As stated before, controlled experiments were adopted and ANN was selected as the multiclass classifier in controlled group. Under the conditions of 100 training epochs, the accuracy of the original classifier Coriginal on the dataset over epochs is shown in Fig. 5. The average accuracies of the original classifier on the training set and the test set are 99.67% and 78.81%, respectively, and the original classifier has the highest accuracy of 79.85% on the test set at the 95th epoch.

Fig. 5
figure 5

The accuracy of the original classifier over epochs

7.2 Overall comparison

In Sect. 5, we trained a local optimal classifier using the ID-GAN framework on the test set at 11 different observation points, as shown in Fig. 6. The vertical axis represents the average accuracy on the test set, while the horizontal axis represents the number of ‘fake’ samples mixed with training samples at each adversarial training epoch. Moreover, the blue dot indicates the average accuracy of the classifier enhanced Cenhanced by the ID-GAN framework on the test set, and the red-dotted line indicates the baseline of the average accuracy of the original classifier Coriginal on the same test set.

Fig. 6
figure 6

The average accuracy of the enhanced classifier on the test set at 11 observation points

Figure 6 indicates that the average accuracy of the enhanced classifier Cenhanced at 9 points has exceeded that of the original classifier Coriginal. The classifier via adversarial training using the ID-GAN framework obviously improves its detection performance and generalization. As the number of generated samples increases (e.g., 50,000 to 125,973), the average detection performance of the enhanced classifier Cenhanced begins to decline, and the ID-GAN framework no longer enhances the classifier Coriginal. The fact that generated samples account for a large proportion in the synthesized data at each adversarial training epoch can cause the classifier to focus too much on the sample to be ‘fake,’ decreasing the classification performance of the original five types.

Figure 7 details the accuracy of the original classifier Coriginal and the enhanced classifier Cenhanced on the test set over epochs at the 11 observation points, and show how to improve the performance of the classifier Coriginal using ID-GAN framework. The blue line indicates the accuracy of the original classifier Coriginal on the test set, while the red line indicates the accuracy of the enhanced classifier Cenhanced via adversarial training on the same test set over 100 epochs. In addition, the black arrow represents the optimal enhanced classifier on the test set.

Fig. 7
figure 7

The accuracy of multiclass classifiers on the test set over epochs at the 11 observation points

Based upon the results of controlled experiments, the accuracy of the enhanced classifier Cenhanced is significantly higher than that of the original classifier Coriginal after approximately 20 epochs and up to 80 (section [20, 80]) at the 100, 500, 2000, 8000 and 10,000 observation points.

Correspondingly, the maximum accuracies of optimal enhanced classifiers Cenhanced via ID-GAN framework at the 11 observation points are also significantly higher than those of the original classifier Coriginal, no matter how many ‘fake’ samples are mixed.

Figure 7 shows the graphical depiction in detail on how to enhance the classifier with the help of the proposed framework.

7.3 Individual comparison

The accuracy and weighted average of the precision, recall and f1-score of optimal classifiers (the black arrows) can be gained using the classification_report function from scikit-learn library.

The performance measures of different multiclass classifiers on the same test set are shown in Table 4. The row with the orange background color indicates the performance of the original classifier Coriginal (ANN), the rows with the light-gray background color indicate the performance of multiclass classifiers on the same test set in other studies [5, 25, 35], and the rows without background color indicate the performance of enhanced classifiers Cenhanced via adversarial training at different observation points.

Table 4 Performance measures of different multiclass classifiers on the same test set KDDTest+

As shown in Table 4, the accuracy of our original classifier Coriginal based on ANN (79.85%) is almost the same as the result (79.90%) gained in [5], which also proves the validity and objectivity of our experimental results. However, classifiers enhanced via adversarial training outperform the original classifier Coriginal without ID-GAN framework and those in other studies in terms of the weighted average of accuracy, precision and other performance indicators.

The experimental results prove that the ID-GAN framework can enhance the original classifier’s classification performance and generalization ability via adversarial training.

In order to describe the improving detection performance of the classifier via adversarial in detail, we take the observation point 8000 as an example. The confusion matrices of the enhanced classifier and original classifier are shown in Tables 5 and 6, respectively. The confusion matrix indicates that the enhanced classifier has an increase in the correct classification of all five types, and it especially improves the detection performance of Probe and R2L attacks.

Table 5 Confusion matrix of the original classifier on KDDTest+
Table 6 Confusion matrix of the enhanced classifier on KDDTest+

Meanwhile, Table 7 shows that the recall, precision and other performance indicators of the enhanced classifier on each class, and also the weight average of those indicators are improved compared with the original classifier.

Table 7 Classification metrics of the enhanced classifier compared with the original classifier on KDDTest+

7.4 Training time

In terms of training time, data augmentation means that the number of the training sample is increased, the classifier takes more training time and the cost of training time is certainly increased. From Table 4, if the ID-GAN framework generates a small number of training samples (less than 8000), the model training time will take no more than 500 s. As the training samples increase, for example, mixing generated samples more than 10,000, the training time will cost more than 700 s. Of course, this is just an experimental result on a personal computer, and we believe that the training time will be greatly reduced on GPU acceleration or on a high-performance server.

7.5 Discussion

Compared with the binary classification for intrusion detection, the multiclass classification needs to further judge the intrusion type. Hence, the multiclass classification problem is more complicated and has greater detection difficulty. As shown in Table 4, in terms of the most important performance indicators of multiclass classification, the detection accuracy of decision tree (C4.5), naïve Bayes, random forest, GAR-forest, ANN, and self-taught learning is less than 80%, and the ANN model obtains the highest detection accuracy (79.9%). In the experiment, we further enhanced the multiclass classification effect of the ANN model through the ID-GAN framework, which obviously improved the multiclass classification detection performance of the ANN model. The detection accuracy, precision and f1-score are improved by about 3.25%, 2.87% and 2.78%, respectively. In [33], the DBN based on an optimization method for multiclass classification archived the highest detection accuracy (82.36%), and the classification accuracy of the CNN algorithm was 78.42% [34]. Even so, our proposed approach is also superior to those methods.

From the above experimental results, we believe that the samples continuously generated by the generative model (as the sixth labeled samples) using adversarial training can provide useful and complementary information for the classifier, and are helpful to improve the accuracy, precision, f1-score and other performance indicators. It is verified accordingly that the ID-GAN framework can generate more useful training samples from limited data, and increase the number and diversity of training samples. Consequently, it improves the generalization of the classifier.

To give a popular example: Even if no one is taught to recognize words, more exercise to distinguish ‘that is not the word’ is also beneficial to recognize words. Rough feedback is better than no feedback.

In terms of overall performance comparison, the average detection rate of the enhanced model on the test set during 100 iterative training outperforms that of the original classifier, especially approximately during the training interval [20, 80] at the 100, 500, 2000, 8000 and 10,000 observation points. The ID-GAN framework using complementary and helpful information from the generative model significantly improves the detection performance of the original classifier. In comparison with the optimal model, the optimal classifier obtained using adversarial training is superior to the original classifier in terms of accuracy, precision, f1-score and other performance indicators, especially improving the detection performance of Probe and R2L attacks.

Hence, the ID-GAN framework via supervised adversarial training can enhance generalization of the original classifier in the performance of detecting attacks, and can be used as a framework to improve the performance of the intrusion detection classifier.

8 Conclusion and future work

In this paper, we first further transform GAN into a supervised learning approach, and propose an intrusion detection framework based on GAN using adversarial training to enhance the classifier. The ID-GAN framework continually generates the ‘fake’ labeled samples by using the generative model G, assists the classifier to improve the detection performance, and enhances the generalization of the original classifier. The approach proposed in this paper provides a feasible method to enhance the classifier. It also provides a new technology and area of thought for the research and practice in other related fields, such as botnet detection.

Similarly, the proposed approach may be under threats of training difficulties and lack of diversity in generated samples. As with other deep learning applications, there are still interpretation and self-adaption scientific problems. In future research, we will study the effect of the ID-GAN framework for other classifiers and further study the optimization training method. Furthermore, we will focus on the hyper parameters of ID-GAN framework, such as the value of n_times in prior training and the number of steps to apply to the classifier.