Keywords

1 Introduction

Machine learning and data mining techniques have fascinated researchers worldwide owing to superior performance results in various application domains. In the field of intrusion detection systems (IDS), these techniques demonstrate promising results by predicting future attack patterns using learning paradigms [1]. Learning is the process of constructing a predictive model using a dataset. It is comprised of several categories including supervised, unsupervised, and reinforced learning.

In particular, supervised learning uses classified (labeled) training data to create a model that would then be applied to future unknown data. Furthermore, the objective of supervised learning systems is to obtain a high classification accuracy and to reduce the false positive rate [2, 3].

However, if IDSs are applied to real world environments, there are problems that reduce accuracy and increase false detection rate. The purpose of this study was to analyze the performance degradation caused by data imbalances in learning data.

A data imbalance problem occurs when the number of observations explored for the purpose of classification differs significantly by class. For example, if there are 10,000 samples in the X1 category, but only 20 samples in the X2 category, then X2 is considered relatively small when compared to X1. The data imbalance issue is a factor that degrades the classification performance by treating the minor class instance as a major class instance to minimize the error rate. Data imbalance in learning data is a problem of algorithm degradation that is addressed in a variety of studies including those on intrusion detection systems. Initially, Zhu et al. [4] proposed a way to improve classification accuracy by using a generative adversarial network (GAN) to categorize data objectively. Douzas and Bacao’s [5] work suggested ways to improve classification performance by modifying the synthetic minority oversampling technique (SMOTE) algorithm through data combination by creating a sample of minority classes using GAN. Finally, Mariani [6] proposed a study that uses GAN to improve classification of datasets using landscape photographs. The common factor in all the above studies was GAN, one of the generative models of deep learning that was used to solve the data imbalance problem caused by small amounts of data in a class.

The GAN algorithm is an algorithm that produces data that is so sophisticated that the difference between real data and peak data is not known. Therefore, it is a suitable model for solving performance degradation problems caused by smaller classes [7]. In this study, similar network traffic was generated for classes with low data counts using the proposed Wasserstein GAN with gradient penalty (WGAN-GP). A Fourier transform based filter was used to make assumptions regarding the classification accuracy and improvement in performance for IDSs. To evaluate the proposed method, a classification experiment was conducted on each data set using a convolutional neural network (CNN). In the above experiment, we have improved the performance of the depth-based intrusion system by reducing the data imbalance of the dataset used for network traffic classification.

In Sect. 2, related work on intrusion detection using deep learning is described. Section 3 provides details of how similar traffic will be generated using the proposed GAN. Section 4 discusses the experimental results and Sect. 5 outlines the conclusions of the study.

2 Related Work

2.1 Deep Learning Based Intrusion Detection

Research on intrusion detection using deep learning was conducted to improve the accuracy of intrusion detection by extracting the attack data characteristics using the undersampling and few-shot methods [8, 9]. Most studies on intrusion detection use the undersampling method to increase the layers of the neural network because it is effective for detecting certain classes [10]. However, the undersampling method exhibits a low accuracy in a real-world environment although its accuracy in an experimental environment is high [26]. In a real-world environment, the data imbalance problem becomes worse than it is in the experimental environment and the detection rate decreases [11]. Therefore, in this study, to apply the oversampling method as a solution to the data imbalance problem, a similar traffic generation method that will increase the attack class traffic training data using a GAN is proposed, as shown in Fig. 1.

Fig. 1.
figure 1

Learning method using oversampling method

2.2 GAN Learning Based Intrusion Detection

Research on malware detection using a GAN has been conducted. Initially, Chintala [12], Gulrajani et al. [13], Kim, Bu and Cho [14] described the detection of malicious software using a GAN that used an autoencoder to improve performance through learning and transferring weights. They verified the accuracy of this method to be 96.3%. Sun et al. [15] proposed solutions to classify distributed DoS (DDoS) attacks with flash crowds (FC) using a least-squares GAN. A method to distinguish legitimate traffic from attackers is to analyze traffic behavior and statistics for users of DDoS and FC.

Yin et al. [16] focused on increasing the botnet detection model, eliminating the use of network payload information, and improving the performance of GAN. Li et al. [17] proposed a GAN anomaly detection method. Several sensors were time series modeled and an anomaly detection method was proposed that utilized actual data sample differences. In research by Lin et al. [18], the IDSGAN (Generative Adversarial Networks for Attack Generation against Intrusion Detection) extracted session data used by a botnet to generate and classify similar traffic and utilized the NSL-KDD dataset to generate traffic and increase the classification model. The generated data contributed to the improved learning performance of the RF and principal component analysis models.

As mentioned earlier, research using the GAN is advancing rapidly. GAN was used to improve classification performance. However, the data imbalance problem for the network traffic classifier has not yet been resolved.

2.3 Dataset Analysis

The datasets generally used in intrusion detection studies using deep learning were analyzed. The NSL–KDD and UNSW-NB15 datasets have insufficient attack traffic when compared to normal traffic. Furthermore, the DDoS attack traffic accounts for more than 50% of all attack traffic, causing the imbalanced data problem. Dainotti, Pescape, and Claffy [19] suggested that the uniformity of shared trace datasets for test data is the primary obstacle to traffic classification method progress. Many studies of abnormal traffic classification use traffic from self-collection tools or security companies, which are unreliable.

Because the classical machine learning approach focuses on function selection techniques, many current public traffic datasets are flow function datasets, not raw traffic datasets. For example, the famous KDD Cup 1999 and NSL-KDD datasets provide 41 predefined functions in the dataset and there are several similar datasets [20].

UNSW-NB15 [21] is a dataset composed of nine attack traffic and general traffic packets created by the IXIA PerfectStorm tool at the Cyber Range Lab at the Australian Centre for Cybersecurity (ACCS). The problem of data imbalance was analyzed for these two data sets to determine the accuracy of classification of the minority class data.

3 Proposed Method Network Feature Extraction

3.1 Network Feature Extraction

Deep Packet Inspection (DPI) was conducted for the attack packets in the dataset. In this study, 15 dataset characteristics, such as duration, header length, IP version, protocol, flag, and session, were extracted and correlated to detect malware. Through correlation analysis, six features; duration, transmission control protocol header, port, session data, and flag were selected and applied to the GAN algorithm. To adjust the size of the data, the hash value was applied and quantified. This allowed images of the same size to be created.

Mode-specific normalization for numerical variables was used. Numerical variables in tabular datasets sometimes follow a multimodal distribution. A Gaussian kernel density estimation to estimate the number of modes of a continuous variable was used.

In the three datasets we used in this study, we found 22 out of 27 continuous variables in the NSL- KDD dataset that had multiple modes. Simply normalizing numerical features to [−1, 1] and using tanh activation to generate these features did not work well. For example, when there is a mode close to −1 or 1, the gradient will saturate when back-propagating through tanh.

To effectively sample values from a multimodal distribution, we cluster values of the numerical variables using a Gaussian Mixture model (GMM). GMM for classification using input data X is performed as outlined in Fig. 2.

Fig. 2.
figure 2

GMM classification algorithm

Smoothing for categorical variables was used. In generating categorical variables, the model faces a similar challenge to the one it faces in natural language generation: how to make the model differentiable. In natural language generation, people use reinforcement learning [22] or Gumbel Softmax [23] to deal with this issue. We are facing a similar challenge; however, the number of categories is much smaller than the size of vocabulary in the natural language. Therefore, we can generate the probability distribution directly using Gumbel Softmax. However, we find it necessary to convert categorical variables to one-hot encoding representation and add noise to binary variables. In this study, we filter by word for Sequecncial data (Payload, URL etc.). After that, we converted to vector processing method using tokenizer.

3.2 Image Generation

Network traffic consists of .pcap files, and it was generated by the portable network graphic file structure outlined in Fig. 3. The session data was extracted from the .pcap file and the byte size was used to create the image with the correct proportions. For WGAN-GP, the conversion was made because it generates analogue traffic based on the image. The images created were stored for each class and used as training data.

Fig. 3.
figure 3

Illustration of similar traffic created based on a GAN

3.3 Training

The generator and critic architecture of the WGAN-GP model used for training is illustrated in Fig. 4 [24] and is one of the algorithms validated by Gullajani [11]. The critic generator training iteration ratio was maintained at 5:1, for every five training iterations of the critic, the generator was updated once. A default value of \( \uplambda \) = 10, as suggested in [13], was used as the gradient penalty hyperparameter in the WGAN-GP loss function. Adam [25] optimization algorithm was used for training the network with a learning rate of 0.0001, 1 = 0, and 2 = 0.9. The model was trained for approximately a day across 45,000 generated iterations with final negative critic losses (averaged across mini-batches of an epoch) of 0.4262 on the training set and 1.365 on the validation set. Figure 4 details the training cost (across each iteration) and the validation cost (for every 100 iterations) of the critic on a minibatch of 64 images. This demonstrated a stable training run without any overfitting. The Wasserstein distance estimate converged close to zero consistently across training and validation. Figure 4 shows a sample of 64 synthetic images generated by the trained generator network G.

Fig. 4.
figure 4

Architecture of the generator and the critic networks of the WGAN-GP model

Similar traffic was generated for each class using the WGAN-GP algorithm. The criteria for generating similar traffic was updated in the traffic data by a class with a generator loss function of 0.98 or higher and a discriminator loss function of 0–1 or higher. In addition, a Fourier transform was applied to each characteristic of the packet to convert the time domain into a value in the frequency domain to allow learning. The value of the frequency domain with Fourier transforms decreased the computation of classifying network traffic, improving performance. Finally, Fourier transforms can be high-frequency filtered images generated by WGAN to validate data with code transformations. In addition, images created using .pcap data were filtered through Fourier transforms. The advantage of using Fourier transformation is they have the ability to highlight the features of images in the encoding process. Additionally, loss value is minimized when decoding a generated image.

The following Fig. 5 describes the creation of a similar image using WGAN-GP and Discrete Fourier Transform (DFT). DFT can filter images to maximize the characteristics of the data to enhance learning performance. Similar traffic creation criteria were used to generate analogous traffic for a class when the amount of all attack traffic data in the class was less than 40%. The data generated by GAN algorithm was used only for training data, because over-fitting problems can occur when using validation data for testing. Therefore, only the original data was used for testing.

Fig. 5.
figure 5

GAN image creation algorithm

4 Performance Analysis

This study investigated a method to improve detection accuracy by creating similar traffic to GAN and increasing the ratio of attack traffic. The method presented in this paper solves the problem of performance degradation due to data imbalance in datasets using GAN. Existing datasets have a data imbalance problem that impedes the detection of attacks or improved learning. As a solution to this problem, we added learning data to similar traffic generated by GAN algorithm and improved attack detection accuracy in experiments. Initially, the performance comparison of GAN algorithm by generating similar traffic was completed. Performance was compared using DC-GAN and WGAN-GP. WGAN-GP exhibited increased precision and reduced learning time when compared to DC-GAN. Based on this result, performance comparison of the datasets by generating data using WGAN-GP was conducted.

It was observed that accuracy was improved and false positivity rates declined. The time required for data generation was also reduced when WGAN-GP used DC-GAN. Mode collapse is a phenomenon that produces similar images for one class. This phenomenon will continue to produce images for specific classes, and it can be used only to identify a particular class. In the case of network traffic, the difference in the amount of data between classes is large and similar data is repeatedly generated.

If the loss function does not properly represent the distance between the actual and the generated data distribution, eventually the gradient fails to properly update the parameters, making convergence difficult. There are various ways to measure the distance between the two distributions \( \varvec{p}_{\varvec{r}} \) and \( \varvec{p}_{\varvec{\theta}} \). For WGAN-GP, we used the Wasserstein resistance distance function. The Wasserstein resistance is as follows [11]:

$$ W_{p} \left( {P_{r,} P_{\theta } } \right) = \varvec{ }\mathop {\text{inf }}\limits_{\gamma \in \varGamma } E_{{\left( {x,y} \right)\sim\gamma \left( {x,y} \right)}} \left( {\left| {x - y} \right|^{p} } \right) $$
(1)

Wasserstein resistance is the process of moving masses to change one distribution to another to match two marginal probability distributions. Thus, because of the probability of simultaneous occurrence, it is called the Wasserstein resistance when each distance is minimal. In the case of network traffic, the manifold is concentrated in a small space and classes have many characteristics. Thus, the performance was improved when the Wasserstein resistance was used because of the variety of ambient probability distributions.

Table 1 lists the performance comparison of the DC-GAN and WGAN-GP detection ratio and duration. While using the DC-GAN, the false detection rate was reduced by 0.02 and the detection time was also reduced.

Table 1. Online anomaly detection results on GAN intrusion detection

Based on the above experiment, the study time and the amount of learning were increased when normal traffic, by dataset category and generated using WGAN-GP, was added to the training dataset. The learning speed was analyzed when traffic was doubled on each set of data with the exception of the DDoS data. The accuracy of epochs 1–5000 was measured by increasing the total data generated by GAN by 30% for non-normal traffic other than DDoS. Figures 6 and 7 show a graph of the accuracy of each set of data.

Fig. 6.
figure 6

NSL-KDD dataset precision analysis

Fig. 7.
figure 7

UNSW-NB15 dataset precision analysis

From Figs. 6 and 7 it was observed that the validation test at epoch 3500–4000 was measured with an accuracy of approximately 81–87%, which is an improvement of approximately 8–12% compared to when the model produced similar traffic. In addition, the learning accuracy for test data also increased by 6–7%. This confirmed that the greater the number of studies, the greater the accuracy. It was also observed that detection performance improved by removing the dataset’s data imbalance profile due to classes with less data from the WGAN-GP algorithm.

Table 2 contains the experimental results of the attack traffic classification for the KDD-NSL and UNSW NB15 datasets using a CNN and by adding data generated using WGAN-GP to the existing dataset. The accuracy of the overall dataset validation for categorizing dataset attack traffic was improved by 16% and the test data accuracy was improved by 14%. This demonstrated that by using GAN to continuously increase the amount of attack traffic, the amount of malicious traffic can be used as an effective detection method to increase accuracy in a research environment where there is minimal attack traffic.

Table 2. Comparisons of attack traffic classification experiments using CNN

5 Conclusion

In this study, WGAN-GP was used to analyze datasets used in training the intrusion detection system and to resolve the problem of data imbalance. The existing datasets UNSW-NB15 and KDD-NSL were analyzed for minority classes. Based on the analysis results, 50% to 200% of the data for minority classes were generated. The generator that produced similar traffic used only values with a loss value of 0.98 to 1. .pcap files were created as images. These files contain a large amount of noise, causing performance degradation. In this study, the method of extracting characteristics using a DFT filter was applied, reducing the noise inherent in the image files. As a result, the discrimination ability was increased when the DFT filter was applied to the existing WGAN-GP. The experimental results demonstrated that the increase in the creation of similar traffic images improved the learning rate by 3–7% and the detection accuracy for the attack traffic data by approximately 8%. Furthermore, we were able to improve classification accuracy using the attack characteristics. In the future, classification to determine the type of attack traffic and transformation of the generated images into text form using WGAN-GP will be researched.