Keywords

1 Introduction

Traditional vehicular networks (VANETs) [1] are gradually evolving into intelligent vehicular networks. While achieving network communication, vehicles are vulnerable to malicious network flow and may lead to privacy leakage due to the lack of security mechanisms such as firewalls and gateways in some of the devices [1,2,3]. Improving the active defense capability and security of vehicular networks is an important and popular research direction [3, 4]. Traditional intrusion detection techniques can detect ongoing and existing malicious attacks in a timely manner. However, in uneven distribution massive network flow, malicious cyber attacks often hide in a large amount of normal data, making traditional intrusion detection methods difficult to deal with evolving malicious attacks and network threats [1, 5].

Currently, the main methods for handling uneven distribution data [6] include resampling methods [7], cost-sensitive algorithms, ensemble methods, feature representation and classification decoupling, etc. These methods attempt to rebalance the class weight norms in the machine learning model by increasing the number of samples of minority attacks. However, these traditional algorithms still have some problems. Generative adversarial networks (GANs) [8, 9] can learn the distribution of given data and generate new sample data. Currently, GANs are mostly used in natural images [10, 11], and have achieved significant results. Inspired by its success in these fields, scholars are gradually starting to use GANs to generate adversarial network flow for intrusion detection.

It can be argued that the current intrusion detection approach for vehicular networks has the following two drawbacks: (1) The network flow data is uneven distribution and the number of samples in the minority class is too small. The commonly used data augmentation algorithm, SMOTE algorithm [12, 13] does not consider noise data and boundary issues, which may cause overlap between different categories, leading to decreased accuracy and overfitting problems. (2) Some current intrusion detection models for vehicular networks perform a lower detection rate and weak classification ability. On this basis, a new intrusion detection model is proposed, which combines GANs with DBN [14], and uses a GAN-based data augmentation method [15] to generate adversarial attack samples for the minority class, in order to expand the dataset CIC-IDS2017, an improved DBN classifier is designed to evaluate the effectiveness of this method [14, 16]. How to achieve better classification effect, higher classification accuracy and higher precision, which is the main innovation and challenge of this paper.

2 Intrusion Detection Model Based on Data Augmentation

2.1 Data Processing and Augmentation

Dataset Analysis and Preprocessing. In this paper, since the vehicle data set may lead to user privacy leakage, the general vehicle data set is not open to the public, so the open source network intrusion detection data set is adopted. In the existing open source datasets, CIC-DDOS2019 dataset proposes an attack classification method for DDOS, and the CSE-CIC-IDS2018 is mostly used for anomaly detection. However, this paper studies data enhancement and intrusion detection for a few categories in the vehicle network. So, the dataset used is the CIC-IDS2017 dataset [17] provided by the Canadian Institute for Cybersecurity. It contains 5 d of normal and attack flow data collected by the institute, with each record having 78 network features. The dataset includes the latest cyber attacks and meets all standards of real-world attacks, and can fully simulate the attack to vehicle network [18].

After merging the 8 csv files of the CIC-IDS2017 dataset, missing instances were removed from the dataset along with NaN values to avoid redundancy and exploding gradients when training the model. The dataset was then transformed into numerical values and normalized, with the most frequently occurring class being labeled as ’Normal’ and all other classes labeled as ’Attack’ to meet the conditions of inputting the dataset into the GAN network.

GANs-Based Data Augmentation Methods. To address the problem of highly uneven distribution data in vehicle network datasets, a data augmentation method based on Generative Adversarial Networks (GANs) is used to generate adversarial attack samples for the minority class, in order to expand the dataset CIC-IDS2017 [17] used in the study.

Algorithm 1.
figure a

Data Augmentation Algorithm based on GANs

Algorithm 1 is the data augmentation training process of GANs. Where \(\varTheta _G\), \(\eta _{\theta _{G}}\), \(\theta _{D}\), \(\eta _{\theta _{D}}\) are the generated weight parameters, gradients, and discriminator weight parameters and gradients, respectively. Its main steps are:

  • (1) The category label \(y^{\prime }\) of the preprocessed minority data is input to the generator G with the random noise vector z for training and the data sample \(S_G\) generated;

  • (2) Fix the generator G, train the discriminator D, and gradually update the weight parameter \(\theta _{D}\) of the discriminator;

  • (3) Fixed discriminator D, train generator G, and gradually update the weight parameter \(\theta _{G}\) of the generator;

  • (4) Loop (1)–(3) until \(p_g\) = \(p_{data}\) = 1/2, the discriminator cannot distinguish between the two distributions, so that the generated sample keeps approaching the real data sample.

Figure 1 shows the occurrence of classes in the original dataset and the minority class after data augmentation, it can be observed that data augmentation is effective in increasing the number of samples in classes with fewer than 5000 samples in the dataset, particularly for extremely rare classes such as Heartbleed and Infiltration, which are increased from 11 and 36 samples, respectively, to 5632 and 8704 samples.

Fig. 1.
figure 1

Comparison of Original Data and Quantity after Data Augmentation

2.2 Improved DBN Model

Structurally, Deep Belief Networks [19] is a probabilistic generative model composed of multiple layers of unsupervised Restricted Boltzmann Machines (RBM) and a supervised Back-Propagation (BP) network, DBN is composed of multiple stacked RBMs, each consisting of a hidden layer and a visible layer.

The training of DBN [20] consists of the layer-wise pre-training stage and the back-propagation fine-tuning stage. In the pre-training stage, it uses the Contrastive Divergence (CD) algorithm proposed by Hinton to quickly train RBM to obtain an approximate representation of the input vector v. In the back-propagation fine-tuning stage, the BP algorithm and stochastic gradient descent are used to optimize the connection weights in the DBN to obtain the optimal model parameters. For the pre-training stage, the Mean Square Error (MSE) and Pseudo-Likelihood (PL) loss functions were used to evaluate the accuracy of RBM training. The calculation method for MSE is Eq. (1):

$$\begin{aligned} MSE=\frac{1}{m} \sum _{i=1}^{m}\left( x_{i}-\bar{x}\right) ^{2} \end{aligned}$$
(1)

where m is the number of samples, \(x_{i}\) \((i=1,2,3 \ldots m)\) is the sample, \(\bar{x}\) is the average of m samples. For the reverse tuning stage, in order to determine the learning rate of the model, the loss and accuracy of the training set and the validation set are used to evaluate its performance.

Algorithm 2.
figure b

DBN Training Algorithm

Algorithm 2 is the specific algorithm trained by DBN. Where W,\(W^k\) are the weight matrix with the training stage and the fine-tuning stage respectively, \(a_{i}\) \((i=1,2,3 \ldots M)\), \(b_{j}\) \((j=1,2,3 \ldots N)\) are the biases of the visible layer and the hidden layer respectively, \(\epsilon \),\(\epsilon _{ft}\) are the learning rate of the pre-training stage and the fine-tuning stage, \(V={v_1,v_2 \ldots ,v_m}\) is the training sample of RBM, l is the number of layers of RBM, \(a^k\), \(b^k\) \((k=1,2,3 \ldots l)\) are the bias of the visible layer and the hidden layer at the kth layer, respectively.

3 Experimental Design and Result Analysis

The experiments were conducted in a Win10 environment, with a 64-bit Intel(R) Xeon(R) Silver 4100 CPU and 32 GB RAM. The implementation was done using Python 3.8 language and the Pytorch 1.9 framework.

3.1 Dataset Labels

Table 1. Number of Labels After Relabeling

The proposed intrusion detection model was evaluated using the CIC-IDS2017 dataset. After data enhancement, similar attack classes with similar characteristics and behaviors were merged into a new class, and the dataset was re-labeled. The final standard dataset was divided into 9 classes. Table 1 shows the number of labels in the standardized dataset after re-labeling.

3.2 Experiments and Analysis

To evaluate the detection performance of the proposed intrusion detection model, the following experiments were designed.

Table 2. GANs Training Parameters
Table 3. DBN Network Parameters
Fig. 2.
figure 2

Dataset Classification Results:(a) Training Set Classification Results. (b)Testing Set Classification Results. (c) Verification Set Classification Results.

Experiments on Training GANs-DBN Model. GANs were used to generate samples for the 8 minority classes in the dataset. The GANs training parameters are shown in Table 2. After training, the dataset was re-labeled to generate the standard dataset, which was then divided into training set, testing set, and validation set in a 60%, 20%, 20% ratio. Finally, the data was input into the DBN classifier for model evaluation. The parameter settings for the DBN classifier are shown in Table 3.

Fig. 3.
figure 3

Comparison of RBM Training Performance at different learning rates.

Fig. 4.
figure 4

Comparison of Accuracy and Loss at different learning rates

During the training of the DBN classifier, in the pre-training stage, the learning rate of the RBM were determined by changing the learning rate within an approximate range of [0.001, 0.1]. As shown in Fig. 3, when the learning rate lr = 0.015, MSE = 0.538, PL = -0.818. Compared with other learning rates, the training accuracy of RBM is optimal at this learning rate. Therefore, this method selected the learning rate lr = 0.015 as the learning rate for RBM training. In the back-propagation fine-tuning stage, the performance of the model was evaluated using the loss and accuracy of the training set and the validation set to determine the optimal learning rate for this stage. From Fig. 4, it can be clearly seen that when the learning rate is lr = 0.005, the loss reaches its minimum value with train-loss = 0.453 and val-loss = 0.419, and the accuracy of the training set and validation set reaches its maximum value with train-acc = 0.993 and val-acc = 0.990. However, when the learning rate is too high, such as lr = 0.1, the model’s loss reaches 0.98 and the model fails to converge. Therefore, based on the above results, this method selected the learning rate lr=0.005 as the training learning rate for the back-propagation fine-tuning stage of the model.

The classification results of the proposed model for the training set, test set, and validation set are shown in Fig. 2, and the confusion matrix for the predicted classes in the test set is presented in Table 4. It show that the proposed model can correctly classify most of the network flow, with high precision, recall, and F1 score. The precision, recall, and F1 score for the minority classes such as Brute Force, DDoS and PortScan are close to 1. Additionally, the precision, recall, and F1 score for extremely rare classes like Heartbleed and Infiltration are also above 70%, with a recall rate of 99.7% and a precision rate of 100% for Heartbleed. Therefore, the proposed model has strong detection performance for attacks on minority classes in the vehicular networks while maintaining high performance in detecting other attacks.

Table 4. Testing Set Confusion Matrix

Performance Comparison Experiments of Different Data Augmentation Methods. To verify the effectiveness of the proposed data augmentation method, this study compared different data augmentation methods, including the SMOTE algorithm, class weight strategy, combination of SMOTE algorithm and class weight strategy, and GANs and combines it with DBN classifier for model evaluation. The parameters of the DBN network were kept consistent for each method, and the specific parameter settings are shown in Table 3.

Table 5. Comparison of GANs with Other Data Augmentation Methods

Table 5 shows that compared to the other three commonly used data augmentation methods, the proposed model improves accuracy, F1 score, and AUC by at least 1%, 2%, and 2%, respectively. Figure 5 compares the offline AUCs for different classes using various data augmentation methods. It can be concluded that the proposed intrusion detection method based on GANs-DBN outperforms other classification algorithms in overall performance, although it may not perform as well as some other methods for certain classes. Overall, this method greatly improves the accuracy of intrusion detection for each class.

Fig. 5.
figure 5

Comparison of AUC for Different Data Augmentation Methods

Fig. 6.
figure 6

Performance Comparison of Different Models

Performance Comparison Experiment of Different Models. To verify the intrusion detection performance of the proposed model, the performance of GANs-DBN was compared with several existing intrusion detection models using the CIC-IDS2017 dataset. Othmane Belarbi [21] To verify the intrusion detection performance of the proposed model, the performance of GANs-DBN was compared with several existing intrusion detection models using the CIC-IDS2017 dataset. Monika Roopak et al. [22] proposed deep learning models including LSTM, CNN + LSTM, and SVM, and evaluated DDoS attack detection using the CIC-IDS2017 dataset. For the LSTM model, the final accuracy reached 86.34%; for the CNN + LSTM model, the final accuracy reached 97.16%; for the SVM model, the accuracy reached 95.5%. By comparing the performance data of the above reference papers with the GANs-DBN model used in this paper, the detection performance of various intrusion detection models was evaluated.

From Fig. 6, it can be seen that the proposed model outperformed other models in all three indicators, reaching 99.27%, 99.80%, and 99.70% respectively, which represents at least a 1.03%, 1.36%, and 0.58% improvement, respectively. Thus, the proposed model significantly improved the detection performance for multi-class intrusion detection compared to other models.

4 Conclusion

This paper presents an integrated network intrusion detection model, GANs-DBN, designed to address the issue of low detection performance for small quantities of malicious flow in vehicle networks due to the discrete distribution of network attacks. The performance of the model is evaluated using the CIC-IDS2017 dataset. Specifically, GANs are employed for data augmentation, expanding the dataset and enriching its distribution, while an improved DBN classifier is utilized to assess the model’s classification capability. Experimental results demonstrate that the proposed model outperforms alternative methods in overall detection performance, effectively enhancing the detection rate for specific classes of attacks and thereby improving overall accuracy. However, it is worth noting that the current research only partially simulates the real network conditions, and future efforts should focus on identifying and defending against the complex traffic characteristics encountered in actual vehicle networks, particularly APT attacks.