Keywords

1 Introduction

Network intrusion detection system (NIDS) monitors all traffic in the network and detects each data packet passing through the web. Many researchers have begun studying intrusion detection techniques to deal with network attacks effectively. In classification problems, machine learning algorithms perform feature extraction to identify malicious behaviors in network traffic [8]. However, the statistical characteristics of traffic have changed considerably in terms of network architectures and applications today. Traditional machine learning methods have been powerless to efficiently and accurately abstract spatial and temporal features of abnormal traffic.

Self-taught learning is a typical machine learning framework for using unlabeled data in supervised classification tasks [22]. The method does not require the assumption that unlabeled data follows the same distribution as labeled data. Besides, representation learning analyzes the characteristic of data that makes it easier to extract helpful information when building predictors [5]. Inspired by the above ideas, we develop a noval network intrusion detection system based on self-taught learning and representation learning.

General traffic features can be divided into two categories: spatial features, such as data packet features, and temporal features, such as network flow features. NIDS often struggles to broaden the horizon and jump out of the local optimum solution when using only spatial or temporal features [27]. In this paper, we design one-dimensional stacked convolutional autoencoders (1D-SCAE), an excellent self-taught learning model which abstracts spatial features by reducing the dimensionality of complex data signals. Besides, bidirectional gated recurrent units (BiGRU) can extract temporal features of traffic sequences in representation learning. Therefore, we propose a deep neural network model based on 1D-SCAE and BiGRU, which can accurately extract spatial and temporal features and enhance the performance of malicious traffic detection. The main contributions of the proposed work include the following:

  • We design 1D-SCAE—an improved network traffic spatial feature extraction model, which uses sparse regularization to reduce overfitting by invalidating a certain part of active neurons. The greedy layer-wise strategy is adopted to achieve the best detection performance.

  • We propose a BiGRU-based temporal feature extraction model that utilizes TimeseriesGenerator to generate and model traffic time series. It can acquire both memories from history and information from the future.

  • We develop SR-IDS, a network intrusion detection system that simultaneously focuses on network traffic’s spatial and temporal characteristics. Experiments show that the accuracy of SR-IDS on the UNSW-NB15 dataset can reach 98.90\(\%\).

  • We discuss different hyperparameters to determine the optimal model architecture. Furthermore, we compare the detection performance of different RNN variants.

The rest of the paper is organized as follows. The related work on NIDS is reviewed in Sect. 2. Then we present the details of the proposed SR-IDS in Sect. 3. The accuracy and the efficiency of SR-IDS are verified in Sect. 4 by comparing it with several state-of-the-art IDS algorithms. Finally, we provide our conclusions and discuss the future work in Sect. 5.

2 Related Work

NIDS is a necessary foundation and premise for dealing with complex network attacks and identifying malicious traffic behavior. The deep learning models currently applied to network anomaly detection include two categories: generative intrusion detection model and discriminative intrusion detection model.

2.1 Generative Intrusion Detection Model

Generative models often adopt an advanced hierarchical learning method to establish a multi-level model, which can flexibly analyze and restore joint probability distribution. The current famous generative model architecture mainly includes autoencoder and its variants [18].

Amir et al. [4] designed a new lightweight architecture that considers feature separation and uses surrounding information of a single value in the feature vector. The accuracy is improved while reducing the memory footprint and the need for processing power. Iliyasu et al. [12] achieved a few-shot learning intrusion detection, which uses the feature extraction model in the few-shot learning stage to fit a classifier with a small number of novel attack samples. Long et al. [17] proposed a network intrusion detection model based on an integrated autoencoder. It uses recursive feature addition to select the optimal subset of features, which can significantly reduce the training time and improve the intrusion detection performance.

2.2 Discriminative Intrusion Detection Model

Discriminative models are usually based on the excellent classification of heterogeneous data to achieve the best recognition. The common discriminative model structures mainly include recurrent neural networks and convolutional neural networks [2].

Imrana et al. [14] proposed a novel feature-driven intrusion detection system. The model first utilizes a statistical model to rank all the features, then uses best-first-search algorithm to search for the best subset, and finally classifies testing data based on the best subset. Sahu et al. [23] proposed a multi-classification intrusion detection method based on LSTM and fully connected networks. This method accurately classifies the imbalanced intrusion data. Imrana et al. [13] used an improved RNN model for network intrusion detection, which can be associated with the feature knowledge and accurately classify unknown data.

Several works sought to propose ML-based solutions with consideration of as many essential features as possible, and the approaches managed to obtain interesting results. However, there are still some challenges in extracting both spatial and temporal traffic features. Inspired by existing research progress, we propose SR-IDS—a new intrusion detection system with the advantages of generative models and discriminative models. Moreover, it can serially extract the spatial and temporal features of network traffic accurately.

3 The Proposed Model

In this section, we introduce how SR-IDS works. SR-IDS first preprocesses the UNSW-NB15 dataset, including one-hot encoding and normalization. Afterward, SR-IDS uses 1D-SCAE to extract spatial features of network traffic, and the greedy layer-wise strategy is adopted to pre-train the neural network. Finally, SR-IDS uses BiGRU to extract temporal features of network traffic. BiGRU accepts input from pre-trained 1D-SCAE and outputs to the binary classifier. Figure 1 describes the framework of our proposed SR-IDS model.

Fig. 1.
figure 1

The framework of our proposed SR-IDS model. 1D-SCAE (marked in blue) extracts spatial features through encoding and decoding. The output of the last encoding layer of 1D-SCAE is the input of BiGRU (marked in green). BiGRU extracts temporal features by generating time series. (Color figure online)

3.1 Data Preprocessing

In general, machine learning models can only process meaningful numerical data, but the actual data differs from what we expected. In order to enable machine learning models to process and analyze traffic data, assigning numerical meaning to features is necessary. One-hot encoding is a commonly used feature encoding method.

One-hot encoding expresses a specific type of different values in binary vectors. The N values used for encoding correspond to the states of N registers one by one. Only one bit in any form is activated, and the rest of the registers are inactive. The specific representation is generally \(v_i=\{0,1,0,\dots 0,0\}\), and the dimension of the vector is equal to the number of possible values N of the eigenvalues to be encoded.

After encoding, we use the min-max method to standardize network traffic samples. With a fixed output range, the min-max method performs a linear operation on the sequence \(\{x_1,x_2,\dots ,x_n\}\). After transformation, the new sequence \(\{y_1,y_2,\dots ,y_n\}\in ( 0,1)\) are dimensionless:

$$\begin{aligned} y_{i}=\frac{x_{i}-\min _{1 \le i \le n}\{x_{j}\}}{\max _{1 \le i \le n}\{x_{j}\}-\min _{1 \le i \le n}\{x_{j}\}} \end{aligned}$$
(1)

It can be found that min-max forces the original input data to distribute in [0, 1], and the normalized scale transformation is only related to extreme values.

3.2 Spatial Feature Extraction

Spatial features of network traffic refer to feature sets related to packets, for example, packet size and number. We design a 1D-SCAE for spontaneously learning spatial feature representation, and Fig. 2 describes the architecture. In each layer, the autoencoder convolves the features of the lower layers to produce a high-level representation. The whole methodology is shown as follows:

$$\begin{aligned} x_{j}^{l}=f(\sum _{i \in M_{j}} x_{i}^{l-1} \times k_{i j}^{l}+b_{j}^{l}) \end{aligned}$$
(2)

where \(M_j\) represents the input feature map, l represents the l-th layer in 1D-SCAE, and k is the convolution kernel. f represents the activation function, and \(b_j^l\) is the bias vector.

Fig. 2.
figure 2

The structure of proposed 1D-SCAE model

The 1D-SCAE consists of three convolutional autoencoders, and their encoder layers are stacked in the model construction process to build the complete 1D-SCAE model. After the training is completed, we discard the decoders and connect the last encoder layer to the subsequent temporal extraction model, which will be explained in the next subsection. MSE Loss is used to evaluate the effect of feature extraction and input reconstruction as follows:

$$\begin{aligned} J=\frac{1}{n} \sum _{i=1}^{n}(x_{i}-x'_{i})^{2} \end{aligned}$$
(3)

where i is the sample index, \(x_i\) is the original input data, and \(x'_i\) is the reconstructed data after dimensionality reduction by 1D-SCAE.

We also add a custom regularization term in 1D-SCAE to improve the generalization performance of the model. The principle is that different inputs cause different neurons to be activated, making neurons better dependent on data. In general, the constant \(\rho \) is the proportion of activated neurons, which is used to measure the average activity \(\hat{\rho }\) of the activation degree of neurons:

$$\begin{aligned} \hat{\rho }=\frac{1}{N} \sum _{i=1}^{N} \varTheta (x_{i}) \end{aligned}$$
(4)

where N is the number of neurons in the hidden layer, \(\varTheta \) is the corresponding neuron transformation. In the field of machine learning, forward KL divergence is often used as the training cost to measure the difference between two probability distributions. Forward KL divergence makes sure \(\hat{\rho }\) close to \(\rho \), and the regularization term punishs the deviation between \(\hat{\rho }\) and \(\rho \):

$$\begin{aligned} KL(\rho \Vert \hat{\rho })=\rho \log \frac{\rho }{\hat{\rho }}+(1-\rho ) \log \frac{1-\rho }{1-\hat{\rho }} \end{aligned}$$
(5)

If \(\hat{\rho }\) is equal to \(\rho \), the KL divergence is 0; otherwise, it will gradually increase as the difference between \(\rho \) and \(\hat{\rho }\) increases. Therefore, the error function \(J^{'}\) in the sparse autoencoder is shown as follows:

$$\begin{aligned} J^{'}=J+\mu \sum _{j=1}^{N} KL(\rho \Vert \hat{\rho }) \end{aligned}$$
(6)

where J is the error when no sparse item is added, and \(\mu \) is the impact factor used to balance the weight of KL divergence in the entire loss function.

3.3 Temporal Feature Extraction

In this work, we group traffic records by timestep and link the context with their labels. Our proposed SR-IDS can accurately reflect the time characteristics of network traffic and significantly reduce the false positive rate.

SR-IDS takes the output from the spatial feature extraction model as input and uses TimeseriesGenerator—a time series generator to convert isolated samples into a sequence. After serialization, the processed traffic is input into the BiGRU. The principle of BiGRU is to split the neurons of a regular GRU into two directions, one for positive time direction and another for negative time direction.

Assume that the current input vector is \(x_t\), the last step activation vector is \(r_{t-1}\), W and U are weight matrices used to represent the connection strength between neurons, and b is the bias vector. \(\sigma _g\) represents the sigmoid activation function, the update gate vector \(z_t\) and the reset gate vector \(r_t\) are shown as follows:

$$\begin{aligned} \begin{aligned} z_{t}=\sigma _{g}(W_{z} x_{t}+U_{z} h_{t-1}+b_{z}) \\ r_{t}=\sigma _{g}(W_{r} x_{t}+U_{r} h_{t-1}+b_{r}) \end{aligned} \end{aligned}$$
(7)

The candidate activation vector \(\hat{h}_t\) is obtained through the Hadamard product of \(r_t\) and \(h_{t-1}\), where \(\phi _h\) represents the hyperbolic tangent function:

$$\begin{aligned} \hat{h}_{t}=\phi _{h}(W_{h} x_{t}+U_{h}(r_{t} \odot h_{t-1})+b_{h}) \end{aligned}$$
(8)

Finally, update the activation output vector of the hidden unit \(h_t\) at time t:

$$\begin{aligned} h_{t}=(1-z_{t}) \odot h_{t-1}+z_{t} \odot \hat{h}_{t} \end{aligned}$$
(9)
Fig. 3.
figure 3

Classification model based on 1D-SCAE and BiGRU

When the 1D-SCAE model is completely trained, we connect it with the subsequent BiGRU network, as shown in Fig. 3. We optimize the free parameters in BiGRU to achieve the global optimum. The binary cross entropy loss function in the final binary classification is adopted to evaluate the model as follows:

$$\begin{aligned} \xi =-\frac{1}{N} \sum _{i=1}^{N} y_{i} \log (p(y_{i}))+ (1-y_{i}) \log (1-p(y_{i})) \end{aligned}$$
(10)

where i is the sample index, N is the number of samples, \(y_i\) is the binary label of the i-th sample, and \(p(y_i)\) is the probability that the output belongs to the \(y_i\) label. For the case where the label \(y_i\) is 1, if the predicted value \(p(y_i)\) approaches 1, then the loss approaches 0. Conversely, if the predicted value \(p(y_i)\) approaches 0, the loss should be tremendous.

4 Experiments

In this section, a series of experiments are conducted to verify the efficiency and accuracy of the proposed SR-IDS. Specifically, we first present the experimental settings and some details. Then we analyze some critical parameters to find the optimal solution. Lastly, we evaluate SR-IDS’s performance and compare it with some state-of-the-art methods.

4.1 Dataset

The UNSW-NB15 dataset simulates a modern representation of network traffic [19]. Each instance in the dataset is a network flow that summarizes the activity of a sequence of unidirectional packets with contextual features. Additional features are introduced into the dataset, totaling 49 features.

Table 1. Model hierarchy and some parameters
Fig. 4.
figure 4

Training loss and accuracy of different RNN variants

4.2 Model Hierarchy

SR-IDS inputs the preprocessed data into three independent one-dimensional convolutional autoencoders and trains them separately through a greedy layer-wise strategy. Three autoencoders’ encoder layers are stacked after training by the weight-sharing method and then connects to the time series generator to produce traffic groups with contextual features. Afterwards, we use BiGRU to extract temporal feature and output the type judgment of the testing set. The complete model hierarchy and some significant parameters are shown in Table 1.

4.3 Parameters Analysis

We compare and test the influence of different learning rate and dropout ratio on convolutional layer and dense layer, as shown in Table 2. We find that the convergence speed of the entire neural network is extremely slow when the initial learning rate of the dense layer is less than 0.0001. It means the time overhead significantly increases, and the effect improvement is negligible, so we do not adopt the lower initial learning rate scheme.

Table 2. Comparison of different learning rate and dropout ratio

We also compare the loss and accuracy of different RNN models during training iterations. Figure 4 shows the detailed performance of training loss and accuracy for attack detection. It can be seen that loss and accuracy hardly change when the epoch reaches 50, and BiGRU can achieve better performance than the other three RNN variants.

4.4 Evaluation

We compare the proposed method’s performance with some state-of-the-art methods, as shown in Table 3. Additionally, we test our model on KDD CUP 99 [26] and CIC-IDS-2017 dataset [24], which also shows well performance. In summary, our proposed SR-IDS method can achieve excellent performance in network traffic anomaly detection.

Table 3. Comparison with other machine learning algorithms

5 Conclusion

In this paper, we propose SR-IDS, an intrusion detection system for network traffic based on self-taught learning and representation learning, which simultaneously focuses on traffic’s spatial and temporal characteristics. Specifically, it utilizes 1D-SCAE to extract spatial features and BiGRU to extract temporal features. The greedy layer-wise strategy is adopted in the training process of 1D-SCAE, and sparse regularization is applied to reduce overfitting. BiGRU generates time series through TimeseriesGenerator to extract advanced time features. Multiple experiments have proved that BiGRU can achieve the best score among RNN variants. The accuracy rate of our proposed SR-IDS model in classifying network traffic on UNSW-NB15 dataset can reach 98.90%, which is more efficient than other current IDS methods.

In future research, we can consider online operations to improve robustness and stability. Furthermore, defense against attack techniques targeting deep learning models is also a research direction in the future.