1 Introduction

Since 1994, the internet has been used to present different services to millions of users. Especially in the past decade, it changes our lives forever, from social media to how we work, buy our needs, and online banking. The dark side of this fact is the threats that target our data and privacy, which implements the need for network security systems. Any of these systems includes three parts, which are firewalls, antivirus software, and network intrusion detection systems (NIDSs) (Xin et al. 2018).

A NIDS, which plays an essential role in network security systems, has three primary functions: monitoring, analysis, and response (Alom et al. 2015). In this research, we consider the Analysis function of a NIDS. According to the way NIDS analyzes the data, there are three types of NIDSs.

  1. 1.

    Signature-based NIDS (SNIDS): comparing the received network traffic with an updated database of attack rules (Ahmed et al. 2016). Ex. Snort (Snort—Network Intrusion Detection & Prevention System 2022). Low false positive and high response time are advantages, and the crucial need for updates and unavailability to detecting zero-day attacks are disadvantages of this kind of NIDSs.

  2. 2.

    Anomaly detection based NIDS (ADNIDS): the kind of NIDSs trying to learn benign behaviors pattern, and any network traffic has deviated from this pattern detected as an attack (Hawkins et al. 2002; Moradi and Zulkernine 2004). It's common for learning this pattern to use a machine learning model or combing of models (hybrid model) (Aljawarneh et al. 2018). These models would adapt to the changing nature of network attacks (Leung and Leckie 2005). However, the need for high detection accuracy, low false-positive rate, and fast response are the challenges of an ADNIDS.

  3. 3.

    Hybrid-based NIDS (HNIDS): Combing SNIDS and ADNIDS provide an HNIDS (Viegas et al. 2016).

Many shallow and deep learning models like SVM (support vector machine) (Mohammed and Sulaiman 2012; Wang et al. 2017), DTS (decision trees) (Eesa et al. 2015; Sindhu et al. 2012), Naïve Bayes (Louvieris et al. 2013; Xiao and Chen 2014; Yin et al. 2017), fuzzy logic (Ali et al. 2014; Elhag et al. 2015), KNN (K-nearest neighbors) (Aburomman and Reaz 2016; Li et al. 2014), CNN (convolutional neural network) (Li et al. 2017; Potluri et al. 2018; Vinayakumar et al. 2017; Wu et al. 2018; Zhu et al. 2018), DBN (Deep belief net) (Moradi and Zulkernine 2004), RNN (recurrent neural network) (Tang et al. 2018; Yin et al. 2017), non-symmetric deep auto-encoder (Shone et al. 2018) have been used for designing ADNIDSs.

One of the main features of many previous works on ADNIDSs was designing a powerful feature selector or extractor because of their effect on improving both accuracy and response time by finding a good representation of features and reducing the misleading data to help classifier (Tsai et al. 2009). Different kinds of feature learning techniques have been used for finding the best representation of data (Coates et al. 2011), which could be based on deep learning models, like restricted Boltzmann machine (Smolensky 1986), variational auto-encoders (Kingma and Welling 2013), convolutional neural networks (Goodfellow et al. 2016), sparse auto-encoders (Ng 2011) or not. Any feature learning algorithms can be categorized into two groups, which are linear and non-linear. The linear algorithms try to optimize \(z=wx+b\) where z is our latent space and x is the original data. The problem with linear algorithms is that the optimal latent space must be in the space spanned by the input and when it is not the case, it provides too large error (Ng 2011). In non-linear groups, the problem is formulated as Eq. 1.

$$z=w\phi \left(x\right)+b,$$
(1)

wherein \(\phi\) could be any non-linear functions (most popular is sigmoid), unlike the linear groups, the span is not limited to the span of the inputs any more (Yu and Principe 2019).

Another way to categorize feature learning is to use labels, which categorized feature learning into supervised, unsupervised, and semi-supervised methods. The supervised methods are beneficial when z is using for a supervised task like classification because, unlike the unsupervised techniques, it’s also considered labels for extract features. To the best of our knowledge, there is only one paper that used semi-supervised feature learning (Nadeem et al. 2016), and there is no paper on supervised future learning on NIDSs till now, which is the primary motivation of our work.

The contributions of our paper are as follow:

  1. i.

    A novel supervised sparse auto-encoder weighs the loss term with each feature's mutual information with the label.

  2. ii.

    Combine this model with SVM (shallow and deep learning models).

  3. iii.

    Use our model as an ADNIDS using NSL-KDD, KDDCUP’ 99, and CICIDS2017 data sets.

  4. iv.

    Design another ADNIDS using previously supervised auto-encoders to be a fair comparison with our model.

  5. v.

    Compare our model with (iv), and a few of previous approaches and effect of our supervised sparse auto-encoder on the number of support vectors compare to SVM and simple, sparse auto-encoder SVM; also the comparison on zero-day attacks detection rate of the proposed model with these two approaches has been provided.

The rest of the paper is structured as follows. Section 2, reviews previous related works. In Sect. 3 we discussed background information. Section 4 specifies our novel proposed solution; Sect. 5 shows the proposed model's evaluation and result. The comparison discussed is provided in this section. Finally, Sect. 6 concludes the proposed model and discussed plans for improving the proposed model.

2 Related Work

As mentioned before, the NIDSs are the most essential and challenging part of network security systems, making them a fascinating topic for researchers. Especially in the area of designing ADNIDSs, which need to use machine learning models. Various approaches are used to creating ADNIDSs, like different shallow learning models and deep learning models. The shallow models are faster than deep learning models, but they suffered from a low detection rate. In this section, we will discuss some of the notable works in this area.

Tavallaee et al. (2009) provided a statistical analysis of the KDDCUP’ 99 data set. They found two critical issues on this data set that affect evaluated systems' performance on real-world data: a vast number of redundant data biassed towards the more frequent records and prevents it from learning unfrequented records. The second problem is the test set and the train set are not independent of each other, which means the test and train set source is similar. These problems cause high detection rates in research that used this data set and low detection rate in practice. To solve these issues, they proposed NSL-KDD data set. They also tested this data set's performance on multiple shallow learning models like j48, naïve Bayes, NB tree.

Lin et al. (2015) provided NIDS based on combining cluster centers and k-nearest neighbor. After using a clustering technique to extract cluster centers (the number of clusters is equal to the number of labels), they replaced all features in the data set with only one feature. This feature is the sum of distances of each element with its cluster center and its nearest neighbor. They used KNN for the classification task. The detection rate and response time were noticeable if considering clustering as preprocessing. They evaluated their model on KDDCUP’ 99. Another example of using KNN is Liao and Vemuri (2002).

Alom et al. (2015) proposed a model based on the deep belief net (DBN). They compared their results with SVM and DBN-SVM. They reached 97.5% accuracy when using 40% of NSL-KDDTrain+ data as train set and the rest as the test set, which is hurt the main idea behind the NSL-KDD data set (make train and test set independent from each other). Alrawashdeh and Purdy (2016) provided a model based on the restricted Boltzmann machine(RBM) and logistic regression (LR). The evaluated data set on this research was KDDCUP’ 99.

Kim et al. (2016) provided a model using the long-short memory term (LSTM) architecture on a recurrent neural network (RNN). The KDDCUP’ 99 data set was used to evaluate their model. They compared their approach with the general regression neural network (GRNN), probabilistic neural network (PNN), KNN, SVM, etc.

Yin et al. (2017) Also used RNN with softmax regression for developing a NIDS. They evaluated their model on the NSL-KDD data set. The comparison between their model and some of the shallow learning models like j48, naïve Bayes, NB tree, random forest, and the random tree is also provided. The improvement of accuracy was noticeable, but the model was suffered from very high response time, which comes from the nature of RNN.

Wu et al. (2018) proposed a CNN-based model evaluated on the NSL-KDD data set. They turned each Data set sample, with 123 features to an 11 × 11 image by removing one feature with the lowest variance coefficient. They proposed a model with two convolutional and two polling layers and one fully connected layer for the classification task. They also consider weights for samples in cost function according to the proportion of different class samples to solve the imbalanced data set problem. Li et al. (2017) also used CNN to design a NIDS. They also used the NSL-KDD data set. They turned each sample of the data set to a vector of binaries after desterilizing continuous features and hot encoding of non-numerical features. They used two famous CNN based models (ResNET, GoogleNET) to classify data.

Shone et al. (2018) proposed a novel auto-encoder named non-symmetric deep auto-encoder. This auto-encoder encoding layers is not equal to the number of decoding layers to design a NIDS. They used NSL-KDDTrain+ as train set and NSL-KDD test with 18,794 instances as test data set, which is the data set that removed the attack types that consist in NSL-KDDTest+ but not in NSL-KDDTrain+ (zero-day attacks). The result was noticeable compared to the deep belief net.

Yousefi-Azar et al. (2017) proposed a feature learning model based on deep auto-encoders. They used multiple shallow learning models for the supervised task like Gaussian naïve Bayes and SVM.

Farahnakian and Heikkonen (2018) provided a model based on deep auto-encoders. They used four stacked auto-encoder for pre-training. After pre-training, they used the soft-max layer for classification and also fine-tuned all layers. The result of their model was promising on KDDCUP’ 99, but it was expensive in the term of computation compare to our model.

Aygun and Yavuz (2017) used auto-encoder and de-noising auto-encoder, respectively, to propose a NIDS. They used an auto-encoder to learn normal behavior. After that, they find a threshold for reconstruction error. In test time, they used this threshold to classify data. The NSL-KDD data set was used as a data set of this research.

Nadeem et al. (2016) proposed a model based on the ladder network (Rasmus et al. 2015; Valpola 2015), a semi-supervised model. The model was evaluated on 10% KDDCUP’ 99 data set. Their model's main issue was suffering from high response time, as was mentioned in their paper.

Al-Qatf et al. (2018) provided a model using sparse auto-encoder and SVM using NSL-KDD data set. Their model has less accuracy and converges slower in pre-training compared to our model. Javaid et al. (2016) also used the sparse auto-encoder combined with the softmax layer. They also used the NSL-KDD data set.

Tang et al. (2020) proposed a LightGBM auto-encoder-based model for NIDS. They used the LightGBM model for feature selection and the AE model for training and detecting. They used a threshold on AE reconstruction error for distinguishing between normal and attack behaviors. For evaluation, the Nsl-KDD data set has been used. The architecture of their AE model consists of 5 hidden layers. They achieved an 89.82% accuracy rate, which is considerable. However, as the LightGBM models are computationally complex (these models are based on gradient boosting algorithms), their AE model contains five layers. The Realtimeness of NIDSs is not considered in this research. Also, their model is disabled to face multiclass classification problems.

Khraisat et al. (2020) proposed a model by combining the c5 decision tree and one-class SVM. They evaluate their model using the NSL-KDD data set, with an 83.24% accuracy rate on binary classification.

Ieracitano et al. (2020) proposed a model based on statistical analysis for feature extraction from 122 to 102 dimensions and AE for dimensional reduction from 102 to 50. The NSL-KDD was used as the data set in this research. The accuracy rate was 84.21% in binary classification.

Li et al. (2020) design a model using the NSL-KDD data-set using a multi convolutional Neural Network (multi-CNN). They achieved an 89.95% accuracy rate for binary classification and 81.32% for 5-class classification. Obviously, the response time was not considered in this method because one CNN is very complex; and multiple of them have been used.

Zhou et al. (2020) proposed a method based on M-Adaboost, which is a complex ensemble model. They used NSL-KDD Train+ for both training and testing.

We understood after studying these approaches that many of these approaches suffering from low accuracy converge slowly in pertaining or high response time.

For having a comparison with some of the following approaches, we evaluated our model on NSL-KDD and KDDCUP’ 99 data sets in two approaches using tenfold cross-validation on training sets of these data sets and using training sets for training and test sets for testing. We examined both binary and five category classification problems.

3 Background

This section provides information about the proposed method's base concepts, which is necessary for understanding the motivation and concepts behind our work. First, a brief elucidation about deep learning and multiple representation learning methods is based on it; After that, soft-margin SVM will be discussed. Finally, the mutual information will be discussed.

3.1 Deep Learning

Deep learning is a subcategory of machine learning that tries to find high-level features from input data (Deng and Yu 2014). Deep learning is based on neural networks (stochastic models that inspire directly from human and animal neural systems intending to find a relationship between input and output). Any neural network has two paths, forward and backward. The input has been mapped to output by crossing over a graph with multilevel linear or non-linear transformation layers in the forwarding path. The backward path calculates the gradient of the cost for the weights and tries to minimize the cost by updating weights. In other words, it is based on feature and representation learning in different layers of the model. Deep learning models divide each complex concept into more straightforward concepts. This process leads the model to find essential concepts. In recent years deep learning achieved considerable success in many fields like machine vision, machine translation, and speech recognition because of the availability to find intricate data patterns. We used the power of deep learning in representation learning. In the following, we discussed some of the previous methods in representation learning based on deep learning.

3.1.1 Auto-encoder

Auto-encoders are unsupervised models that try to learn a good representation of data by mapping the data from the original space to a latent space (encoding) and reconstruct the original data from latent space (decoding) (Baldi 2012; Hinton and Zemel 1997) as it is shown in Fig. 1. Any auto-encoder could have three or more layers, an input layer, an output layer, hidden layer(s). If only linear functions or one sigmoid function used in the middle layers, the auto-encoder acts like PCA (principal component analyzer). It was considering F as encoding function with w and b parameters for mapping data to latent space according to Eq. 2.

Fig. 1
figure 1

example of an auto-encoder

$$z= {F}_{w,b }\left(x\right).$$
(2)

Also \(g\) as decoding function with \({w}^{^{\prime}}\) and \({b}^{\mathrm{^{\prime}}}\) parameters try to reconstruct data from latent space according to Eq. 3.

$${x}^{^{\prime}}= {g}_{{w}^{^{\prime}},{\mathrm{b}}^{\mathrm{^{\prime}}}}\left(z\right).$$
(3)

The loss function of the auto-encoder can be something to minimize the difference between x and \({x}^{^{\prime}}\) like, mean square error. Auto-encoder’s family can be used for feature extraction, data de-noising, and a generative data model.

3.1.2 Sparse Auto-encode

Sparse auto-encoders are auto-encoders trying to discover exciting structures by imposing other constraints on the network. For example, when the latent space dimensions are higher than the original space, the vanilla auto-encoder desired to copy input in latent space and adding some zeroes to latent space. This problem can be handled by adding a sparsity constraint on the hidden layer. This sparsity is a Kullback–Leibler divergence between the average of activated nodes in the hidden layer and a fixed number between 0 and 1. So the loss function of it could be like Eq. 4.

$$J=\frac{1}{2m}\sum_{i=1}^{m}{\Vert {x}_{i}-{\widehat{x}}_{i}\Vert }^{2}+\beta \sum_{j=1}^{k}kl(\rho \left.\Vert {\widehat{\rho }}_{j}\right),$$
(4)

\(\beta\) is the constants considering as hyper-parameters of the loss function that show the importance of respective terms. The first term is a mean square error between original and reconstructed data, and the second term is the sparsity constraint will calculate according to Eq. 5.

$$kl(\rho \| {\widehat{\rho }}_{j})=\rho \mathrm{log}\left(\frac{\rho }{{\widehat{\rho }}_{j}}\right)+ (1-p)\mathrm{log}\left(\frac{1-\rho }{1-{\widehat{\rho }}_{j}}\right),$$
(5)

where \(\rho\) is sparsity constraint parameter and \({\widehat{\rho }}_{j}\) is the average value of activated hidden unit \(j\) overall training input \(x\) (Ng 2011).

3.1.3 Supervised Auto-encoder

In this kind of auto-encoders, there is a supervised constraint in the loss function. This kind of auto-encoders is mostly used when the latent variable is used as supervised model input. Ex (Le and Patterson 2018) Adding a soft-max layer to the network connected to the hidden layer and reformulates the loss as Eq. 6.

$$J=\frac{1}{2m}\sum_{i=1}^{m}{\Vert x-{\widehat{x}}_{i}\Vert }^{2}+\frac{\beta }{2m}\sum_{i=1}^{m}{\Vert y-{\widehat{y}}_{i}\Vert }^{2}.$$
(6)

3.2 SVM (Support Vector Machine)

A classifier aimed to find the best separating hyperplane that maximizes the margin between two classes. Figure 2 shows a schema of how this classifier works. \(W\) and \(b\) are considered as parameters of the hyperplane that need to be learned. After learning, any new sample could be classified into each positive or negative class by Eq. 7.

Fig. 2
figure 2

A geometric view of a SVM classifier

$$\forall \mathrm{i }{y}^{\left(i\right)}\left({w}^{T}{x}^{\left(\mathrm{i}\right)}+b\right)\ge 1$$
(7)

SVM aims to satisfy two requirements:

  1. i.

    Maximize the distance between decision boundaries, which is \(\frac{2}{\Vert w\Vert }\).

  2. ii.

    Correctly classify all \({x}^{\left(i\right)}\).

To achieve these goals, SVM formulated a problem that needs to be solved using the primal form shown in Eq. 8.

$${min}_{w.b}\frac{\Vert w\Vert }{2} +C \sum_{i=1}^{N}{\varepsilon }^{\left(i\right)}$$
$$s.t {y}^{\left(i\right)}\left({w}^{T}\varphi ({x}^{\left(i\right)})+b\right)\ge 1-{\varepsilon }^{\left(i\right)}$$
$${\varepsilon }^{\left(i\right)}>0$$
$$\forall i \in \{1,\cdots ,N\}$$
(8)

where \({\varepsilon }^{\left(i\right)}\) is slack variables and \(C\) is error penalty term, and \(\varphi\) is kernel function that maps data in the space can be separated better. For solving this problem, first, it needs to be written in the dual form, which is considered in Eq. 9.

$${max}_{\alpha }{min}_{w.b}\frac{\Vert w\Vert }{2} +C\sum_{i=1}^{N}{\alpha }^{\left(i\right)}\left(1-\left({w}^{T}\varphi \left({x}^{\left(\mathrm{i}\right)}\right)+b\right)\right).$$
(9)

After solving this problem, the optimal values for w looks like Eq. 10.

$${w}_{i}=\sum_{i=1}^{N}{\alpha }^{\left(i\right)}{y}^{\left(i\right)}\varphi \left({x}^{\left(i\right)}\right).$$
(10)

Also, in the test time, the prediction of each label is calculated by Eq. 11.

$${y}_{test}= \sum_{i=1}^{N}{\alpha }^{\left(i\right)}{y}^{\left(i\right)}{\varphi \left({x}^{\left(i\right)}\right)}^{T}\varphi \left({x}_{test}\right)+b,$$
(11)

\({\alpha }^{\left(i\right)}\)> 0 if and only if \({x}^{\left(i\right)}\) Is a support vector (Chapelle 2007). So less number of support vectors concludes fewer computations in the test.

3.3 Mutual Information

The mutual information between two random variables is a measure of how much information can be gotten from one of them by observing another one. It is calculated by Eq. 12 (Cover and Thomas 2012). High mutual information indicates a massive reduction in uncertainty; low mutual information indicates a small reduction, and zero means independence.

$$I\left(x ;y\right)= \sum_{x\in X}\sum_{y\in Y}p \left(x,y\right)\mathrm{log }(\frac{p (x , y)}{p\left(x\right)p(y)}).$$
(12)

4 Proposed Methodology

This section will describe our supervised sparse auto-encoder and its benefits in two parts; Sect. 4.1 describes our proposed supervised sparse auto-encoder and the motivation of the proposed methodology, and Sect. 4.2 provides the benefits of the proposed model compared to the traditional supervised auto-encoders.

4.1 Sparse Auto-encoder with Mutual Information Weight Reconstruction Loss

The main challenges ahead of designing NIDSs are reliability and being real-time. So we aim to propose a method that could meet the expectations of both of the challenges. This paper introduces our supervised sparse auto-encoder, an auto-encoder with a sparsity constraint, and weighted reconstruction error with the diagonal matrix that shows each feature's importance in the classification tasks. One of the reasons that motivate us to propose our approach was when an unsupervised auto-encoder tries to find an optimal solution, it has no idea about which features are more important than the others in the supervised (classification) task, and it must be more information about them in the latent variable. These models only try to minimize loss, and there is a possibility to be no information about a critical feature in the latent variable. This disadvantage of unsupervised auto-encoder significantly affects the models designed for NIDSs. There is no need to use unsupervised representation learning in the training process when the labels are available. Representation learning is mostly used along with a supervised classification model. So our approach is going to put a positive impact on supervised tasks by having more critical information about the label than the unsupervised auto-encoder. This process can also reduce classifier complexity by giving it more profitable information, which causes finding a solution easier (in our case number of support vectors shows that).

Another advantage of our model is that the weight matrix can be any matrix that shows the importance of different features. We test joint entropy, correlation, and mutual information of features with the label. As we thought, mutual information has the most effect on accuracy and number of support vectors.

The weight matrix will be calculated as a preprocessing task, which is a diagonal matrix that elements \(i,i\) shows the mutual information of \(i\mathrm{th}\) feature with the label. Considering \({f}_{i}\) as \(i\mathrm{th}\) feature the matrix is going to be calculated as follows;

$${M_{i,j}} = \left\{ {\begin{array}{*{20}{l}} {I\left( {{f_i};y} \right)}&{{\text{if}}\;{f_i} = j} \\ 0&{{\text{if}}\;{f_i} \ne j} \end{array}} \right..$$
(13)

Considering matrix as M, the loss function of the proposed model formulated as follows;

$$J=\frac{1}{2m}\sum_{i=1}^{m}{{(\mathrm{x}}_{\mathrm{i}}-{\widehat{\mathrm{x}}}_{\mathrm{i}})}^{\mathrm{T}}\times \mathrm{M}\times {(\mathrm{x}}_{\mathrm{i}}-{\widehat{\mathrm{x}}}_{\mathrm{i}})+\frac{\lambda }{2}(\sum_{k n}{w}^{\begin{array}{c}2 \\ \end{array}}+\sum_{n k}{v}^{2 }+\sum_{k}{b}^{2}+\sum_{n}{{b}^{\mathrm{^{\prime}}}}^{2})+\beta \sum_{j=1}^{k}kl(\rho \left.\Vert {\widehat{\rho }}_{j}\right),$$
(14)

W and V are the weight matrix of the first and second layer, b and \({b}^{^{\prime}}\) are also the bias vector of them, respectively.

When the auto-encoder begin the learning process, it gives more importance to the feature with more mutual information or other metrics; we decide for showing feature importance because these features have an immense weight than the others in the loss function, so for minimizing the loss the reconstruction of them is more important than the others. The auto-encoder needs to have more information about them in latent space to reduce the reconstruction error of corresponding features because the decoder uses the latent space to reconstruct data. This process may increase the risk of overfitting. We handle this risk by adding the regularizer to them. After the auto-encoder is well trained, the latent variable of training data will be used for training the SVM. In the test phase, each testing data first feeds the auto-encoders encoder to find the latent variable. The latent variable of corresponding testing data will feed to the SVM to predict the label. The idea is so simple, but our results show that our model finds the optimal latent space faster than the others, reduces classifiers complexity, and is better or at least competitive with other works even with more hidden layers. The Alghorithm 1 shows the pseudocode of the proposed method.

figure a

4.2 Benefits of the Proposed Model Compared to Traditional Supervised Auto-encoders

As was discussed before, in previous models, a supervised constraint was added to the auto-encoder like Eq. 6, so the pre-training and training phase had been merged. These models have a high potential to over-fit on training data because of the label's direct effect on encoder layers. However, in our models, the labels will not directly affect the encoder and decoder layers but only tell which features it preferred better than the others.

Another benefit of our model over the traditional ones is that in traditional supervised auto-encoders, the supervised term acts as a constant in decoder layers, which causes no effect on backpropagation in these layers. However, in our model, it affects both decoder and encoder layers, which causes less risk on overfitting of the encoder and make the decoder a better generative model (which is not the consideration on NIDSs) as a consequence.

5 Evaluation and Result

Our proposed model, implemented using GPU-enabled TensorFlow is performed on 64-bit Ubuntu 14.04 on PC with Intel(R) Core(TM) i5-6400 CPU 2.70 GHz, 16 GB ram, and NVIDIA GTX 1070 GPU. The hyper-parameters for classification and supervised sparse auto-encoder (SSAE) have been provided in Table 1. Also, RMS-probe with exponential decay learning rate (initial 0.001 for binary and 0.01 for 5 class classification) is used as an auto-encoder optimizer. All of these hyper-parameters are found by grid search of different values using validation data.

Table 1 hyper-parameters of the proposed model

This section provides details of our implementation and also the result and comparison with other researches. Section 5.1, the evaluation metrics were discussed, Sect. 5.2 gives the detail of data sets, Sect. 5.3 discussed preprocessing that needs to be done to prepare the data. Finally, the result and comparison were provided in Sect. 5.4.

5.1 Evaluation Metrics

To perform our evaluation, we used NSL-KDD and KDDCUP’ 99 data-sets. Our approach's performance was evaluated using the following metrics, which have been used in other research in the same field.

  1. 1.

    True Positive (TP): anomaly records that are correctly classified as an anomaly.

  2. 2.

    False Positive (FP): normal records that are incorrectly classified.

  3. 3.

    True Negative (TN): normal records that are correctly classified as normal.

  4. 4.

    False Negative (FN): anomaly records that are classified incorrectly.

We used these metrics to compute the following measures:

  1. 1.

    Accuracy rate (AC): Percentage of correctly classified records overall records Formula (Metz 1978).

    $$\mathrm{AC}= \frac{\mathrm{TP}+\mathrm{TN} }{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}\times 100\%$$
    (15)
  2. 2.

    Precision rate (P): Percentage of correctly classified anomaly records over a total of classified anomalies (Powers 2020).

    $$P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}\times 100\%$$
    (16)
  3. 3.

    Recall (R): Percentage of correctly classified anomalies divided by the number of attack entries (Powers 2020).

    $$R= \frac{\mathrm{TP }}{\mathrm{TP}+\mathrm{FN}}\times 100\%$$
    (17)
  4. 4.

    F-measure (F): Harmonic mean of precision and recall, which is considered the most crucial measure for NIDSs (Powers 2020).

    $$F= \frac{2\times P\times R }{P+R}\times 100\mathrm{\%}$$
    (18)
  5. 5.

    False-positive rate:

    $$\mathrm{FPR}= \frac{\mathrm{FP }}{\mathrm{FP}+\mathrm{TN}}$$
    (19)
  6. 6.

    False-negative rate:

    $$\mathrm{FNR}=\frac{\mathrm{FN}}{\mathrm{TP}+\mathrm{FN}}$$
    (20)

We used all of the following metrics for evaluation on NSL-KDD and CICIDS2017. However, because the goal of using KDDCUP99, which is the old version of NSL-KDD, was the comparison with related works, we only used the accuracy metric on it.

5.2 Data Sets

In this paper, KDDCUP’ 99, NSL-KDD, and CICIDS2017 data-sets were considered for evaluation. These data sets have been used Widely in NIDSs research areas.

5.2.1 KDDCUP’ 99

Since 1999 KDDCUP’ 99 has been the most used data set for the evaluation of NIDSs. This data set was evaluated by using data captured in DARPA’s 98 programs (Stolfo et al. 2000). The Data consists of 4 gigabytes-worth of compressed tcpdump data resulting from 7 weeks of network traffic. The training data set involve 4,900,000 records each consider as a connection record with 41 feature includes Basic, Domain knowledge and time observation features. Each record was labeled as either Attack from 22 different attack types or normal behavior. It is common to use 10% KDDCUP’ 99 data set with 145,586 records with no duplicated data and a great representation of the complete data set as training data set and KDDCUP’ 99 corrected with 77,291 records as testing data set. The features in this data set involve numerical and symbolic values, so it needs preprocessing for evaluating our model. The numerical values need to be normalized to reduce computational complexity. The whole process of preprocessing will be explaining.

5.2.2 NSL-KDD

The newer version of KDDCUP’ 99 to overcome the KDDCUP’ 99 problems (Tavallaee et al. 2009). This data set structure is similar to the previous data set, consisting of 41 features; three of these features are symbolic, and others are numeric. As discussed before, the need for preprocessing is crucial for this data set. NSL-KDD data set consists of two different training data set, which are KDDTrain+ and KDDTrain+20%, and also two testing data sets, which are KDDTest+ and KDDTest−21. In this research, we used KDDTrain+ as training data set and KDDTest+ as testing data set. The training data set consist of 22 different types of attacks, and the testing data set consist of 34 attack types, which means the KDDTest+ data set consist of 12 attacks that do not appear in the training data set (zero-day attacks), so the generalization power of model on zero-day attacks will be considered. These attacks will categories into four categories, which are r2l, u2r, Dos, and probe. The KDDTrain+ data set consist 125,973 records and KDDTest+ data set 22,544 records. The description of the following data sets is provided in Table 2.

Table 2 Description of NSL-KDD and 10%KDDCUP data sets

5.2.3 CIC IDS 2017 (Sharafaldin et al. 2018)

CIC IDS 2017 is the newer version of ISCX 2012 (Shiravi et al. 2012) provided by the Canadian Institute of Cybersecurity, consist of a variety of benign and malicious network traffic types. The attack types in this data set are the most up to date attack scenarios. The Data set includes eight CSV files; each file consists of 83 features and label, the label could be either benign or one of 14 different attack types also, 285,735 instances have a null label, so they have to be deleted from the data set. In this research, we first concatenate all files that provide us with a data set with 3,116,478 instances and 2,830,743 instances after deleting flows with the null label. Table 3 shows the details of this data set. As the Data is vast and cannot be used for training models, we randomly select 10% of the data instances but with the vision of selecting from different classes and distribution of original data set. Some of the features (flag id, destination IP, source IP, destination port, source port) of this data set needs to be deleted for designing NIDS because it is not correct to make the IDS model sensitive to them, for example, if the IDS is sensitive to the source IP the attacker can pass the IDS by changing his/her IP address. Similar to previous data sets, this data set also needs normalization.

Table 3 Description of CICIDS2017 data set

5.3 Data Preprocessing

5.3.1 Normalization

Many features of these data sets consist of a broad range between maximum and minimum, which achieves computational complexity during the learning and testing process. Therefore, we normalized these features using a min–max normalization map of each feature to range between 0 and one according to Eq. 21.

$${x}_{i }= \frac{{x}_{i }-\mathrm{min}}{\mathrm{max}-\mathrm{min}}$$
(21)

where \({x}_{i}\) is a data point, min is the minimum value from all data points, and max is the maximum value for each feature.

5.3.2 Categorize Labels into Five Categories

All the attacks in this data-sets fall in one of the following four categories(Tavallaee et al. 2009) according to Table 4:

  1. 1.

    Denial of Service Attack (DoS): is an attack that attackers target the system's availability by making resources of the victim too busy or full.

  2. 2.

    User to Root Attack (U2R): kind of attacks that attackers start the system with a regular user account and exploit some vulnerabilities to access root user privileges.

  3. 3.

    Remote to Local Attack (R2L): when the attacker can send packets to the system but do not have any accounts on the system, but gain against some vulnerabilities to local access as a user to the system

  4. 4.

    Probing Attack: kind of attacks that attacker investigate network over the system or the system to determine vulnerabilities of the system that can lead to exploits, which could be circumventing system security controls.

Table 4 Category of attacks on NSLKDD

5.3.3 One-Hot Coding Symbolic Labels

The NSL-KDD data set has 38 numerical features and three symbolic features to convert these symbolic features to numerical features; we used the one-hot coding strategy. EX. The “protocol-type” feature has three distinct attributes, these attributes be encoded as (1, 0, 0), (0, 1, 0), (0, 0, 1) binary vectors. This process converts NSL-KDD features set from 41 feature dimensional to 122-dimensional features.

5.4 Result and Comparison

5.4.1 Number of Support Vectors Training and Testing Time

As SVM with RBF kernel is used as the classifier, the number of support vectors (nSVs) and dimensions will be impacted training and testing time because SVM with non-linear kernels computation and usage memory grows relative to nSVs. So we compare our approach (SSPAE-SVM) in terms of nSVs, training, and testing time on NSL-KDD AN and CICIDS2017 data sets with simple SVM and sparse auto-encoder SVM (SPAE-SVM) (Al-Qatf et al. 2018) in four different approaches for NSL-KDD, and one approach for CICIDS2017. Also, we provide a comparison between the pre-training (representation learning) time of the proposed method and SPAE-SVM.

In the first approach, we used NSL-KDDTrain+ as a training set and NSL-KDDTest+ as a test set, also we consider CICIDS2017 10% 0.7 as training and 0.3 as testing data. The label is considered binary (normal and Attack). The result shows that we reduce the number of support vectors in all three models compared to simple SVM and SPAE-SVM in thereupon training and testing time. Also, the dimensional reduction using auto-encoder affects memory and computational complexity. Table 5 shows the result.

Table 5 Training and testing time and number of support vectors (NSV) comparison for SVM, SPAE-SVM, Proposed model (SSPAE-SVM) for binary classification on CICIDS2017, NSL-KDDTrain+ and NSL-KDDTest+

In the second approach, we used tenfold cross-validation on the NSL-KDDTrain+, considering labels as binary like the previous approach; the result also emphasis on efficiency of the proposed model. Because pre-training time was similar to the first approach, these results were not considered again. Table 6 shows the result of this.

Table 6 Training and testing time and number of support vectors (NSV) comparison for SVM, SPAE-SVM, proposed model for binary classification based on NSL-KDDTrain+

In the third approach, we used NSL-KDDTrain+ and NSL-KDDTest+ as the test set. The label is considering as five labels (normal and four attack categories). The one vs. rest strategy was used for the classification model. This result also shows the positive effect of our model on memory and time complexity. Table 7 shows the result. In the fourth approach, the NSL-KDDTrain+ was used for both training and testing with a tenfold cross-validation strategy Table 8 shows the result. What is noticeable in all of these approaches is the improvement of our model in training and testing time of a NIDAS; e.g., the total pre-training and training time of our model is less than training time SVM.

Table 7 Training and testing time and number of support vectors (nSV) comparison for SVM, SPAE-SVM, proposed model for 5 class classification based on NSL-KDDTest+
Table 8 Training and testing time and the number of support vectors (nSV) comparison for SVM, SPAE-SVM, proposed model 5 class classification based on NSL-KDDTrain+

5.4.2 Evaluation Considering Binary Classification

Two approaches were considered for binary classification, first using NSL-KDDTrain+ and NSL-KDDTest+ and CICIDS2017 10% data sets. In the second approach, the evaluation on NSL-KDDTrain+ using tenfold-cross-validation strategy, in each fold 70% of data set used as training data and 30% as testing. Table 9 shows the result in the first approach as it shows the proposed model achieves 90.11% accuracy on NLSKDDTest+ and 91.22% on CICIDS2017; which is very noticeable, considering only one-layer encoding for NSLKDD compare to other works using deep-stacked auto-encoders with more than one auto-encoder and more than one layers for encoding (Shone et al. 2018) or using multiple convolutional layers and overhead of turning samples to pictures (Wu et al. 2018). So the proposed model is powerful and also simple in terms of computation, which means that the proposed model improves both critical requirements of NIDSs. The comparison also shows the improvement of our model compare to SVM and SPAE-SVM.

Table 9 Evaluation result considering binary classification on NSL-KDDTest+ and CICIDS2017 data sets

The false-positive rate and false-negative rate of each model for NSL-KDDTest+ and CICIDS2017 been shown in Figs. 3 and 4, respectively.

Fig. 3
figure 3

FNR and FPR on binary classification in CICIDS2017

Fig. 4
figure 4

FPR and FNR on binary classification in NSL-KDDTest+

The ROC curve and AUC of the model on CICIDS2017 have also shown in Fig. 5, and ROC curve of NSL-KDDTest+ has been shown in Fig. 6. ROC is a probability curve, and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes (Fig. 7).

Fig. 5
figure 5

ROC curve and AUC value of proposed model on binary classification based on CICIDS2017

Fig. 6
figure 6

ROC curve and AUC value of proposed model on five class classification based on NSLKDDTest+

Fig. 7
figure 7

ROC curve and AUC value of proposed model on binary classification based on NSLKDDTest + 

Table 10 showed the result when NSL-KDDTrain+ tenfold cross-validation was used. As the result shows we do not have such massive improvement as the previous approach, which is a good thing because it shows less potential for the over-fitting of our model.

Table 10 Evaluation result considering binary classification on NSL-KDDTrain+ data-set

5.4.3 Evaluation Considering 5 Class Classification

This approach Compared to our result in 5 class in terms of accuracy, precision, recall, and F-measure, considering these measures for different classes with SVM and SPAE-SVM. All of the results were provided considering different class labels. This evaluation also shows our improvement in 5 class classification when our model got 80.46% accuracy, the SVM got 77.49% accuracy, and SPAE-SVM got 78.54%, as has been provided in Fig. 8. Also, Fig. 9 compares the result in tenfold cross-validation on NSL-KDDTrain+. ROC curve of the proposed model considering 5 class classification problems also has been provided in Fig. 5.

Fig. 8
figure 8

Accuracy, precision, recall, and F-measure values for the proposed model, SVM, SPAE SVM for five class classification based on NSLKDDTest+

Fig. 9
figure 9

Accuracy, precision, recall, and F-measure values for the proposed model, SVM, SPAE-SVM for five class classification on NSLKDDTrain+

5.4.4 The Zero-Day Attacks Detection Rate

This section compared the zero-day attack detection rate of our model to SVM and SPAE-SVM to show the proposed model's generalization power. From 22,544 instances of NSL KDDTest+ 3750 are zero-day attacks, which means the attack types which are not observed in training data as the results show in Table 11 the proposed model achieved a better detection rate on zero-day attacks.

Table 11 Comparison of Zero-day attacks detection rate

5.4.5 Additional Comparison

We also compared the proposed model considering the accuracy rate with some of the related works and previously supervised auto-encoder. The evaluation on KDDCUP’ 99 also provided in this section; Table 12 shows this comparison in binary classification, and Table 13 provides a comparison in 5 class classification.

Table 12 Additional performance comparisons with several related approaches in the binary classification
Table 13 Additional performance comparisons with several related approaches in the 5 class classification

All results from 5.4.1 to 5.4.4 show our approach has a better or at least competitive detection rate compared to SVM, SPAE-SVM, related works, and significant potential for real-time NIDS shows in 5.4.1. Comparing our model with related works in binary classification shows our approach has the best accuracy when evaluated on NSL-KDDTest+, even better than traditional supervised auto-encoder because of these models' high potential overfitting. However, in evaluation on KDD CUP’99 the improvement was just competitive. The reason behind that is the nature of this data set, which is that the corrected and 10% train data set have a close relationship together. If models learn 10% KDD CUP’99 very well, the result will be so good on KDDCUP’99 Corrected, but in NSL-KDDTrain+ and NSL-KDDTest+ or CICIDS2017, the model like that may face with overfitting problem. Considering improvement through NSL-KDDTrain+, NSL-KDDTest+, and CICIDS2017 data sets, only the competitive result is 10% KDDCUP’99 and KDDCUP’ 99 Corrected shows the robustness of our model on overfitting. The evaluation on NSL-KDDTest+ with no zero-day attacks also provided for comparison of the proposed model with Shone et al. (2018), considering that their model was much more computationally complex in representation learning by using two deep non-symmetrical auto-encoder with two encoding layers and one decoding layer and also in classification by using random-Forrest (ensemble model) the proposed model achieved better detection rate.

6 Conclusion and Feature Work

In this study, we investigated the challenges of designing a NIDS. In response to these challenges, we proposed a novel supervised sparse auto-encoder with reconstruction loss term weighted by mutual information, which led to significant improvements in both the detection rate and the computational complexity. We have implemented our model in TensorFlow GPU enabled and compared our model with simple SVM and SPAE-SVM for both binary classification and 5class classification in terms of training and test time, accuracy, precision, recall, and f-measure. The improvement of the proposed model in binary classification on NSLKDDTest+ with 90.11% accuracy and 91.22 on CICIDS2017 was very promising compared to other works, even much more complex models. Also, in the case of 5class classification, our model has improved compared to SVM and sparse auto-encoder SVM. Moreover, in terms of real-time computation, our model performs better compared to SVM and sparse auto-encoder SVM even with adding an additional computational operator term in the pre-training phase because of faster convergence to the optimal representation.

The comparison of the proposed model with some of the related studies and the traditional supervised auto-encoder has been provided based on accuracy measures for both NSL-KDD and KDDCUP’ 99 data sets, all of which were promising. In the future, we aim to provide a model (like using a genetic algorithm) that learns the weights of reconstruction loss instead of using mutual information. We will also try to improve our detection rate on five categories of classification and evaluate the model on real-world data.