1 Introduction

With the rapid developments in internet and communication technologies, the volume of stored data increases significantly, which augmented the volumes of traffic flowing through networks from all over the world [1]. With the surge in traffic over networks, cyber-attacks are also rapidly increasing due to novel attacks and mutations of older ones. The frequency of attacks exploiting systems' flaws is expected to escalate as more and more devices connect with the internet [43]. Thus, network security has become a requisite research domain to protect data and networks from malicious users and attackers. Several security solutions have been available to secure networks from external and internal attacks; however, preventing them is still challenging due to the intrinsic limitations of security policies, firewall, access control scheme, and antivirus software [31]. Intrusion Detection Systems (IDS) are deployed to monitor or analyze the ongoing events for detecting potential attacks to deal with these large-scale network threats. It provides real-time protection against internal and external attacks by blocking them upon detection [56].

Intrusion Detection Systems are classified into host-based IDS (HIDS) and network-based IDS (NIDS) primarily [82]. The HIDS relied on the signature of the known attacks to determine the vulnerability of the system. This approach is commonly restricted to a host system running a specific operating system and requires an updated database of attacks, limiting its ability to detect novel attacks [61]. On the other hand, NIDS targets network behavior by analyzing the format and content of network data packets which makes it more suitable to detect unknown attacks as well. However, this approach has its own challenges, including a high false-positive rate due to non-linear nature of the problem, data imbalance originating from attacks, lack of numerical representation of some features, and high dimensionality [27]. To overcome these challenges, machine and deep learning techniques have emerged as a promising tool in contrast to contemporary statistical methods and knowledge-based expert systems [9, 15].

Deep learning has recently emerged as a disruptive technology in this domain also and became hugely popular due to higher accuracy and flexibility with little domain knowledge [32, 63]. The neural networks can discover the relevant patterns to discriminate the attack from the regular traffic through series of non-linear transformations [40]. Also, a little domain knowledge is required to construct models since categorical and strings attributes of the network can be converted into numerical form through integer or one-hot encoding schemes. Many forms of deep learning such as deep neural network (DNN) [35], autoencoders (AE) [4], convolutional neural networks (CNN) [42, 74], and recurrent neural networks (RNN) [70, 83] have been utilized so far to provide the solutions to this challenging problem. Nevertheless, the performance bottlenecks in recognizing attacks from regular traffic still exist due to overfitting and class imbalance problems [26, 27, 74]. Overfitting is common in this domain due to the lack of highly representative training data points from real-world applications corresponding to the true data distribution. For instance, there is a huge disparity in performance on the popular but outdated dataset KDDCup’99 for training (>95%) and test sets (<90%) [69]. Consequently, a more challenging dataset NSL-KDD was developed from the KDDCup’99 dataset but suffered from the same problem. Therefore, multiple datasets are required to validate the IDS model. When insufficient training data is supplied to neural networks, the model captures noise or superficial information. Another problem associated with the IDS is the class imbalance which refers to the uneven distribution of class samples in the dataset where majority class outnumbers the minority class significantly. When such data is provided to neural networks, they tend to be biased towards the majority class which causes poor performance on the unseen data. Therefore, further improvements can be made to obtain better-generalized models by considering the above issues while developing IDS.

This paper presents a neural network based approach to categorize the network traffic into normal and attack classes. The key contributions of this work are two-fold. Firstly, we showed data normalization and selection of neural network components such as initialization and activation functions play an essential role in identifying the anomalies from the traffic data through empirical analysis. Secondly, the crucial issue of class imbalance has been tackled at the classifier level. A cost-sensitive loss function is designed for imbalanced training and demonstrated to be more valuable than the data level approaches for obtaining generalized classification models. The experiments are performed on challenging NSL-KDD and UNSW-NB15 datasets to assess the performance. The results are also compared with the previous works to measure the improvement in classification performance. The obtained outcomes show that the balanced NN outperformed not only machine learning approaches but also advance deep learning approaches, including CNN and RNN. These outcomes also validate the data preprocessing and class imbalance as a crucial barrier impeding the development of better IDS models.

The rest of the paper is organized as follows: Section 2 discusses the previous work on IDS using NSL-KDD and UNSW-NB15 datasets. The proposed IDS methodology using NN has been provided in Section 3, and the experimental outcomes are provided in Section 4. Section 5 concludes the work.

2 Literature survey

Numerous approaches have been proposed to build effective IDS from network traffic data by employing data preprocessing, feature reduction with feature subset selection, PCA and AE, machine learning and deep learning techniques. Initially, we briefly review the existing machine learning based approaches to identify their shortcomings and then present the neural network based efforts to solve this problem.

The machine learning based approaches had dominated the research in NIDS for the last two decades. These works mainly employed data normalization [44, 57], class imbalance [68]–[60], feature selection [59, 72], ensemble learning [16, 76], fuzzy rules [39] and comparison of various classifiers [34, 86] over several datasets. But, these approaches fail to address the inherent complex characteristics of the data distribution and usually lack in performance. The feature selection methods were unable to improve the classification accuracy and suffer from overfitting problem as observed from many studies [59, 72], and our experiments [62]. The feature extraction also suffers from similar issues [55], and the comparisons of different learning algorithms or ensemble learning [16, 76] were also not able to yield satisfactory performance. The class imbalance problem which is inherently manifested in IDS has been addressed by a few researchers at the data level. For instance, Wu et al. [68] combined k-means clustering and SMOTE and Tan et al. [67], and Priyadarsini [60] used SMOTE on full set of features with random forest classifier. The latter authors also performed feature selection using artificial bee colony algorithm.

On the other hand, deep learning showed significant improvements comparatively. Several forms of the artificial neural network (ANN) have been used to solve this problem, including self organization map (SOM) by Ibrahim et al. [23] for classifying attacks on KDDCup’99 and NSL-KDD datasets. Almi'ani et al. [3] clustered the outputs of SOM with k-means where largest cluster represents normal traffic while smaller one indicates the attack using NSL-KDD problem. Moukhafi et al. [51] used it with genetic algorithm over KDDCup’99 and UNSW-NB15 datasets. The works based on feedforward neural networks include comparing several backpropagation training algorithms on NSL-KDD [25] and selecting features on UNSW-NB15 dataset [48] using MLP classifier. Ding and Wang [12] and Kim et al. [35] used slightly different DNN models to detect anomalies in KDDCup’99 dataset. The convolutional-based approaches have been utilized in [42] where ResNet 50 and GoogLeNet models were used to identify attacks through conversion of NSL-KDD data into images. Using CNN, Wu et al. [77] proposed a cost-sensitive imbalance approach, Nguyen et al. [54] detected DoS attacks and Verma et al. [74] applied Adaptive Synthetic (ADASYN) sampling on NSL-KDD dataset to overcome the class imbalance. Some works focused on the feature transformation through AE, where Azar et al. [84] transformed the NSL KDD dataset and compared four classifiers. Zhang et al. [85] added a softmax layer to AE for the NSL-KDD problem whereas Khan et al. [78] proposed a two-stage semi-supervised approach using AE for KDDCup’99 and UNSW-NB15 problems. Al-qatf et al. [4] used the combination of AE and SVM for solving binary and multiclass problems. Dong et al. [13] selected features with the multivariate correlation analysis and fed them to LSTM to solve NSL-KDD and UNSW-NB15 classification problems. Yin et al. [83] used RNN to solve binary and multiclass classification problems whereas Tchakoucht and Ezziyyani [70] used multi-layered echo-state machine on NSL-KDD and UNSW-NB15datsets.

Recently, Sethi et al. [64] used Deep-Q-Network to discriminate normal and attack samples on UNSW-NB15 dataset. Wu et al. [78] combined Deep Belief Network (DBN) and feature weighing to develop the IDS using NSL-KDD dataset. The deep neural networks have been used with PCA reduced data dimensionality in [63] for NSL-KDD dataset and extra trees ranked features of UNSW-NB15 and AWID datasets in [29]. Ashiku and Dagli [7] used CNN to classify attacks present in UNSW-NB15 dataset whereas Su et al. [66] proposed multi-layered BAT-MC model comprising of multiple CNN layers, a bi-directional LSTM layer, and an attention layer to detect anomalies in network traffic. Wu et al. [79] combined AE and kernel machine learning to solve the NSL-KDD problem. The stacked AE was used for feature transformation, kernel approximation was performed with random fourier feature selection approach, and linear SVM was used for learning. Ieracitano et al. [24] proposed combining AEs and LSTM for the same problem. Mighan and Kahani [49] extracted latent features using stacked AE and used SVM for classification. Jaing et al. [27] used the combination of under and oversampling to deal with the data imbalance problem where One-Side Selection (OSS) was used to reduce the noise samples in majority category, and SMOTE was used to increase the minority samples. A deep hierarchical model combining CNN and Bi-directional LSTM was employed for classification. Using CNN, the cost-sensitive approaches to handle class imbalance were proposed in [21, 38] with focal loss functions for NSL KDD, UNSW-NB15, and Bot-IoT datasets.

The analysis of existing works reveals that the neural network based approaches are becoming more prevalent with the recent advances in deep learning. The majority of the researchers have developed end-to-end models using deep neural networks, CNN, and LSTM approaches for classification of traffic and employing AE primarily to reduce data dimensionality. These models yield some fruitful outcomes in terms of data dimensionality as well as better classification. However, the current works focus more on advanced deep learning approaches to alleviate the performance bottlenecks in this domain instead of resolving the class-imbalance issue. As a result, this relevant issue has been tackled in few works only where data and classifier level methods have been suggested. Additionally, recent efforts are directed towards cost-sensitive approaches. So, there is scope to develop the neural network based generalized IDS models through classifier-level balanced learning.

3 Materials and methods

This section provides the details of the datasets considered in the study and the proposed approach for the classification of normal and attacks network traffic.

3.1 Datasets

In this paper, two challenging datasets, NSL KDD and UNSW-NB15, are collected from public sources to develop effective IDS. A brief description of both datasets is given as follows:

3.1.1 NSL KDD dataset

The KDDCup’99 dataset has been commonly used to solve network intrusion with ML techniques. The classifiers tend to overfit it due to numerous redundant records. The number of repeated samples belonging to attacks far exceed the normal class sample in both training (more than 90%) and testing (more than 80%) sets of this dataset, which introduced significant bias and resulted in high accuracy. To resolve this issue, a more challenging NSL KDD set was proposed that consists of KDDTrain+ and KDDTest+ having 125,973 and 22,544 samples respectively. The details of the dataset are provided in Table 1 along with the class balance ratio.

Table 1. Description of NSL KDD dataset

3.1.2 UNSW-NB15 dataset

Another widely used dataset is the UNSW-NB15 which is developed with IXIA Perfect Storm tool at Australian Centre for Cyber Security. Unlike conventional datasets, it contains several modern synthesized attacks, including worms, fuzzer, generic, and reconnaissance. The attributes are categorized into five groups: flow, basic, content, time, and additional generated features. Table 2 provides the details about this dataset along with the class balance ratio.

Table 2. Description of UNSW-NB15 dataset

3.2 Proposed work

Figure 1 shows the flowchart of the proposed IDS approach with neural networks. The raw traffic data from networks contain heterogeneous feature types such as binary, numerical and categorical. Initially, categorical is converted to numerical form using integer encoding to attain homogeneity. Afterward, data is normalized to have a uniform data distribution for all features. Then, normalized data is fed to the neural networks model to differentiate the traffic into normal and attack. The model is trained with the cost-sensitive loss function through class weights to deal with the imbalance issue. The weight for each class is computed based on a heuristic measure which is proven to be useful for better performance. The detailed methodology is provided as follows:

Fig. 1.
figure 1

Flowchart of the proposed IDS approach

3.2.1 Data preprocessing

Data preprocessing is a prerequisite step in modeling neural networks for analyzing complex features. It includes the transformation of categorical data and rescaling of data through normalization. Both NSL-KDD and UNSW-NB15 datasets consist of numeric and categorical attributes. In these datasets, three features are categorical while 38 features of NSL-KDD and 39 features of UNSW-NB15 are numeric. The categorical features are converted into numerals using integer encoding scheme. For instance, ‘protocol’ attribute in NSL-KDD defines TCP, UDP, and ICMP protocols to make the connection. To convert this attribute into the numeric form, each protocol is assigned with an integer. In this case, TCP, UDP, and ICMP values are converted into 1, 2, and 3 respectively.

After converting all features, it is essential to rescale them in a uniform range for better performance. The unnormalized features with different ranges introduce bias in learning as the greater numeric feature values dominate the smaller ones [65]. The training algorithms of neural networks also fail to converge due to an uneven range of features [19]. Thus, data is normalized with the min-max method on the basis of the empirical analysis involving several normalization methods. This normalization has been widely used in existing IDS approaches also [4, 23]. In this method, the features are rescaled to the interval [0, 1], and it is given as follows:

$${\hat{x}}_i=\frac{x_i-\min \left({x}_i\right)}{\max \left({x}_i\right)-\min \left({x}_i\right)}$$
(1)

where xi represents the ith feature of data (denoted as x) to be learned by the classifier, min and max represents the minimum and maximum value the ith feature respectively.

3.2.2 Neural networks

The recent breakthroughs in artificial neural networks have improved the learning capabilities of machines by manifolds. This powerful learning technique known as deep neural networks has achieved state-of-the-art performance on numerous classification problems in the field of image recognition [21, 38], speech [18, 22], and natural language processing [11, 28]. Deep neural networks consist of multiple layers of non-linear functions which are composed in series. These networks are different from the conventional shallow networks which allow better function approximation. The shallow networks correspond to the model with a single hidden layer, whereas several layers correspond to deeper networks. DNNs are preferred over shallow networks as more compact representation of the same functions can be achieved. Other critical advances in the neural network are better activation functions, parameter initialization, and backpropagation training algorithms. We discuss these components in detail as follows:

Firstly, the activation functions play a vital role in the performance of neural networks. The rectified linear unit (ReLU) is the most popular non-linear activation function that converges very quickly compared to smoother functions such as sigmoidal and hyperbolic tangent [40, 53]. It is given as f(z) = max(z, 0) where z represents the input units. It preserves the linear nature for positive values while pruning the negative ones. Thus, the sparse activations helps to obtain superior performance as well as to avoid the vanishing gradient problem. However, dropping negative values is not always helpful and causes the dead neuron problem [81]. This problem deactivates a large portion of the network and thereby limiting the contribution of the model. To resolve this issue, better non-linear activation functions belong to the rectified unit family have been proposed. The Leaky ReLU [45] is one of the popular choices among them that allows the nonzero gradient for the negative values. It is given as follows:

$$y=\left\{\begin{array}{cc}z& if\left(z\ge 0\right)\\ {} az& otherwise\end{array}\right.$$
(2)

where z indicates the input units, y indicates the output units, and a is a fixed parameter in intervals 0 and 1 that controls the negative units. This function allows all positive values and some parts of the negative units. As the derivative is not always zero, the probability of the silent neurons reduces significantly. Leaky ReLU has outperformed ReLU in numerous studies [10] and was also found to be best suited to small datasets as compared to ReLU during the empirical analysis of four ReLU based functions [80]. Secondly, the neural networks are susceptible to random initialization of networks which can cause vanishing gradient and exploding gradient problems. As a result, the optimizers such as standard gradient descent usually perform poorly with deeper networks than the statistics-based weight initializers. Thus, parameter initialization also plays an essential role in the performance of different architectures. Glorot initialization [17] is the most common approach in which the network is initialized by selecting weights from the normal distribution between (−s, s) where

$$s=\sqrt{\frac{6}{fan- in+ fan- out}}$$
(6)

where fan-in and fan-out indicate the numbers of inputs and outputs to a neuron.

Lastly, the neural networks have been primarily trained with gradient descent (GD) optimization techniques. The gradient descent minimizes the objective function, typically cross-entropy, by updating the parameters of neural networks in the direction opposite to gradient. However, these methods suffer from convergence issues, especially when the loss functions are non-complex. They converge very slowly and require an adequate learning rate depending upon the problem which makes them highly susceptible to trap in a local minimum. Therefore, stochastic gradient descent (SGD) algorithms are employed to find solutions with low training error and provide good generalization [20]. Adam [37] is one of the most successful first order SGD optimization techniques that accelerates the training with adaptive step size and momentum. This method maintains the learning rate for every parameter by averaging the first and second moments of the gradients and also introduces the bias correction. This improves the convergence rate and makes Adam an excellent training algorithm that can realize the advantages of AdaGrad (adapts learning rate for optimization) and RMSProp (bias-correction allows sparse gradients) methods. It is given as follows:

$${m}_t={\beta}_1{m}_{t-1}+\left(1-{\beta}_1\right){g}_t$$
$${v}_t={\beta}_2{v}_{t-1}+\left(1-{\beta}_2\right){g}_t^2$$
(3)

where t indicates the time step, mt indicate the first moment, vt indicate the second moment, and β1 and β2 are the decay factor. The bias correction is given as follows:

$${\hat{m}}_t=\frac{m_t}{1-{\beta}_1^t},\kern0.5em and\kern1.5em {\hat{v}}_t=\frac{v_t}{1-{\beta}_2^t}$$
(4)

where \({\beta}_1^t\) and \({\beta}_2^t\) are the tth power of β1 and β2 respectively. Each parameter is then updated as follows:

$${\theta}_t={\theta}_{t-1}-\frac{\alpha_t}{\sqrt{{\hat{v}}_t}+\epsilon }{\hat{m}}_t$$
(5)

where αt is the learning rate, and ϵ is s a small constant to avoid division by zero.

The proposed multilayer neural networks consist of an input layer, two hidden layers, and one output layer. The activation functions at the hidden layers are Leaky ReLU with the parametric setting of a = 0.3, whereas softmax is used at the output layer. The number of neurons is set to 100 and 200 at the hidden layers. The network is trained using mini-batch of 128 training samples and by minimizing the cross-entropy between the binary class label vector y = [yA, yN] and the output probability vector \(\hat{y}=\left[{\hat{y}}_A,{\hat{y}}_N\right]\). The binary cross-entropy is given as follows:

$$E\left(y,\hat{y}\right)=-\left({y}_A\log \left({y}_A\right)+{y}_N\log \left({y}_N\right)\right)$$
(7)

Adam is used to training the network with the following parameters: learning rate (αt) is set to 0.001, and β1 and β2 are set to 0.9 and 0.999 respectively.

3.2.3 Class imbalance

The issue of class imbalance is prevalent in intrusion detection problems due to the lack of sufficient data pertaining to attacks. It refers to improper distribution of data in which one class contains a significantly large number of instances compared to others. While training the model with such data, the classifiers tend to be more skewed towards majority class instances than minority class ones. Although the performance of the model becomes high due to the biased classification, minority class instances suffer greatly. The results usually contain a high false alarm rate. Therefore, this problem needs to be addressed before training the model so that the underrepresented class could have the same importance in learning as does the majority class. Class imbalance can be tackled using several ways at the data and classifier level. At the data level, sampling the data is the popular technique that aims to balance the class distribution of the training data [71]. The balancing can be achieved by either selecting fewer instances of the majority class to equalize the minority class instances which is known as under sampling or adding new instances to the minority class to balance the majority class instances which is known as oversampling.

At the classifier level, this issue is handled with the cost-sensitive approach. In this approach, the penalty for the misclassification of minority classes is applied to force the learning algorithms to focus more on these classes [8]. It is attained by either assigning weights to the classes depending upon their number of instances or explicitly adjusting the prior probabilities of the classes. The first approach is more commonly used to balance different classes at the training time. It is implemented in the form of a loss function by using different error penalties for classes. On the other hand, testing time cost assignment is done either by assigning threshold values to the prior probabilities or adjusting F-score [8].

In this work, weights are assigned to the classes using heuristic measures [36], which is proven to be helpful in solving many classification problems effectively in recent times [41], [58], [47]. The weights are determined for each class as follows:

$${w}_A=\frac{\left|A\right|+\left|N\right|}{2\ \left|A\right|}$$
$${w}_N=\frac{\left|A\right|+\left|N\right|}{2\ \left|N\right|}$$
(8)

where |A| and |N| denote the total samples belonging to attacks and regular traffic respectively, and wA and wN are the corresponding loss weights. The weight for each class modifies the binary cross-entropy function for training as follows:

$$E\left(y,\hat{y}\right)=-\left({w}_A{y}_A\log \left({y}_A\right)+{w}_N{y}_N\log \left({y}_N\right)\right)$$
(9)

This class-weighted function is used to train neural networks for identifying attacks from normal traffic.

4 Experimental results

The proposed methodology is implemented in python using tensorflow and sklearn using Intel® Core™ i5-Processor and 8GB RAM. Two widely used datasets, namely NSL-KDD and UNSW-NB15, have been considered to validate the proposed approach. The performance of the model is evaluated based on four metrics, including classification accuracy, precision, recall, F-score, area under the curve. These metrics are described as follows:

Classification accuracy

It is defined as the percentage of correctly classified instances to the total instances. It is given as follows:

$$ACC=\frac{TP+ FN}{TP+ TN+ FP+ FN}$$

where TP indicates the True Positive, the number of attacks that are correctly identified as an attack, TN indicates the True Negative, the number of normal packets that are correctly identified as normal, FP indicates the False Positive, normal packets that are incorrectly identified as attacks and FN indicates the False Negative, attacks that are incorrectly identified as normal traffic.

Precision

It is defined as the percentage of correctly identified attacks to the total number of records classified as attacks. It is given as follows:

$$Precision=\frac{TP}{TP+ FP}$$

Recall

It measures the percentage of correctly identified attacks versus the total number of attacks and is given as follows:

$$Recall=\frac{TP}{TP+ FN}$$

F1-score

It is the harmonic mean of precision and recall metrics and is computed as follows:

$$FS=\frac{2\ast \left( Precision\ast recall\right)}{Precision+ recall}$$

Area under the ROC curve (AUC)

This measure is defined in terms of the receiver operating characteristic (ROC) curve to measure classifier performance. The ROC curve is obtained by plotting precision and recall at different decision threshold values. The AUC measure is estimated by computing the area under this curve.

4.1 Fine-tuning NN architecture

To validate the proposed model for identifying attacks, we performed several experiments using the training sets of NSL-KDD and UNSW-NB15 datasets. The holdout approach is used for this purpose where train sets of both datasets are split into 70% for training and 30% for validation. Figure 2 compares the impact of activation and parameter initialization using the proposed two-layered NN. Specifically, the performance of activation function Leaky ReLU with glorot initialization has been compared with the combination of ReLU activation function and uniform initialization. From the outcomes, the overfitting with ReLU and uniform initialization can be observed on both datasets in contrast to the proposed approach. In the NSL-KDD dataset, even though ReLU attains better convergence than Leaky ReLU, the validation error exceeds the Leaky ReLU after 40 epochs. Similarly, the ReLU obtains better convergence, but validation error does not improve after 50 epochs and is comparable to Leaky ReLU in the UNSW-NB15 dataset. With Leaky ReLU, the training and validation errors are almost similar, thereby showing minor overfitting only. These outcomes showed the negative values have a crucial role in solving both problems. Leaky ReLU reduces the frequency of silent neurons by introducing leak correction in negative units, thereby allowing more neurons to have values during training. Additionally, parameter initialization with glorot scheme also complements the performance of the network as compared to the uniform one. Thus, the proposed activation function and initialization is the better choice for IDS classification.

Fig. 2.
figure 2

Comparison of activation function and parameter initialization with two hidden layered NN

Further, we analysed the effect of network depths on obtaining generalized models for both datasets. The outcomes are shown in Fig. 3 where the comparison has been made with one, two, and three hidden layers with the Leaky ReLU activation and glorot initialization. From the outcomes, it is evident that the shallow network does not attain good performance. Further, when comparing two and three-layered architectures, the former one emerges as the better choice. In the NSL-KDD dataset, the performance is almost similar to both architectures. On the other hand, the three-layered architecture overfits in contrast to the two-layered one. Thus, two-layered architecture is the best choice to develop the IDS models.

Fig. 3.
figure 3

Effect of network depth on NN performance for IDS

4.2 Impact of data normalization

The impact of data normalization on both datasets has been measured to obtain better quality of data. However, several normalization methods are available for this purpose; the best one depends upon the data itself. Thus, these works consider the best normalization method based on empirical analysis. Four widely used methods, namely z-score, min-max, pareto scaling, and tanh have been considered for experiments. Table 1 provides the outcomes for each method on both datasets. From the outcomes, it is evident that the data normalization improves the classification of IDS with deep neural networks. Additionally, min-max normalization helps to attain better performance than the other three methods. The un-normalized data result in very poor performance. The best and worst accuracy difference is 22.67% and 11.61% on NSL-KDD and UNSW-NB15 datasets respectively, when comparing the un-normalized with min-max outcomes. Further, the improvement is more than 3% on NSL-KDD data while accuracy is improved by 2% on UNSW-NB15 data with the min-max method. Thus, it can be concluded that data preprocessing with normalization plays a crucial role in building better prediction models for detecting attacks. Subsequently, we considered the min-max normalized data for further experiments based on the empirical evidence (Table 3).

Table 3. Effect of data normalization on the classification performance of IDS systems

4.3 Impact of class imbalance

The class imbalance problem is the critical aspect of the IDS model demanding a relevant approach. For this purpose, we have measured the effects of class imbalance on classification performance with several techniques. Table 4 provides the outcomes obtained with the popular methods to deal with class imbalance. It includes imbalanced data as well as data and classifier level approaches. At the data level, three widely used approaches, namely, over-sampling, under-sampling, and bagging, have been considered. Specifically, cluster centroid has been chosen as an under-sampling method in which centers of the clusters are determined from the majority class samples using k-means algorithm to reduce the instances. SMOTE has been selected as the over-sampling method where samples from the minority class are generated using k nearest neighbors randomly. In the bagging approach, multiple training sets are used where each set has the same number of instances from both classes. In this study, the training sets have been made by partitioning the majority class into five non-overlapping subsets. The equal number of minority class instances have been selected randomly to make balanced sets. The classification decision has been determined with soft and hard decision strategies. In the soft decision, the average of prediction probabilities from 5 sets has been used to determine the final class. On the other hand, the majority voting scheme has been employed in the hard decision strategy.

Table 4. Comparison of imbalance data with class-balanced approaches on IDS datasets

The obtained outcomes show the class imbalance cause lower performance on both datasets. In terms of the best class imbalance approach, the classifier level method performs better than the data level approaches. The lower performance is attained with both bagging strategies (less than 80% accuracy on both datasets). Further, oversampling with SMOTE achieves better performance than the cluster centroids method. The proposed classifier level scheme outperforms SMOTE on the NSL-KDD dataset by more than 1.12%. The higher accuracy is also observed on the UNSW-NB15 dataset, but accuracy is 0.06% less which is a minor difference. Thus, empirical evidence indicates the superiority of classifier-level class imbalance in dealing with the IDS problem.

4.4 Performance with proposed approach

The performance of the proposed approach in terms of accuracy, precision, recall, F-score, and AUC for detecting attacks has been provided in Table 5. Figure 4 shows the confusion matrices for the NSL-KDD and UNSW-NB15 datasets. From the outcomes, it can be seen that the proposed approach alleviates the class imbalance problem effectively. The minority attack samples on NSL-KDD have been classified more precisely, and a similar trend has been observed on UNSW-NB15, where normal samples belong to the minority class. The proposed approach detects most of the attacks (>94%) on UNSW-NB15 dataset. In contrast, performance on NSL-KDD dataset is lower, a primary challenge that this dataset poses for machine and deep learning approaches. The F-score is above 85% and 91% for both datasets. Nonetheless, the AUC measure above 94% and 97% for NSL-KDD and UNSW-NB15 respectively show the superiority of the proposed approach in characterizing normal and attacks samples. Figure 5 shows the corresponding ROC curves for both datasets.

Table 5. Performance of proposed approach for the prediction of attacks on different metrics
Fig. 4.
figure 4

Confusion matrices for the NSL-KDD and UNSW-NB15 datasets

Fig. 5.
figure 5

ROC with the proposed approach for the NSL-KDD and UNSW-NB15 datasets

4.5 Comparison with existing works

The performance of the proposed approach has been compared with the existing works to validate its superiority in developing a good IDS system. The comparisons have been made on the test sets of NSL-KDD and UNSW-NB15 datasets using accuracy, and the outcomes are provided in Tables 6 and 7 respectively. The proposed approach has obtained satisfactory results on both datasets, and evidently, class imbalance plays a crucial role in better performance. Most previous works have not considered this issue, thereby unable to achieve better performance on both datasets. In NSL-KDD dataset, most of the earlier results (22 out of 29) reported accuracy between 80% and 85% as shown in Table 6. One of the data normalization methods has been used in many reported works (20 out of 29). But, only four methods deal with the class imbalance problem, and these methods reported accuracy in the range of 84% and 85% which is higher than other approaches. Two works have achieved 85.73% and 85.80% accuracy, which is 0.17% and 0.24% higher than the proposed work. However, both approaches are much complex than the simple and straightforward proposed approach. In [78], the combination of deep belief network, feature weighting, particle swarm optimization, and SVM was proposed. In [79], autoencoders, kernel approximation, and linear SVM were combined for better performance. Additionally, the class imbalance was not considered in both works

Table 6. Comparison of the proposed approach with the existing works on NSL-KDD dataset
Table 7. Comparison of the proposed approach with the existing works on UNSW-NB15dataset

In the UNSW-NB15 dataset, most of the works reported accuracies below 90%. Also, most works used data normalization (11 out of 15) for performance gains, which is similar to NSK-KDD dataset. On the other hand, several works incorporated the data imbalance issue in their method (6 out of 15) while working on this dataset, unlike NSL-KDD as shown in Table 7. In terms of performance, the proposed approach outperforms all approaches except one. Higher accuracy of 90.85% is reported by Kasongo [30] using Decision Tree classifier. However, the reported F1-score of 88.45% is lower than the proposed approach (91.85%). These outcomes validate the proposed approach for solving the IDS classification problem more effectively in contrast to previous works.

5 Conclusion

This paper presents a simple and straightforward neural-based approach for differentiating the regular and attacks traffic. To establish the superiority of the proposed approach, rigorous experiments are performed using two challenging datasets NSL-KDD and UNSW-NB15. Initially, the empirical analysis of normalization validates the higher performance with min-max normalization. It is in contrast to the z-score method which has been widely used with deep learning-based approaches but does not achieve reasonable accuracy in this domain. Secondly, the impact of class imbalance has been analysed and compared with other contemporary approaches such as oversampling, undersampling, and ensemble learning. The cost-sensitive function emerged as the better option for modeling the IDS problems as compared to data level and the combination of multiple balance models. Lastly, the outcomes are compared against the current works to validate the competitive performance on both datasets. In conclusion, the proposed method with class-weighted neural networks is helpful for effectively classifying traffic. In the future, this work can be extended by introducing the dimensionality reduction with the feature subset approach or through new feature representation with autoencoders. The recognition of attacks is another viable option to explore the impact of the proposed work.