1 Introduction

Internet of Things (IoTs) has recently witnessed an explosive expansion in a broad range of daily life and industrial applications [1,2,3], such as healthcare, smart homes, smart cities, smart energy, smart agriculture, and intelligent transportation. The IoT networks aim to provide internet connections for transferring data among massive IoT devices, such as interconnected sensors, drones, actuators, smart vehicles and smart home appliances [2], using either wired or wireless communications. However, most of these IoT devices are low-cost, low-power and limited-resource, making them highly vulnerable to cyber attacks as well as intrusive activities. Therefore, it is vital to develop network intrusion detection systems (NIDS) that can promptly and reliably identify and prevent malicious attacks to IoT networks. For this, a wide range of machine learning-based intrusion detection techniques have been designed for IoT, along with a number of public network traffic datasets [2, 4]. These datasets often contain a large number of features, in which many are irrelevant or redundant, which adversely affect both the complexity and accuracy of machine learning algorithms. Thus, many feature reduction methods have been developed for NIDS, in which feature selection and feature extraction are two of the most popular ones [2, 4], as discussed next.Footnote 1

In NIDS, feature selection has been widely used for reducing the dimensionality of original traffic data. For example, in [8], a mutual information (MI)-based feature selection algorithm was proposed in combination with a classifier called least square support vector machine, which achieves higher accuracy and lower runtime complexity than the existing schemes, over three datasets, namely, KDD99 [9], NSLKDD [10] and Kyoto 2006+ [11]. Before that a MI-based scheme was also proposed for NIDS in [12], which however suffers from higher computational complexity than the approach in [8]. Additionally, several approaches that rely on genetic algorithm (GA) as a search strategy to select the best subset of features can be found in [13, 14]. These methods provide lower false alarm rates than the baselines, where UNSW-NB15 [15] and KDD99 [9] datasets are used. In [16], a hybrid feature selection approach, which relies on the association rule mining and the central points of attribute values, was developed, showing that UNSW-NB15 dataset achieves a better evaluation than NSLKDD. In [17], another hybrid feature selection method that comprises particle swarm optimization (PSO), ant colony algorithm, and GA was proposed, learning to better detection performance than the baselines such as GA [13], in the presence of both NSLKDD and UNSW-NB15 datasets. In [18], a Pigeon inspired optimizer was used for selecting features of NIDS, which achieves higher accuracy than the PSO [13] and hybrid association rules methods [16]. Note that the aforementioned feature selection schemes often suffer from high computational cost, especially for those relying on GA, PSO or machine learning-based classifiers. For this, a correlation-based feature selection method that offers low computational cost was investigated for NIDS over KDD99 and UNSW-NB15 datasets in [19], taking the correlation level among features into account. Recently, this correlation-based method was combined with ensemble-based machine learning classifiers to significantly improve the accuracy of NIDS [20], at the cost of higher complexity. Hence, aiming at a real-time and low-latency attack detection solutions, this work will more focus on the correlation-based feature selection method.Footnote 2

Unlike feature selection, which retains a subset of the original features in NIDS, feature extraction attempts to compress a large amount of original features into a low-dimensional vector so that most of information is retained. There are a number of feature extraction techniques that have been applied for reducing data dimension in NIDS, such as principal component analysis (PCA), linear discriminant analysis (LDA), and neural network-based autoencoder (AE). For instance, in [23], PCA was applied to significantly reduce the dimension of KDD99 dataset, improving both accuracy and speed of NIDS, where support vector machine was used for attack classification. Then, several variants of PCA were adopted to intrusion detection, such hierarchical PCA neural networks [24] and kernel PCA with GA [25], which can enhance the detection precision for low-frequent attacks. Some of applications of PCA to recent network traffic datasets such as UNSW-NB15 and CICIDS2017 [26] can be found in [27, 28]. In addition to PCA, LDA was also employed as a feature reduction method for NIDS in [29], which remarkably reduces the computational complexity of NIDS. Then, in [30, 31], both PCA and LDA were combined into a two-layer dimension reduction, which is capable of reliably detecting low-frequency malicious activities, such as User to Root and Remote to Local, over NSLKDD dataset. To further improve the efficiency of feature extraction in NIDS, a AE-based neural network was used in a range of research works [32,33,34,35,36,37]. In particular, a stacked sparse AE approach was developed in [32] to conduct a non-linear mapping between high-dimensional data and low-dimensional data over NSLKDD dataset. In [33], a deep stacked AE was used to noticeably reduce the number of features to 5 and 10 for binary and multiclass classification, respectively, leading to better accuracy than the previous methods. Additionally, a number of AE architectures based on long short-term memory (LSTM) were developed for dimensionality reduction of NIDS, such as variational LSTM [35] and bidirectional LSTM [34], which can efficiently address imbalance and high-dimensional problems. Note that these AE-based methods suffer from a high computational cost compared to PCA and LDA, both in training and testing phases. To address this issue, a network pruning algorithm has been recently proposed in [36] to considerably lower complexity of AE structures in extracting features of NIDS. In [37], a network architecture uses an AE based on convolutional and recurrent neural networks to extract spatial and temporal features without human engineering.

It is worth noting that most of the aforementioned papers have focused on either improving the detection accuracy or reducing the computational complexity of NIDS, by using machine learning-based classifications in combination with feature engineering methods such as feature selection and feature extraction for reducing data dimensionality. However, a comprehensive comparison between these two feature reduction methods has been overlooked in the literature. Our paper appears to address this gap. In particular, we first provide an overview of NIDS, with a focus on the phase of feature reduction, where feature extraction with PCA and feature selection with correlation matrix are the two promising candidates for realistic low-latency operations of NIDS. Then, using the modern UNSW-NB15 dataset, we thoroughly compare the detection performance (precision, recall, F1-score) as well as runtime complexity (training time and inference time) of these two methods, taking into account both binary and multiclass classifications as well as the same number of selected/extracted features denoted as K. Based on our extensive experiments, we found that feature selection generally achieves higher detection accuracy and requires less training and inference times when the number of reduced features K is large enough, while feature extraction outperforms feature selection when K gets smaller, such as \(K=4\) or less. Furthermore, in order to gain a deeper insight into detection behaviors of both methods, we investigate and compare their accuracy for each attack class when varying K, based on their best machine learning classifiers, which revealed that feature extraction is not only less sensitive to varying the number of reduced features but also capable of detecting more diverse attack types than feature selection. Additionally, both tend to be able to detect more attacks, i.e., Abnormal classes, when having more features selected or extracted. Relying on such comprehensive observations, we provide a theoretical guideline for selecting an appropriate intrusion detection type for each specific scenario, as detailed in Table 14 at the end of Sect. 4, which is, to the best of our knowledge, not available in the literature.

The rest of this paper is organized as follows. Section 2 discusses machine learning-based network intrusion detection methods for IoT networks. The overview of UNSW-NB15 dataset and data pre-processing are explained in Sect. 3. Section 4 provides the experimental results and discussion. Finally, Sect. 5 concludes this paper.

2 Machine learning-based network intrusion detection methods

In this section, we describe an overview of a NIDS based on machine learning, followed by details on the two major feature reduction methods, namely, feature selection and feature extraction.

2.1 Overview of NIDS

Fig. 1
figure 1

Block diagram of a network intrusion detection system

A NIDS consists of three major components, namely, data pre-processing, feature reduction, and attack classification, as illustrated in Fig. 1. In particular, in the first phase, the raw data is denoted as the dataframe \({\textbf{Z}}\), whose features may include unexpected or non-numeric values, such as null or nominal. \({\textbf{Z}}\) is pre-processed in order to either replace these unexpected values with valid ones or transform them to the numeric format using one-hot encoding. Several features that do not affect detection performance, such as the source IP address and the source port number, are dropped out. Furthermore, depending on the classifier we use for identifying attacks, we may use the normalization technique, for example, to constrain the values of all features, i.e., the elements of the output vector of the first phase \({\textbf{X}}\) in Fig. 1, to range from 0 to 1. We will discuss this in detail in Sect. 3 when presenting UNSW-NB15 dataset.

As such, after the first phase, the pre-processed data \({\textbf{X}}\in {\mathbb {R}}^{D\times N}\) is likely to have much more features than the original data \({\textbf{Z}}\), particularly due to the use of one-hot encoding, where D is the number of dimensions, or equivalently, the number of features of \({\textbf{X}}\), and N is the number of data samples. For example, when UNSW-NB15 dataset is used, the dimension of data increases from 45 to nearly 200, which is too large for classification techniques to quickly recognize the attack type. In order to address this fundamental issue, in the second phase, we need to reduce the number of features that will be used for the attack classification phase (the last phase in Fig. 1). For this, two feature reduction methods called feature selection and feature extraction are widely used to either select or extract a small number of most important features from pre-processed traffic data. This procedure also helps to remove a large amount of unnecessary features, which not also increase the complexity of NIDS, but also degrade its detection performance, as will be illustrated in experimental results in Sect. 4. Herein, the output data of the feature reduction block is denoted as vector \({\textbf{U}}\in {\mathbb {R}}^{K\times N}\) in Fig. 1, which is expected to have a much lower dimension than \({\textbf{X}}\), i.e., \(K\ll D\), while retaining its most important information.

Finally, in the third phase of NIDS, a number of binary and multiclass classification approaches based on machine learning, such as decision tree, random forest and multilayer perception neural networks, are employed to detect the attack type. Relying on attack detection results, the system administrators can promptly make a decision to prevent malicious activities, ensuring the security of IoT networks. Here, note that the detection performance and latency of a NIDS strongly depend on which classifier and which feature reduction method it employs. Therefore, in this contribution, we comprehensively investigate detection performance (in terms of recall, precision, F1-score) and latency (in terms of training time and inference time) of different detection methods in presence of both feature selection and feature extraction as well as different machine learning classifiers. We also focus more on the comparison between these two feature reduction methods, which will be described in detail in the following subsections.

2.2 Feature selection

There are a number of feature selection techniques used in intrusion detection, namely, information gain, IG [8] and feature correlation [19, 20, 38]. In this work, we focus on using feature correlation for selecting important features, since this method has been shown to achieve competitive detection accuracy and complexity compared to other selection counterparts. Using this correlation-based method, we aim to select features that are most correlated to other features based on the correlation matrix calculated from the training dataset. More specifically, the correlation coefficient between feature \(\Omega _{1}\) and feature \(\Omega _{2}\) is calculated based on the numeric pre-processed training dataset \({\textbf{X}}\) as follows [38]:

$${\mathcal {C}}_{\Omega _{1},\Omega _{2}}=\frac{\sum _{i=1}^{N}\left( \alpha _{i}-E_{\Omega _{1}}\right) \left( \beta _{i}-E_{\Omega _{2}}\right) }{\sqrt{\sum _{i=1}^{N}\left( \alpha _{i}-E_{\Omega _{1}}\right) ^{2}}. \sqrt{\sum _{i=1}^{N}\left( \beta _{i}-E_{\Omega _{2}}\right) ^{2}}},$$
(1)

where \(\alpha _{i}\) and \(\beta _{i}\) are the values of these two features, \(E_{\Omega _{1}}=\sum _{i=1}^{N}\alpha _{i}/N\) and \(E_{\Omega _{2}}=\sum _{i=1}^{N}\beta _{i}/N\) are their means over N training data samples. Note that after preprocessing the raw data \({\textbf{Z}}\) to obtain \({\textbf{X}}\), all features of \({\textbf{X}}\) are now numeric, i.e., \(\alpha _{i}\) and \(\beta _{i}\) are numeric, making (1) applicable to process. By doing this, we obtain a \(D\times D\) correlation matrix \({\textbf{C}}\), whose elements are given by \(c_{ij}={\mathcal {C}}_{\Omega _{i},\Omega _{j}}\) for \(i,j=1,2,\ldots ,D\). The average correlation of feature \(\Omega _{i}\) to other features is computed as follows:

$$C_{i}=\frac{\sum _{j=1}^{D}c_{ij}}{D},$$
(2)

where \(c_{ii}=1\) for \(j=i\) and \(c_{ij}\in [-1;1]\) for \(j\ne i\). Note that the self-correlation coefficient \(c_{ii}\) does not affect selection results, since it contributes the same amount to all \(C_{i}\) for \(i=1,2,\ldots ,D\). Then, using a suitable threshold, as will be detailed in Sect. 4, we are able to select K most important features corresponding to K largest elements \(C_{i}\).

It is worth noting that we only need to calculate such feature correlation in the training phase, while in the testing phase, we simply pick up K features from the high-dimensional data \({\textbf{X}}\) to form the reduced-dimensional data \({\textbf{U}}\) in Fig. 1. This does not require much computational resource when compared with the feature extraction method, which is presented next.

2.3 Feature extraction

PCA [23] and AE [36] are the two major feature extraction methods used in the NIDS. Different from feature selection, whose selected features are identical to those appearing in the original data, these feature extraction techniques compress the high-dimensional data \({\textbf{X}}\) into the low-dimensional data \({\textbf{U}}\) using either a projection matrix or an AE-based neural network learned from training dataset. Note that the AE approach usually suffers from high computational complexity of a deep neural network (DNN), leading to higher latency than the PCA. Thus, in this work, we concentrate on the PCA-based feature extraction approach in order to fulfill a strict requirement on the latency of the NIDS for promptly preventing severe cyber attacks.

In what follows, we introduce the procedure of producing the \(D\times K\) projection matrix \({\textbf{W}}\) in the training phase, and how to utilize this matrix in the testing phase. In particular, based on the pre-processed training data \({\textbf{X}}\) of N samples, we normalize it by subtracting all samples of \({\textbf{X}}\) by its mean over all training samples, i.e., the normalized data is given as follows: \({\hat{{\textbf{X}}}}={\textbf{X}}-{\bar{{\textbf{X}}}}\), where \({\bar{{\textbf{X}}}}\) is the mean vector. Then, we compute the \(D\times D\) covariance matrix of training data as follows: \({\textbf{R}}=\frac{1}{N}\hat{{\textbf{X}}}\hat{{\textbf{X}}}^{T}\). Based on this, we determine its eigenvalues and eigenvectors, from which, we select K eigenvectors corresponding to K largest eigenvalues for constructing the \(D\times K\) projection matrix \({\textbf{W}}\). Herein, these K eigenvectors are regarded as the principal components that create a subspace, which is expected to be significantly close to the normalized high-dimensional data \({\hat{{\textbf{X}}}}\). Finally, the compressed data is determined by \({\textbf{U}}={\textbf{W}}^{T}\hat{{\textbf{X}}}\), which now has the size of \(K\times N\) instead of \(D\times N\) of the original data.

In the testing phase, for each new data point \({\textbf{x}}_{i}\in {\mathbb {R}}^D\), its dimension is reduced using PCA according to \({\textbf{u}}_{i}={\textbf{W}}^{T}\left( {\textbf{x}}_{i}-{\bar{{\textbf{X}}}}\right)\). This indicates that the output of the training phase of PCA includes both the projection matrix \({\textbf{W}}\) and the mean vector of all training samples \({\bar{{\textbf{X}}}}\). It should be noted that such projection matrix calculation would be computationally expensive, particularly when D and K are large.

3 Overview of UNSW-NB15 dataset

We now present some key information about UNSW-NB15 dataset, which will be used in our experiments in Sect. 4 to compare between feature selection and feature extraction. Then, the data pre-processing for this dataset is also discussed.

Fig. 2
figure 2

Proportions of 10 classes in training dataset of UNSW-NB15

3.1 Key information of UNSW-NB15 dataset

UNSW-NB15 dataset was first introduced in [15], which offers better real modern normal and abnormal synthetical network traffic compared with the previous NIDS datasets such as KDD99 [9] and NSLKDD [10]. A total of 2.5 million records of data are included in the UNSW-NB15 dataset, in which there are one normal class and nine attack classes: Analysis, Backdoor, DoS, Exploits, Fuzzers, Generic, Reconnaissance, Shellcode, and Worms. Flow features, basic features, content features, time features, additional generated features, and labeled features are 6 feature groups, which consist of a total of 49 features in the original data [15]. However, in this work, we use a 10% cleaned dataset of UNSW-NB15, which includes a training set of 175,341 records and a test set of 82,332 records. There are a few minority classes with proportions of less than 2%, including Analysis, Backdoor, Shellcode, and Worms (see Figs. 2,  3). In the 10% dataset, some unrelevant features were removed, such as scrip (source IP address), sport (source port number), dstip (destination IP address), and dsport (destination port number). Therefore, the number of features was reduced to 45, including 41 numerical features and 4 nominal features.

Fig. 3
figure 3

Proportions of 10 classes in testing dataset of UNSW-NB15

3.2 Pre-processing dataset

As mentioned above, the 10% dataset of UNSW-NB15 has 45 features, including 41 numerical features and 4 nominal features. We remove the id feature in numerical features, since it does not affect the detection performance. The attack_cat nominal feature that contains the names of attack categories is also removed. Thus, there are three remaining helpful nominal features, namely, proto, service, state. In addition, null values appearing in the service feature are treated as ‘other’ type of service.

Table 1 An example of one-hot encoding for a nominal feature

One-hot encoding is used for transforming nominal features, i.e., proto, service, state, to numerical values. For example, assume that the proto feature has a total of three different values, namely, A, B, C, then its one-encoding will result in three numerical features, namely, proto_A, proto_C, proto_C, whose values are 0 or 1, as illustrated in Table 1. As a result, after pre-processing data, the number of features will increase from 45 features in Z to approximately 200 features in U (see Fig. 1), where many of them are not really helpful in classifying attacks. Therefore, it is necessary to reduce such a large number of features to a few of the most important features, which allows to reduce the complexity of machine learning models in the classification phase. Finally, we note that when feature extraction is used, we normalize the input feature with the min-max normalization method [39] to improve the classification accuracy, while we do not use that data normalization for feature selection, since it does not improve the performance.

4 Experimental results and discussion

We now present extensive experimental results for investigating the performance of the NIDS using both feature selection and feature extraction methods described in Sect. 2, in combination with a range of machine learning-based classification models. More particularly, the performance metrics used for comparison include recall (R), precision (P), F1-score, training time and inference time, which will be explained in detail in Sect. 4.1. Both binary and multiclass classifications are considered. We also investigate the accuracy for each attack class to provide an insight into the behaviors of different detection methods. Last but not least, based on our extensive comparison between feature selection and feature extraction, we provide a helpful guideline on how to choose an appropriate detection technique for each specific scenario.

4.1 Implementation setting

4.1.1 Computer configuration

The configuration of our computer, its operation system as well as a range of Python packages used for implementing intrusion detection algorithms in this work are detailed in Table 2.

Table 2 Hardware and environment specification
Table 3 Threshold setting and features selected
Table 4 Feature Selection versus Feature Extraction: 4 selected/extracted features and binary classification
Table 5 Feature Selection versus Feature Extraction: 8 selected/extracted features and binary classification

4.1.2 Evaluation metrics

We consider the following performance metrics: precision, recall, F1-score, as well as training time and inference time. In particularly, F1-score is calculated based on precision and recall as follows:

$$\text {F1-score}=2\times \frac{\text {precision}\,\,\times \,\text {recall}}{\text {precision}+\text {recall}},$$
(3)

which is regarded as a harmonic mean of precision and recall.

As shown in Fig. 1, the two feature reduction methods considered in this work go through the same pre-processing data step, so we do not take the time required for this step into account when estimating their training and inference time. Particularly, the training time consists of the training time of classification models and the time duration consumed by feature reduction in training (FR_train), as follows:

$${\text {Training time} = \text {time}_{{train}} + \text {time}_{{FR\_train}}}.$$
(4)

Meanwhile, the inference time consists of the prediction time of machine learning classifiers and the time duration required for feature reduction in the testing phase, given by

$$\text {Inference time} = \text {time}_{{predict}} + \text {time}_{FR\_test}.$$
(5)

4.1.3 Classification models

Table 6 Feature Selection versus Feature Extraction: 16 selected/extracted features and binary classification

We use five machine learning models to do both binary and multiclass classification tasks, which are available in Python Scikit-learn library, namely, Decision Tree, Random Forest (max_depth = 5), K-nearest Neighbors (n_neighbors = 5), Multi-layer Perceptron (MLP) (max_{i}ter = 100, hidden_layer_sizes = 200), and Bernoulli Naive Bayes. Additionally, for a better insight of feature selection, we provide lists of 4, 8 and 16 selected features in Table 3, as well as the corresponding thresholds of the average correlation used to achieve those numbers of selected features.

4.2 Binary classification

We first investigate the detection performance and runtime of feature selection and feature extraction methods when using binary classification in Tables 45 and 6 for 4, 8, 16 selected/extracted features, respectively. In these tables, the best values (i.e. the maximum values of precision, recall, and F1-score, and the minimum values of training and inference times at each column of the tables) are highlighted in bold, especially the best values for both feature selection and feature extraction are highlighted in italics. The training time is measured in second (s), while the inference time for each data sample is measured in millisecond (\(\upmu\)s).

Table 7 Accuracy comparison for each class between feature selection and feature extraction using binary classification
Table 8 Feature Selection versus Feature Extraction: 4 selected/extracted features and multiclass classification
Table 9 Feature Selection versus Feature Extraction: 8 selected/extracted features and multiclass classification

In terms of detection performance, it is shown from Tables 45 and 6 that when the number of reduced features (i.e. extracted or selected) K increases, the detection performance of feature extraction generally improves, while that of feature selection does not improve when we increase K from 8 to 16. In fact, the precision, recall and F1-score of feature selection even slightly degrade from Tables 5 and  6. This phenomenon is understandable due to the fact that if the number of selected features gets larger, it is likely to have more noisy or unimportant features appearing in the selected ones, which are expected to deteriorate the detection performance. Moreover, comparing the two feature reduction methods, we find that when the number of reduced features is small, i.e., \(K=4\), the detection performance of feature extraction is much better than that of feature selection. For instance, in Table 4, the highest F1-score of feature extraction is 85.42% when the KNeighbors classifier is used, while that of feature selection is lower with 81.94% when the Decision Tree classifier is used. However, for larger K such as 8 and 16 in Tables 5 and  6, the feature selection method achieves better accuracy than its extraction counterpart, especially when using Decision Tree for classification. For example, when Decision Tree is employed in Table 5 to achieve the lowest inference time, the F1-score of feature selection is 87.47%, which is higher than that of feature extraction with 85.69%. It is also shown from Tables 45 and 6 that when using feature selection, the Decision Tree classification method always provides the best precision, recall as well as F1-score. By contrast, the feature extraction method would enjoy the KNeighbors classifier when K are small, i.e., 4 or 8, while Decision Tree is only its best classifier when K becomes larger, such as \(K=16\).

In terms of the runtime performance, Tables 4, 5 and 6 demonstrate that both the training time and the inference time of feature selection is lower than that of feature extraction. This is because of the fact that the feature extraction method requires additional computational resources when compressing the high-dimensional data into low-dimensional data, as explained in Sect. 2, while the feature selection almost do not require any computing resources when just picking up K out of D features. More particularly, in Table 5, the best inference time of feature selection is 0.11 \(\upmu\)s, which is 36 times lower than that of feature extraction being 3.95 \(\upmu\)s, where the Decision Tree classifier is the best choice for both feature reduction methods for minimizing the inference time. Again, Decision Tree is one of the best classifiers for minimizing both training and inference times, in addition to the Naive Bayes classifier, which however does not achieve a good accuracy.

Finally, in order to better understand the attack detection performance of feature selection and feature extraction, in Table 7, we provide the accuracy comparison for each class in binary classification, namely, Normal and Abnormal. Similar to Tables 45 And 6, we consider the number of reduced features K being 4, 8 and 16. Besides, based on the results obtained from these three tables, we only include the classifiers that offer the highest F1-scores for accuracy comparison for each class in Table 7, namely, MLP and KNeighbors for feature extraction and Decision Tree for feature selection. Herein, the highest accuracy for each class with respect to K is highlighted in bold, while the highest values in both feature selection and feature extraction are highlighted in italics. It is worth noting from this table that in both feature reduction methods, while the accuracy of detecting Normal class steadily improves when increasing K, that of detecting Abnormal class gradually degrades. This interestingly indicates that in order to detect more attacks, we should select small K rather than large K. In addition, Table 7 shows that for both feature reduction methods, the accuracy of Abnormal class is much higher than that of Normal class. Observe the average accuracy from this table, we find that the accuracy of feature extraction is less sensitive to varying K that that of feature selection, which varies significantly with respect to K.

Table 10 Feature Selection versus Feature Extraction: 16 selected/extracted features and multiclass classification
Table 11 Accuracy comparison for each class between feature selection and feature extraction using multiclass classification
Table 12 Accuracy comparison for each class between feature selection and feature extraction using multiclass classification and the same Decision Tree classifier
Table 13 Accuracy comparison for each class between feature selection and feature extraction using multiclass classification and the same MLP classifier
Table 14 A summary of comparison between feature extraction and feature selection

4.3 Multiclass classification

We compare both the detection performance and runtime of feature selection and feature extraction in Tables 8, 9, and 10 for 4, 8, and 16 selected/extracted features, respectively, when multiclass classification is considered. Here, we still employ five machine learning models as in binary classification. As shown via these three tables, similar to the binary case, the precision, recall and F1-score of both methods generally improve when increasing the number of reduced features K. For example, the highest F1-scores of feature extraction are 74.11%, 75.39%, and 75.52%, while that of feature selection are 65.43%, 78.36% and 77.64%, when K = 4, 8, and 16 reduced features, respectively. As such, feature extraction outperforms its counterpart when K is small such as \(K=4\), however, this is no longer true when K gets larger such as \(K=8\) and 16, where feature selection performs much better than feature extraction. Again, akin to the binary classification, it is shown from Tables 9 and 10 that the detection performance of feature selection degrades when K increases from 8 to 16, mostly due to the impact of noisy or irrelevant features when having more features selected.

Besides, unlike the binary case, where KNeighbors is the best classifier for feature extraction when K is small such as 4 and 8, with multiclass classification, MLP now provides the best detection performance of feature extraction for any values of K, as shown via Tables 89, and 10. Meanwhile, feature selection still enjoys the Decision Tree classifier to achieve the highest detection performance, similar to the binary classification analyzed in the previous subsection, while MLP does not offer a good detection performance for feature selection. Additionally, the Naive Bayes classifier achieves the worst accuracy for both feature reduction methods.

With regard to the runtime comparison, again, Tables 89, and 10 demonstrate that the training and inference times of feature selection are significantly lower than that of feature extraction. For example, using the same Decision Tree model for achieving the lowest runtime, in Table 9 when \(K=8\), the inference time of feature selection is 0.19 \(\upmu\)s, which is 26 times lower than that of feature extraction with 5.04 \(\upmu\)s. Similarly, it is shown from this table that the training time of feature selection is also 2 times lower than that of its extraction counterpart. In addition, the Decision Tree model provides the lowest inference time for both feature reduction methods, while the neural network-based MLP classifier exhibits both the highest inference and training times for them.

Finally, we compare the accuracy for detecting each attack type (including 9 attack classes and 1 normal class, as described in Sect. 3) between feature selection and feature extraction in Table 11, where the values of K are 4, 8, and 16 reduced features. Herein, we employ MLP and Decision Tree classifiers for feature extraction and feature selection, respectively, in order to achieve the best detection performance, as analyzed in the previous discussions. It is observed from Table 11 that feature selection performs better than feature extraction in most of classes, except for Exploits and Fuzzers classes. This table also shows that both methods are capable of achieving higher accuracy for Exploits, Generic, Normal and Reconnaissance classes than the remaining ones. Additionally, similar to the binary classification discussed in Table 7, the multiclass classification accuracy of feature extraction is less sensitive to the number of reduced features K than that of its selection counterpart. More importantly, feature selection with MLP is unable to correctly detect any samples of Analysis and Backdoor, even for all three values of K. By contrast, feature selection with Decision Tree classifier is capable of correctly detecting samples from all classes. We found that this is mainly due to the machine learning classifier rather than the feature reduction method we choose. In order to clarify this issue, we compare the accuracy for each class between the two feature reduction methods using the same Decision Tree and MLP classifiers in Tables 12 and 13, respectively. It is shown via Table 13 that using the same MLP, similar to feature extraction, feature selection is unable to detect any samples of Analysis and Backdoor correctly. Observe these two tables, we found that if the same classifier is employed, feature extraction tends to be able to detect more diverse attack types than feature selection. This is due to the fact that feature extraction can extract key information from all available features, leading to more diverse attack types, instead of relying solely on a subset of selected features as in the feature selection approach. In other words, feature selection tends to detect only attack types, which are highly correlated to the features it selects.

In summary, considering both binary and multiclass classification for the NIDS, the feature selection method not only provides better detection performance but also lower training and inference time compared to its feature extraction counterpart, especially when the number of reduced features K increases. However, the feature extraction method is much more reliable than its selection counterpart, particularly when K is very small, such as \(K=4\). Additionally, among five considered classifiers, while Decision Tree is the best classifier for improving the accuracy of feature selection, a neural network-based MLP is the best one for feature extraction. Last but not least, feature extraction is less sensitive to changing the number of reduced features K than feature selection, and this holds true for both binary and multiclass classifications. For more details, we provide a comprehensive comparison between feature selection and feature extraction in intrusion detection systems in Table 14.

5 Conclusions

We have compared two typical machine learning-based intrusion detection methods, namely, feature selection and feature extraction, in the presence of the modern UNSW-NB15 dataset, where both binary and multiclass classifications were considered. Our extensive comparison showed that when the number of reduced features is large enough, such as 8 or 16, feature selection not only achieves higher detection accuracy, but also requires less training and inference times than feature extraction. However, when the number of reduced features is very small, such as 4 or less, feature extraction notably outperforms its selection counterpart. Besides, the detection performance of feature selection tends to degrade when the number of selected features becomes too large, while that of feature extraction steadily improves. We also found that while MLP is the best classifier for feature extraction, Decision Tree is the best one for feature selection for achieving the highest attack detection accuracy. Finally, our accuracy analysis for each attack class demonstrated that feature extraction is not only less sensitive to varying the number of reduced features but also capable of detecting more diverse attack types than feature selection. Both tend to be able to detect more attacks, i.e., Abnormal classes, when having more features selected or extracted. We believe that such insightful observations about the performance comparison between two feature reduction methods give us a helpful guideline on choosing a suitable intrusion detection method for each specific scenario. Finally, note that our study evaluated the effectiveness of feature reduction methods only on the UNSW-NB15 dataset. In the future, we intend to explore whether our observations with UNSW-NB15 are applicable to other intrusion detection datasets, such as, NSL-KDD, KDD99, CICIDS2017, and DARPA1998. We also plan to thoroughly investigate the performance of various deep learning classification models for NIDS, and compare with existing machine learning models.