1 Introduction

In modern society, network-based services are gaining more and more importance. As technology gets advances, such as IoTs, big data, clouds computing, etc., the vast volumes of the traffic are also increasing, rapidly. Therefore, as the network data traffic is growing substantially, updating the attack sign becomes more difficult, time-consuming, and tedious. Hence, as a result of wide internet use and rapid traffic growth, the network security became an emerging field of research among scientists and researchers. In this field, the researchers try to prevent the attackers or intruders who are always looking for finding the flaws in the network or in the system to obtain illegal access (s).

Many solutions exist today to secure a network environment, such as antivirus, firewalls, IDS, etc. However, the IDS is the most prominent mechanism among them to defend a network or individual system.

The IDS also protects the sensitive data during traveling over the networks from being intercepted by attackers or intruders. However, the existing IDS are still not extremely scalable or flexible enough. In the years 2014–2016, two data breaches were reported by Yahoo, impacting 500 million customers’ accounts and resulting in a loss of 350 million dollars [1]. The outbreaks are being hammered with the intention of snipping data with the help of intelligent and sophisticated algorithms. Several IDS have been developed since last few years, but determining whether a network is normal or aberrant is not an easy task. Therefore, the several algorithms of machine learning (ML) have been introduced and implemented to boost the intelligence of IDS in order to tackle the challenges [2]. Till date, many types of research have been undertaken which show that ML-based IDS performs better in terms of execution and implementation [3]. However, only a few models could combine low computation costs with good detection rates, at the same time.

Therefore, since network data traffic grows at a rapid rate, extracting significant and relevant information from this traffic is a difficult task which must be addressed properly and the computing cost also supposed to be considered, side by side [4].

Moreover, whether the selected features will either improve or not in the performance of IDS is also needed to investigate.

Hence, to minimize the computational cost, one possible solution is, identify and select only relevant features from the dataset that contribute in attacks detection. Thus, a reduction of dataset dimension leads to the requirement of less training time. Simultaneously, it can enhance the performance of the classifier in IDS [5]. Similarly, the other possible solution to minimize the computational cost could be, utilize only the cost-effective algorithms that takes low cost to learn the data [6], for example, K-nearest neighbors (KNN). Therefore, to minimize the computational cost and to increase the performance of IDS, the features selection approaches (FSA) have been used in this research to remove non-relevant features, and various classifiers have also been used to investigate the best-performing classifier for enhancing the performance of IDS. Table 1 provides a list of all the abbreviations used in this paper.

Table 1 Nomenclature

1.1 Contributions

The major contributions in this research are defined as follows:

  • The FSA like principal component analysis (PCA) and recursive feature elimination (RFE) have been used to discover and pick up the significant features on the NSL-KDD and CICIDS2017 dataset.

  • A smaller and more appropriate subset of features have been identified, i.e. 13 and 8 key features from the NSL-KDD and CICIDS 2017 datasets, respectively.

  • A comparative analysis of different FSA with various classifiers such as naive Bayes (NB), decision tree (DT), and KNN on the NSL_KDD dataset has been described.

  • Based on the selected decent classifier and features selection technique (FST) on the NSL-KDD dataset, the same has implemented on a real-time dataset, i.e. combined CICIDS2017 and evaluated its performance in terms of F-measure, G-means, recall (sensitivity), precision, specificity, accuracy, testing time, and training time.

1.2 Organization

The related literature review of this research is explained in the next section. In Sect. 3, the proposed framework and its approach are discussed. The experiments and results are demonstrated in Sect. 4. Finally, the conclusion and future work are deliberated in the last section of this article.

2 Literature survey

Many studies have used FSA to reduce the issue of data dimension and improve the IDS detection rate (DR) throughout the previous few decades. However, as networks traffic is growing at a rapid rate, the possible types of threats are also increasing side by side. However, the researchers are still struggling with the issues of dimensionality reduction and computational time [7]. As a result, various ML techniques for IDS with FSA have been proposed, till date.

Mukkamala et al. [8] examined IDS utilizing support vector machine (SVM) and neural networks (NN). The outcome of the experiment found that SVM is highly flexible and suitable for usage with huge datasets where NN needs a lot of learning time. In this context, in 2004, Fleuret et al. utilized the mutual information approach to choose the relevant features; this approach is more effective than SVM when combined with the Bayes network. In general, it emphasizes overall processing time [9]. In 2005, Chebrolu et al. investigated IDS using reverse classification tree and Bayes networks as feature selection methods. Using the proposed FST, they have extracted 12 essential features which are capable of recognizing and detecting various attack types. Unfortunately, the detection rate for User-to-root (U2R) attacks was comparably low [10]. Therefore, in 2008, Chou et al. utilize correlation-based feature selection (CFS) and fast CFS as feature selection (FS) methods to handle high-dimensional data issues such as uncertainty, ambiguity, and redundancy into the collected data items. For obtaining the relevant features, their proposed approach has integrated C4.5 and NB, together. Based on their experiments, they have demonstrated that the detection rate of the proposed fuzzy KNN technique can be improved compared to previous classifiers [11].

In this context, Heba et al. use PCA as a reduction technique in combination with SVM to address the challenges of features dimension reduction and processing costs minimization problem. The experiment demonstrated that the IDS performance can be increased with less computational time [12]. Zainal et al. used a DT classifier with filter-based FSA, such as information gain (IG), Chi-square, and relief-F, to examine the KDDcup99 dataset. Out of a total of 41 attributes, FS methods have been utilized to obtain 5, 10, 15, and 20 relevant features, only. The results showed that “IG” as the FST outperformed other approaches and improved the performance of the model [13]. Revathi et al. has explored the effectiveness of various ML algorithms, including random forest (RF), KNN, and artificial neural networks (ANN). They have identified 15 key features and built the model using RF, KNN, and NN. The results demonstrated that RF performs well in comparison with others, with an accuracy of 98.88%, whereas RF with all features (without using the FST) has an accuracy of 97.94%, only [14]. Using the NSL-KDD dataset, Kim et al. [15] proposed a hybrid approach for intrusion detection. The results of the experiment demonstrate that their proposed approach was more effective in terms of detection rate and time complexity. According to claim [16], the suggested approach is insufficient for time reduction, and as a result, future research will concentrate on developing the decision tree approach. In 2015, Jo et al. suggested a DT model that outperforms the NN model in terms of performance. Finally, they have demonstrated that DT is superior, with a detection rate of 91.37% [17]. In the same year, an approach that combines FS with Fuzzy-genetic IDS was proposed by Jebur et al. [18]. The article uses fuzzy logic to produce rules and used 15 features to represent the rules in order to reduce training time. The complex computing approach generates less efficient rules than soft computing. Over the UNSW-NB dataset, Mishra et al. has proposed program semantic-aware intrusion detection Net-visor security to identify attacks on virtual networks using ML methods such as DT, ANN, linear regression (LR), RF, random tree (RT), and others. However, based on their experiments RF + LR performed well compared to others in terms of accuracy, but has more false positive rate (FPR) than RT + LR [19]. To identify relevant attributes from the KDDcup 99 dataset, Mousavi et al. presented an ant colony algorithm and gradually feature removal method as FST. Then, a model has been built utilizing specified features using an ensemble of decision trees (Ada-Boost classifier). Here, the proposed technique has enhanced accuracy significantly and yielded a Matthew’s correlation coefficient value to 0.91 [20]. To select important features from the NSL-KDD dataset, Sah et al. used RFE as the FS method and RF as a classifier to build a model. The proposed method enhanced the model's performance to some extent [21]. As a continuation, in 2021, Ankit et al. investigated the impact of FS methods on overall IDS performance. They have used the NSL-KDD dataset to implement RFE, IG, and chi-square as FST with various Classifiers such as NB, SVM, RF, KNN, Logistic regression, and ANN. Finally, the results have been provided in the form of a comparative study and demonstrated that RFE achieved high performance compared to others FST. However, the entire experiment has been performed and tested only on the NSL-KDD dataset that may not contain modern normal activities as per literature [22]. Hence, future research should focus on developing and evaluate the IDS model using NSL-KDD and other modern datasets like CICIDS2017. Therefore, in 2021, on the NSL-KDD, CICIDS2017, Kyoto 2006+, and UNSW-NB15 dataset, Gu et al. [23] suggested a practical method for IDS that classifies incursion and regular instances using SVM and the NB classifiers. The proposed technique found that embedding NB with SVM produced the maximum detection accuracy when system results have compared with just one SVM algorithm. The main conclusions of SVM research also showed that SVM requires higher training time.

After studying the related works, it has been observed that the majority of researchers are interested to address the issue of large data dimensions and finding relevant features for the IDS. It's crucial to note that when data dimensions increase, ML approaches processing times grow as well [22]. Multiple FST have been employed in recent years, but still, these FST are not flexible enough to extract meaningful data from huge amounts of traffic and examining whether the selected features could increase IDS performance or not. To overcome these problems, some effective FSA are required that could reduce the features effectively and obtained a suitable set of features. This helps not only in attacks detection but also can enhance the performance of IDS and reduce the computational cost. Therefore, a decision engine approach with a features reduction strategy should be established with maintaining lightweight characteristics.

3 Proposed framework

The trade-off between a low computing cost and a high detection rate, as well as due to high dimensionality of traffic, makes difficult to develop the effective and efficient IDS models. To reduce the computational cost and increase the detection rate of IDS, this study provides a classifier which is an adaptable and effective intrusion detection technique, based on FSA, The major objective of the purposed framework is to provide a high detection rate at a minimal computing cost. The proposed framework involves five main steps, namely: dataset, data pre-processing, feature selection approaches, model building and evaluation, finally, analysis and selection phase, as shown in Fig. 1. The details descriptions of all the individual steps are defined as follows:

Fig. 1
figure 1

Proposed framework

3.1 Dataset

A standard dataset is an essential requirement for measuring IDS performance correctly. Moreover, it also helps in evaluating the contrast of several estimators or classifiers in IDS. In the first step, standard datasets (CICIDS2017 and NSL-KDD) have been described in the below subsection before doing the pre-processing steps. These datasets are widely used for IDS and contain a sufficient number of normal activities and attack samples. The following are some detailed descriptions of the NSL-KDD and CICIDS2017 datasets.

3.1.1 NSL-KDD dataset

The NSL-KDD dataset was developed as a modified version of the KDD_1999 dataset [24]. It addresses the drawbacks of the KDD_1999 dataset for example redundant records, duplicate records, etc. In the literature, these datasets are utilized frequently for IDS evaluation. These datasets have already been prepared in two subgroups, namely training set and testing set. The NSL-KDD dataset has 41 attributes, which are divided into four (4) subgroups: basic features, content features, time-based traffic features, and host-based features. It also has five classes in which one is for the normal class, and the rests are attacks classes such as U2R, Remote-to-local (R2L), denial of services (DoS), and probing (PROBE) which are described in Table 2. The NSL_KDD dataset is utilized in this study because of the following reason:

  • Redundant records have been removed from the training and testing sets that enable the classifier or estimator to produce unbiased results.

  • A sufficient number of objects are available in both the training and testing sets of a dataset, allowing the experiment to be executed on the entire dataset without the requirement to select small parts or portions at random.

  • It also offers numerous characteristics such as harmful scenarios, realistic network configuration, full packet capture, labelled observations, etc.

Table 2 Attacks and normal classes of NSL_KDD dataset with example

The KDDTrain+ and KDDTest files, which contain 125,973 and 22,544 objects, respectively, have been used in this research work, listed in Table 3, which illustrates the different labels of attacks types in the training and testing sets.

Table 3 Number of the objects and label distribution of different types of attacks in training and testing set

3.1.2 Combined CICIDS 2017 dataset

The Canadian Institute of Cyber-security [25] developed the CICIDS-2017 dataset for IDS. According to a McAfee report [26], the CICIDS-2017 dataset contains a variety of attacks that can be categorized as Web Attack, Infiltration, DoS, Brute Force, Port Scan, Distributed DoS (DDoS), Botnet attacks, etc. and is available in 8 files. The CICIDS-2017 dataset has 79 features which represents the different labels or classes.

Researchers discovered a few flaws in the CICIDS2017 dataset, including the fact that it is easily visible, the size of the dataset is huge, crossed over 8(eight) files that are captured in 5 days, and has many duplicate records which may lead to irrelevant for the IDS training phase. However, many possible solutions have also been introduced in that context [27]. It also contains an imbalance class in nature [28]. This may result in misleading estimators and influence towards the common class. Some of the shortcomings (limitations) of the dataset are given as:

  • Scattered presence As the dataset is driven into 8 different files. Working on each individual file is a monotonous task.

  • An enormous volume of data After integrating all eight files, the resultant set becomes very large and working on it is very tedious as it takes more time for loading and processing of the data.

  • Missing values The dataset has many missing values that have to be removed before working on it.

    The effective IDS model should be capable of the detection of any type of attack. Therefore, in order to design a classic IDS model, the data of all files of the CICIDS2017 dataset are collected and merged into a single dataset to be utilized by IDS. As a result, 3,119,345 total objects are contained in a single dataset, in which 288,602 objects that have missing class labels are removed. Finally, the combined dataset left with 2,830,743 objects.

  • Dimension of combined CICIDS2017 dataset: (2830743,79).

By merging all traffic files of CICIDS2017 into a single dataset, the scattered presence problem has been solved. The missing values have also been eliminated from the combined CICIDS2017 dataset. The CICIDS2017 dataset becomes enormously large after integrating all 8 files into a single dataset; hence, in this research, a sample of 654,321 records has been picked for experiment purposes. Table 4 provides an illustration of the updated labelling for all attack traffic in the CICIDS2017 dataset. Further, the dimensions of the training and testing dataset are 523,456 and 130,865, respectively, as shown in Table 5. Table 6 describes the dimension of attacks classes from the dataset (sample of combined CICIDS2017) that have been used for implementation.

Table 4 All possible types of attacks and normal traffic with new labels in the CICIDS2017 dataset
Table 5 Record of different types of Attack and normal traffic in training and testing set of sample of combined CICIDS2017 dataset
Table 6 Dimension of each attacks class (including normal objects) in the dataset used for the implementation

3.2 Data pre-processing

The dataset must be pre-processed in order to verify the models and methods. Several operations are performed in this phase, including the replacement of noise-values such as infinity or null symbols with mean or zeros, feature transformation, normalization, and splitting. Pre-processes are needed for both datasets to verify models and methods.

3.2.1 One-hot- encoding

Basically, one-hot-encoding is used for data transformation from non-numerical to binary vector. As stated before, the CICIDS2017 dataset has one additional column for defining label or class and 78 regular attributes, in which Fwd Header Length, flow packets and bytes are always carried the same entries. As a result, the features like flow packets and bytes are eliminated from the combined CICIDS2017 dataset. Hence, 76 attributes with one label column are left for analysis. As all these 77 attributes represent a numerical type of data; thus data transformation isn't required. On the other hand, the NSL-KDD dataset has both non-categorical (numeric type) and categorical features (non-numeric), so data transformation is required. These categorical features such as flags, services, protocol types of the NSL-KDD dataset have symbolic entities that are transformed into numerical values using LabelEncoder, as an example considers the protocol-types feature, which contains three categories (UDP, TCP, and ICMP). These categories are mapping to numeric features as 1, 2 and 3. Later, these obtained numerical values are denoted in binary vector for training and testing purposes using One-Hot-Encoding.

3.2.2 Splitting the datasets

The combined CICIDS2017 dataset has been split into six (6) different parts based on every attacks category, whereas the NSL-KDD dataset has split into four (4) different parts based on attacks types, namely U2R, R2L, probe, and DoS, to train the models for all types of attacks and test the models correctly.

3.2.3 Feature normalization

The next operation after performing the pre-processing is feature normalization using the standardization formula given in Eq. (1) [29], where Z represents as Z-score. Feature normalization makes all attributes within the identical scale and prevents the large numeric value of features, which gives them more importance in classification algorithms. As a result, the classifier allocates the same weight to every feature. Further linear transformation given in Eq. (2) is also used that transforms the feature’s value set into a new specific set within the range (0–1)[30].

$$ Z = ( B - \mu )/ \sigma $$
(1)

In Eq. (1), the mean value (µ) is subtracted from the feature value (represent by ‘B’). The result is divided by their standard deviation (σ). In Eq. (2), the min and max stands for minimum and maximum, respectively.

$$ B_{{{\text{normalization}}}} = (B - min(B)) / (max(B) - min(B)) $$
(2)

3.3 FSA

FSA are utilized to remove redundant, irrelevant, or unimportant data. The main purpose of these techniques is to obtain a subset or optimal set of important features from underlying features that can easily classify the given problem or data in different classes or labels. FSA can help in handling the high-dimensional dataset and compute the importance of the feature that supports in data-interpretation.

As stated in Sect. 3.2.1, features (flow packets and bytes) have been eliminated from the combined CICIDS2017 dataset, in our experiment. As a result, there are 77 attributes have remained to be analysed. On the other hand, 41 features have remained of the NSL-KDD dataset. In this phase, the PCA, univariate feature selection using analysis of variance (ANOVA) F test, followed by RFE approaches have been utilized as FSA to reduce the features and acquire the appropriate subset or optimal set of features from the original set. The detailed descriptions of these approaches are mentioned below.

3.3.1 PCA

The PCA [31] method is similar to clustering, which falls under the category of unsupervised learning. PCA rationalizes the complexity in high-dimensional data by maintaining the tendency and patterns. PCA does this by modifying the data to fewer dimensions, later which act as an outline of features. It identifies the patterns from the data without reference to precision about whether the samples of data come from different treatment groups or have phenotypic differences. PCA is primarily utilized to reduce the number of attributes of the dataset by changing a large set of variables to a smaller one, but it still contains information in the dataset.

3.3.2 Univariate feature selection

In univariate feature selection [32] using ANOVA F test [33], each attribute or feature is independently analysed to identify the influence of correlation of an attribute with labels or class. It picks the best attributes based on a univariate statistical-test and works (performed) on them. Here ANOVA is a procedure of comparing each attribute with the target class to understand whether any statistically remarkable connection between them occurs or not. During this procedure, it suppresses other features to obtain the test score for every feature. In the end, all features’ scores are compared in order to obtain and select the topped score features.

3.3.3 Percentile method

The Sklearn library's Percentile method or selectPercentile [34] is used to select attributes based on the percentile of the highest scores. Furthermore, the default function in selectPercentile is ANOVA, which is solely applicable to the classification task.

3.3.4 RFE

After the Percentile method, RFE [35] is utilized as FST to identify and pick the important features for categorizing (classifying) the network traffic. RFE selects and eliminates the features on the basis of ranks. It starts the feature elimination one attribute at a time that has the lowest rank. The main purpose of RFE is to acquire the best subset of features in terms of performance. RFE [36] method evaluates the performance of an estimator or classifier using elimination properties in iterative manners given as follows:

  • Taking a sub-optimal set of features building the model of classification.

  • To provide the rank of features, it computes the importance of features.

  • Eliminate the lowest rank features based on the relevance of features.

3.4 Model building and evaluation

In this phase, DT, NB, and KNN classifiers have been used to build the models using reduced features (by FSA) and all of the training set's features from the NSL-KDD dataset. Further, on the testing set of the NSL-KDD dataset, the recall, precision, accuracy, and F-measure matrices have been calculated to determine the predictions rate of these models. Here, during the procedure of samples considering, learning, and validation, the 10-fold cross-validations have been performed and utilized to measures the performances of the models to unwavering all objects impact.

Later, the best classifier has obtained based on the performance of classifiers on the NSL-KDD dataset and used the same identified classifier with reduced features (by best FST) and all features of the training set into the CICIDS2017 dataset to train the model. Further, this model is evaluated using the testing set of the CICIDS2017 dataset. The detailed descriptions and the working principle of these classifiers are mentioned below.

3.4.1 DT

DT classifier [37] is tree-structured and comprised of edges and nodes. In these trees, every node denotes the problem category that needs to be classified, whereas every edge represents a decision that is taken based on the evaluated data. These trees can be either regression trees or classification trees. The classifier (DT) can be viewed as a predictive model of ML that illustrates a mapping between dataset attributes and their corresponding value. In DT, every part represents possible values that are considered for a specified (given) category. The tree nodes are recognized using the estimated entropy of dataset properties. The property with the top entropy value is referred to as a root node. The examples of broadly adopted DT models are classification and regression trees (CART), C4.5, and iterative dichotomiser 3.

DT classifier has several advantages: firstly, it is simple and easy to understand with a short-term explanation. Secondly, implications can be consequent, on the basis of different probability-estimation and costs. These Implications can be utilized to obtain detailed outputs. Finally, to obtain correct results, it is flexible to combine with other classification-model, but it also has a limitation while considering the data of similar type, in such cases its accuracy is relatively lower. Moreover, it is not adaptive means minor modifications in data fed to an estimator may lead to a highly unstable decision tree organization.

3.4.2 NB

NB is a supervised learning method based on the Bayes theorem. The NB method is premised on the reality (statement) that the existence of one feature/s is independent of other features of a class. The Bayes theorem has been applied to calculate the posterior_probability P(cl | y) from P(y | cl), P(y), and P(cl). The formula will be given in equation (3) [38]:

$$ ({\text{cl}}|{\text{y}}) = { }\frac{{({\text{y }}|{\text{ cl}}).{\text{P}}\left( {{\text{cl}}} \right)}}{{{\text{P}}\left( {\text{y}} \right)}} $$
(3)

where the posterior_probability P(cl | y) of class (cl, target) given predictor (y, features). The prior probability of predictor and class is denoted by P(y) and P(cl), respectively. The likelihood, or probability of the predictor given class, is represented by the expression P(y | cl).

The advantage of the NB algorithm is that it is extremely scalable and speedy in classification. Also, it can utilize for both multi-class and binary problems of classification. As it depends on the theory that any feature is not dependent on others, it cannot establish any relationship among class features. Also, its implementation is more complex with large datasets.

3.4.3 KNN

KNN algorithm is one of the most basic supervised ML classifiers. It supposes that new data and existing data are similar, and new data are allocated to the category that most closely resembles. It puts down all existing data and further categorizes a new data point based on its similarity to the existing data. This means whenever new data arise then they can be effortlessly classified into a pertinent group by utilizing the KNN. It can be utilized for regression as well as for classification, but most over the time it is used for classification problems [39]. It is a nonparametric type of procedure that means, and it does not make any presumption on elementary data.

3.5 Analysis and selection phase

After developing the models, the last phase evaluates the NSL-KDD dataset using FSA such as PCA and RFE with several classifiers such as KNN, DT, and NB. These classifiers with all features have also been considered to build the models for analysis purposes. The performance of these models has been measured using metrics such as recall, precision, F-measure, and accuracy to find the best classifier and suitable FST on the NSL_KDD dataset. After obtaining the best classifier and best FST, the same has also been used on CICIDS2017 datasets too, for building the models and analysis purposes.

3.5.1 Evaluation metrics

Primarily evaluation metrics such as F-measure, recall, precision, accuracy, training time, testing time, specificity, and, G-means: mainly appropriate for imbalanced datasets for analysing or measuring the models' performance. The basic attributes for measuring the performance of the models are defined as follows:

  • True positive (TP) TP denotes the number of normal objects which are successfully categorized by the model as normal.

  • False positive (FP) FP indicates a number of normal samples which are wrongly classified by the model as assaults.

  • True negative (TN) TN specifies the number of attacks trials that are correctly classified (predicted) by a model as attacks.

  • False negative (FN) FN denotes the number of attacks samples that are mistakenly classified (predicted) by a model as normal.

3.5.1.1 Accuracy (A)

Accuracy [40] rate measures the IDS model's accuracy when predicting the traffic as normal or attacks. It is represented by the given formula

$$ A = \frac{{{\text{TN}} + {\text{TP}}}}{{{\text{TN}} + {\text{TP}} + {\text{FP}} + {\text{FN}}}} $$
3.5.1.2 G-means (G_m)

G_m [41] is derived from specificity and sensitivity. It is mainly appropriate for imbalanced datasets. It is computed as

$$ G\_{\text{m}} = \sqrt {\left( {{\text{specificity}} \times {\text{sensitivity}}} \right)} $$
3.5.1.3 Specificity (S)

Specificity is another name for the true negative rate. It is represented by the given formula

$$ S = \frac{{{\text{TN}}}}{{{\text{FP}} + {\text{TN}}}} $$
3.5.1.4 Recall (sensitivity)

True positive rate (TPR) or detection rate is other terms for recall. It's calculated using the given formula

$$ {\text{Recall}}\left( R \right) = \frac{{{\text{TP}}}}{{{\text{FN}} + {\text{TP}}}} $$
3.5.1.5 Precision (P)

Precision is the ratio of true positive (predicted by model correctly) to the total number of positive cases [41]. It is calculated by the given formula

$$ P = \frac{{{\text{TP}}}}{{{\text{FP}} + {\text{TP}}}} $$
3.5.1.6 F-measure (F_m)

F_m is a weighted harmonic_average of recall and precision. It is mainly suitable for imbalanced datasets. It can be calculated by the given [42] formula

$$ F\_{\text{m}} = 2*(R* P)/(R + P) $$
3.5.1.7 Training time (T 1) in seconds (s)

T1 describes the amount of time that a technique utilized for training and building the model using an entire training set of a dataset. It is represented by the given formula [43]

$$ T_{1} = {\text{End}}_{{{\text{training}}\_{\text{time}} }} - {\text{Start}}_{{{\text{training}}\_{\text{time}} }} . $$
3.5.1.8 Testing time (T 2) in seconds (s)

T2 describes the amount of time that a technique utilizes to predict the whole testing set of a dataset as either attack or normal. It is computed as [43]

$$ T_{2} = {\text{End}}_{{{\text{testing}}\_{\text{time}} }} - {\text{Start}}_{{{\text{testing}}\_{\text{time}} }} . $$

4 Experimental setup and results analysis

For the experimental purpose, the Kaggle platform has been used in this research, which is a cloud-based online resource where Python programming can be used by utilizing 'Sklearn' (ML library implemented in Python) [44]. Kaggle has a maximum memory of 16 Gigabyte (GB) and a storage capacity of 4.9 GB, allowing the users to upload and explore data-analysis models. In this research, the entire experiment has performed through Windows 10 with a quad-core 3.6 Gigahertz (GHz) processor.

4.1 Results and discussion

In this research, to investigate the effectiveness of the proposed models, the CICIDS2017 and NSL-KDD datasets have been used. Initially, to remove the non-relevant features and find the important ones, the RFE and PCA-based FSA has been employed on the dataset. Then, these FSA has been implemented with several ML classifiers for example DT, NB, and, KNN to enhance and measure the model’s performance on the NSL-KDD datasets. The accuracy, recall, precision, and F-measure metrics stated in Sect. 3.5 have been used to analyse the experiment outcomes. However, based on outcomes, other metrics such as specificity, G-means, testing, and training time have also been calculated for identifying the best-performing classifier. Moreover, based on the best-performing classifier with the selected FST, the model has also been examined on a real-time dataset (i.e. CICIDS2017).

4.1.1 Results analysis using NSL-KDD dataset

Since IDS enables early detection of intrusion, feature extraction and selection is always a crucial and difficult task in network security. However, it has a considerable impact on both model performance and computational complexity. The main goal of feature selection approaches is to completely depict a problem by picking a subset of important features from the whole dataset. As a result, working with fewer features may yield better outcomes. Therefore, in this experiment, RFE and PCA have been applied to the NSL-KDD dataset to provide the best set of features.

In order to reduce the computation cost of proposed models, the RFE and PCA have identified the 13 and 8 important features (rank wise), respectively, from the NSL-KDD datasets for each classifier, which are shown in Table 7. These approaches (PCA and RFE) have provided an appropriate set of attributes or features that are passed over several classifiers (DT, NB, and KNN) for training and testing in order to construct an IDS model. These classifiers have employed to build a model for evaluation and comparison purposes. Tables 9 and 10 demonstrate the performance evaluations of the classifiers with selected features by RFE and PCA, respectively, on NSL-KDD dataset, whereas Table 8 demonstrates the performance of classifiers with all features using NSL-KDD dataset. Here, the performance of the classifiers has been measured based on four matrices namely, recall, precision, F-measure and accuracy.

Table 7 Selected features of NSL-KDD dataset using RFE and PCA

Figures 2, 3, 4, and 5 illustrate the F-measure, recall, precision, and accuracy of different algorithms (DT, KNN, and NB) using all features and using selected features (via RFE, PCA), respectively. The graph's Y-axis depicts the percentage of performance metrics (F-measure, recall, precision, and accuracy) predicted by various approaches for each category (Probe, DoS, U2R, R2L), while the X-axis depicts various methods such as DT, NB, and KNN.

Fig. 2
figure 2

Different classifiers (NB, DT, and, KNN) F-measure using selected features and all features for each category

Fig. 3
figure 3

Different classifiers recall using selected features and all features for each category

Fig. 4
figure 4

Different classifiers precision using selected features and all features for each category

Fig. 5
figure 5

Different classifiers accuracy using selected features and all features for each category

After analysing the results (Tables 8, 9, 10, Figs. 2, 3, 4, 5) on the NSL-KDD dataset with and without FSA (i.e. using all features), it has been found that the RFE with DT offered superior performance in terms of accuracy. Additionally, the RFE with DT offered higher performance in terms of F-measure, recall, and precision for attack types including DoS, Probe, and R2L. However, the KNN with all features provided better performance in terms of F-measure and precision for U2R attack type. The NB with all features showed better recall performance for U2R attack type.

Table 8 Performance evaluations for ML classifiers with all features using NSL-KDD dataset
Table 9 Performance evaluations for ML classifiers with selected features using RFE on NSL-KDD dataset
Table 10 Performance evaluations for ML classifiers with selected features using PCA on NSL-KDD dataset

Further study of the findings for the NSL-KDD (Tables 8, 9, 10, Figs. 2, 3, 4, 5) shows that the RFE performs better with DT than PCA with DT, PCA with NB, and PCA with KNN in terms of F-measure, precision (with the exception of U2R attack type for PCA with KNN), recall, and accuracy among the FSA.

However, the KNN classifier performs well in terms of Precision under a few specific conditions, specifically those involving U2R attack types. When recall is taken into account, the NB classifier in one instance produces similar results, but only for U2R attack types. Hence, for the majority of the attack types given in the NSL-KDD dataset, the DT classifier using RFE offers higher results in terms of accuracy, F-measure, precision, and recall. Therefore, for further investigation, DT as a classifier and RFE as an FST are taken into consideration in the proposed model. Moreover, the other metrics such as specificity, G-means, testing, and training time are also calculated for classifier DT with RFE. On the NSL-KDD dataset, Table 11 demonstrates the performance evaluations for the DT classifier with all features and the DT classifier with selected features (by RFE). Figure 6 illustrates the evaluation matrices of DT Classifiers employing selected features (by RFE) and all features for each category. Figure 7 shows the testing and training time for the DT classifier with all features and the DT classifier with selected features (by RFE) for each category. In comparison with DT with all features, the DT classifier with selected features (by RFE) requires low training time and testing time. Further, the U2R category requires the lowest training and testing time.

Table 11 Performance evaluations for DT classifier with all and selected features using RFE on NSL-KDD dataset
Fig. 6
figure 6

Evaluation matrices of DT classifiers using selected features (by RFE) and all features for each category on the NSL-KDD dataset

Fig. 7
figure 7

DT classifier’s training and testing time using selected and all features for each category on the NSL-KDD dataset

After analysing the results (Table 11, Figs. 6, 7) of DT with selected features (by RFE) and DT with all features on the NSL-KDD dataset, it is observed that the DT classifier with selected features (by RFE) yields higher results in terms of average accuracy, total training time, total testing time, G-means, and specificity. Additionally, results analysis for the NSL-KDD shows that DT with RFE increased model performance in terms of precision, G-means, accuracy (approximately identical accuracy for U2R), and specificity for each attack category while reducing computational costs in terms of training time and testing time. In terms of recall, and F-measure, the DT with RFE also offers higher results for the attack categories Dos, R2L, and Probe. For the assault U2R category, it generates F-measure and recall, which have only slight value changes.

4.1.2 Results analysis using combined CICIDS2017 dataset

After analysing the results on the NSL-KDD dataset, the best-performing classifier with the appropriate FST, that is DT + RFE has been adopted and evaluated on a real-time dataset (i.e. CICIDS2017) to examine the model's sustainability on some new real-time datasets for IDS. Table 12 shows “8” important features selected from the CICIDS2017combined dataset using RFE. Then these features have passed over the DT classifier to build a model. Performance evaluations for the DT method with all features and selected features using RFE on the CICIDS-2017 combined dataset are shown in Tables 13 and 14.

Table 12 Selected features fromCICIDS2017 combined dataset using RFE
Table 13 Performance evaluations for DT classifier with all features on CICIDS2017 combined dataset
Table 14 Performance evaluations for DT classifier with selected features (by RFE) on CICIDS2017 combined dataset

On the combined CICIDS2017 dataset, Fig. 8 shows that the DT classifier with selected features (by RFE) and all features for each category in terms of accuracy, precision, F-measure, recall, specificity, and G-means. Moreover, Fig. 9 displays the training and testing time for the DT classifier utilizing selected and all features for each category.

Fig. 8
figure 8

Evaluation matrices of DT classifier with selected features (by RFE) and all features for each category on combined CICIDS2017 dataset

Fig. 9
figure 9

DT classifier’s training and testing time using selected and all features for each category on combined CICIDS2017 dataset

After analysing the results (Tables 13, 14, Figs. 8, 9) of DT with selected features (by RFE) and DT with all features on the combined CICIDS2017 dataset, it is observed that the DT classifier with selected features (by RFE) yields higher results in terms of average accuracy, total training time, total testing time, accuracy (approximately identical accuracy for Botnet), G-means (with the exception of botnet attack type for DT with all features) and specificity. Additionally, the RFE with DT offered higher performance in terms of F-measure, recall, and precision for attack types including web attack, Port Scan, DoS/DDoS, and infiltration.

4.1.3 Comparison

In this study, multiple ML techniques have been utilized with various FSA to reduce the number of attributes in the datasets, which could help for the development of IDSs at a lower cost but with improved performances. The proposed model and various ML classifiers that use FSA for IDS models to detect different sorts of attacks have been thoroughly compared in Tables 15 and 16 using different datasets. Moreover, to compare the results of the proposed model with the others (eg. [45, 47, 56, 59] and [61]; the average accuracy, total training and testing time have been considered using NSL-KDD and CICIDS 2017 which is given in Tables 15 and 16, respectively. In KNN classifier-based model [62], only 4 classes (Brute Force, Cross-site scripting (XSS), SQL injection, BENIGN) of CICIDS 2107 have been considered and presented the total training time 11.130 s. However, in the proposed model, 6 classes which cover almost all types’ attacks present in CICIDS 2107 dataset have been considered and the total training time of the proposed model is 4.42 s.

Table 15 A comparisons of ML algorithms for the IDS model utilizing FSA on NSL-KDD
Table 16 A comparisons of ML algorithms for the IDS model utilizing FSA on CICIDS2107

5 Conclusion and future work

This paper examines various classifiers with different FSA in order to construct an effective IDS model. Based on the analysis, it has been proved that the dimensional reduction of data in IDS not only decreases processing costs but also enhances the model performance. According to the results of the NSL-KDD dataset, RFE as FST with DT as classifier produces better results in terms of recall, precision (except for U2R attack category), accuracy, and F-measure than other classifiers with FSA. Moreover, the proposed FST identified a smaller, more appropriate subset of features based on information gain and ranking techniques for the classifier. As a consequence, it identified 13 significant features in the NSL-KDD dataset and 8 relevant features in the CICIDS 2017 dataset. It helps to increase the model performance with lower computation cost than to model with all features. The proposed model (RFE + DT) has been evaluated over the combined CICIDS2017 dataset in terms of F-measure, recall, specificity, precision, G-means, accuracy, testing time, and training time. In order to demonstrate the proposed model's efficiency and effectiveness, it has been compared to other well-known models that have been published in the literature. It has been found that using DT as a classification technique and RFE as a feature selection can reduce the computational cost and improves performance.

Future studies may focus on the application of various ML algorithms, such as unsupervised and supervised models, across various IDS-related datasets. The effectiveness of feature selection in selecting features for attack detections utilizing hybrid FST, which includes a number of statistical approaches and meta-heuristics, will also be the subject of future studies because it is a relatively unexplored area of study.