1 Introduction

The intrusion detection system (IDS) has become a powerful tool that monitors malicious activities and triggers alerts to detect suspicious attacks. The intrusions make the system unpredictable for network traffic because of its nonlinear behaviour [1, 2]. The reason behind the requirement of an IDS is that the security principles: confidentiality, integrity and availability, are compromised due to intrusions and (or) attacks like spoofing, traffic analysis, cyber-attacks and other harmful vulnerabilities [3]. The IDS is partitioned into two broad categories: signature-based and anomaly-based IDSs [4]. The signature-based system is further referred to as misuse-based detection. This approach compares the signature of the recognized malicious activities and triggers alert whenever the match is found. Hence, these types of systems can diagnose perceived attacks with a low false alarm rate [5]. Anomaly-based systems are competent to encounter zero-day attacks. This approach observes the pattern of the systems; whenever any system deviates from the regular pattern, it triggers alert. This approach undergoes a high false-positive rate (FPR) [6, 7]. The IDS can be further divided into two parts: host-based (HIDS) and network-based (NIDS). The HIDS monitors the individual hosts and raises the alerts based on the local host system calls, log files, application logs and other host activities. However, NIDS monitors all the traffic passing through the entire network system; whenever any malicious activity matches with known attacks in the network, alert is raised and sent to the administrator to take appropriate actions. With the increase in the number of features, the complexity of the system increases. Hence, it is complicated for the IDS to examine the large volume of data. The selection of useful and critical features is essential for the detection of intrusions in the field of information security. In an attempt to make an effective and adequate IDS, there is a requirement for the identification of significant features before pre-processing. Identification of the valuable features is complicated as the dataset is expressed by various relevant, irrelevant and extravagant features, which increase the computation complexity for the analysis of intrusions [8,9,10]. The fundamental principle of the feature selection (FS) is to improve the quality and performance of the predictor. This research study proposes the optimal selection of features for elaborating the performance of the IDSs having low computational complexity because of the classification of a set of simplified features. Hence, FS is used to enhance the performance of classifiers and scale down the attributes that are not relevant before pre-processing. FS is performed on the NSL-KDD dataset [11], which is a modified version of the KDD Cup 99 dataset [12], and the proposed method results in higher accuracy in detection of intrusions and also cuts down the false alarm rate because of the use of a reduced number of features.

2 Description of NSL-KDD dataset

The NSL-KDD is a modified variant of KDD Cup 99 dataset. The dataset has 41 attributes, partitioned into 5 classes, which are normal and 4 attack groups, and they are described later. The 42nd attribute is the class attribute, which contains information about these groups; this attribute has positive or negative instances.

DoS (Denial of Service) includes the definition of attacks where the services of authentic users are restricted. Examples: smurf, teardrop, SYN flooding and neptune.

U2R (User to Root) includes the definitions where the attacker controls the local machines by exploiting vulnerabilities over it. Examples: rootkit, spy, buffer-overflow and SQL attacks.

R2L (Root to Local) includes the definition of attack where the attacker tries to yield access over the remote machine in an unauthorized way. Examples: warezmaster, Imap, multihope and spy.

Probe attack includes the definition where the attacker gathers the information about the network by traffic analysis. Examples: port-scan, satan, ping-sweep and nmap.

The attributes of the dataset are further divided into four labels [13, 14].

Basic: These attributes have separate transmission control protocol (TCP) connections. The number of attributes that belong to this label is 9 and they are illustrated in table 1.

Table 1 Basic label attribute description of the NSL-KDD dataset.

Domain Knowledge: They are the attributes that are inside the connections. The number of attributes that belong to this label is 13 and they are explained in table 2.

Table 2 Domain Knowledge label attribute description of the NSL-KDD dataset.

Traffic: This group contains the attributes that are enumerated using windows of 2-s duration. The number of attributes that belong to this label is 9. Attributes under this group label are described in table 3.

Table 3 Traffic label attribute description of the NSL-KDD dataset.

Host: This group contains attributes that are represented to evaluate attacks survival for greater than 2 s. The number of attributes that belong to this label is 10. Attributes under this label are described in table 4.

Table 4 Host label attribute description of the NSL-KDD dataset.

The statistics of the records in KDD Cup 99 and NSl-KDD dataset are presented in figure 1 and 2, respectively. The dataset is divided into four types of attacks for training and testing sets. The training set has 22 attack types, and the testing set has increased 17 attack varieties. Figure 3 presents the frequency distribution of attacks in NSL-KDD dataset.

Figure 1
figure 1

KDD dataset distribution.

Figure 2
figure 2

NSL-KDD dataset distribution.

Figure 3
figure 3

Frequencies of attack distribution in NSL-KDD dataset.

3 Related work

Ganapathy et al [15] proposed the FS algorithm as well as classifier using Support Vector Machine (SVM). The paper also represented the inspection of FS and classification strategies for intrusion detection. Moreover, soft computing techniques were used to highlight the research challenges in intrusion detection. Ahmad and Amin [16] used particle swarm optimization (PSO) algorithm for FS, and PCA for feature transformation. The theoretical method was introduced for the detection of intrusion using SVM classifier on KDD Cup 99 dataset. This research work is further extended using the neural network on NSL-KDD dataset. Franco et al [17] presented classification using a Self-Organized Map (SOM) and FS performed using Fisher discriminant rate algorithm using 17 features of the dataset. Eesa et al [18] propose cuttlefish optimization algorithm for intrusion detection, where FS is also used, and decision tree (DT) is applied for classification purposes. This method improves true positive rate (TPR) and accuracy, and reduces false alarm rate. Chebrolu et al [19] introduced FS using classification and regression tree with a Bayesian network for the effectiveness of performance and detection accuracy in the IDS on KDD Cup 99 dataset. Zhang et al [20] presented methods with a rough set using genetic algorithm (GA) [21] for classification rules. The algorithms used for machine learning and data mining, including SOM, DT, K Means and SVM, have been used for the extensive analysis of NSL-KDD and KDD Cup 99 datasets for the exhaustive study [22]. Tsai et al [23] discuss several machine learning techniques used for the identification of intrusions in the system. The paper describes various classifiers, including single, multiple and ensemble classifiers. Modi and Patel [24] deployed IDS on cloud computing using different classifiers DT, Bayesian and Associative. The framework was designed to identify distributed attacks in both real-time and offline environments. Seth and Chandra et al [25] used Binary Grey Wolf Optimization (BGWO) algorithm for key FS with a neural network classifier. Mazini et al [26] use the AdaBoost algorithm for improved detection rate (DR) and accuracy; for FS they deploy Artificial Bee Colony (ABC) algorithm. Kumar and Sharma [27] propose an integrated framework that automatically detects, triggers the alarm, classifies and mitigates the software vulnerabilities. Alzubi et al [28] used a modified BGWO algorithm for the detection of attacks with feature reduction technique. Bharathy and Basha [29] used Multiple Criteria Linear Programming (MCLP) model for the classification and PSO algorithm for tuning the parameters to design network intrusion detection. Xue et al [30] suggested a self-adaptive PSO over enormous amount of features for dimensionality reduction. The experiments were performed over 12 datasets to represent the effectiveness of the algorithm. Bostani and Sheikhan et al [31] implemented binary gravitational search algorithm along with mutual information for FS to achieve better results from existing algorithms. SVM was used for feature reduction and neural network was implemented to rank the importance of selected features from DARPA dataset in [32]. Xue et al [33] introduced a self-adaptive differential evaluation (SaDE) for FS and k Nearest Neighbour (k-NN) for evaluating the different performance measures of IDS using the KDD Cup 99 dataset on wireless sensor networks. Wu et al [34] and Yu et al [35] proposed online feature selection (OFS) methods in the field of data mining and machine learning. Recently, evolutionary computation (EC) techniques were used for FS due to its ability of global optimization. Some of them include the firefly algorithm [36, 37] and the PSO [38,39,40].

4 Machine learning classifiers

We have used the following machine learning classifiers for the analysis of training and testing set of NSl- KDD dataset.

SVM: Cortes and Vapnik [41] developed the SVM classifier for binary classification of data into two classes. It contains the two hyperplanes, and the margin between them should be large. The basic principle behind this is to maximize the separating margin between negative and positive classes in order to derive a hyperplane. In SVM the mapping of the input vector is carried out to higher dimensional feature space, and the optimal separating hyperplane is accomplished in surpassing the dimensional feature space. The SVM has low rationalization error or does not deteriorate from the obstacle of overfitting to the training dataset [42]. When a model performs poorly on instances that are not commenced in the training, the set is said to have high generalization error or overfitting. The parameter penalty factor concedes users to make a trade-off between the width of the decision boundary and the number of misclassified samples. It is an effective technic for solving regression and classification problems. Later on, Zhang and Wang [43] developed a Computer-aided Design (CAD) system for the classification of images into two classes. They use 126 samples of brain Magnetic Resonance (MR) images in which 28 samples have Alzheimer’ s disease (AD), and remaining samples are normal scans. Here, Displacement Field (DF) is employed for tracking the morphometry from normal brains to AD brains. Further, they have used the SVM and its two variants, namely Generalized Eigen-Value Proximal SVM (GEPSVM) and Twin SVM (TSVM), for classification of images.

Naive Bayes (NB): In most of the cases, it is troublesome to express the probabilistic relationship among the variables in cases where the variables have causal or statistical relations between them. In computing, the variable may influence others. The probabilistic model is designed to exploit the causal or statistical relationship of random variables of the problem, referred to as Naive Bayesian networks [44, 45]. In this case, it is expected that attributes are independent and this classifier performs well when combined with some attribute selection measures.

k Nearest Neighbour (k-NN): It is a non-parametric and most transparent machine learning technique for the classification and regression of samples [46, 47]. It performs the computation of approximate separation between several points on the input vector, and the unlabelled position designated to its k-NN. The k parameter represents the number of observations most adjacent to the given observation in testing or validation dataset. In this classifier, a new point is taken and classified according to the majority of the votes obtained for the nearest point in the training data. To measure the similarity between two points, Euclidean distance is used as the distance metric.

DT: In DT a sample is classified through a sequence of decisions, where the prevailing decision supports making the consecutive decision [48]. The classification of the sample takes place from the root node to the leaf node, where classification category is represented by each leaf node. Each node represents the attributes of the sample.

Logistic Regression (LR): This analysis studies about the association between the categorical dependent variable and independent variables, and takes place only for binary values such as 0 or 1, yes or no. It follows a binomial distribution where individual characteristics are the basis for creating the chance for the outcome of the model [49].

Random forest (RF): It is a machine learning classifier that incorporates the DT and ensemble learning [50, 51]. The forest consists of several trees, and the attributes are selected randomly as input. In this algorithm, a bunch of DTs along with an arbitrary subgroup of the data are created using a bagging approach. This algorithm is considered to be the most effective for the solution of almost any prediction task. For any problem related to classification and regression, it can be used. It is a combination of tree predictors where every tree confides on the values of an arbitrary vector sampled separately with the equal division for all trees in the forest. This algorithm split into two phases. In the first phase, ’i’ random trees are initiated, which generate the RF. The second phase combines all DTs that have the same test features. The final prediction consists of the DT that appears most frequently or the assessment of each DT. This algorithm operates well even for inconsistent data to generate better accuracy, is simple to use and also gives assessment on what variables are necessary for the classification purpose. It executes accurately on enormous databases while provoking an internal impartial evaluation of the generalization error. It also contributes a procedure for unbalanced datasets for stabilizing error in class population. The random algorithm specifically benefits data scientists to redeem data preparation time, as there is no requirement for any input preparation and can examine numerical data and specific features without scaling or transformation. In this ensemble method, several DTs are operated by training them, and the class having a majority over all the trees is returned. RF is slightly better than SVMs [52]. A collection of trees is constructed for the generation of forest having the controlled variance. The prediction of this classifier is decided by the weighted or majority voting. A flow chart of the algorithm is presented in figure 4.

Definition of RF: It uses trees \(g_k (A, \theta _k )\) as \(k^{\mathrm{th}}\) base learners where \(\theta _k\)’ s are a stack of random variables and they are independent for k = 1,...,K. For training data \(D = {(a_1,b_1 ),\ldots ,(a_N,y_N)}\) where \(a_i = (a_i,1,\ldots ,a_i,p )^T\) represents the m predictors and \(b_i\) represents the response, and a specific realization \(\theta _k\) of \(\Theta _k\), the fitted tree is represented as \({\hat{g}}_k (a, \theta _k,D)\). While this formulation is derived from [53, 54], in practice the random component \(\theta _k\) is not treated explicitly. However, it is used to implant randomness in two phases implicitly. In the first phase, which includes bagging, an independent bootstrap pattern is taken from the original data to fit into each tree. The bootstrap sampling involves randomization, which gives one part of \(\theta _k\) . In the second phase, during random selection, the best split is found from the subsets of r predictor variables instead of all m predictors, separately at each node while splitting a node. The sampling of predictors gives the remaining part of \(\theta _k\) by randomization. A pseudo-code of RF is depicted in Algorithm 1. As discussed, this algorithm has several trees and each tree is constructed by following steps:

  1. 1.

    Assumptions: training cases = X and the classifier consists of a number of variables = Y.

  2. 2.

    For the determination of the decision at a node of the tree, we have y number of input variables, and y < Y.

  3. 3.

    In this tree, the selection of a training set is performed by taking a bootstrap sample, i.e., selecting x times with restoration from all X available training cases. The estimation of the tree’ s error is performed using the rest of the cases.

  4. 4.

    Randomly, y variables are selected for each node of the tree to perform the decision at that node. The best split is calculated by these y variables in the training set.

  5. 5.

    Each tree is completely grown up and not sheared. A new sample is predicted by pushing it down the tree. The label is assigned to the training sample at the end of the terminal node. The iteration continues for all trees, and the prediction of RF is expressed by calculation of the average vote of all the trees.

Figure 4
figure 4

Flow chart of random forest algorithm.

The benefits of this algorithm are the following:

  1. 1.

    Applicable for both classification and regression problems.

  2. 2.

    Handles categorical predictors naturally.

  3. 3.

    Computationally quick and straightforward to fit, even for significant problems.

  4. 4.

    No explicit distributional assumptions (non-parametric).

  5. 5.

    It can handle highly nonlinear communication and classification boundaries.

  6. 6.

    Variable selection is automatic.

  7. 7.

    Handles missing values through surrogate variables using proximities.

  8. 8.

    Outlier detection, visualization and unsupervised learning.

figure a

5 Swarm intelligence

Swarm intelligence has emerged as a population-based and nature-inspired algorithm with the capability of solving complex problems that provides robust, fast and low-cost solutions [55]. It is a subdivision of artificial intelligence that involves the collective behaviour of social insects, including ants, termites, bird flocks and honey bees. The social animals have limited capabilities, unsophisticatedly interact either directly or indirectly with each other by specific behavioural patterns for survival. The direct interaction includes communication over audio or video, such as flutter dance of honey bees. The indirect communication exists when an agent changes the environment and other agents acknowledge the changed environment, like ants depositing pheromone trails on their way in order to search for food sources. Swarm intelligence is used for the optimization of the problems including data analysis, scheduling, bioinformatics, machine learning, operations research and medical informatics and also in the field of business and finance. The author Bonabeau et al [56] defined swarm intelligence in very simple words as the popular way of simple and common intelligence of social agents. The parameters used for finding a set of solutions in swarm intelligence are the following. Proximity principle: effective computation for the execution of time and space solutions.

Quality principle: interaction of quality features of the environment.

Different response principle: existence of solutions beyond narrow channels.

Stability principle: stability of actions based on the changing conditions.

Adaptability principle: ability to adapt new conditions based on effective time and space computations.

6 Particle swarm optimization (PSO) algorithm

It was recommended by R Eberhart and J Kennedy in 1995 [57]. The PSO algorithm implemented in various streams of engineering is explained in [58] and is derived from the behaviour of bird flocking. It involves the adaption of rapidly changing interactions and movements of social animals like fishes and birds. A flow diagram of the PSO algorithm is presented in figure 5. The PSO algorithm combines the experiences learned when working together as a group and personal experiences. The optimized solutions are achieved by the flocking behaviour of birds. The birds follow some predetermined path for the destined food resources. This path, considered as the shortest path, is also referred to as the personal best solution (pbest) of the particle. Each particle looks for the best solution in the search space by observing its own flying experiences and experiences of others in the group. Another best fitness value is achieved by observing any of the particles in the group near the range of that particle. This is referred to as the gbest. Each particle has its associated velocity for the acceleration towards achieving the pbest and gbest. The basic concept of PSO is to achieve global optimal solution, thereby moving each particle towards pbest and gbest with arbitrary weight at every step. This algorithm gives improved exploration and exploitation [59, 60].

Figure 5
figure 5

Flow diagram of the PSO algorithm.

The equation for PSO algorithm is represented as

$$\begin{aligned} v_{kd}(t+1)= & {} v_{kd}(t)+c_{1}{\mathbf{R }}_{1}(p_{kd}(t)-x_{kd}(t))\nonumber \\&\quad +\,c_{2}{\mathbf{R }}_{2}(p_{gd}(t)-x_{kd}(t)) \end{aligned}$$
(1)
$$\begin{aligned} x_{kd}(t+1)= & {} x_{kd}(t)+v_{kd}(t+1) \end{aligned}$$
(2)

where \(v_{kd}(t)\) represents the velocity of the \( k^{\mathrm{th}} \) particle in \( d^{\mathrm{th}} \) dimension in \(t^{\mathrm{th}}\) iteration, \(C_{1}\) and \(C_{2}\) are the acceleration factors, respectively, in personal and social cognizance directions, \(x_{kd}(t)\) represents the position of \( k^{\mathrm{th}} \) particle in \( d^{\mathrm{th}} \) dimension in \(t^{\mathrm{th}}\) iteration, \(p_{kd}(t)\) represents the pbest position of the \( k^{\mathrm{th}} \) particle in \( d^{\mathrm{th}} \) dimension in \(t^{\mathrm{th}}\) iteration and \(p_{_{gd}}(t)\) represents the gbest position obtained by the complete population till \(t^{\mathrm{th}}\) iteration; \(R_{1}\) and \(R_{2}\) are random numbers between 0 and 1 used to overcome premature convergence.

7 Proposed methodology

7.1 Preprocessing

The classifiers led to false alarms due to the rough features. Hence, preprocessing of the dataset is the necessary part. Moreover, some typical features increase the computation time and memory resources that are unavoidable by the classification techniques. In the NSL-KDD dataset, rough features are expressed as

$$\begin{aligned} rs = \{fs_1, fs_2, fs_3, fs_4, \ldots , fs_n\}, \end{aligned}$$
(3)

where n is 41, which represents 41 features in the dataset. The typical features are eliminated from rough features due to the overhead and redundancy. The modified rough features are represented as

$$\begin{aligned} rs^{\prime }= \{fs_1, fs_2, fs_3, fs_4, \ldots , fs_p\}, \end{aligned}$$
(4)

where p represents the number of rough features after elimination. Moreover, appended preprocessing is needed to obtain the optimized feature set based on their importance in the dataset. For this purpose, feature reduction and feature transformation are necessary, which are explained in the next section.

7.2 FS method

In this section, we propose the FS method using a mathematical equation. FS is represented as 6-tuplet: \( \text {FS} = \{D, V, C, S, f_s, E\} \) where D is a dataset, \( D = \{a_1, a_2,\ldots , a_k\} \) with k instances, V is set of features, \( V = \{e_1, e_2,\ldots , e_f \} \) with f number of features, C is a target class, \( C = \{c_1, c_2,\ldots , c_n \} \) with n classification of target classes, S (search space) is a distribution of set V that contains all subdivisions that we can construct by V, \( S = \{s_1, s_2,\ldots , s_l \} \) \( (l=2^n-1) \) with \( s_i=\{e_p, e_q,\ldots ,e_r\}\ (l\le p\ne q\ne r\le n) \), E is an evaluation measure and function \( f_s \) represents the transformation of FS: \( f_s:V \rightarrow S \). FS is the process of eliminating extravagant attributes and obtaining the optimized subset(s) from the dataset [61]. The objective of this technique is to select a subset of features that increase the efficiency, reduce the computation complexity and enhance the predictive accuracy. The steps performed in FS are presented in figure 6 and explained here.

  1. 1.

    The generation module provokes the next successor from the original feature set.

  2. 2.

    The estimation module calculates the applicability of the subsets using several measuring parameters.

  3. 3.

    The endpoint where the subset features are recognized as optimal.

  4. 4.

    The validation module to check the validity of the feature subset.

Figure 6
figure 6

Feature selection process [62].

Figure 7
figure 7

Proposed research methodology.

Figure 8
figure 8

List of selected attributes representing their importance using random forest algorithm.

7.3 Proposed model

In this paper, we have performed FS of attributes before pre-processing. The process of selecting essential features from the original is termed as FS, and it is compulsory because of the perseverance of misleading and insignificant features in the dataset. The RF algorithm is used for performing FS, which is one of the most prominent machine learning algorithms. This algorithm has easy interpretability, low over-fitting and exemplary predictive performance. The interpretability is represented by deriving the importance of each variable on the DT. The process of selecting features using RF is referred to as embedded methods. This method is the amalgamation of wrapper methods and quality filters. RF inheres 4–1200 trees; each of them undergoes the observation of random extraction from the dataset as well as features. Decorrelated trees are less prone to over-fitting because every tree does not observe the features of others. Every tree undergoes a sequence of yes–no questions that are based on a combination of features. At every node of the problem the tree is divided into two buckets, each of them observing the similarities among themselves and differences in other buckets. This presents the importance and purity of every feature, which is derived from each bucket. Hence, RF classifier is applied to the training and testing set of the NSL-KDD dataset; the classifier selects top 10 features based on the importance and eliminates the insignificant features. Figure 7 presents the proposed methodology. The list of selected features is shown in figure 8 based on their importance in the dataset and described in section 2. The proposed PSO algorithm is blueprinted by Algorithm 2.

figure b

The training, as well as the testing set of the dataset, has 41 attributes; hence, excluding the 1 class attribute, there are 41 attributes for the training set. The onehot encoding is the feature used to convert categorical variables into a format suitable for machine learning algorithms for improving better job predictions. These 41 attributes undergo onehot encoding and 74 attributes are created. These attributes are used as an input layer in the neural network; the hidden layer has 20 neurons, which is used to decide the computation. The forward propagation is used as an objective function. This computes the forward propagation of the neural network as well as the loss. It receives a set of parameters that must be rolled back into the corresponding weights and biases. The PSO algorithm is implemented after the forward propagation on the selective features of the dataset with different iterations with a finite number of particles to observe the best accuracy among all the iterations.

8 Performance measure

The performance measurement is achieved by calculating the FPR and TPR. FPR is also designated as false alarm rate (FAR) or failure rate; it is the classification of benign traffic as malicious. It can be formulated as the number of incorrect detection of normal records as intrusions divided by the total number of normal records. The DR is the number of correctly identified positive instances divided by the total number of instances identified as positive. The DR is also referred to as TPR or recall or sensitivity. The following parameters are used for performance measurement:

$$\begin{aligned} \text {True Positive Rate (TPR)}&=\dfrac{\text {TP}}{\text {TP+FN}}\times 100,\\ \text {True Negative Rate (TNR)}&=\dfrac{\text {TN}}{\text {TN+FP}}\times 100,\\ \text {False Negative Rate (FNR)}&=100-\text {TPR}\times 100,\\ \text {False Positive Rate (FPR)}&=100-\text {TNR}\times 100, \\ \text {precision}&=\dfrac{\text {TP}}{\text {TP+FP}}\times 100,\\ \text {overall accuracy}&=\dfrac{\text {TP+TN}}{\text {TP+TN+FP+FN}}\times 100,\\ {\mathcal {F}}_1 \text {-score}&=\dfrac{2\text {TP}}{2\text {TP+FP+FN}}\times 100.\\ \end{aligned}$$
Figure 9
figure 9

Comparison of accuracy of machine learning classifiers with the proposed method.

Figure 10
figure 10

Accuracy observed using different iterations of the proposed PSO algorithm.

Table 5 Comparison of the proposed technique with machine learning classifiers in training set.
Table 6 Comparison of the proposed technique with machine learning classifiers in testing set.
Table 7 Observations of the PSO algorithm using fixed number of particles with increased iterations.
Table 8 Observations of the PSO algorithm with different feature sizes.
Table 9 Comparison of the results with other algorithms.

9 Analysis and results

The FS method eliminates unnecessary attributes from the dataset. Removing these unwanted attributes is a must because they decrease the accuracy of the algorithm, which we use to predict something in the future. As the number of features increases in the dataset, the search space also increases. The FS is a challenging task because it is relevant to the search space. It acts as a bridge for the extraction of features and pre-processing. In this study, we have performed FS using RF algorithm to cut down the number of attributes from the dataset to enhance the computation power and remove the redundant attributes that affect the efficiency of the system. After performing FS on the dataset, we have applied the PSO algorithm on the dataset, including training set and testing set, for enhancing the DR and optimization of the IDS. We have calculated various performance measures, including the precision, DR and accuracy of the system, based on the confusion matrix of the results, which are enlisted in table 5 and 6 using the FS method with the proposed PSO algorithm, which gives the best accuracy of 99.32% in training set and 87.83% in testing set, which is considered to be the optimized accuracy as compared with other machine learning classifiers, including NB, SVM, DT, LR and k-NN algorithms. The comparison is presented in figure 9. This algorithm is superior to machine learning classifiers because PSO is a bio-inspired search technique that is simple and easy to implement, which involves a single operator for updating solutions. This algorithm has been very effective in a wide variety of applications, and has the ability to produce good solutions at a very low computational cost. A slight change in the parameters of this algorithm results in better performance and results of the systems [65,66,67]. It is robust and parallel computation can be performed. Accurate mathematical models can be solved efficiently. This algorithm converges rapidly without overlapping and mutation, and also has a higher probability of finding global optima. [68, 69].

To analyse the performance of the proposed work, we examine it on several combinations of the PSO parameters. We found that highest accuracy is obtained on the following PSO parameters: \(C_1\) = 0.5, \(C_2\) = 0.3 and W = 0.9. The tested results for different parameters are presented in table 10. Moreover, to obtain the perfect empirical combination of the number of particles and the number of iterations, we perform several preliminary experiments and find that 2800 number of particles and 28 number of iterations give the concluding performance remarks (see table 7). However, a change of parameters may result in different observations in a different scenario. Thus, the parameter optimization of the PSO algorithm is still an open area of research. The change of accuracy with respect to the iterations is depicted in figure 10. In this figure, we observe that the detection accuracy fluctuates within a small margin (approximately \( \pm 0.5 \)) till the 27th iteration. This small change is because the search particles oscillate between their personal and cognizance; thus their respective positions get updated in a small range of search space (exploitation). However, some particles may iteratively keep the momentum towards the earlier best position (possible because the inertia weight is 0.9), and escape out from local optimum. Hence, they also guide other particles to move towards this newly discovered position (exploration). This is observed in figure 10 by a sudden increase in accuracy at 28th iteration. Moreover, excessive exploration of the search space also guides the particles to move towards nonoptimal area. Thus, we find that the accuracy suddenly drops between the 28th and 30th iterations. This algorithm is also tested for a different minimal set of features like 12, 15, 18, 20, 22 and 25 features using the same configuration of the PSO to compare the results to selected 10 features (see table 8). The comparison of accuracy and DR with existing methods is depicted in figure 11 and 12, respectively, which shows that the proposed technique has a better DR and accuracy than those from other methods. In addition, the number of selected features is lesser in the proposed PSO algorithm. The algorithm improves the DR when the PSO algorithm is applied to the reduced features of the NSL-KDD dataset. Hence, this algorithm gives optimized results in terms of the number of features. table 9 presents the comparison with existing algorithms, which indicates that this algorithm provides better results with 10 features only (table 10).

Table 10 Empirical parameter settings of the PSO algorithm.
Figure 11
figure 11

Comparison of accuracy with other algorithms.

Figure 12
figure 12

Comparison of detection rate with other algorithms.

10 Conclusion and future scope

This paper discusses the implementation of PSO algorithm with FS to enhance the accuracy and DR of the IDS. We have used only 10 features from NSL-KDD dataset in an attempt to remove the extra and noisy attributes with negative encounter on the pursuance of the system. Simplification of the dataset is the main objective of FS method by reducing its dimensionality and identification of the optimal subset of features. The RF algorithm is used to select top 10 features out of 41. The PSO algorithm is applied to the selected 10 features with a distinctive number of iterations with definite particles to achieve the optimized result. The number of iterations ranges from 20 to 28, and the number of particles is fixed to 2800. The best accuracy and DR are observed at 28 iterations with 2800 particles. The results are compared to those from SVM, DT, NB, LR and k-NN algorithms on training and testing sets of the dataset. The accuracy and other performance measures observed are better than those from other algorithms. This algorithm is also compared to existing algorithms where FS method is implemented on the same dataset; the results show that this algorithm gives the best accuracy with only 10 features as compared with other algorithms.