1 Introduction

Internet has become the most essential tool in this modern era. Applications like Local Area Network (LAN), Wide Area Network (WAN), Wireless Local Area Network (WLAN) etc., made computer networking attractive for different enterprises, security services, health care and other emergency services. Internet plays an inevitable role in our daily life. It is obvious that attackers may take advantage of our dependency on internet and threaten it with security threats like botnets based DDoS [4] attacks, Virus, Trojan, Worm, Spyware etc. These types of attacks in the internet are becoming more sophisticated and the number of attacks is also increasing day by day. Several models and mechanisms [12, 21] have already been proposed to defend against different attacks. However, now-a-days providing security to every environment in the network is a big challenging issue. Indeed, intrusion detection system (IDS) has been developed to defend against the threats. It can be considered as a set of techniques and methods that are used to detect suspicious activities in the network. Existing IDS methods.

This paper proposes a misuse based IDS which detects five categories such as: Exploit, DOS, Probe, Generic and Normal. Though firewall can provide good control to access the resources in the internet yet attackers have developed various techniques to bypass it. The proposed system is based on misuse-based technique, which permits it to act as a firewall with some extra information added to it. Thus the system is not limited to the functionality of IDS. The advantage of proposed IDS is that it assists the network administrator to classify the traffic captured into five categories including normal category, which causes lower false alarm rate (FAR) than anomaly based IDS.

The most of the research on IDS are based on KDD99 or NSL-KDD [2, 3, 8, 16, 18, 20, 22, 28] data set. These data sets are considered obsolete to detect recent types of attacks. In the proposed work, UNSW-NB15 [23] is treated as the offline data set to design the proposed IDS model. Moustafa et al. [24] have applied UNSW-NB15 and KDD99 data set for intrusion detection and compared the performance on both the data sets. It shows that IDS applying UNSW-NB15 dataset covers recent attacks compared to KDD99 data set. This paper also proves that KDD99 dataset has no significance now a day. Their work suggests that the decision tree model has best performance on the introduced data set. However, performance evaluation of [24] using UNSW-NB15 shows that intrusion detection rate of [24] is not high due to over lapping nature of several attacks. In [26], authors have presented decision tree approach in combination with genetic algorithm to design a misuse based IDS, where three different datasets KDDCup’99, NSL-KDD and UNSW-NB15 are used to design the models separately and subsequently tested those models using the respective testing set. In this paper, we propose our own integrated model whose accuracy is higher compared to ENADS [24] and Dendron [26]. The performance of this proposed integrated model is also evaluated on the real time data set which is generated in from NIT Patna lab. Sangkatsanee et al. [28] proposed a real time IDS using machine learning techniques on KDD99 data set as offline dataset and real time data set RLD09. In [28], authors considered KDD99 as offline dataset which is obsolete now a day and has significant biasness and also created its own real time dataset which was evaluated on existing decision tree based model considering three categories (Normal, Probe, DoS) only. In our proposed work, we have also designed the real time dataset RTNITP18 which is evaluated on our own proposed integrated model to detect five categories (Normal, Probe, DoS, Exploit, Generic). Forty nodes have been chosen for this purpose at the organization NIT Patna. All of these nodes are connected in a LAN and among them some are attackers and some are victims. We have observed the packet flows in that network for 7 days, from Monday to Sunday and captured data packets from it applying Wireshark packet sniffer tool [5]. Features are extracted from the captured data packets and we export the captured pnpcap file into.txt. Then we use python script to extract features and mapping those features to traditional data set features and save it to.csv file. Thus we are able to design a real time data set. We evaluate the performance of proposed model on real time data set and it is observed that accuracy of our proposed model is 85.8%. It is also observed that value of several evaluation metrics (e.g. ADR = 90.32, FAR = 2.01% etc.) have higher performance on proposed integrated model compared to other existing [24, 26] models. Hence the novelty of the proposed IDS is the development of a model on the basis of recent dataset (UNSW-NB15) which has highly complex behaviour of attacks compared to that of old one (KDD99, NSL-KDDD etc.). The proposed model is created on the basis of different decision tree models by considering rules with high confidence factor which in turn reduces the FAR of the proposed model. In addition, the proposed model is able to detect five categories of attack including normal category with high detection rate (ADR) and low FAR as compared to other recent approaches. The model is also evaluated on the real-time dataset generated by setting up virtual environment at NIT Patna CSE lab.

The proposed model can successfully be used in different domains of industrial applications. Industrial control systems (ICS) are widely used in different domains and it entails real-time data acquisition and system monitoring. It also incorporates automatic control and management of industrial processes. However, ICS is an attractive target for hackers and thus the security issue of ICS is of the paramount importance. The proposed IDS is designed for the automatic detection of malicious attacks. The proposed IDS can collect and analyse different attributes such as the network traffic, security logs. Further, the proposed IDS can also check if there exists security infringement in the system by auditing the data and information from the key points of the computer system. Moreover, evidence collection using digital forensics is an important domain where IDS can substantially be used. The modified version of IDS can be used to notify the administrator by sending an alert as well as it can also activate the digital forensic tool to capture the current state of the system. It is pertinent to mention that this captured system image will include the entire information pertaining to the system at the moment when attack was taking place. And thus these images can be used as evidence in legal proceedings. It is thus obvious that the proposed IDS can successfully be implemented for maintaining the security of ICS and in the domain of digital forensics. The proposed work can be used to provide security to such system against different threats in the network. It can also be used for any organization where it will be installed on a network device to protect the organization. Any malicious activity found will be reported to security administrator for further action. In an IoT environment the proposed model can also be used to provide security. Hence our proposed IDS acts as the dog watcher for detecting threats in internet.

The key observations of our proposed approach are given below:

  1. 1.

    Most of the related works on IDS are based on KDD cup99 or NSL-KDD [2, 3, 8, 16, 18, 20, 22, 28] data set which is not up to date in the sense that most of recent attacks are not covered. In this paper, we have used a new data set (UNSW-NB15) which covers the most recent attacks compared to KDD cup99 data set.

  2. 2.

    In this paper, an integrated rule based model for IDS has been proposed. The detection rate of this proposed model is high in comparison to other traditional decision tree based model and existing state-of-art works on IDS [24, 26]. Several other metrics (discussed in Sect. 3.5) are also used to evaluate and compare the proposed work with other state-of-art techniques.

  3. 3.

    This paper generates a real-time dataset at NIT Patna CSE lab (RTNITP18) and it acts as a working example to evaluate the performance of the proposed model for real-time environment.

  4. 4.

    Proposed model considers five categories in such a way that some of the other attacks can also be identified, whereas maximum related works on IDS only consider two to three types of attacks in the network. The performance of the proposed model with the UNSW-NB15 (benchmark dataset) and real-time dataset (RTNITP18) shows higher accuracy and ADR in comparison to other existing approaches.

  5. 5.

    The proposed IDS model acts as a compliment to firewall which collect the traffic data incident to the network from the Internet as well as the traffic of the organization and analyses it for any malicious activity.

  6. 6.

    The proposed IDS model acts as out of band device to the network hence it will not create any jitter to the network which is an advantage over the IPS.

Figure 1 shows the working environment of IDS, where data comes from Internet goes to the firewall as well as to the IDS in order to find any malicious activity which is not found by the firewall. The proposed IDS will also monitor the traffic inside the organization for internal intruders.

Fig. 1
figure 1

IDS implemented in an organization connecting to Internet

Rest of the paper is organized as follows. Section 2 describes the related work briefly. Proposed model along with result analysis is discussed in Sect. 3. In Sect. 4, a working example for evaluation of proposed model is described. Finally, the concluding remarks pertaining to the proposed model are given in Sect. 5.

2 Related work

This section preludes the state of art studies on IDS system. Most of the previous works is based on KDD99 [18] data set. KDD99 data set is an old dataset and it does not consider most updated or recent type of attack categories. All the research work on IDS can be divided into two categories: (1) works on IDS in which researchers try to develop system where detection is done based on the provided signature. (2) Works in which a normal profile is generated and any deviation from that profile is reported as attack. This is called behavioural attack also. In first type of category, false reporting is very less, but the system is less prone to the new type of attack whose signature is not known yet. In second category, the system is more prone to false reporting, though a novel attack can be detected.

In [2], a very important model of feature selection has been proposed for IDS which uses Ant Colony Optimization (ACO) concept. In this process, features are treated as nodes in graph representation and edge between nodes represent next choice of feature. The optimal subset of features is selected by traversing the graph. In this model, the pheromone and heuristics are not associated with the links as in the actual ACO traversal; rather it is associated with features itself. Pheromone represents the attractiveness of features. Wang et al. [32] also proposed a function for selecting features using the neighbourhood discernibility matrix. This matrix is used to show the ability of classification for a feature subset. This function helps to determine significance of candidate attribute. Here, dependency function is used to analyse the relevance between features and the decision made applying that feature.

Several recent works on IDS/IPS have been proposed to protect against DoS/DDoS attack. Agarwal et al. [1], have proposed an IDS along with Intrusion Prevention System (IPS) for detecting and recovering from DoS attack in Wi-Fi network. They have proposed the work in IEEE 802.11 standard for Wi-Fi. They have also used Angel of Arrival (AoA) [19] algorithm over RSSI [31] to find the location of attacker. In [25], the authors have proposed an enhanced Confidence Based Filtering (CBF) method to protect cloud services from the DDoS attack by using the concept of correlation pattern. Gupta et al. [13] have proposed a Flow and Volume based approach to detect DDoS attack in an ISP domain and the simulation is provided by network simulator (NS-2) which shows good detection with low false alarm rate. Gou et al. [11] have proposed a framework for IDS based on Petri network that consists of two different functions for attacks detection and the model up gradation.

The authors of [20] have proposed an IDS system using the concept of fuzzy set theory with the combination of association of rule mining and genetic network programing (GNP) using KDD99 dataset. The concept of sub-attribute utilization is used for extracting discrete and continuous attributes in order to avoid data loss and gives effective rule mining using GNP. Wattanapongsakorn et al. [33] have proposed an Intrusion Detection and Prevention System (IDPS) in which they covered four categories (Normal, DoS, Probe, Worm). Fuzzy Genetic Algorithm is used for unknown attacks and for known attacks with several machine learning techniques (C4.5 Decision Tree, Random Forest, Ripple Rule, Bayesian Network, and Back Propagation Neural Network). In [37], authors have introduced intuitionistic fuzzy rough graph which is used to handle uncertainty and incomplete information in information system and also proposed an algorithm that can efficiently solve decision making problems.

Das et al. [7] proposed an NIDS model which detects the port scan attack using machine learning concept of SVM. They trained their model using the pattern of frequency change in normal and attack packet. Data is captured every 4 s by the NIDS for analysis. They have used Rough set method as an optimal feature selection method over PCA and only one type of attack has been considered. An IDS model is proposed in [16] by applying SVM on the data set NSL-KDD [27]. They have presented a framework which selects the features of NSL-KDD data to characterize normal traffic more accurately from those of abnormal traffic. Framework uses the method of filter and wrapper for feature selection and ranked those selected feature using information gain ratio. Chowdhury et al. [6], have proposed combination of two machine learning algorithm for classification of anomaly based intrusion. This paper is applying simulated annealing that generate random set of 3 features for each time and then SVM is applied on the selected set to detect the anomalous behaviour. The algorithm has used the dataset from Australian centre for cyber security [23].

Fares et al. [8] proposed an NIDS model using the concept of Neural Network (NN) which is divided in 3 phases. They reduced the dataset applying some pre-processing phase so that over fit due to dominating attack categories do not occur. The attack considered by their model includes Normal, Dos, Probe, R2L, U2R. They compared the performance of their proposed model using both the data sets (10% of KDD99 and reduced data set). The model was trained and tested only on offline data set. In [3], IDS using NN with Genetic algorithm is proposed to improve the accuracy of proposed model have using KDD99 as benchmark dataset. They used NN with resilient back propagation with sigmoid function. In [30], authors have proposed a light weight IDS for anomaly detection using KDD99 by focusing on three major fields such as: 1. Removing redundant data from data set, 2. Feature extraction, and 3. Realization of proposed IDS. The IDS is proposed using a wrapper approach for feature selection. Bagging approach is used to generate multiple training data set which is then used to train multiple neural networks and using the output of these NN new training data set is generated by replacing the class label of original data set with the output labels. Newly generated data set is then used with C4.5 model.

Many researchers are working to design efficient IDS by using machine learning techniques. In [28], the authors have proposed real-time IDS in which they worked on three categories of attacks such as- Normal, DoS and Probe. The work is divided in three phases: 1) pre-processing phase 2) classification phase and 3) post-processing phase. They have compared the performance with KDD99 and RLD09 [28] on different machine learning techniques and only two types of attacks have been considered. Kalekar et al. [17] proposed a real-time IDS using Naïve Bayes classifier. Their proposed model classifies any packet as normal or abnormal, whereas the performance evaluation of the proposed model is not on any data set. An algorithm for removing outlier from KDD cup99 is proposed in [22]. It makes the algorithm compatible with Weka tool [34]. After executing the proposed algorithm, they have used 10% of the data set and evaluated performance on different machine learning algorithms (Bayesian network, naïve Bayes classifier, J48, J48 Graft, and Random forest). Performance evaluation is done using precision, recall and F-measure parameters [10]. In [14], the authors have proposed a host based IDS using hidden Markova model. Proposed model is evaluated using publically available data set by University of New Mexico (UNM) and Massachusetts Institute of Technology (MIT) Artificial Intelligence laboratory. In this model, training data set is divided into K equal sized sub sequences and using sub sequences which have less correlation between them is used to train sub models. Trained sub models are then merged incrementally in order to design the final model. Moustafa et al. [24], have shown a new data set (UNSW-NB15) for testing any IDS. According to authors, the data set covers most recent attacks. They have described the completeness of the data set and also evaluated the performance on different machine learning algorithms. They also compared the analysis with existing KDD99 data set. Though their work suggests that the decision tree model has best performance on the introduced data set, the accuracy is not very high. In [35], a method for host-based anomaly detection has been presented which uses k-Means clustering technique with ID3 decision tree model. First, k-Means clustering is applied for partitioning the training data which uses Euclidean distance for similarity measurement and then ID3 is applied to each cluster to make decision tree. Results of these two phases are combined using a special algorithm.

Ibrahim et al. [15], have proposed a layered-model approach for NIDS which is divided into two stages. First stage detects data traffic either as normal or abnormal due to some attacks. In second stage, the attacks are classified individually. This model is inspired by airport security model. It uses tcpdump data for analysis using data mining techniques. Sasan and Sharma [29] proposed a hybrid model for IDS using J48 and CART. They have implemented their proposed model in Weka tool and tested the model using NSL-KDD data set for evaluation of performance. Recently Yin et al. [36] have proposed an improved clonal selection algorithm of artificial immune system which is inspired by biological immune system of humans to improve the accuracy of IDS. They have used KDDcup99 for the performance evaluation of their proposed work.

The state of the art describes that most of the works on IDS uses KDD cup 99 and NSL-KDD as the benchmark dataset which do not cover recent attacks and consider as the old dataset. On the other hand, very few papers design their own real time data set to evaluate the intrusion detection rate of their IDS in the network. Intrusion detection rate of the most of the IDS are evaluated on traditional decision tree based model to detect only two to three attacks.

In, our proposed work we design the signature based IDS on a new data set (UNSW-NB15) which covers the most recent attacks compared to KDD cup99 and NSL-KDD. An integrated rule based model for IDS has been designed and it shows higher intrusion detection rate compared to other existing decision tree based models for five categories. We have also designed our own real time data set (RTNITP18) which acts as the testing dataset to evaluate the performance of our proposed integrated model.

3 Proposed work

In the proposed system UNSW-NB15 dataset has been used as benchmark dataset. This section proposes an integrated rule based model to optimize the attack detection rate (ADR) and FAR in the network. The working diagram for the proposed work is shown in Fig. 2 which is having two parts. In first part the IDS model is proposed which starts with analysing the UNSW-NB15 dataset, then pre-processing of data and finally proposing the integrated rule based model and testing it with benchmark test dataset. In second part a working example is considered where a real-time dataset is generated by setting of virtual environment and the performance of the proposed model is evaluated on the dataset.

Fig. 2
figure 2

Flow diagram of proposed work

3.1 Dataset description

The dataset was created [23] by applying IXIA PerfectStorm tool. It [23] includes nine categories of the modern attack types and involves realistic activities of normal traffic. This data set [23] contains 49 features that comprised of several categories. Though there are several datasets available for IDS evaluations like KDD98, NSLKDD, etc., all of these do not cover the latest types of attacks. Recent researches on IDS [9] comment that these datasets do not inclusively reflect the real network traffic behaviour and modern attacks for the recent network threats. Features of UNSW-NB15 are categorized in five ways—(a) Flow features (b) Basic features, (c) Content features, (d) Time features and (e) Additional generated features. Dataset overview is shown in Tables 1 and 2. In Table 3, the definition of attacks is given.

Table 1 Description of UNSW-NB15 dataset
Table 2 Features of the dataset [23]
Table 3 Depicts different type of attacks which act as the class labels in the training dataset

3.2 Pre-processing phase

This phase is divided into following sub-phases:

3.2.1 Dataset reduction

In this phase, the size of original UNSW-NB15 dataset has been reduced by eliminating redundant data from the dataset. Clusters are constructed to detect similar types of data in the dataset. Here, 15 numbers of clusters have been chosen using trial and error method. Silhouette coefficient is used as the measure for determining cluster quality. The number of cluster (i.e. 15) for the dataset analysis is chosen by running the k-mean algorithm for several value of k to generate corresponding cluster configurations. Generated cluster configurations are then tested for the best one using silhouette measurement. The cluster configuration for which silhouette measurement value is maximum for least value of k is selected for the further analysis. It is confirmed from Fig. 3 that if number of clusters is 15, the result is best compared to others. UNSW-NB15 dataset considers 10 categories (9 attack types and 1 normal). Among them some attacks behave in similar way due to which attacks are overlapped with each other. To solve this, the total dataset is divided into 15 numbers of clusters. Only instances from dominating class have been selected from each class, which results in reduction of data size.

Fig. 3
figure 3

Silhouette coefficient graph for selecting optimal number of clusters

It is found from Table 4 that maximum instances of DoS, Analysis, Backdoor, Fuzzers are in cluster-1 and DoS dominates Analysis and Fuzzers attacks. On the other hand, maximum instances of Exploit and Shellcode attacks are in the same clusters and Exploit dominates Shellcode. Cluster 5 contains with maximum instances of Generic attack. Cluster 2 and 9 contain with maximum instances of Probe attack. Probe dominates worm. Same thing is true for Normal which is completely distinguishable with compared to others. As a result, DoS, Exploit, Generic, Probe, and Normal are dominating on rest of the attacks. Table 5 depicts the detail of our reduced dataset. The proposed model is misuse-based IDS model. It will detect the attacks for which the signature is available. Moreover, some other attacks (other than those present in training set for proposed model) having similar behaviour (for example, Analysis, Backdoor, Fuzzers behave similar to DoS and shellcode behaves similar to Exploit) can also be detected by this model.

Table 4 Describes similarity between different types of attacks
Table 5 Reduced dataset

3.2.2 Feature reduction

UNSW-NB15 data set contains 47 features. The information gain value of each feature has been computed in order to find the effective set of features for decision making. Information gain value of a feature is defined as its contribution for classifying the data set. If D is the size of a given dataset and A is a feature, then information gain value for feature A is calculated as in Eq. 1.

$$ {\text{Information}}\;{\text{gain}}\; ( {\text{A)}} = {\text{Entropy}}({\text{D}}) - {\text{Entropy}}_{\text{A}} ({\text{D}}) $$
(1)

where Entropy (D) = Expected information needed to classify a tuple in D and is defined in Eq. 2. EntropyA (D) = Extra needed expected information for exact classification when feature A is selected and is defined in Eq. 3.

$$ {\text{Entropy }}\left( {\text{D}} \right) = - \sum\limits_{{{\text{i}} = 1}}^{\text{m}} {{\text{p}}_{i} } \times \log_{2} ({\text{p}}_{i} ) $$
(2)
$$ {\text{Entropy}}_{\text{A}} \, \left( {\text{D}} \right) = - \sum\limits_{{{\text{i}} = 1}}^{\text{v}} {\left| {\frac{{{\text{D}}_{\text{i}} }}{\text{D}}} \right|} \times {\text{Entropy }}\left( {{\text{D}}_{\text{j}} } \right) $$
(3)

where feature A has ‘v’ distinct values and Dj is number of tuples belonging to each distinct feature value of A. pi can be defined as in Eq. 4.

$$ p_{i} = \frac{{|c_{i} |}}{|D|} = {\text{Probability}}\;{\text{that}}\;{\text{an}}\;{\text{arbitrary}}\;{\text{tuple}}\;{\text{in}}\;D\;{\text{belongs}}\;{\text{to}}\;{\text{class}}\;C_{i} $$
(4)

Information gain value is taken averaged over several decision tree models (C5, CHAID, CART, QUEST) as shown in Eq. 5.

$$ {\text{Information}}\;{\text{gain}}_{\text{AVG}} = \frac{{\sum\nolimits_{{{\text{for}}\;{\text{each}}\;{\text{model}}}} {{\text{information}}\;{\text{gain}}} }}{{{\text{Total}}\;{\text{number}}\;{\text{of}}\;{\text{model}}}} $$
(5)

Table 6 shows the detailed of information gain. Among 47 features, average information gain value of 25 features in decision tree models are 0 and rest 22 features are greater than 0. We will consider only those 22 features.

Table 6 Describes average information gain value of 47 features

3.3 Classification

The dataset contains with 22 numbers of features and 5 numbers of classes. We analyse the performance of IDS on several existing classification models (C5, CHAID, CART, QUEST) and observe the accuracy of each model. Figure 4 describes the performance of each classification model on different number of features i.e. 22, 13 and 6 features respectively. It is shown from Fig. 4 that for 13 features accuracy is marginally decremented compared to 22 numbers of features. Whereas, when the number of features are 6 the accuracy is significantly less compared to others. Finally, 13 features are considered to design the proposed model. Figure 4 depicts accuracy of the data set on different decision tree based models with different features. Thus, these 13 numbers of features are selected for designing of proposed integrated model. It is observed that for 13 numbers of features, accuracy of C5 is 89.76, CART is 80.95, CHAID is 81.76 and QUEST is 57.79.

Fig. 4
figure 4

Comparative analysis on training data set

3.4 Proposed integrated rule based model

Different decision tree models (C5, CHAID, CART, QUEST) are trained with selected 13 features of the dataset. Rules(R) are derived from different decision tree models in the form of rule (no. of instance; confidence factor). Rules are selected from each model on the basis of their threshold confidence factor.

Figure 5 shows the rule generation process. Leaf nodes represent the target class and the path with confidence factor that satisfies the threshold is used as a rule. For example, according to Fig. 5, path F1–F2–F3–L1 satisfies the threshold value. As a result, the rule is generated in the form “If (F1 and F2 and F3) THEN L1”, otherwise the path is rejected.

Fig. 5
figure 5

Rule generation from decision tree

Confidence factor of a leaf node is defined as the number of instances truly classified by the leaf to total number of instances classified by that leaf. Mathematically, it is defined by Eq. 6.

$$ {\text{CF}}\; ( {\text{C}}_{\text{i}} ) { = }\frac{{{\text{Number}}\;{\text{of}}\;{\text{instances}}\;{\text{of}}\;{\text{class}}\;{\text{C}}_{\text{i}} \;{\text{truly}}\;{\text{classified}}\;{\text{by}}\;{\text{the}}\;{\text{leaf}}\;{\text{node}} + 1}}{{{\text{Number}}\;{\text{of}}\;{\text{instances}}\;{\text{of}}\;{\text{class}}\;{\text{C}}_{\text{i}} \;{\text{truly}}\;{\text{classified}}\;{\text{by}}\;{\text{the}}\;{\text{leaf}}\;{\text{node}} + {\text{Number}}\;{\text{of}}\;{\text{instance}}\;{\text{misclassified}}\;{\text{by}}\;{\text{the}}\;{\text{leaf}}\;{\text{node}} + {\text{K}}}} $$
(6)

where CF(Ci) = Confidence factor of the leaf node for predicting the class i, K = Number of output classes.

Addition of 1 in numerator and K in the denominator is due to the Laplace correction. Different decision tree models (C5, CHAID, CART, QUEST) are trained and rules with highest confidence factors are selected from each model for each category to design our proposed integrated model. The format of the rules generated by decision tree models is: Ruleifor Attack_Type (number of instances classified, Confidence factor). For example, suppose C5 generates following rules for DoS attack:

Rule1 for DoS (5,1.0)

Rule2 for DoS (3, 0.6)

Rule3 for DoS (8, 1.0)

Rule4 for DoS (5, 0.5)

Rule4 for DoS (2, 0.7)

According to this example, threshold confidence factor to select the rule for DoS attack on C5 model is 1.0. As a result, Rule1 and Rule3 are considered as a rule for DoS attack on C5 model. In this way it is considered that the threshold confidence factor for C5 model is 1.0, for CHAID it is 0.92, for CART it is 0.88 and for QUEST it is 0.83. Rules are chosen for each attack from each model applying the threshold confidence factor of that model. Table 7 describe the rules for different attack from each decision tree based models on the basis of threshold confidence factor of the rules.

Table 7 Generated rules for different attacks from each decision tree based models

From Table 7, rule compositions are made for each model. Rules are combined using logical OR operation in each category of particular model. The composition of rules is shown in Table 8. Rules from Table 8 are then combined categories wise which is shown in Table 9.

Table 8 Depicts the composition of rules category wise for each model
Table 9 Rule composition for proposed model

3.5 Performance analysis on traditional dataset (UNSW-NB15)

The proposed integrated rule-based model is evaluated on the basis of Accuracy (Acc), mean F-measure (MFM), average accuracy (AvgAcc), attack accuracy (AttAcc), ADR and FAR [26]. These metrics are computed using the equations mentioned from 7 to15.

Accuracy (Acc) Accuracy measures the frequency of correct classification of a category. It is measured by the fraction of the correct classification of category among all classes divided by the total number of samples in the dataset which is shown in Eq. 7.

$$ {\text{Accuracy}} = \frac{{\sum\nolimits_{i = 1}^{|c|} {TP_{i} } }}{N} $$
(7)

F-measure (MFM) F-measure is implanted to measure the balance between precision and recall. MFM is computed using Eq. 8.

$$ {\text{MFM}} = \frac{{\sum\nolimits_{i = 1}^{|c|} {FMeasure_{i} } }}{|c|} $$
(8)

where

$$ {\text{FMeasure}} = \frac{{2 \cdot REcall_{i} \cdot PREcission_{i} }}{{REcall_{i} + PREcission_{i} }} $$
(9)
$$ {\text{Precission}}_{i} = \frac{{TP_{i} }}{{TP_{i} + FP_{i} }} $$
(10)
$$ {\text{Recall}}_{i} = \frac{{TP_{i} }}{{TP_{i} + FN_{i} }} $$
(11)

\( FP_{i} = {\text{Represent instances with the actual class other than }}ith\, {\text{class}}, \)\( TP_{i } = {\text{Number of instances actualy belong to class}} i {\text{and predicted class }}i, \)\( FN_{i} = {\text{Number of instances actualy belong to class }}i {\text{and falsely predicted to belong to another class}}. \)

Average accuracy (AvgAcc) It is calculated by taking the average of recall of all the classes of a dataset by using Eq. 12.

$$ {\text{AvgAcc}} = \frac{1}{C}\mathop \sum \limits_{i = 1}^{\left| C \right|} Recall_{i} $$
(12)

Attack Accuracy (AttAcc) It is used to measure the efficiency of a model to detect only attack classes excluding normal traffic. It can be computed as in Eq. 13.

$$ {\text{AttAcc}} = \frac{1}{C - 1}\mathop \sum \limits_{i = 2}^{\left| C \right|} Recall_{i} $$
(13)

Attack detection rate (ADR) Accuracy rate for the attack classes can be defined as in Eq. 14.

$$ {\text{ADR}} = \frac{{\mathop \sum \nolimits_{i = 2}^{\left| C \right|} TP_{i} }}{{\mathop \sum \nolimits_{i = 2}^{\left| C \right|} TP_{i} + FP_{i} }} $$
(14)

False alarm rate (FAR) It defines normal instances misclassified as attack and can be measured as in Eq. 15.

$$ {\text{FAR}} = \frac{{FN_{1} }}{{TP_{1} + FN_{1} }} $$
(15)

In this section, the performance of the proposed model is evaluated using the UNSW-NB15 test dataset and compared with other existing techniques. Confusion matrix is obtained for UNSW-NB15 test dataset on both the C5 and proposed integrated model which is shown in Table 10 and Table 11 respectively. Table 12 shows the confusion matrix for UNSW-NB15 based on Dendron proposed in [26]. Confusion matrix is an N × N matrix, where N is total number of classes. It is introduced to visualize that how instances of dataset are classified. Diagonal entries show the number of instances correctly classified. Row class indicates the actual class label of a data instance and the column class indicates the predicted class label of a dataset.

Table 10 Confusion matrix of C5 on UNSW test dataset
Table 11 Confusion matrix of the Proposed Integrated model on UNSW test dataset
Table 12 Confusion matrix of Dendron for UNSW-NB15 [26]

Figures 6 and 7 depict precision and recall of different categories on UNSW-NB15 testing dataset in proposed integrated model and C5 model respectively. From both of these figures and Table 13, it can be observed that the proposed model is showing higher precision and recall for Probe, Normal, Generic, Exploit and average for DoS with the testing data set compared to C5 model.

Fig. 6
figure 6

Precision and Recall comparison of different categories on Proposed Integrated model

Fig. 7
figure 7

Precision and Recall comparison of different categories in C5 model

Table 13 Metrics summary of C5 and Proposed Integrated model using UNSW-NB15 test dataset

In this paper, it is found that accuracy of C5 decision tree based model is high compared to other existing decision tree based model. This paper proposes its own model and compares the performance of proposed model with C5 decision tree based model. Analysis of the test dataset on C5 and proposed integrated model is shown in Table 14 and the graphical visualization is shown in Fig. 8. The proposed IDS model is also compared with the IDS of ENADS [24], which also applies decision tree based model on UNSW-NB15 dataset shown in Fig. 9. It is observed that proposed system shows lower FAR due to lower false negatives compared to ENADS [24].

Table 14 Features extracted and mapped to the traditional data set features
Fig. 8
figure 8

Performance comparison of C5 and Proposed Integrated model on UNSW test dataset

Fig. 9
figure 9

Performance comparison of ENADS [24] and Proposed Integrated model on UNSW test dataset

From Fig. 10 it is observed that performance of proposed integrated model is better than Dendron [26]. The reason behind it that Dendron [26] considers 10 categories of attacks including Normal which are overlapped with each other. In our proposed integrated model 5 categories of attacks including Normal are considered and they represent rest of categories due to their overlapped nature. Further, the proposed model shows low misclassification rate in comparison to Dendron.

Fig. 10
figure 10

Comparison of Dendron [26] versus Proposed Integrated model on UNSW-NB 15 test dataset

Average detection rate (average accuracy (AvgAcc)) is calculated as the average of recall of all the classes present in the dataset. Now, from Fig. 8 it is observed that the average detection rate (AvgAcc) of proposed system (i.e. 65.21%) shows lower value compared to C5 (i.e. 75.8%) due to the lower recall for DoS and Exploits attack. On the other hand, Fig. 10 shows that the average accuracy of proposed system (i.e. 65.21%) is higher compared to Dendron (i.e. 52.21%) due to higher recall of classes of proposed system. Whereas, it is observed from Figs. 8 and 10 that the ADR of proposed system (i.e. 90.32%) is higher compared to C5 (i.e. 83.47%) and Dendron (i.e. 63.76%) due to the higher precision of the all predicted classes of the proposed system.

4 Working example of proposed model

In this module the real time data set is designed and the performance of that dataset on the proposed model is evaluated. Data set is collected from the setup at the CSE lab in NIT Patna. This phase mainly consists of three parts—(1) Data collection and Feature Extraction, (2) Dataset description and (3) Performance Evaluation on Real time dataset (RTNITP18).

4.1 Data collection and feature extraction phase

In this phase, we have generated our own real time data set to evaluate the proposed model. Data set is generated by establishing a setup at the laboratory of CSE department, in NIT Patna. This lab consists of 40 systems out of which few act as attacker and rest act as normal user or victim and we observe the packet flow in the network for 7 days. Kali Linux is installed on each system for the purpose of observation. Kali Linux is a popular open-source platform which provides set of security tools for hackers. It is an open source and its official webpage is https://www.kali.org. We install metasploitable operating system on victim nodes in the lab. To perform the real time data generation, msfconsole (Kali) is used as an attack generator in the network and used metasploitable an intentionally vulnerable version of Ubuntu as a victim. Feature extraction of Probe, Exploit, DoS, and Generic attacks are given below.

A. Probe NMAP tool is used for Probe attack. This tool is mainly used in scanning phase of attack process. A screenshot for feature extraction of Probe attack is given in Fig. 11.

Fig. 11
figure 11

Feature extraction for Probe attack

B. Exploit Information gathered by using Probe attack (or probing) is further used to perform the attack called Exploit. In this attack the attacker node tries to use this information and send packets accordingly in order to attack the victim without being noticed.

C. DoS DoS attack is performed using Ettercap option present in Kali Linux. Ettercap can be opened using command \( sudo ettercap {-}G \). After opening Ettercap go to sniff menu, after that go to unified sniffing which will pop up Ettercap input window with eth0. After this, plugin, protocol dissectors, ports monitored information will be visible. DoS attack can be started by using the plugin. A screenshot showing the attack is given in Fig. 12.

Fig. 12
figure 12

Performing DoS attack

D. Generic It is a type of activity in which an attacker does not bother about the crypto-graphical implementation of any primitives and runs the attack. As an example consider a cipher text with K bit key, in the generic attack of brute force, attacker tries every combination possible using k bits i.e. \( 2^{K} \) combinations and try to decrypt the text. This attack is performed by using hydra in Kali Linux. Hydra is broadly used as login cracker which provides a way to attack several protocols such as: Cisco AAA, FTP, Cisco auth, XMPP etc.

To open hydra, go to Applications → Password Attacks → Online Attacks → hydra. It will open the terminal console. We attack FTP service of metasploiTable machine, which has IP 192.168.1.101. In Kali Linux a word list is created with extension ‘lst’ in the path usr\share\wordlist\metasploit. Now to perform the attack we use the command hydral/usr/share/wordlists/metasploit/userp/user/share/wordlists/metasploit/passwords ftp://192:168:1:101v. Figure 13 depicts the screen shot of Generic attack where the user name and password is successfully decrypted.

Fig. 13
figure 13

Process performing Generic Attack

E. normal This category of data does not contain with any malicious activity and is collected from LAN in regular working environment.

Data for all categories are captured using Wireshark packet sniffing tool which can be found at http://git.kali.org/gitweb/?p=packages/wireshark.git;a=summary. A snapshot of Wireshark capturing process is shown in Fig. 14. We have captured 2000 data for each category attack under consideration from 7 consecutive working days. To generate RTNITP dataset attacks are performed using the tools available in the kali Linux. Attack generation are automated either by some command line or directly using the tool and captured at the victim end. Total of 10,000 data packets are sampled for all categories through Wireshark at victim end. The behavior of data packets which are captured at the victim nodes are essential requirement rather than source nodes in order to create RTNITP dataset. Captured data samples are saved in separate file with extension \( .pcapng \) at victim end. Basic features present in the captured data samples are time, source IP, Destination IP, protocol, length, and info which contains some other information like port numbers, acknowledgement (ACK) bit, segment size, window size etc. Apart from the basic feature we need derived features for the data set. For this, we have exported \( . {\text{pcapng}} \) file to \( . {\text{txt}} \) which contains all the other detailed information like time to live field (TTL). We have used this \( v \) file to extract derived features using the concept of networking. e.g., to calculate sbytes we have chosen pairs of source IP and destination IP and calculated the total number of bytes sent from source to destination. In similar fashion, dbytes are calculated. We have used python script to automate the extraction of derived features and map those features to the 13 features of traditional dataset. This real time data set is known as RTNITP18 and our proposed integrated model is evaluated on this dataset. Features detailed of RTNITP18 are shown in Table 14.

Fig. 14
figure 14

Feature extraction applying Wireshark

4.2 Dataset description

Dataset is generated at NIT Patna lab with 13 features, 10000 instances and five category of attacks (DoS, Probe, Generic, Exploit, Normal) are called Real Time Dataset at NIT Patna (RTNITP18). Now, RTNITP18 data set acts as the testing data set. We evaluate the performance of the proposed classification based model on theRTNITP18 data set.

RTNITP18 data set is captured in separate file for each category of attack. We have selected random data in same proportion i.e. 10% from each category. So we have chosen 200 randomly sampled data for each category shown in Table 15. We have stored each category in separate.csv file and for testing the proposed model merged all the file in single.csv file.

Table 15 RTNITP18 data set for each category of attack

4.3 Performance evaluation on real time dataset (RTNITP18)

This part explains the performance of proposed model on the RTNITP18 data set. The set is supplied to the proposed model in order to evaluate its performance. Confusion matrix for the different types of attacks for the proposed IDS model is shown in Table 16. In Table 17 precision & recall is shown and Table 18 shows accuracy, ADR and FAR and other metrics value for RTNITP18 on proposed model.

Table 16 Confusion matrix for different categories using rtnitp18 dataset on the proposed ids
Table 17 Precision & Recall values for RTNITP18 on proposed model is as follows
Table 18 Performance metrics of the proposed model on RTNITP18 dataset

It is observed from Fig. 15 that, Normal and Probe have highest precision and recall values compared to DoS and Exploit. Furthermore, DoS has higher recall compared to Exploit. On the real time data set (RTNITP18), proposed model gives an accuracy of 83.8% and ADR 88.29% for five categories as shown in Fig. 16. Hence it can be concluded that performance of the proposed model is also good enough on RTNITP18 dataset.

Fig. 15
figure 15

Precision and Recall of RTNITP18 dataset on proposed integrated model

Fig. 16
figure 16

Performance analysis of proposed model using RTNITP18

5 Conclusions

This paper proposes an integrated classification based IDS and evaluates its performance on offline traditional data set and on line real time data set. This paper evaluates the performance of proposed model on a new data set (UNSW-NB15) which covers the most recent attacks (DoS, Exploit, Normal, Probe, Generic) compared to KDD99 data set. It is observed that the value of several evaluation metrics (e.g. MFM = 84.5%, ADR = 90.32, FAR = 2.01% etc.) have higher performance compared to other existing traditional decision tree based models. Since the proposed approach is based on the misuse-based technique so it is not able to detect any zero day attacks which are publicly unknown and hence there is no signature found for that attack. But once the attack is performed the signature is available to the proposed IDS model. Now our IDS model is updated with the signature to prevent the attacks of these categories. This paper generates a real time data set at NIT Patna CSE lab (RTNITP18) and it acts as the testing data set to evaluate the performance of our proposed model. Accuracy of proposed model is 83.8%. We can conclude that our proposed integrated model acts as the dog watcher in the network to prevent the systems of the organisation from malicious attacks. In future we will try to improve following drawbacks of our proposed integrated model, such as: (1) Enhance the detection rate on real time data set and (2) Develop the ability to classify new unknown attacks.