An integrated rule based intrusion detection system: analysis on UNSW-NB15 data set and the real time online dataset

Kumar, Vikash; Sinha, Ditipriya; Das, Ayan Kumar; Pandey, Subhash Chandra; Goswami, Radha Tamal

doi:10.1007/s10586-019-03008-x

An integrated rule based intrusion detection system: analysis on UNSW-NB15 data set and the real time online dataset

Published: 29 October 2019

Volume 23, pages 1397–1418, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Cluster Computing Aims and scope Submit manuscript

An integrated rule based intrusion detection system: analysis on UNSW-NB15 data set and the real time online dataset

Download PDF

Vikash Kumar¹,
Ditipriya Sinha ORCID: orcid.org/0000-0003-3115-4750¹,
Ayan Kumar Das²,
Subhash Chandra Pandey² &
…
Radha Tamal Goswami³

3799 Accesses
132 Citations
Explore all metrics

Abstract

Intrusion detection system (IDS) has been developed to protect the resources in the network from different types of threats. Existing IDS methods can be classified as either anomaly based or misuse (signature) based or sometimes combination of both. This paper proposes a novel misuse based intrusion detection system to detect five categories such as: Exploit, DOS, Probe, Generic and Normal in a network. Further, most of the related works on IDS are based on KDD99 or NSL-KDD 99 data set. These data sets are considered obsolete to detect recent types of attacks and have no significance. In this paper UNSW-NB15 data set is considered as the offline dataset to design own integrated classification based model for detecting malicious activities in the network. Performance of the proposed integrated classification based model is considerably high compared to other existing decision tree based models to detect these five categories. Moreover, this paper generates its own real time data set at NIT Patna CSE lab (RTNITP18) which acts as the working example of proposed intrusion detection model. This RTNITP18 dataset is considered as a test data set to evaluate the performance of the proposed intrusion detection model. The performance analysis of the proposed model with UNSW-NB15 (benchmark data set) and real time data set (RTNITP18) shows higher accuracy, attack detection rate, mean F-measure, average accuracy, attack accuracy, and false alarm rate in comparison to other existing approaches. Proposed IDS model acts as the dog watcher to detect different types of threat in the network.

A Detailed Analysis on Intrusion Detection Systems, Datasets, and Challenges

A Review on Network Intrusion Detection System Using Machine Learning

Statistical Analysis of the UNSW-NB15 Dataset for Intrusion Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Internet has become the most essential tool in this modern era. Applications like Local Area Network (LAN), Wide Area Network (WAN), Wireless Local Area Network (WLAN) etc., made computer networking attractive for different enterprises, security services, health care and other emergency services. Internet plays an inevitable role in our daily life. It is obvious that attackers may take advantage of our dependency on internet and threaten it with security threats like botnets based DDoS [4] attacks, Virus, Trojan, Worm, Spyware etc. These types of attacks in the internet are becoming more sophisticated and the number of attacks is also increasing day by day. Several models and mechanisms [12, 21] have already been proposed to defend against different attacks. However, now-a-days providing security to every environment in the network is a big challenging issue. Indeed, intrusion detection system (IDS) has been developed to defend against the threats. It can be considered as a set of techniques and methods that are used to detect suspicious activities in the network. Existing IDS methods.

This paper proposes a misuse based IDS which detects five categories such as: Exploit, DOS, Probe, Generic and Normal. Though firewall can provide good control to access the resources in the internet yet attackers have developed various techniques to bypass it. The proposed system is based on misuse-based technique, which permits it to act as a firewall with some extra information added to it. Thus the system is not limited to the functionality of IDS. The advantage of proposed IDS is that it assists the network administrator to classify the traffic captured into five categories including normal category, which causes lower false alarm rate (FAR) than anomaly based IDS.

The most of the research on IDS are based on KDD99 or NSL-KDD [2, 3, 8, 16, 18, 20, 22, 28] data set. These data sets are considered obsolete to detect recent types of attacks. In the proposed work, UNSW-NB15 [23] is treated as the offline data set to design the proposed IDS model. Moustafa et al. [24] have applied UNSW-NB15 and KDD99 data set for intrusion detection and compared the performance on both the data sets. It shows that IDS applying UNSW-NB15 dataset covers recent attacks compared to KDD99 data set. This paper also proves that KDD99 dataset has no significance now a day. Their work suggests that the decision tree model has best performance on the introduced data set. However, performance evaluation of [24] using UNSW-NB15 shows that intrusion detection rate of [24] is not high due to over lapping nature of several attacks. In [26], authors have presented decision tree approach in combination with genetic algorithm to design a misuse based IDS, where three different datasets KDDCup’99, NSL-KDD and UNSW-NB15 are used to design the models separately and subsequently tested those models using the respective testing set. In this paper, we propose our own integrated model whose accuracy is higher compared to ENADS [24] and Dendron [26]. The performance of this proposed integrated model is also evaluated on the real time data set which is generated in from NIT Patna lab. Sangkatsanee et al. [28] proposed a real time IDS using machine learning techniques on KDD99 data set as offline dataset and real time data set RLD09. In [28], authors considered KDD99 as offline dataset which is obsolete now a day and has significant biasness and also created its own real time dataset which was evaluated on existing decision tree based model considering three categories (Normal, Probe, DoS) only. In our proposed work, we have also designed the real time dataset RTNITP18 which is evaluated on our own proposed integrated model to detect five categories (Normal, Probe, DoS, Exploit, Generic). Forty nodes have been chosen for this purpose at the organization NIT Patna. All of these nodes are connected in a LAN and among them some are attackers and some are victims. We have observed the packet flows in that network for 7 days, from Monday to Sunday and captured data packets from it applying Wireshark packet sniffer tool [5]. Features are extracted from the captured data packets and we export the captured pnpcap file into.txt. Then we use python script to extract features and mapping those features to traditional data set features and save it to.csv file. Thus we are able to design a real time data set. We evaluate the performance of proposed model on real time data set and it is observed that accuracy of our proposed model is 85.8%. It is also observed that value of several evaluation metrics (e.g. ADR = 90.32, FAR = 2.01% etc.) have higher performance on proposed integrated model compared to other existing [24, 26] models. Hence the novelty of the proposed IDS is the development of a model on the basis of recent dataset (UNSW-NB15) which has highly complex behaviour of attacks compared to that of old one (KDD99, NSL-KDDD etc.). The proposed model is created on the basis of different decision tree models by considering rules with high confidence factor which in turn reduces the FAR of the proposed model. In addition, the proposed model is able to detect five categories of attack including normal category with high detection rate (ADR) and low FAR as compared to other recent approaches. The model is also evaluated on the real-time dataset generated by setting up virtual environment at NIT Patna CSE lab.

The proposed model can successfully be used in different domains of industrial applications. Industrial control systems (ICS) are widely used in different domains and it entails real-time data acquisition and system monitoring. It also incorporates automatic control and management of industrial processes. However, ICS is an attractive target for hackers and thus the security issue of ICS is of the paramount importance. The proposed IDS is designed for the automatic detection of malicious attacks. The proposed IDS can collect and analyse different attributes such as the network traffic, security logs. Further, the proposed IDS can also check if there exists security infringement in the system by auditing the data and information from the key points of the computer system. Moreover, evidence collection using digital forensics is an important domain where IDS can substantially be used. The modified version of IDS can be used to notify the administrator by sending an alert as well as it can also activate the digital forensic tool to capture the current state of the system. It is pertinent to mention that this captured system image will include the entire information pertaining to the system at the moment when attack was taking place. And thus these images can be used as evidence in legal proceedings. It is thus obvious that the proposed IDS can successfully be implemented for maintaining the security of ICS and in the domain of digital forensics. The proposed work can be used to provide security to such system against different threats in the network. It can also be used for any organization where it will be installed on a network device to protect the organization. Any malicious activity found will be reported to security administrator for further action. In an IoT environment the proposed model can also be used to provide security. Hence our proposed IDS acts as the dog watcher for detecting threats in internet.

The key observations of our proposed approach are given below:

1.
Most of the related works on IDS are based on KDD cup99 or NSL-KDD [2, 3, 8, 16, 18, 20, 22, 28] data set which is not up to date in the sense that most of recent attacks are not covered. In this paper, we have used a new data set (UNSW-NB15) which covers the most recent attacks compared to KDD cup99 data set.
2.
In this paper, an integrated rule based model for IDS has been proposed. The detection rate of this proposed model is high in comparison to other traditional decision tree based model and existing state-of-art works on IDS [24, 26]. Several other metrics (discussed in Sect. 3.5) are also used to evaluate and compare the proposed work with other state-of-art techniques.
3.
This paper generates a real-time dataset at NIT Patna CSE lab (RTNITP18) and it acts as a working example to evaluate the performance of the proposed model for real-time environment.
4.
Proposed model considers five categories in such a way that some of the other attacks can also be identified, whereas maximum related works on IDS only consider two to three types of attacks in the network. The performance of the proposed model with the UNSW-NB15 (benchmark dataset) and real-time dataset (RTNITP18) shows higher accuracy and ADR in comparison to other existing approaches.
5.
The proposed IDS model acts as a compliment to firewall which collect the traffic data incident to the network from the Internet as well as the traffic of the organization and analyses it for any malicious activity.
6.
The proposed IDS model acts as out of band device to the network hence it will not create any jitter to the network which is an advantage over the IPS.

Figure 1 shows the working environment of IDS, where data comes from Internet goes to the firewall as well as to the IDS in order to find any malicious activity which is not found by the firewall. The proposed IDS will also monitor the traffic inside the organization for internal intruders.

Rest of the paper is organized as follows. Section 2 describes the related work briefly. Proposed model along with result analysis is discussed in Sect. 3. In Sect. 4, a working example for evaluation of proposed model is described. Finally, the concluding remarks pertaining to the proposed model are given in Sect. 5.

2 Related work

This section preludes the state of art studies on IDS system. Most of the previous works is based on KDD99 [18] data set. KDD99 data set is an old dataset and it does not consider most updated or recent type of attack categories. All the research work on IDS can be divided into two categories: (1) works on IDS in which researchers try to develop system where detection is done based on the provided signature. (2) Works in which a normal profile is generated and any deviation from that profile is reported as attack. This is called behavioural attack also. In first type of category, false reporting is very less, but the system is less prone to the new type of attack whose signature is not known yet. In second category, the system is more prone to false reporting, though a novel attack can be detected.

In [2], a very important model of feature selection has been proposed for IDS which uses Ant Colony Optimization (ACO) concept. In this process, features are treated as nodes in graph representation and edge between nodes represent next choice of feature. The optimal subset of features is selected by traversing the graph. In this model, the pheromone and heuristics are not associated with the links as in the actual ACO traversal; rather it is associated with features itself. Pheromone represents the attractiveness of features. Wang et al. [32] also proposed a function for selecting features using the neighbourhood discernibility matrix. This matrix is used to show the ability of classification for a feature subset. This function helps to determine significance of candidate attribute. Here, dependency function is used to analyse the relevance between features and the decision made applying that feature.

Several recent works on IDS/IPS have been proposed to protect against DoS/DDoS attack. Agarwal et al. [1], have proposed an IDS along with Intrusion Prevention System (IPS) for detecting and recovering from DoS attack in Wi-Fi network. They have proposed the work in IEEE 802.11 standard for Wi-Fi. They have also used Angel of Arrival (AoA) [19] algorithm over RSSI [31] to find the location of attacker. In [25], the authors have proposed an enhanced Confidence Based Filtering (CBF) method to protect cloud services from the DDoS attack by using the concept of correlation pattern. Gupta et al. [13] have proposed a Flow and Volume based approach to detect DDoS attack in an ISP domain and the simulation is provided by network simulator (NS-2) which shows good detection with low false alarm rate. Gou et al. [11] have proposed a framework for IDS based on Petri network that consists of two different functions for attacks detection and the model up gradation.

The authors of [20] have proposed an IDS system using the concept of fuzzy set theory with the combination of association of rule mining and genetic network programing (GNP) using KDD99 dataset. The concept of sub-attribute utilization is used for extracting discrete and continuous attributes in order to avoid data loss and gives effective rule mining using GNP. Wattanapongsakorn et al. [33] have proposed an Intrusion Detection and Prevention System (IDPS) in which they covered four categories (Normal, DoS, Probe, Worm). Fuzzy Genetic Algorithm is used for unknown attacks and for known attacks with several machine learning techniques (C4.5 Decision Tree, Random Forest, Ripple Rule, Bayesian Network, and Back Propagation Neural Network). In [37], authors have introduced intuitionistic fuzzy rough graph which is used to handle uncertainty and incomplete information in information system and also proposed an algorithm that can efficiently solve decision making problems.

Das et al. [7] proposed an NIDS model which detects the port scan attack using machine learning concept of SVM. They trained their model using the pattern of frequency change in normal and attack packet. Data is captured every 4 s by the NIDS for analysis. They have used Rough set method as an optimal feature selection method over PCA and only one type of attack has been considered. An IDS model is proposed in [16] by applying SVM on the data set NSL-KDD [27]. They have presented a framework which selects the features of NSL-KDD data to characterize normal traffic more accurately from those of abnormal traffic. Framework uses the method of filter and wrapper for feature selection and ranked those selected feature using information gain ratio. Chowdhury et al. [6], have proposed combination of two machine learning algorithm for classification of anomaly based intrusion. This paper is applying simulated annealing that generate random set of 3 features for each time and then SVM is applied on the selected set to detect the anomalous behaviour. The algorithm has used the dataset from Australian centre for cyber security [23].

Fares et al. [8] proposed an NIDS model using the concept of Neural Network (NN) which is divided in 3 phases. They reduced the dataset applying some pre-processing phase so that over fit due to dominating attack categories do not occur. The attack considered by their model includes Normal, Dos, Probe, R2L, U2R. They compared the performance of their proposed model using both the data sets (10% of KDD99 and reduced data set). The model was trained and tested only on offline data set. In [3], IDS using NN with Genetic algorithm is proposed to improve the accuracy of proposed model have using KDD99 as benchmark dataset. They used NN with resilient back propagation with sigmoid function. In [30], authors have proposed a light weight IDS for anomaly detection using KDD99 by focusing on three major fields such as: 1. Removing redundant data from data set, 2. Feature extraction, and 3. Realization of proposed IDS. The IDS is proposed using a wrapper approach for feature selection. Bagging approach is used to generate multiple training data set which is then used to train multiple neural networks and using the output of these NN new training data set is generated by replacing the class label of original data set with the output labels. Newly generated data set is then used with C4.5 model.

Many researchers are working to design efficient IDS by using machine learning techniques. In [28], the authors have proposed real-time IDS in which they worked on three categories of attacks such as- Normal, DoS and Probe. The work is divided in three phases: 1) pre-processing phase 2) classification phase and 3) post-processing phase. They have compared the performance with KDD99 and RLD09 [28] on different machine learning techniques and only two types of attacks have been considered. Kalekar et al. [17] proposed a real-time IDS using Naïve Bayes classifier. Their proposed model classifies any packet as normal or abnormal, whereas the performance evaluation of the proposed model is not on any data set. An algorithm for removing outlier from KDD cup99 is proposed in [22]. It makes the algorithm compatible with Weka tool [34]. After executing the proposed algorithm, they have used 10% of the data set and evaluated performance on different machine learning algorithms (Bayesian network, naïve Bayes classifier, J48, J48 Graft, and Random forest). Performance evaluation is done using precision, recall and F-measure parameters [10]. In [14], the authors have proposed a host based IDS using hidden Markova model. Proposed model is evaluated using publically available data set by University of New Mexico (UNM) and Massachusetts Institute of Technology (MIT) Artificial Intelligence laboratory. In this model, training data set is divided into K equal sized sub sequences and using sub sequences which have less correlation between them is used to train sub models. Trained sub models are then merged incrementally in order to design the final model. Moustafa et al. [24], have shown a new data set (UNSW-NB15) for testing any IDS. According to authors, the data set covers most recent attacks. They have described the completeness of the data set and also evaluated the performance on different machine learning algorithms. They also compared the analysis with existing KDD99 data set. Though their work suggests that the decision tree model has best performance on the introduced data set, the accuracy is not very high. In [35], a method for host-based anomaly detection has been presented which uses k-Means clustering technique with ID3 decision tree model. First, k-Means clustering is applied for partitioning the training data which uses Euclidean distance for similarity measurement and then ID3 is applied to each cluster to make decision tree. Results of these two phases are combined using a special algorithm.

Ibrahim et al. [15], have proposed a layered-model approach for NIDS which is divided into two stages. First stage detects data traffic either as normal or abnormal due to some attacks. In second stage, the attacks are classified individually. This model is inspired by airport security model. It uses tcpdump data for analysis using data mining techniques. Sasan and Sharma [29] proposed a hybrid model for IDS using J48 and CART. They have implemented their proposed model in Weka tool and tested the model using NSL-KDD data set for evaluation of performance. Recently Yin et al. [36] have proposed an improved clonal selection algorithm of artificial immune system which is inspired by biological immune system of humans to improve the accuracy of IDS. They have used KDDcup99 for the performance evaluation of their proposed work.

The state of the art describes that most of the works on IDS uses KDD cup 99 and NSL-KDD as the benchmark dataset which do not cover recent attacks and consider as the old dataset. On the other hand, very few papers design their own real time data set to evaluate the intrusion detection rate of their IDS in the network. Intrusion detection rate of the most of the IDS are evaluated on traditional decision tree based model to detect only two to three attacks.

In, our proposed work we design the signature based IDS on a new data set (UNSW-NB15) which covers the most recent attacks compared to KDD cup99 and NSL-KDD. An integrated rule based model for IDS has been designed and it shows higher intrusion detection rate compared to other existing decision tree based models for five categories. We have also designed our own real time data set (RTNITP18) which acts as the testing dataset to evaluate the performance of our proposed integrated model.

3 Proposed work

In the proposed system UNSW-NB15 dataset has been used as benchmark dataset. This section proposes an integrated rule based model to optimize the attack detection rate (ADR) and FAR in the network. The working diagram for the proposed work is shown in Fig. 2 which is having two parts. In first part the IDS model is proposed which starts with analysing the UNSW-NB15 dataset, then pre-processing of data and finally proposing the integrated rule based model and testing it with benchmark test dataset. In second part a working example is considered where a real-time dataset is generated by setting of virtual environment and the performance of the proposed model is evaluated on the dataset.

3.1 Dataset description

The dataset was created [23] by applying IXIA PerfectStorm tool. It [23] includes nine categories of the modern attack types and involves realistic activities of normal traffic. This data set [23] contains 49 features that comprised of several categories. Though there are several datasets available for IDS evaluations like KDD98, NSLKDD, etc., all of these do not cover the latest types of attacks. Recent researches on IDS [9] comment that these datasets do not inclusively reflect the real network traffic behaviour and modern attacks for the recent network threats. Features of UNSW-NB15 are categorized in five ways—(a) Flow features (b) Basic features, (c) Content features, (d) Time features and (e) Additional generated features. Dataset overview is shown in Tables 1 and 2. In Table 3, the definition of attacks is given.

Table 1 Description of UNSW-NB15 dataset

An integrated rule based intrusion detection system: analysis on UNSW-NB15 data set and the real time online dataset

Abstract

Similar content being viewed by others

A Detailed Analysis on Intrusion Detection Systems, Datasets, and Challenges

A Review on Network Intrusion Detection System Using Machine Learning

Statistical Analysis of the UNSW-NB15 Dataset for Intrusion Detection

Explore related subjects

1 Introduction

2 Related work

3 Proposed work

3.1 Dataset description

3.2 Pre-processing phase

3.2.1 Dataset reduction

3.2.2 Feature reduction

3.3 Classification

3.4 Proposed integrated rule based model

3.5 Performance analysis on traditional dataset (UNSW-NB15)

4 Working example of proposed model

4.1 Data collection and feature extraction phase

4.2 Dataset description

4.3 Performance evaluation on real time dataset (RTNITP18)

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation