Keywords

1 Introduction

Intrusion in the field of the computer network is defined as any unauthorised activity on a system by exploiting its vulnerabilities, which could be improper security protocols or software bugs such as buffer overflow [1]. Uninterrupted services and data confidentiality are two of the important factors that determine the efficiency of a network system. General attack vectors like denial-of-service (DOS) attacks and buffer-overflow attacks, and protocol-specific attack vectors like ARP Poisoning and ICMP vulnerabilities tend to use the faults in the security system of the network for rendering the system passive and unresponsive or to access the privileges restricted to higher administrators [2].

Detecting a network intrusion depends on the defenders’ level of understanding of the network and attacks associated with it. The various techniques used in intrusion detection can be classified into two categories: signature-based and anomaly-based intrusion detection.

Signature-based intrusion detection relies on the previously attained ‘signature’ of the network, which is a collection of features of the network, to classify whether the current network signature falls to the attack category or to the normal category [3]. While this technique is very well suited for detecting attacks if the attack signatures were previously available, it fails to identify any network with a new signature. And since the features of a network are always changing, this detection system does not guarantee accurate results.

Anomaly-based intrusion detection depends on the behaviour of the network to classify the current network. The model is pre-trained to identify different behaviours of the network and identify the current network [4]. When compared to signature-based intrusion detection technique, this technique is much slower, as it has to identify the behaviour of the current network, which involves analysis of various factors of the network. But it is very well suited for classifying new network behaviours as it does not depend on predefined properties once the model is trained.

Anomaly-based network intrusion detection system (NIDS) is the current focus of research, with the aim of building a faster and more accurate detection system. Statistical, clustering and classifying methods are commonly used to build the NIDS.

In this research, I have proposed a novel classifier-based anomaly detection for network intrusion—the conventional artificial neural networks optimised with the Cuckoo Search Optimizer algorithm.

Section 2 gives a brief review of the existing literature on the different models used for building NIDS. Section 3 gives the formal definition of the problem specifying some of the performance measures. Section 4 introduces the basic algorithms used to build the model proposed and the algorithm of the model proposed. The model is tested with the standard KDD Cup 99 dataset, and the results are analysed in Sect. 5. Section 6 concludes the research, briefing about the performance analysis of the model, and also lists out certain ideas which could help in improving the performance of the model as a part of future research work. The last section comprises the list of research papers referenced for the research undergone in this paper.

2 Literature Review

In this section, I have explored some of the existing machine learning tools and techniques for network anomaly detection.

The Naive Bayes Classifier described in [5] is a very efficient technique for anomaly detection if considered for fewer features, in accuracy and detection rate, but the performance drops with increase in the features (curse of dimensionality).

The Genetic Network Programming algorithm proposed in [6] has a pretty high detection rate when combined with fuzzy class-association-rule mining. It works well with both discrete and continuous values, but the algorithm is mainly used to extract the relevant rules of classification, for its high performance with less information available.

The combination of the two popular classifiers—NBTree and Random Tree classifier, as proposed in [7] is found to have a better result than the original random tree classifier, but the model takes more training time, and the performance is still under par with other hybrid models.

The innovation of using the PSO-SVM classifier for the network traffic anomaly detection, as described in [8] has paid off with good accuracy rate (~98%), for a small number of features. As the feature count increase, the accuracy drops sharply.

Time required to train the Bayesian belief model proposed in [9] is very less compared to other models and also gives a good precision rate for small feature count. But even with a slight increase in the features, the precision rate decreases rapidly.

The MRMHC-SVM algorithm [10] has a higher accuracy and robustness than the other intrusion detection methods and gives good result even if the dataset for training is small. But the time required for training the model is very high.

The popular Neural Network with Backpropagation [11] is highly adaptable to change in environment, and hence, training time is lesser than most of the models. But this method is seldom preferred because of its lower DR in comparison with other models.

A novel hybrid of GA and Weighted KNN classifiers is suggested in [12]. To build this model, I have extracted the features directly from the packets, and hence, it can be used as real-time NIDS. But though the accuracy for known attacks is very high, I could not obtain a good accuracy for unknown attacks.

Singh and Silakari [13] suggest the use of KNN Classifier for anomaly detection, which gained high accuracy rate when trained with a large dataset and with lesser features. But as the feature count increases, the accuracy drops by a large value.

A combination of Decision Tree and SVM algorithm, as proposed in [14] gives high accuracy of detection for specific attacks like DOS and Probing attacks, but is a poor algorithm for U2R and R2L attack detection. Also, the rate of convergence to the minima and its adaptation to new conditions was slow.

The use of Cluster Centre and Nearest Neighbour (CANN) for anomaly detection was researched upon in [15]. Performance of CANN was found to be higher than the KNN and SVM classifiers, providing higher detection rates and accuracy and a lower false alarm rate for a smaller feature set. But the accuracy decreases drastically for larger feature set.

Use of clustering algorithms like K-Means Clustering was studied in [16]. Higher accuracy of detection is observed, along with lower false alarm rate even for a smaller dataset. But the model is difficult to implement due to its complex structure and took a long time to train.

One of the recently developed models, Unsupervised Niche Clustering (UNC) was also researched upon for anomaly detection [17]. This algorithm, which takes advantage of the fuzzy analysis for detection, has a high accuracy and detection rate when coupled with principal component analysis (PCA). But due to its complex structure and high training time, the model is seldom used.

3 Problem Description

The main problem tackled in this research is to devise a new model to detect network intrusion based on anomaly detection.

Also, certain factors were to be defined with which I can analyse the performance of the model proposed and to compare the proposed model with some of the other existing algorithms.

Some of the criteria for analysing the algorithm are:

  1. 1.

    Precision and Recall: Efficiency of the model in classifying the network is an important factor in analysing the model. The model should have a high precision and recall rate for it to successfully classify the network traffic as a normal or an attack one. More on the precision and recall is mentioned in Sect. 6.

  2. 2.

    Training time: Most of the time invested in building a model is spent in training it. While training is an important and essential part of a model, the model is to be trained in lesser time. Some complex algorithms have high precision rate, but take more amount of time in training and result evaluation, and hence are seldom used in this field, where real-time classification and anomaly detection is expected.

  3. 3.

    Dataset size: The dataset available for training the model is usually limited, especially when the model is to detect various types of attacks. Hence, the model must be able to perform even when the dataset available for training is small.

These above parameters define the problem to be solved by current researches. Some of the models try to satisfy one or two of the above factors, but research is still going on to come up with a model to satisfy all the factors.

4 The Proposed Solution

There are several approaches to tackle the problem defined, including the statistical, clustering and classifying techniques.

In this paper, a supervised learning classification algorithm is chosen for anomaly detection for the following reasons [18]:

  • The classification models have a definite boundary of division and hence make faster decisions over the other models.

  • The unsupervised clustering algorithms make strong assumptions on the nature of distribution of the traffic, such as the anomalous data are farther away from the centres than the normal data and clusters with smaller sizes are anomalous. Supervised algorithms make minimal assumptions, as the training data gives sufficient information.

  • The statistical models are very quick in decision-making, but are very slow in adapting to new environments.

A combination of two algorithms, a Neural Network for classifying the data, and an optimizer—Cuckoo Search Optimizer algorithm is used in this research.

4.1 Artificial Neural Network (ANN)

An artificial neural network (ANN) is modelled as a collection of nodes called as artificial neurons [19], the main inspiration being the neurons and its network in the brain of living beings. Each connection between the neurons transport signals from one another.

Each neural network consists of neurons grouped into layers. Each layer performs a specific function, including the input layer where the data is fed in, and the output layer where the classification result is obtained. The signal at the connection between two neurons is real-valued, and the output of each neuron is calculated based on a nonlinear function on the inputs for the neurons [20]. The output of the neuron is fed to the input of the neurons of the next layer, and the same process is repeated, until the outer layer is reached.

Learning in ANN is based on the modification of the function operating on the input of each neuron, based on the feedback mechanism [21]. There are various optimizer algorithms for this purpose, of which I am using a hybrid of two popular algorithms, which is discussed in the later part.

A simple model of ANN is represented by ‘weights’ on each of its inputs and a function which performs an operation on the inputs and the weights, returning the output for the next layer [21]. So learning of such ANN model involves modifying the weights based on the penalty given to each of the neuron.

Advantages of using ANN [19]:

  • Adaptive capability of the ANN model is higher compared to most of the other classification algorithms. This will enable the model to adapt to new datasets and give better results when trained and tested.

  • The model can organise the information (data) it receives during training, hence forming more generalised classification boundaries, and hence improving the results.

  • Parallelism can be achieved in ANN, thus reducing time consumed in training and testing the model (which is considerably high for any heuristic algorithm).

  • Fault tolerance rate is high enough, so that some of the capabilities of the original model can be retained in the damaged one.

4.2 Cuckoo Search Algorithm

Cuckoo Search is one of the popular optimizer algorithms currently used in various domains of engineering. Inspired from the brood parasitism nature exhibited by the cuckoo birds, it combines the advantages of swarm optimisation and genetic algorithms [22, 23].

  1. i.

    Cuckoo Search: In the algorithm, the various possible solutions for the optimisation problem are considered as the eggs in a nest. Our aim is to develop new cuckoos which replace less fit eggs (solutions) with more fit ones [24]. The following assumptions are made during the construction of the algorithm:

    • A cuckoo lays only one egg at a time and chooses a random nest to lay the egg.

    • The nest with the highest fit eggs will be considered for the next generation.

    • The host bird will detect the cuckoo’s egg with a probability pa.

For generating new cuckoos, along with the local random walk, the Cuckoo Search algorithm uses a global explorative random walk, where the parameter pa is used for switching between them. For the local random walk, the Heaviside function is used, while for the global walk, I will be using the Levy flight [24, 25].

  1. (I)

    Local random walk: \( x_{i}^{t + 1} = x_{i}^{t} + s*H\left( {p_{a} - \varepsilon } \right)*\left( {x_{j}^{t} - x_{k}^{t} } \right), \) where

    \( x_{i}^{t} \) :

    ith solution of the tth generation.

    \( x_{i}^{t + 1} \) :

    ith solution of the t + 1th generation.

    \( H\left( u \right) \) :

    Heaviside function.

    \( \varepsilon \) :

    A random number generated from a uniform distribution.

    S :

    The step size.

    \( x_{j}^{t} \,{\text{and}}\,x_{k}^{t} \) :

    random solutions of the tth generation.

  2. (II)

    Global random walk: \( x_{i}^{t + 1} = x_{i}^{t} + \alpha *L\left( {s,\lambda } \right), \) where

    \( x_{i}^{t} \) :

    ith solution of the tth generation.

    \( x_{i}^{t + 1} \) :

    ith solution of the t + 1th generation.

    α:

    boundary parameter = 0.01 * (UB − LB), UB = Upper Bound, LB = Lower Bound.

    \( L\left( {s,\lambda } \right) \) :

    Levy flight function.

  1. iii.

    Levy flight: Levy flight is defined as the random walk where the step length is calculated using the Levy distribution [25]. But the Levy flight can be approximated to:

  1. (III)

    \( L\left( {s, \lambda } \right) = \frac{1}{{power\left( {s, \lambda + 1} \right)}}, \) where

    s:

    step length given by Mantegna’s Algorithm

    λ:

    Levy index

  1. ii.

    Step size: For calculating the step size, I have used Mantegna’s Algorithm [25].

  1. (IV)

    \( s = \frac{u}{{v^{{\left( {1 + \lambda } \right)}} }}, \)where

    u:

    normal distribution N(0, σ 2u )

    v:

    normal distribution N(0, 1)

    $$ \upsigma_{\text{u}} = \frac{{\Gamma \left( {1 + \lambda } \right) * \sin \left( {\pi *\lambda /2} \right)}}{{\Gamma \left( {\left( {1 + \lambda } \right)/2} \right)*\lambda *power\left( {2, \left( {\lambda + 1} \right)/2} \right)}} $$
figure a

4.3 Construction of ANN-CSO Model

I have considered Cuckoo Search Algorithm for the optimisation of neural networks used to classify the dataset. The advantages of using genetic algorithms (GA) and swarm optimisation algorithms are discussed in [26] and [27], respectively. Some of the advantages of using Cuckoo Search optimizer are [23]:

  • The selection criteria used in CSO is similar to that of genetic algorithms and Harmony Search.

  • The step length in CSO is heavy-tailed, which supports any step length, and so randomisation is more efficient.

  • The number of parameters in CSO is lesser compared to GA or PSO, and so it can adapt to changing environment faster.

In the model proposed here, I will be using the neural networks as the nonlinear objective function to be optimised by the Cuckoo Search Algorithm. A population of 15–40 neural networks with random weights and bias values are created. I will then find the cost of the network by training the model with the given dataset. The fitness for the neural network will be the inverse of the cost of the network. Then each of the networks will be considered as a possible solution (an egg in the case of Cuckoo Search Algorithm) and then the optimizer algorithm will be executed to get the best solution, i.e., the neural network with the least cost.

5 Experimental Result

5.1 Dataset

For training our model, I have used KDD Cup ’99 dataset. The dataset (corrected form) contains 311,029 instances of data, each containing of about 39 characteristics based on network packet data and the network traffic behaviour at the time of data mining. It also classified the instances into 4 different classes of attacks:

  • DOS attack (223,298 instances)

  • R2L attack (25,722 instances)

  • Probe attack (377 instances)

  • U2L attack (39 instances).

And about 61,593 instances of normal traffic data.

The model is trained using two sets of training data, one set containing 10,000 instances of data divided into 7500 training and 2500 testing data, and the other set containing 2,000 instances divided into 1500 training and 500 testing data, along with the original complete dataset split into 46,000 training and 15,593 testing data. Also, for detecting the attacks based on anomalies, the instances with U2L attack and Probe attack behaviour are very less in the original data, which is further reduced in our selection of training data (problems with overfitting and outlier handling will affect the performance). Hence the instances are classified into three groups:

  • Normal Instances

  • DOS attack instances

  • R2L attack instances

5.2 Performance of the Model

The performance of the model is analysed based on its precision, recall and F1-score. For calculating the precision and recall, I need to define some terms [28]:

  • True Positive: When the prediction is positive and so is the actual data.

  • True Negative: When the prediction is negative and so is the actual data.

  • False Positive: When the prediction is positive, but the actual data is negative, also called false alarm.

  • False Negative: When the prediction is negative, but the actual data is positive, also called miss.

Precision is defined as the ratio of true positive to the total positive prediction.

$$ Precision = True \, Positive/(True \, Positive + False \, Positive) $$

Recall is defined as the ratio of true positive to the total actual positive data.

$$ Recall = True \, Positive/(True \, Positive + False \, Negative) $$

F1-score is the harmonic mean of precision and recall.

For the purpose of analysing our model with the existing models, I am comparing the performance of the model with two other existing models for anomaly detection: BCS-GA algorithm proposed in [29] and fuzzy K-means classifier using PSO algorithm proposed in [30].

Table 1 gives the precision, recall and F1-score on training and testing the model using the dataset of 2000 instances, 10,000 instances and the complete database. Table 2 gives the comparative performance and Table 3 gives the training time of our model over the two models considered for comparative analysis.

Table 1 Performance of the proposed model when trained for 1500 dataset instances and tested for 500 instances, when trained for 7500 dataset instances and tested for 2500 instances, and when trained for 46,000 dataset instances and tested for 15,593 instances
Table 2 Comparison of the performance of our hybrid model with the PSO-K-means model and the BCS-GA model. All the models were trained on the same dataset of 10,000 instances
Table 3 Comparison of the time taken to train the three models considered for the performance analysis. The models were trained on the same computational system with the same dataset

Figures 1, 2 and 3 give the graphical representation of the performance recorded in Table 2.

Fig. 1
figure 1

Column graph compares the performance of the three models in the detection of normal packets

Fig. 2
figure 2

Column graph compares the performance of the three models in the detection of DOS attack packets

Fig. 3
figure 3

Column graph compares the performance of the three models in the detection of R2L attack packets

5.3 Analysis of Performance

From the results of training and testing the model against the dataset chosen, I can analyse the model chosen. The precision and recall for smaller dataset (2000) is low, as shown in Table 1. This is due to overfitting and insufficient handling of outliers. But the performance finds a significant increase when the model is trained for larger dataset (10,000). Also, there is no significant improvement in the performance as the dataset instances is increased from 10,000 to the complete dataset (61,593).

Table 2 suggests that the ANN-CSO model have a better precision and recall than the PSO-K-means model for all the three classes. Also, the precision and recall for normal and DOS attack detection is higher than that for R2L attack. This is due to the relatively smaller dataset for R2L attacks compared to the normal and the DOS attacks.

The precision and recall of the ANN-CSO model and the BCS-GA hybrid model are very close to each other, though the hybrid proposed in [30] has slightly higher values. But the training time for the hybrid is more than the time for training the ANN-CSO model proposed in this paper.

6 Conclusion and Future Work

In this paper, I have presented a novel technique for network anomaly detection using artificial neural network optimised with cuckoo search algorithm. Our model has exploited the advantage of cuckoo search algorithm and the neural network classifier. Analysing the experimental results, I can conclude that the ANN-CSO model suggested in this paper has a higher precision and recall than the PSO-K-means model. Also, the model proposed takes significantly lesser training time compared to the BCS-GA hybrid model.

In our model, I have assumed that each nest in the cuckoo search algorithm has only one egg, and also each cuckoo lays only one egg. This restriction can be eased and allow many eggs in a single nest, and each cuckoo can lay many eggs based on different levy flights. This could allow for multiple solution optimisations, which could give better results than the single solution optimisation proposed in this paper, provided the step length and the Levy flight parameters are properly controlled. Such modifications can be made on the model, and the results can be analysed.