Keywords

Introduction

Intrusion Detection is a process of gathering intrusion related knowledge that occurred in the computer networks or systems and analyzing them for detecting future intrusions. Intrusion Detection can be divided into two categories: Anomaly detection [2] and Misuse detection. The former analyses the information gathered and compares it to a defined baseline of what is seen as “normal” service behaviors, so it has ability to learn how to detect network attacks that are currently unknown. Misuse detection is based on signatures for known attacks, so it is only as good as the database of attack signatures that it uses for comparison. Misuse detection has low false positive rate, but can not detect novel attacks. However, anomaly detection can detect unknown attacks, but has high false positive rate.

The Naïve Bayesian (NB) method is based on the work of Thomas Bayesian. In Bayesian classification, we have a hypothesis that the given data belongs to a particular class. We then calculate the probability for the hypothesis to be true. This is among the most practical approaches for certain types of problems. The approach requires only one scan of the whole data.

A Multinomial Logistic Regression (MLR) model is used for data in which the dependent variable is unordered or polytomous, and independent variables are continuous or categorical predictors. This type of model is therefore measured on a nomial scale and was introduced by McFadden (1974). Unlike a binary logistic model in which a dependent variable has only a binary choice (e.g., presence/absence of a characteristic), the dependent variable in a multinomial logistic model can have more than two choices that are coded categorically, and one of the categories is taken as the reference category.

In this paper, we propose two methods MLR and NB in building anomaly-based IDS and compare the performance of two linear classifier of Naïve Bayesian (NB) and multinomial Logistic Regression (MLR) based on attack scenarios which we created, and search for the characteristics of the data that determine the performance. The comparison between LR and MNB has been studied theoretically by Ng and Jordan (2002).

This paper is organized as follows: Sect. 2 deals with the description of data set for our experiment. Section 3 deals with foundation of methods including naïve Bayesian, multinomial logistic regression, In this section we will consider the problem of applying the two methods in building anomaly-based IDS. In Sect. 4, we give an illustration and experimental results with four attack scenarios. It help in understanding of this procedure, a demonstrative case is given to show the key stages involving the use of the introduced concepts. Section 5 is conclusion.

Dataset

Our data set is created by the following activities:

Data collection activity: collection attribute-value of the flow in terms of packet data (IP, port, TCP, UDP, ICMP). Based on these attributes, the program will build Profile (bin level) which contains the characteristic parameters for network traffic in a given time, including: (1–2) Entropy compression rate of the source/destination IP address, (3–4) Entropy compression rate of the source/destination port, (5) number of packets, (6) total size of the packets, (7) average size of packets, (8)standard deviation of packet size, (9) number of TCP packets, (10) number of UDP packets and (11) number of ICMP packets.

Statistical analysis activity: This activity is based on the data have been analyzed from the data collected to build the corresponding bin arrays. The bin is divided into the following levels: hours, days, months correspond to the three classes of data is the current class, reference class and the differential classes:

Cur_bin: represent for each instance “bin” (bin is the smallest time unit, in my program one minute).These instances is continuously created in the processes monitoring network traffic.

Ref_bin: represents the reference model corresponding to one unit of time reference. Reference model is adaptably updated, based on values of Cur_bin in the absence of intrusion detection.

Dif_bin: represents the difference between the current value and the reference value and is the input of classifiers.

Methods

Naïve Bayesian

Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. Naïve Bayesian classifiers allow the representation of dependencies among subsets of attribute [9]. Through the use of Bayesian networks has proved to be effective in certain situations, the result obtained, are highly dependent on the assumption about the behavior of the target system, and so a deviation in these hypotheses leads to detection errors, attributable to the model considered [10]. The NB classifier work as follows: Let T be a training set of samples, each with their class labels. There are k classes \( C_{1} ,C_{2} , \ldots ,C_{k} \), each sample is represented by an n-dimensional vector \( X = \{ X_{1} ,X_{2} , \ldots ,X_{n} \} \).

Given a sample X, The classifier will predict that X belongs to the class having the highest a posteriori probability, conditional on X. That is X is predicted to belong to the class C, if and only if \( P(C_{i} |X) > P(C_{j} |X) \) for \( 1 \le j \le m,j \ne i \).

By bayes’ theorem, we have \( P(C_{i} |X) = \frac{{P(X|C_{i} )P(C_{i} )}}{P(X)} \). As P(X) is the same for all classes and only \( P(C_{i} ) \) are not known, then it is commonly assumed that the classes are equally likely, that is, \( P(C_{1} ) = P(C_{2} ) = \cdots P(C_{m} ) \) we would therefore maximize \( P(X|C_{i} ) \).

In order to reduce computation in evaluating \( P(X|C_{i} ) \). The naïve assumption of class conditional independence is made. Mathematically this means that \( P(X|C_{i} ) \approx \sum\limits_{k = 1}^{n} {P(X_{k} |C_{i} )} \). The probabilities \( P(X_{k} |C_{i} ) \) can easily be estimated from the training set. If X is continuous-valued, then we typically assume that the values have a Gaussian distribution with a mean \( \mu \) and standard deviation \( \sigma \). So that \( P(X_{k} |C_{i} ) = g(X_{k} ,\mu_{ci} ,\sigma_{ci} ) \). We need to compute \( \mu_{ci} ,\sigma_{ci} \) in training stage. In order to predict the class label of X, \( P(X_{{}} |C_{i} )P(C_{i} ) \) is evaluated for each class \( C_{i} \). The classifier predicts that the class label of X is \( C_{i} \) if and only if it is the class that maximizes \( P(X_{{}} |C_{i} )P(C_{i} ) \).

Multinomial Logistic Regression

A multinomial logistic regression model is used for data in which the dependent variable is unordered or polytomous, and independent variables are continuous or categorical predictors. This type of model is therefore measured on a nomial scale and was introduced by McFadden (1974). Unlike a binary logistic model in which a dependent variable has only a binary choice (e.g., presence/absence of a characteristic), the dependent variable in a multinomial logistic model can have more than two choices that are coded categorically, and one of the categories is taken as the reference category. This study used “0” (normal) as the reference category. Suppose yi is the dependent variable with five categories for individual connection i-th, and the probability of being in category s (s = “1” [TCP flood], “2” [ICMP flood], “3” [UDP flood], “4” [Scan Port]) can be denoted \( \pi_{i}^{(s)} = \Pr (y_{i} = s) \) with the chosen reference category, \( \pi_{i}^{(0)} \). Then, for a simple model with one independent variable xi, a multinomial logistic regression model with logit link can be represented as:

$$ \log \left( {\frac{{\pi_{i}^{(s)} }}{{\pi_{i}^{(0)} }}} \right) = \beta_{0}^{(s)} + \beta_{1}^{(s)} x_{i} ,s = 1,2,3,4. $$

An alternative way to interpret the effect of an independent variable, x, is to use predicted probabilities \( \pi_{i}^{(s)} \) for different of x:

$$ \pi_{i}^{(s)} = \frac{{\exp (\beta_{0}^{(s)} + \beta_{1}^{(s)} x_{i} )}}{{1 + \sum\limits_{k = 1}^{4} {\exp (\beta_{0}^{(k)} + \beta_{1}^{(k)} x_{i} )} }}. $$

Then, the probability of being in the reference category, “0” (normal), can be calculates by subtraction:

$$ \pi_{i}^{(0)} = 1 - \sum\limits_{k = 1}^{4} {\pi_{i}^{(k)} } $$

Experiment and Results

In this section, we summarize our experimental results to detect network intrusion detections using Naïve Bayes and Multinomial Logistic Regression over dataset we created based on four attack scenarios including: TCP flood, ICMP flood, UDP flood and Port Scan.

Purpose of Study

The objective of this study is to detect some common attack types in computer systems and networks. We furthermore make the comparison of classification performance between the NB and MLR model.

Dataset

In this study, the measured attributes are (in particular, 11 attributes): entropy compression rate of the source/destination IP address and source/destination port, number of packets, total/average size of the packets, standard deviation of packet size and number of TCP/UDP/ICMP packets, So each instance will be represented by a vector including 11 attributes and the input of each classifier is differential vector of current vector and reference vector which refer to normal state (Table 1).

Table 1 Number of examples in dataset we created

Experiment

We will test the system’s ability of detecting anomaly-based intrusion activities using two methods: Naïve Bayes and Multinomial Logistic Regression. We will proceed on the four attack scenarios including ICMP flood, TCP flood, UDP flood and port scan. Using with each attack will change significantly the number of ICMP, TCP, UDP packets and entropy source/target.

Testing Environment

The system was tested on virtual LAN 100 Mps environment using VMware tool, including two Window XP computers and a Ubuntu computer installed the Anomaly IDS. These computers are connected to each other through a virtual switch.

Testing Scenarios

Two Window XP computers implement TCP flood, UDP flood, ICMP flood refer to bandwidth flood attacks using tools like hping3, udpflood.exe, ping respectively or scan port in range 1–300 on Ubuntu computer installed anomaly IDS. Our program will collect and analysis packets in order to detect anomalous in traffic.

Experimental Results

A “confusion matrix” is sometime used to represent the result of, as shown in Table 2 (Naïve Bayes) and Table 3 (Multinomial Logistic Regression). The advantage of using this matrix is that is not only tells us how many got misclassified but also what misclassification occurred. We define the Accuracy, Detection rate and false-alarm:

$$\begin{aligned}& Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\quad Detection -rate = \frac{TP}{TP + FP} \\ & \qquad \qquad False - Alarm =\frac{FP}{FP + TN} \end{aligned} $$
Table 2 Confusion matrix for naïve bayes
Table 3 Confusion matrix for multinomial logistic regression

FN: False Negative, TN: True Negative, TP: True Positive and FP: False Positive (Table 4).

Table 4 Comparison between BN and MLR

Conclusion

This study constructed an Anomaly-based Intrusion Detection Model based on Naïve Bayes and Multinomial Logistic Regression algorithm. We also experiment IDS’s ability of detection using both these methods in the data sets that we created based on four attack scenarios including ICMP flood, UDP flood, TCP flood and Scan Port. The experimental results show that both two methods give very high accuracy and could be applied in practice. However, this is still only the initial test, and more research is needed, in the future we will continue to improve and experiment in a real network environment.