Keywords

1 Introduction

For many years, there have been used safety systems based on formerly isolated and classified patterns of threats, named signatures. Anti-virus software, systems for detection and breaking-in counteraction and protection against information leaks are just examples from a long and diversified list of application of those techniques. Nevertheless, there is one aspect in common, namely, they are able to protect systems and computer networks from known attacks described by the mentioned patterns. However, does lack of traffic matching known signatures mean there is no threat?

A means to defend from novel, unknown attacks is a rather radical change in operation concept. Instead of searching for attack signatures in network traffic it is necessary to browse for abnormal behavior which is a deviation from the normal traffic characteristic. The strength of such an approach is visible in solutions which are not based on knowledge a priori of attack signatures but on what does not respond particular norms, profiles of the analyzed network traffic. The techniques based on the above mentioned assumptions should be able to detect both: simple attacks of DoS type (Denial of Service) or DDoS (Distributed Denial of Service), and intelligent network worms up to hybrid attacks which are a combination of numerous different destruction methods. The consequence of such kind of attacks is inception of network anomalies, which creates a possibility to detect them, or even prevent from unwanted actions. The hardest challenge, however, is differentiation between dangerous behavior and normal movement in its initial stage in order to limit the usage of network resources. Anomalies are abnormalities, variations from the adopted rule. Anomalies in network traffic can signify device damage, an error in software or attack on resources and network systems. The essence of anomaly disclosure in computer networks is therefore detecting abnormal behaviors or actions which in particular can constitute a source of a potential attack [6]. One of possible solutions to the presented problem is implementation of Anomaly Detection Systems. They are currently used as one of the main mechanisms of safety supervision in computer networks. Their action consists in monitoring and detecting attacks directed onto information system resources on the basis of abnormal behaviors reflected in parameters of network traffic. Anomaly detection methods have been a topic of numerous surveys and review articles. In works describing the methods there were used techniques consisting in machine learning, neural networks, clustering techniques and expert systems. At present, anomaly detection methods that are particularly intensively developed are those based on statistical models describing the analyzed network traffic [6]. The most often used models are autoregressive ARMA or ARIMA, and Conditional Heteroscedastic Models ARCH and GARCH, which allow to estimate profiles of a normal network traffic [18]. In the present article we propose using estimation of statistical models AR, MR, ARMA, ARFIMA and FIGARCH for defined behavior profiles of a given network traffic. The process of anomaly detection (a network attack) is realized by comparison of parameters of a normal behavior (predicted on the basis of the tested statistical models) and parameters of real network traffic. This paper is organized as follows. After the introduction, in Sect. 2 we present the definition of long and short memory dependence. In Sect. 3 the different statistical models for date traffic prediction are described in details. Then, in Sect. 4 the Anomaly Detection System based on AR, MR, ARMA, ARFIMA and FIGARH model estimation is shown. Experimental results and conclusion are given thereafter.

2 Definition of Long and Short-Memory Dependence

Long memory dependences manifest themselves in the existence of autocorrelations of elements creating the given time series. In most cases it is high-order autocorrelation. This means that in the examined series there is a dependence between the observations—even those distant in time. This phenomenon is called long memory and was discovered by a British hydrologist Hurst [13]. In case of long memory existence the autocorrelation function is slowly falling at hyperbolic pace. The time series with long memory feature has in the spectral domain distribution of low frequency. Short memory time series, however, show essential autocorrelation of low frequency only. This means that observations that are separated even by a relatively short time period are no longer correlated. Short memory series are easy to recognize due to the fact that in the time domain the autocorrelation function disappears quickly, and in the spectral domain there are distributions of high frequency. It is said that the stochastic process has long memory with parameter d if its spectral density function \(f(\lambda )\) meets the condition

$$\begin{aligned} f\left( \lambda \right) \sim c\lambda ^{ - 2d} ,when\mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \lambda \rightarrow 0^ + , \end{aligned}$$
(1)

where c is constant, and symbol \(\sim \) means that the relation of left and right side is heading to one. When the process meets that condition and when \(d>0\) then its autocorrelation function is disappearing in hyperbolic manner [3, 4, 18] i.e.

$$\begin{aligned} \rho _k \sim c_\rho k^{2d - 1} ,when\mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} k \rightarrow \infty . \end{aligned}$$
(2)

Parameter d describes the memory of the process. When \(d>0\), the spectra density function in unlimited in surrounding 0. It is then said that the process has a long memory. When \(d=0\), the spectral density is limited in 0, and the process is described as having short memory. When \(d<0\), then the spectral density equals 0 and the process shows negative memory and is named anti persistent [10, 12].

3 Statistical Models for Network Traffic Prediction

The tested network traffic is represented by means of time series describing variance of parameters characterizing the number of received and sent TCP, UDP and ICMP packages within a time unit. A natural way of describing such series are statistical models which are based on autoregression and moving average in relation to differently realized data variances and autocorrelation of elements creating the given time series.

3.1 Short-Memory Models

In order to describe the properties of short memory time series (essential autocorrelations of low order only) the approach that is often applied is the use of solutions known as autoregression model AR, moving average MR and mixed models ARMA. They can be used for modeling stationary series, i.e. series where there are only random fluctuations around the average, or non-stationary reducible to a stationary form. Their composition is based on autocorrelation phenomenon, i.e. on correlation of the predicted variable value with values of the same variable but delayed in time [5].

Autoregressive Model Numerous time series are composed of interdependent observations which means that it is possible to estimate the models coefficients which describe the following elements of the series on the basis of the delayed in time previous elements of the series \(\left( {Y_{t - 1} ,Y_{t - 2} ,\ldots ,Y_{t - p} } \right) \), and random component \( \varepsilon _t\) in current period t. The above can be presented with the use of equation of autoregression of the order (p) as AR(p)

$$\begin{aligned} Y_t = c_0 + \phi _1 Y_{t - 1} + \phi _2 Y_{t - 2} + \cdot \cdot \cdot + \phi _p Y_{t - p} + \varepsilon _t , \end{aligned}$$
(3)

where \(\phi _1,\phi _2,\ldots ,\phi _p\) describe the models parameters, \(c_0\) is invariable and \(\varepsilon _t \left( {0,\sigma ^2 } \right) \) is the white noise process with zero mean and variance \(\sigma ^2\). AR is a process with memory of previous realizations of the series. Such a process is called stationary, when there is a condition \(p>1\) and all roots of the polynomial \(W\left( z \right) = 1 - \phi _1 z - \phi _2 z^2 + \cdot \cdot \cdot - \phi _p z^p \) of each module are greater than one. For such a model the prediction is built step by step by recurrent substitution of successive values. With stationary processes AR(p) such a prediction is heading for the average value of the process, and the error variance of the forecast aims at the variance of the process.

Moving Average Model It is a linear model in which the realization of \(Y_t\) in the current period depends on realization of the random component \(\varepsilon _t\) in the current period and q in subsequent previous periods. It can be presented by means of an equation of moving average of order q as MA(q)

$$\begin{aligned} Y_t = \varepsilon _t + \theta _1 \varepsilon _{t - 1} + \theta _2 \varepsilon _{t - 2} + \cdot \cdot \cdot + \theta _q \varepsilon _{t - q} , \end{aligned}$$
(4)

where \(\theta _1 ,\theta _2 ,...,\theta _p \) describe the models parameters, and \(\varepsilon _t \sim \left( {0,\sigma ^2 } \right) \) is the white noise process with zero mean and variance \(\sigma ^2\). MA is a process with memory of previous values of the random component. Every MA process which can be reduced to a stationary autoregressive process is called invertible. In general case this condition is fulfilled when the roots of the polynomial \(W\left( z \right) = 1 + \theta _1 z + \theta _2 z^2 + \cdots + \theta _p z^q\) lie outside the unit circle. The prediction made with the use of MA(q) model is obtained in the recurrent way, as it seeks the average value.

Autoregressive Moving Average Model For a stationary series, instead of applying separate models of AR and MR classes, in order to describe the connections between observations from the subsequent periods we use autoregressive models of moving average [3], i.e. ARMA(pq) models with delay order (pq) written as

$$\begin{aligned} Y_t = \phi _1 Y_{t - 1} + \phi _2 Y_{t - 2} + \cdot \cdot \cdot + \phi _p Y_{t - p} + \varepsilon _t + \theta _1 \varepsilon _{t - 1} + \theta _2 \varepsilon _{t - 2} + \cdot \cdot \cdot + \theta _q \varepsilon _{t - q} \end{aligned}$$
(5)

where \(\phi _1,\phi _2,\ldots ,\phi _p,\) and \(\theta _1,\theta _2,\ldots ,\theta _p\) describe the models parameters, and \(\varepsilon _t \sim \left( {0,\sigma ^2 } \right) \) is the white noise process with zero mean and variance \(\sigma ^2\). As a result, by means of a lower number of AR and MR components than separately for AR model and MR model, any linear process can be described, which is beneficial from the perspective of the models estimation possibility and its use in predicting. ARMA process contains properties of both AR and MR which is most easily visible in decomposition of the ACF function. ARMA model generates stationary process if its components are: stationary AR and reversible MA. The prediction made by means of ARMA(pq) model is obtained in a recurrent way.

3.2 Long-Memory Models

An interesting approach towards describing the features of long memory time series was the use of solutions with movable autoregressive averaging in the process of fractional differentiation. In a result ARFIMA (Fractional Differenced Noise and Autoregressive Moving Average) model was created [10], which is a generalization of ARMA and ARIMA models. Another approach towards time series description was including conditional variance dependence of the process from its previous values using the ARCH model (Autoregressive Conditional Heteroskedastic Model) introduced by Engel [6]. Generalization of this approach was the FIGARCH model (Fractionally Integrated GARCH), which autocorrelation function of squared residuals of the model decreases in a hyperbolic way. Such a behavior of an autocorrelation function enables naming FIGARCH a model of long memory in the context of the autocorrelation function of squared residuals of the model.

3.3 Introduction to ARFIMA Model

The Autoregressive Fractional Integrated Moving Average model called \(AR-\) FIMA (pdq) is a combination Fractional Differenced Noise and Auto Regressive Moving Average which is proposed by Grange, Joyeux and Hosking, in order to analysis the Long-Memory property [10, 12].

The ARFIMA(pdq) model for time series \(Y_t\) is written as:

$$\begin{aligned} \Phi (L)(1 - L)^d y_t = \Theta (L)\varepsilon _t, \mathrm{} \begin{array}{*{20}c} {} &{} {} &{} {} &{} {} \\ \end{array}t = 1,2,...\Omega , \end{aligned}$$
(6)

where \(y_t\) is the time series, \(\varepsilon _t \sim (0,\sigma ^2 )\) is the white noise process with zero mean and variances \(\sigma ^2\), \( \Phi (L) = 1 - \phi _1 L - \phi _2 L^2 - \cdots - \phi _p L^p \) is the autoregressive polynomial and \(\Theta (L) = 1 + \theta _1 L + \theta _2 L^2 + \cdots + \theta _p L^q \) is the moving average polynomial, L is the backward shift operator and \((1-L)^d\) is the fractional differencing operator given by the following binomial expansion:

$$\begin{aligned} (1 - L)^d = \sum \limits _{k = 0}^\infty {\left( {\begin{array}{*{20}c} d \\ k \\ \end{array}} \right) ( - 1)^k L^k } \end{aligned}$$
(7)

and

$$\begin{aligned} \left( {\begin{array}{*{20}c} d \\ k \\ \end{array}} \right) ( - 1)^k = \frac{{\Gamma (d + 1)( - 1)^k }}{{\Gamma (d - k + 1)\Gamma (k + 1)}} = \frac{{\Gamma ( - d + k)}}{{\Gamma ( - d)\Gamma (k + 1)}}, \end{aligned}$$
(8)

where \(\Gamma (*)\) denotes the gamma function and d is the number of differences required to give a stationary series and \((1-L)^d\) is the \(d^{th}\) power of the differencing operator. When \(d\in (-0.5 \), 0.5), the ARFIMA (pdq) process is stationary, and if \(d\in (0\), 0.5) the process presents long-memory behavior.

Forecasting ARFIMA processes is usually carried out by using an infinite autoregressive representation of (1), written as \(\prod {\left( L \right) } y_t = \varepsilon _t\), or

$$\begin{aligned} y_t = \sum \nolimits _{i = 1}^\infty {\pi _i y_{t - i} + \varepsilon _t ,} \end{aligned}$$
(9)

where \(\prod {\left( L \right) } = 1 - \pi _1 L - \pi _2 L^2 - \cdot \cdot \cdot = \Phi \left( L \right) \left( {1 - L} \right) ^d \Theta \left( L \right) ^{ - 1} \). In terms of practical implementation, this form needs truncation after k lags, but there is no obvious way of doing it. This truncation problem will also be related to the forecast horizon considered in predictions (see [12]). From (9) it is clear that the forecasting rule will pick up the influence of distant lags, thus capturing their persistent influence. However, if a shift in the process occurs, this means that pre-shift lags will also have some weight on the prediction, which may cause some biases for post-shift horizons [8].

3.4 FIGARH Model

The model enabling description of long-memory in variance series is FIGARCH (pdq) (Fractionally Integrated GARCH) introduced by Baillie et al. [1]. The FIGARCH(pdq) model for time series \(y_t\) can be written as:

$$\begin{aligned} y_t = \mu + \varepsilon _t , \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} t = 1,2,...\Omega , \end{aligned}$$
(10)
$$\begin{aligned} \varepsilon _t = z_t \sqrt{h_t } ,\mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{}\varepsilon _t |\Theta _{t - 1} \sim N\left( {0,h_t } \right) , \end{aligned}$$
(11)
$$\begin{aligned} h_t = \alpha _0 + \beta \left( L \right) h_t + \left[ {1 - \beta \left( L \right) - \left[ {1 - \phi \left( L \right) } \right] \left( {1 - L} \right) ^d } \right] \varepsilon _t^2 , \end{aligned}$$
(12)

where \(z_t\) is a zero-mean and unit variance process, \(h_t\) is a positive time dependent conditional variance defined as \(h_t = E\left( {\varepsilon _t^2 |\Theta _{t - 1} } \right) \) and \(\Theta _{t - 1}\) is the information set up to time \(t-1\).

The FIGARCH(pdq) model of the conditional variance can be motivated as ARFIMA model applied to the squared innovations

$$\begin{aligned} \varphi \left( L \right) \left( {1 - L} \right) ^d \varepsilon _t^2 = \alpha _0 + \left( {1 - \beta \left( L \right) } \right) \vartheta _t , \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \vartheta _t = \varepsilon _t^2 - h_t , \end{aligned}$$
(13)

where \(\varphi \left( L \right) = \varphi _1 L - \varphi _2 L^2 - \cdots - \varphi _p L^p \) and \(\beta \left( L \right) = \beta _1 L + \beta _2 L^2 + \cdots + \beta _q L^q \) and \(\left( {1 - \beta \left( L \right) } \right) \) have all their roots outside the unit circle, L is the lag operator and \(0<d<1\) is the fractional integration parameter. If \(d=0\), then FIGARCH model is reduced to GARCH; for \(d=1\) though, it becomes IGARCH model. However, FIGARCH model does not always reduce to GARCH model. If GARCH process is stationary in broader sense, then the influence of current variance on its forecasting values decreases to zero in exponential pace. In IGARCH case the current variance has indefinite influence on the forecast of conditional variance. For FIGARCH process the mentioned influence decreases to zero far more slowly than in GARCH process, i.e. according to the hyperbolic function [1, 18]. In practical implementation of prediction FIGARH model see [18].

Table 1 Network traffic features used for experiments
Table 2 Detection Rate DR [%] for a given network traffic features
Table 3 False Positive FP [%] for a given network traffic features

4 Parameters Estimation and Choice of Model

The aim of searching for a useful forecasting model is not utilization of the greatest number of parameters which will most accurately describe variance of the analyzed time series. It is due to the fact that too big matching may embrace the description not only of the part of the process called signal but also of random noise, for which in finished trials one can discern as random regularity. The objective of the research is rather discovery of such a model which will describe the most important properties of the analyzed time series by means of a finite number of statistically essential parameters [7]. The most often used method of parameter estimation of autoregressive models is the Maximum Likelihood Estimation (MLE). The basic problem appearing while using this method is the necessity to define the whole model and consequently sensitivity of the obtained estimator to the presumptive errors in the specification of polynomials AR and MA, which are responsible for the process dynamics [9]. There is no universal criterion for the choice of the model. Usually, the more complex the model, the bigger is its likelihood function. Therefore, there is a searching for a compromise between the number of parameters occurring in the model and the value of the likelihood function. The choice of a sparing form of the model is performed on the basis of information criteria such as Akaike (AIC) or Schwarz (SIC). In our article, for parameter estimation and choice of the model, we utilized the Maximum Likelihood Method. It was due to its relative simplicity and computational efficiency. In order to estimate the order of the AR and MA models we used the autocorrelation function ACF and PACF. For ARMA model, however, we used Box-Jenkins procedure [2]. For ARFIMA model we applied HR estimator (described in the Haslett and Rafterys work [11]) and automatic models order selection algorithm based on information criteria (see Hyndman and Khandakar [14]). For estimation of FIGARCH model we used the methodology described in the article [18].

5 Experimental Results

Experimental results are based on traffic features set taken from SNORT [17] based preprocessor which we proposed in [16]. We have used 26 traffic features presented in Table 1. For algorithms evaluation we used Kali Linux [15] tools for simulating real world attacks in controlled network environment (for example: Application specific DDos, various port scanning, DoS, DDoS, Syn Flooding, pocket fragmentation, spoofing and others).

For anomaly detection we used statistical algorithms with short and long memory dependence: ARMA, FIGARH and ARFIMA. In Tables 2 and 3 there are results of DR \([\%]\) and FP \([\%]\) for mentioned three algorithms. Most promising results in terms of DR and FP were achieved for ARFIMA long memory statistics (with FP less then \(10\,\%\)).

6 Conclusion

Ensuring a sufficient level of safety to resources and information systems is a question that is currently intensively surveyed and developed by many research centers in the world. A growing number of novel attacks, their global reach and level of complexity enforce dynamic development of network safety systems. Most often implemented mechanism aiming to ensure security are methods of detection and classification of abnormal behaviors reflected in the analyzed traffic. In the present article, we compare properties of predicated analyzed statistical models in terms of their effectiveness to detect anomalies in network traffic. The analyzed models were those of a long and short memory reflected in the autocorrelation strength of elements composing a given time series. Parameter estimation and identification of the range of the model were realized as a compromise between the models coherence and size of its estimation error. While realizing implementation processes of the described models there were achieved diverse statistical estimations for the analyzed signals of the network traffic. In order to detect anomalies in the network traffic we used differences between the real network traffic and its estimated model for the analyzed parameters characterizing number of received or sent TCP, UDP and ICMP packages within a time unit. The results obtained after the performed experiments show advantage of predictive models ARFIMA and FIGARCH in the network traffic anomaly detection.