Keywords

1 Introduction

At present, the biggest challenge for information systems is providing proper protection against threats. Growing number of attacks, their spreading range, and complexity enforce a dynamic development of network protection systems. This is realized by mechanisms of supervising and monitoring security of computer networks. They are implemented as IDS/IPS Intrusion Detection/Prevention Systems. They detect attacks directed onto widely understood network resources of information systems [1]. The techniques used in IDS systems based on statistical methods can be divided into two groups. The first one consists of methods using threshold analysis examining the frequency of events and exceeding their limits in the described time unit. The information about an attack is achieved when the examined units exceed certain threshold values. A crucial drawback of those methods is their susceptibility to errors connected with temporary violent rise in legal network traffic and problems connected with setting reference levels causing an alarm [2]. The second group consists of methods detecting statistical anomalies on the basis of estimated specific parameter profiles of a network traffic. The profiles are characterized by average quantity values, i.e., the number of IP packages, the number of newly dialed connections per time unit, ratio of packages of individual network protocols, etc. It can also be observed that there are some statistical dependences resulting from the part of the day (for instance a greater network traffic strictly after starting work). It is also possible to keep statistics for individual network protocols (for example, quantity ratio of SYN and FIN packages of TCP protocol). IDS systems based on those methods are able to learn a typical network profile—this process lasts from few to several weeks and then compare the current network activity with the memorized profile. The comparison of these two profiles will provide a basis for determining whether there is something disturbance occurring in the network (for instance an attack) [3]. The primary advantage of methods based on anomaly detection is their ability to identify unknown attack types, because they do not depend on information how a particular attack looks like, but on what does not correspond to regular norms of the network traffic. Therefore, IDS/IPS systems detecting anomalies are more effective than systems using signatures in case of identifying new, unknown attack types. Anomaly detection methods have been a topic of numerous surveys and review articles [4]. In works describing the methods there were used techniques consisting in machine learning, neural networks, and expert systems. At present, anomaly detection methods that are particularly intensively developed are those based on statistical models describing the analyzed network traffic. The most often used models are autoregressive ARMA or ARIMA, and Conditional Heteroscedastic Models ARCH and GARCH, which allow to estimate profiles of a normal network traffic [4, 5]. In the present article, we propose using estimation of statistical models ARFIMA and FIGARCH for defined behavior profiles of a given network traffic. The process of anomaly detection (a network attack) is realized by comparison of parameters of a normal behavior (predicted on the basis of tested statistical models) and parameters of real network traffic. This paper is organized as follows. After the introduction, in Sect. 2, the overview of DDoS attacks is presented. In Sect. 3 the ARFIMA and FIGARCH model for data traffic prediction is described in details. Then, in Sect. 4, the anomaly detection system based on ARFIMA—FIGARCH model estimation is shown. Experimental results and conclusion are given thereafter.

2 Overview of DDoS Attacks

Currently, DoS and DDoS attacks have become an important issue of broadly defined IT infrastructure security. Victims of the attacks are often single personal computers as well as supercomputers and vast networks. The outcomes of such activities are experienced by regular Internet users, biggest companies dealing in new technologies that often provide mass services, and powerful governmental organizations of many countries. Despite substantial effort and funds directed to enhancing IT security procedures, at present, we are not able to protect effectively against such attacks.

Attacks such as distributed denial of service (DDoS) use already known techniques of denial of service (DoS) realized with new technology. DoS attack has two crucial restrictions. First, it is performed from a single computer whose Internet connection bandwidth is too low compared to the bandwidth of the victim. Second, while performing the attack from one computer, the attacker may be subjected to a faster detection. Therefore, DoS attack is often conducted on smaller servers containing WWW sites. Attacks on bigger objects, for instance a portal or DNS server, require using a more sophisticated method DDoS, i.e., Distributed Denial of Service, which was created as a response to DoS limitations. The main difference between both methods concerns quantity factor. In DDoS, an attack is performed not from a single computer, but simultaneously from numerous overtaken machines. The sole idea of DDoS attack is therefore simple. However, what constitutes a challenge is its preparation which sometimes lasts many months. The reason is obvious, it is necessary to take over so many computers that will make the attack successful. The period of preparations is the longer, the more powerful are the victims system resources.

Why are the DDoS attacks so dangerous? Most of all, they are difficult to deter due to the fact that their source is greatly distributed. What is worse, the hosts administrators most often do not realize that they are actively participating in the attacks. The statistics are appalling a survey carried out by University of California, San Diego, point that monthly there are performed approximately 15,000 DDoS attacks.

There are a number of methods for conducting a DDoS attack. First, every operational system requires free memory space. If the attacker succeeds in allocating the whole available memory, theoretically, the system will stop functioning, or at least its performance will fall drastically. Such a brutal attack is able to block normal work of even the most efficient IT systems. The second method is based on the use of restrictions of file systems. The third means consists in using malfunctioning network applications or the kernel or errors in the operating system configuration. It is much easier to protect against the above-mentioned kind of attack by proper configuration of such a system. Most of all, it is characteristic for DoS method, which in contrast with DDoS, usually is not based on sending a great number of requests. Errors in TCP/IP stacks of different operational systems constitute an example here. In extreme cases, sending a few packages will be enough to remotely hang the server. The last method is generating a sufficiently big network traffic so that routers or servers cannot handle it [6].

Attacks of this kind are becoming a more and more serious problem. According to quarterly reports published by Prolexic company, within the last 12 months the number of DDoS attacks has risen by 22 %. Campaigns last longer-not 28.5 h as previously, but 34.5 h (a rise by 21 %). The average traffic generated during the attack is approximately 2 GB/s and is more or less 25 % greater than in 2013. The record so far was an attack on the Spamhaus, an organization dedicated to the fight against spam. In March 2013, a hostile network traffic was directed toward servers of that organization with the speed of 300 GB/s. However, according to Arbor company, most of attacks (over 60 %) still do not exceed 1 GB/s. Nevertheless, they still constitute a serious threat [7].

The reason for DDoS attacks being so problematic is that nowadays there are no effective means and methods allowing to protect the IT systems from them. It is only possible to limit the outcomes of those attacks by early identification. One of such solutions is detection of network traffic anomalies that are aftermath of a DDoS attack.

3 Statistical Models for Network Traffic Prediction

Most research on statistical analysis of time series concerns processes characterized by lack of or poor connection between variables which are separated by some time period. Nevertheless, in numerous uses there is a need for modeling processes whose autocorrelation function is slowly decreasing, and the relation between distant observations—even though it is not big—is essential. An interesting approach toward properties of long-memory time series was applying the autoregression with moving average in the process of fractional diversification. As a result, ARFIMA model (Fractional Differenced Noise and Auto Regressive Moving Average) was obtained and implemented by Grange, Joyeux, and Hosking [8, 9]. ARFIMA is a generalization of ARMA and ARIMA models. Another approach describing time series was taking into account the dependence of the conditional variance of the process on its previous values with the use of ARCH model (Autoregressive Conditional Heteroskedastic Model) introduced by Engel [10]. Generalization of this approach was FIGARCH model (Fractionally Integrated GARCH) introduced by Baillie et al. [11].

3.1 ARFIMA Model

The autoregressive fractional integrated moving average model called ARFIMA (p, d, q) is a combination of fractional differenced noise and auto regressive moving average which is proposed by Grange, Joyeux, and Hosking, in order to analyze the long-memory property [8, 9]. The ARFIMA (p, d, q) model for time series \(y_t\) is written as

$$\begin{aligned} \varPhi (L)(1 - L)^d y_t = \varTheta (L)\varepsilon _t, \mathrm{} \begin{array}{*{20}c} {} &{} {} &{} {} &{} {} \\ \end{array}t = 1,2,\ldots ,\varOmega , \end{aligned}$$
(1)

where \(y_t\) is the time series, \(\varepsilon _t \sim (0,\sigma ^2 )\) is the white noise process with zero-mean and variances \(\sigma ^2\), \( \varPhi (L) = 1 - \phi _1 L - \phi _2 L^2 - \cdots - \phi _p L^p \) is the autoregressive polynomial and \(\varTheta (L) = 1 + \theta _1 L + \theta _2 L^2 + \cdots + \theta _q L^q \) is the moving average polynomial, L is the backward shift operator, and \((1-L)^d\) is the fractional differencing operator given by the following binomial expansion:

$$\begin{aligned} (1 - L)^d = \sum \limits _{k = 0}^\infty {\left( {\begin{array}{*{20}c} d \\ k \\ \end{array}} \right) ( - 1)^k L^k } \end{aligned}$$
(2)

and

$$\begin{aligned} \left( {\begin{array}{*{20}c} d \\ k \\ \end{array}} \right) ( - 1)^k = \frac{{\varGamma (d + 1)( - 1)^k }}{{\varGamma (d - k + 1)\varGamma (k + 1)}} = \frac{{\varGamma ( - d + k)}}{{\varGamma ( - d)\varGamma (k + 1)}}, \end{aligned}$$
(3)

where \(\varGamma (*)\) denotes the gamma function and d is the number of differences required to give a stationary series and \((1-L)^d\) is the dth power of the differencing operator. When \(d\in (-0,5 \) , 0, 5), the ARFIMA (pdq) process is stationary, and if \(d\in (0\) , 0, 5) the process presents long-memory behavior. Forecasting ARFIMA processes are usually carried out using an infinite autoregressive representation of (1), written as \( \varPi {\left( L \right) y_t = \varepsilon _t ,}\)

$$\begin{aligned} y_t = \sum \limits _{i = 1}^\infty {\pi _i y_{t - i} + \varepsilon _t ,} \end{aligned}$$
(4)

where \(\varPi {\left( L \right) =1 - \pi _1 L - } \pi _2 L^2 - \cdots = \varPhi \left( L \right) \left( {1 - L} \right) ^d \varTheta \left( L \right) ^{ - 1}\). In terms of practical implementation, this form needs truncation after k lags, but there is no obvious way of doing it. This truncation problem will also be related to the forecast horizon considered in predictions (see [12]). From (4), it is clear that the forecasting rule will pick up the influence of distant lags, thus capturing their persistent influence. However, if a shift in the process occurs, this means that pre-shift lags will also have some weight on the prediction, which may cause some biases for post-shift horizons [13].

3.2 FIGARCH Model

The model enabling description of long-memory in variance series is FIGARCH (p, d, q) (fractionally integrated GARCH) introduced by Baillie, Bollerslev, and Mikkelsen et al. [11]. The FIGARCH (p, d, q) model for time series \(y_t\) can be written as

$$\begin{aligned} y_t = \mu + \varepsilon _t , \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} t = 1,2,\ldots ,\varOmega , \end{aligned}$$
(5)
$$\begin{aligned} \varepsilon _t = z_t \sqrt{h_t } ,\mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{}\varepsilon _t |\varTheta _{t - 1} \sim N\left( {0,h_t } \right) , \end{aligned}$$
(6)
$$\begin{aligned} h_t = \alpha _0 + \beta \left( L \right) h_t + \left[ {1 - \beta \left( L \right) - \left[ {1 - \phi \left( L \right) } \right] \left( {1 - L} \right) ^d } \right] \varepsilon _t^2 , \end{aligned}$$
(7)

where \(z_t\) is a zero-mean and unit variance process, \(h_t\) is a positive time-dependent conditional variance defined as \(h_t = E\left( {\varepsilon _t^2 |\varTheta _{t - 1} } \right) \) and \(\varTheta _{t - 1}\) is the information set up to time \(t-1\). The FIGARCH (p, d, q) model of the conditional variance can be motivated as ARFIMA model applied to the squared innovations

$$\begin{aligned} \varphi \left( L \right) \left( {1 - L} \right) ^d \varepsilon _t^2 = \alpha _0 + \left( {1 - \beta \left( L \right) } \right) \vartheta _t , \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \mathop {}\limits _{} \vartheta _t = \varepsilon _t^2 - h_t , \end{aligned}$$
(8)

where \(\varphi \left( L \right) = \varphi _1 L - \varphi _2 L^2 - \cdots - \varphi _p L^p \) and \(\beta \left( L \right) = \beta _1 L + \beta _2 L^2 + \cdots + \beta _q L^q \) and \(\left( {1 - \beta \left( L \right) } \right) \) have all their roots outside the unit circle, L is the lag operator and \(0<d<1\) is the fractional integration parameter. If \(d=0\), then FIGARCH model is reduced to GARCH; for \(d=1\) though, it becomes IGARCH model. However, FIGARCH model does not always reduce to GARCH model. If GARCH process is stationary in broader sense, then the influence of current variance on its forecasting values decreases to zero in exponential pace. In IGARCH case, the current variance has indefinite influence on the forecast of conditional variance. For FIGARCH process, the mentioned influence decreases to zero far more slowly than in GARCH process, i.e., according to the hyperbolic function [11, 14]. Rearranging the terms in (8), an alternative representation for the FIGARCH (p, d, q) model may be obtained as

$$\begin{aligned} \left[ {1 - \beta \left( L \right) } \right] h_t = \alpha _0 + \left[ {1 - \beta \left( L \right) - \varphi \left( L \right) } \right] \left( {1 - L} \right) ^d \varepsilon _t^2 . \end{aligned}$$
(9)

From (10), the conditional variance \(h_t\) of \(y_t\) is given by

$$\begin{aligned} h_t = \alpha _0 \left[ {1 - \beta \left( 1 \right) } \right] ^{ - 1} + \lambda \left( L \right) \varepsilon _t^2, \end{aligned}$$
(10)

where \(\lambda \left( L \right) = \lambda _1 L + \lambda _2 L^2 + \cdots \) Of course, for the FIGARCH (p, d, q), for (8) to be well-defined, the conditional variance in the \(ARH\left( \infty \right) \) representation in (10) must be non-negative, \(i.e.,\lambda _k=0\) for \(k=1,2,\ldots \) Solving the problem of forecasting using Eq. (10) may be obtained as

$$\begin{aligned} h_{t + 1} = \alpha _0 \left[ {1 - \beta \left( 1 \right) } \right] ^{ - 1} + \lambda _1 \varepsilon _t^2 + \lambda _2 \varepsilon _{t - 1}^2 + \cdot \cdot \cdot \end{aligned}$$
(11)

The one-step ahead forecast of \(h_t\) is given by

$$\begin{aligned} h_t \left( 1 \right) = \alpha _0 \left[ {1 - \beta \left( 1 \right) } \right] ^{ - 1} + \lambda _1 \varepsilon _t^2 + \lambda _2 \varepsilon _{t - 1}^2 + \cdot \cdot \cdot \end{aligned}$$
(12)

By analogy, the two-step ahead forecast is given by

$$\begin{aligned} h_t \left( 2 \right) = \alpha _0 \left[ {1 - \beta \left( 1 \right) } \right] ^{ - 1} + \lambda _1 h_t \left( 1 \right) + \lambda _2 \varepsilon _t^2 + \cdot \cdot \cdot \end{aligned}$$
(13)

In general, the n-step ahead forecast is can be written as

$$\begin{aligned} h_t \left( n \right) = \alpha _0 \left[ {1 - \beta \left( 1 \right) } \right] ^{ - 1} + \lambda _1 h_t \left( {n - 1} \right) + \cdot \cdot \cdot + \lambda _{n - 1} h_t \left( 1 \right) + \lambda _n \varepsilon _t^2 + \lambda _{n + 1} \varepsilon _{t - 1}^2 + \cdot \cdot \cdot \end{aligned}$$
(14)

In practical application, we stop at a large N and this leads to the forecasting equation

$$\begin{aligned} h_t \left( n \right) \approx \alpha _0 \left[ {1 - \beta \left( 1 \right) } \right] ^{ - 1} + \sum \limits _{i = 1}^{n - 1} {\lambda _i h_t \left( {n - i} \right) + } \sum \limits _{j = 0}^N {\lambda _{n + j} \varepsilon _{t - j}^2 .} \end{aligned}$$
(15)

The parameters will have to be replaced by their corresponding estimates [14].

4 Parameters Estimation and the Choice of Model

The most often used methods of estimation of autoregressive models parameters are: maximum likelihood method (MLE) and quasi-maximum likelihood method (QMLE). This is due to the fact that estimation of the parameters by means of both methods is relatively simple and effective. The basic problem of computing with MLE method is finding a solution to the equation

$$\begin{aligned} \frac{{\partial \ln \left( {L_\varOmega \left( \rho \right) } \right) }}{{\partial \theta }} = 0, \end{aligned}$$
(16)

where \(\theta \) is the estimated set of parameters, \(L_\varOmega (\rho )\) is the likelihood function, and \(\varOmega \) is a number of observations. Mostly, in general case the analytic solution to the Eq. (16) is impossible and then numerical estimation is employed. The basic problem occurring while using the maximum likelihood method is necessity to define the whole model, and consequently the sensitivity of the resulting estimator for any errors in the specification of the AR and MA polynomials responsible for the dynamics of the process [15, 16]. There is no universal criterion for the choice of the model. Usually, the case is as follows: the more complex model, the greater is the value of the likelihood function. As a result, adjusting the model to the data is more effective. However, estimation of a higher number of parameters is connected with bigger errors. Therefore, it is crucial to find a compromise between the quantity of parameters occurring in the model and the value of likelihood function. The choice of the economic form of the model is often based on information criteria such as Akaike (AIC) or Schwarz (SIC). Values of the mentioned criteria can be estimated on the basis of the following formulas:

$$\begin{aligned} AIC\left( \rho \right) = - 2\ln \left( {L_\varOmega \left( \rho \right) } \right) + 2\rho , \end{aligned}$$
(17)
$$\begin{aligned} SIC\left( \rho \right) = - 2\ln \left( {L_\varOmega \left( \rho \right) } \right) + \rho \ln \left( \varOmega \right) , \end{aligned}$$
(18)

where \(\rho \) is the number of the model’s parameters. From different forms of the model, the one that is chosen has the smallest information criterion value [12, 17]. In our article, we proposed the maximum likelihood method for parameters estimation and the choice of the form of the model. The method was chosen due to its relative simplicity and computational efficiency. For ARFIMA model, we used HR estimator (described in Haslett and Raftery [18]) and automatic model selection algorithm based on the information criteria (see Hyndman and Khandakar [19]). For FIGARCH model estimation, we used methodology described in the present article [14].

Table 1 Network traffic features used for experiments

5 Experimental Results

In this section, we presented some results in case of ARFIMA and FIGARH statistical model usage for DDoS attack detection. We simulated real-world DDoS and application specific DDoS attacks for single LAN test network. As a network sensor we used SNORT IDS [20]. SNORT in our case is responsible for traffic capture and extracting network traffic features (see Table 1). Additionally we also used traffic testbed that contains DDoS attacks [21]. Twelve traffic features were used for evaluation of presented ARFIMA and FIGARH statistical models. Obviously, not all traffic features were sufficient for detecting all simulated attacks because they have not got impact on entire set of traffic features presented in Table 1. In Table 2 we presented detection rate DR and false positive FP values for 12 traffic features. Additionally in Tables 3 and 4, there are results for external testbed for 4 days of network traffic, respectively. We can conclude that ARFIMA model gives us better results in case of DR and FP for the used testbed in our experiments.

Table 2 Detection Rate DR (%) and False Positive FP (%) for a given network traffic features
Table 3 Evaluation of proposed method with the use of real world network traffic testbed [21] for 4 days of traffic
Table 4 Evaluation of proposed method with the use of real world network traffic testbed [21] for 4 days of traffic

6 Conclusion

Cybersecurity of information systems is contemporarily a key research factor. The growing number of DDoS attacks, their expending reach, and the level of complexity stimulate the dynamic development of network defensive systems. The techniques of statistical anomaly detections are recently the most commonly used for monitoring as well as detecting the attacks. In the present article, the construction of statistical autoregressive models, ARFIMA and FIFARCH, has been described. The above-mentioned models present the statistic variability of modeled parameters by means of the average or conditional variance. For estimation of parameters and identification of models the maximum likelihood method together with information criteria were used. As a result of their work the satisfying statistic measurements for researched signals of network traffic were obtained. The process of anomaly (attacks) detecting consist in comparison of estimated behavior parameters with real network traffic factors. The obtained results outstandingly signify that the anomalies included in the network traffic signal can be detected by suggested methods.