Keywords

1 Introduction

Analysis and modelling of Web traffic has been a hot research issue in recent years. HTTP requests’ arrival times at a Web server may be easily observed and analyzed. In reality, a request arrival process on a Web server has been proven to reveal significant variance (burstiness): peak request rates can exceed the average request rate even tenfold and surpass the server capacity, resulting in the poor quality of Web service [1, 2]. When this process is bursty on a wide range of time scales, it may have a feature of self-similarity. As a consequence of burstiness on many time scales, the arrival process may show long-range dependence (LRD), which means that values at any instant are non-negligibly positively correlated with values at all future instants [3]. Although the concepts of self-similarity and long-range dependence are not equivalent, in the literature they are often used interchangeably which may be attributed to the fact that the presence of both self-similarity and LRD may be estimated with the Hurst parameter (Hurst index), denoted as H.

Self-similarity has been discovered not only in Web server workload [24] but also in computer network traffic [59] or Web query traffic [10]. The synthetic self-similar traffic can be constructed by multiplexing a large number of on/off sources characterized by heavy-tailed on and off period lengths. Analysis of the Web traffic [3] showed that the self-similarity feature of such traffic can be attributed to several factors, including heavy-tailed distributions of Web document sizes and user “think times”, the effect of caching, and the superimposition of many such transfers in the network.

Self-similarity may have a significant negative impact on system performance and scalability [11]. That is why taking into consideration this Web traffic feature is essential when developing a synthetic workload model used to test the server system capacity – otherwise system performance may be overestimated. A number of traffic models and synthetic traffic generators implementing self-similarity and burstiness have been proposed [1215].

Very few studies have investigated self-similarity and LRD of the arrival process at e-commerce websites so far [2, 4]. The main impediment for this fact is a difficulty in obtaining traffic traces from online retailers, mainly due to e-business profitability and e-customer privacy concerns. In this paper, we investigate LRD in traffic arriving on a popular e-commerce Web server. The additional motivation for our study was a huge increase in popularity of online marketing and Web analytics in recent years, which could induce changes in Web traffic patterns at e-commerce servers, mainly due to the increased share of bot-generated traffic.

The paper is organized as follows. Section 2 presents background information on self-similary, LRD, and some methods for investigating these phenomena in time series. Section 3 presents datasets analyzed in our study and discusses the results of LRD intensity estimation. Section 4 concludes the paper.

2 Background

In this section notions of self-similarity and long-range dependence are briefly presented and some methods for estimating these phenomena are introduced. For detailed discussion on these issues refer e.g. to [16].

2.1 Self-similarity and Long-Range Dependence

Self-similarity may be defined in terms of the process distribution as follows. A stochastic process Y(t) is self-similar with a self-similarity parameter H if for any positive stretching factor c, the distribution of the rescaled process \(c^{-H}Y(ct)\) is equivalent to that of the original process Y(t) [17].

A self-similar process shows long-range dependence if its autocorrelation function follows a power law: \(r(k)\sim k^{-\beta }\) as \(k\rightarrow \infty \), where \(\beta \in (0,1)\) [3] (it is worth noting that LRD can be also defined for non self-similar processes).

A presence and a degree of self-similarity and long-range dependence is expressed by the Hurst parameter, H. When H is in the range of 0.5 and 1, one can say that a process is self-similar [18] and the higher H is, the higher degree of self-similarity and LRD is revealed by the series [2] (although a process can be self-similar even if \(H \le 0.5\), e.g., for the special case of Fractional Brownian motions).

2.2 Selected Methods for Estimating the Hurst Parameter

We apply five popular methods for assessing self-similarity and LRD of the Web traffic [3, 8, 12, 19]. Four of them are graphical methods: aggregate variance method, R/S plot, periodogram-based method, and wavelet-based method. The last method is Local Whittle estimator.

The aggregate variance method and the R/S plot method are in the time domain. Let us consider a time series \(X = (X_{t}; t = 1, 2, \ldots , N)\). In the aggregate variance method, the m-aggregated series \(X^{(m)} = (X_{k}^{(m)}; k = 1, 2, \ldots )\) is defined by summing the time series X over nonoverlapping blocks of length m. The variance of series \(X^{(m)}\) is plotted against m on a log-log plot and the points are approximated by a straight line, e.g., by using the least squares method. Then, the slope of the line, \(-\beta \), is established and the Hurst parameter is computed as \(H = 1-\beta /2\). For a self-similar series variance decays slowly so \(-\beta \) is greater than \(-1\), which gives H higher than 0.5.

In the R/S plot method, the rescaled range, i.e., the R/S statistic, is plotted against m (which has been traditionally denoted by d in this method) on a log-log plot. For a self-similar series, R/S grows according to a power law with exponent H as a function of d and the plot has slope which is an estimate of H.

Other three methods are in the frequency domain. In the periodogram-based method, a periodogram of a time series X is defined by:

$$\begin{aligned} I_{N}(\lambda )=\frac{1}{2\pi N}\left| \sum _{t=1}^{N}X_{t}e^{i\lambda t} \right| ^{2}, \end{aligned}$$
(1)

where \(i=\sqrt{-1}\). Usually it is evaluated at the Fourier Frequencies \(\lambda _{j,N}=\frac{2\pi j}{N}\), where \(j\in [0,n/2]\). The estimation of H is based on the slope \(\gamma \) of a log-log plot \(I_{N}(\lambda _{j,N})\) versus \(\lambda _{j,N}\) as frequency approaches zero. The relationship between the periodogram slope and the Hurst parameter is given by the formula \(\gamma = 1-2H\).

Local Whittle estimator is a non-graphical method based on periodograms. This method assumes that the spectral density \(f(\lambda )\) of the series can be approximated by the function:

$$\begin{aligned} f_{c,H}(\lambda )=c\lambda ^{1-2H} \end{aligned}$$
(2)

for frequencies \(\lambda \) as frequency approaches zero. The Local Whittle estimator of H is defined by minimizing:

$$\begin{aligned} \sum _{j=1}^{m}\log f_{c,H}(\lambda _{j,N})+\frac{I_{N}(\lambda _{j,N})}{f_{c,H}(\lambda _{j,N})} \end{aligned}$$
(3)

with respect to c and H; \(I_{N}\) is defined in (1) and \(f_{c,H}\) is defined in (2).

In the wavelet-based estimator of the Hurst parameter, wavelets are considered as a generalisation of Fourier transform. For the series X the wavelet coefficients are determined; based on their values a time average \(\mu _{j}\) is performed at a given scale (for the j-th octave). The relationship between \(\mu _{j}\) and H is given by the formula:

$$\begin{aligned} E \log _{2}(\mu _{j})\sim (2H-1)j+C, \end{aligned}$$
(4)

where E means the average, C depends only on H. Using this relationship, H may be determined based on the slope of an appropriate weighted linear regression.

Some other methods for determining the Hurst parameter in time series have been also proposed, e.g. detrended fluctuation analysis (DFA) [20] or multifractal analysis [21]. We do not discuss them in the paper due to space limitations.

3 Estimation of the Hurst Parameter for E-Commerce Traffic

3.1 Data Collection

The main goal of our analysis was to investigate LRD in e-commerce Web traffic. The analysis was done for data recorded in Web server log files obtained from an online retailer trading car parts and accessories. HTTP description lines were converted into time series reflecting the request arrival process at the Web server during the successive 14 days. 14 one-day traces were separately analyzed (traces are named with dates of traffic collection).

Table 1. Cardinality of the analyzed data sets

To verify the results obtained for the e-commerce traces, we decided to perform an additional LRD analysis of traffic at an actual non e-commerce server. To this end, we used seven traces from a server hosting a specialized mailing list.

The number of samples (i.e., the number of HTTP requests) in each trace is presented in Table 1.

Package R [22] was used to estimate the Hurst parameter for both sets of traces with the application of the five methods described in Subsect. 2.2.

3.2 Results and Discussion

Figures 1234 provide examples of the application of four graphical methods to analyze two e-commerce traces, collected on 1 and 5 December 2015. Traffic in the 01.12.2015 trace is characterized by rather low LRD intensity compared to other e-commerce traces. On the other hand, for traffic registered in the 05.12.2015 trace, the highest mean H estimate was achieved in our analysis. Thus, in Figs. 1234 one can compare plots for Web traffic characterized with a moderate and a high level of long-range dependence.

Fig. 1.
figure 1

Aggregate variance plot for the 01.12.2015 trace (left) and the 05.12.2015 trace (right)

Fig. 2.
figure 2

R/S plot for the 01.12.2015 trace (left) and the 05.12.2015 trace (right)

Fig. 3.
figure 3

Periodogram for the 01.12.2015 trace (left) and the 05.12.2015 trace (right)

Fig. 4.
figure 4

R/S plot for the 01.12.2015 trace (left) and the 05.12.2015 trace (right)

Fig. 5.
figure 5

Comparison of H estimates for the e-commerce traces

Figure 1 shows the aggregate variance plots. One can observe that the linear plots are characterized by a slope clearly different from \(-1\) which confirms the self-similarity of the analyzed time series. The slope of the plot for 01.12.2015 data (left) was estimated as \(-0.56\), giving an estimate for the Hurst parameter of 0.72. The slope estimated for a 05.12.2015 data plot (right) is \(-0.18\) which results in H of 0.91.

The R/S plots in Fig. 2 have an asymptotic slope between 0.5 and 1 (the corresponding lines are shown for comparison). The slope, being an estimate of H, was determined using regression as 0.65 for the 01.12.2015 trace and 0.54 for the 05.12.2015 trace.

Figure 3 presents example results achieved using the periodogram-based method. Regression lines for periodogram plots have a slope of \(-0.36\) and \(-0.5\), giving the estimates of H as 0.68 and 0.75 for the 01.12.2015 and 05.12.2015 traces, correspondingly.

Figure 4 shows results of application of the wavelet-based estimator of the Hurst parameter to the two example e-commerce traces. The corresponding H of 0.77 and 1.09 were estimated. For H determined with this method confidence intervals are provided (Table 2).

Table 2. H estimates for the e-commerce traces
Table 3. H estimates for the non e-commerce traces

Table 2 summarizes the results of our study across the different methods for all 14 e-commerce traces. In general, the H estimate exceeds 0.5 which indicates the self-similar character of the traffic. Only H estimated for the 09.12.2015 trace using the R/S plot method was 0.46. Other values of the Hurst parameter exceed 0.5 and they vary significantly, ranging from 0.51 to even 0.98.

Mean H values estimated for each e-commerce trace (the last column) show significant fluctuations in LRD intensity depending on a day, with H ranging from 0.6 for the 09.12.2015 trace to 0.8 for the 05.12.2015 trace. The last row of Table 2 shows even bigger differences in H estimates depending on the method applied.

Fluctuations in H estimates depending on the trace and the method applied are graphically presented in Fig. 5. One can observe that for the Local Whittle method, the estimate of H stays relatively consistent across all 14 analyzed datasets (with the mean value of 0.63). On the other hand, for the graphical methods it varies greatly. The wavelet-based method tends to give the highest H estimates (with the mean of 0.85) whereas H estimates for the R/S plot method are the lowest (with the mean of 0.6). We cannot give reasons for such a big variance of H estimates across various methods. However, such variance is not uncommon - it has been also obtained in some previous studies, e.g., for network traffic [8, 12] and MPEG-1 encoded video sequences2 [23].

Table 3 presents estimates of the Hurst parameter for the non e-commerce traces and Fig. 6 illustrates fluctuations in these estimates depending on the trace and the method used. For this traffic the Hurst parameter (with the mean of 0.66) seems to be a little lower than the one for the e-commerce traffic (with the mean of 0.7). At the same time, H estimates for non e-commerce traffic are much more consistent across days and methods applied (c.f. Fig. 5).

Fig. 6.
figure 6

Comparison of H estimates for the non e-commerce traces

4 Conclusions

Application of five popular Hurst parameter estimators to e-commerce traffic shows that this traffic reveals a significant level of long-range dependence. The mean H estimate ranges from 0.6 to 0.8 depending on a day. This result is consistent with results for request arrival process at other e-commerce sites: 0.66 in [2] and 0.73–0.8 in [4]. Furthermore, our study confirms previous findings that one cannot rely on a single method to estimate the Hurst parameter since different methods usually give different results. In our case, the mean H estimate for the e-commerce traffic ranges from 0.6 to 0.85 depending on the method. For the non e-commerce traffic, analyzed in the paper for comparative purposes, these fluctuations are much smaller and range from 0.63 to 0.69.

A coarse analysis of our results shows that the Hurst parameter determined for 24-hour intervals does not depend on the number of HTTP requests arrived on the server within these intervals. It also does not depend on the share of Web bot requests in the intervals. A deeper LRD analysis, performed for intervals shorter than 24 hours, is being planned to investigate these possible relationships. Furthermore, we plan to use traces from multiple Web servers to inspect if it is possible to use the Hurst parameter to distinguish between e-commerce and non e-commerce source.