1 Introduction

The cloud architectures provide a way to share resources in the form of virtual machines (VMs) among multiple users independent to their geographical locations. Every individual is moving towards cloud technology and getting benefited in a number of ways such as on demand services, elasticity, flexibility, availability and many more. Elasticity is one of the key properties of cloud which provides the flexibility of stretching and compressing virtual hardware resources at any point of time in the lifetime of hosted applications. However, the cloud resource management is complex and challenging task [1, 2] that has raised the possibility of dynamic resource scaling. The resource scaling helps in upholding the quality of services (QoS) promised to its subscribers and can be achieved by two different methods. First, a reactive resource scaling mechanism can be adapted where resources get scaled after arrival of demands. In reactive scaling, users have to wait for certain amount of time to get their requested resources allocated due to the fact that VM initialization is not instantaneous. Unlike reactive methods, the proactive scaling approaches minimize the waiting time by allocating the requested resources in no time. These approaches utilize the estimated workload information for upcoming time instances to scale the resources in advance before actual demand arrives. The forecasts used in resource provisioning have been classified into application and host load predictions [3]. The application predictions are further categorized into three levels i.e. performance, SLA parameters, and workload. The data sets considered in the presented work fall into workload category. The process of workload forecasting involves the use of historical information to extract the patterns of applications running on the server and time series forecasting methods have been extensively explored in workload prediction applications. Some of the key contributions in the domain are listed below.

The web server workload forecast is carried out using an auto regression approach [4]. The model provides the long term and short term estimations and considers the correlation and non stationarity of the workload data. The auto regressive integrated moving average (ARIMA) and its variants have been extensively used to develop forecasting methodologies. For instance, Roy et al. [5] highlighted the challenges in auto scaling of cloud resources and developed a model-predictive estimation approach for resource auto scaling. It also shows that the distributed resources can be allocated in a way that achieves the satisfactory QoS and low operational cost. Similarly, Tran et al. [6] proposed a predictive model for server workload forecasting using seasonal variant of ARIMA process. The approach is one of the integral parts in the project EnergeTic-FUI, France. The accuracy of the model is validated with a wide set of experiments on benchmark datasets. The impact of workload estimations has been analyzed by Calheiros et al. in  [7]. The authors modeled the ARIMA process to estimate the workload information of cloud servers. They also studied the aspects of predictive resource management and analyzed the accuracy of predictive model in terms of resource utilization and QoS. It has been claimed that the predictive model receives an accuracy upto 91%. Similarly, a bandwidth allocation scheme using predictive analytic was developed for cloud architectures enabled with software defined networks [8]. The approach collects the data from virtual machines executing on cloud servers and analyzes the collected data to estimate the bandwidth utilization of each VM for upcoming time instance. Then the predicted information is incorporated to generate the bandwidth allocation pattern and allocates the bandwidth in advance to improve the network performance in terms of bandwidth utilization and capacity. Khan et al. advocated the use of multiple time series and grouping the similar applications to improve the forecast accuracy [9]. The approach examines the correlation among VM workload to extract the repeatable patterns of workloads. It also uses a clustering approach to group the applications exhibiting the similar behavior. Then it introduces a predictive method to estimate the variation in workload patterns. The long term QoS requirements has been estimated using multivariate time series while short term advertisements are predicted using univariate time series [10]. Intra correlation of QoS attributes is incorporated to improve the accuracy of predictions and inter correlation helps in obtaining the best possible service composition through time series group similarity.

The time series forecasting methods have also been used in combination with machine learning approaches. For instance, Messias et al. [11] advised to learn the suitable time series forecasting model using a genetic algorithm based learning mechanism as it is challenging task to identify which model suits to workload data under consideration. It becomes more complex if enough historical data is not available. Liu et al. classified the workloads into a number of classes using 0-1 integer programming and each class is assigned with a prediction approach to forecast the workload on the servers [12]. Cetinski and Juric proposed a hybrid approach to forecast the workload on cloud data centers by combining the statistical methods with learning approaches [13]. Rahman et al. proposed to monitor the QoS received by the user to keep an eye on the QoS being provided and received [14]. Khan et al. explored the auto scaling techniques adopted by renowned cloud service providers including Amazon, Google, and Microsoft [15]. The study observed the features and entities of auto scaling approaches. The presented approach allows proactive analysis of workload patterns and estimates the responsiveness of auto scaling operations. Baldán et al. used ARIMA and ES in combination with learning approaches to forecast the workload instances of cloud data centers [16]. The model successfully reduced the under and over provisioning cost up to 30%. Kim et al. carried out an experimental study using a number of different models [17]. The experiments were carried out on an application under four realistic workload patterns, two billing patterns, and three types of predictive scaling. The study revealed that none of the approach is universally best for each workload. Kumar and Mazumdar compared the performance of the different variants of ARMA class such as ARIMA, SARIMA (Seasonal ARIMA), ARFIMA (Fractionally Integrated ARIMA) with SSA (Singular Spectrum Analysis) on CPU, Memory and bandwidth data traces obtained from Wikimedia grid [18]. The performance of ARIMA was found best on network data trace while SSA outperformed other models over CPU and Memory data traces. In general, a number of approaches have been explored to improve the estimation accuracy but every approach lags in producing 100% accurate results [19].

The aim of this article is to assess the performance of the time series forecasting models in workload estimation of different type of workloads from cloud environment. The study comprises six different time series models and their forecast ability is evaluated over five real world data traces. A comprehensive statistical evaluation is also carried out using Friedman test with Finner post-hoc analysis that ranks each model based its forecast accuracy. In addition, the forecast accuracy of each model is compared and statistically evaluated against each other. Further, this paper is organized as follows: Section 2 discusses the workload prediction and its role in cloud data center management. It also lists out the performance evaluation metrics used to evaluate the accuracy of forecasting approaches in this study. The time series prediction models are discussed in Sect. 3. The results of experimental study are presented and discussed in Sect. 5 followed by an analytical study of the experimental findings in Sect. 6. Finally, the paper is wrapped up with conclusive remarks in Sect. 7.

2 Preliminaries

2.1 Workload Prediction

The forecasting plays a vital role in the growth of an organization by estimating the future trends in advance that helps in developing an effective strategic planning [20,21,22,23,24,25]. Cloud service providers can also use predictive analytics to estimate the future demands for improving the quality of services (QoS) and quality of experience (QoE). A general architecture of cloud system with forecasting module is depicted in Fig. 1 where a device called resource manager takes inputs from \(\mathcal {F}\) and current state monitor to manage the cloud resources effectively. The accuracy of \(\mathcal {F}\) is one of the important aspects in improving the cloud resource management operations which include scaling, virtual machine placement, minimizing the active physical machines, and others.

Fig. 1
figure 1

A view of cloud system with workload forecaster

Let Y is a set of workload values over time t, where \(y_t\) denotes the workload at time t. Also, consider that f is a function of y that analyzes the previous workload instances to estimate the outcome of future events (Eq. (1)). The forecasts are compared against the actual values and forecast error (\(e_{t+1})\) is computed.

$$\begin{aligned} \hat{y}_{t+1} = f\left( y_t, y_{t-1}, \ldots , y_1\right) \end{aligned}$$
(1)

2.2 Performance Measure Indicators

The accuracy analysis using the individual forecast errors is a challenging task. Therefore, a metric that gives a single number to measure the forecast accuracy of any prediction model is required. A wide range of metrics have been proposed in the literature and each of them carry their distinct merits and demerits, for instance, RMSE is highly sensitive towards the outliers. Thus, we have adopted three evaluation metrics i.e. Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Scaled Error (MASE) to measure the accuracy of models.

2.2.1 Root Mean Squared Error (RMSE)

It is one of the widely used metrics and it penalizes a forecast models for large errors [26]. The model is considered to be more accurate if its score is closer to 0. The mathematical representation of the metric is mentioned in (2) where n is the number of data points in workload trace.

$$\begin{aligned} RMSE(y,\hat{y}) = \sqrt{\frac{1}{n} \sum _{t=1}^{n} (y_t - \hat{y}_t)^2} \end{aligned}$$
(2)

2.2.2 Mean Absolute Error (MAE)

In root mean squared error the square of higher error values will get more weight which may influence the accuracy of forecaster. While MAE gives equal weight to each error component and measures the accuracy of forecasting model by computing the mean of absolute differences between actual and predicted workloads as shown in (3). The MAE is better indicator of average error then RMSE [27, 28]. It produces a non negative number to evaluate the forecast accuracy and if it is close to 0, forecasts are very much similar to actual values.

$$\begin{aligned} MAE(y,\hat{y}) = \frac{1}{n} \sum _{t=1}^{n} |y_t - \hat{y}_t| \end{aligned}$$
(3)

2.2.3 Mean Absolute Scaled Error (MASE)

The metric was proposed by Rob J. Hyndman and Anne B. Koehler to use as a substitute of percentage error metrics [29]. Unlike other accuracy metrics, it produces a good accuracy estimate of a forecasting model when comparing across a number of series of different scale. It scales the forecast errors based on the training mean absolute error from a naïve forecast method and can be computed using (4), where m denotes the seasonal term. We computed MASE using accuracy function provided in R and used to select the best model over a number of traces.

$$\begin{aligned} MASE(y,\hat{y}) = \frac{1}{n} \sum _{t=1}^{n} \Bigg (\frac{|y_t - \hat{y}_t|}{\frac{1}{n-m}\sum _{t=m+1}^{n}|y_t - y_{t-1}|}\Bigg ) \end{aligned}$$
(4)

3 Time Series Forecasting Models

The analysis of time series data began a long back ago in 1927 [30]. The common aim of time series analysis is the forecasting of the trends by extracting meaningful statistics and other characteristics of the data. A number of time series forecasting methods have been derived and this study focuses on six different models.

3.1 Autoregressive (AR)

An autoregressive (AR) states that the future outcome of a variable is a function of its previous values and a white noise term. Since, the variable \(y_{}\) is regressed on past occurrences of itself, the method is named as autoregressive. Let us consider the past values of a workload variable \(y_{}\) at equally spaced time interval \(t, t-1, \ldots , 1\) are \(y_{t}, y_{t-1}, \ldots , y_{1}\) then an AR process of order \(p\) can be expressed as given in Eq. (5) [31].

$$\begin{aligned} \hat{y}_{t} = \phi _1 \times y_{t-1} + \phi _2 \times y_{t-2} + \ldots + \phi _{p} \times y_{t-p} + a_{t} \end{aligned}$$
(5)
Fig. 2
figure 2

AR process of order p

Figure 2 depicts an AR process of order \(p\), where \(y_{t}\) and \(\hat{y}_{t}\) are the actual and predicted workloads respectively at time instance t. \(\phi _i\) (\(i=1,2,\ldots ,p\)) are model parameters that are restricted to lie between -1 and +1 while \(a_{t}\) is random noise term.

3.2 Moving Average (MA)

Unlike AR process that uses the previous actual values to forecast the upcoming instances, Moving Average (MA) process considers that an observable time series (\(y_{t}\)) can be modeled and generated from a series of independent white noise (\(a_{t}\)), if successive values of \(y_{t}\) highly depends on previous values [32]. These white noise terms are randomly generated using a fixed distribution, and usually having mean zero. A moving average method of order \(q\) can be represented as mentioned in Eq. (6), where \(\theta _j\) (\(j=\{1,2,\ldots ,q\}\)) are the model parameters (weights) and required neither to be total unity nor to be positive [32]. Figure 3 graphically illustrates the MA process of order \(q\), where \(\mathcal {F}_{MA}\) is a function that models random white noise terms to anticipate future values.

$$\begin{aligned} \hat{y}_{t} = a_{t} - \theta _1 \times a_{t-1} - \theta _2 \times a_{t-2} - \ldots - \theta _{q} \times a_{t-q} \end{aligned}$$
(6)
Fig. 3
figure 3

MA of order q

3.3 Autoregressive Moving Average (ARMA)

The Autoregressive Moving Average (ARMA) process combines both autoregressive and moving average terms in the model which is sometimes favorable to achieve a greater fit of the actual data. This method parsimoniously describes a time series using two different polynomials, one for auto regression and other one for moving average model. Commonly this method is preferred for modeling of a noisy time series data. Fig. 4 shows the working of an ARMA process with \(p\) autoregressive terms and \(q\) moving average terms which can be represented using expression (7).

$$\begin{aligned} \hat{y}_{t} = \phi _1 \times y_{t-1} + \phi _2 \times y_{t-2} + \cdots + \phi _{p} \times y_{t-p} + a_{t} - \theta _1 \times a_{t-1} -\theta _2 \times a_{t-2} - \cdots - \theta _{q} \times a_{t-q} \end{aligned}$$
(7)
Fig. 4
figure 4

ARMA of order p and q

3.4 Autoregressive Integrated Moving Average (ARIMA)

The Autoregressive Integrated Moving Average (ARIMA) model is more general form of ARMA model. It has three different components namely autoregressive (AR), Integration (I), and Moving Average (MA). It is preferably applied over a non-stationary time series data which can be transformed into a stationary data by applying integration or differencing a number of times. A general ARIMA model of order \((p, d, q)\) is represented as ARIMA(\(p, d, q\)) and shown in Fig. 5. Where \(p, d\) and \(q\) are the number of autoregressive terms, number of differences required, and number of moving average terms in the prediction equation. The differencing operation of order \(d\) is denoted using \(\delta ^{d}\) and \(\tilde{y}_{t}\) represents the value of differenced time series at time t. The simplest form of ARIMA(1,1,1) can be written as mentioned in Eq. (8), where B represents backward shift (\(B\times y_{t} = y_{t-1}\)) operator [31].

$$\begin{aligned} \underbrace{(1-\theta _1B)}_{\text {AR(1)}} \underbrace{(1-B) y_{t}}_{\text {First Difference}} = \underbrace{(1-\theta _1B) a_{t}}_{\text {MA(1)}} \end{aligned}$$
(8)
Fig. 5
figure 5

ARIMA of order p, d, and q

In addition, the auto ARIMA (referred as \(M_{5}\)) function provided in the forecast package of R [33] is also considered to analyze the effect of parameter optimization. It analyzes the data series and finds the optimal values for regression, differencing, and moving average terms. Based on these terms, the order of ARIMA model is identified which is later applied to forecast the trends on the data series.

3.5 Exponential Smoothing (ES)

Exponential smoothing (ES) advocates that the most recent forecasts highly affect the next forecast. The weights of previous predictions reduces exponentially as the data becomes older. The general model for exponential smoothing is given in Eq. (9), where \(\alpha\) is the smoothing constant or model’s parameter. The visual representation of the model is shown in Fig. 6. The key benefit of the model is that it does not need to store high amount of data to make forecasts.

$$\begin{aligned} \hat{y}_{t} = \alpha y_{t-1} + (1-\alpha ) \hat{y}_{t-1} \end{aligned}$$
(9)
Fig. 6
figure 6

Exponential smoothing

3.6 An Illustration

Let Y is a set of randomly sampled workload instances that represents the memory requested by an application. The simulation is carried out using R programming language and workload is generated using sample(min:max, n, replace = FALSE). Since, the requested memory is sampled at uniform interval, it is converted into a time series object. Here, min and max defines the minimum and maximum values in the workload trace respectively. We generated Y of length \(n=40\) with replacement option being FALSE that means no value is repeated in generation of workload data. An random forecasting model (\(M_{0}\) i.e. \(\hat{y}_t = y_{t-1}+rand\)) is applied over Y to estimate the future values listed as \(\hat{Y}^{M_{0}}\), where rand is a random number between 0 and 1. The forecast error (\(e_t = y_t - \hat{y}_t\)) is measured and listed out as \(e^{M_{0}}\). The random forecasting model (\(M_{0}\)) estimates the workload with 35.64, 29.42 and 0.98 forecast accuracy as per the RMSE, MAE and MASE metrics that represent a large amount of error in estimations.

  • \(Y = {\{25.00, 21.00, 75.00, 13.00, 44.00, 83.00, 99.00, 30.00, 54.00, 41.00, 90.00, 96.00, 43.00, 46.00, 22.00, 57.00, 28.00, 56.00, 64.00, 61.00, 16.00, 70.00, 32.00, 18.00, 78.00, 69.00, 66.00, 94.00, 12.00, 42.00, 49.00, 87.00, 100.00, 67.00, 37.00, 8.00, 14.00, 35.00, 50.00, 7.00\}}\)

  • \(\hat{Y}^{M_{0}} = {\{NA, 25.12, 21.90, 75.74, 13.50, 44.84, 83.17, 99.65, 30.09, 54.09, 41.52, 90.04, 96.89, 43.19, 46.21, 22.64, 57.13, 28.08, 56.28, 64.83, 61.81, 16.74, 70.91, 32.62, 18.17, 78.64, 69.21, 66.90, 94.26, 12.91, 42.25, 49.80, 87.79, 100.73, 67.64, 37.25, 8.58, 14.03, 35.19, 50.35\}}\)

  • \(e^{M_{0}} = {\{NA, -4.12, 53.10, -62.74, 30.50, 38.16, 15.83, -69.65, 23.91, -13.09, 48.48, 5.96, -53.89, 2.81, -24.21, 34.36, -29.13, 27.92, 7.72, -3.83, -45.81, 53.26, -38.91, -14.62, 59.83, -9.64, -3.21, 27.10, -82.26, 29.09, 6.75, 37.20, 12.21, -33.73, -30.64, -29.25, 5.42, 20.97, 14.81, -43.35\}}\)

Similarly, we evaluate each time series forecasting model on random workload data. The estimations obtained from an autoregressive process (\(M_{1}\)) are listed as \(\hat{Y}^{M_{1}}\) and \(e^{M_{1}}\) represents the forecast errors. The forecasting model randomly selects the values of two parameters \(\phi _1 = -0.0093\) and \(c=49.365\). The model generates the 27.81, 23.78,  and 0.78 for RMSE, MAE and MASE correspondingly that shows the better forecasts than \(M_{0}\).

\(\hat{Y}^{M_{1}} = {\{NA, 49.17, 48.67, 49.24, 48.96, 48.59, 48.44, 49.09, 48.86, 48.98, 48.53, 48.47, 48.97, 48.94, 49.16, 48.83, 49.10, 48.84, 48.77, 48.80, 49.22, 48.71, 49.07, 49.20, 48.64, 48.72, 48.75, 48.49, 49.25, 48.97, 48.91, 48.56, 48.44, 48.74, 49.02, 49.29, 49.23, 49.04, 48.90, 49.30 \}}\)

\(e^{M_{1}} = {\{NA, -28.17, 26.33, -36.24, -4.96, 34.41, 50.56, -19.09, 5.14, -7.98, 41.47, 47.53, -5.97, -2.94, -27.16, 8.17, -21.10, 7.16, 15.23, 12.20, -33.22, 21.29, -17.07, -31.20, 29.36, 20.28, 17.25, 45.51, -37.25, -6.97, 0.09, 38.44, 51.57, 18.26, -12.02, -41.29, -35.23, -14.04, 1.10, -42.30 \}}\)

Similarly on applying the first order MA (\(M_{2}\)) with randomly chosen \(\theta _1 = -0.126\) and \(c=49.365\) over Y, it forecasts with 28.13, 23.40,  and 0.78 forecast errors obtained using RMSE, MAE, and MASE respectively that shows the similar accuracy as of \(M_{1}\).

\(\hat{Y}^{M_{2}} = {\{NA, 49.37, 52.94, 46.59, 53.60, 50.57, 45.28, 42.60, 50.95, 48.98, 50.37, 44.37, 42.86, 49.35, 49.79, 52.87, 48.84, 51.99, 48.86, 47.46, 47.66, 53.35, 47.27, 51.29, 53.56, 46.29, 46.50, 46.91, 43.43, 53.33, 50.79, 49.59, 44.65, 42.39, 46.26, 50.53, 54.72, 54.50, 51.82, 49.59\}}\)

\(e^{M_{2}}={\{0.00, -28.37, 22.06, -33.59, -9.60, 32.43, 53.72, -12.60, 3.05, -7.98, 39.63, 51.63, 0.14, -3.35, -27.79, 4.13, -20.84, 4.01, 15.14, 13.54, -31.66, 16.65, -15.27, -33.29, 24.44, 22.71, 19.50, 47.09, -31.43, -11.33, -1.79, 37.41, 55.35, 24.61, -9.26, -42.53, -40.72, -19.50, -1.82, -42.59\}}\)

After applying an ARMA process (\(M_{3}\)) on Y with p and q both terms being 1 where \(\phi _1\), \(\theta _1\) and c are same as selected in \(M_{1}\) and \(M_{2}\), the forecasts have RMSE, MAE, and MASE equal to 27.34, 23.17,  and 0.77 respectively. The forecasts of \(M_{3}\) are bit more accurate than \(M_{1}\) and \(M_{2}\).

\(\hat{Y}^{M_{3}}={\{ NA, 49.13, 45.62, 52.36, 44.28, 48.92, 52.88, 54.25, 46.03, 49.86, 47.86, 53.83, 53.78, 47.60, 48.73, 45.79, 50.24, 46.30, 50.06, 50.52, 50.11, 44.91, 51.87, 46.56, 45.59, 52.72, 50.77, 50.67, 53.95, 43.96, 48.72, 48.94, 53.35, 54.31, 50.34, 47.34, 44.33, 45.41, 47.72, 49.18 \}}\)

\(e^{M_{3}}={\{0.00, -28.13, 29.38, -39.36, -0.28, 34.08, 46.12, -24.25, 7.97, -8.86, 42.14, 42.17, -10.78, -1.60, -26.73, 11.21, -22.24, 9.70, 13.94, 10.48, -34.11, 25.09, -19.87, -28.56, 32.41, 16.28, 15.23, 43.33, -41.95, -1.96, 0.28, 38.06, 46.65, 12.69, -13.34, -39.34, -30.33, -10.41, 2.28, -42.18 \}}\)

We applied ARIMA(1,1,1) (\(M_{4}\)) also over Y which improves the accuracy more by producing forecasts with 26.78, 22.35,  and 0.76 values of RMSE, MAE, and MASE respectively.

\(\hat{Y}^{M_{4}} = {\{NA, NA, 51.20, 53.66, 46.62, 50.55, 54.89, 56.57, 48.46, 51.64, 49.95, 55.76, 56.18, 50.00, 50.63, 47.78, 52.00, 48.41, 51.86, 52.62, 52.25, 47.02, 53.56, 48.80, 47.41, 54.46, 53.08, 52.82, 56.09, 46.37, 50.34, 50.93, 55.36, 56.67, 52.77, 49.46, 46.21, 47.05, 49.45, 51.09\}}\)

\(e^{M_{4}} = {\{NA, 0.00, 23.80, -40.66, -2.62, 32.45, 44.11, -26.57, 5.54, -10.64, 40.05, 40.24, -13.18, -4.00, -28.63, 9.22, -24.00, 7.59, 12.14, 8.38, -36.25, 22.98, -21.56, -30.80, 30.59, 14.54, 12.92, 41.18, -44.09, -4.37, -1.34, 36.07, 44.64, 10.33, -15.77, -41.46, -32.21, -12.05, 0.55, -44.09\}}\)

We evaluated an exponential smoothing model (\(M_{6}\)) over Y with randomly selected \(\alpha =0.5\) which forecasts the workload trace with RMSE, MAE, and MASE equal to 31.12, 27.44, and 0.90 values. The forecasts of \(M_{6}\) are better than \(M_{0}\) only. However, it may produce more accurate forecasts with better values of \(\alpha\).

\(\hat{Y}^{M_{6}} = {\{NA, 49.98, 35.49, 55.24, 34.12, 39.06, 61.03, 80.02, 55.01, 54.50, 47.75, 68.88, 82.44, 62.72, 54.36, 38.18, 47.59, 37.79, 46.90, 55.45, 58.22, 37.11, 53.56, 42.78, 30.39, 54.19, 61.60, 63.80, 78.90, 45.45, 43.72, 46.36, 66.68, 83.34, 75.17, 56.09, 32.04, 23.02, 29.01, 39.51\}}\)

\(e^{M_{6}} = {\{NA, -28.98, 39.51, -42.24, 9.88, 43.94, 37.97, -50.02, -1.01, -13.50, 42.25, 27.12, -39.44, -16.72, -32.36, 18.82, -19.59, 18.21, 17.10, 5.55, -42.22, 32.89, -21.56, -24.78, 47.61, 14.81, 4.40, 30.20, -66.90, -3.45, 5.28, 40.64, 33.32, -16.34, -38.17, -48.09, -18.04, 11.98, 20.99, -32.51\}}\)

Fig. 7
figure 7

Forecast accuracy on random data

Figure 7 shows the forecast accuracy of each model on random data. It can be seen that the ARIMA process generates better forecasts according to all three evaluation metrics i.e. RMSE, MAE and MASE. We conducted a comprehensive experimental study on various data sets to analyze the behavior of each model.

4 Simulation Environment and Datasets

The experiments are carried out on a machine equipped with Intel® Core I7-7700 processor of 3.60GHz clock speed. The machine consists 8 GB of main memory. The models are implemented using R. This analytical study is performed over five different benchmark data sets. The data sets are NASA server HTTP Trace, Calgary HTTP Trace, Saskatchewan server HTTP Trace, [34] and Google Cluster Trace [35]. We selected CPU and memory resource requests data traces from Google cluster traces. In this paper, we refer these traces as \(D_1, D_2, D_3, D_4\), and \(D_5\) respectively as listed out in Table 1.

NASA server traces are two HTTP request traces consisting of two months requests to the NASA Kennedy Space Center WWW server in Florida. In our experiments, we have considered only one month request data. The data logs are stored in ASCII file format distributed in five different fields. Each record contains host, timestamp, request, HTTP reply code, and bytes. Here, host indicates the hostname or an IP address of the requesting machine, timestamp denotes the time of request in the format “DAY MON DD HH:MM:SS YYYY” with timezone − 0400, request is mentioned in quotes, next field indicates the reply code, and last field shows the bytes sent in reply of the request. Similarly, Calgary and Saskatchewan traces contain HTTP requests worth of one year and seven month’s HTTP requests to the University of Calgary’s department of computer science WWW server and University of Saskatchewan’s WWW server respectively. The both servers are respectively located at Calgary, Alberta, Canada and Saskatoon, Saskatchewan, Canada. The other dataset i.e. Google cluster trace was released by Google in 2011 that incorporates the data of 29 days collected from organization’s cluster cell. The data contains 10, 388 machines, over 0.67 million jobs, 20 million tasks. Here, a job may consist multiple tasks and each task may further have multiple processes and arranged in a set of six different tables.

Table 1 List of data traces and corresponding acronyms

5 Forecast Evaluation

We carried out a number of experiments with different prediction window size (PWS) to study the performance behavior of time series forecasting models. The prediction window size defines the forecast time interval e.g. if the length of the window is 10 minutes, the model predicts upcoming workload on the server after each 10 min. The experiments are performed with prediction window of 5, 10, 20, 30, and 60 min duration. The 60% of the data is used to estimate the model parameters and the remaining data is used to test the forecast accuracy of estimated model.

The forecast error in the experiments carried out with first order autoregressive model are reported in Table 2. The model produces least root mean squared error over Calgary traces (\(D_2\)) in all experiments with different lengths of prediction window. In the case of mean absolute error, the least error was measured again on \(D_2\) for 5, 10, 20 and 40 minutes of prediction window. The Saskatchewan trace (\(D_3\)) receives the least mean absolute error when the length of window is 60 minutes. The mean absolute scaled error measured the minimum forecast error in the predictions of \(D_2\) for the cases when prediction window is of length 5, 10, 30, and 60 minutes. The NASA trace (\(D_1\)) achieves the better forecast accuracy than other traces for 20 minutes of PWS. It is also observed that the first order autoregressive model performs better on web server workloads rather than cloud server workloads. In cloud server workloads, the memory trace i.e. \(D_5\) attains better results. Based on MASE results, it can be generalized that AR(1) performs better on Calgary trace.

Table 2 Performance of AR(1)
Table 3 Performance of MA(1)
Table 4 Performance of ARMA(1,1)
Table 5 Performance of ARIMA(1,1,1)
Table 6 Performance of Auto Arima
Table 7 Performance of exponential smoothing

The similar results were observed in the experiments performed with first order moving average forecasting model. The measured error in forecasts of MA process are tabulated in Table 3 which shows that the model attained better accuracy over Calgary trace (\(D_2\)) for all values of prediction window. The forecasts of web server workloads were found more accurate than cloud server workloads. On comparing the performance of the model over different cloud workloads, the memory trace (\(D_5\)) achieved better forecasts than \(D_4\) i.e. CPU demand trace.

We performed the same set of experiments with ARMA(1,1) and the results are listed in Table 4. The achieved forecast errors indicate that the model produces lower root mean squared error for \(D_2\) forecasts in all instances of prediction window. The Saskatchewan trace received lower mean absolute error for prediction lookup window of 30 and 60 min. For other values of PWS \(D_2\) got better forecasts according to mean absolute error. \(D_2\) also witnesses better results in terms of mean absolute scaled error except one instance i.e. when prediction lookup window is 20 min. The experimental results depict similar trend in the accuracy of web and cloud server workloads as shown in previous two models.

We performed experiments with ARIMA(1,1,1) and auto ARIMA also. The forecast errors measured in forecasts of both models are shown in Tables 5 and 6 respectively. The results indicate that both models produced forecasts with similar accuracy. In both cases, the root mean squared error and mean absolute error of Calgary trace forecasts were found better than other traces. According to mean absolute scaled error metric, the Saskatchewan trace and NASA trace forecasts were better when the prediction window was 5 and 20 min duration.

The last model we experimented with is exponential smoothing and the forecasting errors are listed in the Table 7. Unlike the previous models, it produces the mix results. For instance, according to root mean square metric the \(D_2\) forecasts are less erroneous for prediction window of length 5, 10, 20 and 30 minutes. For the case of 60 minutes, Saskatchewan trace receives better forecast. And MAE indicates that \(D_2\) forecasts are better when prediction window duration is 5, 10 and 60 min and \(D_1\) forecasts are better in other cases. On comparing the MASE, the model performs better on \(D_2\) in case of 10 and 30 min prediction window. In case of 20 and 60 minutes interval, the model performs better on Saskatchewan and CPU trace respectively. While it performs equally better on \(D_2\) and \(D_3\) for 5 minutes prediction window. Again the model performs better on web server workloads in most of the cases and it gives mixed performance over cloud server traces.

6 Statistical Analysis

This section analyzes the statistics of forecast accuracy observed by different prediction models that are reported in previous section. The experimental findings report the forecast accuracy on a particular data trace only and it becomes difficult to rank the models. The statistical analysis helps in finding out whether the difference in results are significant or not. We have used the Friedman test [36] due the fact that it is considered to be most powerful test when five or more models’ performance is compared [37]. The Friedman test considers a null hypothesis (\(H_0\)) that assumes the accuracy of each model is same in nature and no significant difference in the results is present. The alternate hypothesis (\(H_1\)) advocates that the experimental findings of at least one model are significantly different than others. Provided k models’ performance is evaluated on d different datasets, a rank (\(\mathcal {R}_{\xi _{ij}}\)) for an experimental finding i.e. forecast accuracy of \(j^{th}\) forecasting model (\(M_{j}\)) on \(i^{th}\) dataset (\(D_i\)) is computed for all datasets. These ranks are summed up to compute the rank of a forecasting model (\(\mathcal {R}_{M_{j}}\)) as shown in Eq. (10). The test further computes the Friedman statistic commonly referred as F-Statistic as given in Eq. (11), where Q can be obtained from Eq. (12). The F-Statistic is tested against the F-quantiles for a given \(\alpha\) with degree of freedom, \(f_1 = k-1\) and \(f_2 = (d-1)(k-1)\), where \(\alpha\) is the significance level being considered. The study includes six models (\(k=6\)) and five data traces (\(d=5\)). The statistical results are reported with \(\alpha =0.05\) as standard.

$$\begin{aligned} \mathcal {R}_{M_j}= & {} \sum _{i = 1}^{d} \mathcal {R}_{\xi _{ij}}; j\in \{1,2,\ldots ,k\} \end{aligned}$$
(10)
$$\begin{aligned} F-Statistic= & {} \frac{(d - 1)Q}{d(k - 1) - Q} \end{aligned}$$
(11)
$$\begin{aligned} Q= & {} \frac{12}{dk(k + 1)} \sum _{j = 1}^{k} \Bigg ( \mathcal {R}_{M_j} - \frac{d(k + 1)}{2} \Bigg )^2 \end{aligned}$$
(12)

We applied the statistical test using the STAC web platform [38]. The Friedman test observed the 28.67 and 0.0 values of F-Statistic and p-value respectively that resulted in the rejection of null hypothesis \(H_0\) for RMSE based forecast accuracy as it finds the significant difference in the predictions of different models. Similarly, it also rejects the \(H_0\) for MAE and MASE based forecast accuracy evaluation. The F-Statistics and p-values for both findings are given in Table 8, where \(H_0.R\) denotes the rejection of \(H_0\). The Friedman ranks obtained by each model are shown in Fig. 8, where Figs. 8a,b and c depict the ranks given to the forecasting models for RMSE, MAE and MASE based accuracy evaluation respectively.

Table 8 Friedman test (\(\alpha =0.05\))
Fig. 8
figure 8

Friedman test ranks

Table 9 Post hoc analysis using Finner test (Controle Method=\(M_5\))

The Friedman test reports the presence of any significant difference in the results and provides a rank to each model based on the experimental results. However, post-hoc analysis methods can be used to explore the statistical properties of experimental findings further. We have used Finner post-hoc analysis [39] that performs multiple comparisons. The test considers one of the methods under evaluation as a control method. It works around a null hypothesis (\(H_0^{p}\)) where p denotes the null hypothesis for post-hoc analysis. The \(H_o^p\) assumes that the mean of the control method’s result is equal with each other group member under test. The pairwise comparison results obtained from Finner post-hoc method are listed out in Table 9, where \(H_0^p.R\) and \(H_0^p.A\) represent the rejection and acceptance of \(H_0^p\). It can be observed that the analysis rejects the \(H_0^p\) in most of the comparisons. Based on these statistical findings, it can be observed that \(M_3\), \(M_4\) and \(M_5\) produced the similar forecast accuracy and presence of significant difference is not noticed. The rankings of these models also exhibit the similar behavior as no major difference in the rankings is observed. However, \(M_5\) received the best ranking among these models.

7 Conclusions

Forecasting have been always beneficial to a wide range of applications in their business decision making. This work studies the performance behavior of time series forecasting schemes for server workload estimation in cloud environment. The study analyses the performance of six different forecasting models on three web server workload traces and two cloud server workload traces. We have analyzed the experimental findings of each model using statistical tests including Friedman test and Finner post-hoc analysis. The study also finds that the three models \(M_{3}, M_{4}\) and \(M_{5}\) produced more or less similar forecast accuracy. But the performance of \(M_{5}\) is observed to be best among the considered models.