1 Introduction

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction [18]. A cloud is a type of parallel and distributed system, which means the ability to run a program or application on many connected computers at the same time. Although cloud computing has been widely adopted by industries, many existing issues have not been fully addressed, whereas new challenges [28] continue to emerge from industry applications such as dynamic resource provisioning [23], virtual machine migration [19], server consolidation and energy management [8].

There have been many proposals for resource management approaches for cloud infrastructures, but effective resource management is still a major challenge for the leading cloud infrastructure operators (e.g., Amazon, Microsoft, Google), because the details of the underlying workloads and the real-world operational demands are too complex. The large-scale distributed applications on a cloud require service-based software, which has the ability of monitoring the system status changes, analyzing the acquired information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously [14]. In distributed systems, we often use some host load monitor tools to record the host load data as a time series [26]. The aim of this work is to predict the host load in the future interval based on the information obtained from the past load traces. In distributed systems, predicting the host load in advance can benefit resource allocation, improve resource utilization and enable the system to take prompt actions.

In 2011, Google released a substantial cluster usage dataset which used Run-time Monitor (RTM) to monitor the characteristics of the hosts at run time. As cloud environment is likely to be constructed from a variety of machine classes and heterologous applications, the Google cluster trace lacks precise information about the purpose of jobs and configuration of machines. Thus, we cannot use these information to infer what is running. Nevertheless, we could predict the host load in the future interval using the information of the past load trace.

In this paper, the load trace provided by Google has been used as our experimental dataset. Most of the studied cluster load traces have fallen into one of three following categories: long-running services, DAG-of-task systems and high-performance (or throughput) computing [20], while the Google cluster trace captures a range of behaviors that includes all the above categories, among others. To cope with such a mixture of these and other categories, we proposed a new approach for host load prediction. In our approach, we choose echo state network (ESN) as our basic prediction model, which is a kind of recurrent neural network (RNN) belonging to a collection of techniques called Reservoir Computing [13]. To achieve a better representation of the inputs, we introduce an autoencoder to learn the higher level feature of the input data instead of using the data directly. The pre-recurrent feature layer can further capture the similarity between load traces.

The rest of the paper is organized as follows. The related work is presented in Sect. 2. Section 3 describes the unsupervised feature learning algorithm. And the comparison between RNN and ESN is discussed in Sect. 4. In Sect. 5, we give the details of our proposed approach. Afterwards, Sect. 6 presents the experimental results and comparison. Finally, we conclude our work in Sect. 7.

2 Related work

Host load prediction has received focus from researchers for a long time due to its potential profits. Many previous efforts have been made toward the host load in traditional Grids or HPC systems. Akioka et al. [1] combined the Markov model and seasonal analysis to predict the host load for one-step-ahead in a computational Grid. Wu et al. [25] used a hybrid model for multi-step-ahead host load prediction, which combined the autoregressive (AR) model and a Kalman filter. Duy et al. [6] applied an artificial neural network (ANN) to the task of host load prediction. Among these methods, ANN is the most widely used approach to analyze the stochastic nonlinear system and show good performance in traditional distributed system.

However, according to the comparison of workloads between Cloud and Grid [4], the average noise in a Cloud is approximately 20 times larger than the average noise in a Grid. Therefore, predicting the host load in a Cloud is more complicated than that in a Grid. Many traditional methods for time series mining have been evaluated in Cloud, such as Linear Regression [8] and Auto Regressive Integrated Moving Average [29]. But these methods achieve limited accuracy when they are applied to the cloud environment.

Di et al. [5] first used the Bayesian model to predict the host load in a Cloud. They proposed 9 novel features to characterize the recent load fluctuation in the evidence window and could predict the mean load over consecutive time intervals. However, their method has two limitations. The first is that the training period in evaluation type B should contain the test period, which is not suitable for the cloud environment. The other is that they used an exponentially segmented pattern, which means that the length of the segment increases exponentially. With the growth of the segment length, the mean load cannot fully reflect the fluctuation of the host.

In [27], the authors combine the Phase Space Reconstruction (PSR) and Group Method of Data Handling method based on Evolutionary Algorithm (EA-GMDH) for host load prediction. The PSR method is used to reconstruct the load trace from single-dimensional time series to a multi-dimensional phase space, and the time series after reconstruction are used as the input of the EA-GMDH network. The prediction performance of this method is closely related to the parameters they chose, as the evolutionary algorithm is a stochastic global search method which may fall into local optima.

ESN is a rather recent development in the field of RNN and it lead to a fast, simple and constructive algorithm for supervised training of RNN. The basic idea of ESN is to use a large reservoir RNN as a supplier of interesting dynamics from which the desired output is combined. This idea has been independently discovered and investigated under the name of liquid state machines by Wolfgang Maass and collaborators [17]. The philosophy adopted in Reservoir Computing is to consider the recurrent layer as a large reservoir of nonlinear transformations of the input data and decouple the learning of parameters inside and outside the reservoir.

Unsupervised feature learning refers to a class of machine learning techniques, developed rapidly since 2006, where many stages of nonlinear information processing in hierarchical architectures are exploited for pattern classification. Recently, unsupervised feature learning technologies have been successfully used in many research areas, such as handwritten digit images recognition [9], visual object classification [21] and nature language process [10]. After initializing the deep neural network with unsupervised feature learning algorithms [e.g., autoencoder, matrix factorization and restricted Boltzmann machines (RBM)], the weights are starting at a better location in parameter space than if they had been randomly initialized [22]. Because the deep neural network can also be considered to perform feature learning, since they learn a representation of their input at the hidden layers which is subsequently used for classification or regression at the output layer.

In this paper, we propose a new framework which adds a pre-recurrent feature layer to the conventional ESN. The intuition that led into the feature learning layer is that by capturing the similarity between load traces, similar traces will have similar trajectories in the reservoir state space. The feature learning algorithm we used is an autoencoder neural network, because it is simple to implement and achieves satisfactory performance. We give more details about the Autoencoder in Sect. 3, and discuss the update of the weights of feature matrix in Sect. 5.2.

3 Autoencoder

An autoencoder neural network is an unsupervised learning algorithm that sets the target values to be equal to the inputs, which means the autoencoder tries to learn a function \(f(x) \approx x\). In other words, it is trying to learn an approximation to the identity function, so that the output \(\hat{x}\) is similar to \(x\). By applying constraints on the network, such as by limiting the number of hidden units or the average activation of the hidden units, we can discover high-level features of the input data. According to recent research [9, 10, 22], the features learned automatically can improve the accuracy of the classification and regression tasks comparing with the features designed manually.

The input of the autoencoder neural network is a set of \(n\) unlabeled examples \({x^{(1)}_{u}, x^{(2)}_{u},\ldots , x^{(n)}_{u}}\), where each \(x^{(i)}_{u} \in \mathbb {R}^{n}\) is an example of the host load in the history window in our approach. Subscript \(u\) here indicates that it is an unlabeled example. The unlabeled data are used to learn a slightly higher level, more succinct representation of the inputs.

The network architecture is shown in Fig. 1. We denote the input host load by \(x\) and the network reconstruction by \(\hat{x} = f(\mathbf W ^{(2)}f(\mathbf W ^{(1)}x + b^{(1)})+b^{(2)})\), where \(\mathbf W ^{(1)}, \mathbf W ^{(2)}\) are the weights, \(b^{(1)},b^{(2)}\) are the bias terms, and \(f(\cdot )\) is the activation function in the form of:

$$\begin{aligned} f(z) = \frac{1}{1+\exp (-z)} \end{aligned}$$
(1)

The middle layer of the network is called hidden layer \(h\), whose output is a new representation of the inputs. To remove the constraint of the number of hidden units, autoencoder impose a sparsity constraint on the hidden units. As a result, the autoencoder still can discover high-level features in the data effectively, even if the number of hidden units is large.

Fig. 1
figure 1

Autoencoder network with input host load \(x\) (left) and reconstructed host load \(\hat{x}\) (right)

Therefore, the optimization problem is to calculate the weights and bias item by minimizing the cost function \(J(\mathbf W ,b)\), whose form is as follows:

$$\begin{aligned} J(\mathbf W ,b)= & {} \frac{1}{n}\sum ^{n}_{i=1}\left( \frac{1}{2}\left\| f_\mathbf{W ,b}(x^{(i)}) -y^{(i)}\right\| ^{2}\right) \nonumber \\&+\frac{\lambda }{2}\sum ^{2}_{l=1}\sum ^{s_{l}}_{i=1}\sum ^{s_{l+1}}_{j=1}\left( \mathbf W ^{(l)}_{ji}\right) ^{2} + \beta \sum ^{s_{2}}_{j=1}\hbox {KL}(\rho \Vert \hat{\rho }_{j}) \end{aligned}$$
(2)

The first term in cost function is to ensure each input \(x^{(i)}_{u}\) to be reconstructed well by minimizing the average squares error, the second term is a regularization term that prevents overfitting, \(s_{l}\) is the number of the units in layer \(l\), while the third term is to restrict that the activations to be sparse, which means most of the activations to be zero. In the third term,

$$\begin{aligned} \hat{\rho }_{j} = \frac{1}{m}\sum ^{m}_{i=1}a^{(2)}_{j}(x^{(i)}) \end{aligned}$$
(3)

which indicates the average activation of hidden unit \(j\), where \(a^{(2)}_{j}(x^{(i)})\) is the activation of the hidden unit \(j\) and \(\rho \) is the sparsity parameter. \(\hbox {KL}(\rho \Vert \hat{\rho }_{j})\) is the Kullback–Leibler (KL) divergence [15], and can be derived by

$$\begin{aligned} \hbox {KL}(\rho \Vert \hat{\rho }_{j}) = \rho \log \frac{\rho }{\hat{\rho }} + (1-\rho )\log \frac{1-\rho }{1-\hat{\rho }} \end{aligned}$$
(4)

KL divergence is a standard function for measuring how different two distributions are, and the function has the property that \(\hbox {KL}(\rho \Vert \hat{\rho }_{j}) = 0\) if \(\hat{\rho }_{j} = \rho \), and otherwise it increases monotonically as \(\hat{\rho }_{j}\) diverges from \(\rho \).

4 Echo state network

4.1 Recurrent neural network

RNN is an ANN with internal loops, which is an expressive model for sequence tasks. At each time step, the RNN receives an input, updates its hidden state, and makes a prediction, which is shown in Fig. 2. The RNN is powerful because it has a high-dimensional hidden state with nonlinear dynamics that enables it to remember and process past information. Even if the nonlinearity used by each unit is quite simple, iterating it over time leads to very rich dynamics.

Fig. 2
figure 2

The recurrent neural network

The standard RNN is formalized as follows: given a sequence of input vectors \((\mathbf x _{1}, \mathbf x _{2}, \ldots , \mathbf x _{n})\), the RNN computes the hidden states \((\mathbf h _{1}, \mathbf h _{2}, \ldots , \mathbf h _{n})\) and the outputs \((\mathbf y _{1}, \mathbf y _{2}, \ldots , \mathbf y _{n})\) using the following equations:

$$\begin{aligned} \mathbf h _{t}= & {} f(\mathbf W _{\mathrm{in}}\mathbf x _{t} + \mathbf W \mathbf h _{t-1})\end{aligned}$$
(5)
$$\begin{aligned} \mathbf y _{t}= & {} \mathbf W _{\mathrm{out}}\mathbf h _{t} \end{aligned}$$
(6)

where \(t = 1, 2, \ldots , n\), \(\mathbf W _{\mathrm{in}}\), \(\mathbf W \), \(\mathbf W _{\mathrm{out}}\) are input-hidden, hidden-hidden, hidden-output connection’ matrices.

The training algorithm of RNN is known as the back-propagation through time (BPTT) [24] method. The RNN is unfolded through time. Then back-propagation training is used to update the weights. However, Hochreiter [11] and Bengio [2] proved that the gradient decays exponentially as it is back-propagated through time, and argued that RNN cannot learn long-range temporal dependencies when gradient descent is used for training.

4.2 Echo state network

ESN [13] is a new RNN architecture, based on a rich reservoir of potentially interesting behavior. The reservoir of ESN is the recurrent layer formed of a large number of sparsely interconnected and randomly initialized recurrent layer composed of huge number of standard sigmoid units. Only output connections are modified during learning process. A significant advantage of this approach over standard RNN is that simple linear regression algorithms can be used for adjusting output weights. This is a much easier learning task and it works surprisingly well provided the recurrent connections are carefully initialized so that the intrinsic dynamics of the network exhibits a rich reservoir of temporal behaviors that can be selectively coupled to output.

The architecture of ESN used in this paper is shown in Fig. 3. A large RNN is used as a dynamic reservoir, which can be excited by suitably presented input and/or feedback output. The training algorithm of an ESN consists of the following steps:

  1. (a)

    Generate a RNN randomly

The input-hidden matrix \(\mathbf W _{\mathrm{in}}\) and hidden-hidden matrix \(\mathbf W \) are generated randomly. Once they have been generated, they will not change during the entire training process.

Fig. 3
figure 3

The basic architecture of ESN

Only when a RNN has the “echo state property” [12] it can be used for dynamic system modeling. Echo state property means for each internal unit \(\mathbf h _{i}\) there exits an echo function \(e_{i}\) such that the current state can be written as \(\mathbf h _{i} = e_{i}(\mathbf x _{t}, \mathbf x _{t-1}, \ldots )\), the network state is an “echo” of the input history. The recent input presented to the network has more influence to the network state than an older input, the input influence gradually fades out. Echo states are crucial for successful operation of ESN, their existence is usually ensured by rescaling recurrent weight matrix \(\mathbf W \) to specified spectral radius \(\lambda \). This can be achieved by simply multiplying all elements of a randomly generated recurrent weight matrix with \(\frac{\lambda }{\lambda _{\max }}\), where \(\lambda _{\max }\) is the spectral radius of the original matrix.

  1. (b)

    Feed the training data to the ESN

When the input units are fed to ESN, they will activate dynamics within the dynamic reservoir. At each time step, the internal dynamic reservoir states are computed according to Eq. (5). Then, we collect the input units \(\mathbf x _{i}\) and \(\mathbf h _{i}\) together as the \(i\)-th row of a matrix \(\mathbf M \) at each time step. Meanwhile, the output data are collected into another matrix \(\mathbf T \).

  1. (c)

    Wash out the initial memory of the dynamic reservoir

Since the arbitrarily generated network states contain an initial memory which is not caused by the input, it is assumed that the effects of the arbitrary starting state have died out. So we only keep a part of matrix \(\mathbf M \) and \(\mathbf T \), the new matrices denoted as \(\mathbf M ^{\mathrm{forget}}\) and \(\mathbf T ^{\mathrm{forget}}\).

  1. (d)

    Compute output weights

The output weights can be derived by the following equation:

$$\begin{aligned} \mathbf W _{\mathrm{out}}^\mathrm{T} = \hbox {pseudoinverse}(\mathbf M ^{\mathrm{forget}})\cdot \mathbf T ^{\mathrm{forget}} \end{aligned}$$
(7)

The reasons we chose ESN as our basic model are based on the following consideration:

  • ESNs can be trained very fast because they just fit a linear model [16].

  • ESNs work surprisingly well if we initialize weights carefully.

  • ESNs can do impressive modeling of one-dimensional time series.

5 Our approach

In traditional ESNs, the reservoir constructs a representation of the input data by applying a complex nonlinear transformation to the input data. To achieve a better representation of the inputs, we introduce the autoencoder to learn the higher level feature of the input data instead of using the data directly. The intuition that led to the introduction of the autoencoder is that by capturing the similarity between load traces, similar traces will have similar trajectories in the reservoir state space. For example, if two load traces appear in similar context, most of the time they will have similar features and share the same function.

Another advantage of extracting features from the input data before feeding it to the reservoir is that the feature vectors of dimension is smaller than the input vector, which will reduce the computational complexity.

5.1 Model definition

The architecture of our approach is shown in Fig. 4. Suppose we have the input data \(\mathbf x = (\mathbf x _{1}, \mathbf x _{2}, \ldots , \mathbf x _{n})\), the equations describing the network are:

$$\begin{aligned} \mathbf h _{t}= & {} f(\mathbf W _{\mathrm{in}} \cdot \mathbf W _{f} \cdot \mathbf x _{t} + \mathbf W \cdot \mathbf h _{t-1})\end{aligned}$$
(8)
$$\begin{aligned} \mathbf y _{t}= & {} \mathbf W _{\mathrm{out}} \cdot \mathbf h _{t} \end{aligned}$$
(9)

where \(h_{t}\) is the output of the hidden units, \(f\) is the hidden units activation function, \(\mathbf W _{f}\), \(\mathbf W _{\mathrm{in}}\), \(\mathbf W \), and \(\mathbf W _{\mathrm{out}}\) are input-feature, feature-hidden, hidden-hidden, hidden-output connections’ matrices, respectively.

Fig. 4
figure 4

Echo state network with an autoencoder

Initialization strategies based on unsupervised pretraining of each layer have been shown to be important both for supervised and unsupervised training of deep architectures [3]. After initializing the deep neural network with unsupervised feature learning algorithms, the weights are starting at a better location in parameter space than if they had been randomly initialized. The ESN corresponds to a very deep architecture when unfolded in time, so we should ensure the quality of the learned weight matrices. It is found that initializing \(W_{f}\) with a trained autoencoder yields less noisy filters according to our experiments. The other matrices are initialized to small random values.

5.2 Details of gradient descent

The input-feature matrix \(\mathbf W _{f}\) is learned by minimizing the cost function \(C\) of every load traces in the training set. At each time step, the prediction of the network \(\mathbf y _{t}\) is compared to the real load \(\mathbf r _{t}\),

$$\begin{aligned} C = \frac{1}{2}\sum ^{n-1}_{t=1}(\mathbf y _{t} - \mathbf r _{t})^\mathrm{T}(\mathbf y _{t} - \mathbf r _{t}) \end{aligned}$$
(10)

To make the derivation of the learning procedure clearer, we briefly derive the matrix formulation of back-propagation through time. Considering the ESN with a load trace of length \(n\) (\(\mathbf x = (\mathbf x _{1}, \mathbf x _{2}, \ldots , \mathbf x _{n})\)), the equations describing the unrolled ESN are as follows,

$$\begin{aligned} \mathbf f _{t}= & {} \mathbf W _{f}\mathbf x _{t}\end{aligned}$$
(11)
$$\begin{aligned} \mathbf a _{t}= & {} \mathbf W \mathbf h _{t-1} + \mathbf W _{\mathrm{in}}\mathbf f _{t}\end{aligned}$$
(12)
$$\begin{aligned} \mathbf h _{t}= & {} f(\mathbf a _{t})\end{aligned}$$
(13)
$$\begin{aligned} \mathbf y _{t}= & {} \mathbf W _{\mathrm{out}}\mathbf h _{t} \end{aligned}$$
(14)

According to gradient descent, each weight change in the network should be proportional to the negative gradient of the cost with respect to the specific weight we are interested in modifying.

$$\begin{aligned} \triangle {W} = -\eta \frac{\partial {C}}{\partial {W}} \end{aligned}$$
(15)

The gradient of \(C\) with respect to the parameters of the autoencoder can be estimated as the following equations,

$$\begin{aligned} \frac{\partial {C}}{\partial \mathbf{W }_{f,t}}= & {} \frac{\partial {C}}{\partial \mathbf{f }_{t}}\frac{\partial \mathbf{f }_{t}}{\partial \mathbf{W }_{f,t}} =\frac{\partial {C}}{\partial \mathbf{f }_{t}}\mathbf{x }_{t}\end{aligned}$$
(16)
$$\begin{aligned} \frac{\partial {C}}{\partial \mathbf{f _{t}}}= & {} \frac{\partial {C}}{\partial {\mathbf{a }_{t}}}\frac{\partial {\mathbf{a }_{t}}}{\partial {\mathbf{f }_{t}}} =\frac{\partial {C}}{\partial {\mathbf{a }_{t}}}\mathbf{W }_{\mathrm{in},t} \end{aligned}$$
(17)

The gradient of \(C\) with respect to \(\mathbf{f }_{t}\) can be rewritten in the matrix form:

$$\begin{aligned} \nabla _\mathbf{f _{t}}C = (\mathbf W _{\mathrm{in}})^\mathrm{T}\delta _{t} \end{aligned}$$
(18)

where \(\delta _{t} = \frac{\partial {C}}{\partial \mathbf{a _{t}}}\). The \(\delta _{t}\) is usually called the error item at time \(t\), and can be expressed using \(\delta _{t+1}\).

$$\begin{aligned} \delta _{t}= & {} \frac{\partial {C}}{\partial {\mathbf{a }_{t}}}\nonumber \\= & {} \frac{\partial {C}}{\partial {\mathbf{a }_{t+1}}}\frac{\partial {\mathbf{a }_{t+1}}}{\partial {\mathbf{h }_{t}}}\frac{\partial {\mathbf{h }_{t}}}{\partial {\mathbf{a }_{t}}}\nonumber \\= & {} \frac{\partial {C}}{\partial {\mathbf{a }_{t+1}}}\mathbf{W }f^{'}(\mathbf{a }_{t})\nonumber \\= & {} \delta _{t+1}\mathbf{W }f^{'}(\mathbf{a }_{t}) \end{aligned}$$
(19)

6 Performance evaluation

6.1 Dataset and parameters

To evaluate our approach, a dataset containing information of the host in Google data center [7] has been employed. The most studied cluster workloads have fallen into one of the three following categories [20]:

  • long-running services: servers that require a certain amount of resources (usually CPU resource) to achieve acceptable performance and run indefinitely.

  • DAG-of-task systems: MapReduce-like systems that run many independent short tasks that are assumed to CPU bound or I/O bound.

  • high-performance (or throughput) computing: batch queuing systems that typically run CPU-bound programs can usually tolerate substantial wait times, and often require many machines simultaneously for a long period of time.

Each of these categories brings different challenges to resource management. Unlike the most previous clusters which have not been faced with the diverse workloads of multi-propose clouds, the Google cluster trace tends to capture a range of behaviors that includes all the above categories, among others. The Google cluster traced over \(670{,}000\) jobs and over 40 million task events across about 12,000 machines over 1-month period. The trace lacks precise information about the purpose of jobs and configuration of machines. The resource (RAM and CPU) information in the trace has already been normalized to \([0,1]\).

In our experiment, we only predict the CPU load. Because CPU load is a vivid aspect to exhibit the request and usage of resources. The fluctuation of CPU load can indicate the application execution process, while high CPU load will considerably slow down the host. And the prediction of CPU load is more challengeable comparing to the memory load, the fluctuation of CPU load is more drastic [5]. To apply our proposed method, we split Google’s \(29\)-day trace data into three sets. A training set which is used to update the feature weights and the output weights, a validating set which is used to prevent overfitting and a predicting set which is used to evaluate our proposed method.

To achieve best prediction performance, some important parameters should be chosen. The most important parameters concerned in feature extraction are the number of hidden units hiddenSize and the sparsity of the hidden units sparsityParam, which are estimated according to the reconstruction error. The optimal values of these parameters that computed using the tenfold cross-validation method. According to the experimental results in Fig. 5, the hiddenSize and sparsityParam, for the minimum reconstruction error, are 200 and 0.1, respectively.

Fig. 5
figure 5

The reconstruction error with different hiddenSize and sparsityParam. The axis hiddenSize represents the number of hidden units of autoencoder, while the axix sparsityParm represents the sparsity constraints of hidden units. The minimum reconstruct error is achieved when the number of hidden units is 200 and the sparsity constraints is \(0.1\)

Fig. 6
figure 6

The MSE as a function of the reservoir size. Different color of the curves indicates the different prediction length

The parameters of ESN are also selected by the cross-validation method. Input connection weights \(\mathbf W _{\mathrm{in}}\) are set to small random values sampled from a uniform distribution between \([-0.1, 0.1]\). The reservoir matrix \(\mathbf W \) is rescaled to a spectral radius of \(0.1\), thus ensuring the echo state property. The size of the reservoir layer is far important than in a traditional RNN at a similar level of performance. In our experiment, we chose the reservoir size by minimizing the MSE of the load traces in the validation set, where MSE is defined as Eq. (20). According to Fig. 6, we set the reservoir size to \(100\).

$$\begin{aligned} MSE = \frac{1}{T}\sum ^\mathrm{T}_{i=1}(y_{i}-r_{i})^{2} \end{aligned}$$
(20)

6.2 Performance evaluation

6.2.1 Methods for comparison

To show the effectiveness of our proposed method, we also rigorously and comprehensively implemented three other load prediction methods. The parameters of these baseline methods are chosen to achieve the best performance which are shown in Table 1. Some details are given as follows:

  • Auto regressive method (AR): the classic AR method is performed according to Eq. (21), where \(y(t)\), \(a_{i}\), \(x_{i}\) and \(\varepsilon _{t}\) refer to the predicted value, the coefficient, the history value and the noise at time \(t\), respectively.

    $$\begin{aligned} y(t) = \sum ^{n}_{i=1}a_{i}x_{i}+\varepsilon _{t} \end{aligned}$$
    (21)

    In general, the AR method can only predict the load value for the next moment, while previous works extended it to long-term point prediction by applying the AR method recursively on the predicted values.

  • ANN [6] method: the ANN method uses the load in the history window as the input vector, and the load in the future interval as the output vector. After training the network, the model it learned will be used for host load prediction.

  • Bayes [5] method: the aim of Bayes method is to predict the vector of the load values, denoted by \(\mathbf l =(l_{1},\ldots ,l_{n})^\mathrm{T}\), each of which represents the mean load value over a particular segment. According to Di [5], the Bayes method we applied is the best strategy using MMSE-BC with the single feature \(F_{\mathrm{ml}}\) based on the evaluation type A.

  • PSR\(+\)EA-GMDH method [27]: this method combines the PSR method and EA-GMDH method. The PSR is used to reconstruct the load trace from single-dimensional time series to a multi-dimensional phase space, and the time series after reconstruction are used as the input of the EA-GMDH network.

Table 1 Optimized parameters for the baseline methods

6.2.2 Experimental results

We evaluated the accuracy of the prediction result by the mean squared errors (MSE), defined as Eq. (20), where \(T\) is the length of the prediction step, \(y_{i}\) and \(r_{i}\) are the predicted and real values, respectively. To compare our proposed method with the Bayes method [5], we use our proposed method to predict the mean host load which is quantified using the mean segment squared error (MSSE) [5]. The MSSE is defined as follows:

$$\begin{aligned} \hbox {MSSE}(s)=\frac{1}{s}\sum ^{n}_{i=1}s_{i}(l_{i}-L_{i})^{2} \end{aligned}$$
(22)

where \(s_{1}=b, s_{i}=b\cdot 2^{i-2}, s=\sum ^{n}_{i=1}s_{i}, s_{i}\) is the separate segment; \(l_{i}\) and \(L_{i}\) are the predict value and real value, respectively; \(n\) is the total number of segments in the prediction interval. More details are presented in [5].

First, we compared our proposed method with the conventional ESN. According to Fig. 7, the addition of the feature layer improved the prediction accuracy. The introduction of the unsupervised feature learning layer before the recurrent layer leads to a novel organization of the reservoir activation pattern. The features can better capture the similarity between load traces, and therefore allow separating more sharply the activations corresponding to different load traces. In Fig. 7, it is also noted that although the prediction error increases with the prediction length increases, our proposed method can still get a satisfactory performance the prediction length is 3 h ahead (The interval of Google load trace is 5 min, so 3 h ahead is \(36\) steps ahead).

Fig. 7
figure 7

Comparison of the conventional ESN and our proposed model

Figure 8 shows the comparison of our method and the methods we mentioned in Sect 6.2.1 in actual host load prediction. As the Bayes method can only predict the mean host load, we do not compare our method with Bayes method in this case. The \(x\)-axis in the Fig. 8 represents the value of MSE, while the \(y\)-axis represents cumulative distribution function (CDF) of MSE. The results show that our proposed method outperforms the others in terms of lower MSE. Specifically, it is obvious that our proposed method is better to predict different types of host traces, while the other methods, such as AR method, will have poor performance in some host traces.

Fig. 8
figure 8

MSE of host load prediction

In Fig. 9, we compare our proposed method to other four methods in mean host load prediction. According to our statistics, our method improves the accuracy over the second best method (PSR\(+\)EA-GMDH) by 26.1 % in the 2.7 h ahead prediction, 22.4 % in 5.3 h, and about 32.3 % in 10.7 h, respectively.

Fig. 9
figure 9

The average of MSSE, where \(h\) represents hour, \(s_{1}\) is 5 min

7 Conclusions and future work

In this paper, we proposed a novel method for CPU load prediction. In our proposed method, we use the ESN as our basic prediction model, considering that the ESN can be trained very fast and modeling one-dimensional time series impressively. To achieve a better representation of the inputs, we introduce a pre-recurrent feature layer which is an autoencoder neural network. The autoencoder can better capture the similarity between load traces. We evaluated our proposed method on one month Google load trace. Compared with some other state-of-arts methods, our proposed method could achieve higher accuracy. This implies that our work can be utilized to solve the schedule problem in the future work. However, the features learned by autoencoder is not easy to display when the input data is host load. We will try our best to visualize the features in the future work.