Keywords

1 Introduction

The study of time series forecasting benefits to many fields, such as the prediction of electricity consumption, stock prices, population, amount of rainfall, and so on. Generally, there are two kinds of theories of time series forecasting: linear models, and non-linear models. The former includes Auto-Regressive (AR), Moving Average (MA), and a combination of them ARIMA. For the effect to financial and economic fields, the proposer of Auto-Regressive Conditional Heteroskedasticity (ARCH) [1], R. Engle was awarded by Nobel Memorial Prize in Economic Sciences in 2003. The later, non-linear methods, usually utilize artificial neural networks such as Multi-Layered Perceptron (MLP), Radial Basis Function Net (RBFN), and deep learning methods [2,3,4,5,6].

In our previous works [3,4,5,6], Deep Belief Net (DBN) [7], a well-known deep learning model, was firstly applied to the time series forecasting. And a hybrid model with DBN and ARIMA was also proposed to improve the prediction precision [8, 9]. The hybrid model was a combination of Artificial Neural Networks (ANN) and linear models which is inspired by the theory of G.P. Zhang [10].

Generally, error Back-Propagation (BP) [11], is used as the training method (optimization) of ANNs. Meanwhile, recently, Adaptive Moment Estimation (Adam) [12], an advanced gradient descent algorithm of BP, is widely utilized in the training of deep neural networks. The concept of Adam is to adopt the first-order momentum, i.e., the past gradient, and the second-order momentum, i.e., the absolute gradient, into the update process of parameters. By considering the average gradient, Adam overcomes the local extremum problem in the high dimensional parameter space, and tackles non-stationary objectives.

In this study, Adam is firstly adopted to the fine-tuning process of DBN instead of the conventional BP optimization method. Benchmark dataset CATS [13, 14], an artificial time series data utilized in time series forecasting competition, and a chaotic time series given by Lorenz chaos which is a famous chaotic theory for its butterfly attractor, were used in the comparison experiment. In both experiments, DBN with Adam showed its priority to the conventional BP method in the fine-tuning process.

2 DBN for Time Series Forecasting

The original Deep Belief Net [7] was proposed for dimension reduction and image classification. It is a kind of deep auto-encoder which composed by multiple Restricted Boltzmann Machines (RBMs). For time series forecasting, the part of decoder of DBN is replaced by a feedforward ANN, Multi-Layered Perceptron (MLP) in our previous works [5, 6]. The structure of the DBN is shown in Fig. 1.

Fig. 1.
figure 1

A structure of a DBN composed by RBMs and MLP [6, 8, 9].

2.1 RBM and Its Learning Rule

Restricted Boltzmann Machine (RBM) [7] is a kind of Hopfield neural network but with 2 layers. Units in the visible layer connect to the units in the hidden layer with different weights. The outputs of units \(v_{i} ,h_{j}\) are binary, i.e., 0 or 1, except the initial value of visible units is given by the input data. The probabilities of 1 of a visible unit and a hidden unit are according to the following.

$$p({h}_{j}=1|v)=\frac{1}{1+\mathit{exp}(-{b}_{j}-{\sum }_{i=1}^{n}{w}_{ji}{v}_{i})}$$
(1)
$$p({v}_{i}=1|h)=\frac{1}{1+\mathit{exp}(-{b}_{i}-{\sum }_{j=1}^{m}{w}_{ij}{h}_{j})}$$
(2)

Here \({b}_{i},{b}_{j}, {w}_{ij}\) are the biases and the weights of units. The learning rules of RBM are given as follows.

$$\Delta {w}_{ij}=\varepsilon (<{v}_{i}{h}_{j}{>}_{\text{data}}-<{v}_{i}{h}_{j}{>}_{\text{model}})$$
(3)
$$\Delta {b}_{i}=\varepsilon (<{v}_{i}>-<{\tilde{v }}_{i}>)$$
(4)
$$\Delta {b}_{j}=\varepsilon (<{h}_{j}>-<{\tilde{h }}_{j}>)$$
(5)

where \(0<\varepsilon <1\) is a learning rate, \({p}_{ij}=<{v}_{i}{h}_{j}{>}_{\text{data}},{p}_{ij}^{\mathrm{^{\prime}}}<{v}_{i}{h}_{j}{>}_{\text{model}}\), \(< v_{i} > , < h_{j} >\) indicate the first Gibbs sampling (k = 0) and \(<{\tilde{v }}_{i}>,<{\tilde{h }}_{j}>\) are the expectations after the kth Gibbs sampling, and it also works when k = 1.

2.2 MLP and Its Learning Rule

A feedforward neural network Multi-Layered Perceptron (MLP) [11] inspired the second Artificial Intelligence (AI) boom in 1980s (see Fig. 1). The input \({x}_{i}\) (\(i=\mathrm{1,2},...n\)) is fired by the unit \({z}_{j}\) with connection weight \({v}_{ji}\) in a hidden layer by an activation function, and also the output \(y=f(z)\) is given by the function and connection weights \({w}_{j}\) (\(j=\mathrm{1,2},...K\)) as follows.

$$y=f(z)=\frac{1}{1+\mathit{exp}(-{\sum }_{j=1}^{K+1}{w}_{j}{z}_{j})}$$
(6)
$$f({z}_{j})=\frac{1}{1+\mathit{exp}(-{\sum }_{i=1}^{n+1}{v}_{ji}{x}_{i})}$$
(7)

where biases \({x}_{n+1}=1.0,{z}_{K+1}=1.0\).

Error Back-Propagation (BP) [11] serves as the learning rule of MLP as follows.

$$\Delta {w}_{j}=-\varepsilon (y-\tilde{y })y(1-y){z}_{j}$$
(8)
$$\Delta {v}_{ji}=-\varepsilon (y-\tilde{y })y(1-y){w}_{j}{z}_{j}(1-{z}_{j}){x}_{i}$$
(9)

where \(0<\varepsilon <1\) is the learning rate, \(\tilde{y }\) is the teacher signal, i.e., the value of training sample.

Meanwhile, because the BP method is sensitive to the noise and easy to convergence to the local minimum, it is modified by Adam (adaptive moment) proposed by Kingma and Ba in 2014 [12].

$$\Delta {\theta }_{t}=\frac{{\widehat{m}}_{t}}{\varepsilon +\sqrt{{\widehat{v}}_{t}}}$$
(10)
$${\widehat{m}}_{t}=\frac{{\beta }_{1}^{t}{m}_{t-1}}{1-{\beta }_{1}^{t}}+{g}_{t}$$
(11)
$${\widehat{v}}_{t}=\frac{{\beta }_{2}^{t}{v}_{t-1}}{1-{\beta }_{2}^{t}}+{{g}_{t}}^{2}$$
(12)
$${g}_{t}={\nabla }_{\theta }{E}_{t}({\theta }_{t-1})$$
(13)

where \(\theta =({v}_{ji},{w}_{j})\) is the parameter to be modified, \(0<\varepsilon ,{\beta }_{1}^{t},{\beta }_{2}^{t}<1\) are hyper parameters and given by empirical scalar values. \({E}_{t}({\theta }_{t-1})\) is the loss function, e.g., the mean squared error between the output of the network and the teacher signal.

Although Adam is the major optimization method of deep learning recently, it is not adopted to the fine-tuning of DBN for time series forecasting as we know. In study, it is proposed that Eqs. (1013) replace Eqs. (69) for Eq. (35), e.g., the learning rules in fine-tuning process of DBN are given by Adam instead of the BP method.

2.3 Meta Parameter Optimization

To design the structure of the ANNs, the evolutional algorithm of swarm intelligence, i.e., the Particle Swarm Optimization (PSO) or the heuristic algorithm Random Search (RS) [15], are more effective than the empirical methods such as grid search algorithm [16]. In this study, PSO and RS are adopted to optimize the meta parameters of DBN, i.e., the number of RBMs, the number of units in each RBM, the number of units of MLP, the learning rate of RBMs, and the learning rates. Detail algorithms can be found in [16], and they are omitted here.

3 Experiments and Analysis

To investigate the performance of DBN with Adam optimization algorithm, comparison experiments of time series forecasting were carried out. A benchmark dataset CATS [13, 14] (see Fig. 2), which is an artificial time series dataset utilized in time series forecasting competition, and a chaotic time series of Lorenz chaos (see Fig. 6), were used in the experiments.

3.1 Benchmark CATS

CATS time series data is an artificial benchmark data for forecasting competition with ANN methods [13, 14]. This artificial time series is given with 5,000 data, among which 100 are missed (hidden by competition the organizers) (see Fig. 2). The missed data exist in 5 blocks:

  • elements 981 to 1,000

  • elements 1,981 to 2,000

  • elements 2,981 to 3,000

  • elements 3,981 to 4,000

  • elements 4,981 to 5,000

The mean square error \(E_{1}\) is used as the prediction precision in the competition, and it is computed by the 100 missing data and their predicted values as following:

$$\begin{gathered} E_{1} = \{ \mathop \sum \limits_{{t = 981}}^{{1000}} (y_{t} - \bar{y}_{t} )^{2} + \mathop \sum \limits_{{t = 1981}}^{{2000}} (y_{t} - \bar{y}_{t} )^{2} + \mathop \sum \limits_{{t = 2981}}^{{3000}} (y_{t} - \bar{y}_{t} )^{2} + \hfill \\ \mathop \sum \nolimits_{{t = 3981}}^{{4000}} (y_{t} - \bar{y}_{t} )^{2} + \mathop \sum \nolimits_{{t = 4981}}^{{5000}} (y_{t} - \bar{y}_{t} )^{2} )\} /100 \hfill \\ \end{gathered}$$
(14)

where \(\bar{y}_{t} \) is the long-term prediction result of the missed data.

Fig. 2.
figure 2

A benchmark dataset CATS [13, 14].

3.2 Results and Analysis of CATS Forecasting

The meta parameter space searched by heuristic algorithms, i.e., Particle Swarm Optimization (PSO) and Random Search (RS) has 5 dimensions: the number of RBMs in DBN, the number of units of each RBM, the number of units in hidden layer of MLP, the learning rate of RBMs, the learning rate of MLP. The exploration ranges of these meta parameters are shown in Table 1.

The iteration of exploration of PSO and RS was set by convergence of evaluation functions or limitations of 2,000 in pre-training (RBM), and 10,000 in the fine-tuning (MLP). Additionally, the exploration finished when the forecasting error (mean squared error between the real data and the output of DBN) of validation data increased than the last time.

Table 1. Meta parameter ranges of exploration by PSO and RS.
Table 2. The comparison of long-term prediction precision by E1 measurement between different methods using CATS data [13, 14].

The forecasting precisions of different ANN and hybrid methods are shown in Table 2. It can be confirmed that the proposed methods, DBN using Adam fine-tuning algorithm with RS or PSO, ranked on the top of all methods. The learning curves of the proposed method (Adam adopted) and the conventional method (BP) are shown in Fig. 3 (the case of the 1st block of CATS). The convergence of loss (MSE) in Adam showed faster and smaller than the case of BP in both PSO and RS algorithms.

Fig. 3.
figure 3

The convergence of loss (MSE) of DBN in different fine-tuning processes (PSO and RS) and optimization algorithms (BP and Adam) using CATS data 1st block.

The change of the number of units in each RBM according to the different exploration algorithms, PSO and RS, is shown in Fig. 4. The iteration time of PSO ended at 15, and 500 for RS. Both exploration results showed that 2 RBMs were the best structure of DBN for the 1st block of CATS.

The change of the learning rates of different RBMs (pre-training) and MLP (fine-tuning) is shown in Fig. 5. The convergence of the learning rates were not obtained in each case of BP and Adam with PSO or RS.

The exploration results of meta parameters for the 1st block data of CATS are described in Table 3.

Fig. 4.
figure 4

Then change of number of units in RBM layers in different fine-tuning methods and optimization algorithms (in the case of CATS data 1st block).

Fig. 5.
figure 5

The change of the learning rates in different fine-tuning methods and optimization algorithms (the case of CATS data 1st block).

Table 3. Meta parameters of DBN optimized by PSO and RS for the CATS data (Block 1)

3.3 Chaotic Time Series Data

Chaotic time series are difficult to be predicted in the case of long-term forecasting [5]. Here, we used Lorenz chaos to compare the performance of DBNs with different fine-tuning methods in the case of short-term forecasting (one-ahead forecasting). Lorenz chaos is given by 3-D differential equations as follows.

$$\left\{ {\begin{array}{*{20}l} {\frac{dx}{{dt}} = - \sigma \cdot x + \sigma \cdot y} \hfill \\ {\frac{dy}{{dt}} = - x \cdot z + r \cdot x - y} \hfill \\ {\frac{dz}{{dt}} = x \cdot y - b \cdot z} \hfill \\ \end{array} } \right.$$
(15)

where parameters are given by \(\sigma = 10,\;b = 28,\;r = \frac{8}{3},\;\Delta t = 0.01\) in the experiment. The attract of Lorenz chaos, a butterfly aspect, and the time series of x-axis are shown in Fig. 6.

Fig. 6.
figure 6

Lorenz chaos used in the short-term (one-ahead) prediction experiment.

3.4 Results and Analysis of Chaotic Time Series Forecasting

The exploration results of meta parameters for Lorenz chaotic time series by PSO and RS in different fine-tuning methods (Adam and BP) are described in Table 4. Adam learning rules resulted deeper structure of DBN than PSO, especially in the case of RS. The convergence of loss (MSE) of DBN in different fine-tuning processes (BP and Adam) and optimization algorithms (PSO and RS) using the time series data of Lorenz chaos (1 to 1000 in x-axis) is shown in Fig. 7. And finally, the precisions of different forecasting methods are compared by Table 5. The best method for this time series forecasting was Adam with PSO, which yielded the lowest loss 1.68 × 10−5.

Table 4. Meta parameters of DBN optimized by PSO and RS for the Lorenz chaos (x-axis).
Fig. 7.
figure 7

The convergence of loss (MSE) of DBN in different fine-tuning processes (BP and Adam) and optimization algorithms (PSO and RS) using the time series data of Lorenz chaos (1 to 1000 in x-axis).

Table 5. Precisions (MSE) of different DBNs (upper: training error; lower: test error).

4 Conclusions

An improved gradient descent method Adam was firstly adopted to the fine-tuning process of the Deep Belief Net (DBN) for time series forecasting in this study. The effectiveness of the novel optimization algorithm showed its priority not only for the benchmark dataset CATS which was a long-term forecasting given by five blocks of artificial data, but also for the chaotic time series data which was a short-term forecasting (one-ahead) problem. As the optimizer Adam has been improved to be Nadam, AdaSecant, AMSGrad, AdaBound, etc., new challenges are remained in the future works.