1 Introduction

Time series forecasting has been a very popular research topic because it has wide applications in many different areas. For example, it is very convenient for people to travel when they are provided a traffic forecasting service. Forecasting climate change provides great help in transportation and agricultural production departments. Forecasting financial data is very helpful for business investments. In time series modelling, some data can be characterized by linear models, and some data must be characterized by nonlinear models. Systems with linear and/or nonlinear behaviours can be modelled using different methods. For time series data, various prediction models have been proposed in the literature [1], in which the typical linear regression (LR) model is either a linear auto-regression (AR) model or linear auto-regressive integrated moving average (ARIMA) model, and a promising nonlinear model is the deep belief network (DBN) model [2,3,4]. These three models are used in this paper.

The LR model is widely used in different fields, such as forecasting short-term electricity demands [5], stock indexes [6], and wind speeds [7]. In this paper, the linear AR and ARIMA models are used as predictive models for the linear behaviours of a time series. For nonlinear time series, however, LR modelling may not be a good choice. The artificial neural network (ANN) is an effective nonlinear modelling technique for predicting nonlinear time series [1]. It has been confirmed that ANNs could achieve a desired accuracy in nonlinear time series prediction [8,9,10,11]. Over the past several decades, ANN modelling techniques have developed substantially. There were more than 5000 publications using ANNs for time series forecasting by 2007 [12]. Some famous ANN structures are listed as follows: the radial basis function (RBF) NN, error back-propagation (BP) NN, and so on. With a great deal of research on ANNs, some problems have also arisen. The main problem is difficulty in determining the values of node connection weights and other parameters in an ANN; if one cannot obtain suitable values, the modelling accuracy decreases. Another problem is that the model parameter search in an optimization process is easy to stop at local optima. For these problems, Hinton and Osindero proposed the deep learning algorithm (DLA) [2], which first uses unsupervised learning to train the connection weights of a deep neural network. The DLA is a good method to extract intricate structures in high-dimensional data, and it has been successfully applied in many fields, such as image compression [13], forecasting exchange rates [14], the evaluation of vehicle interior sound quality [15], electricity load forecasting [3], breast cancer classification [16], magnetic resonance imaging [10], and time series modelling [3, 4, 12]. Currently, the DLA is one of the most effective algorithms for time series forecasting [2, 12] and is also commonly used for nonlinear behaviour modelling [17, 18]. The DLA may avoid falling into the local optimum and prevent over-fitting; these problems are often encountered in ANN training. The long short term memory (LSTM) model [19] is also widely applied in time series forecasting, which is a special type of recurrent neural network (RNN) architecture [19, 20], and RNNs a type of deep learning model. It has been shown that LSTM outperforms traditional RNNs in many temporal processing tasks [19].

In this paper, we use the DLA to train a DBN, which is used for the nonlinear behaviour prediction of a time series. The DBN has been proven to be very effective in learning representative features from observation data without prior knowledge. The DBN is a stack of restricted Boltzmann machines (RBMs), where the RBM is a basic and powerful neural network in which the connection between each neuron is a bipartite graph [21]. As mentioned in the literature [2, 13], the RBM is widely used in classic machine learning tasks, such as image or voice recognition, and it has shown impressive performance [22]. The RBM is a network of neurons composed of two layers: a hidden layer and a visible layer. The visible layer corresponds to the components of an observation, and the hidden layer is used to extract features from the visible layer [23]. The feature of the RBM is that the same layer variables are not connected to each other. The connection between the hidden layer and the visible layer is bidirectional and symmetric. Therefore, an RBM forms a Markov random field [21]. Hinton and Osindero proposed a fast, greedy learning method for the DBN, which learns one layer at a time [2]. After unsupervised training, a regression layer can be added at the top of the network for supervised training, and labelled data are then used for supervised fine-tuning to adjust the features for better prediction. Compared with the traditional ANN, if the model has fewer neurons, the traditional ANN has better advantages than the DBN. However, the DBN is much better than the traditional ANN when the number of nodes is very large [3].

Although the single model mentioned above may obtain good prediction results in many cases, a hybrid model, such as one combining a linear model and nonlinear model, may give better modelling results than a single model. This is because a hybrid model may absorb the qualities of the two models. Thus, using hybrid models has become a common practice to overcome the limitations of single models and improve prediction accuracy [24]. In the literature [1, 7, 8, 11, 24,25,26,27,28,29], some hybrid models have been proposed to combine the advantages of two or more individual models. For instance, a hybrid ARIMA-ANN model [25] was proposed, which combines a linear ARIMA model and a nonlinear ANN model for predicting sunspot time series, Lynx and exchange rate data, and it was shown that the prediction accuracy of the hybrid model was higher than that of a single model. Babu and Reddy also proposed a hybrid ARIMA-ANN model [1]; they used kurtosis to distinguish linear and nonlinear parts, and the proposed hybrid model achieved a higher prediction accuracy for sunspot data and electricity price data. Akouemo and Povinelli combined auto-regression with eXogenous (ARX) processes and the ANN to identify anomalous data points, and the mean absolute percentage errors decreased [8]. Zhu and Wei [28] proposed a hybrid model using the ARIMA model and least squares support vector machine for carbon price forecasting, and the forecasting accuracy of the model was also better than that of a single model. Nourani et al. [29] used hybrid wavelet-artificial intelligence models in hydrology. Shukur and Lee [30] proposed a hybrid Kalman filter and ANN model to improve the accuracy of daily wind speed forecasting. Barak and Sadegh [31] proposed an ARIMA-ANFIS (adaptive network fuzzy inference systems) hybrid algorithm to forecast energy consumption. Qiu et al. [32] represent the empirical mode decomposition (EMD) and DBN-based hybrid model, called EMD-DBN, normally outperforms the corresponding single structure models for time series forecasting, and nine benchmark methods were compared to verify the effectiveness of the EMD-DBN method (i.e., the persistence method [32], ensemble DBN (EDBN) [33], support vector regression (SVR) [34], ANN [35], DBN [13], random forest (RF) [36], EMD-SVR [37], EMD-ANN [38] and EMD-RF [32]).

Based on the aforementioned studies [1, 9, 23, 25], in this paper, we present a hybrid modelling approach that combines a linear AR or ARIMA model with a DBN model for nonlinear time series prediction because the best LR model may be different for different types of data. We call the proposed models “AR-DBN” and “ARIMA-DBN” and conduct thorough experimental studies. As seen in [3, 12, 14], it was shown that compared with the Kalman smoothing model, ARIMA model, multi-layer perceptron (MLP), self-organizing fuzzy neural network (SOFNN), error BP, ARMA and feed forward neural network (FFNN) model, the DBN model is applicable to the prediction of time series and works better than other traditional methods or models. Therefore, the DBN is chosen in this hybrid modelling method for time series prediction in this paper. This new hybrid modelling method is used to overcome the limitation of a single model, as mentioned above, for achieving more accurate prediction results. For the selection of the LR model in our hybrid model, according to the prediction results of the AR-DBN and ARIMA-DBN models, the model with the minimum MSE (mean square error) is used as the final prediction model. To use the proposed hybrid modelling method, we first use a linear AR or ARIMA model to fit a time series, and the residuals of the AR or ARIMA models are then the nonlinear component of the time series. Next, a DBN model is used to model the nonlinear part. We use a fast, greedy learning method to train the DBN first in every layer [2]; then, according to the target values, the BP algorithm is used for fine-tuning all of the connection weights. By combining the advantages of the LR model and DBN, the hybrid model not only can model time series but can also extract different features of time series. In addition, the AR-DBN and ARIMA-DBN models are trained in a greedy manner, which permits the training of deep layer networks and alleviates trapping into local minimums. Therefore, benefiting from DBN, the proposed hybrid model has desirable stability and learning ability. The proposed hybrid model approach is applied to the prediction of four time series, and the results show that the prediction accuracy of the proposed hybrid model is better than that of some models for the studied time series.

The rest of the paper is organized as follows. The proposed hybrid modelling method is presented in Section 2. The forecasting evaluation criteria used and the results and analysis of the experiments are described in Section 3. Finally, we conclude this paper in Section 4.

2 Hybrid LR-DBN model

A novel hybrid LR-DBN modelling approach to time series prediction is proposed in this section, which first uses linear AR and ARIMA models to fit a time series and then uses two DBN models to fit the two residual series of the linear AR and ARIMA models. The final selected model (i.e., AR-DBN or ARIMA-DBN) is determined by the statistical properties of the final modelling residuals from the two LR-DBN models.

2.1 Linear AR modelling

In the proposed hybrid LR-DBN modelling approach, we first use a linear AR model to represent the linear behaviour of a time series. A linear AR(p) model of order p is a linear function of the relation between the present value of a variable and its past p observations. Given a time series {y(t) ∈ R, t = 1,2,3, ⋯, N}, a linear AR(p) model is defined as follows:

$$ {\displaystyle \begin{array}{l}y(t)=f\left(y\left(t-1\right),y\left(t-2\right),\cdots, y\left(t-p\right)\right)+e(t)\\ {}\kern1.5em ={\alpha}_0+\sum \limits_{i=1}^p{\alpha}_iy\left(t-i\right)+e(t)\end{array}} $$
(1)

where N represents the amount of data, f(•) represents linear AR mapping, αi(i = 0, 1, ⋯, p) represent the linear regressive coefficients of model (1), e(t) represents the modelling error, and p represents the order of the model. Model (1) can be used for one-step- or multi-step-ahead prediction, and \( \widehat{y}(t)=f\left(\bullet \right) \) represents the one-step-ahead forecast result.

For model (1), we use the least squares method to estimate the AR coefficients αi(i = 0, 1, ⋯, p) by making the square of the error reach a minimum, and the obtained AR coefficients are given by

$$ \boldsymbol{\upalpha} ={\left({\mathbf{X}}^{\mathrm{T}}\mathbf{X}\right)}^{-1}{\mathbf{X}}^{\mathrm{T}}\mathbf{Y} $$
(2)

where.

$$ \mathbf{X}=\left[\begin{array}{ccccc}1& y(p)& y\left(p-1\right)& \cdots & y(1)\\ {}1& y\left(p+1\right)& y(p)& \cdots & y(2)\\ {}\vdots & \vdots & \vdots & \vdots & \vdots \\ {}1& y\left(N-1\right)& y\left(N-2\right)& \cdots & y\left(N-p\right)\end{array}\right],\mathbf{Y}=\left[\begin{array}{c}y\left(p+1\right)\\ {}y\left(p+2\right)\\ {}\vdots \\ {}y(N)\end{array}\right],\boldsymbol{\upalpha} =\left[\begin{array}{c}{\alpha}_0\\ {}{\alpha}_1\\ {}\vdots \\ {}{\alpha}_p\end{array}\right], $$

and N represents the length of the time series.

2.2 Linear ARIMA modelling

The linear AR model is easily estimated and suitable for use in the proposed hybrid modelling method. However, if the time series to be modelled is nonstationary, the ARIMA model may be better for extracting the linear part from the data in the proposed LR-DBN modelling method. When using the ARIMA model, the given original data are first checked for stationarity. If they are not stationary, a differencing operation needs to be carried out. If the processed data are still nonstationary, the differencing operation is again carried out until the data are made stationary [1]. If differencing is carried out d times, the integration order of the ARIMA model is defined as d, and the resultant data are then fitted by an auto-regressive moving average (ARMA) model as follows:

$$ \tilde{y}(t)={\beta}_0+{\varphi}_1\tilde{y}\left(t-1\right)+{\varphi}_2\tilde{y}\left(t-2\right)+\cdots +{\varphi}_p\tilde{y}\left(t-p\right)+\xi (t)+{\beta}_1\xi \left(t-1\right)+{\beta}_2\xi \left(t-2\right)+\cdots +{\beta}_{\lambda}\xi \left(t-\lambda \right) $$
(3)

where \( \tilde{y}(t)={\left(1-{z}^{-1}\right)}^dy(t) \), z−1y(t) = y(t − 1), ξ(t) represents the modelling error, φi(i = 1, 2, ⋯, p) and βj(j = 0, 1, ⋯, λ) represent the model parameters, and p and λ indicate the orders of the model. For the estimation method of the ARIMA model (3), one can refer to references [1, 9, 11, 24, 25, 27]. Model (3) can be used to compute the one-step-ahead forecasting, \( \widehat{y}(t) \), as follows: first, from (3), we obtain the prediction \( \overline{y}(t)={\beta}_0+{\varphi}_1\tilde{y}\left(t-1\right)+{\varphi}_2\tilde{y}\left(t-2\right)+\cdots +{\varphi}_p\tilde{y}\left(t-p\right)+{\beta}_1\xi \left(t-1\right)+{\beta}_2\xi \left(t-2\right)+\cdots +{\beta}_{\lambda}\xi \left(t-\lambda \right) \) and then compute \( \widehat{y}(t)=\overline{y}(t)-{\left(1-{z}^{-1}\right)}^dy(t)+y(t) \).

2.3 Hybrid modelling

Many time series data contain both linear and nonlinear characteristics. This subsection presents a hybrid LR-DBN modelling approach that combines the linear AR model (1) or ARIMA model (3) with a DBN model for this type of time series modelling. Time series {y(t) ∈ R, t = 1,2,3, ⋯, N} is decomposed into a linear component and a nonlinear component as follows:

$$ y(t)={y}_L(t)+{y}_N(t) $$
(4)

where yL(t) represents the linear component and yN(t) represents the nonlinear component. We first use the AR model (1) or ARIMA model (3) to fit the time series, and the residual e(t) is given by

$$ e(t)=y(t)-{\widehat{y}}_L(t) $$
(5)

where \( {\widehat{y}}_L(t) \) represents the predicted value using the AR model (1) or ARIMA model (3) at time t and e(t) only contains the nonlinear element. Next, we design a DBN model to fit the nonlinear component e(t) as follows:

$$ e(t)=g\left(e\left(t-1\right),e\left(t-2\right),\cdots, e\left(t-q\right)\right)+\varepsilon (t) $$
(6)

where g(•) represents a nonlinear function approximated by the designed DBN, q represents the order of the model, ε(t) represents the final modelling error, and the prediction of e(t) is denoted as \( \widehat{e}(t)=g\left(e\left(t-1\right),e\left(t-2\right),\cdots, e\left(t-q\right)\right) \).

In this paper, we use the DLA to train the DBN model to calculate the predictive value \( \widehat{e}(t) \). The DBN model is composed of several RBMs, and the structure of the designed hybrid model is shown in Fig. 1, where Nr represents the total number of hidden layers, h(k)(k = 0, 1, ⋯Nr) represent the output values in the k − th hidden layer, and v(k)(k = 0, 1, ⋯Nr) represent the input values in the k − th visible layer. The LR modelling residuals are used as input data for the first RBM of the DBN. When a unit in the visible layer or hidden layer in the DBN is activated, the probabilities of the layer are described by [4]:

$$ p\left({h}_j=1|\mathbf{v}\right)=\varphi \left({b}_j+\sum \limits_{i=1}^m{v}_i{w}_{ij}\right) $$
(7)
$$ p\left({v}_i=1|\mathbf{h}\right)=\varphi \left({a}_i+\sum \limits_{j=1}^n{h}_j{w}_{ij}\right) $$
(8)

where φ(x) = 1/(1 + ex) is the sigmoid function that is obtained from the probability distribution of the visible layer or hidden layer, with a varying range of [0,1]; wij represents the bi-directional weight between the visible unit i and hidden unit j; m represents the number of neurons in the visible layer; ν = (v1, v2, ⋯, vm)T is the input vector; n represents the number of neurons in the hidden layer; h = (h1, h2, ⋯, hn)T is the output vector; and ai and bj represent the bias in the input variables and hidden variables, respectively.

Fig. 1
figure 1

Structure of the LR-DBN model

The RBM in Fig. 1 is an energy-based model, whose energy function is defined as follows [13, 39].

$$ E\left(\mathbf{v},\mathbf{h}\right)=\sum \limits_{i=1}^m\frac{{\left({v}_i-{a}_i\right)}^2}{2{\sigma}_i^2}-\sum \limits_{i=1}^m\sum \limits_{j=1}^n\frac{v_i}{\sigma_i}{h}_j{w}_{ij}-\sum \limits_{j=1}^n{b}_j{h}_j $$
(9)

where σi represents the variance in the input variable vi. The marginal probability distribution over the visible vector is defined by [13, 39]:

$$ p\left(\mathbf{v}\right)=\sum \limits_{\mathbf{h}}\frac{\exp \left(-E\left(\mathbf{v},\mathbf{h}\right)\right)}{\int_{\mathbf{v}}{\sum}_{\mathbf{h}}\exp \left(-E\left(\mathbf{v},\mathbf{h}\right)\right)d\mathbf{v}} $$
(10)

According to the energy function shown in (9), we can define the probabilities for the visible and hidden units as follows:

$$ P\left({v}_i|\mathbf{h}\right)=\frac{1}{\sigma_i\sqrt{2\pi }}\exp \left(-\frac{{\left(x-{a}_i-{\sigma}_i\sum \limits_{j=1}^n{h}_j{w}_{ij}\right)}^2}{2{\sigma}_i^2}\right) $$
(11)
$$ P\left({h}_j|\mathbf{v}\right)=\varphi \left({b}_j+\sum \limits_{i=1}^m\frac{v_i}{\sigma_i}{w}_{ij}\right) $$
(12)

This model is suitable for continuous data, but the regular binomial-Bernoulli RBM can also be used if the data are normalized to [0, 1]. This method is used in our paper.

The RBMs are used as building blocks in the DBN. To minimize the deviation between the actual value and prediction value, we use the contrastive divergence (CD) algorithm, which is a good stochastic approximation approach, and its performance is better than that of some other algorithms [19]. According to the log-likelihood gradient, logp(v), which is obtained from (9) and (10), we use one step CD update rules to update the weight wij and bias ai and bj, which can be updated by.

$$ {\displaystyle \begin{array}{l}\varDelta {w}_{ij}={\varepsilon}_1\frac{\partial \log p\left(\mathbf{v}\right)}{\partial {w}_{ij}}={\varepsilon}_1\left(\frac{p\left({h}_j=1|\mathbf{v}\right){v}_i}{\sigma_i^2}-\frac{\sum \limits_{i=1}^mp\left(\mathbf{v}\right)p\left({h}_j=1|\mathbf{v}\right){v}_i}{\sigma_i^2}\right)\\ {}\kern1.75em ={\varepsilon}_1\left({\left\langle \frac{v_i{h}_j}{\sigma_i^2}\right\rangle}_{data}-{\left\langle \frac{v_i{h}_j}{\sigma_i^2}\right\rangle}_{recon}\right)\end{array}} $$
(13)
$$ {\displaystyle \begin{array}{l}\varDelta {a}_i={\varepsilon}_1\frac{\partial \log p\left(\mathbf{v}\right)}{\partial {a}_i}={\varepsilon}_1\left(\frac{v_i}{\sigma_i^2}-\frac{\sum \limits_{i=1}^mp\left(\mathbf{v}\right){v}_i}{\sigma_i^2}\right)\\ {}\kern1.25em ={\varepsilon}_1\left({\left\langle \frac{v_i}{\sigma_i^2}\right\rangle}_{data}-{\left\langle \frac{v_i}{\sigma_i^2}\right\rangle}_{recon}\right)\end{array}} $$
(14)
$$ {\displaystyle \begin{array}{l}\varDelta {b}_j={\varepsilon}_1\frac{\partial \log p\left(\mathbf{v}\right)}{\partial {b}_j}={\varepsilon}_1\left(p\left({h}_j=1|\mathbf{v}\right)-\sum \limits_{i=1}^mp\left(\mathbf{v}\right)p\left({h}_j=1|\mathbf{v}\right)\right)\\ {}\kern1.5em ={\varepsilon}_1\left({\left\langle {h}_j\right\rangle}_{d\mathrm{ata}}-{\left\langle {h}_j\right\rangle}_{recon}\right)\end{array}} $$
(15)

where ε1 represents the learning rate, 〈•〉data denotes an expectation with respect to the data distribution and 〈•〉recon denotes the reconstructed state.

The output value of the first RBM is used as the input data of the second RBM, and the next RBM is trained in the same manner. The output value of the DBN model is the prediction \( \widehat{e}(t) \). Next, using the difference between the actual output value e(t) and predicted output value \( \widehat{e}(t) \), the BP algorithm is executed to fine tune the parameters of each RBM again.

The pseudo-codes for training the first RBM and fine-tuning the DBN are presented in algorithms 1 and 2, respectively. Let h(s) (s = 1, ⋯, Nr) represent the output of the s − th hidden layer, where Nr represents the total number of hidden layers. In Algorithm 1, a(s) represents the bias of the s − th visible layer, b(s) represents the bias of the s − th hidden layer, and w(s) represents the weight of the pairwise interaction between the s − th layer and the (s − 1) − th layer. \( \widehat{e}(t) \) represents the output value of the DBN model, which is the predictive output of the residual e(t) in (5).

figure a
figure b

Finally, the output value of the DBN model is \( \widehat{e}(t) \), which is the predicted value of the residual e(t), where \( \widehat{e}(t) \) can be calculated as follows.

$$ \Big\{{\displaystyle \begin{array}{l}\widehat{e}(t)=\varphi \left({\mathbf{w}}_1^{\left({N}_r\right)}{\mathbf{h}}^{\left({N}_r-1\right)}(t)+{b}_1^{\left({N}_r\right)}\right)\\ {}{\mathbf{h}}^{\left(\ell \right)}(t)={\left({h}_1^{\left(\ell \right)}(t),{h}_2^{\left(\ell \right)}(t),\cdots, {h}_{Q_{\ell}}^{\left(\ell \right)}(t)\right)}^{\mathrm{T}},\kern0.5em \ell \in \left\{1,2,\cdots, {N}_r-1\right\}\\ {}{h}_{n_{\ell}}^{\left(\ell \right)}(t)=\varphi \left({\mathbf{w}}_{n_{\ell}}^{\left(\ell \right)}{\mathbf{h}}^{\left(\ell -1\right)}(t)+{b}_{n_{\ell}}^{\left(\ell \right)}\right),\kern0.5em {n}_{\ell}\in \left\{1,2,\cdots, {Q}_{\ell}\right\}\\ {}{\mathbf{w}}_{n_{\ell}}^{\left(\ell \right)}=\left({w}_{n_{\ell },1}^{\left(\ell \right)},{w}_{n_{\ell },2}^{\left(\ell \right)},\cdots, {w}_{n_{\ell },{Q}_{\ell -1}}^{\left(\ell \right)}\right),\kern0.5em {Q}_0=q\\ {}{\mathbf{h}}^{(0)}(t)={\left(e\left(t-1\right),e\left(t-2\right),\cdots, e\left(t-q\right)\right)}^{\mathrm{T}}\end{array}} $$
(16)

where \( {\mathbf{w}}_{n_{\ell}}^{\left(\ell \right)} \) denotes the weight matrix between the layer and layer  − 1, \( \left({b}_1^{\left(\ell \right)},{b}_2^{\left(\ell \right)},\cdots, {b}_{Q_{\ell}}^{\left(\ell \right)}\right) \) represents the bias of layer , Q represents the number of nodes in layer , Nr represents the total number of layers, and h()(t) denotes the output values of layer .

Finally, the final forecasting value of the time series using the AR-DBN model or ARIMA-DBN model is

$$ \widehat{y}(t)={\widehat{y}}_L(t)+\widehat{e}(t) $$
(17)

The pseudo-code for the hybrid LR-DBN model is presented in Algorithm 3, and the modelling procedure of the proposed method is as follows:

  1. Stage 1:

    Linear modelling. The AR model (1) and ARIMA model (3) are used to estimate linear information from the observations. Then, the residuals e(t) are obtained from this stage. The residuals are used as input data for the next stage.

  2. Stage 2:

    Nonlinear modelling. The DBN models are trained using the residuals of the AR model and ARIMA model. The coefficients of the two DBN models are adjusted.

  3. Stage 3:

    Combining. The predictive results of the first stage and second stage are combined, which results in the final predicted values of the hybrid LR-DBN models. According to the prediction results of the AR-DBN model and ARIMA-DBN model, the LR-DBN model with the minimum MSE is used as the final prediction model.

figure c

3 Application of the hybrid model to time series

In this section, experimental studies are presented to demonstrate the effectiveness and superiority of the proposed hybrid LR-DBN model. Four time series (i.e., the Mackey-Glass, sunspot, Individual Household Electric Power Consumption (IHEPC) and electricity load demand data sets from the Australian Energy Market Operator (AEMO)) are used to evaluate the proposed hybrid LR-DBN model. The modelling results of the proposed method are compared with the modelling results reported in some studies. To verify the model prediction accuracy, three criteria (i.e., the mean square error (MSE), normalized mean square error (NMSE) and root mean square error (RMSE)) are introduced to measure the performance of the proposed hybrid LR-DBN model as follows:

$$ \mathrm{MSE}=\frac{1}{N-p-q}\sum \limits_{t=p+q+1}^N{\left(y(t)-\widehat{y}(t)\right)}^2 $$
(18)
$$ \mathrm{RMSE}=\sqrt{\frac{1}{N-p-q}\sum \limits_{t=p+q+1}^N{\left(y(t)-\widehat{y}(t)\right)}^2} $$
(19)
$$ \mathrm{NMSE}=\left(\frac{\sum \limits_{t=p+q+1}^N{\left(y(t)-\widehat{y}(t)\right)}^2}{\sum \limits_{t=p+q+1}^N{\left(y(t)-\overline{y}\right)}^2}\right) $$
(20)

where \( \overline{y} \) represents the mean value of the observation data.

When using the DBN model, we normalize all of the training and testing data by the following computation:

$$ {y}^{\ast }(t)=\frac{y(t)-\min \left(y(t)\right)}{\max \left(y(t)\right)-\min \left(y(t)\right)} $$
(21)

where y(t) represents the normalized value of y(t).

3.1 Mackey-glass time series modelling

Here, we use the famous Mackey-Glass Eq. (22) to obtain the time series and set a = 0.2, b = 0.1, and c = 10. Different values of τ produce various degrees of chaos.

$$ \frac{\mathrm{d}y}{dt}=\frac{ay\left(t-\tau \right)}{1+{y}^c\left(t-\tau \right)}- by(t) $$
(22)

For fair comparison, we select τ = 20, as used in Gan et al. [26]. This chaotic time series model was also studied by Jang and Gulley [40] and Shi and Tamura [41]. We use 1000 data points, in which the first 500 observations are used to train the model, and the last 500 observations are used to test the modelling performance. The Mackey-Glass time series is shown in Fig. 2.

Fig. 2
figure 2

Time series generated from the Mackey-Glass equation

In the first modelling stage, the order p of the LR model is five, which is the same as that set in the literature [26, 40,41,42]. The AR model and ARIMA model are used to fit the original data. In the process of dealing with the LR modelling residuals, the structure of the DBN is Nr = 3, Q0 = 5, Q1 = 4, Q2 = 9, and Q3 = 1. The nonlinear DBN model in LR-DBN is fine-tuned by the BP algorithm, and the number of fine-tuning occurrences is 1000 in all cases. According to the prediction results of the estimated AR-DBN and ARIMA-DBN models, the model with the minimum MSE is selected as the final prediction model. As seen in Table 1, the ARIMA-DBN model is selected as the prediction model, and the orders of the ARIMA are p = 5, d = 1, λ = 4. The modelling results of the proposed hybrid LR-DBN model are shown in Table 1. For comparison, the modelling results of the LLRBF-AR [26], RBF-AR [42], FIS [40], RBF [42], linear AR, ARIMA-ANN, LSTM [19], DBN [13] and EMD-DBN [32] models are also listed in Table 1, from which it can be seen that the ARIMA-DBN model yields the smallest MSE and AIC, thus providing the best modelling result from an Akaike information criterion (AIC) point of view compared to other models.

Table 1 Comparison results for the Mackey-Glass time series

For further comparison, we depict the predictive errors and their histograms by the ARIMA model, the LLRBF-AR model and ARIMA-DBN model for the testing data of the Mackey-Glass series in Fig. 3, from which it is seen that the prediction accuracy of the ARIMA-DBN model is better than that of the other two models.

Fig. 3
figure 3

Predictive errors and histograms by the three models for the testing data

3.2 Sunspot data modelling

A sunspot series is the most basic parameter used to describe the level of solar activity. Studying sunspot data models plays an important role in protecting the environment [43]. The smoothed, monthly sunspot time series from November 1834 to June 2001 (2000 points) in Fig. 4 is obtained from the SIDC (World Data Center for the Sunspot Index) [44]. The data are the same as those used in the literature [45,46,47,48,49,50,51,52,53], where the original data are scaled between [0, 1]. The data contain the linear part and nonlinear part, which are widely used in hybrid modelling [1, 25]. To evaluate the advantages of the proposed model, a fair comparison is required. Therefore, the sunspot data are also divided into two parts, as in the literature. The first 1000 observations are used to train the model, and the last 1000 observations are used to test the modelling performance.

Fig. 4
figure 4

Sunspot time series

In this case, the LR model order p is selected as 5, as in [45,46,47,48,49,50,51,52,53]. After obtaining the residuals of the LR model (i.e., the nonlinear part), we then use the DBN to fit the residuals. The structure parameters of the DBN model used here are Nr = 2, Q0 = 5, Q1 = 6, and Q2 = 1, and the number of fine-tuning occurrences is 400. The model performance measures of the proposed AR-DBN model and other models are given in Table 2. It can be seen from Table 2 that the AR-DBN model gives the best modelling result compared to other models for the testing data.

Table 2 Modelling performance comparison for the test data in a sunspot series

To ensure that the problem can be explained clearly, we analyse the predictive errors of the proposed AR-DBN model for the testing data, which is plotted in Fig. 5. It can be seen from Fig. 5 that the residuals are small, and their histograms have a reasonably symmetric shape around zero and a Gaussian appearance. The original values of the sunspot time series and their predicted values are compared in Fig. 6, which shows that the AR-DBN model achieves a good prediction accuracy near the peaks and valleys. These results show the good statistical properties of the estimated hybrid model and prove that the estimated hybrid model can represent the dynamic behaviour of the original data well.

Fig. 5
figure 5

Prediction error analysis of the AR-DBN model

Fig. 6
figure 6

Comparison between the original sunspot time series and the final predicted values

3.3 University of California Irvine repository data modelling

In general, deep learning algorithms are more advantageous for large-scale data. To use a larger data set to further validate the effectiveness of the proposed hybrid modelling method, in this subsection, experiments are carried out on UCI (University of California, Irvine) benchmark data sets [54]. The modelling data used, with 34,000 data points, are the Global Active Power data, which are extracted from the Individual Household Electric Power Consumption (IHEPC) data set, and we use the first 30,000 data points as training data and the following 4000 data points as testing data. We denote the data as IHEPC-GAP.

The order p of the linear regression model in the AR-DBN or ARIMA-DBN model is determined by the AIC values. For the IHEPC-GAP data, we finally select p = 20. The structure parameters of the DBN model used here are Nr = 3, Q0 = 20, Q1 = 35, Q2 = 65, and Q3 = 1, and number of fine-tuning occurrences is 5000. The modelling results of the different models are given in Table 3, which shows that the AR-DBN model with the smallest MSE and AIC gives the best modelling performance for the IHEPC-GAP data compared with the other models.

Table 3 Modelling results comparison for IHEPC-GAP data

Table 3 also gives the training time and the number of adjustable parameters for each model, which shows that the number of training times for the DBN-type models is approximate (excluding the EMD-DBN) and longer than that of the other models. This is because of the complexity of training the DBN model and hybrid model. In general, the training time or adjustable parameters of the LR-DBN model are longer or greater than those of other single models when the DBN module has many hidden layers, respectively. However, training of the LR-DBN model performs offline; therefore, in most cases, the use of the LR-DBN model in practice may not be affected by long-term offline training. In this way, we computed the results using a PC with an Intel i7–3770 CPU (3.4 GHz and 8 GB-RAM).

3.4 Prediction of the electric load time series from AEMO

In this subsection, using the electricity load time series from the Australian Energy Market Operator (AEMO) [55], the performance of the proposed hybrid modelling method is evaluated by comparing it with other eleven benchmark modelling methods (i.e., the persistence method [32], artificial neural network (ANN) [35], DBN [13], support vector regression (SVR) [34], ensemble DBN (EDBN) [33], empirical mode decomposition (EMD)-based SVR model (EMD-SVR) [37], EMD-based ANN (EMD-ANN) [38], EMD-based random forest (RF [36]) (EMD-RF) [32], EMD-based DBN (EMD-DBN) [32] and LSTM [19]).

For fair comparison, data sets from the year 2013 for Tasmania (TAS) are chosen to train and test the proposed hybrid model and other compared models. For TAS, January, April, July and October data are used to reflect the different seasons. In the experiment, the first 3 weeks of data are used to train the model, and data from the remaining week are used to test the model [32]. The electricity load demand data from AEMO is sampled every half hour, which means that there are 48 data points for 1 day [32, 56]. Therefore, there are 1008 data points for training and 336 data points for testing [32]. In this paper, for one-day ahead load demand forecasting (i.e., the input data are composed of the data points from y(t − 48) to y(t − 96)), y(t) represents the output of the hybrid modelling method, which is the same as that in [32].

According to the prediction results of the AR-DBN model and the ARIMA-DBN model, the AR-DBN model with the minimum MSE is selected as the final prediction model. The prediction results of the one-day ahead load forecasting are given in Table 4 using the estimated AR-DBN model and the other eleven benchmark methods for the testing data. Table 4 shows that the proposed hybrid modelling method gives better prediction results than the other methods in most cases. Thus, it verifies the advantage of the proposed hybrid model for the time series prediction.

Table 4 The results of the one-day ahead load forecasting for testing data of the AEMO time series

4 Conclusion

In this paper, a novel hybrid model composed of the LR model and DBN model was proposed to overcome the deficiencies in a single LR model or a single DBN model. Using an LR model or nonlinear DBN model alone may be difficult in characterizing a time series accurately, while the proposed hybrid model, which combines the merits of the LR model and DBN model, could be better than a single model. After first using a linear AR model or ARIMA model to reveal the linear part of the time series, the residuals of the AR or ARIMA models only contain the nonlinear behaviour of the time series. Next, the DBN better models the nonlinear part of the time series. Case studies for four well-known time series indicate that the proposed hybrid model has better modelling accuracy than some single models and hybrid models. The main reason is that the DBN has the strong ability to extract features among layers and self-organization characteristics. According to the four experimental results, we observed that the more training data there are, the higher the prediction accuracy. This is determined by the characteristics of deep learning.