1 Introduction

Time series forecasting is an energetic research area which has drawn significant attention in a variety of domains such as economy, industry engineering and hydrology. By using time series forecasting, past observations of the same variable are collected and analyzed to develop a model describing the underlying relationships. Over the past several decades, many efforts have been devoted to the development and improvement of time series forecasting models. These models are developed from traditional linear prediction models and gradually improved and transited to nonlinear models.

Traditional statistical models have been focused on and applied because of their relative simplicity in understanding and implementation [15]. Linear prediction methods mainly include a series of classical statistical models based on autoregressive moving average (ARMA) model. Nonlinear prediction methods include complex models represented by all kinds of new types of machine learning methods. The limitation of linear models is that the underlying process studied is assumed to be linear, and consequently, the models may fail to capture nonlinear features commonly encountered in practice [6]. In recent years, machine learning methods have gradually become the main methods of solving the nonlinear and non-stationary complex time series prediction problems due to their strong nonlinear approximation ability. ANNs are one type of extensively used machine learning methods. Approaches based on ANNs for time series forecasting have produced convincing results [715], and the vast body of the literature is still growing.

Wavelet transform (WT) is a frequently used data preprocessing method. It can analyze a signal in both time and frequency domain, so that it surmounts the shortcomings of conventional Fourier transform. WT has been successfully embedded in various time series prediction models and has achieved satisfying results [1619]. It provides an effective decomposition of time series; therefore, the sub-series can improve the performance of the forecasting models by seizing the available information on different resolution levels.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observation of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (PCs). The number of PCs is less than or equal to the number of original variables. Y. Ouyang [20] used PCA to evaluate the water quality monitoring stations and showed that the number of monitoring stations can be reduced from 22 to 19. Wang [21] presented an improved method which integrates the PCA into a stochastic time strength neural network for forecasting financial time series. PCA extracted 2 PCs as the input data from six variables.

In many existing literature, sub-models are established separately according to the decomposition levels. Forecasting results are combined based on these sub-models finally. The combination of these results would improve forecasting accuracy compared with using original series without decomposition. However, the problems brought about are that both the calculation load and training time of the whole model would be magnified. Another problem is that the more sub-models are established, the more parameters need to be determined. This may lead to worse model generalization ability. The main objective of this paper is to provide an improved method for time series forecasting to solve these problems. A single simplified hybrid neural network model is proposed in this study. This model applies PCA to extract useful information from the wavelet components to construct model inputs. The model is trained by cascade-correlation algorithm. Combination of WT and PCA provides a method for solving feature extraction problems. The effectiveness of the proposed hybrid approach is demonstrated by the results obtained from both artificial data set and real-world data set. Forecasting of benchmark time series of Mackey-Glass as a hand-designed system is carried out. The mean daily flow of Oldman River near Brocket is used as real-world data set.

The rest of this paper is organized as follows. In Sect. 2, methodologies used in this paper are introduced. Section 3 illustrates the proposed hybrid model, and the modeling procedures are given. The experimental results are presented and discussed in Sect. 4. Conclusions based on the study are highlighted in Sect. 5.

2 Methodology

In this section, methodologies which will be used for constructing forecasting models are introduced. The WT and PCA methods are applied to decompose and feature extraction. An ANN model with cascade-correlation (CC) architecture as learning algorithm introduced in this section is employed as the prediction model. The principles of these methods are presented in detail in the following subsections.

2.1 Wavelet transform (WT)

WT is an essential time–frequency analysis tool for signal processing. It has been widely used because of its better performance compared with the Fourier transform. The basic aims of wavelet analysis are both to determine the frequency (or scale) contents of a signal and to assess and determine the temporal variation of this frequency content [22]. It has an advantage of having flexibility in choosing the mother wavelet in light of the characteristics of time series. Elaborate mathematical definitions of the wavelets are given in the literature [23].

WT can be divided into two categories: continuous wavelet transform (CWT) and discrete wavelet transform (DWT). The CWT for an original signal f(t) with respect to a mother wavelet function ψ(t) can be defined as [22]:

$${\text{CWT}}_{f} (a,b) = \left\langle {f(t),\psi_{a,b} (t)} \right\rangle = \frac{1}{{\sqrt {\left| \alpha \right|} }}\int\limits_{ - \infty }^{ + \infty } {f(t)\psi ^{\ast }\left( {\frac{t - b}{a}} \right){\text{d}}t}$$
(1)

where * is the complex conjugate of ψ(t) and a and b denote the scale parameter and translation index, respectively. In CWT, wavelet coefficients are produced by continuously dilating and translating the mother wavelet, so that the wavelet coefficients are calculated for all possible scales and times. In DWT, it produces only the minimal number of coefficients necessary to reconstruct the original signal function f(t). This reduction is achieved by the discretization of the parameters a and b, so that

$$\left\{ \begin{aligned} a & = 2^{ - s} \\ b & = k2^{ - s} \\ \end{aligned} \right.$$
(2)

where s and k belong to the integer set Z and control the wavelet dilation and translation, respectively. Thus, a discrete version of CWT is obtained as

$${\text{DWT}}_{f} (a,b) = \left\langle {f(t),\psi_{a,b} (t)} \right\rangle = \frac{1}{{\sqrt {\left| {2^{ - s} } \right|} }}\int\limits_{ - \infty }^{ + \infty } {f(t)\psi^{*} \left( {\frac{{t - k/2^{s} }}{{1/2^{s} }}} \right){\text{d}}t}$$
(3)

By using wavelet discretization, timescale space can be sampled at discrete levels.

In this study, DWT is employed to process an original time series into a group of sub-series. An algorithm developed by Mallat [24] is employed in multi-solution computation, which represents an efficient way to implement the DWT using filters. The original series are passed through two kinds of filters. The low-pass filters which are associated with the scaling function allow the analysis of low-frequency components. The high-pass filters which are associated with the wavelet function allow the analysis of high-frequency components. An n levers wavelet decomposition showing the decomposition process is presented in Fig. 1. Signal A represents low-frequency approximation component, while signal D contains detail information of high-frequency component. The decomposition processes will be continued until reach the required number of n levels.

Fig. 1
figure 1

Computational process of DWT

Many types of wavelets, including Daubechies, Symmlet, Gaussian, Mexican hat, Morlet and Shannon wavelets, can be used for wavelet-based time series analysis [31]. Daubechies wavelets are one of the most extensively used wavelets. They represent a collection of orthogonal mother wavelets with compact support, characterized by a maximal number of vanishing moments for some given length of the support. They often express as dbN, where N refers to the number of vanishing moment. The Daubechies wavelet transform is defined by computing running averages and differences via scalar products with scaling signals and wavelets [26]. The scaling signals and wavelets produce averages and differences using just a few more values from the signal. This provides a tremendous improvement in the capabilities of the Daubechies transforms. Daubechies wavelets are concerned in this paper.

2.2 Principal component analysis (PCA)

Feature extraction which contributes to removing redundant or irrelevant inputs features, not only reduces computing time for learning, but also improves forecasting accuracy. In order to make sure that the features of WCCNN model are useful and the scale is small, PCA is introduced to extract the feature inputs and reduce the data size. Results of a PCA are usually discussed in terms of components scores (the transformed variable values corresponding to a particular case in the data) and loading (the weight by which each standardized original variable should be multiplied to obtain the component score). Assume the data matrix with p variables, x 1 , x 2 ,…, x p, m times observations, the specific steps of PCA are as follows:

$$X = \left[ {\begin{array}{*{20}l} {x_{11} } \hfill & {x_{12} } \hfill & \ldots \hfill & {x_{1p} } \hfill \\ {x_{21} } \hfill & {x_{22} } \hfill & \ldots \hfill & {x_{2p} } \hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill & \vdots \hfill \\ {x_{m1} } \hfill & {x_{m2} } \hfill & \ldots \hfill & {x_{mp} } \hfill \\ \end{array} } \right]$$
(4)

The decomposition components are normalized by using the following method:

$$Y_{ji} = (x_{ji} - \bar{x}_{i} )/S_{i}$$
(5)

where

$$\bar{x}_{i} = \frac{1}{m}\sum\limits_{j = 1}^{m} {x_{ji} } ,\quad S_{i} = \sqrt {\frac{1}{m - 1}\sum\limits_{j = 1}^{m} {(x_{ji} - \bar{x}_{i} )^{2} } }$$

Let λ1 ≥ λ2 ≥···≥ λp ≥ 0 be the eigenvalues of covariance matrix of normalized data and L 1, L 2, … , L p be the corresponding eigenvectors, the ith PC is such that

$${\text{PC}}_{i} = L_{i}^{T} X$$
(6)

where i = 1,2, … , p. The contribution rate of kth PC and the cumulative contribution rate of first k PCs are as follows:

$$\lambda_{k} /\sum\limits_{i = 1}^{p} {\lambda_{i} } ,\quad \sum\limits_{i = 1}^{k} {\lambda_{i} } /\sum\limits_{i = 1}^{p} {\lambda_{i} }$$

In PCA, the cumulative contribution rate represents the ability of synthesizing information. If the cumulative contribution rate exceeds some values, it can be considered that the first k PCs contain the most information of p original variables. These values need to be determined according to the concrete issues. The PCs reduce input variables and achieve the purpose of simplifying the problem. By selecting the most significant PCs, it is possible to identify certain relationships among the parameters of the data set under consideration.

2.3 Artificial neural networks (ANNs)

ANNs are data-driven and nonparametric models. A supervised learning algorithm for ANNs used in this paper is CC architecture, which was proposed by Fahlman and Lebiere (1900) [27]. This architecture is able to construct an appropriate small network automatically, which only needs to adjust the new weights. The processes of CC architecture are illustrated in Fig. 2. Black circles are frozen connection weights, and white circles are weights trained during output-training phase. The vertical lines sum all incoming activation. This architecture begins with a minimal network that has some inputs and one or more outputs units. The network which has no hidden units is a single layer network and is equivalent to a linear model (Fig. 2a). The hidden units are added to the network one by one as needed during learning, obtaining a multilayer structure (Fig. 2b, c). Each of the hidden units is placed into a new hidden layer. These units receive a connection from each of the network’s original inputs and also from every preexisting hidden unit. This makes it possible to create high-order nonlinear feature detectors, customized for the problem at hand. There is also a bias input, permanently set to +1. Once a new unit has been added to the network, its input weights are fixed; only the output weights are trained repeatedly.

Fig. 2
figure 2

Cascade-correlation (CC) architecture

CCNN was first developed in attempt to overcome certain problems and limitations of the popular back-propagation learning algorithm. The benchmark problem chosen for CCNN was “two-spirals,” proposed by Alexis Wieland of MITRE Corp because it is an extremely hard problem for back-propagation algorithms; the network determines its own size and topology. A feed forward BP network model is established to compare the performance with the CC network in this study.

In order to prove the advantages of proposed model, the flowcharts of WCCNN and WCCNN multi-models which are the conjunction models of wavelet decomposition and CCNN are illustrated in Fig. 3.

Fig. 3
figure 3

Flowcharts of WCCNN multi-models and WCCNN

In WCCNN model, the original time series are decomposed by DWT after identifying model order. The approximation and detail components obtained are then used as the CCNN model inputs. In WCCNN multi-models, the approximation and detail components are applied to set up sub-models, respectively. Forecasting values of these models are added to obtain the final results.

3 Proposed PCA-WCCNN model

A hybrid PCA-WCCNN model is proposed in this paper. The purpose of decomposition and feature extraction is, on one hand, to improve the predictive accuracy and, on the other hand, to present a single simplified model to reduce the training time.

Before constructing input–output pairs, the potential orders are needed to be determined. The Box–Jenkins methodology is adopted to achieve this objective [28]. Box–Jenkins methodology is a five-step process for identifying, selection and assessing conditional mean models for discrete, univariate time series data. The sample autocorrelation function (ACF) and partial autocorrelation function (PACF) are used in this paper. If time series are not stationary, successively difference series to attain stationarity is necessary. The ACF and PACF of a stationary series cut off completely after a few lags. In time series analysis, PACF gives the partial correlation of a time series with its own lagged values, controlling for the values of the time series at all shorter lags. The use of these functions is introduced as part of the Box–Jenkins approach to time series modeling. By plotting the partial autocorrelative functions, one could determine the appropriate lags, which give the potential model orders.

Multilevel wavelet decomposition is considered to decompose the original time series into an approximation series and a set of detail series using DWT technique. In the feature selection process, models require that the input variables should have poor correlation. Because strong correlation between input variables implies that they carry more repeated information, and it may increase the computational complexity and reduce the prediction accuracy of the model. The essence of PCA is the rotation of space coordinates that does not change the data structure. The obtained PCs are the linear combination of variables, reflecting the original information to the greatest degree. These PCs are uncorrelated with each other. The framework of forecasting steps of PCA-WCCNN model is illustrated in Fig. 4.

Fig. 4
figure 4

Framework of forecasting steps of PCA-WCCNN model

4 Experiments

For evaluating the performance of the proposed forecasting model, the Mackey-Glass time series [29] and a mean daily flow of Oldman River near Brocket from Time Series Data Library (TSDL) [30] are used in this study. The time series forecasting based on the chaotic Mackey-Glass differential equation is a standard benchmark in the areas of neural networks for comparing the learning and generalization abilities of different algorithms. River flow modeling and prediction is one of the earliest forecasting problems to have attracted the interest of a good number of scientists and is one of the most frequently analyzed problems in hydrology.

4.1 Performance evaluation

To evaluate the accuracy of the proposed model, two different criterions including the root-mean-squared error (RMSE) and the mean absolute error (MAE) are used in the experiments. These performance indexes can be written as

$${\text{RMSE}} = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^{N} {\left[ {\hat{y}_{i} - y_{i} } \right]^{2} } }$$
(7)
$${\text{MAE}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left| {\hat{y}_{i} - y_{i} } \right|}$$
(8)

where N is the number of observed values, ŷ i is the predicted value and y i is the actual value. The elapsed time is calculated in the experiments in order to compare the training speed of the models.

4.2 Application to the Mackey-Glass time series forecasting

4.2.1 Analysis

The Mackey-Glass equation was first introduced as a model of white blood cell production equation. One interesting feature of the Mackey-Glass problem is that real-value outputs are required instead of the discrete output values found in most neural network benchmarks. Time series which will be predicted with network algorithm is generated from the following equation

$$\frac{{{\text{d}}(x)}}{{{\text{d}}t}} = \beta x(t)\frac{\alpha x(t - \tau )}{{1 + x(t - \tau )^{10} }}$$
(9)

with parameters α = 0.2, β = −0.1, and τ = 17 [31]. Three thousand and five hundred data points are generated with an initial condition of x(0) = 1.2 and x(t) = 0 when t < 0 based on the second-order Runge–Kutta method to discrete the differential equation. Figure 5 shows the portion of the series used for this study. The standard method for this prediction is to create a mapping f from D points of the time series spaced Δ apart, i.e., [x(t − (D − 1)Δ), …, x(t − Δ),x(t)], to predict future point x(t + P). The embedding dimension D = 2 as the model orders are two in this case. Previous studies have used the value Δ = 6 and prediction interval values of P = 6 and P = 85 [32, 33]. The characteristic time constant of x(t) is t char = 50, which makes it particular difficult to forecast x(t + P) with P > t char [31]. By choosing P = Δ, it is possible to predict the value of time series at any multiple of Δ time steps in the future by feeding the output back into the input and iterating the solution. Based on the analysis above, the model inputs consist of two past values of x(t), i.e., (x(t − 6),x(t)) and the model is to predict the value of x(t + 6) in this study.

Fig. 5
figure 5

Mackey-Glass time series

The optimal decomposition levels and mother wavelet must be selected in advance to determine the performance in the wavelet domain. Several researchers have used an empirical equation to determine the decomposition level [34, 35]. The empirical equation is

$$L = \text{int} (\log (N))$$
(10)

where L is the decomposition level, N is the sample numbers and int is the integer-part function. The trial-and-error procedure is used to determined the decomposition levels in this study and three decomposition levels are obtained which is coincide with the result of the empirical equation. The original time series are decomposed using Daubechies wavelets, and sub-series components are D1, D2, D3 and A3. There are many Daubechies wavelets, but they are all very similar. Definition and advantages of db4 wavelet can be found in the literature [25]. Based on the analysis of this time series, the db4 WT will be concerned. Original and decomposed time series are shown in Fig. 6. The values of detail D1 are too small. In order to see it more clearly, only a part of the series are given. Since the other four series have significant regularity, more data than D1 are used to draw this picture.

Fig. 6
figure 6

Original and decomposed MG time series using db4 wavelet

To compare with the CC algorithm, BP algorithm is used to ANNs. Chester [36] and Zhang [37] studies suggest that the ideal number of hidden layers in BP architecture is often one or two, and it is accepted that a network with three layers connected toward ahead can approximate any continuous function in a reasonable way [38]. BPNN used this study has one input layer, one output layer and one hidden layer between the input and output layers. Although there are some experimental suggestions for determining the number of hidden neurons, the trial-and-error procedure used in the current research which is more reliable in spite of its time consume inherence.

4.2.2 Forecasting results

The original time series is decomposed into sub-series. All the wavelet components are as follows:

$$[x_{D1} (t - 6),x_{D1} (t),x_{D2} (t - 6),x_{D2} (t),x_{D3} (t - 6),x_{D3} (t),x_{A3} (t - 6),x_{A3} (t)]$$

PCA is applied to extract the PCs from the above wavelet components. The contribution rate and cumulative contribution rate of the principal components are presented in Table 1. In this paper, if the cumulative contribution rate exceeds 85 %, it can be considered that the first k PCs contain the most information of p original variables.

Table 1 PCA results of MG time series components

This table indicates that the cumulative contribution rates of the first five PCs exceed 85 %, namely the first five PCs contain 85% information of the wavelet components. These five PCs are conducted as the inputs of the PCA-WCCNN model instead of using all the wavelet components. The training sample is 3000, while 500 data are used to test the model.

In BPNN, the original time series are used without decomposition. A number of experiments are carried out to set parameters, viz. initial input–hidden–output nodes, learning rate, epochs, activation function and learning error to obtain the optimal results. The values that exhibit the best behavior in terms of accuracy have been chosen. The determined optimal values of all these parameters including BP and CC algorithms are listed in Table 2. The blank in the table means that different models using CC algorithm have different initial node values. That should be determined in the specific model.

Table 2 Parameters and their values during learning processes of the BPNN and CCNN

The testing results of PCA-WCCNN, WCCNN, WCCNN multi-models, CCNN and BPNN models are illustrated in Fig. 7. The elapsed time of each model is computed under the same condition. The performance of the five models and the elapsed time are shown in Table 3.

Fig. 7
figure 7

Forecasting results of five models

Table 3 The performances of the four models forecasting MG time series

4.3 Application to the mean daily flow forecasting

4.3.1 Analysis

The second data set used in this paper is the mean daily flow of Oldman River near Brocket from Time Series Data Library (TSDL), which was created by Rob Hyndman, professor of the statistics at Monash University, Australia. This data set has 1460 fact values in one time series, from January 01, 1988, to December 31, 1991. The original time series are shown in Fig. 8.

Fig. 8
figure 8

Original time series of the mean daily flow (m3/s)

The mean daily flow data set is used for PCA-WCCNN, WCCNN, WCCNN multi-models, CCNN and BPNN models, respectively. The training samples are accounted for 80 %, and the rest 20 % are used as testing samples. The original and decomposed mean daily flow time series using db4 wavelet are presented in Fig. 9. Before training, the sample autocorrelation function (ACF) and partial autocorrelation function (PACF) are used to determine the model orders. Figure 10 plots the sample ACF and PACF for the series and the significant, linearly decaying ACF indicate a nonstationary process (Fig. 10a). In order to remove the linear trend, we take a first difference of the data and plot the sample ACF and PACF of the differenced series as shown in Fig. 10b. The blue lines are the upper and lower confidence bounds. The red dots are ACF and PACF values.

Fig. 9
figure 9

Original and decomposed time series using db4 wavelet

Fig. 10
figure 10

Order identification for mean daily flow time series

The differenced series appear more stationary. The sample ACF of the differenced series decays more quickly. The sample PACF cuts off after lag 3. Therefore, the model orders are identified as 3. Assume that all the wavelet components are expressed as follows:

$$\begin{aligned} & [y_{D1} (t - 2),y_{D1} (t - 1),y_{D1} (t),y_{D2} (t - 2),y_{D2} (t - 1),y_{D2} (t),y_{D3} (t - 2), \\ & \quad y_{D3} (t - 1),y_{D3} (t),y_{A3} (t - 2),y_{A3} (t - 1),y_{A3} (t)] \\ \end{aligned}$$

New feature inputs for WCCNN model from the wavelet components must be constructed. The same method applied in the MG time series is also used in this case.

4.3.2 Forecasting results

Table 4 gives the computed results of the contribution rate and cumulative contribution rate of the principal components. It can be seen that the cumulative contribution rate of the seventh component exceeds 85 %. The first to seventh components contain more than 85 % information of the wavelet components. These seven PCs are conducted as the input data of the PCA-WCCNN model instead of the wavelet components.

Table 4 The PCA results of mean daily flow time series components

The values that exhibit the best behavior in terms of accuracy of BP and CC algorithms have been chosen are illustrated in Table 5. The blank in the table means that different models using CC algorithm have different initial node values. That should be determined in the specific model.

Table 5 Parameters and their values during learning processes of the BPNN and CCNN

Since the inputs are determined, the input–output pairs are extracted in order to train and test model. The testing results of five different models are illustrated in Fig. 11. The mean daily flow forecasting performances and elapsed time computed under the same condition are shown in Table 6.

Fig. 11
figure 11

Forecasting results of five models

Table 6 The performances of the five models forecasting mean daily flow

5 Discussion

For each example, five models are constructed: CCNN, BPNN, WCCNN, WCCNN multi-models and PCA-WCCNN. The first two are single models, which are built using the original time series. The other three are wavelet decomposition-based models. Compare the performance of the five models from Tables 1, 2, 3, 4, 5, 6, conclusions can be summarized as follows:

  1. 1.

    CCNN and BPNN: The elapsed time illustrated that CCNN learns quickly than BPNN in these examples. The RMSE and MAE of the two models indicated that CCNN has better performance. This is because CC algorithm requires no back-propagation of error signals through the connections of the network. Moreover, it dose not need to decide the hidden layer structure beforehand as the network determines its own size and topology.

  2. 2.

    WCCNN and CCNN: Wavelet decomposition-based CCNN model outperform CCNN in terms of RMSE and MAE. This reveals the effectiveness of the wavelet decomposition. By selecting suitable filters, WT can greatly reduce the correlation between different characteristics of the original time series. As for training speed, WCCNN model is a little bit slower than CCNN. This can be interpreted as time taken by wavelet decomposition.

  3. 3.

    WCCNN and WCCNN multi-models: WCCNN multi-models use all the wavelet components as inputs to construct model for each sub-series. Although the RMSE and MAE in this time series are slightly better than WCCNN, the training time is much longer then WCCNN model. This greatly reduces the model efficiency.

  4. 4.

    PCA-WCCNN: The proposed PCA-WCCNN performs better than other model as illustrated by the calculated values of RMSE and MAE from tables. Appropriate inputs are selected through the PCA to extract significant information from wavelet components and insure the choice is better than others. PCA-WCCNN is concise compared to the WCCNN multi-models and more accurate than WCCNN. It also can help to shorten training time although the elapsed time is a little bit longer than WCCNN and CCNN models. This can be interpreted as time taken by wavelet decomposition and PCA processes.

6 Conclusions

From the above studies, it can be found that the model performance has much to do with feature inputs and learning algorithms. On the one hand, the combination of wavelet decomposition and feature selection improve the performance of neural networks significantly. On the other hand, CCNN takes shorter elapsed time then BPNN. It can be concluded that CC algorithm learns faster than traditional BP algorithm. Thus, CC algorithm can speed up the learning. Moreover, the feature selection also simplified model structural. Given consideration to forecasting accuracy and training efficiency, the proposed neural network model has the advantage of applying to time series forecasting. Further studies can also investigate the models using different kinds of mother wavelets.