1 Introduction

On the basis of the units in human brain, which are interconnected by synapses to generate decision making and coordination, the researchers developed artificial neural network models in machine learning to solve real-world tasks. At present, a lot of deep-learning methods [1,2,3,4] have been successfully applied into time-series tasks (TSKs). Convolution neural networks with their variants such as temporal convolutional neural network (TCNN) [5], time-series encoder (TSE) [6], multi-channel deep convolutional neural network (MCDCNN) [7] and time-CNN [8] were proposed to model the time domain with one-dimensional convolution templates. As a natural extension of convolutional networks, recurrent neural network (RNN) models, such as long short-term memory (LSTM) [9], have been designed for linking and memorizing the past and current information using recurrent connection. These classic ANN models such as LSTMs and CNNs [5,6,7,8,9] employed multiple layers computation framework to adapt the large-scale inputs, while the backpropagation mechanism between these layers consumed too much time for convergence [10, 11]. Echo state networks (ESNs) [12, 13] used the gradient-free method to model the time-series datasets. Echo states in ESNs achieved the feedforward transition in the reservoirs and did not need to propagate the gradients through time step. The echo state networks [12, 13] can effectively suppress the local minima and gradient vanishing in backpropagation procedure [14], in which general ANNs cannot avoid the above problems.

Besides these advantages, the architecture of the reservoir for ESNs is kept fixed and only governed by hyper-parameters such as the number of echo units, leaky rate and spectral radius. Therefore, the learning performance of the ESNs critically depends on the configuration of hyper-parameters. However, the ESNs [12, 13] cannot effectively represent the TSKs in a lot of actual applications because the shallow reservoir architecture of network is hard to portray the complicated feature of TSKs. According to the previous work [15,16,17,18,19], the spatial scale of the stacked reservoirs organization of the ESNs has shown the powerful hierarchical temporal feature representation with respect to the shallow ESNs. Due to the improvement of the fitting capacity with multiple reservoirs stacked, the ESNs have been extensively applied into the complex tasks such as solar irradiance prediction [15,16,17,18] application.

When meeting with the multivariate time-series (MTS) datasets, the single ESN may achieve accurate performance on univariate channel and poorly on another. Inspired by the advantages of ensemble learning [20], the mechanisms can provide the trade-off diagram for the accuracy and diversity of the base learners. It can be considered that combining multiple ESNs to form an ensemble ESN can yield better results beyond the multivariate series than single ESN. The ensemble selection mechanisms for the general ensemble models [21, 22] tend to select the diverse base learners from a number of trained learners to enhance the learning performance. For example, each echo state network [21, 22] is applied to model the 3D motion and steady-state visual evoked potentials (SSVEPs) datasets to build the ensemble echo state network. In these works, the tuning of hyper-parameters for ensemble ESN is always solved by manual or grid search modulation based on trial-and-error method. Previous works [21, 22] demonstrate that such solution suffers from the problems of search space complexity, growing exponentially with the number of tuned hyper-parameters.

To introduce the multiple architecture and ensemble selection mechanisms to the echo state networks, in this paper, ensemble Bayesian deep echo network (EBDEN) is proposed to model the time-series datasets. Admittedly, it is the first attempt to fuse the Bayesian into the optimization of hyper-parameters tuning for deep echo state network. The above merits are attributed to the following two contributions of the proposed method:

  • To extend the shallow layer to the deep architecture, the multiple-scale reservoirs with bidirectional connection are fused across the time steps to build the deep architecture of EBDEN. In order to ensure the performance of the multiple-scale reservoirs, the hyper-parameters in multiple-scale reservoirs can improve the learning performance which are explored by employing Bayesian optimization (BO) in EBDEN.

  • To solve the ensemble selection for the echo state networks, due to the redundancy for modeling multiple time domain channels of multivariate time series (MTS), EBDEN can determine the optimal ensemble weights and avoid overfitting problem.

The reminder of this study is organized as follows: In Sect. 2, the methodology of EBDEN is described. The benchmarks and characteristics of EBDEN are discussed in Sect. 3. The experimental results by leveraging EBDEN to solve various TSKs, such as multivariate time series (MTS), chaotic time-series representation and Dansgaard–Oeschger estimation, are shown in Sect. 4. The conclusion and future work are presented in Sect. 5.

2 Architecture of EBDEN

Particularly, the boldface letter and the italic letter denote the matrix and the vector in this paper, respectively. In Fig. 1, the EBDEN containing three modules, namely input, reservoir and readout layers, is shown. In terms of general multivariate time-series (MTS) sequences \({\mathbf{x}}\), they incorporate M samples (\({\mathbf{x}} = (x_{ 1} ,x_{2} , \ldots ,x_{M} )\)) and K channels. According to Fig. 1, the input sequences \({\mathbf{x}}\) are progressively fed into the multiple-scale bidirectional reservoirs to extract the echo state sequences. Compared with the shallow architecture of the reservoir, the deep one provides richer and more differentiated echo state sequences to discriminate the past and current information. Apart from the global hyper-parameters of EBDEN like the number of scales L, each reservoir is activated by local hyper-parameters, such as unit numbers in the reservoir N, the leaky rate \(\lambda\), the spectral radius \(\rho\), the scaling coefficient \(\omega\) and the sparsity connection degree \(\eta\). However, due to the fixed form of the hyper-parameters without any optimization, the performance of the multiple-scale echo state network fails to achieve the competitive fitting ability [15,16,17,18,19]. To achieve this target, the suitable hyper-parameters of EBDEN are optimized by the BO method.

Fig. 1
figure 1

The illustration of basic architecture of EBDEN. The EBDEN is based on the echo state network, and time-series inputs are incorporated into the T scale of the reservoirs to compute the feature. Each reservoir contains several hyper-parameters, such as the number of units in each reservoir N, leaky rate \(\lambda\), scaling coefficient \(\omega\), spectral radius \(\rho\), connection degree \(\eta\) and scale number L

2.1 Deep multiple-scale reservoir of EBDEN

There are several curious hyper-parameters in EBDEN, such as internal weights \({\mathbf{W}}_{{{\mathbf{in}}}}\), reservoir weights \({\mathbf{W}}_{{\mathbf{r}}}\), the spectral radius \(\rho\), the scaling coefficient \(\omega\) and the sparsity connection degree \(\eta\). With the initialization of \({\mathbf{W}}_{{{\mathbf{in}}}}\) following the uniform distribution in \([ - 1,1]\), the internal weights \({\mathbf{W}}_{{{\mathbf{in}}}}\) range from \([ - \omega ,\omega ]\) after being scaled by the scaling coefficient \(\omega\). Besides, \(\rho\) denotes the spectral radius of the reservoir weights \({\mathbf{W}}_{{\mathbf{r}}}\) that can be calculated as follows:

$${\mathbf{W}}_{\text{r}} = \rho \frac{{\mathbf{W}}}{{{\text{Eigen}}_{\hbox{max} } ({\mathbf{W}})}}$$
(1)

where \({\mathbf{W}}\) follows the uniform distribution in \([ - 0. 5,0.5]\) in this paper. According to the echo memory mechanism [19], ranges of \(\rho\) and the sparsity connection degree β in reservoirs are both \([ 0,1]\). Time-series inputs \({\mathbf{x}}\) are fed into N units of L scales of the reservoir through the internal synapse \({\mathbf{W}}_{{{\mathbf{in}}}}^{{\mathbf{0}}}\). All units in reservoirs are kept the same size of the reservoir weight matrix \({\mathbf{W}}_{{\mathbf{r}}}\), which is \(N * N\). In this paper, the echo state \({\mathbf{S(t)}}\) is extracted by the reservoir. When L equals 1, the echo state \({\mathbf{S(t)}}\) is activated by external input \({\mathbf{x}}\), the size of which is \(M*N\) for each time step. The \({\mathbf{S(t)}}\) can be written as follows:

$${\mathbf{S}}(t) = (1 - \lambda ){\mathbf{S}}(t - 1) + \lambda {\mathbf{H}}(t)$$
(2)
$${\mathbf{H}}(t) = \tanh ({\mathbf{W}}_{{\mathbf{r}}} {\mathbf{S}}(t - 1) + {\mathbf{W}}_{{{\mathbf{in}}}} {\mathbf{x}}(t))$$
(3)

where λ is the leaky rate of the echo units that controls the speed of state dynamics. Besides, larger λ denotes faster dynamics for EBDEN. \({\mathbf{H(t)}}\) represents the intermediate variable at time step t, which incorporates the feedforward \({\mathbf{x}}(t)\) and the echo state from last time step \({\mathbf{S}}(t - 1)\). Meanwhile, it is bounded by the tanh function, which can influence eigenvalues of the incoming weight matrix and ensure the echo state property [23] of EBDEN. Considering the deep architecture of EBDEN, when L is greater than 1, the \(L_{\text{th}}\) scale of variables \({\mathbf{S}}^{L} {\mathbf{(t)}}\) and \({\mathbf{H}}^{L} {\mathbf{(t)}}\) can be rewritten as follows:

$${\mathbf{S}}^{L} (t) = (1 - \lambda ){\mathbf{S}}^{L} (t - 1) + \lambda {\mathbf{H}}^{L} (t)$$
(4)
$${\mathbf{H}}^{L} (t) = \tanh ({\mathbf{W}}_{{{\mathbf{in}}}}^{L - 1} {\mathbf{S}}^{L - 1} (t) + {\mathbf{W}}_{{\mathbf{r}}}^{L} {\mathbf{S}}^{L} (t - 1))$$
(5)

The \({\mathbf{W}}_{{{\mathbf{in}}}}^{L - 1}\) in (5) denotes the internal weight matrix between the \(L - 1_{\text{th}}\) and \(L_{\text{th}}\) scale reservoir. In the conventional mechanism of the echo state network, the echo state \({\mathbf{S}}\) for whole timescale (\({\mathbf{S}} = [{\mathbf{S}}(1),{\mathbf{S}}(2), \ldots ,{\mathbf{S}}(T) ]\)) is weighted by readout weights \({\mathbf{W}}_{{{\mathbf{out}}}}\), which is listed as follows:

$${\mathbf{y}} = {\mathbf{W}}_{{{\mathbf{out}}}} {\mathbf{S}}$$
(6)

Expected to track any forms of target dynamics \({\hat{\mathbf{y}}}\) like complex time-series patterns, the output \({\mathbf{y}}\) can be optimized by learning \({\mathbf{W}}_{{{\mathbf{out}}}}\) in Eq. (6) and computing the mean square error with ridge regressor (RC) [24, 25] as shown below:

$$E({\mathbf{y}},{\hat{\mathbf{y}}}) = \left\| {{\mathbf{y}} - {\hat{\mathbf{y}}}} \right\|_{2}^{2}$$
(7)

When dealing with the classification problems, the support vector machines (SVMs) are always used in previous works [26]. Inspired by bidirectional LSTM [27], we adopt echo units with bidirectional connection in the EBDEN for the purpose of extracting more abundant context features from input \({\mathbf{x}}\). In reservoir space of the EBDEN, except for the forward computation along the time step for the echo state \({\mathbf{S}}(t)\), reverse computation along the reverse time step for the echo state is computed as well. Different from the representation of unidirectional computation, the echo state \({\mathbf{S}}(t)\) (described in (4)) and the intermediate variable \({\mathbf{H}}(t)\) (described in (5)) are represented as \(\overrightarrow {{\mathbf{S}}}^{L} (t)\), \(\overrightarrow {{\mathbf{H}}}^{L} (t)\) and \(\overleftarrow {{\mathbf{S}}}^{L} (t^{\prime } )\), \(\overleftarrow {{\mathbf{H}}}^{L} (t^{\prime } )\), respectively:

$$\overrightarrow {{\mathbf{H}}}^{L} (t) = \tanh ({\mathbf{W}}_{{{\mathbf{in}}}}^{L - 1} \overrightarrow {{\mathbf{S}}}^{L - 1} (t) + {\mathbf{W}}_{\text{r}}^{L} \overrightarrow {{\mathbf{S}}}^{L} (t - 1))$$
(8)
$$\overleftarrow {{\mathbf{H}}}^{L} (t^{\prime } ) = \tanh ({\mathbf{W}}_{{{\mathbf{in}}}}^{L - 1} \overleftarrow {{\mathbf{S}}}^{L - 1} (t^{\prime } ) + {\mathbf{W}}_{\text{r}}^{L} \overleftarrow {{\mathbf{S}}}^{L} (t^{\prime } - 1))$$
(9)

where \({\vec{\mathbf{S}}}^{L} (t)\) and \(\overrightarrow {{\mathbf{H}}}^{L} (t)\) denote the \(L_{\text{th}}\) scale of forward states; \(\overleftarrow {{\mathbf{S}}}^{L} (t^{\prime } )\) and \(\overleftarrow {{\mathbf{H}}}^{L} (t^{\prime } )\) represent the reverse states. Accordingly, the start and the end of the simulation \(t^{\prime }\) for \(\overleftarrow {{\mathbf{H}}}^{L} (t^{\prime } )\) in (9) are the end and the start of the simulation t for intermediate states \(\overrightarrow {{\mathbf{H}}}^{L} (t)\), respectively. Substitute (8), (9) into (4), and the following results of forward and reverse echo states are obtained:

$$\overrightarrow {{\mathbf{S}}}^{L} (t) = (1 - \lambda )\overrightarrow {{\mathbf{S}}}^{L} (t - 1) + \lambda \overrightarrow {{\mathbf{H}}}^{L} (t)$$
(10)
$$\overleftarrow {{\mathbf{S}}}^{L} (t^{\prime } ) = (1 - \lambda )\overleftarrow {{\mathbf{S}}}^{L} (t^{\prime } - 1) + \lambda \overleftarrow {{\mathbf{H}}}^{L} (t^{\prime } )$$
(11)

At the final time step, the forward and reverse echo states are concatenated as the new type of \({\mathbf{S}}^{L} (t) = [{\vec{\mathbf{S}}}^{L} (t);{\mathbf{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{S} }}^{L} (t^{\prime } )]\). It can be seen that the \(L_{\text{th}}\) scale of the echo state \({\mathbf{S}}^{L}\) through T time steps in EBDEN has the size of \(M*T*2NL\).

2.2 Bayesian optimization of EBDEN

Unlike traditional reservoir computing frameworks, several hyper-parameters in EBDEN are unfixed during the learning procedure for the sake of efficaciously modeling time-series datasets. As mentioned before, these hyper-parameters are fitted by Gaussian regressor (GR), in which the Matern kernel is utilized and updated following BO. According to Fig. 1, several hyper-parameters, namely \(\rho\), \(\omega\), \(\eta\), \(\lambda\), L and N, are sampled by BO, and the readout weight matrix \({\mathbf{W}}_{{{\mathbf{out}}}}\) can be learnt to obtain the output \({\mathbf{y}}\) (according to Eq. 6). In EBDEN, such hyper-parameters and readout weights are optimized alternatively. When hyper-parameters are kept fixed, the readout weights of EBDEN are learnt by ridge regressor; when the latter is kept fixed, the suitable hyper-parameters capable of enhancing the performance are optimized by BO. For the optimization procedure of the hyper-parameters, the initial candidates for hyper-parameters are of equal size and pre-defined searching space. Furthermore, the average performance of the output \({\mathbf{y}}\) activated by these candidates is inferred by BO. In this paper, the lower confidence bound (LCB) mechanism, which is controlled by the predictive mean and variance functions, is used as the acquisition type of BO expected to query the suitable solutions to ensure the loss function with lower expectation under the candidates. Hence, suppose the acquisition function is denoted as F and assemble of the parameters as \(\theta\) for EBDEN containing the spectral radius \(\rho\), the scaling coefficient \(\omega\), the sparsity connection degree \(\eta\), the leaky rate \(\lambda\), the number of scales of the reservoir L and unit numbers in the reservoir N.

figure a

2.3 Ensemble architecture of EBDEN

When meeting with the complex time-series types, such as multivariate time-series (MTS), each time step stems from the multiple time domain channels. Hence, is it essential for each time domain channel to be modeled by EBDEN? In reality, the computation of such condition is very costly when time-series inputs \({\mathbf{x}}\) with multiple channels are incorporated into the deep echo network. Since the MTS datasets [28, 29] in the previous works are known to be sourced from the small subset of the total channels, it is eminent that not all time domain channels are deterministic for the computation of the network. In order to avert redundancy computation, ensemble echo network architecture is applied in EBDEN to determine which channels in MTS are with semblable temporal behavior.

According to Fig. 2, the number of echo networks is assigned to ensembles, and each sub-ensemble is independently evaluated by the deep echo network. As mentioned in Sect. 2, there are K time domain channels and C ensembles, with hyper-parameters (\(\rho\), \(\omega\), \(\eta\), \(\lambda\), L and N) optimized by BO briefly presented as \(\theta\). Furthermore, \(f(\theta )\) denotes the loss function in line with Eq. (7) and the ensemble weight matrix is denoted as the optimal ensemble weight \({\mathbf{W}}_{{{\mathbf{K,C}}}}\), which measures the importance of each channel individually as follows:

$${\mathbf{W}}_{k,c} = \frac{{{\text{e}}^{{ - f_{k} (\theta_{c} )}} }}{{\sum\nolimits_{c = 1}^{C} {{\text{e}}^{{ - f_{k} (\theta_{c} )}} } }}$$
(12)

where \(f_{k} (\theta_{c} )\) denotes the real loss for the \(c_{\text{th}}\) sub-ensemble when dealing with the \(k_{\text{th}}\) time domain channel. The total loss of EBDEN is calculated with ensemble weights and the loss function \(f(\theta )\):

$${\text{Loss}}_{\text{total}} = \frac{1}{C}\sum\limits_{k = 1}^{K} {\sum\limits_{c = 1}^{C} {{\mathbf{W}}_{{k{\mathbf{,}}c}} f_{k} (\theta_{c} )} }$$
(13)

According to (13), the learning stopping procedure of EBDEN for total loss is dominated by two factors: the ensemble weight \({\mathbf{W}}_{k,c}\) and the function loss \(f_{k} (\theta_{c} )\) optimized by BO. The overfit learning phenomenon can be relieved by EBDEN when the loss is not merely determined by the loss of sub-ensembles.

Fig. 2
figure 2

The ensemble architecture of EBDEN

3 Datasets and characteristics for EBDEN

In this section, descriptions of some benchmarks, such as Baydogan MTS task, SantaFe Competition A, the NARMA system and chaotic attractors, are discussed. Subsequently, the metric memory capacity is used to explore the advantage of deep reservoirs of EBDEN. The time-consuming performance between EBDEN using BO and grid search mechanisms is calculated to demonstrate the superiority of BO. Finally, the test error performances of EBDEN, EBDEN without the ensemble and EBDEN with global ensemble are compared.

3.1 Datasets

  • The public database Baydogan archive [30] containing 13 multivariate time-series datasets is adopted to evaluate the performance of EBDEN and other comparison algorithms.

  • Far-infrared-laser SantaFe time-series competition A [31] contains 9000 training and 1000 testing samples, respectively.

  • The nonlinear autoregressive moving average (NARMA) [32] system of 10 orders activated by uniform distribution is used.

  • The three chaotic attractors [33] are activated among total 2500 time steps. These trajectories for attractors are 1-channel, 2-channel (each channel denotes the X- and Y-axis) and 3-channel (each channel denotes the X-, Y- and Z-axis) time-series datasets, respectively. The dynamic formulation of the Lorenz attractor system is:

$$\begin{aligned} \dot{x} & = 10(y - x) \\ \dot{y} & = x(28 - z) - y \\ \dot{z} & = xy - \frac{8}{3}z \\ \end{aligned}$$
(14)

The dynamic formulation of the Rossler attractor system is:

$$\begin{aligned} \dot{x} & = - (y + z) \\ \dot{y} & = x + 0.2y \\ \dot{z} & = 0.2 + xz - 8z \\ \end{aligned}$$
(15)

The dynamic formulation of the Mackey–Glass attractor system is:

$$\dot{x} = \frac{0.2x(t - 17)}{{1 + x(t - 17)^{n} }} - 0.1x(t)$$
(16)

where x, y and z denote the corresponding channels to activate the dynamic systems.

3.2 Contraction mapping analyzation for EBDEN

As described in Sect. 2, the bidirectional deep architecture of the reservoirs contained in EBDEN is composed of the multiple scales of the echo states. To the best of our knowledge, contractivity is used widely in previous works [34] in the aspect of measuring the echo space of reservoir computing. In this section, the exploration for the superiority of the bidirectional multiple-scale reservoir mechanism in the theoretical level is analyzed by contraction mapping.

There are two inputs postulated to have the same time series, with one decayed by Gaussian noise and the other undisturbed, both of which are learnt by EBDEN. Euclidean distance is perceived as the metric to measure the corresponding echo states. The total L scale reservoirs of two echo states, namely \({\mathbf{S}} = \left( {{\mathbf{S}}^{ 1} , \ldots ,{\mathbf{S}}^{L} } \right)\) and \({\hat{\mathbf{S}}} = \left( {{\hat{\mathbf{S}}}^{ 1} , \ldots ,{\hat{\mathbf{S}}}^{L} } \right)\), are calculated, while EBDEN deals with the undisturbed and the disturbed inputs, respectively. As described in Eqs. (4) and (5), the \(L_{\text{th}}\) scale of the echo state \({\mathbf{S}}^{L}\) is dominated by that of the last scale echo state \({\mathbf{S}}^{L - 1}\). Suppose there is a constant \(C^{(L)}\), and the \(L_{\text{th}}\) scale network satisfies the contraction condition in the light of Lipschitz continuity, which can be listed as follows:

$$\left\| {F^{(L)} \left( {{\mathbf{x}},{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L)} } \right) - F^{(L)} \left( {{\mathbf{x}},{\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L)} } \right)} \right\| \le C^{(L)} \left\| {\left( {{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L)} } \right) - \left( {{\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L)} } \right)} \right\|$$
(17)

where \(F^{(L)}\) denotes the Lth scale state transition function. According to Eqs. (4) and (5), the state transition function is measured by the adaptive process of the echo states, which can be simplified as \({\text{s}}^{(1)} (t) = F^{(1)} (x(t),s^{(1)} (t - 1))\) when the number of scales L is equal to 1. When the number of scales L is larger than 1, the adaptive process of echo states can be simplified as \({\text{s}}^{(L)} (t) = F^{(L)} (x(t),s^{(1)} (t - 1), \ldots ,s^{(L)} (t - 1))\). With the range of \(C^{(L)}\) in \([0,1]\), the higher value of \(C^{(L)}\) denotes less contractive dynamics. There is evidence that the range of \(C^{(L)}\) is indeed in \([0,1]\) and the contractivity of EBDEN in L scale reservoirs satisfies the Lipschitz continuity lemma.

According to (4), (5) and (17), the difference between \(F^{(L)} (x(t),s^{(1)} (t - 1), \ldots ,s^{(L)} (t - 1))\) and \(F^{(L)} (x(t),{\hat{\text{s}}}^{(1)} (t - 1), \ldots ,\hat{s}^{(L)} (t - 1))\) can be calculated as follows:

$$\begin{aligned} & \left\| {F^{(L)} \left( {{\mathbf{x}},{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L)} } \right) - F^{(L)} \left( {{\mathbf{x}},{\hat{\mathbf{S}}}^{ (1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L)} } \right)} \right\| \\ & \quad = \left\| {(1 - \lambda ){\mathbf{S}}^{(L)} + \lambda \tanh \left( {{\mathbf{W}}_{{\mathbf{r}}}^{(L)} {\mathbf{S}}^{(L)} + {\mathbf{W}}_{\text{in}}^{(L)} F^{(L - 1)} \left( {{\mathbf{x}},{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L - 1)} } \right)} \right)} \right. \\ & \left. {\quad \quad - (1 - \lambda ){\hat{\mathbf{S}}}^{(L)} - \lambda \tanh \left( {{\mathbf{W}}_{{\mathbf{r}}}^{(L)} {\hat{\mathbf{S}}}^{(L)} + {\mathbf{W}}_{\text{in}}^{{ ( {\text{L)}}}} F^{(L - 1)} \left( {{\mathbf{x}},{\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L - 1)} } \right)} \right)} \right\| \\ \end{aligned}$$
(18)

Since the maximum for activation tanh is 1, (18) can be transformed to:

$$\begin{aligned} & {\text{Eq}} .\;(18) \\ & \quad \le \left\| {(1 - \lambda ){\mathbf{S}}^{(L)} + \lambda \left( {{\mathbf{W}}_{{\mathbf{r}}}^{(L)} {\mathbf{S}}^{(L)} + {\mathbf{W}}_{{{\mathbf{in}}}}^{(L)} F^{(L - 1 )} \left( {{\mathbf{x}},{\mathbf{S}}^{ ( 1)} , \ldots ,{\mathbf{S}}^{(L - 1 )} } \right)} \right)} \right. \\ & \quad \quad \left. { - \,(1 - \lambda ){\hat{\mathbf{S}}}^{(L)} - \lambda \left( {{\mathbf{W}}_{{\mathbf{r}}}^{(L)} {\hat{\mathbf{S}}}^{(L)} + {\mathbf{W}}_{{{\mathbf{in}}}}^{(L)} F^{(L - 1 )} \left( {{\mathbf{x}},{\hat{\mathbf{S}}}^{ ( 1)} , \ldots ,{\hat{\mathbf{S}}}^{(L - 1 )} } \right)} \right)} \right\| \\ \end{aligned}$$
(19)

When the range of \(\lambda\) is in \([0,1]\) and the condition of \({\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{(L)} \; \le \;||{\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{(L)} ||\) is met, \(F^{(L - 1)} ({\mathbf{x}},{\mathbf{S}}^{ ( 1)} , \ldots ,{\mathbf{S}}^{(L - 1)} ) - F^{(L - 1)} ({\mathbf{x}},{\hat{\mathbf{S}}}^{ ( 1)} , \ldots ,{\hat{\mathbf{S}}}^{(L - 1)} )\; \le \;||F^{(L - 1)} ({\mathbf{x}},{\mathbf{S}}^{ ( 1)} , \ldots ,{\mathbf{S}}^{(L - 1)} ) - F^{(L - 1)} ({\mathbf{x}},{\hat{\mathbf{S}}}^{ ( 1)} , \ldots ,{\hat{\mathbf{S}}}^{(L - 1)} )||\). Equation (19) can be further derived as follows:

$$\begin{aligned} & {\text{Eq}} .\;(1 9) \\ & \quad \le (1 - \lambda )\left\| {{\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{(L)} } \right\| + \lambda \left( {\left\| {{\mathbf{W}}_{{\mathbf{r}}}^{(L)} } \right\|\;\left\| {{\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{(L)} } \right\|} \right. \\ & \quad \quad \left. { + \,\left\| {{\mathbf{W}}_{{{\mathbf{in}}}}^{(L)} } \right\|\left\| {F^{(L - 1)} \left( {{\mathbf{x}},{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L - 1)} } \right) - F^{(L - 1)} \left( {{\mathbf{x}},{\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L - 1)} } \right)} \right\|} \right) \\ \end{aligned}$$
(20)

Considering \(|{\kern 1pt} |F^{(L - 1)} ({\mathbf{x}},{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L - 1)} ) - F^{(L - 1)} ({\mathbf{x}},{\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L - 1)} ) | |\le {\text{C}}^{(i - 1)} ||({\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L - 1)} ) - ({\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L - 1)} )||\) and \(| |{\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{ (L)} | |\le | | ({\mathbf{S}}^{ ( 1)} , \ldots ,{\mathbf{S}}^{(L)} ) - ({\hat{\mathbf{S}}}^{ ( 1)} , \ldots ,{\hat{\mathbf{S}}}^{(L)} ) | |\), (20) is further enlarged as follows:

$$\begin{aligned} & {\text{Eq}} .\;( 2 0) \\ & \quad \le (1 - \lambda )\left\| {\left( {{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L)} } \right) - \left( {{\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L)} } \right)} \right\| + \lambda \left( {\left\| {{\mathbf{W}}_{{\mathbf{r}}}^{(L)} } \right\|\;\left\| {\left( {{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L)} } \right) - \left( {{\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L)} } \right)} \right\|} \right. \\ & \quad \quad + \,\left. {\left\| {{\mathbf{W}}_{{{\mathbf{in}}}}^{(L)} } \right\|{\kern 1pt} C^{(L - 1)} {\kern 1pt} \left\| {\left( {{\mathbf{x}},{\mathbf{S}}^{ ( 1 )} , \ldots ,{\mathbf{S}}^{(L - 1)} } \right) - \left( {{\mathbf{x}},{\hat{\mathbf{S}}}^{ ( 1 )} , \ldots ,{\hat{\mathbf{S}}}^{(L - 1)} } \right)} \right\|} \right) \\ \end{aligned}$$
(21)

From the above, the recurrence formulation for L scale \(C^{(L)}\) can be acquired as follows:

$$C^{(L)} = (1 - \lambda ) + \lambda \left( {C^{(L - 1)} \left\| {{\mathbf{W}}_{{{\mathbf{in}}}}^{(L)} } \right\|{ + }\left\| {{\mathbf{W}}_{{\mathbf{r}}}^{(L)} } \right\|} \right)$$
(22)

Hence, the Lth scale of the state transition function \(F^{(L)}\) for EBDEN fulfills the requirement of contractivity when the following equation is satisfied: \(0 < C^{(L)} = (1 - \lambda ) + \lambda (C^{(L - 1)} | |{\mathbf{W}}_{{{\mathbf{in}}}}^{(L)} | |+ | |{\mathbf{W}}_{{\mathbf{r}}}^{(L)} | ) { < 1}\). From this recurrence formulation, one can see that when \(C^{(L)}\) satisfies \(0 \le C^{(L)} \le 1\), each reservoir layer and EBDEN can hold contractivity. According to the recurrence formulation for L scale Lipschitz constant \(C^{(L)}\) in contraction analyzation in Eq. (22) and the transition function \({\text{s}}^{(L)} (t) = F^{(L)} (x(t),s^{(1)} (t - 1), \ldots ,s^{(L)} (t - 1))\) of the adaptive process of echo states, for the definition of echo state property, the distinction between two transition functions of EBDEN is close to a constant without any activation. For this reason, the external activation 0 is used in transition functions:

$$\begin{aligned} & \left\| {F^{(L)} \left( {{\mathbf{0}},{\mathbf{S}}^{(L)} } \right) - F^{(L)} \left( {{\mathbf{0}},{\hat{\mathbf{S}}}^{(L)} } \right)} \right\| \\ & \quad \le \,(1 - \lambda )\left\| {{\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{(L)} } \right\| + \lambda \left( {\left\| {{\mathbf{W}}_{{\mathbf{r}}}^{(L)} } \right\|\;\left\| {{\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{(L)} } \right\| + {\kern 1pt} C^{(L - 1)} \left\| {{\mathbf{W}}_{{{\mathbf{in}}}}^{(L)} } \right\|\;\left\| {\left( {{\mathbf{0}},{\mathbf{S}}^{(L - 1)} } \right) - \left( {{\mathbf{0}},{\hat{\mathbf{S}}}^{(L - 1)} } \right)} \right\|} \right) \\ \end{aligned}$$
(23)

After simplification, Eq. (23) can be obtained as follows:

$${\text{Eq}} .\;( 2 3) \propto \lambda^{2} C^{(L - 2)} \left\| {{\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{(L)} } \right\| \propto \lambda^{{^{L - 1} }} C^{(1)} \left\| {{\mathbf{S}}^{(L)} - {\hat{\mathbf{S}}}^{(L)} } \right\|$$
(24)

Because the range of the leaky rate \(\lambda\) is \([0,1]\), Lipschitz constant \(C^{(L)}\) is thus the same. As the bound of \(| |F^{(L)} ({\mathbf{0}},{\mathbf{S}}^{(L)} ) - F^{(L )} ({\mathbf{0}},{\hat{\mathbf{S}}}^{ (L )} ) | |\) is found to be a constant, the EBDEN satisfies the echo state property.

3.3 Quantitative analyzation for the BO and multiple-scale mechanisms

As described in Sect. 3.2, the theoretical level analyzation has proved that the architecture of the multiple-scale reservoir can be contractive. Considering that the echo state network with multiple-scale architecture will bring much more hyper-parameters than that with single-scale architecture, the Bayesian optimization (BO) in this paper is used to tune the hyper-parameters. The identical input time-series datasets are learnt by the EBDEN and EBDEN, which does not contain BO (EBDEN without BO). Except for the number of scales between EBDEN and EBDEN without BO, all configurations for these two models are kept consistent, with each result repeated 10 times.

The necessity of the BO for EBDEN is demonstrated in Fig. 3, and the testing error is used as the metric. It can be seen that the EBDEN and EBDEN without BO achieve the comparable performances when the number of scale equals 1. As for the performance of EBDEN without BO, it is not converged with the number of scales in accordance with the blue line in Fig. 3a. When there is an increase in the number of scales, more hyper-parameters are needed to be tuned, giving rise to the instability of the performance of the network if large amounts of the hyper-parameters are inadequately chosen. As comparison, the testing errors of EBDEN shrink with the number of scales, mainly owing to the BO mechanism of searching the appropriate hyper-parameters.

Fig. 3
figure 3

The characteristics of EBDEN and shallow EBDEN. a The performances of EBDEN and EBDEN without BO with the number of scales. b The performances of EBDEN and shallow EBDEN with different leaky rates \(\lambda\). c The performances of EBDEN and shallow EBDEN with dissimilar spectral radii \(\rho\)

Furthermore, for the sake of quantitatively analyzing the necessity of the multiple-scale mechanism, we apply the memory capacity [35, 36] to measuring the effectiveness of ESNs. In this section, the performance of the short-term memory capacity between EBDEN and shallow EBDEN is compared by evaluating the models’ compactness of recalling delayed time series. The calculation procedure of the memory capacity is:

$$C = \sum\limits_{k = 0}^{\infty } {{\text{Corr}}({\hat{\mathbf{y}}}(t - \tau ),{\mathbf{y}}(t))^{2} }$$
(25)

where C and \({\text{Corr}}(.)\) denote the memory capacity and the correlation operator, respectively. From the above, it is observed that memory capacity is the computation of the squared correlation coefficient between the prediction \({\mathbf{y}}(t)\) and target \({\hat{\mathbf{y}}}\) with delay \(\tau\). Noticeably, the delay \(\tau\) varies from 0 to the number of the input samples.

From Fig. 3b, c, the performance of EBDEN is seen to be superior to that of shallow EBDEN in the same configuration (such as leaky rates \(\lambda\) and spectral radius \(\rho\)).

3.4 Quantitative analyzation for BO versus grid search

In this section, the efficiency of using BO to optimize the EBDEN is measured by the comparison with that of using the grid search mechanism. It is worth noting that the range of the parameter configures is invariable when two optimization mechanisms are used. The space of the hyper-parameters for EBDEN is: the number of the units in each reservoir \(N\left\{ {N \in (100,1000)} \right\}\), the scaling coefficient \(\omega \left\{ {\omega \in (0,1)} \right\}\), the sparsity connection degree \(\eta \left\{ {\eta \in (0.001,1)} \right\}\), the spectral radius \(\rho \left\{ {\rho \in (0,1.25)} \right\}\), the leaky rate \(\lambda \left\{ {\lambda \in (0,1)} \right\}\) and the number of scales \(L\left\{ {L \in ( 1, 1 0)} \right\}\). Moreover, there are 100 points initialized by BO in EBDEN, with the maximum number of iteration and the convergence criterion for BO pre-defined as 1000 and 0.001, respectively. With respect to the model EBDEN with grid search optimization, each hyper-parameter is taken 6 values as: the number of the units in each reservoir \(N\left\{ {N \in (100,325,550,775,1000)} \right\}\), the scaling coefficient \(\omega \left\{ {\omega \in (0,0.25,0.5,0.75,1)} \right\}\), the sparsity connection degree \(\eta \left\{ {\eta \in (0.001, 0. 0 0 5 6 , 0. 0 3 1 , 0. 1 7 , 1)} \right\}\), the spectral radius \(\rho \left\{ {\rho \in (0,0.312,0.625,0.937,1.250)} \right\}\), the leaky rate \(\lambda \left\{ {\lambda \in (0,0.25,0.5,0.75,1)} \right\}\) and the number of scales \(L\left\{ {L \in (1,3,5,8,10)} \right\}\).

The experiments are performed on the three datasets that are described in Sect. 3.1. Besides, two metrics, such as time-consuming and test error, are used to measure the performance of EBDEN and EBDEN with grid search optimization. The experimental results are shown in Table 1.

Table 1 The performance of EBDEN versus EBDEN with grid search

As can be seen from Table 1, EBDEN can not only speed up the search of suited hyper-parameters in comparison with EBDEN with grid search optimization, but also achieve smaller test errors.

3.5 Quantitative analyzation for ensemble architecture

As mentioned in Sect. 2.3, several echo state networks are ensembled in EBDEN to restrict the redundant computation for each time domain channel. In this section, the edge of the ensemble architecture for EBDEN is explored by measuring whether it can contribute to the performance improvement of modeling the MTS dataset or not. The experiment is benchmarked on time-series dataset with 30 channels by comparing the performance between the EBDEN, EBDEN without ensemble and EBDEN with global ensemble. Note that the EBDEN without ensemble model means that all of the time-series’ datasets are incorporated into the same echo state network architecture and the mean value performance is computed. EBDEN with global ensemble model signifies that every channel’s datasets are modeled by one echo state network (30 ensembles), with the performance shown in Fig. 4.

Fig. 4
figure 4

The characteristic of EBDEN with the number of the ensembles

As is seen from Fig. 4, the number of ensembles for EBDEN ranges is within [1, 6]. As the number of the ensemble grows, the test error of EBDEN decreases into [0.057, 0.032]. When the ensemble number equals 6, the performance of EBDEN gets close to EBDEN with global ensemble.

Except for the comparison between the EBDEN with global ensemble and EBDEN in Fig. 4, in order to measure the necessary for the ensemble learning module in EBDEN, the prediction performance between the EBDEN and SESNE [22] which uses the simple least square regression is also compared. As described in Sect. 1, the optimal procedure is not contained in SESNE, but EBDEN uses the loss function with optimal weights in Eq. (13) to account the importance of MTS’s channel individually. From Fig. 5, it can be clearly seen that the performance of EBDEN can be converged at 7 ensemble numbers, but the performance of SESNE cannot be converged with the number of ensembles. Figure 5 shows the optimal loss function of EBDEN can effectively improve the performance of the ensemble organization of the echo state network.

Fig. 5
figure 5

The performance comparison between EBDEN and SESNE

4 Results

All the computation models in this paper are implemented on a TITIAN X Nvidia graphic card with the dual-core Intel CPU processor in Windows environment. Moreover, EBDEN is applied to the classification and fitting problems by using SVM and RC to settle the issue of the readout weight according to Eq. (7). In order to ensure the strictness, all the experiments in Sects. 4.1, 4.2 and 4.3 are repeated 10 times. The critical difference diagram is depicted in Figs. 5, 6 and 7 to describe the statistically significant scores according to the corresponding probability between two comparisons [37]. The critical distance is controlled by the critical difference and number of comparisons.

Fig. 6
figure 6

The critical difference diagram for the comparison of EBDEN with five feature-based models

Fig. 7
figure 7

The critical difference diagram for the comparison of EBDEN with six deep-learning-based models

4.1 Results for multivariate time-series datasets

When dealing with the MTS sets, the datasets comprise multiple time domain channels for each time step. In this study, three representative models for the MTS dataset are used: (1) feature-based models, (2) ensemble-based models and (3) deep-learning-based models.

With regard to the feature-based models, the hidden Markov model (HMM) [38] is used as the baseline model. The autoregressive kernel (AR) [39] utilizes the matrix normal-inverse Wishart prior to the measurement of the similarity between two MTS. The distances between the time series as the features are measured by DTW [40] and incorporated into machine learning. In terms of learned pattern similarity (LPS) [41], it models the dependency between the time stamps by local autopatterns. Time-series feature selection (TSFE) [40] computes thousands of time-series features before selecting some discriminative features by greedy forward selection technique with the linear classifier.

As for the deep-learning-based models, six contrasts, such as the temporal convolutional neural network [5], the time-series encoder (TSE) [6], the multi-channel deep convolutional neural network (MCDCNN) [7] and time-CNN [8], are mentioned in Sect. 1. The multi-scale convolutional neural network (MSCNN) [42] introduces the Window Slicing (WS) operator as the feature augmentation method and uses transformation, local convolution and full convolution stages to learn these features. Besides, multiple layer perceptron (MLP) [5] consists of 4 layers connected by full connection to model the time-series datasets.

To model the time-series, autoregressive forest [43] (AF) applies the autoregressive learning mechanism to the tree-based architecture regarding the ensemble-based models. The shapelet ensemble (SE) [44] captures the shapelet subsequence ensembles for time-series sets to measure the similarity between the series. Symbolic representation for multivariate time series (SMTS) [45] regards the code book as the word to label the leaf node trained by random forest.

The comparison of experiment results for EBDEN with feature-based models on 12 MTS sets is shown in Table 2. The contrasts contain HMM, AR, feature DTW, LPS and TSFE. From Fig. 6, it can be seen that EBDEN achieves 7 wins, while the HMM, AR, feature DTW, LPS and TSFE achieve 1, 3, 2, 3, 6 wins, respectively. TSFE holds the first rank of these models, with which EBDEN achieves a comparable performance. In our view, this is mainly due to the performance on UWave, in which TSFE visibly outperforms the EBDEN model.

Table 2 Performance of five feature-based time-series models and EBDEN on 12 datasets

Table 3 shows the comparison between EBDEN and deep-learning-based models. Thus, it can be seen that EBDEN wins 8 datasets out of the whole 12 datasets, while the other contrasts win 0, 9, 3, 0, 0, and 0 datasets, respectively. Hence, it can be said that EBDEN achieves the comparable performance with TCNN and is superior to other contrasts on MTS datasets, which is supported by the results of the critical difference diagram in Fig. 7. Moreover, the performance of EBDEN is comparable with TCNN.

Table 3 Performance of six deep-learning-based time-series models and EBDEN on 12 datasets

Table 4 shows the comparison between the EBDEN and three ensemble-based models. The former achieves the 8 wins, while AF, SE and SMTS achieve 2, 5 and 3 wins, respectively. This data indicates the good performance of models SE and EBDEN on MTS sets. According to Fig. 8, EBDEN achieves the first rank among these contrasts and dramatically outperforms other models.

Table 4 Performance of three ensemble-based time-series models and EBDEN on 12 datasets
Fig. 8
figure 8

The critical difference diagram for the comparison of EBDEN with three ensemble-based models

4.2 Results for chaotic series representation

To our best knowledge, chaotic series analyzation has been used in wide range fields, such as radar detection [46], chemical reaction [47] and EEG signal reaction [48]. Owing to the intricate mathematical formulations, Chaotic series are, however, always hard to be learnt. In this section, the performance of EBDEN is evaluated on three chaotic attractor datasets.

Two contrasts of TSE and MCDCNN are employed as the contrasts in this section. Furthermore, testing error is used as the metric for chaotic series representation. Figures 9 and 10 show the testing outputs and errors, respectively. Since the activation time length is too long, simply the X-axis of these attractors and 100 out of 2500 samples are captured in the figures.

Fig. 9
figure 9

Three representation performances of chaotic attractor models. a The performance of three models on Lorenz attractors. b The performance of three models on Mackey–Glass attractors. c The performance of three models on Rossler attractors

Fig. 10
figure 10

Three testing error performances of chaotic attractor models. a The performance of three models on Lorenz attractors. b The performance of three models on Mackey–Glass attractors. c The performance of three models on Rossler attractors

From Figs. 9 and 10, it can be seen that the TSE model outputs the curves with highest testing errors on Lorenz and Mackey–Glass attractors. And MCDCNN model outputs the curve with highest testing errors on Rossler attractor. But for the EBDEN model, it outputs the curves with small testing errors and achieves high goodness-of-fit performance on all three chaotic attractors, which prominently surpasses TSE and MCDCNN. (The testing errors of EBDEN approach 0 in Fig. 10.)

4.3 Results for Dansgaard–Oeschger component estimation

The Dansgaard–Oeschger is the gradual cooling current when there are abrupt increases in the North Atlantic region’s surface temperature of up to 15 over a few decades [49]. The Dansgaard–Oeschger effect is exploited by the recording of the time series of the Greenland ice cores collected by the INTIMATE project. Accordingly, the Ca2+ and \(\delta^{ 1 8} {\text{O}}\) are two essential elements to influence the Dansgaard–Oeschger events, the tendency of which is thus predicted by EBDEN when 2000 time-series sets are used as the training dataset and 2500 and 2000 time-series sets as the corresponding testing datasets. Moreover, the DeepESN [19] which uses the recursive least square is deemed as the contrast model in this experiment.

As shown in Fig. 11, EBDEN achieves the comparable and even relatively better performance than DeepESN. From Fig. 11a, b, when meeting with the irregular points like extremum points, the performance of EBDEN is better than that of DeepESN.

Fig. 11
figure 11

Part of testing results for Ca2+ and \(\delta^{ 1 8} {\text{O}}\) from EBDEN and DeepESN

5 Conclusions

The brain can give inspiration for echo state models to code various time-series datasets, such as multivariate and chaotic time series. In this paper, a novel echo state computation framework, called ensemble Bayesian deep echo network, is proposed and applied to model the time-series datasets in this paper.

There are three key contributions for this paper. First, the bidirectional multiple-scale reservoirs across the time step are fused to construct the deep architecture of the ensemble Bayesian deep echo network, which is demonstrated by the contraction mapping analyzation in this paper to own the higher memory capacity than that for the shallow reservoirs. The second contribution is Bayesian optimization used in the network to select the suitable hyper-parameters, which can activate the network to achieve great performance. Different from traditional echo state networks, the hyper-parameters are not kept fixed in the ensemble Bayesian deep echo network. Third, the ensemble Bayesian deep echo network can avoid redundant computing when encountering with multiple channels of multivariate time series by the ensemble mechanism.

Due to these contributions, the ensemble Bayesian deep echo network can be used as a high-performance brain-like computational framework for solving realistic problems, such as multivariate time-series classification, chaotic attractors-based time-series representation and Dansgaard–Oeschger component estimation tasks. Besides, it can bridge the gap between bio-inspired networks and conventional neural network models.