Keywords

1 Introduction

Reservoir Computing (RC) [21, 28] delineates a class of Recurrent Neural Network (RNN) models based on the idea of separating the non-linear dynamical component of the network, i.e. the recurrent hidden reservoir layer, from the feed-forward linear readout layer. The reservoir is initialized randomly under stability constraints and then is left untrained, leaving the burden of training to fall only on the readout part of the architecture, hence resulting in a strikingly efficient approach to RNN design. In this context, the Echo State Network (ESN) model [16, 18] is a popular realization of the RC paradigm based on implementing the reservoir in terms of a discrete-time non-linear dynamical system. Being featured by untrained dynamics, ESNs represent an important tool to understand and characterize the operation and potentialities of recurrent neural models. Shaping the reservoir architecture in order to achieve desired properties and optimized performance in applications, even in the absence of training of the recurrent connections, is one of the key goals of RC research [9].

In this paper we bring together two major trends in the area of ESN architectural studies. The first one focuses on the pattern of connectivity among the recurrent units. In this case, the aim is to constrain the random reservoir initialization process towards topologies that determine specific algebraic properties of the resulting recurrent weight matrices. A relevant class of reservoir variants in this regard is given by ESNs with orthogonal recurrent matrices [15, 30], which were shown to lead to improved performance with respect to random reservoirs both in terms of memorization skills and in terms of predictive performance on non-linear tasks. In particular, reservoirs whose structure is based on permutation matrices represent particularly appealing instances of orthogonal ESNs [15, 26], entailing a simple and very sparse pattern of connectivity among the recurrent units. Other relevant architectural variants are given by reservoirs structured according to a ring topology or to form a chain of units [24, 26]. The second major line of research that we consider regards the construction of hierarchically structured reservoir models. While initial studies in this context focused on composing multiple ESN modules to form ad-hoc architectures [19, 27], recent works started analyzing the effects of stacking multiple untrained reservoir layers with the introduction of the DeepESN model [8]. On the one hand, the analysis of DeepESN dynamics contributes to uncover the intrinsic computational properties of deep neural networks in the temporal domain [8, 13]. On the other hand, a proper architectural design of deep reservoirs might have a huge impact in real-world applications [12], enabling effective multiple time-scales processing and at the same time preserving the training efficiency typical of RC models.

In this paper we analyze the impact on the predictive performance given by a constrained reservoir topology in DeepESNs. Specifically, we consider deep architectures in which the individual reservoir layers are implemented based on permutation matrices, as well as on ring and on chain topologies. Our study is conducted in comparison to shallow ESN counterparts through numerical experiments on several benchmarks in the RC area.

The rest of this paper is structured as follows. The DeepESN model is introduced in Sect. 2, while the investigated reservoir topologies are described in Sect. 3. The experimental analysis is reported in Sect. 4. Finally, Sect. 5 draws conclusions and delineates future research directions.

2 Deep Echo State Networks

A DeepESN is an RC model in which the reservoir part is organized into a stacked composition of multiple untrained recurrent hidden layers. The external input is propagated only to the first reservoir layer, while each successive level in the deep architecture is fed by the output of the previous one, as graphically illustrated in Fig. 1.

Fig. 1.
figure 1

Hierarchical reservoir architecture in a DeepESN.

To fix our notation, we use L to indicate the number of layers in the deep reservoir, while \(N_U\) and \(N_Y\) respectively denote the sizes of input and output spaces. For the sake of simplicity in the presentation of the DeepESN model, here we make the assumption that all the reservoir layers are featured by the same number of units, indicated by \(N_R\). The operation of each reservoir layer can be described in terms of a discrete-time non-linear dynamical system, whose state update equation is given in the form of an iterated mapping. In particular, at time-step t, the state of the first layer, i.e. \(\mathbf {x}^{(1)}(t) \in \mathbb {R}^{N_R}\), is computed as follows:

$$\begin{aligned} \mathbf {x}^{(1)}(t) = \tanh (\mathbf {W}_{in}\mathbf {u}(t) + \hat{\mathbf {W}}^{(1)} \mathbf {x}^{(1)}(t-1)), \end{aligned}$$
(1)

while the state of each successive layer \(l>1\), i.e. \(\mathbf {x}^{(l)}(t) \in \mathbb {R}^{N_R}\), is given by:

$$\begin{aligned} \mathbf {x}^{(l)}(t) = \tanh (\mathbf {W}^{(l)} \mathbf {x}^{(l-1)}(t) + \hat{\mathbf {W}}^{(l)} \mathbf {x}^{(l)}(t-1)). \end{aligned}$$
(2)

Here, \(\tanh \) indicates the element-wise application of the hyperbolic tangent non-linearity, \(\mathbf {u}(t) \in \mathbb {R}^{N_U}\) represents the external input at time-step t, while \(\mathbf {W}_{in}\), \(\mathbf {W}^{(l)}\) and \(\hat{\mathbf {W}}^{(l)}\) respectively denote the input weight matrix (that modulates the external input stimulation to the first layer), the inter-layer weight matrix for layer l (that modulates the strength of the connections from layer \(l-1\) to layer l), and the recurrent reservoir weight matrix for layer l. In both the above Eqs. 1 and 2 we omitted the bias terms for the ease of notation. The interested reader can find in [8] a more detailed description of the deep reservoir equations, framed in the more general context of leaky integrator reservoir units. In order to set up initial conditions for the state update Eqs. 1 and 2, at time-step 0 all reservoir layers are set to a null state, i.e. \(\mathbf {x}^{(l)}(0) = \mathbf {0}\) for all \(l = 1,\ldots , L\). Given this framework, it is worth noticing that a standard shallow ESN model can be seen as a special case of DeepESN in which a single reservoir layer is considered, i.e. \(L = 1\).

As in standard RC approaches, the parameters of the entire reservoir component, i.e. the elements in all the weight matrices in Eqs. 1 and 2, are left untrained after initialization subject to stability constraints. These are required to avoid the system dynamics to fall into unstable regimes, which would make them unsuitable for robust processing of time-series data. In the context of ESNs, the analysis of asymptotic stability is usually described in terms of the Echo State Property (ESP) [16, 21], providing simple algebraic conditions for the initialization of reservoir weight matrices that have been recently extended to cope with the case of deep reservoirs in [6]. Under a practical view-point, the analysis in [6] suggests to carefully control the spectral radius of all the reservoir weight matrices in the deep reservoir. In this paper, we use \(\rho ^{(l)}\) to denote the spectral radius in layer l, i.e. the largest among the absolute values of the eigenvalues of \(\hat{\mathbf {W}}^{(l)}\). A simple initialization procedure for the reservoir of a DeepESN then consists in choosing the elements in \(\hat{\mathbf {W}}^{(l)}\) randomly from a uniform distribution on \([-1,1]\), subsequently re-scaling them to achieve desired values of \(\rho ^{(l)}\), typically not above unity. Similarly, the elements in \(\mathbf {W}_{in}\) and those in \(\mathbf {W}^{(l)}\) (for \(l>1\)) are initialized randomly from a uniform distribution on \([-1,1]\), and then are re-scaled to control the input scaling hyper-parameter \(\omega _{in} = \Vert \mathbf {W}_{in}\Vert _2\), and the set of inter-layer scaling hyper-parameters \(\omega _{il}^{(l)} = \Vert \mathbf {W}^{(l)}\Vert _2\).

The output of the DeepESN is computed by a simple readout tool, which linearly combines the reservoir representations developed in all the layers of the deep architecture. In formulas, the output at time-step t, denoted as \(\mathbf {y}(t) \in \mathbb {R}^{N_Y}\), is computed by the following equation:

$$\begin{aligned} \mathbf {y}(t) = \mathbf {W}_{out}\, [\mathbf {x}^{(1)}(t); \ldots ; \mathbf {x}^{(L)}(t)], \end{aligned}$$
(3)

where \(\mathbf {W}_{out}\) is the output weight matrix, and \([\mathbf {x}^{(1)}(t); \ldots ; \mathbf {x}^{(L)}(t)]\) represents the global deep reservoir state at time-step t, expressed as the concatenation of all the states in the architecture. The elements in \(\mathbf {W}_{out}\) represent the only learnable weights of the DeepESN, and are typically adjusted to fit a training set by exploiting non-iterative training algorithms as in the case of standard RC models [21]. Notice that, although different patterns of reservoir-to-readout connectivity are possible [23], the one employed here, where all reservoir layers are used to feed the readout, has the advantage to allow the training algorithms to modulate and exploit differently the variety of representations provided by the different levels in the deep architecture.

A more comprehensive description of the characteristics and advantages of the DeepESN approach can be found in [7], while a constantly updated overview on the advancements achieved in this research field is given in [10]. To date, software implementations of the DeepESN model are made publicly available as libraries for PythonFootnote 1, MatlabFootnote 2 and OctaveFootnote 3.

3 Reservoir Topology

We consider DeepESN architectural variants where the recurrent weight matrix in each layer l, i.e. \(\hat{\mathbf {W}}^{(l)}\), is characterized by a specific structure, according to the topologies described in the following. The resulting patterns of reservoir connectivity are graphically exemplified in Fig. 2.

Fig. 2.
figure 2

Reservoir topologies of DeepESN layers.

  • Sparse: Each reservoir unit is randomly connected to a subset of the others, determining a sparse recurrent matrix \(\hat{\mathbf {W}}^{(l)}\) (see Fig. 2(a)). This corresponds to a common setting used in RC practice and serves here as baseline for our analysis.

  • Permutation: The structure of the recurrence matrix \(\hat{\mathbf {W}}^{(l)}\) is given by a permutation matrix \(\mathbf {P}\), i.e. we have:

    $$\begin{aligned} \hat{\mathbf {W}}^{(l)} = \lambda \, \mathbf {P}, \end{aligned}$$
    (4)

    where \(\mathbf {P}\) is obtained by randomly permuting the columns of the identity matrix, and \(\lambda \) is a multiplicative constant that specifies the value of the non-zero recurrent weights. In this case, the spectral radius of \(\hat{\mathbf {W}}^{(l)}\) is determined by the value of \(\lambda \), i.e. \(\rho ^{(l)} = \lambda \). The permutation topology implies that each row and each column of the recurrence matrix have exactly one non-zero element, resulting into a reservoir architecture that presents a variable number of disjoint cyclic structures, as graphically exemplified in Fig. 2(b). The levels in the deep reservoir architecture are allowed to employ different permutations, i.e. the number of cycles in each reservoir layer can be different.

    In the context of shallow ESNs, this kind of topology has been empirically studied in [2], where it was shown to achieve good memorization skills at the same time improving the performance of randomly initialized reservoirs in tasks involving non-linear mappings. Interestingly, the permutation topology has been investigated in [15] as a way to implement orthogonal reservoir matrix structures, under the name of Critical ESNs.

  • Ring: The reservoir units are organized to form a single ring, as shown in Fig. 2(c). Accordingly, the recurrent weight matrix \(\hat{\mathbf {W}}^{(l)}\) is expressed as:

    $$\begin{aligned} \hat{\mathbf {W}}^{(l)}= \lambda \, \left[ \begin{array}{c c c c} 0 &{} 0 &{} \ldots &{} 1\\ 1 &{} 0 &{} \ldots &{} 0\\ \vdots &{} \ddots &{} \ddots &{} \vdots \\ 0 &{} \ldots &{} 1 &{} 0\\ \end{array} \right] , \end{aligned}$$
    (5)

    where \(\lambda \) is the value of non-zero recurrent weights, and determines the spectral radius of \(\hat{\mathbf {W}}^{(l)}\), i.e. \(\rho ^{(l)} = \lambda \). The ring topology can be easily seen as a special case of the permutation topology, where the pattern of reservoir connectivity is ruled by the specific permutation matrix in Eq. 5, and the reservoir units are all part of the same cyclic structure.

    Reservoirs following this architectural organization have been subject of several studies in literature on shallow RC. Notable instances in this regard are given by the work in [26], in which the ring topology is studied in the context of orthogonal reservoir structures, and by the work in [24], where the study is carried out under the perspective of architectural design simplification for minimum complexity ESN construction. One interesting outcome of previous analysis on the ring topology is that, compared to randomly initialized reservoirs, it shows superior memory capacity that, at least in the linear case, approaches the optimal value [24]. While this optimal memory characterization has been extensively analyzed in literature for the more general class of orthogonal recurrent weight matrices (see e.g. [3, 17, 30]), the ring topology presents the advantage of a strikingly simple (and sparse) dynamical network construction.

  • Chain: The recurrent units are arranged in a pipeline, where each unit - except for the first one - receives in input the activation of the previous one, forming a chain as in the example in Fig. 2(d). The only non-zero elements in \(\hat{\mathbf {W}}^{(l)}\) are located in the lower sub-diagonal, i.e. we have:

    $$\begin{aligned} \hat{\mathbf {W}}^{(l)}= \lambda \, \left[ \begin{array}{c c c c} 0 &{} 0 &{} \ldots &{} 0\\ 1 &{} 0 &{} \ldots &{} 0\\ \vdots &{} \ddots &{} \ddots &{} \vdots \\ 0 &{} \ldots &{} 1 &{} 0\\ \end{array} \right] , \end{aligned}$$
    (6)

    where as in previous cases \(\lambda \) identifies the value of non-zero weights. Although in this case \(\hat{\mathbf {W}}^{(l)}\) is nilpotent and hence its spectral radius is always 0, we still operate on \(\lambda \) to control the magnitude of recurrent weights. As such, with a slight abuse of notation, also in this case we set \(\rho ^{(l)} = \lambda \). Overall, the chain topology results in a particularly simple design strategy that, from the architectural perspective, applies a further simplification to the ring topology by removing one of the connections between the internal units.

    Literature works on shallow ESN models pointed out the merits of reservoir organizations based on a chain topology (also called delay-line reservoirs), as a very simple approach to the architectural design of the network, resulting in a model that is easier to analyze [30] and that leads to comparable or even better performance than standard ESNs [24, 26].

4 Experiments

In this section we illustrate the experimental analysis conducted in this paper. Specifically, in Sect. 4.1 we detail the datasets considered and the experimental settings adopted in our work, whereas in Sect. 4.2 we report and discuss the achieved numerical results.

4.1 Datasets and Experimental Settings

In our experiments, we considered benchmark datasets featured by univariate time-series (i.e., \(N_U = N_Y = 1\)).

The first dataset is obtained from a non-linear auto-regressive moving average system of the 10-th order (NARMA10). At each time-step, the input u(t) comes from a uniform distribution over [0, 0.5], whereas the corresponding target output \(y_{tg}(t)\) is given by the following relation:

$$\begin{aligned} y_{tg}(t) = 0.3 \, y_{tg}(t-1)+0.05 \, y_{tg}(t-1) \sum \limits _{i = 1}^{10} y_{tg}(t-i) + 1.5 \, u(t-10) \, u(t-1)+0.1. \end{aligned}$$
(7)

The second dataset that we considered is the Santa Fe Laser time-series [29], where the input values u(t) are sampled intensities from a far-infrared laser in chaotic regime, re-scaled by a factor of 0.01. We used the Laser dataset to define a next-step prediction task, where \(y_{tg}(t) = u(t+1)\) for each time-step t.

The last two datasets are instances of the Mackey-Glass (MG) [4, 22] time-series, obtained by discretizing the following non-linear differential equation:

$$\begin{aligned} \frac{\delta u(t)}{\delta t} = \frac{0.2 \, u(t-\tau )}{1+u(t-\tau )^{10}} - 0.1 \, u(t), \end{aligned}$$
(8)

where \(\tau \) is a parameter of the system influencing its dynamical behavior. We generated two MG time-series using \(\tau = 17\) (MG17) and \(\tau = 30\) (MG30), representing cases with increasingly complex chaotic behavior. In both cases, the elements of the time-series where shifted by \(-1\) and then passed through the \(\tanh \) squashing function as in [16, 18]. The two MG time-series allowed us to set up two next-step prediction tasks, where \(y_{tg}(t) = u(t+1)\) for each time-step t.

For NARMA10, MG17 and MG30 we generated datasets with 10000 time-steps, while the Laser dataset contained a number of 10092 samples. In all the cases, the available data was split into a training set, comprising the first 5000 time-steps, and a test set, comprising the remaining samples. The first 100 time-steps were used as transient to wash out the initial conditions. The performance of the considered RC models was evaluated in terms of mean squared error (MSE) in all the tasks.

In our experiments, we considered DeepESNs with a total number of 500 recurrent reservoir units, distributed evenly across the layers of the deep architecture, varying the number of layers L from 2 to 5Footnote 4. All the reservoir layers in the deep architecture shared the same values for the scaling hyper-parameters \(\rho \) and \(\omega _{il}\), i.e. \(\rho = \rho ^{(1)} = \ldots = \rho ^{(L)}\) and \(\omega _{il}= \omega _{il}^{(2)} = \ldots = \omega _{il}^{(L)}\). To account for sparsity, each reservoir unit was randomly connected to 5 units in the previous layer and to 5 units in the same layer. Of course, when considering permutation, ring and chain reservoir topologies, the connectivity of the reservoir units in each layer followed the corresponding specific structure described in Sect. 3. In all the cases, we used fully-connected input weight matrices. For every task and choice of the reservoir topology, the DeepESN hyper-parametrization was chosen by model selection on a validation set comprising the last 1000 time-steps of the training split. To this end, we performed a random search with 50 networks configurations for each number of layers, sampling the value of \(\rho \) from a uniform distribution in (0.1, 1), and the values of \(\omega _{in}\) and \(\omega _{il}\) from uniform distributions in (0.1, 2). The achieved results were averaged on 10 network guesses for each hyper-parametrization explored, and readout training was performed by using pseudo-inversion. Finally, our experimental analysis was conducted in comparison with shallow ESN setups, considering the same reservoir topologies investigated in the DeepESN case, and using the same experimental setting described above, with the only crucial exception that all the available reservoir units were organized into a single layer (i.e., \(L = 1\)). Also note that, to provide a fair comparative analysis, the shallow reservoir configuration is not accounted in our experiments with DeepESNs (i.e., for DeepESNs we always consider \(L>1\)).

4.2 Results

The test MSE values obtained by DeepESNs in correspondence of all the considered types of layer-wise reservoir topology are reported in Table 1. For the sake of comparison, the same table shows the results achieved by shallow ESNs under the same architectural conditions examined in the deep case. In all the cases, the sparse reservoir topology is considered as a baseline setup for our analysis. For completeness, in Appendix A we report the hyper-parametrization values selected on the validation set for all the considered architectural settings.

Table 1. Test MSE (and std) achieved by shallow ESN and DeepESN settings for different choices of the reservoir topology. The last column reports the number of layers selected for DeepESN. Best results for each task are underlined.

The performance values reported in Table 1 allow us to draw several lines of observations. First of all, our results confirm the goodness of the considered reservoir architectural variants already in the shallow setup, showing improved performance (i.e., a smaller MSE) with respect to the sparse baseline in all the cases (with the sole exception of permutation shallow reservoirs on Laser). Second, we observe that the performance of DeepESN with constrained topology (i.e. permutation, ring and chain) enhances that one of sparse DeepESN in all the considered tasks (with the only exception of deep reservoirs with chain architecture on Laser). Moreover, we can see that DeepESN improves the results of shallow ESN in all the tasks and for all the choices of reservoir topology, both in the constrained architectural cases and for the base sparse reservoir setup. Taken together, results in Table 1 clearly indicate the performance advantage arising from the synergy between deep organization and constrained topology as factors of architectural design of reservoirs. Giving a structure to the architecture of reservoirs both at a coarser level, i.e. organizing the recurrent units into layers, and at a finer level, i.e. organizing individual layers’ units into cyclic or chain structures, amplifies the benefits brought by the two factors individually. Finally, we notice that the best performing architecture in our experiments is the DeepESN with permutation reservoir topology, which obtained the smallest errors on all the tasksFootnote 5, and is put forward here as a particularly effective (yet sparse and efficient) approach to the architectural design of reservoir models. We leave to further studies the analysis of the dynamical properties that make DeepESN constructions based on permutation matrices so effective in applications, while here we limit ourselves to intuitive yet insightful considerations that might explain the observed results. On the one hand, recent literature works (e.g., [5, 8]) provided empirical evidence of the fact that higher layers in deep reservoirs tend to develop progressively more abstract temporal representations of the driving input, and are naturally featured by longer memory. On the other hand, reservoir architectures based on permutation topology present multiple ring sub-structures that (at least in the linear approximation) are possibly featured by maximized memory. The resulting reservoir provides a variety of memories (that can be easily controlled by scaling the strength of the recurrent connections), and the developed state representations are enriched [15]. Our results show that DeepESNs with permutation reservoir topology are able to effectively exploit both the advantages of deep recurrent architectures and multiple ring sub-structures.

5 Conclusions

In this paper we have investigated the role of reservoir topology in the architectural design of DeepESNs. Specifically, we focused on analyzing the effects of constraining the recurrent weight matrix of each layer according to permutation, ring and chain topologies. Numerical results on several RC benchmarks pointed out a striking beneficial effect arising from the combination of a deep reservoir construction with a structured organization of the recurrent units in each layer. Our results indicate that DeepESN with reservoir units arranged to obey a permutation scheme (i.e., forming multiple rings) provides a particularly advantageous design strategy for reservoirs, leading to the best performance in all the explored tasks.

While already giving interesting empirical evidences on the potentialities of deep RC architectures, the study presented in this paper opens the way to several directions for further research. First of all, the experimental analysis described here suggests that the use of simplified deep RC models has a great potential that can be exploited massively in real-world applications. Leveraging the parsimonious design approach resulting from structured sparsity of reservoir units, the class of deep neural models studied in this work seems an ideal candidate e.g. for embedding advanced learning capabilities on resource-constrained computing devices. On the methodological side, a natural extension of the work in this paper is to analyze the effect of a broader pool of reservoir architectural variants, including e.g. small-world [20], cycles with regular jumps [25] and concentric [1] reservoirs. Moreover, future research could pursue even further the simplification of architectural construction in deep RC models, reducing the impact of randomness in the network initialization in the same vein as the works on minimum complexity ESNs [24, 25]. Simplifying the reservoir structure locally to each layer can also be exploited from a more theoretically-oriented perspective, easing the mathematical analysis of dynamical properties naturally emerging in deep RNNs. In this concern, it is certainly interesting to extend fundamental mathematical results, e.g. pertaining to short-term memory capacity [17, 24, 30], or to approximation properties [14] of shallow reservoirs to the case of DeepESN. In addition to this, we believe that the role of orthogonality in deep reservoirs, studied in this paper in relation to the individual layers of the architecture, is an intriguing concept that deserves to be investigated also at the level of global (instead of local) DeepESN dynamics. Finally, the advantages of constrained DeepESN architectures delineated in this paper can be extended to larger classes of models, including e.g. deep RC for complex data structures [11], as well as fully trained deep RNNs.