Keywords

1 Introduction

Reservoir Computing (RC) [17, 19, 22] is a popular technique for efficiently training Recurrent Neural Networks (RNNs) by utilizing the stable neural dynamics of a fixed recurrent reservoir layer and a trainable readout for output computation. This approach has been successful in various applications, in particular for implementing distributed learning functionalities in embedded systems [2, 3, 6] and as a reference paradigm for neuromorphic hardware implementations of recurrent neural models [20, 21].

The effective operation of RC neural networks depends largely on the stability of its dynamics, which can be achieved through a global asymptotic stability property known as the Echo State Property in the widely used Echo State Network (ESN) model [14, 15]. This property ensures that the dynamics of the reservoir remain stable, while at the same time it limits its memory and state-space structure, thus preventing the transmission of input information across multiple time steps.

Recently, a new approach to overcome the limitations of fading memory in standard ESNs has been proposed, which involves discretizing an Ordinary Differential Equation (ODE) while ensuring stability and non-dissipative constraints. This approach computes the reservoir dynamics as the forward Euler solution of an ODE, hence the resulting model is called the Euler State Network (EuSN) [7, 9]. As their dynamics are neither unstable nor lossy, EuSNs are capable of preserving input information over time, making them better suited than ESNs for tasks involving long-term memorization. The EuSN approach has already been shown to exceed the accuracy of ESNs and achieve comparable performance levels to fully trainable state-of-the-art RNN models on time-series classification tasks, while still maintaining the efficiency advantage of RC [9]. At the same time, the study of the architectural organization of the EuSN reservoir system is still largely unexplored.

In this paper, we deepen the analysis of EuSN architectures and propose ways to improve the diversification of reservoir dynamics. Our first proposal is to introduce a variability factor by using different integration rates in different reservoir neurons. The second variability factor is to consider different diffusion coefficients, which result in different strengths for the self-feedback connections in the reservoir neurons. We analyze the effects of these factors, individually and in synergy, on the resulting dynamical characterization of the reservoir system and in a wide range of experiments on time-series classification benchmarks.

The rest of this paper is organized as follows. In Sect. 2 we summarize the fundamental aspects of the RC methodology and of the popular ESN model, while in Sect. 3 we introduce the crucial concepts behind non-dissipative RC dynamics and the EuSN model. Then, in Sect. 4, we illustrate the proposed approach to enhance the diversification of reservoir dynamics in EuSNs. Our empirical analysis on several time-series classification benchmarks is given in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Reservoir Computing

Reservoir Computing (RC) [17, 22] refers to a category of efficiently trainable recurrent neural models in which the internal connections pointing to the hidden recurrent layer, the reservoir, are left untrained after randomization subject to asymptotic stability constraints. The neural architecture is then completed by an output layer, the readout, which is the only trained component of the model. Within such a class, we introduce the popular Echo State Network (ESN) [14, 15] model, which employs the \(\tanh \) non-linearity and operates in discrete time-steps.

To set our notation, let us consider a reservoir that comprises \(N_h\) neurons, and that is stimulated by a driving (external) \(N_x\)-dimensional input signal. Accordingly, we denote the reservoir state and the input at time step t respectively as \(\textbf{h}(t) \in \mathbb {R}^{N_h}\), and \(\textbf{x}(t) \in \mathbb {R}^{N_x}\). We refer to the general case of leaky integrator ESNs [16], and describe the dynamical operation of the reservoir by the following iterated map:

$$\begin{aligned} \textbf{h}(t) = (1-\alpha ) \, \textbf{h}(t-1) + \alpha \, \tanh (\textbf{W}_h\;\textbf{h}(t) + \textbf{W}_x\;\textbf{x}(t)+\textbf{b}), \end{aligned}$$
(1)

where \(\textbf{W}_h \in \mathbb {R}^{N_h \times N_h}\) is the reservoir recurrent weight matrix, \(\textbf{W}_x \in \mathbb {R}^{N_h \times N_x}\) is the input weight matrix, \(\textbf{b} \in \mathbb {R}^{N_h}\) is the bias vector, and \(\tanh (\cdot )\) denotes the element-wise applied hyperbolic tangent non-linearity. Moreover, \(\alpha \in (0,1]\) represents the leaking rate hyper-parameter, influencing the relative speed of reservoir dynamics with respect to the dynamics of the input. Before being driven by the external input signal \(\textbf{x}(t)\), the reservoir state is typically initialized in the origin, i.e., \(\textbf{h}(0) = \textbf{0}\).

After their initialization, the weight values of \(\textbf{W}_h\), \(\textbf{W}_x\), and \(\textbf{b}\) are kept fixed in accordance with the Echo State Property (ESP) [23], which ensures global asymptotic stability of the reservoir dynamical system. In practice, the recurrent weights in \(\textbf{W}_h\) are typically randomly drawn from a uniform distribution over \((-1,1)\) and then adjusted to limit the resulting spectral radiusFootnote 1 \(\rho (\textbf{W}_h)\) to values smaller than 1. The value of \(\rho (\textbf{W}_h)\) has a direct influence on the dynamical properties of the resulting reservoir, and in particular on the extent of its fading memory. As such, it is a crucial hyper-parameter of the ESN model. As the spectral radius re-scaling of (a potentially large) matrix \(\textbf{W}_h\) can represent a computational bottleneck, in this paper we resort to an efficient initialization scheme introduced in [10], which leverages results in random matrix theory to provide a fast initialization of the recurrent weights in \(\textbf{W}_h\). The input weight matrix \(\textbf{W}_x\) and bias vector \(\textbf{b}\) are also randomly initialized, and then re-scaled to control their magnitude. A widely used approach consists in drawing their values from uniform distributions over \((-\omega _x,\omega _x)\) and \((-\omega _b, \omega _b)\), respectively, where \(\omega _x\) and represents the input scaling hyper-parameter, and \(\omega _b\) is the bias scaling hyper-parameter.

The ESN architecture also includes a trainable dense readout layer which, in the case of time-series classification tasks, is fed by the last reservoir state corresponding to each input time series. As the reservoir parameters are kept fixed, the readout is often trained in closed form [17], e.g., by pseudo-inversion or ridge regression.

Finally, it is worth remarking that the ESN model relies on the ESP stability property to regulate the reservoir dynamics. This property ensures that when the network is fed with a long input time-series, the initial state conditions eventually fade away, and the state encoding produced by the reservoir becomes stable. However, this characterization is linked to the fading memory and suffix-based Markovian organization of the reservoir state space (see, e.g., [8, 11, 13]). These properties make it difficult to transfer information across multiple time-steps, limiting the effectiveness of ESNs for tasks that require long-term memory retention of the input information.

3 Non-dissipative Reservoir Computing

To overcome the limitations of a fading memory reservoir system, an alternative approach based on discretizing ODEs subject to stability and non-dissipativity conditions has recently been proposed [7, 9]. The resulting RC model is derived from the continuous-time dynamics expressed by the following ODE:

$$\begin{aligned} \textbf{h}'(t) = \tanh (\textbf{W}_h\textbf{h}(t) + \textbf{W}_x\textbf{x}(t) + \textbf{b}), \end{aligned}$$
(2)

requiring that the corresponding Jacobian has eigenvalues with \(\approx 0\) real parts. In addition to stability, such a critical condition implies non-dissipative system dynamics, which can be leveraged to effectively propagate the input information over multiple time-steps [5, 12]. Crucially, the requested condition on the eigenvalus of the Jacobian of Eq. 2, can be easily met architecturally by the use of an antisymmetric recurrent weight matrix, i.e. requiring that \(\textbf{W}_h = -\textbf{W}_h^T\). In such a case, indeed, the eigenvalues of both \(\textbf{W}_h\) and the Jacobian are on the imaginary axis (see, e.g., [5, 9] for further details). Interestingly, this property does not need to be learned from data, rather it can be enforced in the neural processing system by design. In other words, provided that the antisymmetric condition holds, the recurrent weight matrix \(\textbf{W}_h\) can be initialized with random weights and then left untrained, as in standard RC approaches. Finally, the resulting constrained ODE system is discretized by Euler forward method, yielding the following state transition equation ruling the behavior of a discrete-time recurrent neural layer:

$$\begin{aligned} \textbf{h}(t) = \textbf{h}(t-1)\; + \varepsilon \tanh \Big ((\textbf{W}_h-\gamma \textbf{I})\textbf{h}(t-1)+\textbf{W}_x\textbf{x}(t)+\textbf{b}\Big ), \end{aligned}$$
(3)

where \(\textbf{W}_h= -\textbf{W}_h^T\) is the antisymmetric recurrent weight matrix, while \(\varepsilon \) and \(\gamma \) are two (typically small) positive hyper-parameters that represents respectively the step size of integration, and the diffusion coefficient used to stabilize the discretization [12]. As in standard ESNs, the weight values in \(\textbf{W}_h\), \(\textbf{W}_x\) and \(\textbf{b}\) are left untrained after initialization, and the resulting RC model is named Euler State Network (EuSN). In particular, the weight values in \(\textbf{W}_h\) in Eq. 3 can be obtained starting from a random matrix \(\textbf{W}\) whose elements are drawn from a uniform distribution in \((-\omega _r,\omega _r)\), with \(\omega _r\) representing a recurrent scaling hyper-parameter, and then setting \(\textbf{W}_h = \textbf{W}- \textbf{W}^T\), which grants the antisymmetric property. The weight values in \(\textbf{W}_x\) and \(\textbf{b}\) are initialized as described in Sect. 2 for ESNs. Moreover, as in standard ESNs, the state is initialized in the origin, i.e., \(\textbf{h}(0) = \textbf{0}\), and the neural network architecture is completed by a readout layer that is the only trained component of the model. It has already been shown in the literature that the EuSN model is extremely efficient at propagating input information across many time steps, providing an exceptional trade-off between complexity and accuracy in time-series classification tasks. Overall, EuSNs make it possible to retain the efficiency typical of untrained RC networks while achieving - and even exceeding - the accuracy of fully trained recurrent models (see [7, 9] for an extended comparison in this regard). In this paper, starting from the basic EuSN model, we show how its dynamics can be enriched by simple architectural modifications that affect the variety of its dynamic behavior.

4 Diversifying Dynamics in Euler State Networks Reservoirs

We start our analysis by noting that the reservoir system of an EuSN, as described in Eq. 3 has an effective spectral radius intrinsically close to unity. In fact, using standard arguments in RC area, we can observe how the Jacobian of the system in Eq. 3, analyzed around the origin and for null input, takes the following form:

$$\begin{aligned} \textbf{J} = (1-\varepsilon \, \gamma )\;\textbf{I}+\varepsilon \, \textbf{W}_h, \end{aligned}$$
(4)

whose eigenvalues have a fixed real part, given by \(1-\varepsilon \gamma \), and imaginary part given by a small perturbation of one of the eigenvalues of \(\textbf{W}_h\). Using \(\lambda _k(\cdot )\) to denote the k-th eigenvalue of its matrix argument, we have that:

$$\begin{aligned} \lambda _k(\textbf{J}) = 1-\varepsilon \gamma + i \; \varepsilon \beta _k, \end{aligned}$$
(5)

where \(\beta _k = Im(\lambda _k(\textbf{W}_h))\). All eigenvalues are thus concentrated (vertically in the Gaussian plane) in a neighborhood of \(1-\varepsilon \gamma \). Given that both \(\varepsilon \) and \(\gamma \) take small positive values, we can notice that all the eigenvalues in Eq. 5 are close to 1 by design, and the eigenvalues of \(\textbf{W}_h\) have only a minor perturbation impact. This is illustrated in Fig. 1 (top, left). As analyzed in [9], this characterization can be interpreted as an architectural bias of the EuSN model towards critical dynamics. Notice that this bias is fundamentally different from the suffix-based Markovian nature of reservoir dynamics typical of the conventional ESN [8].

Despite the application success of the EuSN model already in its original form (as evidenced by the results in [7, 9]), the dynamic characterization of the model seems to be improvable. In particular, while one of the keys to the success of RC is that it can cover a wide range of dynamic behaviour by randomizing the reservoir parameters, in the case of EuSNs randomization does not seem to be fully exploited. This can be seen firstly from the squeezing of the Jacobian eigenvalues on a line, and secondly from the observation that the reservoir state transition function in Eq. 3 contains a self-loop term modulated by the same \(\gamma \) value for all neurons. Accordingly, in the following, we introduce variants of the basic EuSN model in which different neurons can have different values of the step size parameter \(\varepsilon \) and the diffusion parameter \(\gamma \).

Step Size Variability. We consider EuSN reservoir neurons with different values of the step size. The resulting state transition function is given by:

$$\begin{aligned} \textbf{h}(t) = \textbf{h}(t-1)\; + \pmb {\varepsilon } \odot \tanh \Big ((\textbf{W}_h-\gamma \textbf{I})\textbf{h}(t-1)+\textbf{W}_x\textbf{x}(t)+\textbf{b}\Big ), \end{aligned}$$
(6)

where \(\pmb {\varepsilon } \in \mathbb {R}^{N_h}\) is a vector containing the step size of integration of the different neurons, and \(\odot \) denotes component-wise (Hadamard) multiplication. As an effect of this modification, the neurons in the EuSN reservoir exhibit dynamics with variable integration speed, potentially offering greater richness to the encoding produced by the system. Moreover, the resulting Jacobian is given by:

$$\begin{aligned} \textbf{J} = diag(\textbf{1}-\gamma \pmb {\varepsilon })+diag(\pmb {\varepsilon }) \textbf{W}_h, \end{aligned}$$
(7)

where \(diag(\cdot )\) indicates a diagonal matrix with specified diagonal elements, and \(\textbf{1}\in \mathbb {R}^{N_h}\) is a vector of ones. The resulting eigenvalues are no longer characterized by the same real part, and present a more varied configuration, as illustrated in Fig. 1 (top, right). In the following, we use EuSN-\(\varepsilon \) to refer to an EuSN network whose reservoir is ruled by Eq. 6.

Diffusion Variability. We consider EuSN reservoir neurons with different values of the diffusion coefficient. In this case, the state transition function is given by:

$$\begin{aligned} \textbf{h}(t) = \textbf{h}(t-1)\; + \varepsilon \tanh \Big ((\textbf{W}_h-diag(\pmb \gamma ))\textbf{h}(t-1)+\textbf{W}_x\textbf{x}(t)+\textbf{b}\Big ), \end{aligned}$$
(8)

where \(\pmb {\gamma } \in \mathbb {R}^{N_h}\) is a vector containing the diffusion term of the different neurons. Differently from the previous case of EuSN-\(\varepsilon \), all the reservoir neurons operate at the same integration speed, but the reservoir topology is enriched by different strengths of the self-loops. The resulting Jacobian is given by:

$$\begin{aligned} \textbf{J} = diag(\textbf{1}-\varepsilon \pmb {\gamma })+\varepsilon \, \textbf{W}_h, \end{aligned}$$
(9)

whose eigenvalues variability is illustrated in Fig. 1 (bottom, left). In the following, EuSN-\(\gamma \) is used to refer to an EuSN network whose reservoir is described by Eq. 8.

Full Variability. We finally introduce an EuSN in which each reservoir neuron presents its own step size of integration and diffusion coefficient. This configuration includes both variability factors introduced by EuSN-\(\gamma \) and EuSN-\(\gamma \), and is denoted by EuSN-\(\varepsilon ,\gamma \). In this case, the reservoir state transition function reads as follows:

$$\begin{aligned} \textbf{h}(t) = \textbf{h}(t-1)\; + \pmb {\varepsilon } \odot \tanh \Big ((\textbf{W}_h-diag(\pmb \gamma ))\textbf{h}(t-1)+\textbf{W}_x\textbf{x}(t)+\textbf{b}\Big ), \end{aligned}$$
(10)

and the resulting Jacobian is given by:

$$\begin{aligned} \textbf{J} = diag(\textbf{1}-\pmb \varepsilon \odot \pmb {\gamma })+diag(\pmb \varepsilon )\, \textbf{W}_h, \end{aligned}$$
(11)

The reservoir exhibits both dynamics with multiple scales of integration speed and diverse self-loops. Moreover, while preserving the architectural bias toward eigenvalues of the Jacobian near 1, these show wider variability, as illustrated in Fig. 1 (bottom, right).

Fig. 1.
figure 1

Eigenvalues of the Jacobian for a 500-dimensional reservoir in EuSN (top left), EuSN-\(\varepsilon \) (top right), EuSN-\(\gamma \) (bottom left), and EuSN-\(\varepsilon ,\gamma \) (bottom right). The plots correspond to a system with \(\omega _r = 1\), \(\varepsilon = 0.01\), \(\gamma = 0.01\). Variable values of the step size were randomly sampled from a uniform distribution on \([\varepsilon ,\varepsilon +0.1]\). Variable values of the diffusion were randomly sampled from a uniform distribution on \([\gamma ,\gamma +0.1]\).

5 Experiments

We have experimentally evaluated the performance of the proposed EuSN variants (introduced in Sect. 4), in comparison to the base EuSN setup (described in Sect. 3) and the conventional ESN model (described in Sect. 2).

Datasets. The performed analysis involved experiments on a large pool of diverse time-series classification benchmarks. The first 20 datasets have been taken from the UEA & UCR time-series classification repository [4], namely: Adiac, Blink, CharacterTrajectories, Computers, Cricket, ECG5000, Epilepsy, FordA, FordB, HandOutlines, HandMovementDirection, Handwriting, Hearthbeat, KeplerLightCurves, Libras, Lightning2, Mallat, MotionSenseHAR, ShapesAll, Trace, UWaveGestureLibraryAll, Wafer, and Yoga. We have also run experiments on the IMDB movie review sentiment classification dataset [18], and on the Reuters newswire classification dataset from UCI [1], which were used in the publicly online available formsFootnote 2. For these two tasks, we applied a preprocessing step in order to represent each sentence by a time series of 32-dimensional word embeddingsFootnote 3. For all datasets, we used the original splits into training and test, applying a further 67\(\%\) - 33\(\%\) stratified splitting of the original training data into training and validation sets. Relevant information on the used datasets is reported in Table 1.

Table 1. Information on the time-series classification benchmarks used in our experiments, including the number of sequences in the training set (# Seq Tr) and in the test set (# Seq Ts), the maximum length of a sequence in the dataset (Length), the number of input features (Feat.), and the number of output classes (Classes).

Experimental Settings. In our experiments, we considered EuSN with a number of recurrent units \(N_h\) ranging between 10 and 500. We explored values of \(\omega _r\), \(\omega _x\) and \(\omega _b\) in \(\{10^{-3},10^{-2},\) \(\ldots ,10\}\), \(\varepsilon \) and \(\gamma \) in \(\{10^{-5},10^{-4},\ldots , 1\}\). For EuSN settings with step size variability, we explored values of \(\varDelta \varepsilon \) in \(\{10^{-5},10^{-4},\ldots , 1\}\), and generated values of \(\pmb \varepsilon \) from a uniform distribution in \([\varepsilon , \varepsilon + \varDelta \varepsilon ]\). A similar setting was used for exploring the case with diffusion variability. For comparison, we ran experiments with standard ESNs, exploring values of \(\rho (\textbf{W}_h)\) in \(\{0.3, 0.6, 0.9, 1.2\}\), \(\alpha \) in \(\{10^{-5},10^{-4},\ldots , 1\}\), \(\omega _x\) and \(\omega _b\) as for the EuSN models. In all the cases, the readout was trained by ridge regression (with regularization coefficient equal to 1).

Table 2. Results on the time-series classification benchmarks. For every task, it is reported the accuracy on the test set achieved by ESN, EuSN, EuSN with variable step size (EuSN-\(\varepsilon \)), EuSN with variable diffusion (EuSN-\(\gamma \)), and EuSN with both variable step size and diffusion (EuSN-\(\varepsilon ,\gamma \)). Results are averaged (and std are given) over 10 random guesses. Best results for each task are highlighted in bold.
Table 3. Average ranking across all the time-series classification benchmarks.

For each model individually, the values of the hyper-parameters were fine-tuned by model selection, by means of a random search with 1000 iterations. After the model selection process, for each model the selected configuration was instantiated 10 times (generating random reservoir guesses). These 10 instances were trained on the entire training set and then evaluated on the test set. Our code was written in KerasFootnote 4, and was run on a system with 2\(\,\times \,\)20 Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20 GHz.

Results. The achieved results are given in Table 2, which reports the test accuracy of each tested model, averaged over the 10 repetitions. The results in the table show the practical effectiveness of the architectural variants proposed in this paper, which overall achieve the best result in the vast majority of the cases examined. In particular, the variant comprising the maximum variability explored in the paper, i.e., EuSN-\(\varepsilon ,\gamma \) is the one that is found to be superior in most cases. Taken individually, the variability on the step size (EuSN-\(\varepsilon \)) is slightly less effective than the full variability, while the variability on the diffusion term (EuSN-\(\gamma \)) is the one that individually results in less effectiveness. It is interesting to note that although in some cases the difference in performance between the best proposed variance and the baseline EuSN model is minimal, in many cases (including Blink, Computers, Cricket, HandMovementDirection, Handwriting, Heartbeat, Lightning2, Mallat, MotionSenseHAR, and Yoga) the improvement achieved is definitely relevant. Furthermore, the results show clear confirmation of the accuracy advantage of the EuSN approach over traditional ESNs. In the few cases where ESNs exceed the accuracy of standard EuSNs (Computers, Cricket, MotionSenseHAR), the proposed EuSN variants achieve even higher accuracy.

Our analysis is further supported by the ranking values given in Table 3, which indicate that on average on the considered datasets, EuSN-\(\varepsilon ,\gamma \) and EuSN-\(\varepsilon \) models perform the best, followed by EuSN-\(\gamma \) and standard EuSN, while ESN has the worst performance.

6 Conclusions

In this paper we have empirically explored the effects of (architecturally) introducing dynamical variability in the behavior of Euler State Networks (EuSNs), a recently introduced Reservoir Computing (RC) methodology featured by non-dissipative dynamics. Diversity has been enforced by using reservoir neurons with variable step size of integration (EuSN-\(\varepsilon \)), and with different diffusion coefficient (EuSN-\(\gamma \)). Both the approaches impact on the organization of the diversification of the dynamical behavior of the model, as pointed out by analyzing the eigenvalues of the resulting Jacobian. Moreover, results on several time-series classification benchmarks showed the efficacy of the proposed variants, and of their synergy, as the EuSN model with both the introduced variability factors (EuSN-\(\varepsilon ,\gamma \)) resulted in the highest accuracy in a larger number of cases. Notwithstanding the clear advantage of basic EuSNs over conventional Echo State Networks in the explored tasks, from a practical point of view, the results suggest the convenience in exploring EuSNs in conjunction with at least the EuSN-\(\varepsilon ,\gamma \) variant.

Future work will focus on theoretical analysis of the effects of the dynamic variability factors introduced in this paper, and their application in pervasive artificial intelligence contexts.