Keywords

Although the use of Recurrent Neural Networks (RNNs) in machine learning is boosting, also as effective building blocks for deep learning architectures, a comprehensive understanding of their working principles is still missing [4, 26]. Of particular relevance are Echo State Networks (ESNs), introduced by Jaeger [13] and independently by Maass et al. [16] under the name of Liquid State Machine (LSM), which emerge from RNNs due to their training simplicity. The basic idea behind ESNs is to create a randomly connected recurrent network, called reservoir, and feed it with a signal so that the network will encode the underlying dynamics in its internal states. The desired – task dependent – output is then generated by a readout layer (usually linear) trained to match the states with the desired outputs. Despite the simplified training protocol, ESNs are universal function approximators [10] and have shown to be effective in many relevant tasks [2, 3, 7, 19,20,21,22].

These networks are known to be sensitive to the setting of hyper-parameters like the Spectral Radius (SR), the input scaling and the sparseness degree [13], which critically affect their behaviour and, hence, the performance at task. Fine tuning of hyper-parameters requires cross-validation or ad-hoc criteria for selecting the best-performing configuration. Experimental evidence and some results from the theory show that ESNs performance is usually maximised in correspondence of a very narrow region in hyper-parameter space called Edge of Chaos (EoC) [1, 6, 14, 15, 23,24,25, 30]. However, we comment that beyond such a region ESNs behave chaotically, resulting in useless and unreliable computations. At the same time, it is everything but trivial configuring the hyperparameters to lie on the EoC still granting a non-chaotic behaviour. A very important property for ESNs is the Echo State Property (ESP), which basically asserts that their behaviour should depend on the signal driving the network only, regardless of its initial conditions [32]. Despite being at the foundation of theoretical results [10], the ESP in its original formulation raises some issues, mainly because it does not account for multi-stability and is not tightly linked with properties of the specific input signal driving the network [17, 31, 32].

In this context, the analysis of the memory capacity (as measured by the ability of the network to reconstruct or remember past inputs) of input-driven systems plays a fundamental role in the study of ESNs [8, 9, 12, 27]. In particular, it is known that ESNs are characterized by a memory–nonlinearity trade-off [5, 11, 28], in the sense that introducing nonlinear dynamics in the network degrades memory capacity. Moreover, it has been recently shown that optimizing memory capacity does not necessarily lead to networks with higher prediction performance [18].

Fig. 1.
figure 1

Results of the experiments on memory for different benchmarks. Panel (a) displays the white noise memorization task, (b) the MSO, (c) the x-coordinate of the Lorenz system, (d) the Mackey-Glass series and (e) the Santa Fe laser dataset. As described in the legend (f), different line types account for results obtained on training and test data. The shaded areas represent the standard deviations, computed using 20 different realization for each point.

In a recent paper [29], we proposed an ESN model that eliminates critical dependence on hyper-parameters, resulting in models that cannot enter a chaotic regime. In addition to this major outcome, we showed that such networks denote nonlinear behaviour in phase space characterised by a large memory of past inputs (see Fig. 1): the proposed model generates dynamics that are rich-enough to approximate nonlinear systems typically used as benchmarks. Our contribution was based on a nonlinear activation function that normalizes neuron activations on a hyper-sphere. We showed that the spectral radius of the reservoir, which is the most important hyper-parameter for controlling the ESN behaviour, plays a marginal role in influencing the stability of the proposed model, although it has an impact on the capability of the network to memorize past inputs. Our theoretical analysis demonstrates that this property derives from the impossibility for the system to display a chaotic behaviour: in fact, the maximum Lyapunov exponent is always null. An interpretation of this very important outcome is that the network always operates on the EoC, regardless of the setting chosen for its hyper-parameters.