1 Introduction

About ten years ago a new trend of understanding, training, and using Recurrent Neural Networks (RNNs) has been started with Echo State Networks (ESNs) [19, 21] and Liquid State Machines (LSMs) [34]. While the former came from the field of Machine Learning (ML) and the latter from computational neuroscience, both approaches share the same basic idea. It stems from the observation that, as long as an RNN possessed certain generic properties, supervised adaptation of all interconnection weights is not necessary, and only training a memoryless supervised readout from it is enough to obtain excellent performance in many tasks (Fig. 1). The RNN is called a reservoir in this context.

Fig. 1
figure 1

The difference between a full gradient descent training of RNN (A) and the ESN training (B)

This was a welcome discovery, since training RNNs has always been much more difficult than training feedforward neural networks. Cyclic dependencies in RNNs lead to bifurcations during training: infinitesimally small changes to RNN parameters can lead to drastic discontinuous changes in its behavior. This phenomenon may render classical gradient descent RNN training methods (like [52, 53]) non converging [11]. Even if they do converge, this process is typically slow, computationally expensive, requires careful selection of learning parameters, and ends in a local minimum. Learning long-term dependencies in the data is hard [2] (but see [15] for an RNN architecture specialized on learning such dependencies, and [35] for recent progress in generic RNNs). Because of the complexity and computational costs, the number of neurons used in so-trained RNNs has typically been in the order of tens, which in turn limits their expressive capacity.

The approach started by ESNs and LSMs reinvigorated interest in RNN research and applications, a stream which became collectively known as Reservoir Computing (RC) [49]. It now has many more related methods and extensions of the original idea (see [30] for an extensive overview; http://reservoir-computing.org is a web portal collectively maintained by leading RC groups). We will mention a few selected variants here, but let us start with the original basic ESN RC approach.

2 The Basic ESN Approach

Here are the update equations of a typical RNN used in ML with leaky-integrated discrete-time continuous-value units:

(1)
(2)

where n is discrete time, \(\mathbf{u}(n) \in{\mathbb {R}}^{N_{\mathrm{u}}}\) is the input signal, \(\mathbf{x}(n) \in{\mathbb{R}}^{N_{\mathrm{x}}}\) is a vector of reservoir neuron activations and \(\tilde{\mathbf{x}}(n) \in{\mathbb{R}}^{N_{\mathrm {x}}}\) is its update, all at time step n, tanh(⋅) is applied element-wise, [⋅;⋅] stands for a vertical vector concatenation, W in and W are the input and recurrent weight matrices respectively, and α∈(0,1] is the leaking rate. The model is also frequently used without the leaky integration, which is a special case obtained by setting α=1 and thus \(\tilde{\mathbf{x}}(n) \equiv\mathbf {x}(n)\). The linear readout layer is defined as

$$ \mathbf{y}(n)={\mathbf{W}^{\mathrm{out}}}\bigl[1;\mathbf{u}(n);\mathbf{x}(n) \bigr], $$
(3)

where \(\mathbf{y}(n) \in{\mathbb{R}}^{N_{\mathrm{y}}}\) is network output, and W out the output weight matrix. An additional nonlinearity can be applied to y(n) in (3), as well as feedback connections W fb from y(n−1) to \(\tilde{\mathbf{x}}(n)\) in (1).

The original method of RC introduced with ESNs [19] is to:

  • generate a large random reservoir (W in,W,α);

  • run it using the training input u(n) and collect the corresponding reservoir activation states x(n);

  • compute the linear readout weights W out from the reservoir using linear regression, minimizing the mean square error of the network output w.r.t. the training target signal y target(n);

  • use the trained network on new input data u(n) by computing y(n) employing the trained output weights W out.

Let us look at these steps in more detail.

For the approach to work, the reservoir must possess the echo state property, which can roughly be described as fading memory of the input: trajectories of the reservoir state should converge given the same input, irrespective of the previous history. This is typically ensured by appropriately scaling recurrent connection weights W [19]. A few other parameters, most importantly the input weight W in scaling and leaking rate α, should also be adjusted for an optimal validation performance in a given task.

While running the generated model with training data, the vectors [1;u(n);x(n)] as in (3) are collected into a matrix X and the desired teacher targets y target(n) into a matrix Y target, both having a column for every training time step n. The training is typically done by computing the output weights via ridge regression

$$ {\mathbf{W}^{\mathrm{out}}}=\mathbf{Y}_{\mathrm{target}}\mathbf {X}^{{\mathrm{T}}} \bigl(\mathbf{X}\mathbf{X}^{{\mathrm{T}}}+ \gamma^2 \mathbf{I} \bigr)^{-1}, $$
(4)

where I is the identity matrix and γ is a regularization parameter. For optimal results γ should also be selected through validation; note that the network needs no rerunning with a different γ to recompute W out. By avoiding training of RNN connections W, the learning is done in a single pass through training data and the optimal output weights W out are computed with a high precision using a closed-form solution (4). This also enables a practical use of reservoirs with size of thousands or even tens of thousands of units on contemporary computers [46]. Also note, that Y target X T and XX T in (4) can be computed incrementally and stored in the memory, instead of Y target and X, for arbitrary long training data sequences. Alternatively, W out can be continuously adapted by an online learning algorithm [19].

Such simple and efficient RNN training was demonstrated to outperform fully-trained RNNs in many benchmark tasks, e.g., [17, 22, 23, 46, 50]. Some examples of applications are presented in Sect. 7.

3 Perspectives on RC

The principles of RC can be perceived from several different perspectives.

There are certain parallels between RC and kernel methods in ML. The reservoir can be seen as a nonlinear high-dimensional expansion x(n) of the input signal u(n). For classification tasks, input data u(n) which are not linearly separable in the original space \({\mathbb{R}}^{N_{\mathrm{u}}}\), often become linearly separable in the expanded space \({\mathbb{R}}^{N_{\mathrm{x}}}\) of x(n) where they are separated by W out. At the same time, the reservoir serves as a memory, providing the temporal context. The “kernel trick” is typically not used in RC, however it is possible to do so by defining recursive temporal context-sensitive kernels that integrate over a continuum of W in and W, which can be used as in regular Support Vector Machines (SVMs), but for sequence data [13]. SVM-style readouts can also be trained from the reservoirs [41].

The separation between the fixed reservoir and the adaptive readout can also be arrived at when analyzing the dynamics of a full gradient descent RNN training, and optimizing it. In an efficient version of gradient descent RNN training introduced by Atiya and Parlos [1] the output weights W out are adapted much more than W and W in [38], which led to a further optimization where they remain constant, an RC method called BackPropagation-DeCorrelation (BPDC) [43]. BPDC is an online RNN learning algorithm which runs with O(N x) time complexity.

From a biological perspective, RC gives a simple and yet powerful interpretation of how generic cortical circuits with no well-understood supervised adaptation can be utilized for purposeful computation [34]. Reservoirs also correspond well to how temporal information is spatially encoded in the brain and provides a context for interpreting current inputs [6]. Fixed RNNs for modeling parts of sensory-motor sequence [8] and speech [10] learning architectures have been employed even before the original ESN and LSM publications.

Another advantage or RC is that the same reservoir can be used as a generic computational substrate for multiple tasks concerning the same input. For each task a new readout can be learned independently and without interfering with what has been learned before. This might have a potential in aiming for a general purpose artificial intelligence mechanisms and corresponds well to the natural intelligence.

4 Other RC Approaches

Despite the success of the original ESN approach depicted in Sect. 2, there are many extensions, modifications and improvements possible.

For example, intuitively, there should be something better than a random reservoir. The error landscape of the RNN parameter space for a given task is usually notoriously complicated (this is why gradient descent is difficult). The probability of exactly hitting the global, or even local, minimum in this landscape by picking a random point is virtually equal to zero. Often slight, but always present variations of performance among randomly sampled reservoirs confirm this.

The linear readout is also quite limited in its expressive power.

Guided by such intuitions the modern field of RC substantially widened and differentiated. It has moved away from the initial paradigm of having a fixed RNN and training only the output. However, what still sets the RC approach apart from other RNN training methods is that the recurrent part (the reservoir) is generated or trained differently than the readout. This has become the modern paradigm of RC.

The RC paradigm of separating the reservoir and readout training allows for these two research directions to be pursued virtually independently, and the best results from both to be combined. There are numerous different methods proposed in the literature for both of the directions.

Output training is in essence a standard ML problem, where virtually any method capable of learning an input-to-output mapping can be employed with their respective strengths and weaknesses.

For the reservoir part, there has been also a large number of proposals in the literature. They can roughly be classified into three categories:

  • Generic methods for generating RNNs with different neuron models, connectivity patterns and dynamics;

  • Unsupervised adaptation of the reservoir, based on the input data u(n), but not y target(n);

  • Supervised, or semi-supervised, like reinforcement learning, adaptation of the reservoir, using task-specific information from both u(n) and y target(n), but exploiting it differently than for the readout training.

Since the readout training can be very efficient, the quality of a reservoir for a particular task can be tested quickly by measuring the error of the readout. This makes RC a convenient and popular testing ground for many types of RNN models, topologies, unsupervised, reinforcement, and biologically inspired adaptation algorithms.

Most of these different approaches are reviewed in [30], updated in [29]. For the sake of brevity only a few of those have been mentioned here.

The numerous proposed RC modification introduced multiple improvements, often case-specific, extending the power of RC to new domains, and offering new insights into the workings of RNNs. The original ESN approach of Sect. 2, however, still holds its ground for its combination of simplicity and power.

5 Beyond Neural Networks

The RC principle can also be seen as a strategy to implement useful computations on generic dynamical systems, treating them as reservoirs, either in simulations or even in physical instantiations. Thus RC has spread well beyond the world of artificial neural networks. In particular, it enables useful computation on hardware platforms where, e.g., it is hard or just impractical to implement equivalents of basic electronic logic gates and memory cells. Potential and functioning examples include: analog electronics [39, 40], randomly crystallized nonlinear electronic networks [44], opto-electronic [26, 36] and optical [47] systems, or just a bucket filled with water [12]. Readouts from such reservoirs are typically implemented in more conventional ways.

Many of these directions are very active research areas. In a long run such non-neural, physical reservoirs might significantly complement or even, for some domains, replace the omni-present digital electronic computers.

6 Training of the Dynamics

Even if the reservoir is kept fixed, for some tasks the trained readouts are fed back to the reservoir and thus the training process changes its dynamics. In other words, a recurrence exists between the reservoir and the readout. Pattern generation is a typical example of such task. This is either realized by feedback connections W fb from the trained output y(n−1) to the reservoir \(\tilde{\mathbf{x}}(n)\), or by looping the output y(n−1) as an input u(n) for the next update step n in a predictive generator mode in (1). Note, that these two options are equivalent and just a matter of notation: u(n) and W in instead of y(n−1) and W fb. In some cases, however, both external input and output feedback can be present.

This extends the power of RC, because it no longer relies on fixed random input-driven dynamics to construct the output, but the dynamics are adapted to the task. This power has its price, because stability issues arise here. In order to avoid falling prey to the same difficulties as with full RNN training algorithms, two strategies are used in RC.

The first strategy is to disengage the recurrent relationship between the reservoir and the readout using teacher forcing and treat output learning as a feedforward task. This is done by feeding the desired output y target(n−1) through the feedback connections W fb (or W in) instead of the real output y(n−1) while learning. The target signal y target(n) “bootstraps” the learning process and if the output is learned with high precision (i.e., y(n)≈y target(n)), the recurrent system runs much the same way with the real y(n) in feedbacks after training as it did with y target(n) during training.

There are some caveats here. The approach works very well if the output can be learned precisely [19]. However, if this is not the case, the distorted feedback leads to an even more distorted output and feedback at the next time step, and so on, with the actual generated output y(n) quickly diverging from the desired y target(n). Even with well-learned outputs the dynamical stability of the autonomous running system is often an issue. In both cases the problem is alleviated by some kind of regularization of the weights or “immunization” of the state and/or the feedbacks with noise.

The second strategy is using specialized RC learning algorithm to train the outputs W out while the real feedbacks are present. The before-mentioned BPDC algorithm is an efficient online option with an optimal time complexity [43]. A recent approach named FORCE learning uses a powerful 2nd-order online learning algorithm to vigorously adapt W out in the presence of the real feedbacks [45]. By the initial fast and strong adaptation of W out the feedbacks y(n) are kept close to the desired y target(n) already from the beginning of the learning process, similar to teacher forcing. It appears that FORCE learning is well suited to yield very stable and accurate neural pattern generators.

7 Applications

RC methods have been widely employed in more or less academic applications. The nature of these applications spans all kinds that are amenable to supervised modeling of temporal systems, e.g., temporal pattern classification, temporal pattern generation, time series prediction, timing, routing, memorizing, or controlling nonlinear systems. We refrain from giving an ad hoc selection here; googling “echo statenetwork application will retrieve a few hundreds of relevant instances.

Useful hints for setting up RC learning systems for practical tasks are given in [20, 48]. It should be clearly appreciated that, like always in machine learning, achieving very good results requires experience, experimentation, and insight into the nature of the respective task. Furthermore, an understanding of basic principles of machine learning is a necessary precondition. Specifically, an insightful use of cross-validation and regularization is key for good performance. RC is not a miracle method that can be used out-of-the-box and then be expected to excel.

Instead of attempting a comprehensive overview, we will highlight a number of applications in which the authors have been (or still are) personally involved.

Speech Recognition

One of the textbook examples of temporal sequence recognition is speech recognition. ESNs and LSMs have already early on been applied to this domain. The first approaches focused specifically on isolated recognition of Japanese vowels [20] and digits [49, 51]. The first attempt to continuous speech recognition was based on a rather atypical setup: a large committee of predictive classifiers using ESNs [42]. It showed good results on a benchmark dataset, but due to the use of a custom acoustic front-end, it is not trivial to compare to state-of-the-art work. More recently, in the European FP7 project ORGANIC (http://reservoir-computing.org/organic) which set out to establish neurodynamical architectures as viable alternative to statistical methods for speech and handwriting recognition, different approaches to speech recognition have been applied. In [46] it was demonstrated that competitive phoneme recognition rates can be achieved using straightforward application of the ESN setup on a hard benchmark dataset. Based on this front-end, ESN-HMM hybrids are currently being investigated to realize word recognition with excellent results. Research on noise-robust recognition using ESNs [24] also demonstrate that they perform better than classic HMM approaches.

Handwriting Recognition

Handwriting recognition is in many respects very similar to speech recognition, and traditionally similar computational approaches have been employed [5]. Therefore it is no coincidence that the Organic project also hosts an industrial partner who develops text recognition solutions, e.g., for car number plate reading (easy) or address recognition in automated postal parcel sorting plants (difficult). This partner, Planet intelligent systems GmbH, has been developing ESN-based recognition modules in a long-standing co-operation with the Machine Learning group at Jacobs University. Important customers of Planet’s parcel sorting technology are FedEx and the US Postal Services. ESN-based offline text recognition functions by scanning the text with a virtual linear camera from left to right, obtaining a time series of pixel vectors, which is passed to a hierarchical reservoir recognizer architecture. On subsequent layers, increasingly aggregate “chunks” are recognized (e.g., letters → words). Importantly, no explicit segmentation routine is necessary (“segmentation-free” processing). The different layers are trained individually in a supervised way, which requires training data that are teacher-annotated on each representational level. Planet seeks collaboration with academic partners, and—quite remarkably—allows scientific results which emerge from such collaborations to be published (e.g., [23, 31]). Planet furthermore has made its very large annotated training dataset available to the scientific community as a benchmark (http://reservoir-computing.org/organic/benchmarks/294).

Robot Motor Control

ESNs can be conveniently trained as deadbeat controllers for nonlinear plants. The setup for such controllers is detailed in the original ESN patent document [18] and had first been employed in practice for the tracking control of omniwheel Robocup robots at Fraunhofer AIS (now Fraunhofer IAIS) [37]. The training principle is to feed the controller ESN with the current plant output observation and an n-timestep-delayed version of the same, which enables the ESN to acquire an nth order model of the plant. In exploitation, the direct plant output feedback channel is replaced by the reference signal while the input which in training received the delayed output observation now receives the direct observations. In a very different way, ESNs are currently being explored as neural pattern generators for the humanoid iCub robot (http://www.icub.org/) within the European FP7 project AMARSi (http://www.amarsi-project.eu/). Here, the objective is to obtain neural pattern generators which can be modulated by higher-level control input, e.g., in order to adapt frequency, amplitude, offset, phase, or waveform of the generated pattern. Modulatable neural pattern generation is an extensive field of research [16]. The innovation offered by ESNs is to obtain a generic learning mechanism by which an existing neural pattern generator can acquire essentially arbitrary novel modes of modulatability by learning [28].

Financial Forecasting

Here is an episode worth telling. In a graduate seminar held at Jacobs University in 2007, a group of five students with no previous exposure to machine learning engaged in an international financial time series prediction contest (http://neural-forecasting-competition.com/NN3/). The competition data consisted in a set of 111 time series of very diverse nature (it was part of the challenge to develop versatile predictors). Within 3 months, the students acquired the basic knowledge of standard data preprocessing methods used in the field, applied them to the raw data, developed ESN predictors, implemented them, submitted their predictions—and won the contest, against competitors with years of professional experience in financial forecasting [17]. An informal account of this story is given at http://minds.jacobs-university.de/teaching/highlights. The predictions were obtained by combining the outputs of ensembles of 500 independently created reservoirs whose sizes ranged around 100 units. This episode underlines the simplicity of RC modeling and its motivational capacity in education as much as it illustrates its modeling performance.

Medical

Ghent University has been actively pursuing the use of ESNs in bio-medical applications with great success. It has been applied to real-time detection of epileptic seizures, and this with very low latency and high accuracy, outperforming the state-of-the-art [7]. This technology would enable treatments for epilepsy that are based on closing-the-loop: rapidly detecting the seizure and actively counteracting it using, e.g., medication or brain stimulation. Based on the good results on seizure detection, we also started investigating various forms of Brain Computer Interfaces (BCIs). The most overt result here was that ESNs are very good at detecting the so called “rest state”: the interval between specific thoughts. Combining ESNs with Common Spatial Patterns lead to state-of-the-art results in motor imagery BCI [25].

Here we mentioned only applications in engineering and machine learning. This is one of two main directions of utilizing RC, the other being to model biological phenomena in the cognitive and neurosciences. This is often done with more biologically plausible reservoirs made up from spiking neurons, and is mostly associated with the “liquid state machine” flavor of RC. Pioneers in this area are Peter F. Dominey and Wolfgang Maass. Dominey actually was the first to explicitly spell out the RC principle as early as 1995 [8], and ever since he has continued to extend and refine his models of the cortico-striatal processing loop for temporal sequence learning (e.g., [9, 10, 14]). Maass et al. widely explored the RC principle to understand generic computational properties of cortical microcircuits (e.g., [3234]). Recently he and his group have added reinforcement learning [27] and Bayesian inferencing [4] to the picture of microcircuit adaptation. RC principles have been taken up by other leading researchers in computational neuroscience (e.g., [3, 45]).

8 Resources

Leading European RC groups jointly maintain an RC web portal at http://reservoir-computing.org. Here potential users can find introductory tutorials, an extensive bibliography, an option to subscribe to an RC mailing list, and links to a choice of RC tools. Among the latter we want to point out the OGER engine, a very comprehensive Python-based toolbox with interfaces to a number of standard (spiking) neural simulators (supporting the computational neuroscience branch of RC) and numerous pre-installed validation, regularization, and optimization methods supporting the machine learning side of RC. This engine has been developed within the Organic FP7 project.