Keywords

1 Introduction

Backpropagation through time (BPTT) [6] is currently the most used algorithm for training recurrent neural networks (RNNs) and is derived from applying the chain rule (backpropagation) to the computational graph of the RNN unrolled in time. It suffers however from undesired characteristics both in terms of biological plausibility and large scale applicability: (i) it is non-causal, since at each time step it requires future activity to compute the current gradient of the loss with respect to the parameters; and (ii) it is non-local, since it requires reverse error signal propagating across all neurons and all synapses. An equivalent algorithm is real-time recurrent learning (RTRL) [8]. It uses eligibility traces that are computed at each time step recursively in order to be causal, and can therefore be computed online. However, this comes at the cost of very high computational and memory complexity, since all temporal forward dependencies have to be maintained over time. RTRL is, hence, also non-local. Recently, a new online learning algorithm, called e-prop [2] has been proposed, which is tailored for training recurrent spiking neural networks (RSNNs) with local neural dynamics. The aim was to find an alternative to BPTT (and RTRL) that is causal, local, but also computational and memory efficient.

In this paper, we look in depth into the formalisation of BPTT and RTRL and formalise e-prop into the picture. To do so, we use the computational graph and notation of the architecture in the e-prop paper [2] to understand how these three algorithms relate to each other. Furthermore, in a posterior paper [9], it was shown that e-prop was an approximation of RTRL. Here, by formalising also RTRL in the same framework we indeed confirm the connection and make it more explicit (cf. Fig. 1). In the process, we uncover a family of algorithms determined by the level of approximation allowed to benefit from causality and locality. The main focus of this paper is to give intuition and understanding of all of these gradient computation rules.

Fig. 1.
figure 1

Overview of all the algorithms and how they relate to each other.

1.1 Background

The most common way to train a model in supervised learning is to compute the gradient of a given loss \(\mathcal {L}\) with respect to the parameters \(\theta \), \(\nicefrac {d\mathcal {L}}{d\theta }\), and use this gradient in some gradient descent scheme of the form \(\theta (\tau +1) = \theta (\tau ) - f(\nicefrac {d\mathcal {L}}{d\theta })\), where \(\tau \) refers to the current update iteration and f is some gradient postprocessing. Therefore, we here focus on the algorithms for the computation (or approximation) of this gradient.

In particular, we focus on a general class of RNN models where we have n computational units. These units have hidden states at each time step \(c_i^t\) that influence the hidden state at the next time step \(c_i^{t+1}\) (implicit recurrence) as well as the output of the unit at the current time step \(h_i^t\). The output \(h_i^t\) of a unit at a given time step influences the hidden state of the same and other units at the following time step \(c_j^{t+1}\) (explicit recurrence) through a weighted connection \(w_{ij}\). Finally, these outputs also account for the model’s computation (either directly or through some other computations, e.g. a linear readout) and therefore are subject to evaluation by a loss function \(\mathcal {L}\). The formalization here is agnostic to the particular dimensionality and computational relation between the variables and therefore apply for different RNNs, such as LSTMs [4] or RSNNs [1].

Fig. 2.
figure 2

Simple example of computational graph and distinction between total and partial derivative of f with respect to x.

For a function f(xy(x)) we distinguish the notation of the total derivative \(\nicefrac {df}{dx}\) and the partial derivative \(\nicefrac {\partial f}{\partial x}\) because the first one represents the whole gradient through all paths, while the second one expresses only the direct relation between the variables. To illustrate: using the chain rule (cf. Fig. 2) and with the example \(y = 2x\) and \(f(x,y(x)) = xy\), the total derivative is calculated as:

$$\begin{aligned} \frac{df}{dx} = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial x} = y + x\cdot 2 = 4x \end{aligned}$$
(1)

2 Backpropagation Through Time

In RNNs, since previous states affect the current state, the trick to applying the chain rule is to unroll the RNN in time, obtain a virtual feed-forward architecture and apply to this computational graph error-backpropagation [7]. The resulting algorithm is BPTT [6] and it is the currently most used algorithm to compute since it reuses many previous computations to be highly efficient. Here, we focus our attention on the role of the recurrences dividing the algorithm into the following steps:

Fig. 3.
figure 3

Computational graph for A) explicit recurrences gradients, B) implicit recurrence gradients and C) final computation of BPTT.

Explicit Recurrences: Compute using the recursive definition given by the explicit recurrences (cf. Fig. 3A):

$$\begin{aligned} \begin{aligned} \frac{d\mathcal {L}}{dh^{t}_j}&= \frac{\partial \mathcal {L}}{\partial h^{t}_j} + \frac{d\mathcal {L}}{dc^{t+1}_j} \frac{\partial c^{t+1}_j}{\partial h_j^{t}} + \sum _{k \ne j} \frac{d\mathcal {L}}{dc^{t+1}_k} \frac{\partial c^{t+1}_k}{\partial h_j^{t}} \\&= \frac{\partial \mathcal {L}}{\partial h^{t}_j} + \sum _k \frac{d\mathcal {L}}{dc^{t+1}_k} \frac{\partial c^{t+1}_k}{\partial h_j^{t}} \end{aligned} \end{aligned}$$
(2)

Implicit Recurrences: Compute using the value of the previous step and the recursive definition given by the implicit recurrence (cf. Fig. 3B):

$$\begin{aligned} \frac{d\mathcal {L}}{dc^{t}_j} = \frac{d\mathcal {L}}{dh^{t}_j} \frac{\partial h^{t}_j}{\partial c_j^{t}} + \frac{d\mathcal {L}}{dc^{t+1}_j}\frac{\partial c_j^{t+1}}{\partial c_j^{t}} \end{aligned}$$
(3)

BPTT: Finally, compute using the values obtained in the previous two steps for all time steps (cf. Fig. 3C):

$$\begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} = \sum _{t} \frac{d\mathcal {L}}{dc_j^{t}} \frac{\partial c^{t}_j}{\partial w_{ij}} \end{aligned}$$
(4)

We use explicit and implicit recurrences from the maximum time T backwards and for all \(t \le T\), and finally, sum all the results from Eq. 3. The existence of these recurrences makes BPTT present the following problems [2, 5]:

Non-locality: Due to the explicit recurrences, we need to take into account how the current synaptic strength \(w_{ij}\) between the neurons i and j affects the future value of the postsynaptic neuron: for all \(k \ne j\) (cf. Eq. 2). This means that to compute the weight change for synapse magnitude \(w_{ij}\) we need information of the hidden variables \(c^{t+1}_k\) for all k. Moreover, this chain of dependencies continues at each time step, such that at the next time step we need information of the variables \(c^{t+2}_q\) for all q (including \(q \ne j\)) and so forth. The contraposition would be a local algorithm that does not require messages passing from every neuron to every synapse to compute the gradients, but rather only need information close to the given synapse.

Non-causality: Due to the three kinds of recurrences shown before we need to take into account all the gradients in the future (the same way as current layer computations need to use the gradients of posterior layers in feed-forward architectures), leading to two main problems. First, we need to compute the values of the variables (update locking) and the gradients (backwards locking) across all future time steps before computing the current gradient [3]. Secondly, all the values of all the variables across time have to be saved during the inference phase to be used while computing the gradients, requiring a memory overhead (O(nT) with n neurons and T time steps). The contraposition would be a causal algorithm, that at each time step would only need information from previous and current activity to compute the current gradient. Therefore, it could do it at each time step (online) and while the inference is running.

3 Real-Time Recurrent Learning

RTRL [8] is a causal learning algorithm that can be implemented as an online learning algorithm and that computes the same gradient as BPTT, at the cost of being more computationally expensive. We derive the equation for RTRL starting with BPTT (cf. Eq. 4) via re-expressing the gradients that connect the computation with future gradients to obtain a causal algorithm. These gradients correspond to the implicit and explicit recurrences.

3.1 Re-expressing Implicit Recurrence

First, we re-express the implicit recurrence gradient .

Unrolling the Recursion: To unroll, we plug the equation of implicit recurrence Eq. 3 into Eq. 4:

$$\begin{aligned} \begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} =&\sum _{t'} \left( \frac{d\mathcal {L}}{dh^{t'}_j} \frac{\partial h^{t'}_j}{\partial c_j^{t'}} + \frac{d\mathcal {L}}{dc^{t'+1}_j}\frac{\partial c_j^{t'+1}}{\partial c_j^{t'}} \right) \frac{\partial c^{t'}_j}{\partial w_{ij}} \\ =&\sum _{t'} \left( \frac{d\mathcal {L}}{dh^{t'}_j} \frac{\partial h^{t'}_j}{\partial c_j^{t'}} + \left( \frac{d\mathcal {L}}{dh^{t'+1}_j} \frac{\partial h^{t'+1}_j}{\partial c_j^{t'+1}} + (\cdots ) \frac{\partial c_j^{t'+2}}{\partial c_j^{t'+1}} \right) \frac{\partial c_j^{t'+1}}{\partial c_j^{t'}} \right) \frac{\partial c^{t'}_j}{\partial w_{ij}} \\ =&\sum _{t'}\sum _{t \ge t'} \frac{d\mathcal {L}}{dh^t_j} \frac{\partial h^t_j}{\partial c^t_j} \frac{\partial c^t_j}{\partial c^{t-1}_j} \cdots \frac{\partial c^{t'+1}_j}{\partial c^{t'}_j} \frac{\partial c^{t'}_j}{\partial w_{ij}} \end{aligned} \end{aligned}$$
(5)

Flip Time Indices: The derived formula is non-causal since it requires future gradients (for each \(t'\) we sum products of gradients with factors starting from \(t \ge t'\)). To make it causal, we change the indices as follows:

$$\begin{aligned} \begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} =&\sum _t \frac{d\mathcal {L}}{dh^t_j} \frac{\partial h^t_j}{\partial c^t_j} \sum _{t' \le t} \frac{\partial c^t_j}{\partial c^{t-1}_j} \cdots \frac{\partial c^{t'+1}_j}{\partial c^{t'}_j} \frac{\partial c^{t'}_j}{\partial w_{ij}} \end{aligned} \end{aligned}$$
(6)
Fig. 4.
figure 4

Computational graph for the implicit variable \(\epsilon ^t_{ij}\) with \(t'= t-2\).

Definition

(Implicit variable). We define the implicit variable \(\epsilon ^t_{ij}\) as:

$$\begin{aligned} \epsilon _{ij}^t := \sum _{t' \le t} \frac{\partial c^t_j}{\partial c^{t-1}_j} \cdots \frac{\partial c^{t'+1}_j}{\partial c^{t'}_j} \frac{\partial c^{t'}_j}{\partial w_{ij}} \end{aligned}$$
(7)

Backwards Interpretation: Starting at \(c^t_j\), the implicit variable represents the sum over all the paths going backwards through the implicit recurrence until \(c^{t'}_j\) and from there to the synaptic weight \(w_{ij}\) (cf. Fig. 4).

Forwards Interpretation: The implicit variable represents how the hidden variable of neuron j has been affected by the synapse weight \(w_{ij}\) through time, i.e. taking into account also how the hidden variables at previous time steps have affected the variables at the current time step through the implicit recurrence.

Incremental Computation: Importantly, there is a recursive relation to this variable that allows it to be updated at each time step:

$$\begin{aligned} \epsilon _{ij}^t = \frac{\partial c^t_j}{\partial c^{t-1}_j} \epsilon ^{t-1}_{ij} + \frac{\partial c^t_j}{\partial w_{ij}} \end{aligned}$$
(8)

Definition

(Implicit eligibility trace). Given the implicit variable \(\epsilon ^t_{ij}\), we define the implicit eligibility trace \(e^t_{ij}\) as:

$$\begin{aligned} e^t_{ij} := \frac{\partial h^{t}_j}{\partial c^{t}_j} \epsilon ^t_{ij} \end{aligned}$$
(9)

Since is causal and local, and so is the implicit variable \(\epsilon ^t_{ij}\) (can be computed at each time step and is specific for each synapse), then the implicit eligibility trace \(e^t_{ij}\) is also causal and local.

Final Equation with Re-expressed Implicit Recurrence: With all of this combined, BPTT (cf. Eq. 4) has become (substituting Eq. 7 in Eq. 6) the following (cf. Fig. 5A):

$$\begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} = \sum _{t} \frac{d\mathcal {L}}{dh^{t}_j} \frac{\partial h^{t}_j}{\partial c^{t}_j} \epsilon _{ij}^{t} = \sum _{t} \frac{d\mathcal {L}}{dh^{t}_j} e^{t}_{ij} \end{aligned}$$
(10)

Even though \(e^t_{ij}\) is causal and local, this equation as a whole is not, since the factor still includes explicit recurrences. E-prop will simply ignore these recurrences to solve this problem (cf. Fig. 5B, Sect. 4).

Fig. 5.
figure 5

Computational graph for A) BPTT re-expressed with implicit eligibility trace (cf. Eq. 10) B) symmetric e-prop.

3.2 Re-expressing Explicit Recurrences of Order 1

Now we re-express the explicit recurrences’ gradient analogously to the implicit recurrence in the previous section. First, we plug Eq. 2 (explicit recurrences) into Eq. 10 (re-expressed implicit recurrence):

$$\begin{aligned} \begin{aligned}\ \frac{d\mathcal {L}}{dw_{ij}} =&\sum _{t'} \left( \frac{\partial \mathcal {L}}{\partial h^{t'}_j} + \sum _k \frac{d\mathcal {L}}{dc^{t'+1}_k} \frac{\partial c^{t'+1}_k}{\partial h_j^{t'}}\right) e^{t'}_{ij} \\ =&\sum _{t'} \frac{\partial \mathcal {L}}{\partial h^{t'}_j} e^{t'}_{ij} + \sum _k \sum _{t'} \frac{d\mathcal {L}}{dc^{t'+1}_k} \frac{\partial c^{t'+1}_k}{\partial h_j^{t'}} e^{t'}_{ij} \end{aligned} \end{aligned}$$
(11)

The first factor of this sum is already causal since it only requires the direct derivative and the implicit eligibility trace introduced in Eq. 7. Focusing on the second factor, this term represents the gradient until \(c^{t'+1}_k\), the jump to \(h^{t'}_j\) and the implicit eligibility trace \(e^{t'}_{ij}\) stored there that represents the sum over all of the paths from there to \(w_{ij}\).

Unrolling the Recursion: We can now unroll the recursion by plugging the equation of explicit recurrences Eq. 2 into the second term of Eq. 11:

$$\begin{aligned}&\sum _{t'} \frac{d\mathcal {L}}{dc^{t'+1}_k} \frac{\partial c^{t'+1}_k}{\partial h_j^{t'}} e^{t'}_{ij} = \sum _{t'} \left( \frac{d\mathcal {L}}{dh^{t'+1}_k} \frac{\partial h^{t'+1}_k}{\partial c_k^{t'+1}} + \frac{d\mathcal {L}}{dc^{t'+2}_k}\frac{\partial c_k^{t'+2}}{\partial c_k^{t'+1}} \right) \frac{\partial c^{t'+1}_k}{\partial h^{t'}_j}e^{t'}_{ij} \nonumber \\&= \sum _{t'} \left( \frac{d\mathcal {L}}{dh^{t'+1}_k} \frac{\partial h^{t'+1}_k}{\partial c_k^{t'+1}} + \left( \frac{d\mathcal {L}}{dh^{t'+2}_k} \frac{\partial h^{t'+2}_k}{\partial c_k^{t'+2}} + (\cdots ) \frac{\partial c_k^{t'+3}}{\partial c_k^{t'+2}} \right) \frac{\partial c_k^{t'+2}}{\partial c_k^{t'+1}} \right) \frac{\partial c^{t'+1}_k}{\partial h^{t'}_j}e^{t'}_{ij} \nonumber \\&= \sum _{t'}\sum _{t \ge t'} \frac{d\mathcal {L}}{dh^{t+1}_k} \frac{\partial h^{t+1}_k}{\partial c^{t+1}_k} \frac{\partial c^{t+1}_k}{\partial c^{t}_k} \cdots \frac{\partial c^{t'+2}_k}{\partial c^{t'+1}_k} \frac{\partial c^{t'+1}_k}{\partial h^{t'}_j} e^{t'}_{ij} \end{aligned}$$
(12)

Flip Time Indices: We flip the indices again to have a causal formula:

$$\begin{aligned} \begin{aligned} \sum _{t'} \frac{d\mathcal {L}}{dc^{t'+1}_k} \frac{\partial c^{t'+1}_k}{\partial h_j^{t'}} e^{t'}_{ij} =&\sum _{t} \frac{d\mathcal {L}}{dh^{t+1}_k} \frac{\partial h^{t+1}_k}{\partial c^{t+1}_k} \sum _{t' \le t} \frac{\partial c^{t+1}_k}{\partial c^{t}_k} \cdots \frac{\partial c^{t'+2}_k}{\partial c^{t'+1}_k} \frac{\partial c^{t'+1}_k}{\partial h^{t'}_j} e^{t'}_{ij} \end{aligned} \end{aligned}$$
(13)
Fig. 6.
figure 6

Computational graph for the explicit variable \(\beta _{ij}^t(k,k',...,j)\) with \(t'= t-1\).

Definition

(Explicit variable). We define the explicit variable \( \beta _{ij}^t(k,k',...,j)\) as:

$$\begin{aligned} \beta _{ij}^t(k,k', ...,j) := \sum _{t' \le t-1} \frac{\partial c^{t}_k}{\partial c^{t-1}_k} \cdots \frac{\partial c^{t'+2}_k}{\partial c^{t'+1}_k} \frac{\partial c^{t'+1}_k}{\partial h^{t'}_{k'}} \frac{\partial h^{t'}_{k'}}{\partial c^{t'}_{k'}} \beta _{ij}^{t'}(k', k'', ...,j) \end{aligned}$$
(14)

with \(\beta _{ij}^t(j) = \epsilon ^t_{ij}\).

Backwards Interpretation: The explicit variable represents the idea of starting at \(c^t_k\), moving an arbitrary number of steps through the implicit recurrence \(c^{t}_k \rightarrow h^{t-1}_k\) until at a certain \(t'\) you jump to the output variable of another neuron \(h^{t'}_{k'}\), down to its hidden variable \(c^{t'}_{k'}\) and then start again, with a path, now starting at \(c^{t'}_{k'}\). In total, it considers all possible paths, with arbitrary length, spending an arbitrary number of steps in each of the neurons (through implicit recurrences) from \(c^t_k\) to \(c^{t'}_j\) through the neurons \(k', k'', ...\) and then times the implicit variable \(\epsilon ^{t'}_{ij}\) (cf. Fig. 6).

Forwards Interpretation: The explicit variable accounts for the influence of the activity of neuron j at any previous time step \(c^{t'}_j\) to neuron \(k'\) at a future time step \(c^t_{k'}\) through the neurons \(k', k'', ...\).

Incremental Computation: The recursive relation to this variable that allows it to be updated at each time step is:

$$\begin{aligned} \beta _{ij}^t(k,k', ...,j) = \frac{\partial c^{t}_k}{\partial c^{t-1}_k} \beta _{ij}^{t-1}(k,k', ...,j) + \frac{\partial c^{t}_k}{\partial h^{t-1}_{k'}} \frac{\partial h^{t-1}_{k'}}{\partial c^{t-1}_{k'}} \beta _{ij}^{t-1}(k', ...,j) \end{aligned}$$
(15)

Definition

(Explicit eligibility trace). Given the explicit variable \(\beta _{ij}^t(k,k', ...,j)\), we define the explicit eligibility trace \(b_{ij}^t(k,k', ...,j)\) as:

$$\begin{aligned} b_{ij}^t(k,k', ...,j) := \frac{\partial h^{t}_k}{\partial c^{t}_k} \beta _{ij}^t(k,k', ...,j) \end{aligned}$$
(16)

with \(b_{ij}^t(j) = e^t_{ij}\).

Since \(\nicefrac {\partial h^{t}_k}{\partial c^{t}_k}\) is causal and local, and the explicit variable \(\beta _{ij}^t(k,k', ...,j)\) is causal but only partially local (it requires message passing from the presynaptic neuron \(k'\) to the postsynaptic neuron k), then the explicit eligibility trace \(b_{ij}^t(k,k', ...,j)\) is also causal but only partially local.

Final Equation with Re-expressed Explicit Recurrence of Order 1: Substituting the explicit variable Eq. 14 in Eq. 13 yields:

$$\begin{aligned} \sum _k \sum _{t} \frac{d\mathcal {L}}{dc^{t+1}_k} \frac{\partial c^{t+1}_k}{\partial h_j^{t}} e^{t'}_{ij} = \sum _k \sum _{t} \frac{d\mathcal {L}}{dh^{t+1}_k} b_{ij}^{t+1}(k,j) \end{aligned}$$
(17)

And substituting this back to the original equation (cf. Eq. 11):

$$\begin{aligned} \begin{aligned}\ \frac{d\mathcal {L}}{dw_{ij}} =&\sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^{t}_{ij} + \sum _k \sum _{t} \frac{d\mathcal {L}}{dc^{t+1}_k} \frac{\partial c^{t+1}_k}{\partial h_j^{t}} e^{t}_{ij}\\ =&\sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^{t}_{ij} + \sum _k \sum _{t} \frac{d\mathcal {L}}{dh^{t+1}_k} b_{ij}^{t+1}(k,j) \end{aligned} \end{aligned}$$
(18)

Here it becomes clear how setting this second factor to 0 is what gives us e-prop (cf. the right arrow in Fig. 1), since we forcefully ignore the influence of a neuron on other neurons (and itself) through the explicit recurrences.

3.3 Re-expressing Explicit Recurrences of Order \(> 1\)

Now that we have seen how the explicit eligibility connects the activity of neuron j with other neurons through explicit recurrences, we can use it to re-express higher-order explicit recurrences.

Unroll the Recursion: Starting from the equation with one order of explicit recurrence already re-expressed (cf. Eq. 18), and alternatively using the definition of explicit recurrences (cf. Eq. 2) and the action of the explicit eligibility trace (cf. Eq. 17), we can repeat the previous steps for higher orders:

$$\begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} =&\sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^{t}_{ij} + \sum _{k_1} \sum _{t} \left( \frac{\partial \mathcal {L}}{\partial h^{t+1}_{k_1}} + \sum _{k_2} \frac{d\mathcal {L}}{dc^{t+2}_{k_2}} \frac{\partial c^{t+2}_{k_2}}{\partial h_{k_1}^{t+1}} \right) b_{ij}^{t+1}(k_1,j) \nonumber \\ =&\sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^{t}_{ij} + \sum _{t} \sum _{k_1} \frac{\partial \mathcal {L}}{\partial h^{t+1}_{k_1}} b_{ij}^{t+1}(k_1,j) + \sum _{t} \sum _{k_1, k_2} \frac{d\mathcal {L}}{dc^{t+2}_{k_2}} \frac{\partial c^{t+2}_{k_2}}{\partial h_{k_1}^{t+1}} b_{ij}^{t+1}(k_1,j) \nonumber \\ =&\sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^{t}_{ij} + \sum _{t} \sum _{k_1} \frac{\partial \mathcal {L}}{\partial h^{t+1}_{k_1}} b_{ij}^{t+1}(k_1,j) + \sum _{t} \sum _{k_1, k_2} \frac{d \mathcal {L}}{d h^{t+2}_{k_2}} b_{ij}^{t+2}(k_2, k_1,j) \nonumber \\ =&\sum _{t} \sum _{t' \ge t} \sum _{k_0=j, k_1,..,k_{t}} \frac{\partial \mathcal {L}}{\partial h^{{t'}}_{k_{t}}} b_{ij}^{t'}(k_{t}, \cdots , k_1,k_0=j) \end{aligned}$$
(19)

This gives us a high overview of separating the different levels of explicit recurrences which will lead to the definition of the m-order e-prop (Sect. 4).

Flip Time Indices: As before we change the time indices and reorganise to allow for causality,

$$\begin{aligned} \begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} =&\sum _{t} \sum _{k} \frac{\partial \mathcal {L}}{\partial h^{{t}}_{k}} \frac{\partial h^{t}_{k}}{\partial c^{t}_{k}} \sum _{t' \le t} \sum _{k_0=j, k_1,..,k_{t'-1}} \beta _{ij}^{t}(k, k_{t'-1}, \cdots , k_1,k_0 = j) \end{aligned} \end{aligned}$$
(20)

Definition

(Recurrence variable). We define the recurrence variable \( \alpha ^{t,r}_{ij}\) as:

$$\begin{aligned} \alpha ^{t,r}_{ij} = \sum _{t' \le t} \sum _{k_0=j, k_1,..,k_{t'-1}} \beta _{ij}^{t}(r, k_{t'-1}, \cdots , k_1,k_0 = j) \end{aligned}$$
(21)

Backwards Interpretation: Starting at current time t in neuron r, the recurrence variable represents all combinations of paths through any combination of neurons \(k_{t'-1},..,k_1\) ending in neuron j.

Forwards Interpretation: The recurrence variable accounts for the influence of the activity of neuron j at any previous time step to neuron r at the current timestep t through all possible paths through all neurons.

Incremental Computation: Once again, importantly, we have a recursive equation for computing the recurrence variable:

$$\begin{aligned} \alpha ^{t,r}_{ij} = \frac{\partial c^{t}_r}{\partial c_{r}^{t-1}} \alpha ^{{t-1},r}_{ij} + \sum _{k} \frac{\partial c^{t}_r}{\partial h_{k}^{t-1}} \frac{\partial h^{t-1}_{k}}{\partial c_{k}^{t-1}} \alpha ^{{t-1},k}_{ij} \end{aligned}$$
(22)

Definition

(Recurrence eligibility trace). Given the recurrence variable \(\alpha ^{t,r}_{ij}\), we define the recurrence eligibility trace \(a_{ij}^{t,r}\) as:

$$\begin{aligned} a_{ij}^{t,r} := \frac{\partial h^t_{r}}{\partial c_{r}^t} \alpha ^{t,r}_{ij} \end{aligned}$$
(23)

Since is causal and local, but the recurrence variable \(\alpha ^{t,r}_{ij}\) is causal but non-local, the recurrence eligibility trace \(a^{t,r}_{ij}\) is also causal but non-local. It is non-local in an equivalent way as BPTT is not: each synapse ij requires to store a variable representing how the activation in the past of any other neuron r would affect its computation in the present, even if \(r \ne i,j\) and through all possible paths of synapses. The recursive computation of \(\alpha ^{t,r}_{ij}\) requires of the summation of the recurrence variables of all the neurons requiring non-local communication.

Final Equation of RTRL: Eq. 20 transforms into (by substituting Eq. 21 into Eq. 20) the final equation for RTRL (cf. Fig. 7):

$$\begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} = \sum _{t} \sum _{k} \frac{\partial \mathcal {L}}{\partial h^{t}_{k}} \frac{\partial h^t_{k}}{\partial c_{k}^t} \alpha ^{t,k}_{ij} = \sum _{t} \sum _{k} \frac{\partial \mathcal {L}}{\partial h^{t}_{k}} a_{ij}^{t,k} \end{aligned}$$
(24)

We now have a causal but still non-local gradient computation algorithm.

Fig. 7.
figure 7

Computational graph for final computation of RTRL.

4 E-prop

The e-prop algorithm approximates the gradient by not considering the explicit recurrences in RNNs. E-prop was originally formulated for RSNNs, since it is considered more biologically plausible than BPTT and RTRL due to its characteristics of being causal and local [2]. The approximation of the gradient that defines e-prop is:

$$\begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} \approx \sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^t_{ij} \end{aligned}$$
(25)

Through the derivation of RTRL from BPTT, e-prop has arisen naturally in three different places. This allows us for equivalent interpretations of the approximation, each more detailed than the previous one.

First, and as originally proposed [2], we can understand e-prop from the equation that arises after re-expressing the implicit eligibility trace (cf. Eq. 10):

$$\begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} = \sum _{t} \frac{d\mathcal {L}}{dh^{t}_j} e^t_{ij} \end{aligned}$$

Here we approximate the non-causal and non-local total derivative by the causal and local partial derivative, i.e. (cf. Fig. 5).

Second, we can understand it from the equation after re-expressing the explicit recurrences of order 1 (cf. Eq. 18):

$$\begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} = \sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^{t}_{ij} + \sum _k \sum _{t} \frac{d\mathcal {L}}{dh^{t+1}_k} b_{ij}^{t+1}(k,j) \end{aligned}$$

Here we see explicitly what we are ignoring in the approximation, since ignoring this non-causal and non-local second term, i.e. , defines e-prop. Ignoring these future dependencies to other neurons through explicit recurrences leads to a gradient computing algorithm that treats each neuron as producing an output only for the network’s computation and not to communicate to other neurons. Therefore, synapses arriving at neurons that are not directly connected to the readout of the RNN, are not modified by e-prop (e-prop does not compute through additional feed-forward layers).

Finally, the most expressive of the interpretations comes from the equation that shows how to apply the re-expressing of the explicit recurrences of order 1, recursively, to re-express higher orders (cf. Eq. 19):

$$\begin{aligned} \frac{d\mathcal {L}}{dw_{ij}} =&\sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^{t}_{ij} + \sum _k \sum _{t} \frac{d\mathcal {L}}{dh^{t+1}_k} b_{ij}^{t+1}(k,j) \\ =&\sum _{t} \frac{\partial \mathcal {L}}{\partial h^{t}_j} e^{t}_{ij} + \sum _{t} \sum _{k_1} \frac{\partial \mathcal {L}}{\partial h^{t+1}_{k_1}} b_{ij}^{t+1}(k_1,j) + \sum _{t} \sum _{k_1, k_2} \frac{d \mathcal {L}}{d h^{t+2}_{k_2}} b_{ij}^{t+2}(k_2, k_1,j) \\ =&\cdots \end{aligned}$$

Here we define the m-order e-prop as the approximation resulting from setting in the above equation \(\sum _t \sum _{k_0 = j, k_1, ..., k_m} \frac{d \mathcal {L}}{d h^{t+m}_{k_m}} b_{ij}^{t+m}(k_m, \cdots , k_1,k_0=j)=0\). By increasing the order m we better approximate the gradient at the cost of needing the activities of other neurons m time steps ahead to compute the current gradient of the loss. Under this scope, standard e-prop [2] is just the 1-order e-prop (fully causal and local but the most inaccurate approximation). On the other extreme, the T-order e-prop (nothing is approximated or set to 0) corresponds to the full gradient computation, in a middle form between BPTT and RTRL (the exact computation of the gradient but completely non-causal and non-local). Moreover, synapses arriving into neurons connected to the readout through up to \(m-1\) synapses will be modified by the m-order e-prop (m-order e-prop computes through up to \(m-1\) additional feed-forward layers).

5 Conclusion

In this paper, we formally explored how BPTT, RTRL, and e-prop relate to each other. We extended the general scheme for re-expressing recurrences as eligibility traces from [2] and applied it iteratively to go from BPTT to RTRL. In the process, we found intermediate expressions that allow for better intuition of these algorithms. Moreover, we showed how e-prop can be seen as an extreme case of a series of approximation algorithms, which we coin m-order e-prop.