Keywords

1 Introduction

Animals demonstrate remarkable adaptability to their environments, a trait honed through the evolution of their morphological and neural structures [30, 46]. They are born equipped with both hard-wired behavioral routines (e.g. breathing, motor babbling) and learning capabilities for adapting based on their own experience. The costs and benefits of evolving hard-wired behaviors vs. learning capabilities depend on different factors, a central one being the level of unpredictability of environmental conditions across generations [17, 42]. Environmental challenges that are shared across many generations favor the evolution of hard-wired behavior (e.g. breathing). On the other hand, traits whose utility can hardly be predicted from its utility in previous generations are likely to be learned through individual development (e.g. learning a specific language). Some brain regions might have evolved to generically facilitate the learning of diverse behaviors. For example, central pattern generators (CPGs) enable limb bambling, which may facilitate locomotion, pointing and vocalizations in humans [24]. Another example is the prefrontal cortex (PFC), a brain region that maps inputs within a high-dimensional non-linear space from which they can be decoded by other brain regions, acting as a reservoir for computations [14, 23].

Fig. 1.
figure 1

(left) A simplified view of the evolution of brain structures. The generating parameters of neural structures are modified at an evolutionary loop. In the developmental loop, agents equipped with these neural structures learn to interact with their environment (right) Parallel to our computational approach. We propose a computational framework where an evolutionary algorithm optimizes hyperparameters that generate neural structures called reservoirs. These reservoirs are then integrated into RL agents that learn an action policy to maximize their reward in an environment

This prompts an intriguing question: How can neural structures, optimized at an evolutionary scale, enhance the capabilities of agents to learn complex tasks at a developmental scale? To address this question, we propose to model the interplay between evolution and development as two nested adaptive loops: neural structures are optimized through natural selection over generations (i.e. at an evolutionary scale), while learning specific behaviors occurs during an agent’s lifetime (i.e. at a developmental scale). Figure 1 illustrates the interactions between evolutionary-scale and developmental-scale optimization. This model agrees with recent views on evolution that emphasize the importance of both scales for evolving complex skills [19, 20]. It is also compatible with the biological principle of a genomic bottleneck, i.e. the fact that the information contained in the genome of most organisms is not sufficient to fully describe their morphology [52]. In consequence, genomes must instead encode macro-level properties of morphological features such as synaptic connection patterns.

In line with these biological principles, we propose a novel computational approach, called Evolving Reservoirs for Meta Reinforcement Learning (ER-MRL), integrating mechanisms from Reservoir Computing (RC), Meta Reinforcement Learning (Meta-RL) and Evolutionary Algorithms (EAs). We use RL as a model of learning at a developmental scale [9, 29]. In RL, an agent interacts with a simulated environment through actions and observations, receiving rewards according to the task at hand. The objective is to learn an action policy from experience, mapping the observations perceived by the agent to actions in order to maximize cumulative reward over time. The policy is usually modeled as a deep neural network which is iteratively optimized through gradient descent. We use RC as a model of how a genome can encode macro properties of the agent’s neural structure. In RC, the connection weights of a recurrent neural network (RNN) are generated from a handful of hyperparameters (HPs) controlling macro-level properties of the network related to connectivity, memory and sensitivity. Our choice of using RC relies on its parallels with biological brain structures such as CPGs and the PFC [15, 50], as well as on the fact that its indirect encoding of a neural network in global hyperparameters makes it compatible with the genomic bottleneck principle mentioned above. Being a cheap and versatile computational paradigm, RCs may have been favored by evolution [39].

We use Meta-RL to model how evolution shapes development [8, 32]. Meta-RL considers an outer loop, akin to evolution, optimizing HPs of an inner loop, akin to development. At the evolutionary scale, we use an evolutionary algorithm to optimize a genome specifying HPs of reservoirs. At a developmental scale, an agent equipped with a generated reservoir learns an action policy to maximize cumulative reward in a simulated environment. Thus, the objective of the outer evolutionary loop is to optimize hyperparameters of reservoirs in order to facilitate the learning of an action policy in the inner developmental loop.

Using this computational model, we run experiments in diverse simulated environments, e.g. 2D environments where the agent learns how to balance a pendulum and 3D environments where the agent learns how to control complex morphologies. These experiments provide support to three main hypotheses for how evolved reservoirs can affect development. First, they can facilitate solving partially-observable tasks, where the agent lacks access to all the information necessary to solve the task. In this case, we test the hypothesis that the recurrent nature of the reservoir will enable inferring the unobservable information. Second, it can generate oscillatory dynamics useful for solving locomotion tasks. In this case, the reservoir acts as a meta-learned CPG. Third, it can facilitate the generalization of learned behaviors to new tasks unknown during the evolution phase, a core hypothesis in meta-learning. In our case, our expectation is that HPs of reservoirs evolved across different environments will capture some abstract properties useful for adaptation.

In Sect. 2, we detail the methods underlying our proposed model, including RL (Sect. 2.1), Meta-RL (Sect. 2.2), RC (Sect. 2.3) and EAs (Sect. 2.4). We then explain their integration into our ER-MRL architecture (Sect. 3). Our results, aligned with the three hypotheses, are presented in Sect. 4. Computational specifics and supplementary experiments can be found in the appendix. The source code and videos are accessible at this link.

2 Background

Fig. 2.
figure 2

Our proposed architecture, called ER-MRL, integrates several ML paradigms. We consider an RL agent learning an action policy (a), having access to a reservoir (c). We consider two nested adaptive loops in the spirit of Meta-RL (b). Our proposed architecture (d) consists in evolving HPs \(\phi \) for the generation of reservoirs in an outer loop. In an inner loop, the agent learns an action policy, that takes as input the neural activation of the reservoir. The policy is trained using RL in order to maximize episodic return. Section 2 provides the computational details of each ML paradigm.

2.1 Reinforcement Learning as a Model of Development

Reinforcement Learning (RL) involves an agent that interacts with an environment by taking actions, receiving rewards, and learning an action policy in order to maximize its accumulated rewards (Fig. 2a). This interaction is formalized as a Markov Decision Process (MDP) [33]. An MDP is represented as a tuple \((S, A, P, p_0, R)\), where S is the space of possible states of the environment, A is the space of available actions to the agent, \(P(s_{t+1}|s_t, a_t)\) is the transition function specifying how the state at time \(t+1\) is determined by the current state and action at time t, \(p_0\) represents the initial state distribution, and \(R(s_t, a_t)\) defines the reward received by the agent for a specific state-action pair. At each time step of an episode lasting T time steps, the agent observes the environment’s state \(s_t\), takes an action \(a_t\), and receives a reward \(r_t\). The environment then transitions to the next step according to \(P(s_{t+1}|s_t, a_t)\). The objective of RL is to learn a policy \(\pi _\theta (a|s)\) that maps observed states to actions in order to maximize the cumulative discounted reward G over time, where \(G=\sum _{t=0}^{T} \gamma ^t r_t\) [44]. The parameter \(\gamma < 1\) discounts future rewards during decision making.

In Deep RL [21], the policy is implemented as an artificial neural network, whose connection weights are iteratively updated as the agent interacts with the environment. In all conducted experiments, we employ the Proximal Policy Optimization (PPO) RL algorithm [38] (see details in Sect. 6.1).

2.2 Meta Reinforcement Learning as a Model of the Interplay Between Evolution and Development

While RL has led to impressive applications [4, 25, 40], it suffers from several limitations: the learned policy is specific to the task at hand and does not necessarily generalize well to variations of the environment while requiring a large amount of data to converge. To address these issues Meta Reinforcement Learning (Meta-RL) [3] aims at training agents that learn how to learn, i.e. agents that can quickly adapt to new tasks or environments unknown during training. It is based on two nested adaptive loops: an outer loop, analogous to evolution, optimizes the HPs of an inner loop, analogous to development (Fig. 2b) [31, 32]. The objective of the outer loop is to maximize the average performance of the inner loop on a distribution of environments. Formally, a set of HPs \(\varPhi \) are meta-optimized in the outer loop, with the objective of maximizing the average performance of a population of RL agents conditioned by \(\varPhi \). In this paper, we leverage the RC framework where \(\varPhi \) corresponds to HPs encoding macro-level properties of a RNN, as explained in the next subsection.

2.3 Reservoir Computing as a Model of Neural Structure Generation

Meta-RL algorithms often directly optimize the weights of a RNN through backpropagation in the outer loop [10, 11]. While this technique has demonstrated remarkable efficacy, it is ill-suited for addressing the research question outlined in the introduction. This is due to its lack of biological plausibility in two main aspects: (1) evolutionary-scale adaptation cannot rely on backpropagation mechanisms [43] and (2) the notion that evolution directly fine-tunes neural network weights contradicts the genomic bottleneck principle mentioned in the introduction [52]. Instead our method evolves RNNs based on the Reservoir Computing (RC) paradigm. Instead of directly optimizing the neural network weights at the evolutionary scale, it optimizes HPs encoding macro-level properties of randomly generated recurrent networks.

The fundamental idea behind RC is to create a dynamic ‘reservoir’ of computation, where inputs are nonlinearly and recurrently recombined over time [22]. This provides a set of dynamic features from which a linear ’readout’ can be easily trained: such training equivalent to selecting and combining interesting features to solve the given task (Fig. 2c).

A reservoir is generated from a few HPs which play a crucial role in shaping the efficiency of the reservoir dynamics. This includes the number of neurons in the reservoir, the spectral radius sr (controlling the level of recurrence in the generated network), input scaling iss (controlling the strength of the network’s inputs), and leak rate lr (controlling how much the neurons retain past information); we explain reservoir HPs in more details in Appendix 6.1. In this paper, we propose to meta-optimize reservoir’s HPs \(\varPhi = (sr, iss, lr)\) in a Meta-RL outer loop, using evolutionary algorithms explained in the next subsection. We will then explain how we propose to integrate RC with RL in Sect. 3.

2.4 Evolutionary Algorithms as a Model of Evolution

Evolutionary Algorithms (EAs) draw inspiration from the fundamental principles of biological evolution [2, 36], where species improve their fitness through the selection and variation of their genomes. EAs iteratively enhance a population of candidate parameterized solutions to a given optimization problem, iteratively selecting those with higher fitness levels (i.e. higher performance of the solution) and mutating their parameters for the next generation.

In our approach, we utilize the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [13] as our designated evolutionary algorithm in order to meta-optimize HPs \(\varPhi \) of reservoirs. In CMA-ES, a population of HPs candidates is sampled from a multivariate Gaussian distribution, with mean \(\mu \) and covariance matrix V. The fitness of each sample \(\varPhi _i\) of the population is evaluated (see Sect. 3 for how we do it in our proposed method). The Gaussian distribution is then updated by weighting each sample proportionally to its fitness; resulting in a new mean and covariance matrix that are biased toward solutions with higher fitness. This process continues iteratively until either convergence towards sufficiently high fitness values of the generated HPs is achieved, or until a predefined threshold of candidates is reached.

3 Evolving Reservoirs for Meta Reinforcement Learning (ER-MRL)

General Approach. Our objective is to devise a computational framework to address a fundamental question: How can neural structures adapt at an evolutionary scale, enabling agents to better adapt to their environment at a developmental scale? For this aim, we aim to integrate the Machine Learning paradigms presented above. The architecture is illustrated in Fig. 2d and the optimization procedure in Fig. 3. We call our method ER-MRL, for “Evolving Reservoirs for Meta Reinforcement Learning".

The ER-MRL method encompasses two nested optimization loops (as in Meta-RL, Sect. 2.2). In the outer loop, operating at an evolutionary scale, HPs \(\varPhi \) for generating a reservoir (Sect. 2.3) are optimized using an evolutionary algorithm (Sect. 2.4). In the inner loop, focused on a developmental scale, a RL algorithm (Sect. 2.1) learns an action policy \(\pi _\theta \) using the reservoir state as inputs. In other words, the outer loop meta-learns HPs able to generate reservoirs resulting in maximal averages performance on multiple inner loops. The whole process is illustrated in Fig. 3 and detailed below.

Fig. 3.
figure 3

In the evolution phase (top), CMA-ES refines Reservoir HPs \(\varPhi \). At each generation i of the evolution loop (left), a population \(\varPhi _i : \{\varPhi _i^1, \ldots , \varPhi _i^n\}\) of HPs is sampled from the CMA-ES Gaussian distribution. Each \(\varPhi _i^j\) undergoes evaluation on multiple random seeds, generating multiple reservoirs. An ER-MRL agent is created for each reservoir, with its action policy being trained from the states of that reservoir (lighter grey frames). The fitness of a sampled \(\varPhi _i^j\) is determined by the average score of all ER-MRL agents generated from it (mid-grey frames). The fitness values are used to update the CMA-ES distribution for the next generation (dotted arrow). This process iterates until a predetermined threshold is reached. In the Testing phase (bottom), the best set of HPs \(\varPhi ^{*}\) from all CMA-ES samples is employed. Multiple reservoirs are generated within ER-MRL agents, and their performance is evaluated.

Inner Loop. To represent the development of an agent, we consider a RL agent (Sect. 2.1) that interacts with an environment through observation \(o_t\), actions \(a_t\) and rewards \(r_t\) at each time step t (Fig. 2a). In our proposed ER-MRL method, this agent is composed of three main parts: a reservoir generated by HPs \(\varPhi =\{iss,lr,sr\}\) (see Sect. 6.1 for more details), a feed forward action policy network \(\pi _\theta \) and a RL algorithm. At each time step, we feed the reservoir with the current \(o_t\), and the previous action and reward \(a_t\) and \(r_t\) (Fig. 2d). Contrarily to standard RL, the policy \(\pi _\theta \) does not directly access the observation of the environment’s state \(o_t\), but the context \(c_t\) of the reservoir instead (i.e. the vector of all reservoir’s neurons activations at time t). Because reservoirs are recurrent neural networks, \(c_t\) not only encompasses information about the current time step, but also integrates information over previous time steps. In some experiments, we also use ER-MRL with multiple reservoirs. In this case, we still generate the reservoirs from a set of HPs \(\varPhi \), and the context \(c_t\) given to the policy is the concatenation of hidden states of all reservoirs. We then train our policy \(\pi _\theta (a|c_t)\) using RL.

Outer Loop. The outer loop employs the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) (Sect. 2.4) to optimize reservoir HPs \(\varPhi \). The objective is to generate reservoirs which, on average over multiple agents, improve learning abilities. For each set of HPs, we assess the performance of our agents in multiple inner loops (we utilize 3 in our experiments), each one with a different random seed. Using different random seeds implies that, while using the same HPs set, each agent will be initialized with different connection weights of both their reservoirs, their policies and the initial environment state. Note that while the generated reservoirs have different connection weights, they share the same macro-properties in terms of spectral radius sr, input scaling iss and leak rate lr (since they are generated from the same HPs set). In assessing an agent’s fitness within its RL environment, we compute the mean episodic reward over the final 10 episodes of its training. To obtain the fitness of a reservoir HPs, we calculate the mean fitness of three agents across three different versions of the same environment. These steps are iterated until we reach a predetermined threshold of CMA-ES iterations (set at 900 in our experiments).

Evaluation. To evaluate our method, we select the HPs \(\varPhi ^{*}\) that generated the best fitness function during the whole outer loop optimization with CMA-ES (see bottom of Fig. 3). We then generate 10 ER-MRL agents with different random seeds (with a different reservoir sampled from \(\varPhi ^{*}\) for each seed, together with random initial policy weights \(\theta \)) and train the action policy \(\pi _\theta \) of each agent using RL. We report our results in the next section, comparing the performance of these agents against vanilla RL agents using a feedfoward policy.

4 Results

We designed experiments to study the following hypotheses: The ER-MRL architecture combining reservoirs and RL could enable (1) solving tasks with partial observability, (2) generating oscillatory dynamics that facilitate the learning of locomotion tasks, and (3) facilitating the generalization of learned behaviors to new tasks unseen during evolution phase.

4.1 Evolved Reservoirs Improve Learning in Highly Partially Observable Environments

In this section, we evaluate our approach on tasks with partial observability, where we purposefully remove information from the agent observations. Our hypothesis is that the evolved reservoir can help reconstructing this missing information. Partial observability is an important challenge in the field of RL, where agents have access to only a limited portion of environmental information to make decisions. This is referred to as a Partially Observable Markov Decision Process (POMDP) [26] rather than a traditional MDP. In this context, the task becomes harder to learn, or even impossible, as the agent needs to make decisions based on an incomplete observation of the environment state. To explore this issue, our experimental framework is based on control environments, such as CartPole, Pendulum, and LunarLander (see details in Fig. 9 of the appendix). We modify these environments by removing velocity-related observations, thus simulating a partially-observable task.

Let’s illustrate this issue with the first environment (CartPole), where the agent’s goal is to keep the pole upright on the cart while it moves laterally. If we remove velocity-related observations (both for the cart and the pole’s angle), a standard feedfoward RL agent cannot effectively solve the task. The reason is straightforward: without this information, the agent doesn’t know the cart’s movement direction or whether the pole is falling or rising. We apply the same process to the other two environments, removing all velocity-related observations for our agents. Can the ER-MRL architecture address this challenge? To find out, we independently evolve reservoirs using ER-MRL for each task. We search for effective HPs tailored to the partial observability of each environment. To evaluate our approach, we compare the learning curves of ER-MRL agents (from the test phase, see bottom of Fig. 3) on these three partially observable environments against an agent with a feedforward policy.

Fig. 4.
figure 4

Learning curves for partially observable tasks. The x-axis represents the number of timesteps during the training and the y-axis the mean episodic reward. Learning curves of our ER-MRL methods correspond to the testing phase described in the bottom of Fig. 3. Vanilla RL corresponds to a feedforward policy RL agent. The curves and the shaded areas represent the mean and the standard deviation of the reward for 10 random seeds. See Sect. 6.3 for a comparison with another method.

Figure 4 presents the results for the three selected partially observable tasks. We observe, as expected, that vanilla RL agents cannot learn how to solve the task under partial observability (for the reasons mentioned above). In comparison, our approach leads to performance scores close to those obtained by a RL algorithm with full observability. This indicates that the evolved reservoir is able to reconstruct missing information related to velocities from its own internal recurrent dynamics. This confirms the hypothesis that an agent with a reservoir can solve partially observable tasks by using the internal reservoir state to reconstruct missing information. We explain with more details why this method could work in Sect. 6.3 of the appendix. The difference in results between the model with 2 reservoirs on LunarLander environment suggests that solving it requires encoding at least two different timescales dynamics. Our interpretation here is that solving LunarLander requires to deal with both an “approaching" and “landing" phase, unlike the two other environments.

4.2 Evolved Reservoirs Could Generate Oscillatory Dynamics that Facilitate the Learning of Locomotion Tasks

In this section, we evaluate our approach on agents with 3D morphology having to learn locomotion tasks shown in Fig. 10. We postulate that the integration of an evolved reservoir can engender oscillatory patterns that aid in coordinating body movements, akin to Central Pattern Generators (CPGs). CPGs, rooted in neurobiology, denote an interconnected network of neurons responsible for generating intricate and repetitive rhythmic patterns that govern movements or behaviors [24] such as walking, swimming, or other cyclical movements. Existing scientific literature hypothesizes that reservoirs, possessing significant rhythmic components, share direct connections with CPGs [37]. We propose to study this hypothesis using motor tasks involving rhythmic movements.

We employed 3D MuJoCo environments (detailed in Fig. 10 of the appendix), where the goal is to exert forces on various rotors of creatures to propel them forward. Notably, while the ultimate goal across these tasks remains constant (forward movement), the creatures exhibit diverse morphologies, including humanoids, insects, worms, bipeds, and more. Furthermore, the action and observation spaces vary for each morphology. We individually evaluate our ER-MRL architecture on each of these tasks.

Fig. 5.
figure 5

Learning curves for locomotion tasks. Same conventions as Fig. 4

Our approach demonstrates improved performance in some tasks (Ant, HalfCheetah, and Swimmer) compared to a standard RL baseline, particularly noticeable in the early stages of learning, as illustrated in Fig. 5. This suggests that the evolved reservoir may generate beneficial oscillatory patterns, facilitating the learning of locomotion tasks, in line with the notion that reservoirs could potentially function as CPGs, aiding in solving motor tasks. Although carefully testing this hypotheses would require more analysis, we present in Sect. 6.4 in the appendix preliminary data suggesting that the evolved reservoir is able to generate oscillatory dynamics that could facilitate learning in the Swimmer environment. However, as shown in Fig. 5, performance enhancement was not observed in the Walker and Hopper environments compared to the RL baseline. Locomotion in both environments demands precise closed-loop control strategies to maintain an agent’s equilibrium. In such cases, generated oscillatory patterns may not be as beneficial.

4.3 Evolved Reservoirs Improve Generalization on New Tasks Unseen During Evolution Phase

In this section, we address a key aspect of our study: the ability of evolved reservoirs to facilitate adaptation to novel environments. This inquiry is crucial in assessing the potential of evolved neural structures to generalize and enhance an agent’s adaptability beyond the evolution phase. Building on the promising results of ER-MRL with two reservoirs in previous experiments, we focus exclusively on this configuration for comparison with the RL baseline.

Generalizing Across Different Morphologies with Similar Tasks. In prior experiments, ER-MRL demonstrated effectiveness in environments like Ant, HalfCheetah, and Swimmer. This success led us to explore whether reservoirs evolved for two of these tasks could be adaptable to the third, indicating potential generalization across different morphologies. However, due to variations in environments, including differences in morphology, observation and action spaces, and reward functions, generalization from one set of tasks to another presents a complex challenge. To ensure fair task representation of each environment in the final fitness, we employ the normalization formula detailed in Sect. 6.6. Subsequently, we select the reservoir HPs \(\varPhi ^{*}\) that yielded the highest fitness and evaluate them in a distinct environment. For instance, if we evolve reservoirs on Ant and HalfCheetah, we test them in the Swimmer task.

Fig. 6.
figure 6

Learning curves for generalization on similar locomotion tasks with different morphologies. The curves evaluate the performance of ER-MRL on an environment that was unseen during the evolution phase. For instance, the left plot shows performance of an agent on Ant, using reservoirs evolved on only HalfCheetah and Swimmer.

In Fig. 6, we observed a notable improvement in the performance of ER-MRL agents with reservoirs evolved for different tasks, particularly in HalfCheetah and Swimmer environments. This substantiates the capacity of evolved reservoirs to generalize to new tasks and encode diverse dynamics from environments with distinct morphologies. However, it’s worth noting that this improvement wasn’t replicated in the Ant task. This could be attributed to the unique characteristics of the Ant environment, with its stable four legged structure, in contrast to the simpler anatomies of Swimmer and HalfCheetah. For a detailed analysis, please refer to Sect. 6.7 in the appendix.

Fig. 7.
figure 7

Learning curves for generalization on different locomotion tasks with similar morphologies. The reservoirs are evolved on one task and tested on the other one.

Generalizing Across Different Tasks with Similar Morphologies. We have seen how reservoirs facilitated ER-MRL agent’s ability to generalize across locomotion tasks with different morphologies. Now, we shift our focus to tasks with consistent morphologies but distinct objectives. To delve into this, we turn to the Humanoid and HumanoidStandup environments (shown in Fig. 12 of the appendix), both presenting tasks within the realm of humanoid movement. One task involves learning to walk as far as possible, while the other centers around the challenge of standing up from the ground. As in our previous study, we follow the procedure of evolving reservoir-generating HPs on one task and evaluating their performance on the other.

Figure 7 provides a visual representation of our findings. While the performance improvement may not be dramatic, it underscores the generalization capabilities of reservoirs across tasks with similar morphologies but differing objectives. This observation, though promising, invites further investigation, given the limited number of experiments conducted in this context. This aspect represents an avenue for future research.

5 Discussion

In this paper, we have addressed the compelling question of whether reservoir-like neural structures can be evolved at an evolutionary time scale, to facilitate the learning of agents on a multitude of sensorimotor tasks at a developmental scale. Our results demonstrate the effectiveness of employing evolutionary algorithms to optimize these reservoirs, especially on Reinforcement Learning tasks involving partial observability, locomotion, and generalization of evolved reservoirs to unseen tasks.

Our ER-MRL approach has parallels to previous algorithms in RL that employ an indirect encoding for mapping a genome to a particular neural network architecture [12, 28, 41]. Our choice of employing reservoirs comes with the benefit of a very small genomic size (reservoirs are parameterised by a handful of parameters that we show in Appendix 6.1) without reducing the complexity of the phenotype (the number of weights of the reservoir policy is independent of the number of hyper-parameters). Moreover, our approach clearly distinguish neural structures optimized at the evolutionary scale (the reservoirs) vs. at the developmental scale (the RL action policy).

Nonetheless, some limitations persist within our methodology. The combination of reservoir computing and reinforcement learning remains underexplored in the existing literature [6, 7], leaving substantial room for refining the algorithmic framework for improved performance. Moreover, our generalization experiments and quantitative analyses warrant further extensive testing to gain deeper insights. Notably, our approach does incur a computational cost due to the time required to train a new policy with RL for each generated reservoir. Future studies could devise more efficient evolutionary strategies or employ alternative optimization techniques.

However, because our method remains agnostic to specific environment and agent’s characteristics (a reservoir architecture being independent of the shape of its inputs and outputs), we could in theory evolve reservoirs across a very wide range of environments and agent’s morphologies. Such evolved generalist reservoirs could then result in highly reduced computational cost at the developmental scale, as our results suggest, compared to training recurrent architectures from scratch.

Moving forward, there are several promising avenues for exploration. Firstly, a more comprehensive understanding of the interaction between RL and RC could significantly improve the performance of such methods on developmental learning tasks. Secondly, integrating our approach with more sophisticated Meta-RL algorithms could offer a mean to initialize RL policy weights with purposefully selected values rather than random ones. Additionally, a broader framework allowing for the evolution of neural structures with greater flexibility, such as varying HPs and neuron counts, could yield more intricate patterns during the evolution phase, potentially resulting in substantial improvements in agent performance across developmental tasks [28, 41].

Our research bridges the gap between evolutionary algorithms, reservoir computing and meta-reinforcement learning, creating a robust framework for modelling neural architecture evolution. We believe that this integrative approach opens up exciting perspectives for future research in RC and Meta-RL to propose new paradigms of computations. It also provides a computational framework to study the complex interplay between evolution and development, a central issue in modern biology [16, 18, 27, 49].