Abstract
Engineering approaches to machine learning (including robot learning) typically seek for the best learning algorithm for a particular problem, or a set problems. In contrast, the mammalian brain appears as a toolbox of different learning strategies, so that any newly encountered situation can be autonomously learned by an animal with a combination of existing learning strategies. For example, when facing a new navigation problem, a rat can either learn a map of the environment and then plan to find a path to its goal within this map. Alternatively, it can learn sequences of egocentric movements in response to identifiable features of the environment. For about 15 years, computational neuroscientists have searched for the mammalian brain’s coordination mechanisms which enable it to find efficient, if not necessarily optimal, combinations of existing learning strategies to solve new problems. Understanding such coordination principles of multiple learning strategies could have great implications in robotics, to enable robots to autonomously determine which learning strategies are appropriate in different contexts. Here, we review some of the main neuroscience models for the coordination of learning strategies and present some of the early results obtained when applying these models to robot learning. We moreover highlight important energy costs which can be reduced with such bio-inspired solutions compared to current deep reinforcement learning approaches. We conclude by sketching a roadmap for further developing such bio-inspired hybrid learning approaches to robotics.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The mammalian brain combines multiple learning systems whose interactions, sometimes in a competitive way, sometimes in a cooperative way, are thought to be largely responsible for the high degree of behavioral flexibility observed in mammals [13, 18, 34, 35, 37, 46, 51, 53, 57]. For instance, the hippocampus is a brain region playing an important role in the rapid acquisition of episodic memories – the memory of individual episodes previously experienced, such as sequences of visited places while visiting a new city [4, 25, 42]. Together with the prefrontal cortex, the hippocampus can link these episodes so as to store in long-term memory a mental representation (or ‘model’) of statistical regularities of the environment [18, 26, 66]. In the spatial domain, such a mental model can take the form of a ‘cognitive map’ [53]. Even if it constitutes an imperfect and incomplete representation of the environment, it can be used to mentally explore the map [8, 27], or to plan potential trajectories to a desired goal before acting [32, 45]. Such a model-based strategy enables to rapidly and flexibly adapt to changes in goal location, since the map can be updated instantaneously with the new goal location so that the animal can plan the new correct trajectory in a one-trial learning manner [18]. Nevertheless, such a flexibility comes at the expense of time- and energy-consuming planning phases (the larger the map, the longer it takes to find the shortest path between two locations). This is typically observed when a human or an animal takes longer decision time after task changes, putatively implying a re-planning phase [32, 70].
In contrast, the basal ganglia, and especially its main input region called the striatum, is involved in the slow acquisition of procedural memories [37, 55, 73]. This type of memories is typically acquired through the repetition of sequences of egocentric movements (e.g., turn left, go straight, turn right) or sequences of stimulus-triggered responses (e.g., start moving in response to a flash light, then stop in response to a sound) which become behavioral ‘habits’ [16, 31]. These habits are known to be inflexible, resulting in slow adaptation to both changes in the environment (e.g., change in goal location) and changes in motivation (e.g., an overtrained rat habitually presses a food-delivering lever, even when it is satiated) [17]. Nevertheless, when these habits have been well acquired in a familiar environment, they enable the animal to make fast decisions and to perform efficient action sequences without relying on the time-consuming planning system [13, 37]. Such a learning strategy is thus called model-free because making a decision does not require the manipulation of an internal model to mentally represent the potential long-term consequences of the actions before acting. In contrast, it is the perception of a stimulus or the recognition of a familiar location which triggers the execution of a habitual behavioral sequence.
It is fascinating how lesion studies have highlighted some degree of modularity of the organization of learning systems within the brain. Lesioning the hippocampus impairs behavioral flexibility as well as behavioral strategies relying on a cognitive map (see [37] for a review). As a consequence, hippocampus-lesioned animals only display stimulus-response behaviors in a maze and do not seem to remember the location of previously encountered food. In contrast, animals with a lesion to what is called the dorsolateral striatum have an intact map-based behavioral strategy and perform less egocentric movements during navigation (see [37] for a review). Nevertheless, the modularity is not perfect and an important degree of distributed information processing also exists. For instance, lesions to what is called the ventral striatum seem to impair representations of reward value which are required for both model-based and model-free learning strategies (again see [37] for a review). Moreover, some brain regions do not seem to play a specific role in learning one particular strategy, but rather a role in the coordination of these strategies. For instance, lesions to the medial prefrontal cortex only impair the initial acquisition of model-based strategies, but not their later expression [54]. Indeed, it seems that the medial prefrontal cortex plays a central role in the arbitration between model-based and model-free strategies [39]. As a consequence, lesioning a subpart of the medial prefrontal cortex can even restore flexible model-based strategies in overtrained rats [11].
This paper is particularly aimed at illustrating how neuroscience studies of decision-making have progressively helped understanding (and are still currently investigating) the neural mechanisms underlying animals’ ability to adaptively coordinate model-based and model-free learning, and to illustrate how this biological knowledge can help towards developing behaviorally flexible autonomous robots. Since the computational models of these processes largely rely on the reinforcement learning theoretical framework, the next section will first describe the employed formalism. Then the third section will briefly review some of the neuroscience results which contribute to deciphering the principles of this coordination, and how this coordination was mathematically formalized by computational neuroscience models. The fourth section will then review some of the experimental tests of these principles in robotic scenarios. We will finally discuss the perspectives of this field of research, and how it could not only contribute to improving robots’ behavioral flexibility, but also to reducing the computational cost of machine learning algorithms for robots (by enabling to skip model-based strategies when the robot autonomously recognizes that model-free strategies are sufficient).
2 Formalism Adopted to Describe Model-Based and Model-Free Reinforcement Learning
For simplicity, existing computational models are most often framed within standard Markov Decision Problem (MDP) settings, where an agent visits discrete states \(s\in \mathcal {S}\), using a finite set of discrete actions \(a\in \mathcal {A}\). They can encounter reward scalar values \(r\in \mathbb {R}\) after performing some actions a in some states s, which they have to discover. And there is a transition probability function \(\mathcal {T}(s,a,s'):(\mathcal {S},\mathcal {A},\mathcal {S})\rightarrow \left[ 0;1\right] \), which is a generative model underlying the statistics of the task that the agent will face, and which basically determines what is the probability of ending up in a state \(s'\) after performing an action a in state s.
In navigation scenarios, states represent unique locations in space. In neuroscience modeling studies, these states are usually equally spaced on a square grid, an information expected to be provided by place cell activity in the hippocampus, which are neurons that participate in the estimation of the animal’s current location within the map. In the robot navigation experiments that will be presented later on, the state space discretization process is autonomously performed by the robot after an initial exploration of the environment. As a consequence, different states have different sizes and are unevenly distributed. It is important to note that these models can easily be extended to more distributed or continuous state representations [1], for instance when facing Partially Observable Markov Decision Process (POMDP) settings [7]. The models reviewed here can also be generalized to more continuous representations of space and actions (e.g., [22, 38]). Nevertheless, we stick here to the discrete state representation for the sake of simplicity and because it has a high explanatory power.
As classically assumed within the Reinforcement Learning (RL) theoretical framework, the agent’s goal is here considered to be the maximization of the expected value of the long-term reward, that is the maximization of the discounted sum of future rewards over a potentially infinite horizon: \(\mathbb {E}\left[ \sum _{t=0}^\infty \gamma ^t r(s_t,a_t)\right] \), where \(\gamma \) \((\gamma <1)\) is a discount factor which basically assigns a weaker weights to long-term rewards than to short-term rewards. In order to meet this objective, the agent will learn a state-action value function \(Q:(\mathcal {S},\mathcal {A})\rightarrow \mathbb {R}\) which evaluates the total discounted sum of future rewards that the agent expects to receive when starting from a given state s, taking the action a and then following a certain (eventually learned) behavioral policy \(\pi \):
Saying that the agent adopts a model-based RL strategy means that the agent will progressively try to estimate an internal model of its environment. Conventionally, this model is the combination of the estimated transition function \(\hat{\mathcal {T}}(s,a,s')\) and the estimated reward function \(\hat{\mathcal {R}}(s,a)\) that aims at capturing the rules that the human experimenter chooses to determine which (state, action) couples yield reward in the considered task.
Various ways to learn these two functions exist. Here, we will simply consider that the agent estimates the frequencies of occurrence of states, actions and rewards from its observations. Then, the learned internal model can be used by the agent to infer the value of each action in each possible state. This inference can be a computationally costly process, especially when all (s, a) are to be visited multiple times before reaching an accurate estimation of the state-action value function Q. Some heuristics exist to simplify the computations, or to make it less costly, like trajectory sampling [68] or prioritized sweeping [49, 56], which we review in [8]. Some alternatives to full model-based strategies exist, like the successor representation [14], which provides the agent with more flexibility and generalization ability than a pure model-free strategy, at a smaller computational cost than a pure model-based strategy, and at the same time can contribute to describe some neural learning mechanisms in the hippocampus [48, 65]. Nevertheless, for the sake of simplicity, here we will consider that the inference process in the model-based (MB) RL agent is performed through a value iteration process [68]:
In contrast, an agent adopting a model-free RL strategy will not have access to nor try to estimate any internal model of the environment. Instead, the agent will iteratively update its estimation of the state-action value function through its interactions with the environment. Each of these interactions consist in performing an action a in a state s and observing the consequence in terms of the reward r that this yields and the new state \(s'\) of the environment. Again, if we address a navigation problem, the possible actions are typically movements towards cardinal directions (North, South, East, West) and the new state \(s'\) is the new location of the agent within the environment after acting. A classical and widely used model-free RL algorithm is Q-learning [71]:
where \(\alpha \in \left[ 0;1\right] \) is the learning rate, and the term between parentheses, often written \(\delta _t\) is called the temporal-difference error [68] or the reward prediction error [64] because it constitutes a reinforcement signal which compares the new estimation of value (\(r_t + \gamma \max _{k \in \mathcal {A}} Q_{MF}^{(t)}(s_{t+1},k)\)) after performing action \(a_t\) in state \(s_t\), arriving in state \(s_{t+1}\) and receiving a reward \(r_t\), with the expected value \(Q_{MF}^{(t)}(s_t,a_t)\) before executing this action. Any deviation between the two is used as an error signal to correct the current estimation of the state-action value function Q.
Finally, for the decision-making phase, no matter if the agent is model-free or model-based, the agent selects the next action a to perform from a probability distribution over actions in the current state s computed from the estimated state-action value function \(x^{(t)}(s,a)\) (with \(x^{(t)}(s,a)=Q_{MB}^{(t)}(s,a)\) if the agent is model-based, or \(x^{(t)}(s,a)=Q_{MF}^{(t)}(s,a)\) if the agent is model-free), using a Boltzmann softmax function:
where \(\beta \) is called the inverse temperature which regulates the exploration/exploitation trade-off by modulating the level of stochasticity of choice: the closer \(\beta \) is to zero, the more the contrast between Q-values will be attenuated, the extreme being for \(\beta =0\) which produces a flat action probability distribution (random exploration); in contrast, the larger the value of \(\beta \), the more the contrast between Q-values will be enhanced, which makes the probability of the action with the highest Q-value close to 1 when \(\beta \) tends towards \(\infty \) (exploitation).
3 Neuroscience Studies of the Coordination of Model-Based and Model-Free Reinforcement Learning
Reinforcement learning models (initially only from the model-free family) have started to become popular in neuroscience in the mid 90s, when researchers discovered that a part of the brain called the dopaminergic system (because it innervates the rest of the brain with a neuromodulator called dopamine) increases its activity in response to unpredicted reward, decreases its activity in response to the absence of a predicted reward, but does not respond to predicted ones, as can be modeled when \(\delta _t\), the right part between parentheses in Eq. 3, is positive, negative or null, respectively [64]. This discovery was followed by a large set of diverse neuroscience experiments to verify that other parts of the brain could show neural activity compatible with other variables of reinforcement learning models like state-action values (see examples of comparisons of neuroscience results with RL models’ predictions in [2, 3]; and see [35] for a review).
More important for the topic of this paper, since about 10 years, an increasing number of neuroscience studies have started to investigate whether human and animal behavior during reward-based learning tasks could involve some sort of combination of model-based (MB) and model-free (MF) learning processes, and what are the neural mechanisms underlying such a coordination.
The simplest possible way of combining MB and MF RL processes is to consider that they occur in parallel in the brain, and that any decision made by the subject results from the simple weighted sum of MB and MF state-action values (i.e., replacing \(x^{(t)}(s,a)\) in Eq. 4 by \((1-\omega )Q_{MB}^{(t)}(s_t,a_t)+\omega Q_{MF}^{(t)}(s_t,a_t)\), with \(\omega \in \left[ 0;1\right] \) a weighting parameter). This works well to have a first approximation of the degree with which different individual subjects, whether human adults [12], human children [15], or animals [41], rely on a model-based system to make decisions while considering the ensemble of trials made by the subject during a whole experiment. This has for instance helped understand that children make more model-free decisions than adults because their brain area subserving model-based decisions (their prefrontal cortex) takes years to mature [15]. This has also helped better model why some subjects are more prone than others to be influenced by reward predicting stimuli (which has implication to understand stimulus-triggered drug-taking behaviors in humans): roughly because their model-based process contributes less to their decisions [41].
Nevertheless, a systematic weighted sum of MB’s and MF’s decisions has the disadvantage of systematically requiring the (potentially heavy) computations from both learning systems. In contrast, it is thought that relying on habitual behaviors learned by the model-free system when the environment is stable and familiar is useful to avoid the costly inference-related computations of the model-based system [13, 35]. There could thus be some evolutionary reasons why humans do not always perform rational choices as could be (more often) the case if they were relying more on their model-based system [33]: namely that they would not be able to make fast decisions in easy familiar situations. Instead, they would always need to make long inferences with their internal models before deciding. Thus, they would be exhausted at the end of the day if they had to think deeply for all the simple decisions they have to make everyday, like whether they should drink a coffee before taking a shower or the opposite, whether they should wear a blue or a red shirt, where to go for lunch, etc.
Alternatively, early neuroscience studies of the coordination of MB and MF process hypothesized a sequential activation of the two systems: humans and animals should initially rely on their MB system when facing a new task, so as to figure out what are the statistics of the task and what is the optimal thing to do; and as they repeat over and over the same task and get habituated to it (making it become familiar), they should switch to the less costly MF system which hopefully will have had time to learn during a long repetition phase. Moreover, if suddenly the task changes, they should restart using their MB system (and thus break their previously acquired habit) in order to figure out what has changed and what is the new optimal behavioral policy. And then again after many repetitions with the new task settings, they can acquire a new behavioral habit with the MF system.
An illustrative example is the case where someone has to visit a new city. In that case, people usually look at a map, which is an allocentric representation of the city, and try to remember the parts of the maps that are useful to reach a desired location. And then, once people walk through the city, if they suddenly find themselves in front of some landmark that they thought would not be encountered during their planned trip (e.g., a monument), they can close their eyes and try and understand where they might actually be located within their mental map, and which change in their trajectory they should operate. This is typically a MB inference process. In contrast, when one always takes the same path from their home until their workplace, they rarely perform MB inference, and rather let their body automatically turn at the right corner and lead them to their usual arrival point. This works well even if one is discussing with a friend while walking, or is not fully awake. We thus think that in such a case the brain has shifted its decisions to the MF system. This permits to free other parts of the brain which can be used to think while we walk about the last book we read, or to try and solve the maths problem we are currently addressing.
One initial computational proposal for the coordination of MB and MF learning systems which can well capture this dynamics consists in comparing the uncertainty of the MB and MF systems and relying on the most certain one [13]. This can be achieved if a Bayesian formulation of RL is adopted where the agent does not simply learns point estimates of state-action value functions, but rather full distributions over each (state, action) pair value. In that case, the precision of the distribution can be used to represent the level of uncertainty. In practice, when facing a new task, the uncertainty in both systems is high. But the uncertainty in the MB system decreases faster with learning (i.e., after less observations made following interactions of the agent with the world, even if these observations are processed during a long inference phase). As a consequence, the agent will rely more on the MB system during early learning. In parallel the uncertainty in the MF system slowly decreases, until the MF system is sufficiently certain to take control over the agent’s actions. When the task changes (e.g., the goal location changes, or a path is now obstructed by an obstacle), uncertainty re-increases in both systems, but again it decreases faster in the MB system, so that again a sequence of MB decisions followed by MF decisions after a long second learning phase can be produced.
However, systematically monitoring uncertainty in the MB system can be computational heavy, and does not really permit the avoidance of the costly computations of the MB system when the MF system is currently leading. Alternatively, a more recent computational neuroscience model proposes to only monitor uncertainty within the MF system, and considers that the MB system is by default providing perfect information, so that it should be chosen when the MF system is uncertain, and avoided only when the MF system is sufficiently certain [34]. This works well in a number of situations and enables to well capture the behavior of animals in a number of experiments. However, there are situations where this assumption cannot be true. In particular, as we will illustrate with some robotic tests of this kind of models in the next section, if the agent has an inaccurate internal model, it is better to rely on the more reactive MF system even when it is still uncertain [6, 59]. Less costly alternative estimations of uncertainty can be used to permanently monitor both the MB and the MF system [40]. For instance, the degree of instability of Q-values (MB or MF) before convergence is reached can be a good proxy to uncertainty [8, 36]. Choice confidence can also give a relatively good proxy to choice uncertainty in simple situations, by measuring the entropy of the probability distribution over all possible actions in a given state and comparing MB and MF estimations of this measure [70]. This moreover enables to well capture not only choices made by human subjects during simple decision-making tasks, but also their decision time (the more uncertain they are, the more time they need to make their decision). Finally, this type of mechanism also enables to explain why the ideal MB-to-MF sequence is not always true, since early choices of human subjects can sometimes significantly rely on the MF system because their MF system might initially be overconfident [70].
Another important current question is whether uncertainty alone is sufficient to arbitrate between MB and MF systems [52], or whether, when the two systems are equally uncertain, the agent should rely on the least computationally costly one [24]. If we want the agent to be initially agnostic about which system is more costly, and if we even want the agent to be able to potentially arbitrate between N different learning systems with different a priori unknown computational characteristics, then one proposal is simply to measure the average time taken by each system when it has to make decisions [24]. In some of the robotic experiments that we will describe in the next section, we found that this principle works robustly, enables to produce the ideal MB-to-MF sequence, not only during initial learning but also after a task change. We will come back to this later.
Finally, other current outstanding questions are whether the two systems shall always be in competition, or whether they shall sometimes also cooperate (as can be achieved with the weighted sum of their contribution described above); and whether an efficient coordination mechanism shall arbitrate between MB and MF at each timestep from the current available measures (e.g., uncertainty, computational cost, etc.), or whether it is sometimes more efficient to learn and remember that the MF system is usually better in situation type A while the MB system is better in situation type B. The latter could enable the agent to instantaneously rely on the best memorized system without needing to fully experience a new situation identified as belonging from a recognized type. This issue relates to current investigations in subfields of machine learning known as transfer learning, life-long learning and open-ended learning [21, 62, 69]. One solution to this coordination memory problem consists in adopting a hierarchical organization where a second, higher-level, learning process (in what Dollé and colleagues call a ‘meta-controller’) learns which strategy (model-based, model-free or random exploration) is the best (in terms of the amount of reward it yields) in different parts of the environment [18] (Fig. 1). This model learns through RL which strategy is the most efficient in each part of the environment. It can moreover learn that a certain equilibrium between MB and MF processes is required for good performance, thus resulting in cooperation between systems. It can even learn to change through time the weight of the contribution of each system, as learning in the MF system progresses, thus producing something that looks like the ideal MB-to-MF sequence.
With these principles in hand, the Dollé model [18] can explain a variety of rat navigation behaviors experimentally observed, including data that initially appeared as either contradictory to the cognitive map theory, or contradictory to the associative learning theory which approximately considers that navigation behaviors shall all be learned through model-free reinforcement learning. Finally, it is worthy of note that performing offline replay of the MB system can result in learning by observation in the MF system, so that the two somehow cooperate [8], as inspired by the now classical DYNA architecture [67].
Overall, this short review highlights that the investigation of the principles underlying the adaptive coordination of model-based and model-free reinforcement learning mechanisms in humans and animals is currently an active area of research in neuroscience.
4 Robotic Tests of Bio-Inspired Principles for the Coordination of Model-Based and Model-Free Reinforcement Learning
The importation of these bio-inspired ideas to robotics is quite recent, and is still an emerging field of research. Nevertheless, a few studies and their outcomes deserve to be mentioned here.
To our knowledge, the first test with a real robot of a bio-inspired algorithm for the online coordination of model-based and model-free reinforcement learning has been presented in [6]. This work included an indoor robot navigation scenario with the Psikharpax rat robot [47] within a 2 m \(\times \) 2.5 m arena (Fig. 2). The robot first explored the environment to autonomously learn a cognitive map of the environment (hence a mental model used by its model-based learning strategy). In addition, the robot could use a model-free reinforcement learning strategy to learn movements in 8 cardinal directions in response to perceived salient features within the environment (i.e., stimuli in the vocabulary of psychology). The latter MF RL component of the model was later improved in [5] to make it able to learn movements away from visual features when needed. The proposed algorithm for the online coordination of MB and MF RL was based on the computational neuroscience model of Dollé and colleagues [18,19,20], which has been presented in the previous section and sketched in Fig. 1.
The first important result of this robotic work is that the algorithm could autonomously learn the appropriate coordination of MB and MF systems for each specific configuration of the environment that was presented to the robot. For a first configuration associated to an initial goal location (where the robot can obtain a scalar reward), the algorithm learned that the MB strategy was appropriate to guide the robot from far away towards the goal, and that the MF strategy was appropriate to control the fine movements of the robot when closer to the goal. This was an emergent property of the coordination that was not designed by human. Instead, it was autonomously learned by the algorithm in response to the specific environment where the robot was located. The reason is that the robot had less explored the area around the initial goal location. Thus, its cognitive map was less precise there. As a consequence, a pure MB version of the algorithm could learn to approach the robot near the goal, but could not learn to precisely reach it (because of the imprecision in the map). As a consequence, the autonomous coordination algorithm found out that the MF system could compensate for this lack of precision. From this simple example we can learn two things: first, that in contradiction to the assumption made by some previously discussed computational models that the MB system has access to perfect information, the map (i.e., model) learned by the MB system can be imperfect, and the coordination algorithm has to cope with it. More generally, we think that when experimenting with robots, there will always be a situation where the map cannot be accurately learned, because of noisy perceptions, problems with the light, etc. So, rather than endlessly trying to refine the MB system to make it appropriate for each new situation at hand, it might be better to let the coordination algorithm autonomously find out what is the appropriate alternation between MB and MF for the present situation. The second thing that we can learn from this example is that a simple coordination algorithm which puts MB and MF systems in competition, and selects the most efficient one, can sometimes produce a sort of cooperation between them. In this particular example, a learned trajectory of the robot to the goal can be hybrid, involving a first use of the MB strategy when far away from the goal, and then a shift to the MF strategy when getting closer. This enables us to draw a model-driven prediction for neuroscience: that sometimes animals solving these types of task may display a trajectory within a maze or an arena that is not the result of a single learning system, but rather a hybrid byproduct of the coordination of multiple systems.
Another important result of this robotic work relates to its ability to learn context-specific coordination patterns, which can relate to what people call episodic control. This occurred when we changed the goal location after some time, and let the robot adapt to the new configuration. What happened is that the algorithm first detected the change because of the different profile of reward propagation through its mental map that this induced. Then the algorithm decided to store in memory the previously learned coordination pattern between MB and MF, and to learn a new one. After a new learning phase, the algorithm found a new coordination pattern adapted to the new condition, thus producing good performance again. Finally, we suddenly moved the goal location back to its initial location. The algorithm could detect the change and recognize the previous configuration (again thanks to the profile of reward propagation through its mental map). As a consequence, the algorithm retrieved the previously learned coordination pattern, which enabled the robot to instantaneously restore the appropriate behavior without re-learning.
Nevertheless, some limitations and perspectives of this seminal work ought to be mentioned here. First, the coordination component of the algorithm (which is called the meta-controller in [5, 6, 18]) slowly learns through MF RL (in addition to the MF RL mechanism used within the MF system dedicated to the MF strategy) which strategy is the most appropriate in each part of the environment (In other words, the model involves a hierarchical learning process in addition to the parallel learning process between MB and MF strategies). While this is good for the robot to be able to memorize specific coordination patterns for each context (i.e., for each configuration of the goal location within the arena), this nevertheless requires a long time to achieve a good coordination within each context. Thus, it would be interesting to also test coordination mechanisms based on instantaneous measures such as uncertainty, as discussed in the previous section. A second limitation is that this experiment involved a specific adaptation of a coordination model to a simple indoor navigation task, with a small map, a small number of states to learn, and an action repertoire which is specific to navigation scenarios. A third limitation is at the technical level, involving an old custom robot. Thus, it it not clear if these results could be generalized to other robotic platforms facing a wider variety of tasks, and sometimes more complex tasks involving a larger number of states.
A more recent series of robotic experiments with the same research goal (i.e., assessing the efficience and robustness of bio-inspired coordination principles of MB and MF learning) has been presented in [58,59,60] and later in [9, 23, 24]. First, [58,59,60] compared different coordination principles, including methods coming from ensemble learning [72] in several different simulated robotic tasks. They found again that the MB system was not always the most reliable system, especially in tasks with hundreds of states, where the MB system requires long inference durations to come up with a good approximation of the state-action value function Q. These experiments highlighted the respective advantages and disadvantages of MB and MF reinforcement learning in a variety of simulated robotic situations, and concluded again for the added value of coordinating them. In [9, 23, 24], simulated and real robotic experiments were presented, some involving navigation with a Turtlebot, and others involving simulated tasks with the PR2 robot and the Baxter robot (Fig. 3).
The first important result of these new series of experiments to highlight is that the coordination of MB and MF RL was efficient in a variety of tasks, including navigation tasks with detours, non-stationarity of the configuration of the environment (i.e., sudden introduction of obstacles obstructing some corridors), but also simple human-robot interaction tasks. In the latters, the human and the robot had to cooperate to clean a table by putting objects in a trashbin. Importantly, some objects were reachable by the human, some by the robots, thus forcing them to communicate and cooperate. In that case, the model-based system could compute joint action plans where actions by the robot and actions by the humans alternated. In all these situations, the robot could autonomously learn the task and reach good performance.
The second important result to highlight is that instantaneous measures of uncertainty in MB and MF systems allow a quicker reaction of the coordination mechanism to changes in the environment. Nevertheless, this does not permit memorization nor episodic control, which the work of [6] did. Thus, the results are in good complementarity and in the future it would interesting to test combinations of these two principles.
The last important result to highlight here is that an efficient coordination mechanism proposed by [23, 24], and successfully applied to robot navigation and human-robot interaction scenarios, consists in taking into account not only the uncertainty but also the computational cost of each learning system. In practice, the proposed algorithm monitored the average time taken by each system to make its inference phase before deciding. It learned that the MB system takes on average 10 times longer than the MF system, in these specific tasks, before making a decision. As a consequence, the coordination algorithm gave the lead to the MF system in cases of equal uncertainty, and even in cases of slightly higher uncertainty in the MF system. As a result, the algorithm mostly relied on the MF system but transiently and efficiently gave the lead to the MB system, only when needed. This occurred during initial learning as well as after task changes. As a consequence, the algorithm could reproduce the nice MB-to-MF sequence that we discussed in previous sections, both during initial learning and after task changes. Moreover, with this new coordination principle, the robot could achieve the same optimal performance as a pure MB system (which was optimal in these cases) while requiring a cumulated computational cost which was closer to that of a pure MF system (which achieves a lower bound on computational cost in these experiments). Thus, the coordination algorithm not only allowed for an efficient and flexible behavior of the robot in these non-stationary tasks, but it also permitted to reduce the computational cost of the algorithm controlling the robot. Finally, the authors also compared their algorithm with a state-of-the-art deep reinforcement learning algorithm. They found that the latter requires a very large number of iterations to learn, much more than their proposed MB-MF coordination algorithm.
Finally, it is interesting to mention that in the meantime, several other research groups throughout the world have also started to test hybrid MB/MF algorithms for robot learning applications [29, 30, 43, 44, 61, 63]. In particular, the deep reinforcement learning community is showing a growing interest for such hybrid learning algorithms [10, 28, 50]. This illustrates the potentially broad interest that this type of hybrid solutions to reinforcement learning can have in different research communities.
5 Conclusion
This paper aimed at first illustrating current outstanding questions and investigations to better understand and model neural mechanisms for the online adaptive coordination of multiple learning strategies in humans and animals. Secondly, the paper reviewed a series of recent robot learning experiments aimed at testing such bio-inspired principles for the coordination of model-based and model-free reinforcement learning strategies.
We discussed the respective advantages and disadvantages of different coordination mechanisms: on the one hand, mechanisms relying on instantaneous measures of uncertainty, choice confidence, performance, as well as computational cost; on the other hand, mechanisms relying on hierarchical learning where a high-level meta-controller autonomously learns which strategy is the most efficient in each situation.
The robotic experiments discussed here showed that this type of coordination principle can work efficiently, robustly and at a reduced computational cost in a variety of robotic scenarios (navigation, human-robot interaction). This is of particular importance at a time where energy saving is a critical issue for the planet and to slow down global warming. In contrast, many current machine learning techniques, especially those relying on deep learning, require tremendous amounts of energy and long pre-training phases.
Finally, the paper aimed at also illustrating the interest of testing neuro-inspired models in real robots interacting with the real world so as to generate novel model-driven predictions for neuroscience and psychology. In the particular case of the adaptive coordination of model-based and model-free reinforcement learning strategies, we showed that some situations can induce cooperation between learning strategies. We moreover showed that not only taking into account the uncertainty of each learning system but also its computational cost could work efficiently in a variety of task. This raises the prediction that the mammalian brain may also monitor and memorize the average computational cost (for instance in terms of the duration required for inference) of different learning strategies in different memory systems, in order to favor those which cost less when they are equally efficient. This paves the way for novel neuroscience experiments aimed at testing these new model-driven predictions and understanding the underlying neural mechanisms.
References
Arleo, A., Smeraldi, F., Gerstner, W.: Cognitive navigation based on nonuniform Gabor space sampling, unsupervised growing networks, and reinforcement learning. IEEE Trans. Neural Netw. 15(3), 639–652 (2004). https://doi.org/10.1109/TNN.2004.826221
Bellot, J., Sigaud, O., Khamassi, M.: Which temporal difference learning algorithm best reproduces dopamine activity in a multi-choice task? In: Ziemke, T., Balkenius, C., Hallam, J. (eds.) SAB 2012. LNCS (LNAI), vol. 7426, pp. 289–298. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33093-3_29
Bellot, J., Sigaud, O., Roesch, M.R., Schoenbaum, G., Girard, B., Khamassi, M.: Dopamine neurons activity in a multi-choice task: reward prediction error or value function? In: Proceedings of the French Computational Neuroscience NeuroComp12 Workshop, pp. 1–7 (2012)
Burgess, N., Maguire, E.A., O’Keefe, J.: The human hippocampus and spatial and episodic memory. Neuron 35(4), 625–641 (2002)
Caluwaerts, K., et al.: Neuro-inspired navigation strategies shifting for robots: integration of a multiple landmark taxon strategy. In: Prescott, T.J., Lepora, N.F., Mura, A., Verschure, P.F.M.J. (eds.) Living Machines 2012. LNCS (LNAI), vol. 7375, pp. 62–73. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31525-1_6
Caluwaerts, K., et al.: A biologically inspired meta-control navigation system for the Psikharpax rat robot. Bioinspiration Biomim. 7, 025009 (2012)
Cassandra, A.R., Kaelbling, L.P., Littman, M.L.: Acting optimally in partially observable stochastic domains. In: AAAI, vol. 94, pp. 1023–1028 (1994)
Cazé, R., Khamassi, M., Aubin, L., Girard, B.: Hippocampal replays under the scrutiny of reinforcement learning models. J. Neurophysiol. 120(6), 2877–2896 (2018)
Chatila, R., et al.: Toward self-aware robots. Front. Robot. AI 5(1), 88–108 (2018)
Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S.: Combining model-based and model-free updates for trajectory-centric reinforcement learning. arXiv preprint arXiv:1703.03078 (2017)
Coutureau, E., Killcross, S.: Inactivation of the infralimbic prefrontal cortex reinstates goal-directed responding in overtrained rats. Behav. Brain Res. 146(1–2), 167–174 (2003)
Daw, N.D., Gershman, S.J., Seymour, B., Dayan, P., Dolan, R.J.: Model-based influences on humans’ choices and striatal prediction errors. Neuron 69(6), 1204–1215 (2011)
Daw, N.D., Niv, Y., Dayan, P.: Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8(12), 1704–1711 (2005)
Dayan, P.: Improving generalization for temporal difference learning: the successor representation. Neural Comput. 5(4), 613–624 (1993)
Decker, J.H., Otto, A.R., Daw, N.D., Hartley, C.A.: From creatures of habit to goal-directed learners: tracking the developmental emergence of model-based reinforcement learning. Psychol. Sci. 27(6), 848–858 (2016)
Dezfouli, A., Balleine, B.W.: Habits, action sequences and reinforcement learning. Eur. J. Neurosci. 35(7), 1036–1051 (2012)
Dickinson, A., Balleine, B.: Motivational control of goal-directed action. Anim. Learn. Behav. 22(1), 1–18 (1994)
Dollé, L., Chavarriaga, R., Guillot, A., Khamassi, M.: Interactions of spatial strategies producing generalization gradient and blocking: a computational approach. PLoS Comput. Biol. 14(4), e1006092 (2018)
Dollé, L., Khamassi, M., Girard, B., Guillot, A., Chavarriaga, R.: Analyzing interactions between navigation strategies using a computational model of action selection. In: Freksa, C., Newcombe, N.S., Gärdenfors, P., Wölfl, S. (eds.) Spatial Cognition 2008. LNCS (LNAI), vol. 5248, pp. 71–86. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87601-4_8
Dollé, L., Sheynikhovich, D., Girard, B., Chavarriaga, R., Guillot, A.: Path planning versus cue responding: a bio-inspired model of switching between navigation strategies. Biol. Cybern. 103(4), 299–317 (2010)
Doncieux, S., et al.: Dream architecture: a developmental approach to open-ended learning in robotics. arXiv preprint arXiv:2005.06223 (2020)
Doya, K.: Reinforcement learning in continuous time and space. Neural Comput. 12(1), 219–245 (2000)
Dromnelle, R., Girard, B., Renaudo, E., Chatila, R., Khamassi, M.: Coping with the variability in humans reward during simulated human-robot interactions through the coordination of multiple learning strategies. In: Proceedings of the 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2020), Naples, Italy (2020)
Dromnelle, R., Renaudo, E., Pourcel, G., Chatila, R., Girard, B., Khamassi, M.: How to reduce computation time while sparing performance during robot navigation? A neuro-inspired architecture for autonomous shifting between model-based and model-free learning. In: 9th International Conference on Biomimetic & Biohybrid Systems (Living Machines 2020). pp. 1–12. LNAI, Online Conference (Initially Planned in Freiburg, Germany) (2020)
Eichenbaum, H.: Prefrontal-hippocampal interactions in episodic memory. Nat. Rev. Neurosci. 18(9), 547–558 (2017)
Frankland, P.W., Bontempi, B.: The organization of recent and remote memories. Nat. Rev. Neurosci. 6(2), 119–130 (2005)
Gupta, A.S., van der Meer, M.A., Touretzky, D.S., Redish, A.D.: Hippocampal replay is not a simple function of experience. Neuron 65(5), 695–705 (2010)
Hafez, M.B., Weber, C., Kerzel, M., Wermter, S.: Curious meta-controller: adaptive alternation between model-based and model-free control in deep reinforcement learning. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)
Hangl, S., Dunjko, V., Briegel, H.J., Piater, J.: Skill learning by autonomous robotic playing using active learning and creativity. arXiv preprint arXiv:1706.08560 (2017)
Jauffret, A., Cuperlier, N., Gaussier, P., Tarroux, P.: From self-assessment to frustration, a small step toward autonomy in robotic navigation. Front. Neurorobotics 7, 16 (2013)
Jog, M.S., Kubota, Y., Connolly, C.I., Hillegaart, V., Graybiel, A.M.: Building neural representations of habits. Science 286(5445), 1745–1749 (1999)
Johnson, A., Redish, A.D.: Neural ensembles in CA3 transiently encode paths forward of the animal at a decision point. J. Neurosci. 27(45), 12176–12189 (2007)
Kahneman, D.: Thinking, Fast and Slow. Macmillan, New York (2011)
Keramati, M., Dezfouli, A., Piray, P.: Speed/accuracy trade-off between the habitual and the goal-directed processes. PLoS Comput. Biol. 7(5), e1002055 (2011)
Khamassi, M.: Complementary roles of the rat prefrontal cortex and striatum in reward-based learning and shifting navigation strategies. Ph.D. thesis, Université Pierre et Marie Curie-Paris VI (2007)
Khamassi, M., Girard, B.: Modeling awake hippocampal reactivations with model-based bidirectional search. Biol. Cybern. (114), 231–248 (2020)
Khamassi, M., Humphries, M.D.: Integrating cortico-limbic-basal ganglia architectures for learning model-based and model-free navigation strategies. Front. Behav. Neurosci. 6, 79 (2012)
Khamassi, M., Velentzas, G., Tsitsimis, T., Tzafestas, C.: Robot fast adaptation to changes in human engagement during simulated dynamic social interaction with active exploration in parameterized reinforcement learning. IEEE Trans. Cogn. Dev. Syst. 10(4), 881–893 (2018)
Killcross, S., Coutureau, E.: Coordination of actions and habits in the medial prefrontal cortex of rats. Cereb. Cortex 13(4), 400–408 (2003)
Lee, S.W., Shimojo, S., O’Doherty, J.P.: Neural computations underlying arbitration between model-based and model-free learning. Neuron 81(3), 687–699 (2014)
Lesaint, F., Sigaud, O., Flagel, S.B., Robinson, T.E., Khamassi, M.: Modelling Individual differences in the form of Pavlovian conditioned approach responses: a dual learning systems approach with factored representations. PLoS Comp. Biol. 10(2) (2014). https://doi.org/10.1371/journal.pcbi.1003466
Leutgeb, S., Leutgeb, J.K., Barnes, C.A., Moser, E.I., McNaughton, B.L., Moser, M.B.: Independent codes for spatial and episodic memory in hippocampal neuronal ensembles. Science 309(5734), 619–623 (2005)
Llofriu, M., et al.: A computational model for a multi-goal spatial navigation task inspired by rodent studies. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)
Maffei, G., Santos-Pata, D., Marcos, E., Sánchez-Fibla, M., Verschure, P.F.: An embodied biologically constrained model of foraging: from classical and operant conditioning to adaptive real-world behavior in DAC-X. Neural Netw. 72, 88–108 (2015)
Mattar, M.G., Daw, N.D.: Prioritized memory access explains planning and hippocampal replay. Nat. Neurosci. 21(11), 1609–1617 (2018)
McClelland, J.L., McNaughton, B.L., O’Reilly, R.C.: Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102(3), 419 (1995)
Meyer, J.A., Guillot, A., Girard, B., Khamassi, M., Pirim, P., Berthoz, A.: The Psikharpax project: towards building an artificial rat. Robot. Auton. Syst. 50(4), 211–223 (2005)
Momennejad, I.: Learning structures: predictive representations, replay, and generalization. Curr. Opin. Behav. Sci. 32, 155–166 (2020)
Moore, A.W., Atkeson, C.G.: Prioritized sweeping: reinforcement learning with less data and less time. Mach. Learn. 13(1), 103–130 (1993)
Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE (2018)
Nakahara, H., Doya, K., Hikosaka, O.: Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuomotor sequences-a computational approach. J. Cogn. Neurosci. 13(5), 626–647 (2001)
O’Doherty, J.P., Lee, S., Tadayonnejad, R., Cockburn, J., Iigaya, K., Charpentier, C.J.: Why and how the brain weights contributions from a mixture of experts (2020)
O’keefe, J., Nadel, L.: The Hippocampus as a Cognitive Map. Clarendon Press, Oxford (1978)
Ostlund, S.B., Balleine, B.W.: Lesions of medial prefrontal cortex disrupt the acquisition but not the expression of goal-directed learning. J. Neurosci. 25(34), 7763–7770 (2005)
Packard, M.G., Knowlton, B.J.: Learning and memory functions of the basal ganglia. Annu. Rev. Neurosci. 25(1), 563–593 (2002)
Peng, J., Williams, R.J.: Efficient learning and planning within the Dyna framework. Adapt. Behav. 1(4), 437–454 (1993)
Pezzulo, G., Rigoli, F., Chersi, F.: The mixed instrumental controller: using value of information to combine habitual choice and mental simulation. Front. Psychol. 4, 92 (2013)
Renaudo, E., Girard, B., Chatila, R., Khamassi, M.: Design of a control architecture for habit learning in robots. In: Duff, A., Lepora, N.F., Mura, A., Prescott, T.J., Verschure, P.F.M.J. (eds.) Living Machines 2014. LNCS (LNAI), vol. 8608, pp. 249–260. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09435-9_22
Renaudo, E., Girard, B., Chatila, R., Khamassi, M.: Respective advantages and disadvantages of model-based and model-free reinforcement learning in a robotics neuro-inspired cognitive architecture. In: Biologically Inspired Cognitive Architectures BICA 2015, Lyon, France, pp. 178–184 (2015)
Renaudo, E., Girard, B., Chatila, R., Khamassi, M.: Which criteria for autonomously shifting between goal-directed and habitual behaviors in robots? In: 5th International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EPIROB), Providence, RI, USA, pp. 254–260. (2015)
Rojas-Castro, D.M., Revel, A., Menard, M.: Rhizome architecture: an adaptive neurobehavioral control architecture for cognitive mobile robots’ application in a vision-based indoor robot navigation context. Int. J. Soc. Robot. (3), 1–30 (2020)
Ruvolo, P., Eaton, E.: ELLA: an efficient lifelong learning algorithm. In: International Conference on Machine Learning, pp. 507–515 (2013)
Santos-Pata, D., Zucca, R., Verschure, P.F.M.J.: Navigate the unknown: implications of grid-cells “Mental Travel” in vicarious trial and error. In: Lepora, N.F.F., Mura, A., Mangan, M., Verschure, P.F.M.J.F.M.J., Desmulliez, M., Prescott, T.J.J. (eds.) Living Machines 2016. LNCS (LNAI), vol. 9793, pp. 251–262. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42417-0_23
Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275, 1593–1599 (1997)
Stachenfeld, K.L., Botvinick, M.M., Gershman, S.J.: The hippocampus as a predictive map. Nat. Neurosci. 20(11), 1643 (2017)
Stoianov, I., Maisto, D., Pezzulo, G.: The hippocampal formation as a hierarchical generative model supporting generative replay and continual learning. bioRxiv (2020)
Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the Seventh International Conference on Machine Learning, pp. 216–224 (1990)
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge (1998)
Thrun, S.: Lifelong learning algorithms. In: Thrun, S., Pratt, L. (eds.) Learning to Learn, pp. 181–209. Springer, Boston (1998). https://doi.org/10.1007/978-1-4615-5529-2_8
Viejo, G., Khamassi, M., Brovelli, A., Girard, B.: Modeling choice and reaction time during arbitrary visuomotor learning through the coordination of adaptive working memory and reinforcement learning. Front. Behav. Neurosci. 9, 225 (2015)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Wiering, M.A., van Hasselt, H.: Ensemble algorithms in reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B 38(4), 930–936 (2008). https://doi.org/10.1109/TSMCB.2008.920231
Wise, S.P.: The role of the basal ganglia in procedural memory. In: Seminars in Neuroscience, vol. 8, pp. 39–46. Elsevier (1996)
Acknowledgements
The author would like to thank all his collaborators who have contributed through the years to this line of research. In particular, Andrea Brovelli, Romain Cazé, Ricardo Chavarriaga, Laurent Dollé, Benoît Girard, Agnes Guillot, Mark Humphries, Florian Lesaint, Olivier Sigaud, Guillaume Viejo for their contribution to the design, implementation, test, and analysis of computational models of the coordination of learning processes in humans and animals. And Rachid Alami, Lise Aubin, Ken Caluwaerts, Raja Chatila, Aurélie Clodic, Sandra Devin, Rémi Dromnelle, Antoine Favre-Félix, Benoît Girard, Christophe Grand, Agnes Guillot, Jean-Arcady Meyer, Steve N’Guyen, Guillaume Pourcel, Erwan Renaudo, Mariacarla Staffa for their contribution to the design, implementation, test and analysis of robotic experiments aimed at testing neuro-inspired principles for the coordination of learning processes.
Funding
This work has been funded by the Centre National de la Recherche Scientifique (CNRS)’s interdisciplinary programs (MITI) under the grant name ‘Hippocampal replay through the prism of reinforcement learning’.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Khamassi, M. (2020). Adaptive Coordination of Multiple Learning Strategies in Brains and Robots. In: Martín-Vide, C., Vega-Rodríguez, M.A., Yang, MS. (eds) Theory and Practice of Natural Computing. TPNC 2020. Lecture Notes in Computer Science(), vol 12494. Springer, Cham. https://doi.org/10.1007/978-3-030-63000-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-63000-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62999-1
Online ISBN: 978-3-030-63000-3
eBook Packages: Computer ScienceComputer Science (R0)