Keywords

1 Introduction

The continual interaction that exists between an organism and the environment requires an active form of regulation of the mechanisms safeguarding its integrity. There are several aspects an agent must consider, ranging from assessing various sources of information to anticipating changes in its surroundings. In order to decide what to do, an agent must consider between different courses of action and factor in the potential costs and benefits derived from its hypothetical future behavior. This process of selection among different value-based choices can be formally described as an optimization problem. Depending on the formalism, the cost or utility functions optimized by the agent presuppose different normative interpretations.

In reinforcement learning (RL) for instance, an agent has to maximize the expected reward guided by a signal provided externally by the environment in an oracular fashion. The reward in some cases is also complemented with an intrinsic contribution, generally corresponding to an epistemic deficiency within the agent. For example prediction error [24], novelty [3, 5, 23] or ensemble disagreement [25]. It is important to note that incorporating these surrogate rewards into the objectives of an agent is often regarded as one of many possible enhancements to increase its performance, rather than been motivated by a concern with explaining the roots of goal-directed behavior.

In active inference [14], the optimization is framed in terms of the minimization of the variational free energy to try to reduce the difference between sensations and predictions. Instead of rewards, the agent holds a prior over preferred future outcomes, thus an agent minimizing its free energy acts to maximize the occurrence of these preferences and to minimize its own surprisal. Value arises not as an external property of the environment, but instead it is conferred by the agent as a contextual consequence of the interplay of its current configuration and the interpretation of stimuli.

There are recent studies that have successfully demonstrated how to reformulate RL and control tasks under the active inference framework. While for living processes it is reasonable to assume that the priors emerge and are refined over evolutionary scales and during a lifetime, translating this view into a detailed algorithmic characterization raises important considerations because there is no evolutionary prior to draw from. Thus the approaches to specify a distribution of preferences have included for instance, taking the reward an RL agent would receive and encoding it as the prior [16, 21, 29, 32,33,34], connecting it to task objectives [29] or through expert demonstrations [6, 7, 30].

In principle this would suggest that much of the effort that goes into reward engineering in RL is relocated to that of specifying preferred outcomes or to the definition of a phase space. Nonetheless active inference provides important conceptual adjustments that could potentially facilitate conceiving more principled schemes towards a theory of agents that could provide a richer account of autonomous behavior and self-generation of goals, desires or preferences. These include the formulation of objectives and utilities under a common language residing in belief space, and appealing to a worldview in which utility is not treated as independent or detached from the agent. In particular the latter could encourage a more organismic perspective of the agent in terms of the perturbations it must endure and the behavioral policies it attains to maintain its integrity [11].

Here we explore this direction by considering how a signal acquires functional significance as the agent identifies it as a condition necessary for its viability and future continuity in the environment. Mandated by an imperative to minimize surprisal, the agent learns to associate sensorimotor events to specific outcomes. First, we start by introducing the surprise minimizing RL (SMiRL) specification [4] before we proceed with a brief overview of the expected free energy. Then we motivate our approach from the perspective of a self-regulatory organism. Finally, we present results from our case study and close with some observations and further potential directions.

2 Preliminaries

2.1 Model-Free Surprisal Minimization

Consider an environment whose generative process produces a state \(s_t \in \mathcal {S}\) at each time step t resulting in an agent observing \(o_t \in \mathcal {O}\). The agent acts on the environment with \(a_t \in \mathcal {A}\) according to a policy \(\pi \), obtaining the next observation \(o_{t+1}\). Suppose the agent performs density estimation on the last \(t-k\) observations to obtain a current set of parameter(s) \(\theta _t\) summarizing \(p_{\theta }(o)\). As these sufficient statistics contain information about the agent-environment coupling, they are concatenated with the observations into an augmented state \(x_t = (o_t, \theta _t)\). Every time step, the agent computes the surprisal generated by a new observation given its current estimate and then updates it accordingly. In order to minimize surprisal under this model-free RL setting, the agent should maximize the expected log of the model evidence \(\mathbb {E}[\sum _t \gamma ^t \ln p_{\theta _{t}}(o_t)]\) [4]. Alternatively, we maintain consistency with active inference by expressing the optimal surprisal Q-function as,

$$\begin{aligned} Q_{\pi ^*}(x, a) = \mathbb {E}_\pi [-\ln p_\theta (o) + \gamma \min _{a'} Q_{\pi ^*}(x',a')] \end{aligned}$$
(1)

estimated via DQN [22] or any function approximator with parameters \(\phi \) such that \(Q_{\pi ^*}(x,a) \approx Q(x,a;\phi )\).

2.2 Expected Free Energy

The free energy principle (FEP) [15] has evolved from an account of message passing in the brain to propose a probabilistic interpretation of self-organizing phenomena [13, 27, 28]. Central to current discourse around the FEP is the notion of the Markov blanket to describe a causal separation between the internal states of a system from external states, as well as the interfacing blanket states (i.e. sensory and active states). The FEP advances the view that a system remains far from equilibrium by maintaining a low entropy distribution over the states it occupies during its lifetime. Accordingly, the system attempts to minimize the surprisal of an event at a particular point in time.

This can be more concretely specified if we consider a distribution p(o) encoding the states, drives or desires the system should fulfil. Thus the system strives to obtain an outcome o that minimizes the surprisal \(- \ln p(o)\). Alternatively, we can also state this as the agent maximizing its model evidence or marginal likelihood p(o). For most cases estimating the actual marginal is intractable, therefore a system instead minimizes the free energy [10, 18] which provides an upper bound on the log marginal [19],

$$\begin{aligned} \mathbf {F} = \mathbb {E}_{q(s)}[\ln q(s) -\ln p(o,s)] \end{aligned}$$
(2)

where p(os) is the generative model and q(s) the variational density approximating hidden causes. Equation 2 is used to compute a static form of free energy and infer hidden causes given a set of observations. However if we instead consider an agent that acts over an extended temporal dimension, it must infer and select policies that minimize the expected free energy (EFE) \(\mathbf {G}\) [14] of a policy \(\pi \) for a future step \(\tau >t\). This can be expressed as,

$$\begin{aligned} \mathbf {G}(\pi , \tau ) = \mathbb {E}_{q(o_{\tau }, s_{\tau }|\pi )}[\ln q(s_\tau |\pi ) - \ln p(o_\tau ,s_\tau |\pi )] \end{aligned}$$
(3)

where \(p(o_\tau ,s_\tau |\pi )=q(s_\tau |o_\tau ,\pi )p(o_\tau )\) is the generative model of the future. Rearranging \(\mathbf {G}\) as,

$$\begin{aligned} \mathbf {G}(\pi , \tau ) = - \underbrace{\mathbb {E}_{q(o_\tau |\pi )}[\ln p(o_\tau )]}_{instrumental\ value} -\underbrace{\mathbb {E}_{q(o_\tau |\pi )}\big [D_{KL}[\ln q(s_\tau |o_\tau ,\pi )||\ln q(s_\tau |\pi )]\big ]}_{epistemic\ value} \end{aligned}$$
(4)

which illustrates how the EFE entails a pragmatic, instrumental or goal-seeking term that realizes preferences and an epistemic or information seeking term that resolves uncertainty. An agent selects a policy with probability \(q(\pi ) = \sigma (-\beta \sum _\tau \mathbf {G_\tau }(\pi ))\) where \(\sigma \) is the softmax function and \(\beta \) is the inverse temperature. In summary, an agent minimizes its free energy via active inference by changing its beliefs about the world or by sampling the regions of the space that conforms to its beliefs.

3 Adaptive Control via Self-regulation

The concept of homeostasis has played a crucial role in our understanding of physiological regulation. It describes the capacity of a system to maintain its internal variables within certain bounds. Recent developments in the FEP describing the behavior of self-organizing systems under the framework, can be interpreted as an attempt to provide a formalization of this concept [28]. From this point of view, homeostatic control in an organism refers to the actions necessary to minimize the surprisal of the values reported by interoceptive channels, constraining them to those favored by a viable set of states. Something that is less well understood is how these attracting states come into existence. That is, how do they emerge from the particular conditions surrounding the system and how are they discovered among the potential space of signals.

Recently, it has been shown that complex behavior may arise by minimizing surprisal in observation space (i.e. sensory states) without pre-encoded fixed prior distributions in large state spaces [4]. Here we consider an alternative angle intended to remain closer to the homeostatic characterization of a system. In our scenario, we assume that given the particular dynamics of an environment, if an agent is equipped only with a basic density estimation capacity, then structuring its behavior around the type of regularities in observation space that can sustain it in time will be difficult. In these situations with fast changing dynamics, rather than minimizing free energy over sensory signals, the agent may instead leverage them to maintain a low future surprisal of another target variable. That implies that although the agent may have in principle access to multiple signals it might be interested in maintaining only some of them within certain expected range.

Defining what should constitute the artificial physiology in simulated agents is not well established. Therefore we assume the introduction of an information channel representing in abstract terms the interoceptive signals that inform the agent about its continuity in the environment. We can draw a rudimentary comparison, and think of this value in a similar way in which feelings agglutinate and coarse-grain the changes of several internal physical responses [9]. In addition, we are interested in the agent learning to determine whether it is conductive to its self-preservation in the environment or not.

3.1 Case Study

We assess the behavior of an agent in the Flappy Bird environment (Fig. 1 left). This is a task where a bird must navigate between obstacles (pipes) at different positions while stabilizing its flight. Despite the apparent simplicity, the environment offers a fundamental aspect present in the physical world. Namely, the inherent dynamics leads spontaneously to the functional disintegration of the agent. If the agent stops propelling, it succumbs to gravity and falls. At the same time the environment has a constant scrolling rate, which implies that the agent cannot remain floating at a single point and cannot survive simply by flying aimlessly. Originally, the task provides a reward every time the bird traverses in between two pipes. However for our case study the information about the rewards is never propagated and therefore does not have any impact on the behavior of the agent. The agent receives a feature vector of observations indicating its location and those of the obstacles. In addition, the agent obtains a measurement v indicating its presence in the task (i.e. 1 or 0). This measurement does not represent anything positive or negative by itself, it is simply another signal that we assume the agent is able to calculate. Similarly to the outline in Sect. 2.1, the agent monitors the last \(t-k\) values of this measurement and estimates the density to obtain \(\theta _t\). These become the statistics describing the current approximated distribution of preferences \(p(v|\theta _t)\) or \(p_{\theta _t}(v)\), which are also used to augment the observations to \(x_t=(o_t,\theta _t)\). When the agent takes a new measurement \(v_{t}\), it evaluates the surprisal against \(p_{\theta _{t-1}}(v_t)\). In this particular case it is evaluated via a Bernoulli density function such that \(-\ln p_{\theta _{t-1}}(v_t) = - (v_t \ln \theta _{t-1} + (1-v_t) \ln (1-\theta _{t-1}))\). First, we train a baseline model-free surprisal minimizing DQN as specified in Sect. 2.1 parameterized by a neural network (NN). Then we examine the behavior of a second agent that minimizes the expected free energy. Thus the agent learns an augmented state transition model of the world, parameterized by an ensemble of NNs, and an expected surprisal model, also parameterized by another NN. In order to identify an optimal policy we apply rolling horizon evolution [26] to generate candidate policies \(\pi =(a_\tau ,...,a_T)\) and to associate them to an expected free energy given by (Appendix A)

$$\begin{aligned} \mathbf {G}(\pi ,\tau ) \approx -\mathbb {E}_{q(o_\tau ,v_\tau ,\theta |\pi )}D_{KL}[q(s_\tau |,o_\tau ,v_\tau ,\pi )||q(s_\tau |\pi )] - \mathbb {E}_{q(v_\tau ,\theta ,s_\tau |\pi )}[\ln p_\theta (v_\tau )] \end{aligned}$$
(5)

If we explicitly consider the model parameters \(\phi \), Eq. 5 can be decomposed as (Appendix B),

$$\begin{aligned} \mathbf {G}(\pi ,\tau )&\approx -\underbrace{\mathbb {E}_{q(o_\tau ,v_\tau ,\phi |\pi )}D_{KL}[q(s_\tau |o_\tau ,v_\tau ,\pi )||q(s_\tau |\pi )]}_{salience}\nonumber \\&\quad -\underbrace{\mathbb {E}_{q(o_\tau ,v_\tau ,s_\tau |\pi )}D_{KL}[q(\phi |s_\tau , o_\tau , v_\tau , \pi )||q(\phi )]}_{novelty}\nonumber \\&\quad - \underbrace{\mathbb {E}_{q(o_\tau ,v_\tau ,s_\tau ,\phi |\pi )}[\ln p_\theta (v_\tau )]}_{instrumental\ value}\nonumber \\ \end{aligned}$$
(6)

The expression unpacks further the epistemic contributions to the EFE in terms of salience and novelty [17]. These terms refer to the expected reduction in uncertainty about hidden causes and in the parameters respectively. For this task \(o=s\), thus only the first and third term are considered.

3.2 Evaluation

The plot on Fig. 1 (center) tracks the performance of an EFE agent in the environment (averaged over 10 seeds). The dotted line represents the surprisal minimizing DQN agent after 1000 episodes. The left axis corresponds to the (unobserved) task reward while the right axis indicates the approximated number of time steps the agent survives. During the first trials, and before the agent exhibits any form of competence, it was observed that the natural coupling between agent and environment grants the agent a life expectancy of roughly 19–62 time steps in the task. This is essential as it starts to populate the statistics of v. Measuring a specific quantity v, although initially representing just another signal, begins to acquire certain value due to the frequency that it occurs. In turn, this starts to dictate the preferences of the agent as it hints that measuring certain signal correlates with having a stable configuration for this particular environment as implied by its low surprisal. Right Fig. 1 shows the evolution of parameter \(\theta \) (averaged within an episode) corresponding to the distribution of preferred measurements \(p_\theta (v)\) which determines the level of surprisal assigned when receiving the next v. As the agent reduces its uncertainty about the environment it also becomes more capable of associating sensorimotor events to specific measurements. The behavior becomes more consistent with seeking less surprising measurements, and as we observe, this reinforces its preferences, exhibiting the circular self-evidencing dynamics that characterize an agent minimizing its free energy.

Fig. 1.
figure 1

Left: The Flappy Bird environment. Center: Performance of an EFE agent. The left axis indicates the unobserved rewards as reported by the task and the right axis the number of time steps it survives in the environment. The dotted line shows the average performance of an SM-DQN after 1000 episodes. Right: Parameter \(\theta \) in time, summarizing the intra-episode sufficient statistics of \(p_\theta (v)\).

4 Discussion

Learning Preferences in Active Inference: The major thesis in active inference is the notion of an agent acting in order to minimize its expected surprise. This implies that the agent will exhibit a tendency to seek for the sort of outcomes that have high prior probability according to a biased model of the world, giving rise to goal-directed behavior. Due to the difficulty of modeling an agent to exhibit increasing levels of autonomy, the agent based simulations under this framework, and similarly to how it has largely occurred in RL, have tended to concentrate on the generation of a particular expected behavior in the agent. That is, on how to make the agent perform a task by encoding predefined goals [16, 21, 29, 32,33,34] or providing guidance [6, 7, 30]. However there has been recent progress trying to mitigate this issue. For example, in some of the simulations in [29] the authors included a distribution over prior preferences to account for each of the cells in Frozen Lake, a gridworld like environment. Over time the prior preferences are tuned, leading to habit formation. Most related to our work, are the studies on surprise minimizing RL (SMiRL) by [4], where model-free agents performed density estimation on their observation space and acquired complex behavior in various tasks by maximizing the model evidence of their observations. Here we have also opted for this approach, however we have grounded it on organismic based considerations of viability as inspired by insights on the nature of agency and adaptive behavior [1, 11, 12]. It has been suggested that even if some of these aspects are defined exogenously they could capture general components of all physical systems and could potentially be derived in a more objective manner compared to task based utilities [20]. Moreover these views suggest that the inherent conditions of precariousness and the perturbations an agent must face are crucial ingredients for the emergence of purpose generating mechanisms. In that sense, our main concern has been to explore an instance of the conditions in which a stable set of attracting states arises, conferring value to observations and leading to what seemed as self-sustaining dynamics. Although all measurements lacked any initial functional value, the model presupposes the capacity of the agent to measure its operational integrity as it would occur in an organism monitoring its bodily states. This raises the issue of establishing more principled protocols to define what should constitute the internal milieu of an agent.

Agent-Environment Coupling: A matter of further analysis, also motivated by results in [4], is the role of the environment to provide structure to the behavior of the agent. For instance, in the environments in [4], a distribution of preferences spontaneously built on the initial set of visual observations tends to correlate with good performance on the task. In the work presented here the initial set of internal measurements afforded by the environment contributes to the formation of a steady state, with the visual features informing the actions necessary to maintain it. Hence similarly to [4], the initial conditions of the agent-environment coupling that furnish the distribution p(v) provide a starting solution for the problem of self-maintenance as long as the agent is able to preserve the statistics. Thus if the agent lacks a sophisticated sensory apparatus, the capacity to extract invariances or the initial statistics of sensory data do not favor the emergence of goal-seeking behavior, tracking its internal configuration may suffice for some situations. However this requires further unpacking, not only because as discussed earlier it remains uncertain how to define the internal aspects of an agent, but also because often simulations do not capture the essential characteristics of real environments either [8].

Drive Decomposition: While here we have afforded our model certain levels of independence between the sensory data and the internal measurements, it might be sensible to imagine that internal states would affect perception and perceptual misrepresentation would affect internal states. Moreover, as the agent moves from normative conditions based entirely on viability to acquire other higher level preferences, it learns to integrate and balance different drives and goals. From Eq. 8 it is also possible to conceive a simplified scenario and establish the following expression (Appendix D),

$$\begin{aligned} \mathbf {G}(\pi ,\tau )&\approx \underbrace{\mathbb {E}_{q(o_\tau ,v_\tau ,\theta ,s_\tau |\pi )}[\ln q(s_\tau |\pi ) - \ln p(s_\tau |o_\tau ,\pi )]}_{epistemic\ value}\nonumber \\&\quad - \underbrace{\mathbb {E}_{q(o_\tau ,v_\tau ,\theta ,s_\tau |\pi )}[\ln p(o_\tau )]}_{high\ level\ value}\nonumber \\&\quad + \underbrace{\mathbb {E}_{q(o_\tau , s_\tau |\pi )} H[p(v_\tau |s_\tau ,o_\tau ,\pi )]}_{regulatory\ value} \end{aligned}$$
(7)

Where the goal-seeking value is decomposed into a component that considers preferences encoded in a distribution p(o) and another element estimating the expected entropy of the distribution of essential variables. Policies would balance the contributions resolving for hypothetical situations, such as a higher level goal being at odds with the viability of the system.