Keywords

1 Introduction

Active vision considers an observer that can act by controlling the geometric properties of the sensor in order to improve the quality of the perceptual results [1]. This problem becomes apparent when considering occlusions, a limited field of view or a limited resolution of the used sensor [2]. In many cases, selecting the next viewpoint should be done as efficiently as possible due to limited resources for processing the new observations and the time it takes to reach the new observation pose. This problem is traditionally solved with frontier-based methods [21] in which the environment is represented as an occupancy grid. These approaches rely on evaluating engineered utility functions that estimate the amount of new information provided for all potential viewpoints [8, 21]. Usually this utility function represents the amount of unobserved voxels that a given viewpoint will uncover. Instead of using hand-crafted heuristics, this function can also be learned from data [8, 9]. A different approach is to predict the optimal viewpoint with respect to reducing uncertainty and ambiguity directly from a reconstructed volumetric grid [3, 13]. A different bio-inspired method for active vision is proposed by Rasouli et al. [17] in which the action is driven by a visual attention mechanism in conjunction with a non-myopic decision-making algorithm that takes previous observations at different locations in account.

Friston et al. [7, 14] cast the active vision problem as a low dimensional, discrete state-space Markov decision process (MDP) that can be solved using the active inference framework. In this paradigm, agents act in order to minimize their surprise, i.e. their free energy. In this paper, instead of using an explicit 3D representation, or a simple MDP formulation of the environment, we learn a generative model and latent state distribution purely from observations. Previous work also used deep learning techniques to learn the generative model in order to engage in active inference [22], while other work has created an end-to-end active inference pipeline using pixel-based observations [20]. Similar to Friston et al. [6, 7, 14], we then use the expected free energy to drive action selection. Similar to the work of Nair, Pong et al. [15] where the imagined latent state is used to compute the reward value for optimizing reinforcement learning tasks and the work of Finn and Levine [5], where a predictive model is used that estimates the pixel observations for different control policies, we employ the imagined observations form the generative model to compute the expected free energy. We evaluate our method on a grasping task with a robotic manipulator with an in-hand camera. In this task, we want the robot to get to the target object as fast as possible. For this reason we consider the case of best viewpoint selection. We show how active inference yields information-seeking behavior, and how the robot is able to reach goals faster than random or systematic grid search.

2 Active Inference

Active inference posits that all living organisms minimize free energy (FE) [6]. The variational free energy is given by:

$$\begin{aligned} \begin{aligned} F&= \mathbb {E}_{Q}[\log Q(\tilde{\mathbf{s }}) - \log P(\tilde{\mathbf{o }}, \tilde{\mathbf{s }}, \pi )] \\&= \mathrm {D}_{KL}[Q(\tilde{\mathbf{s }})||P(\tilde{\mathbf{s }}, \pi )] - \mathbb {E}_{Q}[\log P(\tilde{\mathbf{o }} | \tilde{\mathbf{s }}, \pi )], \end{aligned} \end{aligned}$$
(1)

where \(\tilde{\mathbf{o }}\) is a sequence of observations, \(\tilde{\mathbf{s }}\) the sequence of corresponding model belief states, \(\pi \) the followed policy or sequence of actions taken, and \(Q(\tilde{\mathbf{s }})\) the approximate posterior of the joint distribution \(P(\tilde{\mathbf{o }}, \tilde{\mathbf{s }}, \pi )\). Crucially, in active inference, policies are selected that minimize the expected free energy \(G(\pi , \tau )\) for future timesteps \(\tau \) [6]:

$$\begin{aligned} G(\pi , \tau ) \approx - \mathbb {E}_{Q(\mathbf{o} _\tau |\pi )} [\mathrm {D}_{KL}[Q(\mathbf{s} _\tau |\mathbf{o} _\tau , \pi ) || Q(\mathbf{s} _\tau |\pi )]] - \mathbb {E}_{Q(\mathbf{o} _\tau |\pi )}[\log P(\mathbf{o} _\tau )]. \end{aligned}$$
(2)

This can be viewed as a trade-off between an epistemic, uncertainty-reducing term and an instrumental, goal-seeking term. The epistemic term is the Kullback-Leibler divergence between the expected future belief over states when following policy \(\pi \) and observing \(\mathbf{o} _\tau \) and the current belief. The goal-seeking term is the likelihood that the goal will be observed when following policy \(\pi \).

3 Environment and Approach

In this paper, we consider a simulated robot manipulator with an in-hand camera which can actively query observations from different viewpoints or poses by moving its gripper, as shown in Fig. 1. The robotic agent acts in a static workspace, in which we randomly spawn a red, green and blue cube of fixed size. Each such configuration of random cube positions is dubbed a scene, and the goal of the robot is to find a cube of a particular color in the workspace. The agent initially has no knowledge about the object positions and has to infer this information from multiple observations at different poses. Example observations for different downward facing poses are given in Fig. 2.

Fig. 1.
figure 1

Franka Emika Panda robot in the CoppeliaSim simulator in a random scene with three colored cubes. (Color figure online)

Fig. 2.
figure 2

Sampled observations on a grid of potential poses used for evaluating the expected free energy. (Color figure online)

To engage in active inference, the agent needs to be equipped with a generative model. This generative model should be able to generate new observations given an action or in this particular case, the new robot pose. In contrast with [7, 14], we do not fix the generative model upfront, but learn it from data. We generate a dataset of 250 different scenes consisting of approximately 25 discrete time steps in which the robot observes the scene from a different viewpoint. Using this dataset we train two deep neural networks to approximate the likelihood distribution \(P(\mathbf{o} _t | \mathbf{s} _t, \pi )\) and approximate posterior distribution \(Q(\mathbf{s} _t | \mathbf{o} _t, \pi )\) as multivariate Gaussian distributions. In our notation, \(\mathbf{o} _t\) and \(\mathbf{s} _t\) respectively represent the observation and latent state at discrete timestep t. Both distributions are conditioned by the policy \(\pi \), the action that the robot should take in order to acquire a new observation, or equivalently, the new observation viewpoint. The models are optimized by minimizing the free energy from Eq. 1, with a zero-mean isotropic Gaussian prior \(P(\mathbf{s} _t | \pi ) = \mathcal {N}(0, 1)\). Hence the system is trained as an encoder-decoder to predict scene observations from unseen poses, given a number of observations from the same scene at different poses. This is similar to a Generative Query Network (GQN) [4]. For more details on the model architecture and training hyperparameters, we refer to Appendix A.

At inference time, the policy \(\pi \), or equivalently the next observer pose, is selected by evaluating Eq. (2) for a number of candidate policies and selecting the policy that evaluates to the lowest expected free energy. These candidate policies are selected by sampling a grid of poses over the workspace. The trained decoder extracts the imagined observation for each of the candidate policies and the state vector acquired through encoding the initial observations. The corresponding expected posterior distributions are computed by forwarding these imagined observations together with the initial observations through the encoder. For the goal-seeking term, we provide the robot with a preferred observation, i.e. the image of the colored cube to fetch, and we evaluate \(\log P(\mathbf{o} _\tau )\). The epistemic term is evaluated by using the likelihood model to imagine what the robot would see from the candidate pose, and then calculating the KL divergence between the state distributions of the posterior model before and after “seeing” this imagined observation. The expectation terms are approximated by drawing a number of samples for each candidate pose.

4 Experiments

We evaluate our system in two scenarios. In the first scenario, only the epistemic term is taken into account, which results in an exploring agent that actively queries information of the scene. In the second scenario, we add the instrumental term by which the agent makes an exploration-exploitation trade-off to reach the goal state as fast as possible.

4.1 Exploring Behaviour

First, we focus on exploratory or information-seeking behaviour, i.e. actions are chosen based on the minimization of only the epistemic term of the expected free energy. For evaluation we restrict the robot arm to a fixed number of poses at a fixed height close to the table, so it can only observe a limited area of the workspace. The ground truth observations corresponding to the candidate poses are shown in a grid in Fig. 2.

Initially, the agent has no information about the scene, and the initial state distribution \(Q(\mathbf{s} )\) is a zero-mean isotropic Gaussian. The expected observation is computed over 125 samples and visualized in the top row of Fig. 3a. Clearly, the agent does not know the position of any of the objects in the scene, resulting in a relatively low value of the epistemic term from Eq. (2) for all candidate poses. This is plotted in the bottom row of Fig. 3a. The agent selects the upper left pose as indicated by the green hatched square in Fig. 3b. After observing the blue cube in the upper left corner, the epistemic value of the left poses drops, and the robot queries a pose at the right side of the workspace. Finally, the robot queries one of the central poses, and the epistemic value of all poses becomes relatively high, as new observations do not yield more information. Notice that at this point, the robot can also accurately reconstruct the correct cubes from any pose as shown in the top row of Fig. 3d.

Fig. 3.
figure 3

The top row represents the imagined observations, i.e. the observations generated by the generative model, for each of the considered potential poses at a given step, the bottom row represents the epistemic value for the corresponding poses. Darker values represent a larger influence of the epistemic value. The green hatched squares mark the observed poses. (Color figure online)

4.2 Goal Seeking Behaviour

In this experiment, we use the same scene and grid of candidate poses, but now we provide the robot with a preferred observation from the red cube, indicated by the red hatched square in the bottom row of Figs. 4a through 4d.

Initially, the agent has no information on the targets position and the same information-seeking behaviour from Sect. 4.1 can be observed in the first steps as the epistemic value takes the upper hand. However, after the second step, the agent has observed the red cube and knows which pose will reach the preferred state. The instrumental value takes the upper hand as indicated by the red values in Figs. 4a through 4d. This is reflected by a significantly lower expected free energy. Even though the agent has not yet observed the green cube and is unable to create correct reconstructions as shown in Fig. 4d, it will drive itself towards the preferred state. The trade off between exploratory and goal seeking behaviour can clearly be observed. In Fig. 4c, the agent still has low epistemic values for the candidate poses to the left, but they do not outweigh the low instrumental value to reach the preferred state. The middle column of potential observations has a lower instrumental value, which is the result of using independent Gaussians for estimating likelihood on each pixel.

Fig. 4.
figure 4

The top row shows the imagined observations for each of the considered potential poses at a given time step. The bottom row shows the expected free energy for the corresponding poses. Blue is used to represent the epistemic value, while red is used to represent the instrumental value. The values of both terms are shown in the legend. The green hatched squares mark the observed poses, while the red hatched square marks the preferred state. (Color figure online)

The number of steps to reach the preferred state is computed on 30 different validation scenes not seen in training, where the preferred state is chosen randomly. On average, the agent needs 3.7 steps to reach its goal. This is clearly more efficient than a systematic grid search which would take on average 12.5 steps.

5 Conclusion

This work shows promising results in using the active inference framework for active vision. The problem is tackled with a generative model learned unsupervised from pure pixel data. The proposed approach can be used for efficiently exploring and solving robotic grasping scenarios in complex environments where a lot of uncertainty is present, for example in cases with a limited field of view or with many occlusions.

We show that it is possible to use learned latent space models as generative models for active inference. We show that both exploring and goal-seeking behaviour surfaces when using active inference as an action-selection policy. We demonstrated our approach in a realistic robot simulator and plan to extend this to a real world setup as well.