Keywords

1 Introduction

Active inference proposes a unifying principle for perception and action as jointly minimizing the free energy of an agent’s internal generative model [6]. It has been strongly influential in contemporary neuroscience and cognitive science. More recently, active inference has been proposed as a framework for modeling driving behavior, both at the conceptual [5, 10] and computational levels [31]. The framework is attractive for computational driver behavior modeling as it enables the learning of complex behaviors from large amounts of driving data while at the same time being grounded in a fundamental theory of cognition and behavior which guides model design and enables increased interpretability of machine-learned models. However, most existing active inference models in the cognitive neuroscience literature address relatively simple toy problems. Thus, the scaling of active inference by means of modern machine learning techniques is currently an active area of research [29]. The novel contribution of this paper is to explore the application of active inference models in the context of learning human driving behavior from recorded data (i.e., Learning from Demonstration; LfD).

LfD provides an efficient alternative to the current manual specification or trial-and-error learning approaches to active inference model design. Assuming the demonstrating agent is an active inference agent, we can instead estimate the agent’s generative model, consisting of a world model and a preference model, from demonstrated behavior. This approach is similar to inverse reinforcement learning (IRL) [20, 33] with an important difference. Instead of using a single reward function, active inference explains the demonstrator with a world model-preference pair, which makes active inference more transparent about the agent’s decision process than traditional IRL methods because we can introspect the learned world model. This allows us to understand variations in human behavior as “optimal inference in suboptimal models” [26, 31].

The closest approaches to the work presented here are [1, 11, 15, 23]. We build on these works by jointly estimating agent world model and preference model from demonstration. However, our work differs from these approaches in that it does not assume the environment is fully observable as in [23], it makes no assumptions about the agent’s world model’s alignment with the environment in light of the active inference formulation [11, 15], and it focuses on a large continuous environment rather than a small discrete environment [1]. We demonstrate our method in continuous car following scenarios recorded on highways [32]. The learned driving policy jointly models its own states, road geometry, and other vehicles (i.e., agents) using discrete abstract states and implements continuous vehicle control. We show that this approach can mimic human driving behavior in simple scenarios but that it may learn an incorrect model of the world, known as “causal confusion” in LfD [4], and occasionally deviate from the lane. We further show that this deviation can be corrected by revising the observation set based on grounded theory of driver steering [25], thus illustrating the how inductive biases and domain knowledge can be injected into LfD approaches.

2 Active Inference Model of Highway Driving

In this section, we propose a mixed discrete-continuous active inference model of driving behavior and present the update rules for driver perception and control by minimizing expected free energy.

2.1 World Model

We model the driver’s perceptual process using a discrete-time controlled hidden Markov process with discrete hidden states \(s \in \mathcal {S}\), discrete actions \(a \in \mathcal {A}\), and continuous observations \(o \in \mathcal {O}\). The hidden states are the driver’s internal representation of the driving environment which is used to guide action selection (e.g. steering and braking). The discrete actions represent driving motor primitives (i.e., prototype actions as described in [16]). The continuous observations are a vector of signals known to influence driving control behavior (e.g., visual looming of the lead vehicle [18]). The state evolves according to a Markov chain with transition probabilities \(P(s_{t+1}|s_t, a_t)\). The driver cannot directly observe the state but a high dimensional continuous signal \(o_t\) with distribution \(P(o_t|s_t)\). Importantly, the definition of states and the corresponding transition and observation probabilities are free to deviate from the actual environment as long as they explain the demonstrated behavior.

2.2 A POMDP Formulation of Active Inference

Given the world model, the agent’s perception-action loop at every decision epoch consists of inferring a belief distribution on the current hidden state and selecting an action controlling the evolution of the hidden state. Active inference posits the minimization of free energy as a unifying principle for describing the perception-action loop.

Let \(h_t = \{o_t,...,o_0, a_{t-1},...,a_0\} \in H_t\) denote the observable history of the dynamic decision process including all past and present revealed observations and all implemented actions up to time \(t>0\), where \(H_t \triangleq \mathcal {O}^{t}\times \mathcal {A}^{t-1}\).

According to the free energy minimization principle, the agent’s belief distribution at time \(t>0\) which we denote by \(b_t(s_t)\) must correspond to the Bayes updated belief distribution on the state \(s_t\), i.e. the conditional probability distribution of \(s_t\) given history \(h_t\), i.e. \(b_t(s_t)= P(s_t|h_t)\). The active inference model of the perception-action loop assumes the agent has preferences over hidden states \(s_{t+1}\) which are represented by a probability distribution \(\tilde{P}(s_{t+1})\). The expected free energy associated with the choice of action \(a_t\) and current belief distribution \(b_t\) at time \(t>0\) can be written as [3]:

$$\begin{aligned} EFE(b_t,a_t)&= \mathbb {E}\big [D_{KL}\big (b_{t+1}||\tilde{P}\big )\big ] + \mathbb {E}[\mathcal {H}(o_{t+1})] \end{aligned}$$
(1)

where the first expectation is taken with respect to

$$\begin{aligned} \begin{aligned} P(o_{t+1}|b_t,a_t)&:= \sum _{s_{t+1}} P(o_{t+1}|s_{t+1})P(s_{t+1}|b_t,a_t) \\&= \sum _{s_{t+1}} P(o_{t+1}|s_{t+1})\sum _{s_t}P(s_{t+1}|s_t,a_t)b(s_t) \end{aligned} \end{aligned}$$
(2)

and \(D_{KL}\big (b_{t+1}||\tilde{P}\big )\) is the Kullback-Leibler divergence between the random belief distribution \(b_{t+1}(\cdot )=P(\cdot |h_t \cup \{o_{t+1},a_t\})\) and \(\tilde{P}(\cdot )\). \(\mathbb {E}[\mathcal {H}(o_{t+1})]\) is the entropy of the observables expected under the predictive distribution \(P(s_{t+1}|b_t, a_t)\) defined in (2). The first term in (1) is a measure of the extent to which the belief distribution \(b_{t+1}\) (resulting from implementing action \(a_t\) and recording observation \(o_{t+1}\)) differs from the preferred one \(\tilde{P}\). Let \(\pi \in \varPi \) denote a randomized action selection policy conditioned on the history of the process, i.e. \(\pi (a|h_t) \in [0,1], a \in \mathcal {A}\) and \(\sum _{a \in \mathcal {A}} \pi (a|h_t)=1\) for all \(h_t \in H_t\). An information processing cost is modeled as the Kullback-Leibler divergence between policy \(\pi \) and a default a priori control policy \(\pi _0\) which is oblivious to new information [21, 28] i.e.:

$$\begin{aligned} D_{KL}(\pi (\cdot |h_t)||\pi _0)&:= \sum _{a \in \mathcal {A}} \pi (a|h_t)\log \frac{\pi (a|h_t)}{\pi _0(a)} \end{aligned}$$

With a uniform default distribution, \(D_{KL}(\pi (\cdot |h_t)||\pi _0)=\mathbb {E}_{\pi (a|h_t)}\log \pi (a|h_t) - \log |\mathcal {A}|\). For a finite planning horizon T, the active inference controller is the solution to the problem:

$$\begin{aligned} \mathcal {G}^*_{\tau }(h_{\tau })&\triangleq \min _{\pi \in \varPi } \mathbb {E}\big [\sum _{t\ge \tau }^T (EFE(b_t,a_t)+\log \pi (a_t|h_t))\big ] \end{aligned}$$
(3)

The combination of additive structure and Markovian dynamics allows for a recursive characterization of the optimal policy as follows:

$$\begin{aligned} \begin{aligned} \mathcal {G}^*_{t}(h_t)&= \min _{\pi \in \varPi } \Bigg \{\sum _{a_t \in \mathcal {A}} \pi (a_t|h_t) \big [\\&EFE(b_t,a_t) + \log \pi (a_t|h_t) + \int _{\mathcal {O}}P(o_{t+1}|h_t, a_t)\mathcal {G}^*_{t+1}(h_{t+1})d o_{t+1}\big ] \Bigg \} \end{aligned} \end{aligned}$$
(4)

where \(h_{t+1}=h_t\cup \{o_{t+1},a_t\}\). Note that with no loss of generality the recursive equation can be expressed in terms of belief states \(b_t\) as opposed to the history \(h_t\). The following is a standard result characterizing the optimal solution to (4) [7].

Proposition 1

Let \(\mathcal {G}_t^{*}(b_t,a_t)\) be defined as:

$$ \mathcal {G}^*_t(b_t,a_t):=EFE(b_t,a_t) + \log \pi (a_t|b_t) + \int _{\mathcal {O}} P(o_{t+1}|b_t,a_t)\mathcal {G}^*_{t+1}(b_{t+1})do_{t+1} $$

The optimal policy is of the form:

$$\begin{aligned} \pi (a|b_t)&=\frac{e^{-\mathcal {G}^*_t(b_t,a)}}{\sum _{\tilde{a} \in \mathcal {A}} e^{-\mathcal {G}^*_t(b_t,\tilde{a})}} \end{aligned}$$
(5)

2.3 Estimation of POMDP Model

Given the model for the active inference controller described above, in this section, we describe the problem of estimating such a model given recorded sequences of actions and observables. This is akin to inverse learning a POMDP model (see Sect. 4.7 in [22]).

In what follows we consider a parametrization of observation probabilities \(P_{\theta _1}(o_{t+1}|s_{t+1}) \) and state-dynamics \(P_{\theta _1}(s_{t+1}|s_t,a_t)\) with \(\theta _1 \in \mathbb {R}^{p_1}\) where \(p_1>0\). Given data in the form of finite histories \(h_{T,i}= \{(o_{t,i},a_{t,i})\}_{t=0}^{T}\) for \(i \in \{1,\dots ,N\}\), a sequence of belief trajectories \(\{b_{t,\theta _1,i}\}_{t=0}^T\) can be recursively computed for a fixed value of \(\theta _1\).

Assuming preferences over hidden states are parametrized \(\tilde{P}_{\theta _2}(s_{t+1})\) with \(\theta _2 \in \mathbb {R}^{p_2}\) with \( p_2>0\), the log-likelihood of observed actions can be written as:

$$\begin{aligned} \log \ell (\theta )&= \sum _{i=1}^N \sum _{t=0}^{T-1} \log \pi _{\theta }(a_{t,i}|b_{t,\theta _1,i}) \end{aligned}$$
(6)

where \(\pi _{\theta }(\cdot |b_{t, \theta _1, i})\) is the optimal policy in (5) and \(\theta := (\theta _1,\theta _2)\).

(6) can be optimized using a nested-loop algorithm alternating between (i) a parameter update step at iteration \(k>0\) in which we set \(\theta ^{k+1}\) as the solution to:

$$\begin{aligned} \max _{\boldsymbol{\theta }} \sum _{i=1}^N \sum _{t=0}^{T-1}\log \pi _{\theta }(a_{t,i}|b_{t,\theta _1^k,i}) \quad \text {{ s.t.}} \quad \pi _{\theta }(a_t|b_t) = \frac{e^{-\mathcal {G}_{t, \theta ^k}^*(b_t, a_t)}}{\sum _{\tilde{a}_t \in \mathcal {A}}e^{-\mathcal {G}_{t, \theta ^k}^*(b_t, \tilde{a}_t)}} \end{aligned}$$

where \(\mathcal {G}^*_{t,\theta ^k}\) denotes the current free energy function and (ii) solving for the free energy function \(\{\mathcal {G}^*_{t,\theta ^{k+1}}\}_t\) given the new parameter values.

3 Implementation

In this section, we first describe the signals assumed to be observed by the drivers during a car-following scenario and defer a detailed description of the dataset to appendix A.1. We then describe the model fitting process with an augmentation of the model to continuous braking and steering control. Finally, we describe the procedure for model comparison.

3.1 Driver Observations

We leveraged prior works on driver behavior theory [17, 18, 25] to define the observation vector o used in the car-following task. Markkula et al. [17] proposed visual looming denoted by \(\tau ^{-1}\) as a central observation signal in human longitudinal vehicle control, which is defined as the derivative of the optical angle of the lead vehicle subtended on the driver’s retina divided by the angle itself: \(\tau ^{-1} = \dot{\theta }/\theta \). Salvucci & Gray [25] proposed a two-point model of human lateral vehicle control where the human driver controls the vehicle by representing road curvature with a near-point, assumed at a fixed distance in front of the vehicle, and a far-point, assumed to be the lead vehicle in the car-following context, and steers to minimize the deviation from a combination of the near and far-points. Using these insights, we designed an observation vector consisting of three sensory modalities:

  1. 1.

    The state of the ego vehicle in ego-centric coordinate

  2. 2.

    Relationships with the lead vehicle in ego-centric coordinates

  3. 3.

    Road geometry

We featurized the ego state with the longitudinal and lateral velocity and relationship to the lead vehicle with relative distance and speed with longitudinal and lateral components, and looming. To encode the road geometry in the two-point model, we used the lane center \(30\text { m}\) ahead of the current position as the near-point and the lead vehicle as the far-point and used as features the heading error from the near and far-points and lane-center distance to the current road position.

3.2 Model Fitting

We parameterized the hidden state transition probabilities \(P(s_{t+1}|s_t, a_t)\) and preference distribution \(\tilde{P}(s_t)\) with categorical distributions and observation probabilities \(P(o_t|s_t)\) with multivariate Gaussian distributions. For a fixed belief vector \(b_t\), the expected KL divergence and entropy in (1) can be computed in closed-form. We used the QMDP method [14] to approximate the cumulative expected free energy assuming the states will become fully observable in the next time step: \(\mathcal {G}^*(b_t, a_t) \approx \sum _{s_t}b(s_t)\mathcal {G}^*(s_t, a_t)\). This allows us to train the model in automatic differentiation frameworks (e.g., Pytorch) using Value-Iteration-Networks style implementations [9, 27].

In order to fit the discrete action model from Sect. 2 to continuous longitudinal and lateral controls, we extended the model with a continuous control module. Let u denote a multidimensional continuous control vector (longitudinal and lateral accelerations in the current setting), we modeled the mapping from a discrete action a to u using P(u|a) parameterized as a multivariate Gaussian with its parameters added to vector \(\theta _1\). P(u|a) thus automatically extracts primitive actions, such as different magnitudes of acceleration and deceleration [16], from data by adaptively discretizing the action space. We assume at a given time step t, the agent also performs a Bayesian belief update about the previous action realized with prior given by the policy \(\pi (a_t|b_t)\) and the posterior \(P(a_t|u_t) \propto P(u_t|a_t)\pi (a_t|b_t)\). The action log likelihood objective in (6) is modified as:

$$\begin{aligned} \log \ell (\theta ) = \sum _{i=1}^{N}\sum _{t=0}^{T-1}\log \sum _{a_{t, i}}P_{\theta _{1}}(u_{t,i}|a_{t,i})\pi _{\theta }(a_{t,i}|b_{t,\theta _{1},i}) \end{aligned}$$
(7)

3.3 Model Comparison

We measured the quality of the trained agents by using a combination of offline and online testing metrics on a held-out dataset. For offline metrics, we used mean absolute error (MAE). For online metrics, we first ran the trained agents in a simulator that replayed the recorded trajectories of the lead vehicles and then recorded the final displacement and average lane deviation for each trajectory tested. The final displacement is defined as the distance between the final position reached by the trained agents and the final position in the dataset. The average lane deviation is the agents’ distance to the tangent point on the lane center line averaged over all time steps in the trajectory.

We varied three aspects of the agents to compare with the canonical agent described previously. First, we examined the importance of the chosen features by replacing the near-point heading error and distance to lane center with distances to the left and right boundaries at the current road position, a feature set commonly used by driving agents for simulated testing [2, 13]. We label the agents trained with the original two-point observation as “TP”. Next, we examined the importance of grounding the world model in actual observations by adding an observation regularizer to the training objective with a coefficient of 0.01:

$$\begin{aligned} \mathcal {L}_{obs} = \sum _{t=1}^T \log P(o_t|h_t) \end{aligned}$$
(8)

This encourages the agent to have a more accurate belief about the world with higher observation likelihood under the agent’s posterior beliefs. We label agents trained with this penalty “Obs”. Finally, we examined the impact of agent planning objectives on the learned world model and behavior. We replaced EFE with an alternative objective called expected cross entropy (ECE):

$$\begin{aligned} ECE(b_t, a_t) = \mathbb {E}[\log \tilde{P}(o_{t+1})] \end{aligned}$$
(9)

which is the expected marginal likelihood of the agent preference model.

We used 30 states and 60 actions for all agents as they were sufficient to produce reasonable behavior. As a baseline, we trained a behavior cloning (BC) agent consisting of a recurrent and a feed-forward neural network to emulate the belief update and control modules of the active inference agent. We provide more details of the BC agent in appendix A.2.

4 Results and Discussions

Figure 1 shows the offline (left panel) and online (middle and right panels) testing metrics for each agent tested using the same set of 15 scenarios sampled from the held-out dataset, with the canonical agent labeled as “TP+EFE”. The MAE of all active inference agents were between 0.11 and 0.14 \(\text {m}/\text {s}^2\). The BC agent outperformed all agents with a MAE of 0.08, however the BC+TP agent had a higher MAE value of 0.135. This is likely due to the sensitivity to input features during training, despite better function approximation capability of neural networks. The final displacements were on average 13 m, the average lane deviation was 1.37 m, and no collision with the lead vehicle was observed. These metrics show that the agents can generate reasonable behavior by staying in the lane and following the lead vehicle (see a few sample trajectories generated in Fig. 3a).

Fig. 1.
figure 1

Box plots of offline (column 1) and online (columns 2 & 3) performance metrics of the compared agents. Offline metrics are calculated on the entire held-out set. Each box plot in the online metrics shows the distribution of agent performance in 15 random held-out scenarios tested with 3 different random seeds.

Comparing across different agents, Fig. 1 shows that adding an observation penalty increased offline MAE, however, it did not noticeably affect the agents’ online performance. This might be related to the objective mismatch problem in model-based reinforcement learning where a model better fitted to the observations may not enhance control capabilities [12]. The middle and right panels show that some of the agents produced final displacements and lane deviation as large as 100 m and 15 m, respectively, as a result of deviating from the lane and failing to make corrections (see Fig. 3b). Interestingly, active inference agents using the two-point observations model generated noticeably less lane deviation than other agents (see Fig. 1 right with x axis in log-scale) despite similar performance in terms of offline metrics. This observation highlights the importance of incorporating generalizable features into agent world model.

Figure 2 shows a subset of the parameters of the learned world models. All panels ordered the states by desirability so that states with lower EFE are assigned smaller indices. The left panel plots the variance of the observation distribution for the relative distance feature against the states. The orange and blue lines represent the ECE and EFE objectives, respectively. This panel shows a clear increasing trend in the observation variance with the decrease of state desirability. The middle and right panels show the transition matrices controlled by the learned policy: \(P^{\pi }(s'|s) = \sum _{a \in \mathcal {A}}P(s'|s, a)\pi (a|b=\delta (s))\), where \(b=\delta (s)\) denotes a belief concentrated on a single state. Whereas the transition probabilities of the ECE agent spread more uniformly across the state space, the transition matrix of the EFE agent has a block-diagonal structure. As a result, it is difficult to traverse to the desirable states in the upper diagonal (states 0–24) from the undesirable states (states 24–30) in the lower diagonal. We have empirically observed that when the EFE agent deviates from the lane, its EFE values also increase significantly without it taking any corrective actions. This shows that the increasing variance played a more important role in determining the desirability of a state than the KL divergence from the preferred states.

Fig. 2.
figure 2

Parameters of the learned world models. States are sorted by desirability (i.e., low expected free energy). Left: Observation variance vs. state. Middle & right: Heat map of controlled transition matrix. Darker color corresponds to higher transition probability.

The observation made in Fig. 2 is similar to the “causal confusion” problem in LfD [4]. In [4], the authors found that the learning agent may falsely attribute the cause of an action to previous actions in the demonstration rather than the observation signals and its own goals. Our agent exhibited a different type of “causal confusion” similar to the model exploitation phenomena in reinforcement learning [8], where the cause of an action is attributed to a model with incorrect counterfactual state and observation predictions. The consequence is that the agent does not have the ability to make corrections when entering these states. However, learning the correct counterfactual states from demonstration is difficult because these states are rarely contained in the demonstration as the demonstrating agents are usually experts who rarely visit undesirable states. Prior works addressed this by interacting with an environment [30] and receiving real-time expert feedback [24]. We have instead partially alleviated this by designing domain specific features (i.e., the two-point observation model) to reduce the probability of the agent deviating from desired states. However, given active inference strongly relies on counterfactual simulation of the world model in the planning step, future work should focus on discovering the correct counterfactual states from human demonstrations using approaches at the model level rather than at the feature level, e.g., by constraining the model class or learning causal world models via environment interactions [4].