Autonomous Driving Based on Imitation and Active Inference

Nozari, Sheida; Krayani, Ali; Marin, Pablo; Marcenaro, Lucio; Martin, David; Regazzoni, Carlo

doi:10.1007/978-3-031-16281-7_2

Sheida Nozari^17,18,
Ali Krayani¹⁷,
Pablo Marin¹⁸,
Lucio Marcenaro¹⁷,
David Martin¹⁸ &
…
Carlo Regazzoni¹⁷

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 546))

Included in the following conference series:

International Conference on System-Integrated Intelligence

1429 Accesses
1 Citations

Abstract

We advance a novel computational model of acquiring a hierarchical action sequence and its use for minimizing the divergence between observation and prediction. The model is grounded in a principled framework to learn and cognize the dynamic surrounding of a learning agent, which exemplifies the model’s functioning by presenting a simulation of overtaking scenarios in autonomous driving. The learning agent integrates imitation learning and active inference to form hierarchical representations from expert demonstrations. During the online learning phase, the learning agent improves the action selection procedure based on the prior knowledge (exploiting) and the novel interactions with a dynamic environment (exploring). The proposed method applies an active knowledge sampling during the learning agent’s movements to make a dynamic inference in the Bayesian structure by message passing through the multi-levels. A line-changing driving scenario with different levels of complexity is organized to verify the proposed framework’s efficiency by dealing with single and multiple dynamic objects interacting in the environment.

Access provided by Autonomous University of Puebla. Download conference paper PDF

World Model Learning from Demonstrations with Active Inference: Application to Driving Behavior

Reinforcement imitation learning for reliable and efficient autonomous navigation in complex environments

Article 18 April 2024

Decision Assist for Self-driving Cars

Keywords

1 Introduction

As autonomous vehicle (AV) technology progresses towards full autonomy, self-driving vehicles are more likely to become a reality in the coming years [1]. Although AVs have demonstrated a guarantee to operate in a structured environment under reasonable driving circumstances, they are still facing unpredictable behaviours causing disastrous consequences in unseen environments with high uncertainty [2]. A crucial component of autonomous driving is making driving decisions based on the vehicle’s environmental changes and adapting to different types of driving conditions [3].

Due to their heavy manual tuning in modelling dynamic interactions among vehicles, traditional planning methods are not always scalable and cost-effective [4]. Imitation Learning (IL) offers an alternative solution that mimics an expert’s driving behaviour based on a data-driven approach. However, in IL, one of the primary challenges is specifying how an AV acts when it is outside the presented states by the demonstrations [5]. Reinforcement Learning (RL) can overcome such an issue by exploring new states and actions while an agent is interacting with the environment. Exploration plays a vital role in RL, however, it typically requires a large amount of training data and computational resources to succeed [6]. In response to this challenge, IL leverages inverse reinforcement learning (IRL) to deduce the reward function from demonstrations [7]–[8]. Learning a parameterized reward function provides a compact representation of the demonstrator’s preferences and allows policy optimization to generalize to the unnoticed states. However, IRL methods always generate uncertainty about the proper reward function, which can have adverse outcomes if the learning agent figures a reward function leading it to learn the wrong policy. Additionally, systems that uses deterministic models rather than probabilistic ones ignore the stochastic property of dynamic environments. We propose an IL approach combined with a hierarchical probabilistic model under the Active Inference framework [9] to overcome the mentioned limitations. An active imitator agent (i.e., an AV) aims to learn a robust policy to uncertainty and dynamic environmental changes while accomplishing a specific task. The agent can effectively trade-off the exploration-exploitation rate with the expected return within active inference. The latter allows an AV to detect novel situations by comparing what the agent is expecting to observe based on the rules learned from an expert’s demonstration and what it is actually observing to decide whether to explore new actions or exploit what has been learned by imitating the expert. AV aims to learn a sequence of actions that lead to the minimization of the abnormality. The main contributions of this work are: 1) Integrate Active inference to IL to optimize the learning policy by establishing the balance between expected return and abnormality value under the new experiences. 2) Optimize the exploration-exploitation rate by distinguishing the normal and abnormal situations through the Bayesian predictive and diagnostic messages. 3) Optimize the action planning concerning the uncertainty and minimize the imitation loss with respect to expert demonstrations. 4) Demonstrate that the proposed framework achieves better results in terms of learning aspects than existing RL methods.

2 Proposed Framework

The proposed framework is divided into two main phases: offline and online learning. During the first phase, we learn a situation model explaining how an expert agent ($\textrm{E}$) and a dynamic Object ($\textrm{O}$) interact in the environment. Moreover, we organize a First-Person generative model (FP), allowing a learning agent ($\textrm{L}$) to learn the $\textrm{E}$’s behavior by observing its demonstrations. In the second phase, $\textrm{L}$ modifies and updates its imperfect knowledge from the sub-optimal expert demonstration through an Active First-Person generative model (AFP) with respect to its relative distance with a moving object ($\mathrm {\hat{O}}$) in a dynamic environment. Both FP and AFP models are structured in Dynamic Bayesian Network (DBN) representations [10].

2.1 Offline Learning Phase

2.1.1 Situation Model.

This model is structured to explain the interaction among two dynamic agents (see Fig. 1-(a)), $\textrm{E}$ and $\textrm{O}$ by using a switching DBN [11]. In order, the variables $\textrm{Z}_{k}^{\textrm{E}}$ and $\textrm{Z}_{k}^{\textrm{O}}$ illustrate the agents observations at the low level of the hierarchy (yellow nodes). The middle level (green nodes), describes the joint hidden continuous Generalized states (GSs) containing the dynamics of two agents at each time instant k as follows: $\tilde{\boldsymbol{\textrm{X}}}_{k} = {[ \tilde{\textrm{X}}^{\textrm{E}}_{k} \tilde{\textrm{X}}^{\textrm{O}}_{k} ]}^{\intercal }$, where $\tilde{\textrm{X}}^{\textrm{E}}_{k}$, $\tilde{\textrm{X}}^{\textrm{O}}_{k}$ denote the GSs of $\textrm{E}$ and $\textrm{O}$, respectively. The GS related to agent i is defined as a vector composed of the agent’s state and its first-order temporal derivative, such that $\tilde{\textrm{X}}_{k}^{i} = [\boldsymbol{x} \boldsymbol{\dot{x}}]^\intercal $ where $\boldsymbol{x} \in \mathbb {R}^d$, $\boldsymbol{\dot{x}} \in \mathbb {R}^d$, $i \in \{\textrm{E}, \textrm{O}\}$ and d stands for the dimensionality. The correlation between $\textrm{Z}_{k}^{i}$ and $\tilde{\textrm{X}}_{k}^{i}$, which describes the observation model, is defined as:

$$\begin{aligned} \textrm{Z}_{k}^{i}=\textrm{H}\tilde{\textrm{X}}_{k}^{i} + v_{k}, \end{aligned}$$

(1)

where $H=[\textrm{I}_d 0_{d,d}]$ stands for the observation matrix that measures of how dependent the measurements ($\textrm{Z}_{k}^{i}$) are upon the hidden GSs ($\tilde{\textrm{X}}_{k}^{i}$) and $v_k \sim \mathcal {N}(0,\textrm{R})$ is the measurement noise that follows a zero-mean Gaussian distribution with covariance $\textrm{R}$.

Initially, the evolution of $\tilde{\boldsymbol{\textrm{X}}}_{k}$ is assumed to follow a static equilibrium assumption described by:

$$\begin{aligned} \tilde{\textrm{X}}_{k}^{i} = \textrm{A} \tilde{\textrm{X}}_{k-1}^{i} + w_k, \end{aligned}$$

(2)

where $\textrm{A} \in \mathbb {R}^{d \times d}$ and $w_k \sim \mathcal {N}(0,\textrm{Q})$ depict the dynamic matrix and the process noise, respectively. $\tilde{\boldsymbol{\textrm{X}}}_{k}$ is predicted by utilizing a Null Force Filter (NFF) according to (2). The NFF calculates the innovations that encode the deviations between predictions and observations as : $\varepsilon _{\tilde{\textrm{X}}_{t}}=\textrm{H}^{-1}\big (\textrm{Z}_{k}^{i}-\textrm{H}\tilde{\textrm{X}}_{k}^{i}\big )$. The Growing Neural Gas with utility measurement (GNG-U) [12] is empolyed to cluster in an unsupervised manner those innovations $\varepsilon _{\tilde{\textrm{X}}_{t}}$ that characterise the generalized errors (GEs) and outputs a vocabulary defined as: $\boldsymbol{\textrm{S}}^{i}=\{\textrm{s}^i_1, \textrm{s}^i_2, \dots , \textrm{s}^i_{\mathcal {L}_{i}}\}$ consisting of $\mathcal {L}_{i}$ clusters.

The interaction between $\textrm{E}$ and $\textrm{O}$ at multi-level (i.e., discrete and continuous levels) is described by a joint vocabulary (i.e., $\boldsymbol{\textrm{S}}^{\textrm{E}}, \boldsymbol{\textrm{S}}^{\textrm{O}}$) expressing discrete regions with quasi-linear models that explain the interactive dynamic evolution of joint states over time. Each discrete cluster $\textrm{s}^i \in \boldsymbol{\textrm{S}}^{i}$ follows a multivariate Gaussian distribution with covariance matrix $\tilde{\Sigma }_{\textrm{s}^{i}_{k}}$ and generalized mean value $\tilde{\boldsymbol{\mu }}^{\textrm{s}^{i}}=[\mu _{Pos}^{s^i} \ \mu _{V}^{s^i}]$, where $\mu _{Pos}^{s^i}$ is the states’ mean value on positions and $\mu _{V}^{s^i}$ is the states’ mean value on velocity. Thus, a dictionary can be formed as $\boldsymbol{\textrm{D}} = \{\textrm{D}_{1}, \textrm{D}_{2}, \cdots , \textrm{D}_{\textrm{M}}\}$ according to the learned configurations, where the interaction configuration ($\textrm{D}_{k} = [\textrm{s}_{k}^{\textrm{E}},\textrm{s}_k^{\textrm{O}}]^\intercal $) explains the joint activated clusters.

Consequently, an extra vocabulary encoding the joint configurations is defined as: $\boldsymbol{\textrm{D}} = \{\textrm{D}_{1}, \textrm{D}_{2}, \cdots , \textrm{D}_{\textrm{M}}\}$ with $\textrm{M}$ configurations, where $\textrm{D}_{k} \in \boldsymbol{\textrm{D}}$ and $\textrm{D}_{k} = [\textrm{s}_{k}^{\textrm{E}},\textrm{s}_k^{\textrm{O}}]^\intercal $ depicts an interaction configuration explaining the joint activated clusters occurring simultaneously in the agents’ vocabularies (red nodes). Each configuration $\textrm{D}_{k}$ consists of the average position and average velocity of the two agents according to: $\textrm{D}_{k} = \big [ (\mu _{Pos}^{}, \mu _{V})^{\textrm{E}}, (\mu _{Pos}, \mu _{V})^{\textrm{O}} \big ]$. After learning the joint vocabulary, the dynamic model defined in (2) can be updated as follows:

$$\begin{aligned} \tilde{\textrm{X}}_{k}^{i} = \textrm{A}\tilde{\textrm{X}}_{k-1}^{i} + \textrm{B}\mu _V^{s_k^i} + w_k, \end{aligned}$$

(3)

where $\textrm{B} \in \mathbb {R}^{d \times d}$ stands for the control model matrix, $\mu _V^{\textrm{s}_{k}^{i}}=[\dot{x}_{s_{k}^i}, \dot{y}_{s_{k}^i}]$ is a control vector encoding the agent’s velocity (on x and y) associated with $\textrm{s}_{k}^{i}$. The Transition Matrix ($\textrm{TM}$) can be learned by estimating the transition probabilities $\textrm{P}(\textrm{D}_{k+1}|\textrm{D}_k)$ that encodes the dynamic transitions among the learned configurations at the top level of hierarchy is defined as:

$$\begin{aligned} \textrm{TM} = \begin{bmatrix} \textrm{P}({\mathrm {D_{1}}|\textrm{D}_{1}}), &{} \textrm{P}({\textrm{D}_{1}| \textrm{D}_{2}}), &{} \dots , &{} \textrm{P}({\textrm{D}_{1}| \textrm{D}_{\textrm{M}}}) \\ \textrm{P}({\mathrm {D_{2}}| \textrm{D}_{1}}), &{} \textrm{P}({\textrm{D}_{2}| \textrm{D}_{2}}), &{} \dots , &{} \textrm{P}({\textrm{D}_{2}| \textrm{D}_{\textrm{M}}}) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \textrm{P}({\mathrm {D_{\textrm{M}}}| \textrm{D}_{1}}), &{} \textrm{P}({\textrm{D}_{\textrm{M}}| \textrm{D}_{2}}), &{} \dots , &{} \textrm{P}({\textrm{D}_{\textrm{M}}| \textrm{D}_{\textrm{M}}}) \\ \end{bmatrix} \end{aligned}$$

(4)

where $\sum _{m}^{\textrm{M}} \textrm{P}({\textrm{D}_{p}| \textrm{D}_{m}})=1$ such that $p,m \in \textrm{M}$.

2.1.2 First-Person Model.

We provide the FP model by using a switching DBN (situation model) represented by a generative DBN to consider the interaction states with a dynamic object ($\mathrm {\hat{O}}$) under the $\textrm{L}$’s interpretation (see Fig. 1-(b)). The discrete level represents the learned configuration $\textrm{D}_{m} \in \boldsymbol{\textrm{D}}$ (red nodes). The continuous level stands for the generalized relative distance between $\textrm{E}$ and $\textrm{O}$, which will be updated by $\textrm{L}$ and $\mathrm {\hat{O}}$ through interacting in the environment(green nodes). At FP model initialization, the generalized relative distance can be computed by the difference of the joint GS to consider the agent’s interaction at a specific configuration at time k, as follow: $\tilde{\textrm{X}}_{k} = \big [ \tilde{\textrm{X}}^{\textrm{E}}_{k} - \tilde{\textrm{X}}^{\textrm{O}}_{k} \big ] = \big [ (\boldsymbol{x}^{\textrm{O}} - \boldsymbol{x}^{\textrm{E}}) (\dot{\boldsymbol{x}}^{\textrm{O}} - \dot{\boldsymbol{x}}^{\textrm{E}}) \big ]$. At the bottom level (yellow nodes), the observation can be mapped onto observations $(\textrm{Z}^{i}_{k})$ of both agents as: $\textrm{Z}_{k} = \big [ \textrm{Z}^{\textrm{E}}_{k} - \textrm{Z}^{\textrm{O}}_{k} \big ]$.

2.2 Online Learning Phase

In this section, we aim to enhance IL’s efficiency through a probabilistic hierarchical model in a dynamic environment. Therefore, the FP model must compound with the active states to minimise the divergence between the internal predictions and observations while acting in the environment.

Active First-Person Model. The AFP model takes as input a sequence of observed or previously learned configurations ($D_{1:M}$), an action based on the current configuration ($a_k$), and a random variable ($Z_k$) which represents the learner observation by its exteroceptive sensor at time instant k. The observation provides a relative distance between $\textrm{L}$ and $\mathrm {\hat{O}}$ that is embedded in the continuous level ($\tilde{X}_k$) of the online learning model. $\textrm{L}$ sustains an internal dynamic representation $\textrm{P}(\textrm{Z}, \tilde{\mathrm {{X}}}, \textrm{D}, {a})$ of the external environment encoded in an AFP model (see Fig. 1-(c)) and purposes to implicitly reduce the mismatch between what it is expecting to receive from the environment and what it is actually perceiving.

Action Selection Probability. At each time instant, $\textrm{L}$ evaluates its current situation. In one condition, $\textrm{L}$ has no prior knowledge about the current interaction with $\mathrm {\hat{O}}$. In the other condition, $\textrm{L}$ has been informed that $\textrm{E}$ was in the same situation while interacting with $\textrm{O}$ (prior belief). Thus, $\textrm{L}$ estimates the proper behavior whether to explore by performing new actions based on the current configuration (in the first condition) or exploit from the learnt configuration (in the second condition). The divergence between the observation and prediction is the main criteria which guides the action selection procedure. $\textrm{L}$ employs Particle Filter (PF) to predict the learnt configuration with the least divergence until time k. Likelihood messages ($\lambda (\tilde{\textrm{X}}_{k})$ and $\lambda (\textrm{D}_{k})$) passing in a backward manner from the bottom level towards higher levels inside the AFP enable measuring the anomaly between the prediction of the propagated particles and the learner’s observation, which is computed by cosine similarity $(\cos (\theta ))$, as follow:

$$\begin{aligned} \cos (\theta ) = \dfrac{ \tilde{\textrm{Z}}_{k}. \tilde{\textrm{X}}_{k,n}}{ \vert \vert \tilde{\textrm{Z}}_{k}\vert \vert \vert \vert \tilde{\textrm{X}}_{k,n}\vert \vert }. \end{aligned}$$

(5)

The computed abnormality assesses the similarity of the current observation with predictions, and the corresponding configuration to the particle with the highest weight (the least anomaly) is assigned as the activated configuration ($\mathring{\textrm{D}}$). The particles’ weights can be updated by using $\lambda (\textrm{D}_{k})$ which is defined as: $\lambda (\textrm{D}_{k}) = \lambda (\tilde{\textrm{X}}_{k}) \textrm{P}(\tilde{\textrm{X}}_{k}|\textrm{D}_{k})$, where $\lambda (\tilde{\textrm{X}}_{k})=\textrm{P}(\textrm{Z}_{k}|\tilde{\textrm{X}}_{k,n})$ is a multivariate Gaussian distribution such that $\lambda (\tilde{\textrm{X}}_{k}) \sim \mathcal {N}(\textrm{Z}_{k}, v_{k})$ and $\lambda (\textrm{D}_{k})$ is a discrete probability distribution. Additionally, $\textrm{L}$ considers two parameters to perform an action. One is the exploration rate ($\epsilon $) where the highest particle weight $(\alpha )$ presents a control input on it as follows: $\epsilon _{k} = 1 - \alpha _{k}$. Both $\epsilon _{k}$ and $\alpha $ values are in an interval between 0 and 1, as much as $\alpha $ tends to 1 (higher similarity), $\epsilon _{k}$ goes to 0 (less exploration). $\textrm{L}$ learns to decrease the exploration rate while minimizing the distinction between the prediction and observation. The other parameter is a threshold ($\rho $) which is obtained by a trial-and-error process. $\textrm{L}$ compares $\rho $ and $\epsilon _{k}$ to select an action at time k as below:

$$\begin{aligned} a_{k} \sim {\left\{ \begin{array}{ll} \underset{a_{k}}{\arg \max } \textrm{Q}(\mathcal {A}^{\mathrm {}},\mathring{D}_{k}), \ \text {if} \ \epsilon < \rho \ \ \text {(exploitation)}, \\ \textit{random } \ \text {from } \mathcal {A}^{+}, \ \text {if} \ \epsilon \ge \rho \ \text {(exploration)}, \end{array}\right. } \end{aligned}$$

(6)

where $\mathcal {A}=\{\mathcal {A}^{\textrm{E}}, \mathcal {A}^{+}\}$, as $\mathcal {A}^{\textrm{E}}=\{a_{1}^{\textrm{E}}, a_{2}^{\textrm{E}}, \dots , a_{\textrm{Y}}^{\textrm{E}}\}$ is a set of actions performed by $\textrm{L}$ and encoded in the situation model that $\textrm{L}$ aims to imitate during exploitation and $\mathcal {A}^{+}=\{a_{1}, a_{2}, \dots , a_{8}\}$ is a set of actions realizing 8 cardinal and ordinal directions which $\textrm{L}$ selects during exploration. Moreover, during exploration, $\textrm{L}$ records the new experiencing pair $(\textrm{D}_{k}^{+}, a^{+}_{k} \in \mathcal {A}^{+})$ in the $\textrm{Q}$-table incrementally.

Free Energy Measurement. The AFP model by leveraging a hierarchical structure computes the imitation cost using bottom-up $(\lambda )$ and lateral messages (predictions by inter-slice links $\pi $) that drive posterior agent movements toward a better prediction to optimize the Free Energy (FE), which will be updated after gathering novel observations. During the online phase, $\textrm{L}$ learns an optimized mapping between the Bayesian message passing, which causes a sequence of observations with a minimum distinction between the expectation and likelihood. The distinction can be estimated between the predictive message $\pi (\tilde{\textrm{X}}_{k})$ and the diagnostic message $\lambda (\tilde{\textrm{X}}_{k})$ after performing an action ($a_{k-1}$). We employ Kullback Leibler-Divergence ($\mathcal {D_{KL}}$) between $\pi (\tilde{\textrm{X}}_{k})$ and $\lambda (\tilde{\textrm{X}}_{k})$ to calculate the FE after each performed action, as:

$$\begin{aligned} \mathcal {F} = \mathcal {D_{KL}}\bigg (\lambda (\tilde{\textrm{X}}_{k}) || \pi (\tilde{\textrm{X}}_{k})\bigg ) = \int \lambda (\tilde{\textrm{X}}_{k}) \log \bigg (\frac{\lambda (\tilde{\textrm{X}}_{k})}{\pi (\tilde{\textrm{X}}_{k})}\bigg )\textrm{d}\tilde{\textrm{X}}_{k}. \end{aligned}$$

(7)

Action Selection Probability Update. At each time instant k, the model evaluates the performed action $(a_{k-1})$ with respect to the FE measurement and updates the selection probabilities in the probabilistic $\textrm{Q}$-table according the following equation:

$$\begin{aligned} \textrm{Q}^*_k=(1-\eta )\textrm{P}(a_{k-1}|\textrm{D}_{k-1}) + \eta \bigg [(1-{\mathcal {F}}_k) + \gamma \underset{a_{k}}{\max } \textrm{P}(a_{k}|D_{k}) \bigg ], \end{aligned}$$

(8)

where $\eta $ is the learning rate which controls to what extent the new experiences overrides the previously recorded situations ($1-{\mathcal {F}}_k$) is the normalized reward measurement with a range in [0, 1], and $\gamma $ is a discount factor. Our objective is to minimize the long term loss by keeping down the $\mathcal {F}$ though improving the action selection procedure.

3 Experimental Evaluation

Our framework is validated using two autonomous vehicles’ multisensory information, ’iCab 1’ and ’iCab 2’ [13]. To consider the lane-changing scenario, the odometry module obtains positional information and velocity from the vehicles, where iCab 2 overtakes iCab 1 from the left side without colliding.

3.1 Offline Learning Phase

This section shows how the situation model is structured by employing NFF as an initial filter on data. The provided GEs by NFF are clustered using GNG that outputs a set of discrete clusters representing the discrete regions of the trajectories generated by $\textrm{E}$ and $\textrm{O}$. The joint clusters introduce a set of configurations that encode the interactive behavior between the agents. (see Fig. 2).

3.2 Online Learning Phase

The learning agent employs the FP model as a prior beliefs to imitate the expected transitions. During the online phase, by balancing exploration-exploitation trade-off, $\textrm{L}$’s actions are engaged to solve the uncertain aspects of the action selection caused by the new dynamic environment, which adjust the $\textrm{L}$’s hypotheses. The experiments are executed in a simulated environment through 500 episodes with different start positions to train a learning-agent $\textrm{L}$. Each episode consists of 10 iterations, i.e., $\textrm{L}$ tries 5k iterations by 500 different start positions to learn the policies. We evaluate the performance of the proposed framework and compare it with other learning algorithms from the literature, namely, the general Q-learning, IRL (when an optimal expert is available), and self-learning in the RL context (when optimal expert data is not available).

Performance Evaluation. After trial stage, $\textrm{L}$ acquires knowledge about the contingencies and the likelihood mapping in the generative model is aligned adequate with the reference generative process and the targeted goal (e.g., overtaking the dynamic object). Crucially, we assume that the correctness and accuracy of the action selection procedure guide the learning agent to the expected observations. Figure 3-(a) illustrates that $\textrm{L}$ movements are engaged coherently, which causes less exploration in each trial epoch (e.g., each episode). Additionally, Fig. 3-(a) compares the number of executed actions during the training using different learning methods, where it shows $\textrm{L}$ performs less actions to accomplish its task by using our method than others. Moreover, $\textrm{L}$ adopting the proposed method has higher successful trajectories than other methods as depicted in Fig. 3-(b).

Another noticeable point is that the AFP model expands its repertoire of action representations by learning new interactions between $\textrm{L}$ and $\mathrm {\hat{O}}$. The novel experienced configurations are recorded in $\textrm{Q}$-table incrementally, which will be clustered by using GNG after training. Figure 4-(a)-(b) shows the learned model during online phase has an expanded transition matrix than the situation model due to the exploratory aspect of the $\textrm{L}$’s behavior.

Learning Cost Evaluation. The loss cost reduction imperative is one of the components of updating the actions’ probabilities (see (8)) that guides the action selection policies in active inference. Figure 5-(a) demonstrates how modifying the actions can reduce the exploration and minimize the imitation cost resulting in a high learning rate during the training phase. Our goal is to find the best set of actions that minimize the imitation loss in terms of FE. Figure 5-(b) shows that the normalized global FE ($\mathcal {F}$) drops down capably to decrease below 0.1. Figure 5-(c) demonstrates that our method outperforms others in terms of success and loss rate, which is attributed to the effectiveness of motion prediction while dealing with abnormalities that improve the success rate.

During testing, the agent travels through 500 paths with different start positions than the training time. The testing stage includes two levels of difficulties: I) the agent needs to overtake a single dynamic object, and II) the agent needs to overtake multiple dynamic objects. Moreover, each level has three scenarios: overtaking from the left side, overtaking from the right side, and when the agent should decide to overtake from the left or right side of the object(s). Table 1 describes that the adapted agent to the learned model through the presented method can overtake a single object and multiple dynamic objects in the environment effectively, whereas other methods still have a high failure rate. Moreover, the provided vocabulary by $\textrm{E}$’s demonstrations tends to change-line from the left side of $\textrm{O}$. By experiencing the unseen configuration during the online learning phase, $\textrm{L}$ learns to reduce the collision probability by interacting from the dynamic object’s right side. The testing results show that by 5k training trajectories, the agent can overcome the experiments with different scenarios where it is necessary i) to overtake from the left side of the dynamic object(s), ii) to overtake from the right side of the dynamic object(s), and iii) to decide to overtake from which side of the dynamic object(s) has less collision probability (mixed situation from both side).

Table 1. Testing the learnt model after 5k trial trajectories.

Full size table

4 Conclusion

A novel framework has been proposed to integrate Imitation Learning with Active Inference for autonomous driving. In the hybrid presented model, the errors between the prediction and observation guide the action selection in two processes. A low amount of error influences $\textrm{L}$ that exploits the prior knowledge to perform an action. In this case, $\textrm{L}$ prepares an imitative response. On the other hand, when experiencing an unobserved configuration causes a high error, $\textrm{L}$ needs to rely on random movements. During the online phase, $\textrm{L}$ learns to decide how to minimize the FE and guide the movements to the imitative actions. Future work concentrates on employing the errors to guide the random movement that might facilitate the execution of congruent actions with the expert demonstrations.

References

Claussmann, L., Revilloud, M., Gruyer, D., Glaser, S.: A review of motion planning for highway autonomous driving. IEEE Trans. Intell. Transp. Syst. 21(5), 1826–1848 (2020)
Article Google Scholar
Guo, J., Kurup, U., Shah, M.: Is it safe to drive? an overview of factors, metrics, and datasets for driveability assessment in autonomous driving. IEEE Trans. Intell. Transp. Syst. 21(8), 3135–3151 (2020)
Article Google Scholar
Li, Z.: A hierarchical autonomous driving framework combining reinforcement learning and imitation learning. In: 2021 International Conference on Computer Engineering and Application (ICCEA), pp. 395–400, June 2021
Google Scholar
Zhou, J.: Exploring imitation learning for autonomous driving with feedback synthesizer and differentiable rasterization. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1450–1457 Sep. 2021
Google Scholar
Hussein, A., Gaber, M.M., Elyan, E., Jayne, C.: Imitation learning: a survey of learning methods. ACM Comput. Surv. (CSUR) 50(2), 1–35 (2017)
Article Google Scholar
Sutton, R.S. Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018)
Google Scholar
Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000)
Google Scholar
Nozari, S., Marcenaro, L., Martin, D., Regazzoni, C.: Observational learning: imitation through an adaptive probabilistic approach. In: 2021 IEEE International Conference on Autonomous Systems (ICAS IEEE 2021), pp. 1–5
Google Scholar
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G.: Active inference: a process theory. Neural Comput. 29(1), 1–49 (2017)
Article MathSciNet Google Scholar
Ghahramani, Z.: Learning dynamic Bayesian networks. In: Giles, C.L., Gori, M. (eds.) NN 1997. LNCS, vol. 1387, pp. 168–197. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0053999
Chapter Google Scholar
Baydoun, M., Campo, D., Kanapram, D., Marcenaro, L., Regazzoni, C.S.: Prediction of multi-target dynamics using discrete descriptors: an interactive approach. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3342–3346. IEEE (2019)
Google Scholar
Iqbal, H., Campo, D., Baydoun, M., Marcenaro, L., Gomez, D.M., Regazzoni, C.: Clustering optimization for abnormality detection in semi-autonomous systems. In: 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, pp. 33–41 (2019)
Google Scholar
Marin-Plaza, P.: Stereo vision-based local occupancy grid map for autonomous navigation in ros. In: 11th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications VISIGRAPP, vol. 2016 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Genoa, Genoa, Italy
Sheida Nozari, Ali Krayani, Lucio Marcenaro & Carlo Regazzoni
Carlos III University of Madrid, Getafe, Spain
Sheida Nozari, Pablo Marin & David Martin

Authors

Sheida Nozari
View author publications
You can also search for this author in PubMed Google Scholar
Ali Krayani
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Marin
View author publications
You can also search for this author in PubMed Google Scholar
Lucio Marcenaro
View author publications
You can also search for this author in PubMed Google Scholar
David Martin
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Regazzoni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheida Nozari .

Editor information

Editors and Affiliations

Department of Electrical, Electronic and Telecommunication Engineering and Naval Architecture (DITEN), University of Genova, Genoa, Italy
Maurizio Valle
Fraunhofer Institute for Manufacturing Technology and Advanced Materials, Bremen, Germany
Dirk Lehmhus
Department of Electrical, Electronic and Telecommunication Engineering and Naval Architecture (DITEN), University of Genova, Genoa, Italy
Christian Gianoglio
Department of Electrical, Electronic and Telecommunication Engineering and Naval Architecture (DITEN), University of Genova, Genoa, Italy
Edoardo Ragusa
Department of Electrical, Electronic and Telecommunication Engineering and Naval Architecture (DITEN), University of Genova, Genoa, Italy
Lucia Seminara
Computer Science and Mathematics, University of Bremen, Bremen, Germany
Stefan Bosse
Department of Electrical and Electronics Engineering (EEE), Lebanese International University (LIU), Beirut, Lebanon
Ali Ibrahim
BIBA, University of Bremen, Bremen, Germany
Klaus-Dieter Thoben

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nozari, S., Krayani, A., Marin, P., Marcenaro, L., Martin, D., Regazzoni, C. (2023). Autonomous Driving Based on Imitation and Active Inference. In: Valle, M., et al. Advances in System-Integrated Intelligence. SYSINT 2022. Lecture Notes in Networks and Systems, vol 546. Springer, Cham. https://doi.org/10.1007/978-3-031-16281-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-16281-7_2
Published: 04 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16280-0
Online ISBN: 978-3-031-16281-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Autonomous Driving Based on Imitation and Active Inference

Abstract

Similar content being viewed by others

World Model Learning from Demonstrations with Active Inference: Application to Driving Behavior