Keywords

1 Introduction

As autonomous vehicle (AV) technology progresses towards full autonomy, self-driving vehicles are more likely to become a reality in the coming years [1]. Although AVs have demonstrated a guarantee to operate in a structured environment under reasonable driving circumstances, they are still facing unpredictable behaviours causing disastrous consequences in unseen environments with high uncertainty [2]. A crucial component of autonomous driving is making driving decisions based on the vehicle’s environmental changes and adapting to different types of driving conditions [3].

Due to their heavy manual tuning in modelling dynamic interactions among vehicles, traditional planning methods are not always scalable and cost-effective [4]. Imitation Learning (IL) offers an alternative solution that mimics an expert’s driving behaviour based on a data-driven approach. However, in IL, one of the primary challenges is specifying how an AV acts when it is outside the presented states by the demonstrations [5]. Reinforcement Learning (RL) can overcome such an issue by exploring new states and actions while an agent is interacting with the environment. Exploration plays a vital role in RL, however, it typically requires a large amount of training data and computational resources to succeed [6]. In response to this challenge, IL leverages inverse reinforcement learning (IRL) to deduce the reward function from demonstrations [7]–[8]. Learning a parameterized reward function provides a compact representation of the demonstrator’s preferences and allows policy optimization to generalize to the unnoticed states. However, IRL methods always generate uncertainty about the proper reward function, which can have adverse outcomes if the learning agent figures a reward function leading it to learn the wrong policy. Additionally, systems that uses deterministic models rather than probabilistic ones ignore the stochastic property of dynamic environments. We propose an IL approach combined with a hierarchical probabilistic model under the Active Inference framework [9] to overcome the mentioned limitations. An active imitator agent (i.e., an AV) aims to learn a robust policy to uncertainty and dynamic environmental changes while accomplishing a specific task. The agent can effectively trade-off the exploration-exploitation rate with the expected return within active inference. The latter allows an AV to detect novel situations by comparing what the agent is expecting to observe based on the rules learned from an expert’s demonstration and what it is actually observing to decide whether to explore new actions or exploit what has been learned by imitating the expert. AV aims to learn a sequence of actions that lead to the minimization of the abnormality. The main contributions of this work are: 1) Integrate Active inference to IL to optimize the learning policy by establishing the balance between expected return and abnormality value under the new experiences. 2) Optimize the exploration-exploitation rate by distinguishing the normal and abnormal situations through the Bayesian predictive and diagnostic messages. 3) Optimize the action planning concerning the uncertainty and minimize the imitation loss with respect to expert demonstrations. 4) Demonstrate that the proposed framework achieves better results in terms of learning aspects than existing RL methods.

2 Proposed Framework

The proposed framework is divided into two main phases: offline and online learning. During the first phase, we learn a situation model explaining how an expert agent (\(\textrm{E}\)) and a dynamic Object (\(\textrm{O}\)) interact in the environment. Moreover, we organize a First-Person generative model (FP), allowing a learning agent (\(\textrm{L}\)) to learn the \(\textrm{E}\)’s behavior by observing its demonstrations. In the second phase, \(\textrm{L}\) modifies and updates its imperfect knowledge from the sub-optimal expert demonstration through an Active First-Person generative model (AFP) with respect to its relative distance with a moving object (\(\mathrm {\hat{O}}\)) in a dynamic environment. Both FP and AFP models are structured in Dynamic Bayesian Network (DBN) representations [10].

Fig. 1.
figure 1

Learning models: a) Situation model, b) FP model, c) AFP model.

2.1 Offline Learning Phase

2.1.1 Situation Model.

This model is structured to explain the interaction among two dynamic agents (see Fig. 1-(a)), \(\textrm{E}\) and \(\textrm{O}\) by using a switching DBN [11]. In order, the variables \(\textrm{Z}_{k}^{\textrm{E}}\) and \(\textrm{Z}_{k}^{\textrm{O}}\) illustrate the agents observations at the low level of the hierarchy (yellow nodes). The middle level (green nodes), describes the joint hidden continuous Generalized states (GSs) containing the dynamics of two agents at each time instant k as follows: \(\tilde{\boldsymbol{\textrm{X}}}_{k} = {[ \tilde{\textrm{X}}^{\textrm{E}}_{k} \tilde{\textrm{X}}^{\textrm{O}}_{k} ]}^{\intercal }\), where \(\tilde{\textrm{X}}^{\textrm{E}}_{k}\), \(\tilde{\textrm{X}}^{\textrm{O}}_{k}\) denote the GSs of \(\textrm{E}\) and \(\textrm{O}\), respectively. The GS related to agent i is defined as a vector composed of the agent’s state and its first-order temporal derivative, such that \(\tilde{\textrm{X}}_{k}^{i} = [\boldsymbol{x} \boldsymbol{\dot{x}}]^\intercal \) where \(\boldsymbol{x} \in \mathbb {R}^d\), \(\boldsymbol{\dot{x}} \in \mathbb {R}^d\), \(i \in \{\textrm{E}, \textrm{O}\}\) and d stands for the dimensionality. The correlation between \(\textrm{Z}_{k}^{i}\) and \(\tilde{\textrm{X}}_{k}^{i}\), which describes the observation model, is defined as:

$$\begin{aligned} \textrm{Z}_{k}^{i}=\textrm{H}\tilde{\textrm{X}}_{k}^{i} + v_{k}, \end{aligned}$$
(1)

where \(H=[\textrm{I}_d 0_{d,d}]\) stands for the observation matrix that measures of how dependent the measurements (\(\textrm{Z}_{k}^{i}\)) are upon the hidden GSs (\(\tilde{\textrm{X}}_{k}^{i}\)) and \(v_k \sim \mathcal {N}(0,\textrm{R})\) is the measurement noise that follows a zero-mean Gaussian distribution with covariance \(\textrm{R}\).

Initially, the evolution of \(\tilde{\boldsymbol{\textrm{X}}}_{k}\) is assumed to follow a static equilibrium assumption described by:

$$\begin{aligned} \tilde{\textrm{X}}_{k}^{i} = \textrm{A} \tilde{\textrm{X}}_{k-1}^{i} + w_k, \end{aligned}$$
(2)

where \(\textrm{A} \in \mathbb {R}^{d \times d}\) and \(w_k \sim \mathcal {N}(0,\textrm{Q})\) depict the dynamic matrix and the process noise, respectively. \(\tilde{\boldsymbol{\textrm{X}}}_{k}\) is predicted by utilizing a Null Force Filter (NFF) according to (2). The NFF calculates the innovations that encode the deviations between predictions and observations as : \(\varepsilon _{\tilde{\textrm{X}}_{t}}=\textrm{H}^{-1}\big (\textrm{Z}_{k}^{i}-\textrm{H}\tilde{\textrm{X}}_{k}^{i}\big )\). The Growing Neural Gas with utility measurement (GNG-U) [12] is empolyed to cluster in an unsupervised manner those innovations \(\varepsilon _{\tilde{\textrm{X}}_{t}}\) that characterise the generalized errors (GEs) and outputs a vocabulary defined as: \(\boldsymbol{\textrm{S}}^{i}=\{\textrm{s}^i_1, \textrm{s}^i_2, \dots , \textrm{s}^i_{\mathcal {L}_{i}}\}\) consisting of \(\mathcal {L}_{i}\) clusters.

The interaction between \(\textrm{E}\) and \(\textrm{O}\) at multi-level (i.e., discrete and continuous levels) is described by a joint vocabulary (i.e., \(\boldsymbol{\textrm{S}}^{\textrm{E}}, \boldsymbol{\textrm{S}}^{\textrm{O}}\)) expressing discrete regions with quasi-linear models that explain the interactive dynamic evolution of joint states over time. Each discrete cluster \(\textrm{s}^i \in \boldsymbol{\textrm{S}}^{i}\) follows a multivariate Gaussian distribution with covariance matrix \(\tilde{\Sigma }_{\textrm{s}^{i}_{k}}\) and generalized mean value \(\tilde{\boldsymbol{\mu }}^{\textrm{s}^{i}}=[\mu _{Pos}^{s^i} \ \mu _{V}^{s^i}]\), where \(\mu _{Pos}^{s^i}\) is the states’ mean value on positions and \(\mu _{V}^{s^i}\) is the states’ mean value on velocity. Thus, a dictionary can be formed as \(\boldsymbol{\textrm{D}} = \{\textrm{D}_{1}, \textrm{D}_{2}, \cdots , \textrm{D}_{\textrm{M}}\}\) according to the learned configurations, where the interaction configuration (\(\textrm{D}_{k} = [\textrm{s}_{k}^{\textrm{E}},\textrm{s}_k^{\textrm{O}}]^\intercal \)) explains the joint activated clusters.

Consequently, an extra vocabulary encoding the joint configurations is defined as: \(\boldsymbol{\textrm{D}} = \{\textrm{D}_{1}, \textrm{D}_{2}, \cdots , \textrm{D}_{\textrm{M}}\}\) with \(\textrm{M}\) configurations, where \(\textrm{D}_{k} \in \boldsymbol{\textrm{D}}\) and \(\textrm{D}_{k} = [\textrm{s}_{k}^{\textrm{E}},\textrm{s}_k^{\textrm{O}}]^\intercal \) depicts an interaction configuration explaining the joint activated clusters occurring simultaneously in the agents’ vocabularies (red nodes). Each configuration \(\textrm{D}_{k}\) consists of the average position and average velocity of the two agents according to: \(\textrm{D}_{k} = \big [ (\mu _{Pos}^{}, \mu _{V})^{\textrm{E}}, (\mu _{Pos}, \mu _{V})^{\textrm{O}} \big ]\). After learning the joint vocabulary, the dynamic model defined in (2) can be updated as follows:

$$\begin{aligned} \tilde{\textrm{X}}_{k}^{i} = \textrm{A}\tilde{\textrm{X}}_{k-1}^{i} + \textrm{B}\mu _V^{s_k^i} + w_k, \end{aligned}$$
(3)

where \(\textrm{B} \in \mathbb {R}^{d \times d}\) stands for the control model matrix, \(\mu _V^{\textrm{s}_{k}^{i}}=[\dot{x}_{s_{k}^i}, \dot{y}_{s_{k}^i}]\) is a control vector encoding the agent’s velocity (on x and y) associated with \(\textrm{s}_{k}^{i}\). The Transition Matrix (\(\textrm{TM}\)) can be learned by estimating the transition probabilities \(\textrm{P}(\textrm{D}_{k+1}|\textrm{D}_k)\) that encodes the dynamic transitions among the learned configurations at the top level of hierarchy is defined as:

$$\begin{aligned} \textrm{TM} = \begin{bmatrix} \textrm{P}({\mathrm {D_{1}}|\textrm{D}_{1}}), &{} \textrm{P}({\textrm{D}_{1}| \textrm{D}_{2}}), &{} \dots , &{} \textrm{P}({\textrm{D}_{1}| \textrm{D}_{\textrm{M}}}) \\ \textrm{P}({\mathrm {D_{2}}| \textrm{D}_{1}}), &{} \textrm{P}({\textrm{D}_{2}| \textrm{D}_{2}}), &{} \dots , &{} \textrm{P}({\textrm{D}_{2}| \textrm{D}_{\textrm{M}}}) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \textrm{P}({\mathrm {D_{\textrm{M}}}| \textrm{D}_{1}}), &{} \textrm{P}({\textrm{D}_{\textrm{M}}| \textrm{D}_{2}}), &{} \dots , &{} \textrm{P}({\textrm{D}_{\textrm{M}}| \textrm{D}_{\textrm{M}}}) \\ \end{bmatrix} \end{aligned}$$
(4)

where \(\sum _{m}^{\textrm{M}} \textrm{P}({\textrm{D}_{p}| \textrm{D}_{m}})=1\) such that \(p,m \in \textrm{M}\).

2.1.2 First-Person Model.

We provide the FP model by using a switching DBN (situation model) represented by a generative DBN to consider the interaction states with a dynamic object (\(\mathrm {\hat{O}}\)) under the \(\textrm{L}\)’s interpretation (see Fig. 1-(b)). The discrete level represents the learned configuration \(\textrm{D}_{m} \in \boldsymbol{\textrm{D}}\) (red nodes). The continuous level stands for the generalized relative distance between \(\textrm{E}\) and \(\textrm{O}\), which will be updated by \(\textrm{L}\) and \(\mathrm {\hat{O}}\) through interacting in the environment(green nodes). At FP model initialization, the generalized relative distance can be computed by the difference of the joint GS to consider the agent’s interaction at a specific configuration at time k, as follow: \(\tilde{\textrm{X}}_{k} = \big [ \tilde{\textrm{X}}^{\textrm{E}}_{k} - \tilde{\textrm{X}}^{\textrm{O}}_{k} \big ] = \big [ (\boldsymbol{x}^{\textrm{O}} - \boldsymbol{x}^{\textrm{E}}) (\dot{\boldsymbol{x}}^{\textrm{O}} - \dot{\boldsymbol{x}}^{\textrm{E}}) \big ]\). At the bottom level (yellow nodes), the observation can be mapped onto observations \((\textrm{Z}^{i}_{k})\) of both agents as: \(\textrm{Z}_{k} = \big [ \textrm{Z}^{\textrm{E}}_{k} - \textrm{Z}^{\textrm{O}}_{k} \big ]\).

2.2 Online Learning Phase

In this section, we aim to enhance IL’s efficiency through a probabilistic hierarchical model in a dynamic environment. Therefore, the FP model must compound with the active states to minimise the divergence between the internal predictions and observations while acting in the environment.

Active First-Person Model. The AFP model takes as input a sequence of observed or previously learned configurations (\(D_{1:M}\)), an action based on the current configuration (\(a_k\)), and a random variable (\(Z_k\)) which represents the learner observation by its exteroceptive sensor at time instant k. The observation provides a relative distance between \(\textrm{L}\) and \(\mathrm {\hat{O}}\) that is embedded in the continuous level (\(\tilde{X}_k\)) of the online learning model. \(\textrm{L}\) sustains an internal dynamic representation \(\textrm{P}(\textrm{Z}, \tilde{\mathrm {{X}}}, \textrm{D}, {a})\) of the external environment encoded in an AFP model (see Fig. 1-(c)) and purposes to implicitly reduce the mismatch between what it is expecting to receive from the environment and what it is actually perceiving.

Action Selection Probability. At each time instant, \(\textrm{L}\) evaluates its current situation. In one condition, \(\textrm{L}\) has no prior knowledge about the current interaction with \(\mathrm {\hat{O}}\). In the other condition, \(\textrm{L}\) has been informed that \(\textrm{E}\) was in the same situation while interacting with \(\textrm{O}\) (prior belief). Thus, \(\textrm{L}\) estimates the proper behavior whether to explore by performing new actions based on the current configuration (in the first condition) or exploit from the learnt configuration (in the second condition). The divergence between the observation and prediction is the main criteria which guides the action selection procedure. \(\textrm{L}\) employs Particle Filter (PF) to predict the learnt configuration with the least divergence until time k. Likelihood messages (\(\lambda (\tilde{\textrm{X}}_{k})\) and \(\lambda (\textrm{D}_{k})\)) passing in a backward manner from the bottom level towards higher levels inside the AFP enable measuring the anomaly between the prediction of the propagated particles and the learner’s observation, which is computed by cosine similarity \((\cos (\theta ))\), as follow:

$$\begin{aligned} \cos (\theta ) = \dfrac{ \tilde{\textrm{Z}}_{k}. \tilde{\textrm{X}}_{k,n}}{ \vert \vert \tilde{\textrm{Z}}_{k}\vert \vert \vert \vert \tilde{\textrm{X}}_{k,n}\vert \vert }. \end{aligned}$$
(5)

The computed abnormality assesses the similarity of the current observation with predictions, and the corresponding configuration to the particle with the highest weight (the least anomaly) is assigned as the activated configuration (\(\mathring{\textrm{D}}\)). The particles’ weights can be updated by using \(\lambda (\textrm{D}_{k})\) which is defined as: \(\lambda (\textrm{D}_{k}) = \lambda (\tilde{\textrm{X}}_{k}) \textrm{P}(\tilde{\textrm{X}}_{k}|\textrm{D}_{k})\), where \(\lambda (\tilde{\textrm{X}}_{k})=\textrm{P}(\textrm{Z}_{k}|\tilde{\textrm{X}}_{k,n})\) is a multivariate Gaussian distribution such that \(\lambda (\tilde{\textrm{X}}_{k}) \sim \mathcal {N}(\textrm{Z}_{k}, v_{k})\) and \(\lambda (\textrm{D}_{k})\) is a discrete probability distribution. Additionally, \(\textrm{L}\) considers two parameters to perform an action. One is the exploration rate (\(\epsilon \)) where the highest particle weight \((\alpha )\) presents a control input on it as follows: \(\epsilon _{k} = 1 - \alpha _{k}\). Both \(\epsilon _{k}\) and \(\alpha \) values are in an interval between 0 and 1, as much as \(\alpha \) tends to 1 (higher similarity), \(\epsilon _{k}\) goes to 0 (less exploration). \(\textrm{L}\) learns to decrease the exploration rate while minimizing the distinction between the prediction and observation. The other parameter is a threshold (\(\rho \)) which is obtained by a trial-and-error process. \(\textrm{L}\) compares \(\rho \) and \(\epsilon _{k}\) to select an action at time k as below:

$$\begin{aligned} a_{k} \sim {\left\{ \begin{array}{ll} \underset{a_{k}}{\arg \max } \textrm{Q}(\mathcal {A}^{\mathrm {}},\mathring{D}_{k}), \ \text {if} \ \epsilon < \rho \ \ \text {(exploitation)}, \\ \textit{random } \ \text {from } \mathcal {A}^{+}, \ \text {if} \ \epsilon \ge \rho \ \text {(exploration)}, \end{array}\right. } \end{aligned}$$
(6)

where \(\mathcal {A}=\{\mathcal {A}^{\textrm{E}}, \mathcal {A}^{+}\}\), as \(\mathcal {A}^{\textrm{E}}=\{a_{1}^{\textrm{E}}, a_{2}^{\textrm{E}}, \dots , a_{\textrm{Y}}^{\textrm{E}}\}\) is a set of actions performed by \(\textrm{L}\) and encoded in the situation model that \(\textrm{L}\) aims to imitate during exploitation and \(\mathcal {A}^{+}=\{a_{1}, a_{2}, \dots , a_{8}\}\) is a set of actions realizing 8 cardinal and ordinal directions which \(\textrm{L}\) selects during exploration. Moreover, during exploration, \(\textrm{L}\) records the new experiencing pair \((\textrm{D}_{k}^{+}, a^{+}_{k} \in \mathcal {A}^{+})\) in the \(\textrm{Q}\)-table incrementally.

Free Energy Measurement. The AFP model by leveraging a hierarchical structure computes the imitation cost using bottom-up \((\lambda )\) and lateral messages (predictions by inter-slice links \(\pi \)) that drive posterior agent movements toward a better prediction to optimize the Free Energy (FE), which will be updated after gathering novel observations. During the online phase, \(\textrm{L}\) learns an optimized mapping between the Bayesian message passing, which causes a sequence of observations with a minimum distinction between the expectation and likelihood. The distinction can be estimated between the predictive message \(\pi (\tilde{\textrm{X}}_{k})\) and the diagnostic message \(\lambda (\tilde{\textrm{X}}_{k})\) after performing an action (\(a_{k-1}\)). We employ Kullback Leibler-Divergence (\(\mathcal {D_{KL}}\)) between \(\pi (\tilde{\textrm{X}}_{k})\) and \(\lambda (\tilde{\textrm{X}}_{k})\) to calculate the FE after each performed action, as:

$$\begin{aligned} \mathcal {F} = \mathcal {D_{KL}}\bigg (\lambda (\tilde{\textrm{X}}_{k}) || \pi (\tilde{\textrm{X}}_{k})\bigg ) = \int \lambda (\tilde{\textrm{X}}_{k}) \log \bigg (\frac{\lambda (\tilde{\textrm{X}}_{k})}{\pi (\tilde{\textrm{X}}_{k})}\bigg )\textrm{d}\tilde{\textrm{X}}_{k}. \end{aligned}$$
(7)

Action Selection Probability Update. At each time instant k, the model evaluates the performed action \((a_{k-1})\) with respect to the FE measurement and updates the selection probabilities in the probabilistic \(\textrm{Q}\)-table according the following equation:

$$\begin{aligned} \textrm{Q}^*_k=(1-\eta )\textrm{P}(a_{k-1}|\textrm{D}_{k-1}) + \eta \bigg [(1-{\mathcal {F}}_k) + \gamma \underset{a_{k}}{\max } \textrm{P}(a_{k}|D_{k}) \bigg ], \end{aligned}$$
(8)

where \(\eta \) is the learning rate which controls to what extent the new experiences overrides the previously recorded situations (\(1-{\mathcal {F}}_k\)) is the normalized reward measurement with a range in [0, 1], and \(\gamma \) is a discount factor. Our objective is to minimize the long term loss by keeping down the \(\mathcal {F}\) though improving the action selection procedure.

3 Experimental Evaluation

Our framework is validated using two autonomous vehicles’ multisensory information, ’iCab 1’ and ’iCab 2’ [13]. To consider the lane-changing scenario, the odometry module obtains positional information and velocity from the vehicles, where iCab 2 overtakes iCab 1 from the left side without colliding.

3.1 Offline Learning Phase

This section shows how the situation model is structured by employing NFF as an initial filter on data. The provided GEs by NFF are clustered using GNG that outputs a set of discrete clusters representing the discrete regions of the trajectories generated by \(\textrm{E}\) and \(\textrm{O}\). The joint clusters introduce a set of configurations that encode the interactive behavior between the agents. (see Fig. 2).

Fig. 2.
figure 2

Learning the situation model. a) iCab2 overtakes iCab1, b) Clustering of GEs.

3.2 Online Learning Phase

The learning agent employs the FP model as a prior beliefs to imitate the expected transitions. During the online phase, by balancing exploration-exploitation trade-off, \(\textrm{L}\)’s actions are engaged to solve the uncertain aspects of the action selection caused by the new dynamic environment, which adjust the \(\textrm{L}\)’s hypotheses. The experiments are executed in a simulated environment through 500 episodes with different start positions to train a learning-agent \(\textrm{L}\). Each episode consists of 10 iterations, i.e., \(\textrm{L}\) tries 5k iterations by 500 different start positions to learn the policies. We evaluate the performance of the proposed framework and compare it with other learning algorithms from the literature, namely, the general Q-learning, IRL (when an optimal expert is available), and self-learning in the RL context (when optimal expert data is not available).

Performance Evaluation. After trial stage, \(\textrm{L}\) acquires knowledge about the contingencies and the likelihood mapping in the generative model is aligned adequate with the reference generative process and the targeted goal (e.g., overtaking the dynamic object). Crucially, we assume that the correctness and accuracy of the action selection procedure guide the learning agent to the expected observations. Figure 3-(a) illustrates that \(\textrm{L}\) movements are engaged coherently, which causes less exploration in each trial epoch (e.g., each episode). Additionally, Fig. 3-(a) compares the number of executed actions during the training using different learning methods, where it shows \(\textrm{L}\) performs less actions to accomplish its task by using our method than others. Moreover, \(\textrm{L}\) adopting the proposed method has higher successful trajectories than other methods as depicted in Fig. 3-(b).

Fig. 3.
figure 3

a) The performed actions by \(\textrm{L}\), b) The success rate.

Fig. 4.
figure 4

Incremental learning. a) Corresponding TM to the situation model, b) TM after clustering the learnt \(\textrm{Q}\)-table.

Fig. 5.
figure 5

Learning evaluation. a) The impact of exploration and learning rates after each training quarter on the imitation cost, b) FE measurement, and c) Training result.

Another noticeable point is that the AFP model expands its repertoire of action representations by learning new interactions between \(\textrm{L}\) and \(\mathrm {\hat{O}}\). The novel experienced configurations are recorded in \(\textrm{Q}\)-table incrementally, which will be clustered by using GNG after training. Figure 4-(a)-(b) shows the learned model during online phase has an expanded transition matrix than the situation model due to the exploratory aspect of the \(\textrm{L}\)’s behavior.

Learning Cost Evaluation. The loss cost reduction imperative is one of the components of updating the actions’ probabilities (see (8)) that guides the action selection policies in active inference. Figure 5-(a) demonstrates how modifying the actions can reduce the exploration and minimize the imitation cost resulting in a high learning rate during the training phase. Our goal is to find the best set of actions that minimize the imitation loss in terms of FE. Figure 5-(b) shows that the normalized global FE (\(\mathcal {F}\)) drops down capably to decrease below 0.1. Figure 5-(c) demonstrates that our method outperforms others in terms of success and loss rate, which is attributed to the effectiveness of motion prediction while dealing with abnormalities that improve the success rate.

During testing, the agent travels through 500 paths with different start positions than the training time. The testing stage includes two levels of difficulties: I) the agent needs to overtake a single dynamic object, and II) the agent needs to overtake multiple dynamic objects. Moreover, each level has three scenarios: overtaking from the left side, overtaking from the right side, and when the agent should decide to overtake from the left or right side of the object(s). Table 1 describes that the adapted agent to the learned model through the presented method can overtake a single object and multiple dynamic objects in the environment effectively, whereas other methods still have a high failure rate. Moreover, the provided vocabulary by \(\textrm{E}\)’s demonstrations tends to change-line from the left side of \(\textrm{O}\). By experiencing the unseen configuration during the online learning phase, \(\textrm{L}\) learns to reduce the collision probability by interacting from the dynamic object’s right side. The testing results show that by 5k training trajectories, the agent can overcome the experiments with different scenarios where it is necessary i) to overtake from the left side of the dynamic object(s), ii) to overtake from the right side of the dynamic object(s), and iii) to decide to overtake from which side of the dynamic object(s) has less collision probability (mixed situation from both side).

Table 1. Testing the learnt model after 5k trial trajectories.

4 Conclusion

A novel framework has been proposed to integrate Imitation Learning with Active Inference for autonomous driving. In the hybrid presented model, the errors between the prediction and observation guide the action selection in two processes. A low amount of error influences \(\textrm{L}\) that exploits the prior knowledge to perform an action. In this case, \(\textrm{L}\) prepares an imitative response. On the other hand, when experiencing an unobserved configuration causes a high error, \(\textrm{L}\) needs to rely on random movements. During the online phase, \(\textrm{L}\) learns to decide how to minimize the FE and guide the movements to the imitative actions. Future work concentrates on employing the errors to guide the random movement that might facilitate the execution of congruent actions with the expert demonstrations.