Actor-critic objective penalty function method: an adaptive strategy for trajectory tracking in autonomous driving

Wang, Bo; Bai, Fusheng; Zhang, Ke

doi:10.1007/s40747-023-01238-6

Actor-critic objective penalty function method: an adaptive strategy for trajectory tracking in autonomous driving

Original Article
Open access
Published: 30 September 2023

Volume 10, pages 1715–1732, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Actor-critic objective penalty function method: an adaptive strategy for trajectory tracking in autonomous driving

Download PDF

1016 Accesses
Explore all metrics

Abstract

Trajectory tracking is a key technology for controlling the autonomous vehicles effectively and stably to track the reference trajectory. How to handle the various constraints in trajectory tracking is very challenging. The recently proposed generalized exterior point method (GEP) shows high computational efficiency and closed-loop performance in solving the constrained trajectory tracking problem. However, the neural networks used in the GEP may suffer from the ill-conditioning issue during model training, which result in a slow or even non-converging training convergence process and the control output of the policy network being suboptimal or even severely constraint-violating. To effectively deal with the large-scale nonlinear state-wise constraints and avoid the ill-conditioning issue, we propose a model-based reinforcement learning (RL) method called the actor-critic objective penalty function method (ACOPFM) for trajectory tracking in autonomous driving. We adopt an integrated decision and control (IDC)-based planning and control scheme to transform the trajectory tracking problem into MPC-based nonlinear programming problems and embed the objective penalty function method into an actor-critic solution framework. The nonlinear programming problem is transformed into an unconstrained optimization problem and employed as a loss function for model updating of the policy network, and the ill-conditioning issue is avoided by alternately performing gradient descent and adaptively adjusting the penalty parameter. The convergence of ACOPFM is proved. The simulation results demonstrate that the ACOPFM converges to the optimal control strategy fast and steadily, and perform well under the multi-lane test scenario.

An actor-critic based learning method for decision-making and planning of autonomous vehicles

Article 19 March 2021

Approximate Optimal Filter Design for Vehicle System through Actor-Critic Reinforcement Learning

Article 04 November 2022

Stabilizing Electric Vehicle Systems Using Proximal Policy-Based Self-structuring Control

Article Open access 09 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Autonomous driving has attracted increasingly more attention from practitioners and researchers [1]. The primary research areas for autonomous driving include three major aspects: visual perception, planning and decision, and motion control [2]. Trajectory tracking is a key technology for precisely controlling autonomous vehicles. As a fundamental part of the motion control module, trajectory tracking aims to calculate the vehicle control commands based on the reference trajectory obtained by the path planning module to achieve accurate tracking of the reference trajectory. The performance of trajectory tracking directly determines, to a large extent, the performance of autonomous vehicles, which involves driving safety, passenger comfort, travel efficiency, energy consumption and so on. The trajectory tracking problem in autonomous driving is very challenging as the system involved is typically high nonlinear and contains large-scale state-wise constraints [3]. How to mathematically model the trajectory tracking problem properly, set up the obstacle avoidance strategy, reduce the huge computational burden in the solution process, and minimize the trajectory tracking error are crucial issues in the trajectory tracking research [4, 5].

The current vehicle trajectory tracking control methods mainly include sliding mode control (SMC), proportion integration differentiation (PID) control, and model predictive control (MPC) [6,7,8], etc. SMC is a nonlinear variable structure control, in which the composition of the system is constantly changing with the system state, so that the system always changes according to the employed slip mode state. SMC has the advantages of fast response, insensitivity to parameter changes and perturbations, and no need for online system identification [9], but it is prone to chatter, which is the most noticeable issue affecting the practical application of SMC [10]. The chattering issue can be lessened by approximating the discontinuous terms via continuous functions like saturation function, arctan function, and hyperbolic tangent function [11,12,13]. Although some improvements obtained using these modified SMC methods, there are still some problems associated with the presence of external disturbances [14,15,16]. PID control is used broadly in industry because it is simple and does not require a mathematical model [17], but it is not applicable to nonlinear systems and the tuning of parameters is complicated, which makes it hard to guarantee the accuracy and robustness of the control [18]. MPC employs the nonlinear dynamics model of the control system and predicts the output behavior of the system in the future period by solving the optimal control problem with constraints [19], however its tracking performance is sensitive to the accuracy of the prediction model, and the nonlinear MPC has high demand for computational resources [20].

Reinforcement learning (RL) solves sequential decision-making problems via a trial-and-error process interacting with the environment [21]. Model-based RL builds environment models in which trial-and-error can take place without real costs and can improve RL algorithms significantly by making them more sample-efficient and thus reducing errors [22]. Deep neural networks (DNN) have been commonly used as function approximators in RL, which mitigate against the state-action space explosion as the number of state-action pairs recorded grows [23], and generalize prior experience to previously unseen state-action pairs, leading to the birth of deep reinforcement learning (DRL). Actor-critic methods for RL are hybrid methods that combine the advantages of policy-based and value-based algorithms, where the ‘actor’ is the term for the policy structure in charge of choosing actions and the ‘critic’ is the estimated value function to evaluates the actor’s actions. Zhao et al. [24] proposed a model-based actor-critic framework for optimal tracking control of robotic systems, which overcomes the issue of long learning cycles for control strategies. The actor-critic based solution framework is widely used in autonomous driving tasks, especially to solve the problem for which the training process of RL or DRL can be hard when the state space is enormous [25]. Recently, given the uncertainties in the driving conditions, a soft actor-critic scheme was used in [26] to solve the decision-making and planning problem with continuous action space. Note that in general the DRL lacks closed-loop stability analysis, state constraints satisfiability, and significant weight initialization capacity [27]. In order to deal with these issues, the prospect of using MPC-based RL has been proposed and justified in [28], where using MPC as the function approximation for the optimal policy in RL is suggested. MPC based strategies can effectively deal with the multivariable constraints in autonomous vehicle tracking under uncertain environments by building systems that satisfy state constraints and safety requirements [29].

Recently, Guan et al. [30] proposed an integrated decision and control (IDC) framework for autonomous vehicles, which has better interpretability and computational efficiency compared with the traditional decomposed scheme [31] and the end-to-end scheme [32]. In addition, it is applicable in different driving scenarios and tasks. A model-based actor-critic RL algorithm, called the generalized exterior point method (GEP) is presented in [30] for solving the formulated MPC-based optimal control problem with large-scale state-wise constraints in a rolling optimization process to minimize the trajectory tracking error. The GEP is an extension of the exterior point method in the optimization domain to the field of neural network (NN), which transforms the constrained optimal control problem into an unconstrained problem with penalties for safety violations and an actor-critic solution framework is employed to solve the problem. The purpose of the actor is to obtain an excellent policy NN, which aims to make trajectory tracking performance as good as possible without collision. The objective function with penalties is mapped as the parameterized loss function of the policy NN for model training, and the input is the state vector of the autonomous vehicle and the output is the control vector, thus giving the corresponding control commands for the autonomous vehicle. Meanwhile, the critic, with the function of judging state goodness, can be served as the path selector. Unlike the exterior point method in the optimization domain, the GEP relaxes the conditions to establish convergence of the exterior point method, eliminates the need to find global minima in each gradient descent step, and updates the NN parameters effectively under the guidance of the model. An approximate feasible optimal control strategy is obtained by implementing gradient descent and updating the penalty parameter alternatively. For the exterior point method, it can be shown that under certain conditions, when the penalty parameter increases to infinity, the optimal solution to the penalty problem is the optimal solution to the original problem [33]. However, as the penalty parameter increases, it would face the ill-conditioning issue [34, 35], including severe numerical instabilities [36], so the penalty parameter cannot be infinitely large. A similar issue arises for DNN training [37,38,39], where each iterative update of the network parameters is essentially a step of gradient descent. On the one hand, we do not know exactly how large the penalty parameters need to be. Theoretically speaking, the penalty parameters have to be increased to infinity to ensure convergence, but due to the actual computational limitations, the penalty parameters cannot be too large or too small [40]. On the other hand, as the penalty parameters keep increasing, the condition number of the Hessian matrix of the loss function becomes larger and larger, the correlation between the columns of the matrix becomes too high [41, 42], and the ill-conditioning issue in the training of the DNN becomes more and more serious, which results in a very slow or even non-converging training process [38, 39]. In other words, the control of DNN output may be suboptimal or even constraint-violating, so that we can hardly ensure the quality of the trained model.

In this paper, we propose a model-based RL algorithm called the actor-critic objective penalty function method (ACOPFM) to solve the trajectory tracking problem. The ACOPFM can effectively deal with the large-scale nonlinear state-wise constraints and avoid the ill-conditioning issue caused by the increasing condition number of the Hessian matrix during the model training of the policy network of the GEP. The ACOPFM transforms the nonlinear programming problem into a parametric, unconstrained optimization problem and employs the objective penalty function as a loss function for policy network updates to find the optimal control strategy by alternately performing gradient descent and adaptively adjusting the penalty parameter. The convergence of ACOPFM is proved. Meanwhile, the ACOPFM can improve the computational efficiency of the solution process and update the NNs efficiently under the guidance of the model. To the best of our knowledge, this is the first time that the objective penalty function method is extended to the RL framework. Moreover, we conduct extensive simulation experiments via the joint platform consisting of CARLA and SUMO. During the offline training, the proposed ACOPFM converges faster and more steadily to the optimal control strategy with the GEP as the baseline. During the online testing, our trained autonomous vehicle can successfully complete the trajectory tracking task in the overtaking challenge under the multi-lane test scenario with frequent vehicle interactions.

The remainder of this paper is organized as follows. Section “Related work” briefly introduces the penalty function methods in the optimization field, and the development of the integration of MPC and RL. Section “Planning and control scheme” describes the planning and control scheme we have adopted based on the IDC framework. Section “Design of the nonlinear programming problem” presents the modeling of the trajectory tracking problem into the MPC-based nonlinear programming problem. Section “ACOPFM framework” presents the ACOPFM solution for the modeled trajectory tracking problem under the RL framework, and its theoretical convergence analysis. Section “Simulation experiment” reports the numerical results of the offline training and online testing of the NN trained using the ACOPFM for the multi-lane simulation scene. Finally, the conclusions of the present work and some suggestions for the future work are given in Sect. “Conclusion and future work”.

Related work

In this section, we introduce firstly the penalty function methods, including the development of the objective penalty function methods in recent decades, then the integration of MPC and RL, including its applications in the field of autonomous driving.

Penalty function methods

Consider the following constrained optimization problem:

$$\begin{aligned} \qquad \qquad \quad&\min \quad f(x),\\ \qquad \qquad \quad&\;\hspace{0.152em}\mathrm {s.t.}\quad g_{c}(x)\le 0,\quad c\in \varvec{C}, \end{aligned}$$

(P)

where $f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}$ is the objective function, $g_c: {\mathbb {R}}^n\rightarrow {\mathbb {R}}, c\in {\varvec{C}}$ are constraints. The set

$$\begin{aligned} X=\left\{ x\in {\mathbb {R}}^n \, \big |\, g_{c}(x)\le 0,c\in {\varvec{C}}\right\} \end{aligned}$$

is called the feasible set of (P).

The problem finds applications in various fields such as artificial intelligence and transportation [43, 44].

Penalty function methods are popular choices for solving problem (P). Based on the principle of converting a constrained optimization problem into a series of unconstrained optimization problems, penalty function methods are relatively simple and effective in finding the optimal solution [45, 46]. A commonly used penalty function for (P) is the $l_2$ penalty function, which is defined as:

$$\begin{aligned} F(x,\rho )=f(x)+\rho \sum _{c\in {\varvec{C}}}\max \{g_{c}(x),0\}^2, \end{aligned}$$

where $\rho >0$ is the penalty parameter. The corresponding penalty problem for (P) is given as follows:

$$\begin{aligned} \quad \qquad \qquad \min F(x,\rho ) \qquad \mathrm {s.t.}\quad x\in {\varvec{R}}^n. \end{aligned}$$

(P′)

It can be shown that under certain conditions, when the penalty parameter increases to infinity, the optimal solution of the problem (P$^\prime $) is also the optimal solution of the original problem (P) [46]. Therefore, when the penalty parameter keeps increasing, only a series of penalty problems need to be solved to obtain an approximate optimal solution of the original problem (P). However, the Hessian matrix of the penalty function will become increasingly ill-conditioning as the penalty parameter increases [34, 35, 47], which leads to numerical difficulties. Another commonly used penalty function is the $l_1$ penalty function, which is defined as:

$$\begin{aligned} F_1(x,\rho )=f(x)+\rho \sum _{c\in {\varvec{C}}}\max \{g_{c}(x),0\}. \end{aligned}$$

The corresponding penalty problem for (P) is given as follows:

$$\begin{aligned} \quad \qquad \qquad \min F_1(x,\rho ) \qquad \mathrm {s.t.}\quad x\in {\varvec{R}}^n. \end{aligned}$$

(P′′)

Note that the $l_1$ penalty function is an exact penalty function (EPF) [48,49,50,51], which means that under some conditions, there is some $\rho ^*$ such that an optimal solution to (P) is also an optimal solution to (P$^{\prime \prime }$) for all $\rho \ge \rho ^*$. However, as the $\rho ^*$ is unspecified, it is necessary to gradually increase the penalty parameter to find the optimal solution to (P), and the ill-conditioning issue still occurs. Besides, the $l_1$ penalty function is nondifferentiable at the point x with some $g_c(x)=0$. As the most powerful optimization methods require a differentiable cost function, this entails restrictions on choices of optimization methods to solve the exact penalty problem.

The idea of the objective penalty function method was first proposed in [52]. By adjusting the objective penalty parameter adaptively, the ill-conditioning issue can be effectively overcome.

The following objective penalty function was presented in [53]:

$$\begin{aligned} F(x,M)=(f(x)-M)^2+\sum _{c\in {\varvec{C}}}\max \{g_{c}(x),0\}^p, \end{aligned}$$

where $p>1$, and an objective penalty function method (OPFM) was developed to solve (P) and the convergence of the corresponding algorithm was established under the assumption that X is connected and compact. In [54] a general objective penalty functions was proposed as follows:

$$\begin{aligned} F(x,M)=Q(f(x)-M)+\sum _{c\in {\varvec{C}}}P(g_c(x)), \end{aligned}$$

where Q(t) is strictly decreasing on $[0, +\infty )$, $Q(t)>0$ for $t\ne 0,$ $Q(0)=0$, and $P(t)>0$ for $t>0$, $P(t)=0$ for $t\le 0$. For example, $Q(t)=t^2$ and $P(t)=\mathop {\max }\{t,0\}^2.$ In [55] the modified Q(t) and P(t) were used to construct the objective penalty function, where $Q(t)=0$ for $t<0$ and P(t) is further required to be strictly increasing on $[0, +\infty ).$ The convergence results for the objective penalty function method are established in [55] without the compactness and connectivity assumption on the feasible set.

Integration of MPC and RL

MPC is a well-established control strategy that uses a mathematical model to predict the system behavior over a finite time horizon and optimize a performance criterion subject to constraints [8], but the performance of MPC is highly dependent on the accuracy of the model used for predictions. RL, on the other hand, is a machine learning technique that enables an agent to learn from its interactions with an environment and make decisions that maximize a reward signal [21], but many critical aspects still need to be tackled, including safety and stability issues [27].

The integration of MPC and RL can leverage the strengths of both approaches. MPC provides a framework for handling constraints and modeling complex systems, while RL allows for adaptive control in uncertain environments where traditional model-based methods may fail. Aswani et al. [57] first presented learning-based model predictive control (LBMPC) scheme, and used a linear model with bounds on its uncertainty to construct invariant sets that provided deterministic guarantees on robustness and safety. In [58], Koller et al. presented a learning-based model predictive control scheme that can provide provable high-probability safety guarantees. Gros and Zanon [28] proposed that a Nonlinear Model Predictive Control (NMPC) can be tuned to deliver the optimal policy of the real system even when using a wrong model, and one practical outcome of the theory proposed in this paper is that all RL techniques can be directly used to tune the NMPC scheme to increase its performance on the real system, and this was the first work proposing to use NMPC as a function approximator in RL. In [27] Gros and Zanon proposed an RL formulation based on MPC which addresses the issue of safety, which they did not rigorously enforce in [28, 59]. A key idea is that MPC is used as a function approximator within RL to provide safety and stability guarantees; and RL is used to tune the MPC parameters, thus improving closed-loop performance in a data-driven fashion [27, 28, 59].

Recently, a method to implement the stochastic policy gradient method using actor-critic techniques was proposed in [29], where the policy is approximated using an MPC scheme, and a computationally inexpensive approach is used to build a stochastic policy generating samples that are supposed to be feasible for the MPC constraints. In [60], a novel algorithm called reinforced predictive control (RL-MPC) that merges the relative merits of MPC and RL, and the complementarity between RL and MPC is emphasized.

In summary, the integration of MPC and RL can provide a more efficient, robust, and adaptive control system for autonomous driving.

Planning and control scheme

This section describes our planning and control scheme employed based on the IDC framework [30], which consists of two main modules: path planning and trajectory tracking control, as seen in Fig. 1.

Firstly, the global road points are generated in consideration of the road static information [30], including road characteristics (shapes, intersections, obstacles etc.) and traffic rules (traffic light signals, vehicle speed limits etc.), then the reference paths are filtered using the static information from the starting point to the end point of the ego vehicle, which are used as the underlying reference paths for the dynamic optimal tracking task (in the CARLA simulator, the built-in $A^*$ algorithm can quickly generate the reference paths). Next, use the dynamic traffic information including surrounding vehicles’ movements to formulate the constrained trajectory tracking problem with respect to each candidate path, solve the formulated problems separately, and follow the path with the best tracking performance.

The formulated MPC-based trajectory tracking problem consists of three parts: first, the objective function to evaluate the trajectory tracking effect; second, the state transfer equation constraints based on the bicycle vehicle model [14] and vehicle kinematics for the ego and other vehicles, respectively; third, the constraints resulting from dynamic collision avoidance in consideration of the surrounding vehicles, and the distance restrictions in consideration of the lane boundaries.

For the constrained optimization problem faced, the ACOPFM is developed as the solution scheme. The objective penalty function method in the optimization field is employed to handle the constraints in order to obtain the unconstrained optimization problem and further the loss function required for the DNN model training, and the objective penalty parameters are adjusted adaptively during the model training process to avoid the ill-conditioning issue.

As seen in Fig. 2, the highway scenario is used to demonstrate the trajectory tracking task. The $A^*$ algorithm is used to generate the reference paths. The optimal trajectory for the ego vehicle is obtained by the ACOPFM. Note that the goal of the optimization problem is to minimize the trajectory tracking error between the trajectory of the ego vehicle and the reference trajectory as much as possible under the premise of safety and reasonableness.

MPC-based nonlinear programming problem

The trajectory tracking model for autonomous driving was given in [30]. Equation(1) shows the variables related to the state and control.

$$\begin{aligned} \begin{aligned} \qquad \qquad {\varvec{x}}_{i|t}&=[l_{\mathrm{{x}}}, \ l_{\mathrm{{y}}}, \ v_{\mathrm{{lon}}}, \ v_{\mathrm{{lat}}}, \ \varphi , \ \omega ]_{i|t}^{\top }\\ \qquad \qquad {\varvec{x}}_{i|t}^{j}&=[l_{\mathrm{{x}}}^j, \ l_{\mathrm{{y}}}^j, \ v_{\mathrm{{lon}}}^j, \ 0, \ \varphi ^j, \ 0]_{i|t}^{\top }\\ \qquad \qquad {\varvec{x}}_{i|t}^{\mathrm{{lane}}}&=[l_{\mathrm{{x}}}^{\mathrm{{lane}}}, \ l_{\mathrm{{y}}}^{\mathrm{{lane}}}, \ 0, \ 0, \ 0, \ 0]_{i|t}^{\top }\\ \qquad \qquad {\varvec{x}}_{i|t}^{\mathrm{{ref}}}&=[l_{\mathrm{{x}}}^{\mathrm{{ref}}}, \ l_{\mathrm{{y}}}^{\mathrm{{ref}}}, \ v_{\mathrm{{lon}}}^{\mathrm{{ref}}}, \ 0, \ \varphi ^{\mathrm{{ref}}}, \ 0]_{i|t}^{\top }\\ \qquad \qquad {\varvec{u}}_{i|t}&=[\delta ,\ a]_{i|t}^{\top } \end{aligned} \end{aligned}$$

(1)

The prediction horizon is set as T steps. $\ x_{i|t}$ is the state of the ego vehicle at the ith time step within T from the current time step t. $x_{i|t}^{j}$ is the state of the jth vehicle in ${\varvec{N}}$, which stands for a collection of vehicles interacting with the ego vehicle, where $l_{\mathrm{{x}}}$ and $l_{\mathrm{{y}}}$ are the position coordinates (for ego and other vehicles, they stand for the position of the center of gravity(CG)). $v_{\mathrm{{lon}}}$ and $v_{\mathrm{{lat}}}$ are the longitudinal and lateral velocities, $\varphi $ is the heading angle, $\omega $ is the yaw rate. $x_{i|t}^{\mathrm{{ref}}}$ and $x_{i|t}^{\mathrm{{lane}}}$ are the closest points from $\ x_{i|t}$ on the reference path and on the road edge. $u_{i|t}$ is the control of the ego vehicle at the ith time step within T from the current time step t. The vehicle control is expressed in terms of the front-wheel angle $\delta $ and acceleration a.

The objective function is given in (2):

$$\begin{aligned} \begin{aligned} \qquad \min \limits _{u_{i|t},i=0:T-1}{\varvec{J}}({\varvec{x}}_{i|t},{\varvec{u}}_{i|t})=\sum _{i=0}^{T-1}{\varvec{l}}({\varvec{x}}_{i|t})+\varvec{\eta }({\varvec{u}}_{i|t}) \end{aligned} \end{aligned}$$

(2)

where

$$\begin{aligned} \qquad \qquad \qquad {\varvec{l}}({\varvec{x}}_{i|t})&=||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{ref}}}||_2^2\\ \qquad \qquad \qquad&=({\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{ref}}})^{\top }{\varvec{Q}}({\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{ref}}}) ,\\ \qquad \qquad \qquad \varvec{\eta }({\varvec{u}}_{i|t})&={\varvec{u}}_{i|t}^{\top }{\varvec{R}}{\varvec{u}}_{i|t}, \end{aligned}$$

${\varvec{Q}},{\varvec{R}}$ are positive definite weighting matrices. Note that the trajectory tacking problem is formulated as a rolling optimization process where for each step the new ${\varvec{x}}_{0|t}$ is provided, then the tracking error is minimized to get the optimal control sequence $\{{\varvec{u}}_{0|t}, {\varvec{u}}_{1|t},...{\varvec{u}}_{T-1|t}\}$. The actual control given to the vehicle is ${\varvec{u}}_{0|t}$ each time due to the uncertainty of the dynamic road information. The state of the vehicle at the next time step is obtained by the state transfer equation. This optimization process is repeated at each time step [8, 61]. Based on the vehicle kinematics and the bicycle vehicle model [14], the state transfer equation of ego vehicle is

$$\begin{aligned} {\varvec{x}}_{i+1|t}={\varvec{F}}_{ego}({\varvec{x}}_{i|t},{\varvec{u}}_{i|t}), \end{aligned}$$

which is specifically given in Eq. (3):

$$\begin{aligned} \qquad \begin{bmatrix} v_{\mathrm{{lon}}|i+1} \\ {\varphi }_{i+1}\\ l_{\mathrm{{x}}|i+1}\\ l_{\mathrm{{y}}|i+1}\\ v_{\mathrm{{lat}}|i+1}\\ \omega _{i+1} \end{bmatrix} = \begin{bmatrix} v_{\mathrm{{lon}}|i}+T_s\cdot a_{i}\\ \dfrac{\tan \delta _i}{L}\cdot {\varphi }_{i}+T_s\cdot v_{\mathrm{{lon}}|i+1}\\ l_{\mathrm{{x}}|i}+ T_s \cdot v_{\mathrm{{lon}}|i+1}\cdot \cos \varphi _{i+1}\\ l_{\mathrm{{y}}|i}+ T_s \cdot v_{\mathrm{{lon}}|i+1}\cdot \sin \varphi _{i+1}\\ v_{\mathrm{{lat}}|i}\\ \omega _{i} \end{bmatrix} \end{aligned}$$

(3)

The state transfer equation of other vehicles is

$$\begin{aligned} {\varvec{x}}_{i+1|t}^j={\varvec{F}}_{other}({\varvec{x}}_{i|t}^j), \end{aligned}$$

which is specifically given in nonlinear Eq.(4):

$$\begin{aligned} \qquad \begin{bmatrix} v_{\mathrm{{lon}}| i+1}^j \\ {\varphi }_{i+1}^j\\ l_{\mathrm{{x}}|i+1}^j\\ l_{\mathrm{{y}}|i+1}^j\\ v_{\mathrm{{lat}}|i+1}^j\\ \omega _{i+1}^j \end{bmatrix} = \begin{bmatrix} v_{\mathrm{{lon}}|i}^j\\ {\varphi }_{i}^j\\ l_{\mathrm{{x}}|i}^j+ T_s \cdot v_{\mathrm{{lon}}|i+1}^j\cdot \cos \varphi _{i+1}^j\\ l_{\mathrm{{y}}|i}^j+ T_s \cdot v_{\mathrm{{lon}}|i+1}^j\cdot \sin \varphi _{i+1}^j\\ v_{\mathrm{{lat}}|i}^j\\ \omega _{i}^j \end{bmatrix} \end{aligned}$$

(4)

Next we move to the nonlinear constraints involved in the model:

$$\begin{aligned} \qquad \qquad \qquad ||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{j}||_2&\ge {\varvec{D}}^{\mathrm{{safe}}}_{other},\\ \qquad \qquad \qquad ||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{lane}}}||_2&\ge {\varvec{D}}^{\mathrm{{safe}}}_{lane}, \end{aligned}$$

where ${\varvec{D}}^{\mathrm{{safe}}}_{other}$ is the safe distance between the ego vehicle and other vehicles, while ${\varvec{D}}^{\mathrm{{safe}}}_{lane}$ is the safe distance between the ego vehicle and the lane line. More specifically, we have the following nonlinear constraints, as shown in Eq. (5), which update dynamically at each time step with the change of the ego vehicle’s state, other vehicles’ states, lane boundary information, and number of other vehicles:

$$\begin{aligned} \left\{ \begin{array}{lr} \!\sqrt{(l_{\mathrm{{x}}}-l_{\mathrm{{x}}}^{j})^2+(l_{\mathrm{{y}}}-l_{\mathrm{{y}}}^{j})^2} \!\ge \!{\varvec{D}}^{\mathrm{{safe}}}_{other}, j\in {\varvec{N}}\\ \\ \!\sqrt{(l_{\mathrm{{x}}}-l_{\mathrm{{x}}}^{\mathrm{{lane,1}}})^2+(l_{\mathrm{{y}}}-l_{\mathrm{{y}}}^{\mathrm{{lane,1}}})^2} \!\ge \! {\varvec{D}}^{\mathrm{{safe}}}_{lane}\\ \\ \!\sqrt{(l_{\mathrm{{x}}}-l_{\mathrm{{x}}}^{\mathrm{{lane,2}}})^2+(l_{\mathrm{{y}}}-l_{\mathrm{{y}}}^{\mathrm{{lane,2}}})^2} \! \ge \! {\varvec{D}}^{\mathrm{{safe}}}_{lane} \end{array} \right. \end{aligned}$$

(5)

where $l^{\mathrm{{lane,1}}}$ and $l^{\mathrm{{lane,2}} }$ represent the left lane boundary and the right lane boundary.

A model predictive control-based nonlinear programming problem is given as follows to fully describe the trajectory tracking problem:

$$\begin{aligned} \begin{aligned} \qquad \min \limits _{u_{i|t},i=0:T-1}&{\varvec{J}}({\varvec{x}}_{i|t},{\varvec{u}}_{i|t})=\sum _{i=0}^{T-1}{\varvec{l}}({\varvec{x}}_{i|t})+\varvec{\eta }({\varvec{u}}_{i|t})\\ \qquad {\text { s.t. }}\quad&{\varvec{x}}_{i+1|t}={\varvec{F}}_{ego}({\varvec{x}}_{i|t},{\varvec{u}}_{i|t})\\ \qquad&{\varvec{x}}_{i+1|t}^j={\varvec{F}}_{other}({\varvec{x}}_{i|t}^j)\\ \qquad&||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{j}||_2\ge {\varvec{D}}^{\mathrm{{safe}}}_{other}\\ \qquad&||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{lane}}}||_2\ge {\varvec{D}}^{\mathrm{{safe}}}_{lane}\\ \qquad&{\varvec{x}}_{0|t}={\varvec{x}}_{t},{\varvec{x}}_{0|t}^j={\varvec{x}}_{t}^j,{\varvec{u}}_{0|t}={\varvec{u}}_{t}\\ \qquad&i=0:T-1,j\in {\varvec{N}}. \end{aligned} \end{aligned}$$

(6)

As the problem is highly nonlinear, the number of constraints involved is large, and the vehicle-mounted computing resources are limited, the traditional numerical optimization methods are not able to solve the problem satisfactorily. It has been shown that the model-based RL algorithms can produce reasonable solutions for such problems [30, 62]. However, the constraints involved pose a great challenge to these algorithms. In the next section, we will present a generalized objective penalty function method to handle the constraints under the model-based RL framework, which can well handle the ill-conditioning issue that occurs in the traditional penalty function methods.

ACOPFM framework

DRL-based parameterization

We consider a standard RL setup for the interaction of the agent with the environment in discrete time steps. Here the agent is defined as an autonomous vehicle and the simulation environment can be modeled as a Markov decision process (MDPs), which can be defined by the tuple $({\mathcal {S}},{\mathcal {A}},{\mathcal {R}},{\mathcal {P}},{\mathcal {V}})$. The state space and action space are identified as being continuous. Within each discrete time step, for each state given by the simulation environment $s_t\in {\mathcal {S}},$ the ego vehicle takes the appropriate action $a_t\in {\mathcal {A}}$ according to the input state $s_t$ and the policy $\pi :{\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}$ to complete the trajectory tracking task. The function $r(s_t,a_t): {\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}$ describes the cost, indicating the objective function of the trajectory tracking task optimized at each moment, which is represented here as $l(x_t)+\eta (u_t)$. The function $p(s_{t+1}|s_{t},a_t): {\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathcal {P}}(s_{t+1})$ represents the probability density of the next state $s_{t+1}$ to which the current state $s_t$ is transferred by taking action $a_t$. $p(\cdot )$ is usually unknown due to the complexity and uncertainty of the autonomous vehicle dynamics and the motion of the surrounding traffic participants, but we can obtain the state at the next time step according to the state transfer equation $s_{t+1}={\varvec{F}}(s_{t}).$ The state value function $V^{\pi }(s_t)$ is defined as the expected cost sum from the start of $s_t$ to future moments and can be used to evaluate the value of an initial state $s_t$ under policy $\pi $. It is closely related to the policy $\pi $ of the selected action by $V^{\pi }(s_t)=\left\{ \sum _{t=0}^{T-1}l(x_t)+\eta (u_t)|s_0=s_t\right\} .$

We use the actor-critic RL scheme to solve the trajectory tracking problem. The purpose of actor is to find the optimal policy $\pi _{\theta }^*$, which demonstrates the tracking performance during the training of NNs. The purpose of critic is to approximate the true function value, and to improve the closed-loop performance as the training of NNs becomes more and more accurate, the loss of critic can demonstrate the safety performance during the training process.

Our RL problems are modeled as follows:

$$\begin{aligned}&\min \limits _{\theta }J_{actor}\! =\!{\mathbb {E}}_{s_{0|t}}\!\left\{ \sum _{i=0}^{T-1}{\varvec{l}}(s_{i|t},\pi _{\theta }(s_{i|t}))\!+\!\varvec{\eta }(u_{i|t},\pi _{\theta }(s_{i|t}))\!\right\} \nonumber \\&\mathrm {s.t.}\ s_{i+1|t}\!={\varvec{F}}(s_{i|t},\pi _{\theta }(s_{i|t}))\nonumber \\&g_c(s_{i|t})\ge 0,c\in {\varvec{C}}\nonumber \\&s_{0|t}=s_t \leftarrow \{\tau ^*,x_t,x_t^{j},j\in N\} \sim d\nonumber \\&i=0:T-1 \end{aligned}$$

(7)

$$\begin{aligned}&\min \limits _{\omega }J_{critic}\nonumber \\&\!=\!{\mathbb {E}}_{s_{0|t}}\!\left\{ \!\left( \!\sum _{i=0}^{T-1}{\varvec{l}}(s_{i|t},\!\pi _{\theta }(s_{i|t}))\!+\!\varvec{\eta }(u_{i|t},\!\pi _{\theta }(s_{i|t}))\!-\!{\varvec{V}}_{\omega }(s_{0|t})\!\right) ^2\!\right\} \nonumber \\&\mathrm {s.t.}\ s_{i+1|t}={\varvec{F}}(s_{i|t},\pi _{\theta }(s_{i|t}))\nonumber \\&s_{0|t}=s_t \leftarrow \{\tau ^*,x_t,x_t^{j},j\in N\} \sim d\nonumber \\&i=0:T-1, \end{aligned}$$

(8)

where the tracking error objective

$$\begin{aligned} \begin{aligned}&{\varvec{l}}(s_{i|t},\pi _{\theta }(s_{i|t}))+\varvec{\eta }(u_{i|t},\pi _{\theta }(s_{i|t}))\\&=||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{ref}}}||_2^2+\varvec{\pi }_{\theta }(s_{i|t})^{\top }{\varvec{R}}\varvec{\pi }_{\theta }(s_{i|t}), \end{aligned} \end{aligned}$$

$g_c(s_{i|t})\ge 0, c\in {\varvec{C}} $ denote all the constraints. As the number of constraints is dynamically changing according to the surrounding vehicle information and the location information at each moment. Note that ${\varvec{F}}\in \{{\varvec{F}}_{ego},{\varvec{F}}_{other}\}$. In the value network, the optimal value function ${\varvec{V}}_{\omega ^*}$ is obtained by fitting the optimization parameter $\omega $. We train the value network so that ${\varvec{V}}_{\omega ^*}$ approximates the obtained $\varvec{J^*}.$ The control $u_{i|t}=\pi _{\theta }(s_{i|t})$ can be obtained via the policy network. Under some assumptions, $u_{i|t}$ can be mapped as $\pi _{\theta }(s_{i|t})$, and the optimal policy $\pi _{\theta }^*(s_{i|t})$is equivalent to optimal action $u_{i|t}^{*}.$

Actor-critic objective penalty function method

Taking $P(t)=\max \{0,t\}^2$ and introducing a dynamic objective penalty parameter M, then the constrained critic problem is converted to a generalized objective penalty function problem as follows:

$$\begin{aligned}&\min \limits _{\theta }J_{p}=Q(J_{actor}-M)+J_{penalty}\nonumber \\&\quad ={\mathbb {E}}_{s_{0|t}}\!\left\{ \!Q\!\left( \sum _{i=0}^{T-1}{\varvec{l}}(s_{i|t},\pi _{\theta }(s_{i|t}))\!+\!\varvec{\eta }(u_{i|t},\pi _{\theta }(s_{i|t}))\!-\! M_{\theta }\!\right) \!\right\} \nonumber \\&\qquad +{\mathbb {E}}_{s_{0|t}}\!\left\{ \sum _{i=0}^{T-1}\phi _i(\theta )\right\} \nonumber \\&\qquad \mathrm {s.t.}\ s_{i+1|t}={\varvec{F}}(s_{i|t},\pi _{\theta }(s_{i|t}))\nonumber \\&\qquad \phi _i(\theta )=\sum _{c\in {\varvec{C}}}[\textrm{max}\{0,-g_c(s_{i|t})\}]^2\nonumber \\&\qquad s_{0|t}=s_t \leftarrow \{\tau ^*,x_t,x_t^{j},j\in J\} \sim d\nonumber \\&\qquad i=0:T-1. \end{aligned}$$

(9)

The algorithmic procedure of the actor-critic objective penalty method is given in Algorithm 1. Let the set

$$\begin{aligned} U=\{u=\pi _{\theta }(s_t)\, \big |\,g_c(s_t)\ge 0,c\in {\varvec{C}}\}. \end{aligned}$$

The optimal control taken each time through the policy network is $u_k$. $f(u_k)$ is the value obtained from

$\sum _{i=0}^{T-1}{\varvec{l}}(s_{i|t},\pi _{\theta }(s_{i|t}))+\varvec{\eta }(u_{i|t},\pi _{\theta }(s_{i|t}))$. Since the punishment for constraint violations is formulated as a penalty term, $u_k$ does not necessarily satisfy the constraints.

Convergence analysis of the ACOPFM

We will show that the ACOPFM converges to the optimal policy under certain conditions, and we say that a “round" is completed when an optimization process is completed.

Assumption 1

After the round k completes, we have an optimized policy parameter $\theta _k$. Let

$$\begin{aligned} \begin{aligned} \qquad P(\theta _k,\beta )=\left\{ \theta _k\, \big |\,J_{actor}(\theta _k)\le \beta ,k=1,2,...\right\} , \end{aligned} \end{aligned}$$

(10)

which is called a generalized P-level set. We assume that $P(\theta _k,\beta )$ is bounded for any given $\beta >0$ if the sequence $\{M_k\}$ is convergent.

Next we establish the convergence of the sequence $\{M_k\}$.

Theorem 1

Let $M_{*}=\min \limits _{\theta }J_{actor}(\theta )$, then $M_{*}=f(\pi _{\theta ^{*}}(s_t)).$ Suppose that for some M, $u_M$ is the control obtained from $J_{p}(\theta ,M),$ and let $\theta _M$ be the optimal parameter corresponding to $J_{p}(\theta ,M)$, i.e. $u_M=\pi _{\theta _M}(s_t).$ Then the following three assertions hold.

(i) If $J_{p}(\theta _M,M)=0,$ then $u_M\in U$ and $M_{*}\le f(u_M)\le M.$

(ii) If $J_{p}(\theta _M,M)>0$ and $u_M\notin U,$ then $M\le M_{*}$ and $f(u_M)< M_{*}.$

(iii) If $J_{p}(\theta _M,M)>0$ and $u_M\in U,$ then $\theta _{M}\in \mathop {\arg \min }\limits _{\theta }J_{actor}(\theta ).$

Proof

(i) The conclusion clearly holds according to the definitions of P and Q.

(ii) Assume that $\theta ^{*}\in \mathop {\arg \min }\limits _{\theta }J_{actor}(\theta ).$ Obviously

$$\begin{aligned} J_{penalty}(\theta ^{*})=0, J_{actor}(\theta ^{*})=f(\pi _{\theta ^{*}}(s_t)). \end{aligned}$$

As $J_{p}(\theta _M,M)>0,$ it follows

$$\begin{aligned} 0<J_p(\theta _M,M)\le J_{p}(\theta ^{*},M)=Q(f(\pi _{\theta ^{*}}(s_t)-M)). \end{aligned}$$

By the definition of Q, $M<f(\pi _{\theta ^{*}}(s_t))=M_{*}.$

If $f(u_M)\le M,$ then $f(u_M)\le M\le M_{*}.$ If $f(u_M)> M,$ then

$$\begin{aligned} \begin{aligned} \qquad 0&<Q(f(u_M)-M)\\ \qquad&\le J_p(\theta _{M},M)\le J_p(\theta ^*,M)=Q(f(\pi _{\theta ^{*}}(s_t)-M)). \end{aligned} \end{aligned}$$

Therefore, $f(u_M)< f(\pi _{\theta ^{*}}(s_t))=M_*.$ At the initial step 0, the ego vehicle is generated at the starting point, the default parameter settings satisfy the constraints, therefore $J_{penalty}(\theta _{0})=0.$

(iii) According to the given conditions, we have

$$\begin{aligned} \begin{aligned} \qquad \qquad \quad 0&< Q(f(u_M)-M)\\ \quad \qquad \qquad&=J_p(\theta _{M},M)\le J_p(\theta ,M)\\ \quad \qquad \qquad&=Q(f(u)-M),\forall u\in U. \end{aligned} \end{aligned}$$

Since $u_M\in U,$ this implies that

$$\begin{aligned} f(u_M)-M\le f(u)-M,\forall u\in U, \end{aligned}$$

i.e. $\theta _M\in \mathop {\arg \min }\limits _{\theta }J_{actor}(\theta )$.$\square $

Theorem 2

Let $\{M_{k}\}$ be the sequence of objective penalty parameters generated during the iteration with the sequence $\{a_{k}\}$ and the sequence $\{b_{k}\}.$ Then $\{a_{k}\}$ is an increasing sequence and $\{b_{k}\}$ is a decreasing sequence and satisfies

$$\begin{aligned}{} & {} \qquad a_{k}\le M_{*} \le b_{k},\qquad k=1,2,..., \end{aligned}$$

(11)

$$\begin{aligned}{} & {} b_{k+1}-a_{k+1}\le \dfrac{b_{k}-a_{k}}{2},\qquad k=1,2,.... \end{aligned}$$

(12)

Proof

Considering the case when $k = 0$. Clearly, $a_{0}\le M_{*}\le b_{0}.$ When $k = 1$, if $J_{p}(\theta _1,M_0)=0,$ then $a_1=a_0,\ b_1=f(u_1)=M_0,\ M_1=\dfrac{a_1+b_1}{2},$ thus we have

$$\begin{aligned} b_1-a_1=M_0-a_0=\dfrac{a_0+b_0}{2}-a_0=\dfrac{b_0-a_0}{2}; \end{aligned}$$

else $J_{p}(\theta _1,M_0)\ne 0,$ then for $J_{penalty}({\theta }_{k})=0,$ it holds, for the case $J_{penalty}({\theta }_{k})>0$,

$$\begin{aligned} \begin{aligned} \qquad \quad a_1&=\max \{f(u_1),M_0\}\\ \qquad \quad&\ge M_0=\dfrac{a_0+b_0}{2}\ge \dfrac{a_0+a_0}{2}=a_0. \end{aligned} \\ b_1=b_0. \end{aligned}$$

thus

$$\begin{aligned} \begin{aligned} \qquad \quad b_1-a_1&=b_0-\max \{f(u_1),M_0\}\\ \qquad \quad&\le b_0-M_0=b_0-\dfrac{a_0+b_0}{2}=\dfrac{b_0-a_0}{2}, \end{aligned} \end{aligned}$$

this implies that $b_1-a_1\le \dfrac{b_0-a_0}{2}$.

Considering the case $k\ge 1$. By induction, suppose that $a_k\le M_{*}\le b_k$ for some $k \ge 1$, then we consider the case of $k+1$.

If $J_{p}(\theta _{k+1},M_k)=0,$ this implies that $u_{k+1}\in U,$ $f(u_{k+1})=M_k=b_{k+1},$ $a_{k+1}=a_{k},$ $M_{k+1}=\dfrac{a_{k+1}+b_{k+1}}{2},$ thus

$$\begin{aligned} \begin{aligned} \qquad \quad a_{k+1}=a_k\le M_{*}\le b_{k+1}=M_k=\dfrac{a_{k}+b_{k}}{2}\le b_k. \end{aligned} \\ \begin{aligned} \qquad \quad b_{k+1}-a_{k+1}=M_k-a_k=\dfrac{b_{k}-a_{k}}{2}. \end{aligned} \end{aligned}$$

If $J_p(\theta _{k+1},M_k)\ne 0,$ then for $J_{penalty}(\theta _{k+1})=0,$ obviously it holds. For the case $J_{penalty}(\theta _{k+1})>0$, we have $b_{k+1}=b_k,\ a_{k+1}=\max \{f(u^{k+1}),M_k\}\ge M_k,$ thus

$$\begin{aligned} \begin{aligned} \qquad \quad a_{k+1}\ge M_k=\dfrac{a_k+b_k}{2}\ge \dfrac{a_k+a_k}{2}=a_k. \end{aligned} \end{aligned}$$

$\square $

From Theorem 1, we have

$$\begin{aligned} a_{k+1}\le M_{*}\le b_k=b_{k+1}. \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} \qquad \quad b_{k+1}-a_{k+1}&=b_k-\max \{f(u^{k+1}),M_k\}\\ \qquad \quad&\le b_k-M_k=\dfrac{b_k-a_k}{2}. \end{aligned} \end{aligned}$$

Also, we can find that the sequence $\{a_k\}$ is increasing and the sequence $\{b_k\}$ is decreasing, and both of them are bounded. Therefore, the sequences $\{a_k\}$ and $\{b_k\}$ are both convergent. Let

$$\begin{aligned} {\lim _{k\rightarrow \infty }}a_{k}=a^{*}, \ {\lim _{k\rightarrow \infty }}b_{k}=b^{*}. \end{aligned}$$

By Eqs. (11) and (12), it is clear that $a^{*}=b^{*}.$ Then we have that the sequence $\{M_{k}\}$ converges to $a^{*}=b^{*}.$

Theorem 3

Under Assumption 1, the limit point of any convergent subsequence of the sequence of the optimization parameters $\{\theta _{k}\},k=1,2...,$ is the optimal solution to the original problem, where each $\theta _k$ is obtained in the kth iteration.

Proof

We first show that the sequence of optimization parameters $\{\theta _{k}\},k=1,2...$ is bounded.

Since $\theta _{k}$ is the optimal parameter after k rounds of iterations and $u_0\in U,J_{penalty}(\theta _0)=0$ when $k=0$, we have

$$\begin{aligned} \begin{aligned} J_p(\theta _{k},M_{k-1})\le Q(J_{actor}(\theta _{0})-M_{k-1}),\quad k=0,1,2,... \end{aligned} \end{aligned}$$

by the convergence of $\{M_k\}$, i.e. ${\lim _{k\rightarrow \infty }}M_{k}=a^{*}.$

By Assumption 1, there exists ${\overline{\beta }}>0,$ when $\beta >{\overline{\beta }}$ we have

$$\begin{aligned} \begin{aligned} \quad \qquad J_p(\theta _{k},M_{k-1})&\le Q(J_{actor}(\theta _{0})-M_{k-1})\\ \quad \qquad&<\beta . \quad k=0,1,2,... \end{aligned} \end{aligned}$$

as P-level set $P(\theta _k,\beta )$ is bounded, i.e. the sequence $\{\theta _k\}$ is bounded.

Since $M_{*}=\min \limits _{\theta }J_{actor}(\theta ),$ without loss of generality, let $\theta ^{*}\in \mathop {\arg \min }\limits _{\theta }J_{actor}(\theta )$. We have proved that $a_k \le M_* \le b_k,k=0,1,2...$ and the sequences $\{a_k\},\{b_k\},\{M_k\}$ converge to $a^{*}.$ Let $k\rightarrow +\infty ,$ we obtain $a^{*}=M_*=J_{actor}(\theta ^*).$

Note that $J_p, \ J_{actor}, \ J_{penalty}$ are all continuous on the parameter space. As the sequence ${\theta _k}$ is bounded, it has convergent subsequence $\{\theta _{k_j}\}$ with $\theta _{k_j}\rightarrow {\overline{\theta }}$ as $k_j\rightarrow +\infty $.

We will show that $J_{actor}({\overline{\theta }})=a^{*}.$ Clearly $M_{*}=J_{actor}(\theta ^*),$ and note that

$$\begin{aligned} \begin{aligned} \quad \qquad J_p(\theta _{k_j},M_{k_j-1})&\le J_p(\theta ^*,M_{k_j-1})\\ \quad \qquad&=Q(J_{actor}(\theta ^*)-M_{k_j-1}). \end{aligned} \end{aligned}$$

Let $k_j \rightarrow +\infty ,$ we have $M_{k_j-1}\rightarrow a^*.$ Then

$$\begin{aligned} \begin{aligned} \qquad \qquad \lim _{k_j\rightarrow +\infty }J_p(\theta _{k_j},M_{k_j-1})&= J_p({\overline{\theta }},a^{*})\\ \qquad \qquad&\le Q(J_{actor}(\theta ^*)-a^{*})\\ \qquad \qquad&=0. \end{aligned} \end{aligned}$$

Therefore $J_p({\overline{\theta }},a^{*}) \le 0.$ It is clear that $J_p({\overline{\theta }},a^{*})= 0$ due to the nonnegativity, and $J_{penalty}({\overline{\theta }})=0,$ $J_{actor}({\overline{\theta }})=a^{*},$ i.e., the limit of any convergent subsequence of the parameter sequence $\{\theta _{k}\},k=0,1,2...$ is the optimal solution. $\square $

Simulation experiments

Our simulation experiments are conducted on the CARLA-SUMO joint platform. In order to simulate the real-time trajectory tracking scenario under complex constraints, the simulation platform employed should have realistic sensor simulation, vehicle dynamics simulation, and traffic flow simulation. We thus choose CARLA, which comes with a realistic sensor simulation system, a vehicle dynamics simulation system, and a game engine rendering [63]. SUMO, which has realistic complex random traffic simulation [64], is chosen to produce realistic traffic flow in CARLA, and some important variable settings in SUMO simulation traffic flow are shown in Table 1. For the experimental scenario, we choose the multi-lane setting of the CARLA 09.13 version of the Town06 map, and it is worth noting that the vehicle speed limit is set at 20 m/s.

Table 1 Important variable settings in SUMO simulation traffic flow

Full size table

Experimental environment

We choose the vehicle.tesla.model3 simulation model in CARLA simulator as ego vehicle for the trajectory tracking task. The state information is known to contain three pieces: ego vehicle’s state information, other vehicles’ state information, and information about the reference state:

$$\begin{aligned} s=[s^{\mathrm{{ego}}},s^{\mathrm{{other}}},s^{\mathrm{{ref}}}]. \end{aligned}$$

The specific settings of $s^{\mathrm{{ego}}}$ are described in Sect. “Design of the nonlinear programming problem”. $s^{\mathrm{{other}}}=[l_{\mathrm{{x}}}^j, l_{\mathrm{{y}}}^j, v_{\mathrm{{lon}}}^j, \varphi ^j]_{i|t}^{T},j\in {\varvec{N}},$ and ${\varvec{N}}=6.$ $s^{\mathrm{{ref}}}=[l_{\mathrm{{x}}}^{\mathrm{{ref}}}, l_{\mathrm{{y}}}^{\mathrm{{ref}}}, v_{\mathrm{{lon}}}^{\mathrm{{ref}}}, \varphi ^{\mathrm{{ref}}}]_{i|t}^{\top }$, in which

$$\begin{aligned} (l_{\mathrm{{x}}}^{\mathrm{{ref}}}, \ l_{\mathrm{{y}}}^{\mathrm{{ref}}}) \in \mathop {\arg \min }\limits _{i}\left\{ \sqrt{(l_{\mathrm{{x}}}-l_{\mathrm{{x}}}^{\mathrm{{ref}}})^2+(l_{\mathrm{{y}}}-l_{\mathrm{{y}}}^{\mathrm{{ref}}})^2}\right\} . \end{aligned}$$

The reference speed $v_{\mathrm{{lon}}}^{\mathrm{{ref}}}$is 13m/s, which is generated based on the real time traffic rules, and the reference heading angle $\varphi ^{\mathrm{{ref}}}$ is generated adaptively by the CARLA simulator. In the policy network, the input space of the DNN contains information on all the three states, which is a 34-dimensional tensor.

The control is set as

$$\begin{aligned} u=[\delta , \ a]^{\top }, \end{aligned}$$

where $\delta $ is the front-wheel angle, and can be adjusted by changing the steering wheel angle in the CARLA simulator (the steering wheel angle range is $[-1,\ 1] rad$), a is the acceleration of the ego vehicle, and can be converted to throttle in the CARLA simulator by the PID controller, parameterized by the throttle range $[0, \ 1]$. The output space contains the angle and acceleration of the front wheel, which is a 2-dimensional tensor with an angle range $[-0.4,\ 0.4]rad$, and acceleration range $[-3,\ 3]m/s^2.$ Due to the nonlinear mapping relationship, we use ELU function to activate the network output and use sigmoid function with parameters to map the network output to appropriate intervals in order to ensure that the control of the network output is in line with the control range. The basic structure of the policy network is shown in Fig. 3. The value network is also a DNN with two hidden layers, but it differs structurally from the policy network in that the value network outputs a real-valued signal. This real-valued signal is the score of the state-control pair.

The information of all road points of the map is obtained from CARLA simulation platform in advance. During the simulation process, the ego vehicle is automatically destroyed when it reaches the end point, collides with other vehicles, or crosses the boundary of the lane, and then it is generated again at the starting position. The data of the start and end coordinates of the reference path are shown in Table 2.

Table 2 The starting and ending points’ coordinates

Full size table

The prediction horizon $T=30$, and the simulation interval time step $T_s$ for each step is 0.05s. The positive definite weighting matrix is

$$\begin{aligned} \begin{aligned} \qquad \qquad&{\varvec{Q}}=diag\{ 0.04, 0.04, 0.01, 0.05,0.1, 0.06\},\\ \qquad \qquad&{\varvec{R}}=diag\{ 0.1, 0.05\}. \end{aligned} \end{aligned}$$

Safe distance ${\varvec{D}}^{\mathrm{{safe}}}_{other}=3.0 \ m$, ${\varvec{D}}^{\mathrm{{safe}}}_{lane}=1.5\ m$. The distance from the front axle to the rear axle $L=3.0\ m.$ The height of top view point in CARLA is $50\ m.$ The important parameters of the NNs are shown in Table 3, and the computational complexity, e.g., FLOPs and model’s parameter sizes are shown in Tables 4 and 5.

Table 3 Important hyperparameter settings

Full size table

Table 4 Computational complexity of the policy network

Full size table

Table 5 Computational complexity of the value network

Full size table

Table 6 Average violation degree ${\bar{J}}_{penalty}$ of the ACOPFM, GEP and EPF

Full size table

Offline training

The ACOPFM, GEP, and $l_1$ exact penalty function based method (denoted by EPF) are trained for comparison. In the key iteration step of the GEP [30] ‘if i mod m: $\rho \leftarrow c\rho $’, the parameter settings are shown in Table 3. Five different runs are conducted with different random seeds on a desktop computer with a 3.70 GHz-10 core Intel Xeon W-2255 CPU, with evaluations every 100 iterations. The GEP and the EPF are trained using four different sets of initial penalty parameters $\rho _{0}$, and are compared with the ACOPFM. The numerical results are reported in Fig. 4.

As seen in Fig. 4a, from the trend of $J_{actor}$, compared with the GEP and the EPF, we can see that ACOPFM converges faster, and is more stable; in other words, ACOPFM is able to find the optimal strategy more efficiently, which means that the trajectory tracking performance of ACOPFM is better than that of the GEP and the EPF. Figure 4b shows the variation of the objective penalty parameter M; compared with the infinite growth of the penalty parameter $\rho $ of the GEP and EPF as shown in Fig. 4c, it converges faster to a finite value that is roughly equal to the parameterized $J_{actor}$. Figure 4d shows the variation of $J_{p}$, and it can be seen that ACOPFM converges to a small number quite steadily, whereas the GEP gives a diverging and oscillating trend, and the EPF gives an oscillating result. The numerical performance of the value network of the three methods are shown in Fig. 4e. It can be seen that the ACOPFM converges very fast, whereas the GEP and the EPF diverge. Figure 4f demonstrates the variation of $J_{penalty}$, which indicates the degree of constraint violations. It can be seen that the ACOPFM performs much better than the GEP and the EPF. Moreover, we take the average violation degree [65] ${\bar{J}}_{penalty}$ of $J_{penalty}$ to measure the safety performance:

$$\begin{aligned} \begin{aligned} \qquad \qquad {\bar{J}}_{penalty}= {\mathbb {E}}_{0|k}\left\{ {\mathbb {E}}_{s_{0|t}}\left\{ \sum _{i=0}^{T-1}\phi _i(\theta )\right\} \right\} . \end{aligned} \end{aligned}$$

(13)

Overall, the average violation degree of ACOPFM is lower than the average value of the GEP and the EPF as shown in Table 6, which indicates that the safety performance of ACOPFM is better than that of the GEP and the EPF in general.

Online testing

We deploy the model trained by the ACOPFM to the ego vehicle and test it in a Town06 multi-lane scenario. The red thick line segment is the optimized trajectory obtained by the ACOPFM, which is formed by connecting waypoints within the prediction horizon ($T = 30$) at each simulation moment. The green solid line is the reference path, generated by the $A^*$ algorithm in the CARLA simulator after considering the road static information, and there are 5 reference paths corresponding to the multi-lane scenario.

1) Performance in the environment without surrounding vehicles

In such environment, the ego vehicle trajectory should be well maintained in the reference lane. In the SUMO simulator we set the vehicle generation frequency vehsPerHour to 0 vehicles per hour. The dynamic breakdown of the whole process of the simulation test is shown in Fig. 5, where the ego vehicle is initialized in the lane 2, with the goal of reaching the end of the line in a safe and traffic rules compliance manner.

During the whole test process, the changes of the indicators of the ego vehicle are shown in Fig. 6, and it can be seen that the ego vehicle almost keeps a constant speed after accelerating for a period of time, and the front-wheel angle, the steering wheel angle and the heading angle parameters are basically kept near 0, which indicates that the trajectory tracking performance of the ego vehicle is good.

2) Performance in the environment with surrounding vehicles

In this environment, the challenging task is to safely and stably complete the overtaking control of the ego vehicle. We adopt a multistep safety shield [30] after the output of the policy to enhance safety during the online implementation of the trained model. In the SUMO simulator we set the vehicle generation frequency vehsPerHour to 5000 vehicles per hour. The dynamic breakdown of the whole process of the simulation test is shown in Figs. 7 and 8.

A collision can occur when there is a vehicle A in front of the ego vehicle. The ego vehicle will turn to lane 2, as shown in Fig. 7b, and chooses lane 2 as the path to be tracked to complete the first overtaking, as shown in Fig. 7c and d.

After driving stably on lane 2 for a period of time, the sensor detects vehicle B which is directly ahead of the ego vehicle, as shown Fig. 8b, and the ego vehicle will perform a second lane change for overtaking. The ego vehicle chooses to change lane to enter the lane 1 to the left and overtake the vehicle B by acceleration, as shown in Fig. 8c. After the second overtaking, the ego vehicle ultimately arrives at the end of the lane safely and stably, as shown in Fig. 8c and d.

During the whole test process, the changes of the indicators of the ego vehicle are shown in Fig. 9. Around simulation step 170 and step 530, it is evident that there are clear fluctuations in the indicators, since the first and second overtaking maneuvers entail the changes of the acceleration and the front-wheel angle, which are reflected in Fig. 9a to Fig. 9f). Note that the first and second overtaking maneuvers involve the right-turn control and left-turn control of the steering wheel, which are implemented by changing the angle of the steering wheel (Fig. 9c). The front-wheel angle (Fig. 9d) and the heading angle (Fig. 9e) have abrupt changes around simulation step 170 and step 530.

Conclusion and future work

In this paper, we have proposed a model-based RL algorithm, the ACOPFM, to avoid the ill-conditioning issue faced by the GEP during model training and effectively deal with the large-scale nonlinear state-wise constraints. A planning and control scheme based on the IDC framework is adopted, and a MPC-based model of the trajectory tracking problem is established. Subsequently, the model is converted into an unconstrained optimization problem using the objective penalty function method and it is parameterized based on the actor-critic framework. The loss function guiding the update of the policy network is employed in the form of objective penalty function. The ACOPFM can find the optimal control strategy faster and stabler in comparison with the GEP by alternately performing gradient descent and adaptively adjusting the penalty parameter. The simulation results show that under multi-lane scenarios, the ACOPFM performs well in trajectory tracking tasks and is able to complete the overtaking maneuvers successfully.

As the ACOPFM converts the original problem into an unconstrained optimization problem by imposing penalties for constraint violations as the GEP, there is no 100% guarantee of safety. Our next research is to improve the multistep safety shields [30] to ensure safety performance.

Note that the state transfer equation we employed in the present work is relatively simple and there could be errors in real applications. There are some environmental factors such as wind speed, ambient temperature, and ground slope that are not considered in the simulation process. Moreover, some scenes (e.g., some control behaviors of the ego vehicle and other vehicles) are not encountered during model training, which can cause long-tail problems and makes the real applications of the method less secure. However, by refining the model and extending the training process, these problems can be overcome to a large extent.

Data Availability

The data associated with this paper is available from the corresponding author upon request.

References

Badue Claudine, Guidolini Rânik, Carneiro Raphael Vivacqua, Azevedo Pedro, Cardoso Vinicius B, Forechi Avelino, Jesus Luan, Berriel Rodrigo, Paixao Thiago M, Mutz Filipe, et al (2021) Self-driving cars: A survey. Expert Systems with Applications, 165:113816
González D, Pérez J, Milanés V, Nashashibi F (2015) A review of motion planning techniques for automated vehicles. IEEE Trans Intell Transp Syst 17(4):1135–1145
Article Google Scholar
Huang Z, Li H, Li W, Liu J, Huang C, Yang Z, Fang W (2021) A new trajectory tracking algorithm for autonomous vehicles based on model predictive control. Sensors 21(21):7165
Article Google Scholar
Chatzikomis C, Sorniotti A, Gruber P, Zanchetta M, Willans D, Balcombe B (2018) Comparison of path tracking and torque-vectoring controllers for autonomous electric vehicles. IEEE Transactions on Intelligent Vehicles 3(4):559–570
Article Google Scholar
Li L, Li J, Zhang S (2021) Review article: State-of-the-art trajectory tracking of autonomous vehicles. Mechanical Sciences 12(1):419–432
Article Google Scholar
Shtessel Yuri, Edwards Christopher, Fridman Leonid, Levant Arie, et al (2014) Sliding mode control and observation, volume 10. Springer
Karl Johan Åström and Tore Hägglund (2001) The future of pid control. Control Eng Pract 9(11):1163–1175
Article Google Scholar
Grüne Lars, Pannek Jürgen, Grüne Lars, Pannek, Jürgen (2017) Nonlinear model predictive control. Springer
Liu J-K, Sun F-C (2007) Research and development on theory and algorithms of sliding mode control. Kongzhi Lilun yu Yingyong/ Control Theory & Applications 23(3):407–418
Google Scholar
Kachroo P, Tomizuka M (1996) Chattering reduction and error convergence in the sliding-mode control of a class of nonlinear systems. IEEE Trans Autom Control 41(7):1063–1068
Article MathSciNet Google Scholar
Huang B, Yang Q (2019) Double-loop sliding mode controller with a novel switching term for the trajectory tracking of work-class rovs. Ocean Eng 178:80–94
Article Google Scholar
Elmokadem T, Zribi M, Youcef-Toumi K (2016) Trajectory tracking sliding mode control of underactuated auvs. Nonlinear Dyn 84:1079–1091
Article MathSciNet Google Scholar
Labbadi M, Cherkaoui M (2019) Robust adaptive backstepping fast terminal sliding mode controller for uncertain quadrotor uav. Aerosp Sci Technol 93:105306
Article Google Scholar
Ge Q, Sun Q, Li SE, Zheng S, Wu W, Chen X (2021) Numerically stable dynamic bicycle model for discrete-time control. In: 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops), pp 128–134. IEEE
Mohan Tiwari Pyare, Janardhanan S, un Nabi Mashuq (2015) Rigid spacecraft attitude control using adaptive non-singular fast terminal sliding mode. Journal of Control, Automation and Electrical Systems 26:115–124
Hassani H, Mansouri A, Ahaitouf A (2021) Robust autonomous flight for quadrotor uav based on adaptive nonsingular fast terminal sliding mode control. Int J Dyn Control 9:619–635
Article MathSciNet Google Scholar
Rupp Astrid, Stolz Michael (2017) Survey on control schemes for automated driving on highways. In Automated driving, pages 43–69. Springer
Nie L, Guan J, Chihua L, Zheng H, Yin Z (2018) Longitudinal speed control of autonomous vehicle based on a self-adaptive pid of radial basis function neural network. IET Intel Transp Syst 12(6):485–494
Article Google Scholar
Howard Thomas M, Alonzo K (2007) Optimal rough terrain trajectory generation for wheeled mobile robots. Int J Robot Res 26(2):141–166
Article Google Scholar
Li S, Li K, Rajamani R, Wang J (2010) Model predictive multi-objective vehicular adaptive cruise control. IEEE Trans Control Syst Technol 19(3):556–566
Article Google Scholar
Sutton Richard S, Barto Andrew G (2018) Reinforcement learning: an introduction. MIT press
Pal Constantin-Valentin, Leon Florin (2020) Brief survey of model-based reinforcement learning techniques. In 2020 24th International Conference on System Theory, Control and Computing (ICSTCC), pages 92–97. IEEE
Bellman R (1966) Dynamic programming. Science 153(3731):34–37
Article Google Scholar
Xingwei Z, Bo Tao L, Qian HD (2020) Model-based actor-critic learning for optimal tracking control of robots with input saturation. IEEE Trans Industr Electron 68(6):5046–5056
Google Scholar
Yu Lingli, Shao Xuanya, Yan Xiaoxin (2017) Autonomous overtaking decision making of driverless bus based on deep q-learning method. In 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 2267–2272. IEEE
Tang X, Huang B, Liu T, Lin X (2022) Highway decision-making and motion planning for autonomous driving via soft actor-critic. IEEE Trans Veh Technol 71(5):4706–4717
Article Google Scholar
Zanon M, Gros S (2020) Safe reinforcement learning using robust mpc. IEEE Trans Autom Control 66(8):3638–3652
Article MathSciNet Google Scholar
Gros S, Zanon M (2019) Data-driven economic nmpc using reinforcement learning. IEEE Trans Autom Control 65(2):636–648
Article Google Scholar
Gros Sébastien, Zanon Mario (2021) Reinforcement learning based on MPC and the stochastic policy gradient method. In 2021 American Control Conference (ACC), pages 1947–1952. IEEE
Yang G, Yangang R, Qi S, Eben LS, Haitong M, Jingliang D, Yifan D, Bo C (2022) Integrated decision and control: toward interpretable and computationally efficient driving intelligence. IEEE transactions on cybernetics 53(2):859–873
Google Scholar
Brian P, Michal Č, Zheng YS, Dmitry Y, Emilio F (2016) A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on intelligent vehicles 1(1):33–55
Article Google Scholar
Ravi KB, Ibrahim S, Victor T, Patrick M, Al Sallab Ahmad A, Senthil Y, Patrick P (2021) Deep reinforcement learning for autonomous driving: A survey. IEEE Trans Intell Transp Syst 23(6):4909–4926
Google Scholar
Fletcher R_ (1981) Practical methods of optimization: Vol. 2: Constrained optimization. JOHN WILEY & SONS, INC., ONE WILEY DR., SOMERSET, N. J. 08873, 1981, 224
Charalambous Christakis (1980) A method to overcome the ill-conditioning problem of differentiable penalty functions. Operations Research, 28(3-part-ii):650–667
Fletcher Roger (1983) Penalty functions. Mathematical Programming The State of the Art, pages 87–114
Dussault J-P (1995) Numerical stability and efficiency of penalty algorithms. SIAM J Numer Anal 32(1):296–317
Article MathSciNet Google Scholar
Saarinen S, Bramley R, Cybenko G (1993) Ill-conditioning in neural network training problems. SIAM J Sci Comput 14(3):693–714
Article MathSciNet Google Scholar
Zhang Yongke, Zhang Yongjun, Ye Wei (1995) Local-sparse connection multilayer networks. In Proceedings of ICNN’95-International Conference on Neural Networks, volume 3, pages 1254–1257. IEEE
Der Smagt Patrick Van, Hirzinger Gerd (2002) Solving the ill-conditioning in neural network learning. In Neural networks: tricks of the trade, pages 193–206. Springer
Byrd Richard H, Gabriel L-C, Jorge N (2012) A line search exact penalty method using steering rules. Math Program 133(1):39–73
Article MathSciNet Google Scholar
Rheinboldt Werner C (1976) On measures of ill-conditioning for nonlinear equations. Math Comput 30(133):104–111
Article MathSciNet Google Scholar
Peters G, Wilkinson James H (1979) Inverse iteration, ill-conditioned equations and newton’s method. SIAM Rev 21(3):339–360
Article MathSciNet Google Scholar
Peter KM, Chua Leon O (1988) Neural networks for nonlinear programming. IEEE Transactions on Circuits and Systems 35(5):554–562
Article MathSciNet Google Scholar
Jie L, Gupte A, Huang Y (2018) A mean-risk mixed integer nonlinear program for transportation network protection. Eur J Oper Res 265(1):277–289
Article MathSciNet Google Scholar
Nocedal Jorge, Wright Stephen J (2006) Numerical Optimization, 2nd edition. Springer
Luenberger David G, Ye Yinyu (2021) Linear and Nonlinear Programming, 5th edition. Springer Nature Switzerland AG
Murray W (1967) Ill-conditioning in barrier and penalty functions arising in constrained nonlinear programming. In Proceedings of the Sixth International Symposium on Mathematical Programming
Zangwill Willard I (1967) Non-linear programming via penalty functions. Manage Sci 13(5):344–358
Article MathSciNet Google Scholar
Coleman Thomas F, Conn Andrew R (1980) Second-order conditions for an exact penalty function. Math Program 19(1):178–185
Article MathSciNet Google Scholar
Körner F (1990) On the numerical realization of the exact penalty method for quadratic programming algorithms. Eur J Oper Res 46(3):404–408
Article MathSciNet Google Scholar
Mongeau M, Sartenaer A (1995) Automatic decrease of the penalty parameter in exact penalty function methods. Eur J Oper Res 83(3):686–699
Article Google Scholar
Morrison David D (1968) Optimization by least squares. SIAM J Numer Anal 5(1):83–88
Article MathSciNet Google Scholar
Meng Z, Qiying H, Dang C, Yang X (2004) An objective penalty function method for nonlinear programming. Appl Math Lett 17(6):683–689
Meng Z, Qiying H, Dang C (2009) A penalty function algorithm with objective parameters for nonlinear mathematical programming. Journal of Industrial & Management Optimization 5(3):585
Meng Z, Dang C, Jiang M, Xinsheng X, Shen R (2013) Exactness and algorithm of an objective penalty function. J Global Optim 56(2):691–711
Article MathSciNet Google Scholar
Min J, Meng Z, Zhou G, Shen R (2021) On the smoothing of the norm objective penalty function for two-cardinality sparse constrained optimization problems. Neurocomputing 458:559–565
Article Google Scholar
Anil A, Humberto G, Shankar SS, Claire T (2013) Provably safe and robust learning-based model predictive control. Automatica 49(5):1216–1226
Article MathSciNet Google Scholar
Koller Torsten, Berkenkamp Felix, Turchetta Matteo, Krause Andreas (2018) Learning-based model predictive control for safe exploration. In 2018 IEEE conference on decision and control (CDC), pages 6059–6066. IEEE
Zanon Mario, Gros Sébastien, Bemporad Alberto (2019) Practical reinforcement learning of stabilizing economic mpc. In 2019 18th European Control Conference (ECC), pages 2258–2263. IEEE
Arroyo J, Manna C, Spiessens F, Helsen L (2022) Reinforced model predictive control (rl-mpc) for building energy management. Appl Energy 309:118346
Article Google Scholar
Garcia Carlos E, Prett David M, Manfred M (1989) Model predictive control: Theory and practice-a survey. Automatica 25(3):335–348
Article Google Scholar
Karg B, Lucia S (2020) Efficient representation and approximation of model predictive control laws via deep learning. IEEE Transactions on Cybernetics 50(9):3866–3878
Article Google Scholar
Chen Jianyu, Li Shengbo Eben, Tomizuka Masayoshi (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems
Ren Yangang, Duan Jingliang, Li Shengbo Eben, Guan Yang, Sun Qi (2020) Improving generalization of reinforcement learning with minimax distributional soft actor-critic. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–6. IEEE
Ma Haitong, Chen Jianyu, Eben Shengbo, Lin Ziyu, Guan Yang, Ren Yangang, Zheng Sifa (2021) Model-based constrained reinforcement learning using generalized control barrier function. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4552–4559. IEEE

Download references

Acknowledgements

The authors would like to thank the anonymous referees for their very helpful comments and insightful suggestions that have helped improve this paper greatly. This work was supported by the Chongqing Technological Innovation and Applications Development Special Program (Project No. cstc2021jscx-jbgsX0001), the Scientific and Technological Research Program of the Chongqing Municipal Education Commission (Project No. KJZD-K202114801), the Innovation and Development Joint Program of the Chongqing Natural Science Foundation (Project No. 2022NSCQ-LZX0301) and the National Natural Science Foundation of China (Project No. 11991024).

Author information

Authors and Affiliations

National Center for Applied Mathematics in Chongqing, Chongqing Normal University, Chongqing, 401331, China
Bo Wang, Fusheng Bai & Ke Zhang

Authors

Bo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fusheng Bai
View author publications
You can also search for this author in PubMed Google Scholar
Ke Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fusheng Bai.

Ethics declarations

Conflicts of interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, B., Bai, F. & Zhang, K. Actor-critic objective penalty function method: an adaptive strategy for trajectory tracking in autonomous driving. Complex Intell. Syst. 10, 1715–1732 (2024). https://doi.org/10.1007/s40747-023-01238-6

Download citation

Received: 19 March 2023
Accepted: 03 September 2023
Published: 30 September 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s40747-023-01238-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Actor-critic objective penalty function method: an adaptive strategy for trajectory tracking in autonomous driving

Abstract

Similar content being viewed by others

An actor-critic based learning method for decision-making and planning of autonomous vehicles

Approximate Optimal Filter Design for Vehicle System through Actor-Critic Reinforcement Learning

Stabilizing Electric Vehicle Systems Using Proximal Policy-Based Self-structuring Control

Explore related subjects

Introduction

Related work

Penalty function methods

Integration of MPC and RL

Planning and control scheme

MPC-based nonlinear programming problem

ACOPFM framework

DRL-based parameterization

Actor-critic objective penalty function method

Convergence analysis of the ACOPFM

Assumption 1

Theorem 1

Proof

Theorem 2

Proof

Theorem 3

Proof

Simulation experiments

Experimental environment

Offline training

Online testing

1) Performance in the environment without surrounding vehicles

2) Performance in the environment with surrounding vehicles

Conclusion and future work

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation