Introduction

Autonomous driving has attracted increasingly more attention from practitioners and researchers [1]. The primary research areas for autonomous driving include three major aspects: visual perception, planning and decision, and motion control [2]. Trajectory tracking is a key technology for precisely controlling autonomous vehicles. As a fundamental part of the motion control module, trajectory tracking aims to calculate the vehicle control commands based on the reference trajectory obtained by the path planning module to achieve accurate tracking of the reference trajectory. The performance of trajectory tracking directly determines, to a large extent, the performance of autonomous vehicles, which involves driving safety, passenger comfort, travel efficiency, energy consumption and so on. The trajectory tracking problem in autonomous driving is very challenging as the system involved is typically high nonlinear and contains large-scale state-wise constraints [3]. How to mathematically model the trajectory tracking problem properly, set up the obstacle avoidance strategy, reduce the huge computational burden in the solution process, and minimize the trajectory tracking error are crucial issues in the trajectory tracking research [4, 5].

The current vehicle trajectory tracking control methods mainly include sliding mode control (SMC), proportion integration differentiation (PID) control, and model predictive control (MPC) [6,7,8], etc. SMC is a nonlinear variable structure control, in which the composition of the system is constantly changing with the system state, so that the system always changes according to the employed slip mode state. SMC has the advantages of fast response, insensitivity to parameter changes and perturbations, and no need for online system identification [9], but it is prone to chatter, which is the most noticeable issue affecting the practical application of SMC [10]. The chattering issue can be lessened by approximating the discontinuous terms via continuous functions like saturation function, arctan function, and hyperbolic tangent function [11,12,13]. Although some improvements obtained using these modified SMC methods, there are still some problems associated with the presence of external disturbances [14,15,16]. PID control is used broadly in industry because it is simple and does not require a mathematical model [17], but it is not applicable to nonlinear systems and the tuning of parameters is complicated, which makes it hard to guarantee the accuracy and robustness of the control [18]. MPC employs the nonlinear dynamics model of the control system and predicts the output behavior of the system in the future period by solving the optimal control problem with constraints [19], however its tracking performance is sensitive to the accuracy of the prediction model, and the nonlinear MPC has high demand for computational resources [20].

Reinforcement learning (RL) solves sequential decision-making problems via a trial-and-error process interacting with the environment [21]. Model-based RL builds environment models in which trial-and-error can take place without real costs and can improve RL algorithms significantly by making them more sample-efficient and thus reducing errors [22]. Deep neural networks (DNN) have been commonly used as function approximators in RL, which mitigate against the state-action space explosion as the number of state-action pairs recorded grows [23], and generalize prior experience to previously unseen state-action pairs, leading to the birth of deep reinforcement learning (DRL). Actor-critic methods for RL are hybrid methods that combine the advantages of policy-based and value-based algorithms, where the ‘actor’ is the term for the policy structure in charge of choosing actions and the ‘critic’ is the estimated value function to evaluates the actor’s actions. Zhao et al. [24] proposed a model-based actor-critic framework for optimal tracking control of robotic systems, which overcomes the issue of long learning cycles for control strategies. The actor-critic based solution framework is widely used in autonomous driving tasks, especially to solve the problem for which the training process of RL or DRL can be hard when the state space is enormous [25]. Recently, given the uncertainties in the driving conditions, a soft actor-critic scheme was used in [26] to solve the decision-making and planning problem with continuous action space. Note that in general the DRL lacks closed-loop stability analysis, state constraints satisfiability, and significant weight initialization capacity [27]. In order to deal with these issues, the prospect of using MPC-based RL has been proposed and justified in [28], where using MPC as the function approximation for the optimal policy in RL is suggested. MPC based strategies can effectively deal with the multivariable constraints in autonomous vehicle tracking under uncertain environments by building systems that satisfy state constraints and safety requirements [29].

Recently, Guan et al. [30] proposed an integrated decision and control (IDC) framework for autonomous vehicles, which has better interpretability and computational efficiency compared with the traditional decomposed scheme [31] and the end-to-end scheme [32]. In addition, it is applicable in different driving scenarios and tasks. A model-based actor-critic RL algorithm, called the generalized exterior point method (GEP) is presented in [30] for solving the formulated MPC-based optimal control problem with large-scale state-wise constraints in a rolling optimization process to minimize the trajectory tracking error. The GEP is an extension of the exterior point method in the optimization domain to the field of neural network (NN), which transforms the constrained optimal control problem into an unconstrained problem with penalties for safety violations and an actor-critic solution framework is employed to solve the problem. The purpose of the actor is to obtain an excellent policy NN, which aims to make trajectory tracking performance as good as possible without collision. The objective function with penalties is mapped as the parameterized loss function of the policy NN for model training, and the input is the state vector of the autonomous vehicle and the output is the control vector, thus giving the corresponding control commands for the autonomous vehicle. Meanwhile, the critic, with the function of judging state goodness, can be served as the path selector. Unlike the exterior point method in the optimization domain, the GEP relaxes the conditions to establish convergence of the exterior point method, eliminates the need to find global minima in each gradient descent step, and updates the NN parameters effectively under the guidance of the model. An approximate feasible optimal control strategy is obtained by implementing gradient descent and updating the penalty parameter alternatively. For the exterior point method, it can be shown that under certain conditions, when the penalty parameter increases to infinity, the optimal solution to the penalty problem is the optimal solution to the original problem [33]. However, as the penalty parameter increases, it would face the ill-conditioning issue [34, 35], including severe numerical instabilities [36], so the penalty parameter cannot be infinitely large. A similar issue arises for DNN training [37,38,39], where each iterative update of the network parameters is essentially a step of gradient descent. On the one hand, we do not know exactly how large the penalty parameters need to be. Theoretically speaking, the penalty parameters have to be increased to infinity to ensure convergence, but due to the actual computational limitations, the penalty parameters cannot be too large or too small [40]. On the other hand, as the penalty parameters keep increasing, the condition number of the Hessian matrix of the loss function becomes larger and larger, the correlation between the columns of the matrix becomes too high [41, 42], and the ill-conditioning issue in the training of the DNN becomes more and more serious, which results in a very slow or even non-converging training process [38, 39]. In other words, the control of DNN output may be suboptimal or even constraint-violating, so that we can hardly ensure the quality of the trained model.

In this paper, we propose a model-based RL algorithm called the actor-critic objective penalty function method (ACOPFM) to solve the trajectory tracking problem. The ACOPFM can effectively deal with the large-scale nonlinear state-wise constraints and avoid the ill-conditioning issue caused by the increasing condition number of the Hessian matrix during the model training of the policy network of the GEP. The ACOPFM transforms the nonlinear programming problem into a parametric, unconstrained optimization problem and employs the objective penalty function as a loss function for policy network updates to find the optimal control strategy by alternately performing gradient descent and adaptively adjusting the penalty parameter. The convergence of ACOPFM is proved. Meanwhile, the ACOPFM can improve the computational efficiency of the solution process and update the NNs efficiently under the guidance of the model. To the best of our knowledge, this is the first time that the objective penalty function method is extended to the RL framework. Moreover, we conduct extensive simulation experiments via the joint platform consisting of CARLA and SUMO. During the offline training, the proposed ACOPFM converges faster and more steadily to the optimal control strategy with the GEP as the baseline. During the online testing, our trained autonomous vehicle can successfully complete the trajectory tracking task in the overtaking challenge under the multi-lane test scenario with frequent vehicle interactions.

The remainder of this paper is organized as follows. Section “Related work” briefly introduces the penalty function methods in the optimization field, and the development of the integration of MPC and RL. Section “Planning and control scheme” describes the planning and control scheme we have adopted based on the IDC framework. Section “Design of the nonlinear programming problem” presents the modeling of the trajectory tracking problem into the MPC-based nonlinear programming problem. Section “ACOPFM framework” presents the ACOPFM solution for the modeled trajectory tracking problem under the RL framework, and its theoretical convergence analysis. Section “Simulation experiment” reports the numerical results of the offline training and online testing of the NN trained using the ACOPFM for the multi-lane simulation scene. Finally, the conclusions of the present work and some suggestions for the future work are given in Sect. “Conclusion and future work”.

Related work

In this section, we introduce firstly the penalty function methods, including the development of the objective penalty function methods in recent decades, then the integration of MPC and RL, including its applications in the field of autonomous driving.

Penalty function methods

Consider the following constrained optimization problem:

$$\begin{aligned} \qquad \qquad \quad&\min \quad f(x),\\ \qquad \qquad \quad&\;\hspace{0.152em}\mathrm {s.t.}\quad g_{c}(x)\le 0,\quad c\in \varvec{C}, \end{aligned}$$
(P)

where \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is the objective function, \(g_c: {\mathbb {R}}^n\rightarrow {\mathbb {R}}, c\in {\varvec{C}}\) are constraints. The set

$$\begin{aligned} X=\left\{ x\in {\mathbb {R}}^n \, \big |\, g_{c}(x)\le 0,c\in {\varvec{C}}\right\} \end{aligned}$$

is called the feasible set of (P).

The problem finds applications in various fields such as artificial intelligence and transportation [43, 44].

Penalty function methods are popular choices for solving problem (P). Based on the principle of converting a constrained optimization problem into a series of unconstrained optimization problems, penalty function methods are relatively simple and effective in finding the optimal solution [45, 46]. A commonly used penalty function for (P) is the \(l_2\) penalty function, which is defined as:

$$\begin{aligned} F(x,\rho )=f(x)+\rho \sum _{c\in {\varvec{C}}}\max \{g_{c}(x),0\}^2, \end{aligned}$$

where \(\rho >0\) is the penalty parameter. The corresponding penalty problem for (P) is given as follows:

$$\begin{aligned} \quad \qquad \qquad \min F(x,\rho ) \qquad \mathrm {s.t.}\quad x\in {\varvec{R}}^n. \end{aligned}$$
(P′)

It can be shown that under certain conditions, when the penalty parameter increases to infinity, the optimal solution of the problem (P\(^\prime \)) is also the optimal solution of the original problem (P) [46]. Therefore, when the penalty parameter keeps increasing, only a series of penalty problems need to be solved to obtain an approximate optimal solution of the original problem (P). However, the Hessian matrix of the penalty function will become increasingly ill-conditioning as the penalty parameter increases [34, 35, 47], which leads to numerical difficulties. Another commonly used penalty function is the \(l_1\) penalty function, which is defined as:

$$\begin{aligned} F_1(x,\rho )=f(x)+\rho \sum _{c\in {\varvec{C}}}\max \{g_{c}(x),0\}. \end{aligned}$$

The corresponding penalty problem for (P) is given as follows:

$$\begin{aligned} \quad \qquad \qquad \min F_1(x,\rho ) \qquad \mathrm {s.t.}\quad x\in {\varvec{R}}^n. \end{aligned}$$
(P′′)

Note that the \(l_1\) penalty function is an exact penalty function (EPF) [48,49,50,51], which means that under some conditions, there is some \(\rho ^*\) such that an optimal solution to (P) is also an optimal solution to (P\(^{\prime \prime }\)) for all \(\rho \ge \rho ^*\). However, as the \(\rho ^*\) is unspecified, it is necessary to gradually increase the penalty parameter to find the optimal solution to (P), and the ill-conditioning issue still occurs. Besides, the \(l_1\) penalty function is nondifferentiable at the point x with some \(g_c(x)=0\). As the most powerful optimization methods require a differentiable cost function, this entails restrictions on choices of optimization methods to solve the exact penalty problem.

The idea of the objective penalty function method was first proposed in [52]. By adjusting the objective penalty parameter adaptively, the ill-conditioning issue can be effectively overcome.

The following objective penalty function was presented in [53]:

$$\begin{aligned} F(x,M)=(f(x)-M)^2+\sum _{c\in {\varvec{C}}}\max \{g_{c}(x),0\}^p, \end{aligned}$$

where \(p>1\), and an objective penalty function method (OPFM) was developed to solve (P) and the convergence of the corresponding algorithm was established under the assumption that X is connected and compact. In [54] a general objective penalty functions was proposed as follows:

$$\begin{aligned} F(x,M)=Q(f(x)-M)+\sum _{c\in {\varvec{C}}}P(g_c(x)), \end{aligned}$$

where Q(t) is strictly decreasing on \([0, +\infty )\), \(Q(t)>0\) for \(t\ne 0,\) \(Q(0)=0\), and \(P(t)>0\) for \(t>0\), \(P(t)=0\) for \(t\le 0\). For example, \(Q(t)=t^2\) and \(P(t)=\mathop {\max }\{t,0\}^2.\) In [55] the modified Q(t) and P(t) were used to construct the objective penalty function, where \(Q(t)=0\) for \(t<0\) and P(t) is further required to be strictly increasing on \([0, +\infty ).\) The convergence results for the objective penalty function method are established in [55] without the compactness and connectivity assumption on the feasible set.

Integration of MPC and RL

MPC is a well-established control strategy that uses a mathematical model to predict the system behavior over a finite time horizon and optimize a performance criterion subject to constraints [8], but the performance of MPC is highly dependent on the accuracy of the model used for predictions. RL, on the other hand, is a machine learning technique that enables an agent to learn from its interactions with an environment and make decisions that maximize a reward signal [21], but many critical aspects still need to be tackled, including safety and stability issues [27].

Fig. 1
figure 1

Illustration of our integrated decision and control scheme

The integration of MPC and RL can leverage the strengths of both approaches. MPC provides a framework for handling constraints and modeling complex systems, while RL allows for adaptive control in uncertain environments where traditional model-based methods may fail. Aswani et al. [57] first presented learning-based model predictive control (LBMPC) scheme, and used a linear model with bounds on its uncertainty to construct invariant sets that provided deterministic guarantees on robustness and safety. In [58], Koller et al. presented a learning-based model predictive control scheme that can provide provable high-probability safety guarantees. Gros and Zanon [28] proposed that a Nonlinear Model Predictive Control (NMPC) can be tuned to deliver the optimal policy of the real system even when using a wrong model, and one practical outcome of the theory proposed in this paper is that all RL techniques can be directly used to tune the NMPC scheme to increase its performance on the real system, and this was the first work proposing to use NMPC as a function approximator in RL. In [27] Gros and Zanon proposed an RL formulation based on MPC which addresses the issue of safety, which they did not rigorously enforce in [28, 59]. A key idea is that MPC is used as a function approximator within RL to provide safety and stability guarantees; and RL is used to tune the MPC parameters, thus improving closed-loop performance in a data-driven fashion [27, 28, 59].

Recently, a method to implement the stochastic policy gradient method using actor-critic techniques was proposed in [29], where the policy is approximated using an MPC scheme, and a computationally inexpensive approach is used to build a stochastic policy generating samples that are supposed to be feasible for the MPC constraints. In [60], a novel algorithm called reinforced predictive control (RL-MPC) that merges the relative merits of MPC and RL, and the complementarity between RL and MPC is emphasized.

In summary, the integration of MPC and RL can provide a more efficient, robust, and adaptive control system for autonomous driving.

Planning and control scheme

This section describes our planning and control scheme employed based on the IDC framework [30], which consists of two main modules: path planning and trajectory tracking control, as seen in Fig. 1.

Firstly, the global road points are generated in consideration of the road static information [30], including road characteristics (shapes, intersections, obstacles etc.) and traffic rules (traffic light signals, vehicle speed limits etc.), then the reference paths are filtered using the static information from the starting point to the end point of the ego vehicle, which are used as the underlying reference paths for the dynamic optimal tracking task (in the CARLA simulator, the built-in \(A^*\) algorithm can quickly generate the reference paths). Next, use the dynamic traffic information including surrounding vehicles’ movements to formulate the constrained trajectory tracking problem with respect to each candidate path, solve the formulated problems separately, and follow the path with the best tracking performance.

The formulated MPC-based trajectory tracking problem consists of three parts: first, the objective function to evaluate the trajectory tracking effect; second, the state transfer equation constraints based on the bicycle vehicle model [14] and vehicle kinematics for the ego and other vehicles, respectively; third, the constraints resulting from dynamic collision avoidance in consideration of the surrounding vehicles, and the distance restrictions in consideration of the lane boundaries.

For the constrained optimization problem faced, the ACOPFM is developed as the solution scheme. The objective penalty function method in the optimization field is employed to handle the constraints in order to obtain the unconstrained optimization problem and further the loss function required for the DNN model training, and the objective penalty parameters are adjusted adaptively during the model training process to avoid the ill-conditioning issue.

As seen in Fig. 2, the highway scenario is used to demonstrate the trajectory tracking task. The \(A^*\) algorithm is used to generate the reference paths. The optimal trajectory for the ego vehicle is obtained by the ACOPFM. Note that the goal of the optimization problem is to minimize the trajectory tracking error between the trajectory of the ego vehicle and the reference trajectory as much as possible under the premise of safety and reasonableness.

MPC-based nonlinear programming problem

The trajectory tracking model for autonomous driving was given in [30]. Equation(1) shows the variables related to the state and control.

$$\begin{aligned} \begin{aligned} \qquad \qquad {\varvec{x}}_{i|t}&=[l_{\mathrm{{x}}}, \ l_{\mathrm{{y}}}, \ v_{\mathrm{{lon}}}, \ v_{\mathrm{{lat}}}, \ \varphi , \ \omega ]_{i|t}^{\top }\\ \qquad \qquad {\varvec{x}}_{i|t}^{j}&=[l_{\mathrm{{x}}}^j, \ l_{\mathrm{{y}}}^j, \ v_{\mathrm{{lon}}}^j, \ 0, \ \varphi ^j, \ 0]_{i|t}^{\top }\\ \qquad \qquad {\varvec{x}}_{i|t}^{\mathrm{{lane}}}&=[l_{\mathrm{{x}}}^{\mathrm{{lane}}}, \ l_{\mathrm{{y}}}^{\mathrm{{lane}}}, \ 0, \ 0, \ 0, \ 0]_{i|t}^{\top }\\ \qquad \qquad {\varvec{x}}_{i|t}^{\mathrm{{ref}}}&=[l_{\mathrm{{x}}}^{\mathrm{{ref}}}, \ l_{\mathrm{{y}}}^{\mathrm{{ref}}}, \ v_{\mathrm{{lon}}}^{\mathrm{{ref}}}, \ 0, \ \varphi ^{\mathrm{{ref}}}, \ 0]_{i|t}^{\top }\\ \qquad \qquad {\varvec{u}}_{i|t}&=[\delta ,\ a]_{i|t}^{\top } \end{aligned} \end{aligned}$$
(1)

The prediction horizon is set as T steps. \(\ x_{i|t}\) is the state of the ego vehicle at the ith time step within T from the current time step t. \(x_{i|t}^{j}\) is the state of the jth vehicle in \({\varvec{N}}\), which stands for a collection of vehicles interacting with the ego vehicle, where \(l_{\mathrm{{x}}}\) and \(l_{\mathrm{{y}}}\) are the position coordinates (for ego and other vehicles, they stand for the position of the center of gravity(CG)). \(v_{\mathrm{{lon}}}\) and \(v_{\mathrm{{lat}}}\) are the longitudinal and lateral velocities, \(\varphi \) is the heading angle, \(\omega \) is the yaw rate. \(x_{i|t}^{\mathrm{{ref}}}\) and \(x_{i|t}^{\mathrm{{lane}}}\) are the closest points from \(\ x_{i|t}\) on the reference path and on the road edge. \(u_{i|t}\) is the control of the ego vehicle at the ith time step within T from the current time step t. The vehicle control is expressed in terms of the front-wheel angle \(\delta \) and acceleration a.

Fig. 2
figure 2

Schematic diagram of the trajectory tracking task on highway scenario. The blue line segment illustrates the trajectory tracking error. The pink dashed segment illustrates the predicted trajectory of other vehicles, according to the state transfer equation. The red thick line segment is the optimized trajectory obtained for the ego vehicle after dynamically solving the MPC-based trajectory tracking problem using ACOPFM. The green solid line is the reference path, generated by the \(A^*\) algorithm in the CARLA simulator

The objective function is given in (2):

$$\begin{aligned} \begin{aligned} \qquad \min \limits _{u_{i|t},i=0:T-1}{\varvec{J}}({\varvec{x}}_{i|t},{\varvec{u}}_{i|t})=\sum _{i=0}^{T-1}{\varvec{l}}({\varvec{x}}_{i|t})+\varvec{\eta }({\varvec{u}}_{i|t}) \end{aligned} \end{aligned}$$
(2)

where

$$\begin{aligned} \qquad \qquad \qquad {\varvec{l}}({\varvec{x}}_{i|t})&=||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{ref}}}||_2^2\\ \qquad \qquad \qquad&=({\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{ref}}})^{\top }{\varvec{Q}}({\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{ref}}}) ,\\ \qquad \qquad \qquad \varvec{\eta }({\varvec{u}}_{i|t})&={\varvec{u}}_{i|t}^{\top }{\varvec{R}}{\varvec{u}}_{i|t}, \end{aligned}$$

\({\varvec{Q}},{\varvec{R}}\) are positive definite weighting matrices. Note that the trajectory tacking problem is formulated as a rolling optimization process where for each step the new \({\varvec{x}}_{0|t}\) is provided, then the tracking error is minimized to get the optimal control sequence \(\{{\varvec{u}}_{0|t}, {\varvec{u}}_{1|t},...{\varvec{u}}_{T-1|t}\}\). The actual control given to the vehicle is \({\varvec{u}}_{0|t}\) each time due to the uncertainty of the dynamic road information. The state of the vehicle at the next time step is obtained by the state transfer equation. This optimization process is repeated at each time step [8, 61]. Based on the vehicle kinematics and the bicycle vehicle model [14], the state transfer equation of ego vehicle is

$$\begin{aligned} {\varvec{x}}_{i+1|t}={\varvec{F}}_{ego}({\varvec{x}}_{i|t},{\varvec{u}}_{i|t}), \end{aligned}$$

which is specifically given in Eq. (3):

$$\begin{aligned} \qquad \begin{bmatrix} v_{\mathrm{{lon}}|i+1} \\ {\varphi }_{i+1}\\ l_{\mathrm{{x}}|i+1}\\ l_{\mathrm{{y}}|i+1}\\ v_{\mathrm{{lat}}|i+1}\\ \omega _{i+1} \end{bmatrix} = \begin{bmatrix} v_{\mathrm{{lon}}|i}+T_s\cdot a_{i}\\ \dfrac{\tan \delta _i}{L}\cdot {\varphi }_{i}+T_s\cdot v_{\mathrm{{lon}}|i+1}\\ l_{\mathrm{{x}}|i}+ T_s \cdot v_{\mathrm{{lon}}|i+1}\cdot \cos \varphi _{i+1}\\ l_{\mathrm{{y}}|i}+ T_s \cdot v_{\mathrm{{lon}}|i+1}\cdot \sin \varphi _{i+1}\\ v_{\mathrm{{lat}}|i}\\ \omega _{i} \end{bmatrix} \end{aligned}$$
(3)

The state transfer equation of other vehicles is

$$\begin{aligned} {\varvec{x}}_{i+1|t}^j={\varvec{F}}_{other}({\varvec{x}}_{i|t}^j), \end{aligned}$$

which is specifically given in nonlinear Eq.(4):

$$\begin{aligned} \qquad \begin{bmatrix} v_{\mathrm{{lon}}| i+1}^j \\ {\varphi }_{i+1}^j\\ l_{\mathrm{{x}}|i+1}^j\\ l_{\mathrm{{y}}|i+1}^j\\ v_{\mathrm{{lat}}|i+1}^j\\ \omega _{i+1}^j \end{bmatrix} = \begin{bmatrix} v_{\mathrm{{lon}}|i}^j\\ {\varphi }_{i}^j\\ l_{\mathrm{{x}}|i}^j+ T_s \cdot v_{\mathrm{{lon}}|i+1}^j\cdot \cos \varphi _{i+1}^j\\ l_{\mathrm{{y}}|i}^j+ T_s \cdot v_{\mathrm{{lon}}|i+1}^j\cdot \sin \varphi _{i+1}^j\\ v_{\mathrm{{lat}}|i}^j\\ \omega _{i}^j \end{bmatrix} \end{aligned}$$
(4)

Next we move to the nonlinear constraints involved in the model:

$$\begin{aligned} \qquad \qquad \qquad ||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{j}||_2&\ge {\varvec{D}}^{\mathrm{{safe}}}_{other},\\ \qquad \qquad \qquad ||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{lane}}}||_2&\ge {\varvec{D}}^{\mathrm{{safe}}}_{lane}, \end{aligned}$$

where \({\varvec{D}}^{\mathrm{{safe}}}_{other}\) is the safe distance between the ego vehicle and other vehicles, while \({\varvec{D}}^{\mathrm{{safe}}}_{lane}\) is the safe distance between the ego vehicle and the lane line. More specifically, we have the following nonlinear constraints, as shown in Eq. (5), which update dynamically at each time step with the change of the ego vehicle’s state, other vehicles’ states, lane boundary information, and number of other vehicles:

$$\begin{aligned} \left\{ \begin{array}{lr} \!\sqrt{(l_{\mathrm{{x}}}-l_{\mathrm{{x}}}^{j})^2+(l_{\mathrm{{y}}}-l_{\mathrm{{y}}}^{j})^2} \!\ge \!{\varvec{D}}^{\mathrm{{safe}}}_{other}, j\in {\varvec{N}}\\ \\ \!\sqrt{(l_{\mathrm{{x}}}-l_{\mathrm{{x}}}^{\mathrm{{lane,1}}})^2+(l_{\mathrm{{y}}}-l_{\mathrm{{y}}}^{\mathrm{{lane,1}}})^2} \!\ge \! {\varvec{D}}^{\mathrm{{safe}}}_{lane}\\ \\ \!\sqrt{(l_{\mathrm{{x}}}-l_{\mathrm{{x}}}^{\mathrm{{lane,2}}})^2+(l_{\mathrm{{y}}}-l_{\mathrm{{y}}}^{\mathrm{{lane,2}}})^2} \! \ge \! {\varvec{D}}^{\mathrm{{safe}}}_{lane} \end{array} \right. \end{aligned}$$
(5)

where \(l^{\mathrm{{lane,1}}}\) and \(l^{\mathrm{{lane,2}} }\) represent the left lane boundary and the right lane boundary.

A model predictive control-based nonlinear programming problem is given as follows to fully describe the trajectory tracking problem:

$$\begin{aligned} \begin{aligned} \qquad \min \limits _{u_{i|t},i=0:T-1}&{\varvec{J}}({\varvec{x}}_{i|t},{\varvec{u}}_{i|t})=\sum _{i=0}^{T-1}{\varvec{l}}({\varvec{x}}_{i|t})+\varvec{\eta }({\varvec{u}}_{i|t})\\ \qquad {\text { s.t. }}\quad&{\varvec{x}}_{i+1|t}={\varvec{F}}_{ego}({\varvec{x}}_{i|t},{\varvec{u}}_{i|t})\\ \qquad&{\varvec{x}}_{i+1|t}^j={\varvec{F}}_{other}({\varvec{x}}_{i|t}^j)\\ \qquad&||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{j}||_2\ge {\varvec{D}}^{\mathrm{{safe}}}_{other}\\ \qquad&||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{lane}}}||_2\ge {\varvec{D}}^{\mathrm{{safe}}}_{lane}\\ \qquad&{\varvec{x}}_{0|t}={\varvec{x}}_{t},{\varvec{x}}_{0|t}^j={\varvec{x}}_{t}^j,{\varvec{u}}_{0|t}={\varvec{u}}_{t}\\ \qquad&i=0:T-1,j\in {\varvec{N}}. \end{aligned} \end{aligned}$$
(6)

As the problem is highly nonlinear, the number of constraints involved is large, and the vehicle-mounted computing resources are limited, the traditional numerical optimization methods are not able to solve the problem satisfactorily. It has been shown that the model-based RL algorithms can produce reasonable solutions for such problems [30, 62]. However, the constraints involved pose a great challenge to these algorithms. In the next section, we will present a generalized objective penalty function method to handle the constraints under the model-based RL framework, which can well handle the ill-conditioning issue that occurs in the traditional penalty function methods.

ACOPFM framework

DRL-based parameterization

We consider a standard RL setup for the interaction of the agent with the environment in discrete time steps. Here the agent is defined as an autonomous vehicle and the simulation environment can be modeled as a Markov decision process (MDPs), which can be defined by the tuple \(({\mathcal {S}},{\mathcal {A}},{\mathcal {R}},{\mathcal {P}},{\mathcal {V}})\). The state space and action space are identified as being continuous. Within each discrete time step, for each state given by the simulation environment \(s_t\in {\mathcal {S}},\) the ego vehicle takes the appropriate action \(a_t\in {\mathcal {A}}\) according to the input state \(s_t\) and the policy \(\pi :{\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}\) to complete the trajectory tracking task. The function \(r(s_t,a_t): {\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}\) describes the cost, indicating the objective function of the trajectory tracking task optimized at each moment, which is represented here as \(l(x_t)+\eta (u_t)\). The function \(p(s_{t+1}|s_{t},a_t): {\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathcal {P}}(s_{t+1})\) represents the probability density of the next state \(s_{t+1}\) to which the current state \(s_t\) is transferred by taking action \(a_t\). \(p(\cdot )\) is usually unknown due to the complexity and uncertainty of the autonomous vehicle dynamics and the motion of the surrounding traffic participants, but we can obtain the state at the next time step according to the state transfer equation \(s_{t+1}={\varvec{F}}(s_{t}).\) The state value function \(V^{\pi }(s_t)\) is defined as the expected cost sum from the start of \(s_t\) to future moments and can be used to evaluate the value of an initial state \(s_t\) under policy \(\pi \). It is closely related to the policy \(\pi \) of the selected action by \(V^{\pi }(s_t)=\left\{ \sum _{t=0}^{T-1}l(x_t)+\eta (u_t)|s_0=s_t\right\} .\)

We use the actor-critic RL scheme to solve the trajectory tracking problem. The purpose of actor is to find the optimal policy \(\pi _{\theta }^*\), which demonstrates the tracking performance during the training of NNs. The purpose of critic is to approximate the true function value, and to improve the closed-loop performance as the training of NNs becomes more and more accurate, the loss of critic can demonstrate the safety performance during the training process.

Our RL problems are modeled as follows:

$$\begin{aligned}&\min \limits _{\theta }J_{actor}\! =\!{\mathbb {E}}_{s_{0|t}}\!\left\{ \sum _{i=0}^{T-1}{\varvec{l}}(s_{i|t},\pi _{\theta }(s_{i|t}))\!+\!\varvec{\eta }(u_{i|t},\pi _{\theta }(s_{i|t}))\!\right\} \nonumber \\&\mathrm {s.t.}\ s_{i+1|t}\!={\varvec{F}}(s_{i|t},\pi _{\theta }(s_{i|t}))\nonumber \\&g_c(s_{i|t})\ge 0,c\in {\varvec{C}}\nonumber \\&s_{0|t}=s_t \leftarrow \{\tau ^*,x_t,x_t^{j},j\in N\} \sim d\nonumber \\&i=0:T-1 \end{aligned}$$
(7)
$$\begin{aligned}&\min \limits _{\omega }J_{critic}\nonumber \\&\!=\!{\mathbb {E}}_{s_{0|t}}\!\left\{ \!\left( \!\sum _{i=0}^{T-1}{\varvec{l}}(s_{i|t},\!\pi _{\theta }(s_{i|t}))\!+\!\varvec{\eta }(u_{i|t},\!\pi _{\theta }(s_{i|t}))\!-\!{\varvec{V}}_{\omega }(s_{0|t})\!\right) ^2\!\right\} \nonumber \\&\mathrm {s.t.}\ s_{i+1|t}={\varvec{F}}(s_{i|t},\pi _{\theta }(s_{i|t}))\nonumber \\&s_{0|t}=s_t \leftarrow \{\tau ^*,x_t,x_t^{j},j\in N\} \sim d\nonumber \\&i=0:T-1, \end{aligned}$$
(8)

where the tracking error objective

$$\begin{aligned} \begin{aligned}&{\varvec{l}}(s_{i|t},\pi _{\theta }(s_{i|t}))+\varvec{\eta }(u_{i|t},\pi _{\theta }(s_{i|t}))\\&=||{\varvec{x}}_{i|t}-{\varvec{x}}_{i|t}^{\mathrm{{ref}}}||_2^2+\varvec{\pi }_{\theta }(s_{i|t})^{\top }{\varvec{R}}\varvec{\pi }_{\theta }(s_{i|t}), \end{aligned} \end{aligned}$$

\(g_c(s_{i|t})\ge 0, c\in {\varvec{C}} \) denote all the constraints. As the number of constraints is dynamically changing according to the surrounding vehicle information and the location information at each moment. Note that \({\varvec{F}}\in \{{\varvec{F}}_{ego},{\varvec{F}}_{other}\}\). In the value network, the optimal value function \({\varvec{V}}_{\omega ^*}\) is obtained by fitting the optimization parameter \(\omega \). We train the value network so that \({\varvec{V}}_{\omega ^*}\) approximates the obtained \(\varvec{J^*}.\) The control \(u_{i|t}=\pi _{\theta }(s_{i|t})\) can be obtained via the policy network. Under some assumptions, \(u_{i|t}\) can be mapped as \(\pi _{\theta }(s_{i|t})\), and the optimal policy \(\pi _{\theta }^*(s_{i|t})\)is equivalent to optimal action \(u_{i|t}^{*}.\)

Actor-critic objective penalty function method

Taking \(P(t)=\max \{0,t\}^2\) and introducing a dynamic objective penalty parameter M, then the constrained critic problem is converted to a generalized objective penalty function problem as follows:

$$\begin{aligned}&\min \limits _{\theta }J_{p}=Q(J_{actor}-M)+J_{penalty}\nonumber \\&\quad ={\mathbb {E}}_{s_{0|t}}\!\left\{ \!Q\!\left( \sum _{i=0}^{T-1}{\varvec{l}}(s_{i|t},\pi _{\theta }(s_{i|t}))\!+\!\varvec{\eta }(u_{i|t},\pi _{\theta }(s_{i|t}))\!-\! M_{\theta }\!\right) \!\right\} \nonumber \\&\qquad +{\mathbb {E}}_{s_{0|t}}\!\left\{ \sum _{i=0}^{T-1}\phi _i(\theta )\right\} \nonumber \\&\qquad \mathrm {s.t.}\ s_{i+1|t}={\varvec{F}}(s_{i|t},\pi _{\theta }(s_{i|t}))\nonumber \\&\qquad \phi _i(\theta )=\sum _{c\in {\varvec{C}}}[\textrm{max}\{0,-g_c(s_{i|t})\}]^2\nonumber \\&\qquad s_{0|t}=s_t \leftarrow \{\tau ^*,x_t,x_t^{j},j\in J\} \sim d\nonumber \\&\qquad i=0:T-1. \end{aligned}$$
(9)

The algorithmic procedure of the actor-critic objective penalty method is given in Algorithm 1. Let the set

$$\begin{aligned} U=\{u=\pi _{\theta }(s_t)\, \big |\,g_c(s_t)\ge 0,c\in {\varvec{C}}\}. \end{aligned}$$

The optimal control taken each time through the policy network is \(u_k\). \(f(u_k)\) is the value obtained from

\(\sum _{i=0}^{T-1}{\varvec{l}}(s_{i|t},\pi _{\theta }(s_{i|t}))+\varvec{\eta }(u_{i|t},\pi _{\theta }(s_{i|t}))\). Since the punishment for constraint violations is formulated as a penalty term, \(u_k\) does not necessarily satisfy the constraints.

Algorithm 1
figure a

ACOPFM for Offline Training

Convergence analysis of the ACOPFM

We will show that the ACOPFM converges to the optimal policy under certain conditions, and we say that a “round" is completed when an optimization process is completed.

Assumption 1

After the round k completes, we have an optimized policy parameter \(\theta _k\). Let

$$\begin{aligned} \begin{aligned} \qquad P(\theta _k,\beta )=\left\{ \theta _k\, \big |\,J_{actor}(\theta _k)\le \beta ,k=1,2,...\right\} , \end{aligned} \end{aligned}$$
(10)

which is called a generalized P-level set. We assume that \(P(\theta _k,\beta )\) is bounded for any given \(\beta >0\) if the sequence \(\{M_k\}\) is convergent.

Next we establish the convergence of the sequence \(\{M_k\}\).

Theorem 1

Let \(M_{*}=\min \limits _{\theta }J_{actor}(\theta )\), then \(M_{*}=f(\pi _{\theta ^{*}}(s_t)).\) Suppose that for some M, \(u_M\) is the control obtained from \(J_{p}(\theta ,M),\) and let \(\theta _M\) be the optimal parameter corresponding to \(J_{p}(\theta ,M)\), i.e. \(u_M=\pi _{\theta _M}(s_t).\) Then the following three assertions hold.

(i) If \(J_{p}(\theta _M,M)=0,\) then \(u_M\in U\) and \(M_{*}\le f(u_M)\le M.\)

(ii) If \(J_{p}(\theta _M,M)>0\) and \(u_M\notin U,\) then \(M\le M_{*}\) and \(f(u_M)< M_{*}.\)

(iii) If \(J_{p}(\theta _M,M)>0\) and \(u_M\in U,\) then \(\theta _{M}\in \mathop {\arg \min }\limits _{\theta }J_{actor}(\theta ).\)

Proof

(i) The conclusion clearly holds according to the definitions of P and Q.

(ii) Assume that \(\theta ^{*}\in \mathop {\arg \min }\limits _{\theta }J_{actor}(\theta ).\) Obviously

$$\begin{aligned} J_{penalty}(\theta ^{*})=0, J_{actor}(\theta ^{*})=f(\pi _{\theta ^{*}}(s_t)). \end{aligned}$$

As \(J_{p}(\theta _M,M)>0,\) it follows

$$\begin{aligned} 0<J_p(\theta _M,M)\le J_{p}(\theta ^{*},M)=Q(f(\pi _{\theta ^{*}}(s_t)-M)). \end{aligned}$$

By the definition of Q, \(M<f(\pi _{\theta ^{*}}(s_t))=M_{*}.\)

If \(f(u_M)\le M,\) then \(f(u_M)\le M\le M_{*}.\) If \(f(u_M)> M,\) then

$$\begin{aligned} \begin{aligned} \qquad 0&<Q(f(u_M)-M)\\ \qquad&\le J_p(\theta _{M},M)\le J_p(\theta ^*,M)=Q(f(\pi _{\theta ^{*}}(s_t)-M)). \end{aligned} \end{aligned}$$

Therefore, \(f(u_M)< f(\pi _{\theta ^{*}}(s_t))=M_*.\) At the initial step 0, the ego vehicle is generated at the starting point, the default parameter settings satisfy the constraints, therefore \(J_{penalty}(\theta _{0})=0.\)

(iii) According to the given conditions, we have

$$\begin{aligned} \begin{aligned} \qquad \qquad \quad 0&< Q(f(u_M)-M)\\ \quad \qquad \qquad&=J_p(\theta _{M},M)\le J_p(\theta ,M)\\ \quad \qquad \qquad&=Q(f(u)-M),\forall u\in U. \end{aligned} \end{aligned}$$

Since \(u_M\in U,\) this implies that

$$\begin{aligned} f(u_M)-M\le f(u)-M,\forall u\in U, \end{aligned}$$

i.e. \(\theta _M\in \mathop {\arg \min }\limits _{\theta }J_{actor}(\theta )\).\(\square \)

Theorem 2

Let \(\{M_{k}\}\) be the sequence of objective penalty parameters generated during the iteration with the sequence \(\{a_{k}\}\) and the sequence \(\{b_{k}\}.\) Then \(\{a_{k}\}\) is an increasing sequence and \(\{b_{k}\}\) is a decreasing sequence and satisfies

$$\begin{aligned}{} & {} \qquad a_{k}\le M_{*} \le b_{k},\qquad k=1,2,..., \end{aligned}$$
(11)
$$\begin{aligned}{} & {} b_{k+1}-a_{k+1}\le \dfrac{b_{k}-a_{k}}{2},\qquad k=1,2,.... \end{aligned}$$
(12)

Proof

Considering the case when \(k = 0\). Clearly, \(a_{0}\le M_{*}\le b_{0}.\) When \(k = 1\), if \(J_{p}(\theta _1,M_0)=0,\) then \(a_1=a_0,\ b_1=f(u_1)=M_0,\ M_1=\dfrac{a_1+b_1}{2},\) thus we have

$$\begin{aligned} b_1-a_1=M_0-a_0=\dfrac{a_0+b_0}{2}-a_0=\dfrac{b_0-a_0}{2}; \end{aligned}$$

else \(J_{p}(\theta _1,M_0)\ne 0,\) then for \(J_{penalty}({\theta }_{k})=0,\) it holds, for the case \(J_{penalty}({\theta }_{k})>0\),

$$\begin{aligned} \begin{aligned} \qquad \quad a_1&=\max \{f(u_1),M_0\}\\ \qquad \quad&\ge M_0=\dfrac{a_0+b_0}{2}\ge \dfrac{a_0+a_0}{2}=a_0. \end{aligned} \\ b_1=b_0. \end{aligned}$$

thus

$$\begin{aligned} \begin{aligned} \qquad \quad b_1-a_1&=b_0-\max \{f(u_1),M_0\}\\ \qquad \quad&\le b_0-M_0=b_0-\dfrac{a_0+b_0}{2}=\dfrac{b_0-a_0}{2}, \end{aligned} \end{aligned}$$

this implies that \(b_1-a_1\le \dfrac{b_0-a_0}{2}\).

Considering the case \(k\ge 1\). By induction, suppose that \(a_k\le M_{*}\le b_k\) for some \(k \ge 1\), then we consider the case of \(k+1\).

If \(J_{p}(\theta _{k+1},M_k)=0,\) this implies that \(u_{k+1}\in U,\) \(f(u_{k+1})=M_k=b_{k+1},\) \(a_{k+1}=a_{k},\) \(M_{k+1}=\dfrac{a_{k+1}+b_{k+1}}{2},\) thus

$$\begin{aligned} \begin{aligned} \qquad \quad a_{k+1}=a_k\le M_{*}\le b_{k+1}=M_k=\dfrac{a_{k}+b_{k}}{2}\le b_k. \end{aligned} \\ \begin{aligned} \qquad \quad b_{k+1}-a_{k+1}=M_k-a_k=\dfrac{b_{k}-a_{k}}{2}. \end{aligned} \end{aligned}$$

If \(J_p(\theta _{k+1},M_k)\ne 0,\) then for \(J_{penalty}(\theta _{k+1})=0,\) obviously it holds. For the case \(J_{penalty}(\theta _{k+1})>0\), we have \(b_{k+1}=b_k,\ a_{k+1}=\max \{f(u^{k+1}),M_k\}\ge M_k,\) thus

$$\begin{aligned} \begin{aligned} \qquad \quad a_{k+1}\ge M_k=\dfrac{a_k+b_k}{2}\ge \dfrac{a_k+a_k}{2}=a_k. \end{aligned} \end{aligned}$$

\(\square \)

From Theorem 1, we have

$$\begin{aligned} a_{k+1}\le M_{*}\le b_k=b_{k+1}. \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} \qquad \quad b_{k+1}-a_{k+1}&=b_k-\max \{f(u^{k+1}),M_k\}\\ \qquad \quad&\le b_k-M_k=\dfrac{b_k-a_k}{2}. \end{aligned} \end{aligned}$$

Also, we can find that the sequence \(\{a_k\}\) is increasing and the sequence \(\{b_k\}\) is decreasing, and both of them are bounded. Therefore, the sequences \(\{a_k\}\) and \(\{b_k\}\) are both convergent. Let

$$\begin{aligned} {\lim _{k\rightarrow \infty }}a_{k}=a^{*}, \ {\lim _{k\rightarrow \infty }}b_{k}=b^{*}. \end{aligned}$$

By Eqs. (11) and (12), it is clear that \(a^{*}=b^{*}.\) Then we have that the sequence \(\{M_{k}\}\) converges to \(a^{*}=b^{*}.\)

Theorem 3

Under Assumption 1, the limit point of any convergent subsequence of the sequence of the optimization parameters \(\{\theta _{k}\},k=1,2...,\) is the optimal solution to the original problem, where each \(\theta _k\) is obtained in the kth iteration.

Proof

We first show that the sequence of optimization parameters \(\{\theta _{k}\},k=1,2...\) is bounded.

Since \(\theta _{k}\) is the optimal parameter after k rounds of iterations and \(u_0\in U,J_{penalty}(\theta _0)=0\) when \(k=0\), we have

$$\begin{aligned} \begin{aligned} J_p(\theta _{k},M_{k-1})\le Q(J_{actor}(\theta _{0})-M_{k-1}),\quad k=0,1,2,... \end{aligned} \end{aligned}$$

by the convergence of \(\{M_k\}\), i.e. \({\lim _{k\rightarrow \infty }}M_{k}=a^{*}.\)

By Assumption 1, there exists \({\overline{\beta }}>0,\) when \(\beta >{\overline{\beta }}\) we have

$$\begin{aligned} \begin{aligned} \quad \qquad J_p(\theta _{k},M_{k-1})&\le Q(J_{actor}(\theta _{0})-M_{k-1})\\ \quad \qquad&<\beta . \quad k=0,1,2,... \end{aligned} \end{aligned}$$

as P-level set \(P(\theta _k,\beta )\) is bounded, i.e. the sequence \(\{\theta _k\}\) is bounded.

Since \(M_{*}=\min \limits _{\theta }J_{actor}(\theta ),\) without loss of generality, let \(\theta ^{*}\in \mathop {\arg \min }\limits _{\theta }J_{actor}(\theta )\). We have proved that \(a_k \le M_* \le b_k,k=0,1,2...\) and the sequences \(\{a_k\},\{b_k\},\{M_k\}\) converge to \(a^{*}.\) Let \(k\rightarrow +\infty ,\) we obtain \(a^{*}=M_*=J_{actor}(\theta ^*).\)

Note that \(J_p, \ J_{actor}, \ J_{penalty}\) are all continuous on the parameter space. As the sequence \({\theta _k}\) is bounded, it has convergent subsequence \(\{\theta _{k_j}\}\) with \(\theta _{k_j}\rightarrow {\overline{\theta }}\) as \(k_j\rightarrow +\infty \).

We will show that \(J_{actor}({\overline{\theta }})=a^{*}.\) Clearly \(M_{*}=J_{actor}(\theta ^*),\) and note that

$$\begin{aligned} \begin{aligned} \quad \qquad J_p(\theta _{k_j},M_{k_j-1})&\le J_p(\theta ^*,M_{k_j-1})\\ \quad \qquad&=Q(J_{actor}(\theta ^*)-M_{k_j-1}). \end{aligned} \end{aligned}$$

Let \(k_j \rightarrow +\infty ,\) we have \(M_{k_j-1}\rightarrow a^*.\) Then

$$\begin{aligned} \begin{aligned} \qquad \qquad \lim _{k_j\rightarrow +\infty }J_p(\theta _{k_j},M_{k_j-1})&= J_p({\overline{\theta }},a^{*})\\ \qquad \qquad&\le Q(J_{actor}(\theta ^*)-a^{*})\\ \qquad \qquad&=0. \end{aligned} \end{aligned}$$

Therefore \(J_p({\overline{\theta }},a^{*}) \le 0.\) It is clear that \(J_p({\overline{\theta }},a^{*})= 0\) due to the nonnegativity, and \(J_{penalty}({\overline{\theta }})=0,\) \(J_{actor}({\overline{\theta }})=a^{*},\) i.e., the limit of any convergent subsequence of the parameter sequence \(\{\theta _{k}\},k=0,1,2...\) is the optimal solution. \(\square \)

Fig. 3
figure 3

Schematic diagram of the policy network structure. The input layer has 34 neurons, i.e., the input space is 34-dimensional. They include the ego vehicle information (6 dimensions), the 6 other vehicles information (6\(\times \)4 dimensions) and the reference information (4 dimensions). The number of neurons in both hidden layers is 256, and ELU is used as the activation function. To ensure that the control of the network output conforms to the boundary constraints of the control, a sigmoid function with parameters is used to map the network output to the corresponding intervals, and the CARLA simulator simulates the control of the autonomous vehicle by adjusting the steering wheel angle and throttle size accordingly

Simulation experiments

Our simulation experiments are conducted on the CARLA-SUMO joint platform. In order to simulate the real-time trajectory tracking scenario under complex constraints, the simulation platform employed should have realistic sensor simulation, vehicle dynamics simulation, and traffic flow simulation. We thus choose CARLA, which comes with a realistic sensor simulation system, a vehicle dynamics simulation system, and a game engine rendering [63]. SUMO, which has realistic complex random traffic simulation [64], is chosen to produce realistic traffic flow in CARLA, and some important variable settings in SUMO simulation traffic flow are shown in Table 1. For the experimental scenario, we choose the multi-lane setting of the CARLA 09.13 version of the Town06 map, and it is worth noting that the vehicle speed limit is set at 20 m/s.

Table 1 Important variable settings in SUMO simulation traffic flow

Experimental environment

We choose the vehicle.tesla.model3 simulation model in CARLA simulator as ego vehicle for the trajectory tracking task. The state information is known to contain three pieces: ego vehicle’s state information, other vehicles’ state information, and information about the reference state:

$$\begin{aligned} s=[s^{\mathrm{{ego}}},s^{\mathrm{{other}}},s^{\mathrm{{ref}}}]. \end{aligned}$$

The specific settings of \(s^{\mathrm{{ego}}}\) are described in Sect. “Design of the nonlinear programming problem”. \(s^{\mathrm{{other}}}=[l_{\mathrm{{x}}}^j, l_{\mathrm{{y}}}^j, v_{\mathrm{{lon}}}^j, \varphi ^j]_{i|t}^{T},j\in {\varvec{N}},\) and \({\varvec{N}}=6.\) \(s^{\mathrm{{ref}}}=[l_{\mathrm{{x}}}^{\mathrm{{ref}}}, l_{\mathrm{{y}}}^{\mathrm{{ref}}}, v_{\mathrm{{lon}}}^{\mathrm{{ref}}}, \varphi ^{\mathrm{{ref}}}]_{i|t}^{\top }\), in which

$$\begin{aligned} (l_{\mathrm{{x}}}^{\mathrm{{ref}}}, \ l_{\mathrm{{y}}}^{\mathrm{{ref}}}) \in \mathop {\arg \min }\limits _{i}\left\{ \sqrt{(l_{\mathrm{{x}}}-l_{\mathrm{{x}}}^{\mathrm{{ref}}})^2+(l_{\mathrm{{y}}}-l_{\mathrm{{y}}}^{\mathrm{{ref}}})^2}\right\} . \end{aligned}$$

The reference speed \(v_{\mathrm{{lon}}}^{\mathrm{{ref}}}\)is 13m/s, which is generated based on the real time traffic rules, and the reference heading angle \(\varphi ^{\mathrm{{ref}}}\) is generated adaptively by the CARLA simulator. In the policy network, the input space of the DNN contains information on all the three states, which is a 34-dimensional tensor.

The control is set as

$$\begin{aligned} u=[\delta , \ a]^{\top }, \end{aligned}$$

where \(\delta \) is the front-wheel angle, and can be adjusted by changing the steering wheel angle in the CARLA simulator (the steering wheel angle range is \([-1,\ 1] rad\)), a is the acceleration of the ego vehicle, and can be converted to throttle in the CARLA simulator by the PID controller, parameterized by the throttle range \([0, \ 1]\). The output space contains the angle and acceleration of the front wheel, which is a 2-dimensional tensor with an angle range \([-0.4,\ 0.4]rad\), and acceleration range \([-3,\ 3]m/s^2.\) Due to the nonlinear mapping relationship, we use ELU function to activate the network output and use sigmoid function with parameters to map the network output to appropriate intervals in order to ensure that the control of the network output is in line with the control range. The basic structure of the policy network is shown in Fig. 3. The value network is also a DNN with two hidden layers, but it differs structurally from the policy network in that the value network outputs a real-valued signal. This real-valued signal is the score of the state-control pair.

The information of all road points of the map is obtained from CARLA simulation platform in advance. During the simulation process, the ego vehicle is automatically destroyed when it reaches the end point, collides with other vehicles, or crosses the boundary of the lane, and then it is generated again at the starting position. The data of the start and end coordinates of the reference path are shown in Table 2.

Table 2 The starting and ending points’ coordinates

The prediction horizon \(T=30\), and the simulation interval time step \(T_s\) for each step is 0.05s. The positive definite weighting matrix is

$$\begin{aligned} \begin{aligned} \qquad \qquad&{\varvec{Q}}=diag\{ 0.04, 0.04, 0.01, 0.05,0.1, 0.06\},\\ \qquad \qquad&{\varvec{R}}=diag\{ 0.1, 0.05\}. \end{aligned} \end{aligned}$$

Safe distance \({\varvec{D}}^{\mathrm{{safe}}}_{other}=3.0 \ m\), \({\varvec{D}}^{\mathrm{{safe}}}_{lane}=1.5\ m\). The distance from the front axle to the rear axle \(L=3.0\ m.\) The height of top view point in CARLA is \(50\ m.\) The important parameters of the NNs are shown in Table 3, and the computational complexity, e.g., FLOPs and model’s parameter sizes are shown in Tables 4 and 5.

Table 3 Important hyperparameter settings
Table 4 Computational complexity of the policy network
Table 5 Computational complexity of the value network
Fig. 4
figure 4

Comparative experimental results of the ACOPFM with the GEP, the EPF in a multi-lane overtaking simulation scenario. The solid lines represent the mean value over different episodes. The shaded regions represent the 95% confidence interval. a The variation of \(J_{actor}\). b The objective penalty parameter M. c The penalty parameter \(\rho \) of the GEP and the EPF. d The variation of \(J_{p}\). e The variation of \(J_{critic}\). f The variation of \(J_{penalty}\)

Table 6 Average violation degree \({\bar{J}}_{penalty}\) of the ACOPFM, GEP and EPF

Offline training

The ACOPFM, GEP, and \(l_1\) exact penalty function based method (denoted by EPF) are trained for comparison. In the key iteration step of the GEP [30] ‘if i mod m: \(\rho \leftarrow c\rho \)’, the parameter settings are shown in Table 3. Five different runs are conducted with different random seeds on a desktop computer with a 3.70 GHz-10 core Intel Xeon W-2255 CPU, with evaluations every 100 iterations. The GEP and the EPF are trained using four different sets of initial penalty parameters \(\rho _{0}\), and are compared with the ACOPFM. The numerical results are reported in Fig. 4.

As seen in Fig. 4a, from the trend of \(J_{actor}\), compared with the GEP and the EPF, we can see that ACOPFM converges faster, and is more stable; in other words, ACOPFM is able to find the optimal strategy more efficiently, which means that the trajectory tracking performance of ACOPFM is better than that of the GEP and the EPF. Figure 4b shows the variation of the objective penalty parameter M; compared with the infinite growth of the penalty parameter \(\rho \) of the GEP and EPF as shown in Fig. 4c, it converges faster to a finite value that is roughly equal to the parameterized \(J_{actor}\). Figure 4d shows the variation of \(J_{p}\), and it can be seen that ACOPFM converges to a small number quite steadily, whereas the GEP gives a diverging and oscillating trend, and the EPF gives an oscillating result. The numerical performance of the value network of the three methods are shown in Fig. 4e. It can be seen that the ACOPFM converges very fast, whereas the GEP and the EPF diverge. Figure 4f demonstrates the variation of \(J_{penalty}\), which indicates the degree of constraint violations. It can be seen that the ACOPFM performs much better than the GEP and the EPF. Moreover, we take the average violation degree [65] \({\bar{J}}_{penalty}\) of \(J_{penalty}\) to measure the safety performance:

$$\begin{aligned} \begin{aligned} \qquad \qquad {\bar{J}}_{penalty}= {\mathbb {E}}_{0|k}\left\{ {\mathbb {E}}_{s_{0|t}}\left\{ \sum _{i=0}^{T-1}\phi _i(\theta )\right\} \right\} . \end{aligned} \end{aligned}$$
(13)

Overall, the average violation degree of ACOPFM is lower than the average value of the GEP and the EPF as shown in Table 6, which indicates that the safety performance of ACOPFM is better than that of the GEP and the EPF in general.

Online testing

We deploy the model trained by the ACOPFM to the ego vehicle and test it in a Town06 multi-lane scenario. The red thick line segment is the optimized trajectory obtained by the ACOPFM, which is formed by connecting waypoints within the prediction horizon (\(T = 30\)) at each simulation moment. The green solid line is the reference path, generated by the \(A^*\) algorithm in the CARLA simulator after considering the road static information, and there are 5 reference paths corresponding to the multi-lane scenario.

1) Performance in the environment without surrounding vehicles

In such environment, the ego vehicle trajectory should be well maintained in the reference lane. In the SUMO simulator we set the vehicle generation frequency vehsPerHour to 0 vehicles per hour. The dynamic breakdown of the whole process of the simulation test is shown in Fig. 5, where the ego vehicle is initialized in the lane 2, with the goal of reaching the end of the line in a safe and traffic rules compliance manner.

During the whole test process, the changes of the indicators of the ego vehicle are shown in Fig. 6, and it can be seen that the ego vehicle almost keeps a constant speed after accelerating for a period of time, and the front-wheel angle, the steering wheel angle and the heading angle parameters are basically kept near 0, which indicates that the trajectory tracking performance of the ego vehicle is good.

Fig. 5
figure 5

Visualization of the trajectory tracking simulation process without interactions with other vehicles. a The ego vehicle starts in lane 2. b The ego vehicle is in motion. c The ego vehicle is approaching the end (marked by red dot) smoothly

Fig. 6
figure 6

Results of the indicators in the environment without interactions with other vehicles. a The variation of the speed. b The variation of the acceleration. c The variation of the steering wheel angle. d The variation of the front-wheel angle. e The variation of the heading angle. f The variation of the throttle parameter

Fig. 7
figure 7

Visualization of the ego vehicle’s maneuvers during the first overtaking. a The ego vehicle starts in the lane 1. b The ego vehicle detects vehicle A in the right front of it through sensors in the CARLA simulator and prepares to steer itself enter lane 2 to the right. c The ego vehicle interacts with vehicle A safely and stably. d The ego vehicle completes the first lane change to overtake and travels on lane 2

Fig. 8
figure 8

Visualization of the ego vehicle’s maneuvers during the second overtaking. a The ego vehicle travels on lane 2 for some time. b The ego vehicle detects vehicle B in the right front of it via sensors in the CARLA simulator and prepares to steer itself to enter lane 1 on the left. c The ego vehicle interacts with vehicle B safely and stably. d The ego vehicle completes the second lane change to overtake and travels on lane 1 to reach the end (marked with red dot) successfully

Fig. 9
figure 9

Results of indicators in the environment with surrounding vehicles. a Variation of the speed. b Variation of the acceleration. c Variation of the steering wheel angle. d Variation of the front-wheel angle. e Variation of the heading angle. f Variation of the throttle parameter

2) Performance in the environment with surrounding vehicles

In this environment, the challenging task is to safely and stably complete the overtaking control of the ego vehicle. We adopt a multistep safety shield [30] after the output of the policy to enhance safety during the online implementation of the trained model. In the SUMO simulator we set the vehicle generation frequency vehsPerHour to 5000 vehicles per hour. The dynamic breakdown of the whole process of the simulation test is shown in Figs. 7 and 8.

A collision can occur when there is a vehicle A in front of the ego vehicle. The ego vehicle will turn to lane 2, as shown in Fig. 7b, and chooses lane 2 as the path to be tracked to complete the first overtaking, as shown in Fig. 7c and d.

After driving stably on lane 2 for a period of time, the sensor detects vehicle B which is directly ahead of the ego vehicle, as shown Fig. 8b, and the ego vehicle will perform a second lane change for overtaking. The ego vehicle chooses to change lane to enter the lane 1 to the left and overtake the vehicle B by acceleration, as shown in Fig. 8c. After the second overtaking, the ego vehicle ultimately arrives at the end of the lane safely and stably, as shown in Fig. 8c and d.

During the whole test process, the changes of the indicators of the ego vehicle are shown in Fig. 9. Around simulation step 170 and step 530, it is evident that there are clear fluctuations in the indicators, since the first and second overtaking maneuvers entail the changes of the acceleration and the front-wheel angle, which are reflected in Fig. 9a to Fig. 9f). Note that the first and second overtaking maneuvers involve the right-turn control and left-turn control of the steering wheel, which are implemented by changing the angle of the steering wheel (Fig. 9c). The front-wheel angle (Fig. 9d) and the heading angle (Fig. 9e) have abrupt changes around simulation step 170 and step 530.

Conclusion and future work

In this paper, we have proposed a model-based RL algorithm, the ACOPFM, to avoid the ill-conditioning issue faced by the GEP during model training and effectively deal with the large-scale nonlinear state-wise constraints. A planning and control scheme based on the IDC framework is adopted, and a MPC-based model of the trajectory tracking problem is established. Subsequently, the model is converted into an unconstrained optimization problem using the objective penalty function method and it is parameterized based on the actor-critic framework. The loss function guiding the update of the policy network is employed in the form of objective penalty function. The ACOPFM can find the optimal control strategy faster and stabler in comparison with the GEP by alternately performing gradient descent and adaptively adjusting the penalty parameter. The simulation results show that under multi-lane scenarios, the ACOPFM performs well in trajectory tracking tasks and is able to complete the overtaking maneuvers successfully.

As the ACOPFM converts the original problem into an unconstrained optimization problem by imposing penalties for constraint violations as the GEP, there is no 100% guarantee of safety. Our next research is to improve the multistep safety shields [30] to ensure safety performance.

Note that the state transfer equation we employed in the present work is relatively simple and there could be errors in real applications. There are some environmental factors such as wind speed, ambient temperature, and ground slope that are not considered in the simulation process. Moreover, some scenes (e.g., some control behaviors of the ego vehicle and other vehicles) are not encountered during model training, which can cause long-tail problems and makes the real applications of the method less secure. However, by refining the model and extending the training process, these problems can be overcome to a large extent.