Reinforcement learning-based robust optimal tracking control for disturbed nonlinear systems

Fan, Zhong-Xin; Tang, Lintao; Li, Shihua; Liu, Rongjie

doi:10.1007/s00521-023-08993-0

Reinforcement learning-based robust optimal tracking control for disturbed nonlinear systems

Original Article
Published: 12 September 2023

Volume 35, pages 23987–23996, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Reinforcement learning-based robust optimal tracking control for disturbed nonlinear systems

Download PDF

Zhong-Xin Fan^1,2,
Lintao Tang³,
Shihua Li^1,2 &
…
Rongjie Liu ORCID: orcid.org/0000-0003-1653-8609³

421 Accesses
1 Citation
Explore all metrics

Abstract

This paper concludes a robust optimal tracking control law for a class of nonlinear systems. A characteristic of this paper is that the designed controller can guarantee both robustness and optimality under nonlinearity and mismatched disturbances. Optimal controllers for nonlinear systems are difficult to obtain, hence a reinforcement learning method is adopted with two neural networks (NNs) approximating the cost function and optimal controller, respectively. We designed weight update laws for critic NN and actor NN based on gradient descent and stability, respectively. In addition, matched and mismatched disturbances are estimated by fixed-time disturbance observers and an artful transformation based on backstepping method is employed to convert the system into a filtered error nonlinear system. Through a rigorous analysis using the Lyapunov method, we demonstrate states and estimation errors remain uniformly ultimately bounded. Finally, the effectiveness of the proposed method is verified through two illustrative examples.

Reinforcement Learning-Based Adaptive Output-Feedback Control for Discrete-Time Strict-Feedback Nonlinear Systems

Robust Near-optimal Control for Constrained Nonlinear System via Integral Reinforcement Learning

Article 03 March 2023

Reinforcement Learning-Based Anti-disturbances Adaptive Control for Systems Subjected to Mismatched Disturbances and Input Uncertainties

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The objective of optimal tracking control is to develop a controller that ensures the system’s output tracks a specified reference signal, while minimizing a specific performance index. This field has earned significant attention and research, finding applications in practical domains such as chaotic systems, helicopters, permanent magnet synchronous motors, dispatch and electric vehicles [1,2,3,4,5]. Optimal control techniques rely on the principles of Pontryagin’s minimum principle. In the case of linear systems, the optimal control involves solving the algebraic Riccati equation, as suggested in the work by [6]. For the nonlinear systems, the optimal control necessitates the solution of the nonlinear Hamilton-Jacobi-Bellman (HJB) equation. Despite the practical utility of optimal control, the conventional methodology encounters a significant challenge, namely, the difficulty of solving the nonlinear HJB equation for higher-order systems [7,8,9,10].

In recent years, numerous efforts have been made to obtain the optimal controller, including inverse optimal control, $\theta$-D techniques, numerical approximation methods, and others [11, 13, 14]. The inverse optimal control method, presented in [11, 12], offers a solution that avoids the need to solve the HJB equations. For nonlinear systems, a suboptimal control approach was proposed in [13]. Another approach, described in [14], employed a $\theta$-D approximation method to solve the HJB equation by transforming it into state-dependent Lyapunov equations. It is important to note that these methods, although effective, are typically performed offline. Consequently, when there are changes in the system parameters, there may be fluctuations in the control effectiveness. To address this issue, researchers have explored the integration of reinforcement learning and adaptive control with optimal control [7, 15,16,17,18,19,20,21].

Approximate dynamic programming (ADP), proposed by [7] in 1992, utilizes function approximation structures to approximate the cost function and control strategy in the dynamic programming equation. ADP has been developed in subsequent works [15,16,17] using neural networks (NNs) to achieve optimal tracking control. These methods have been thoroughly studied and widely adopted [18, 24]. Furthermore, advancements in hardware have paved the way for data-driven approaches in optimal control. For example, [22] introduced a computational adaptive optimal controller for linear systems with completely unknown dynamics. Nonlinear adaptive optimal control was achieved through value iteration and ADP, as described in [23].

Inspired by this, we have incorporated the principles of adaptive and reinforcement learning to develop efficient tracking controllers using an actor-critic approach. Nevertheless, previous studies such as [25, 26] have highlighted a limitation of optimal tracking control, which involves the introduction of a discount factor into the performance index. This factor is intended to prevent the index from growing indefinitely, but it can hinder the convergence of the system state to zero. To address this issue, our paper proposes a reinforcement learning-based tracking control technique that utilizes a filtered error system, thereby eliminating the need for a discount factor.

In practical systems, the presence of disturbances is an inevitable issue [27, 28, 35]. These disturbances encompass both internal environmental factors, such as unmodeled dynamics, perturbed model parameters, and structural perturbations, as well as external environmental disturbances [37]. To achieve desired control outcomes, including improved disturbance rejection, fast dynamic response, and minimal steady-state error, it is crucial to explore high reliability controllers. Extensive research has been conducted on various anti-disturbance control methods, such as robust control [29], sliding mode control [30, 31], and output regulation theory [32]. Among these methods, two approaches have gained attention for their ability to achieve fast disturbance suppression based on system dynamics: disturbance observer-based control and active disturbance rejection control [33,34,35]. By employing disturbance observers or extended state observers to estimate and actively compensate for disturbances, their influence can be effectively mitigated [35].

However, mismatched disturbances are difficult to handle, as highlighted in [36, 37]. In [37], the authors proposed a composite control strategy based on the backstepping method for higher-order nonlinear systems with non-vanishing disturbances. By incorporating estimation information of the disturbance at each step of the virtual control, output is regulated to 0. While this method effectively handles mismatched disturbances, it is not optimal due to two reasons. Firstly, nonlinearity is subtracted at each step of the virtual control process. Secondly, the gain of the virtual control is artificially assigned and only satisfies the condition for making the derivative of the Lyapunov function negative definite. Therefore, we employ the concept of backstepping to construct a filtered error system that retains the nonlinear terms, ensuring optimality in dealing with mismatched disturbances.

Furthermore, the majority of existing studies focus on achieving asymptotic estimates of disturbances, implying that estimation errors persist even as the system converges. To mitigate the impact of disturbances, researchers have proposed fixed-time observers [38,39,40]. This approach involves estimating unknown disturbances within a predetermined time period, thereby minimizing their subsequent effects. In our study, we also employ a fixed-time disturbance observer (FTDOB) to estimate disturbances and reduce their influence on the neural network training process.

Therefore, this paper aims to address the limitations of existing optimal control methods and anti-disturbance methods in order to tackle more complex scenarios. The primary contributions of this paper are as follows:

Two neural networks are utilized to implement an actor-critic network, enabling the approximation of both the optimal control and cost function.
The fixed-time algorithm is employed in the design of the observer, allowing for the estimation of disturbances over a predetermined time interval, thereby enhancing the reliability of the control strategy.
Filtered error systems are constructed to attain an optimal controller for high-order nonlinear systems affected by mismatched disturbances.

The rest of the paper are organized as follows. In Sect. 2, system description and some necessary definitions are given. Section 3 concludes the main results about disturbance observer design and controller design. Simulation examples are given in Sect. 4 and conclusion is given in Sect. 5.

2 System descriptions and some preliminaries

Consider the following disturbed nonlinear system,

$$\begin{aligned} \left\{ \begin{aligned} {\dot{x}}_{i}&= x_{i+1}+f_{i}+d_{i}, ~~i=1, 2,\ldots , n-1,\\ {\dot{x}}_{n}&= f_{n}+u+d_{n}, \end{aligned} \right. \ \end{aligned}$$

(1)

where $x_{i}$, $d_{i}$, $f_{i}$, $i=1, 2,\ldots , n$ denote system states, disturbances and nonlinear functions, u is the control input. Assuming complete state information is available.

Assumption 1

Assuming there exists a small enough constant $\xi$ such that $\Vert {\dot{d}}\Vert <\xi$.

Here, we recall the optimal control theory [6]. For the nominal system, i. e., we do not consider the disturbance here, a cost function is given as

$$\begin{aligned} \begin{aligned} J=\int _{0}^{\infty }[Q(x)+u^{\textsf {T}}Ru]\mathrm{{d}}t, \end{aligned} \end{aligned}$$

(2)

where Q(x) is positive definite function and R is symmetric positive definite constant matrix. Define $\frac{\partial J}{\partial x}=\nabla J$ and choose the Hamilton function as $H= \nabla J^{\textsf {T}}{\dot{x}}+Q+u^{\textsf {T}}Ru$. Then, optimal value function $J^{*}$ meets $0= \min _{u}[H(x,u,\nabla J^{*})]$. With optimal control policy $u^{*}$, the HJB equation becomes

$$\begin{aligned} \begin{aligned} 0&= Q+u^{*\textsf {T}}Ru^{*}+\nabla J^{*\textsf {T}}(f+gu^{*}). \end{aligned} \end{aligned}$$

(3)

Then, we have the optimal control input $u^{*}$ as

$$\begin{aligned} \begin{aligned} u^{*}= \arg \min \limits _{u}[H(x,u,\nabla J^{*})]=-\frac{1}{2}R^{-1}g^{\textsf {T}}\nabla J^{*}. \end{aligned} \end{aligned}$$

(4)

The existing optimal control methods faces two challenges: (1) robustness in the presence of disturbances, especially in the presence of mismatched disturbances; (2) complex nonlinear HJB equation, given that the solution is very resource-intensive. Hence, we proposed a robust optimal control strategy based on NNs and disturbance observers, which will be detailed given in Sect. 3. Next, we provide one definition for the latter process.

Definition 1

The equilibrium $x_{e}$ of system (1) is uniformly ultimately bounded (UUB) if there is a compact set $S\subset {\mathbb {R}}^{n}$, and for any initial value $x_{0}$ that belongs to that compact set, initial time $t_{0}$, there is an upper bound B and a time $T(B,x_{0})$ such that $\Vert x(t)-x_{e}\Vert \le B$ for all $t>t_{0}+T$.

3 Main results

The classic control method usually adopts the idea of feedback control plus feedforward control [35], but it has the following two shortcomings: (1) The asymptomatically convergent observer will cause the estimation error to persist. (2) Feedback control can only stabilize the system with not optimality. This paper avoids these shortcomings by fusing fixed-time estimation with reinforcement learning. The accompanying Fig. 1 visually represents the core concepts discussed in this paper. The output of the system is directly used as the input of the disturbance observer. By choosing the observer gain reasonably, the complete tracking of the disturbance can be realized in any fixed time. Then, the original with disturbance estimation is transformed into a filter error system, which enables us to deal with mismatched disturbance well. Under the framework of optimal control, reinforcement learning methods relying on actor and critic NNs are proposed. By training the NN, the optimal controller of the error system is obtained.

Firstly, we design the fixed-time disturbance observers. With the disturbance estimation in hand, a filtered error system is then transformed.

3.1 Fixed-time disturbance observer design

The fixed-time disturbance observer is designed for each channel as

$$\begin{aligned} \left\{ \begin{aligned} {\dot{z}}_{i1}&= z_{i2}-\lambda _{1}(z_{i1}-x_{i})^{\alpha 1}-\lambda _{2}(z_{i1}-x_{i})^{\beta 1}+x_{i+1}+f_{i},\\ {\dot{z}}_{i2}&= -\lambda _{3}(z_{i1}-x_{i})^{\alpha 2}-\lambda _{4}(z_{i1}-x_{i})^{\beta 2}, \end{aligned} \right. \ \end{aligned}$$

(5)

where $i=1, 2,\ldots , n$. $z_{i1}$, $z_{i2}$ are estimations of $x_{i}$ and $d_{i}$, $\lambda _{1}$, $\lambda _{2}$, $\lambda _{3}$, $\lambda _{4}$ are observer gains to be designed, $\alpha _{1}$, $\alpha _{2}$, $\beta _{1}$, $\beta _{2}$ are observer internal parameters.

Theorem 1

Given system (1) if the observer gain is chosen properly, the disturbance can be estimated in a fixed time $T_{d}$, which is independent of the initial values.

Proof

Define the estimation error as $e_{i1}=x_{1}-z_{i1}$, $e_{i2}=d_{i}-z_{i2}$. Derivation of $e_{i1}$ and $e_{i2}$ along time gives

$$\begin{aligned} \left\{ \begin{aligned} {\dot{e}}_{i1}&= e_{i2}-\lambda _{1}(e_{1})^{\alpha 1}-\lambda _{2}(e_{1})^{\beta 1},\\ {\dot{e}}_{i2}&= -\lambda _{3}(e_{1})^{\alpha 2}-\lambda _{4}(e_{1})^{\beta 2}+{\dot{d}}_{i}. \end{aligned} \right. \ \end{aligned}$$

(6)

As long as the observer gain is chosen carefully, then the estimation error is fixed-time convergent, and can be written as ${\dot{e}}=\Lambda (e)+D$, $D=[0,\quad {\dot{d}}_{i}]^{T}$. The rest proof is similar to [31] and is omitted here. $\square$

Under the designed observer, the mismatched disturbance can be handled. With the help of backstepping method, the filtered error is obtained as ${\dot{z}}_{1}=x_{2}+f_{1}+d_{1}-{\dot{r}}$, where r is reference signal. Here, we denote $z_{2}=x_{2}-x_{2}^{*}$, choose $x_{2}^{*}=-k_{1}z_{1}-{\hat{d}}_{1}+{\dot{r}}$, then ${\dot{z}}_{1}={\dot{x}}_{1}-{\dot{r}}=z_{2}-k_{1}z_{1}+f_{1}+e_{1}$. Likewise, we have

$$\begin{aligned} \left\{ \begin{aligned} {\dot{z}}_{i}&= z_{i+1}-k_{i}z_{i}+f_{i}+e_{i}, i=1,\ldots , n-1,\\ {\dot{z}}_{n}&= u_{o}+f_{n}+e_{n}. \end{aligned} \right. \ \end{aligned}$$

(7)

Then (7) is rewritten as

$$\begin{aligned} \begin{aligned} {\dot{Z}}&=F(Z)+Gu_{o}, \end{aligned} \end{aligned}$$

(8)

where $Z=[z_{1},z_{2},\ldots ,z_{n}]^{\textsf {T}}$, $F(Z)=[z_{2}+f_{1}-k_{1}z_{1}, \cdots , f_{n}-k_{n}z_{n}]^{\textsf {T}}$, $G=[0,0,\ldots , 1]^{T}$.

Remark 1

Subtracting the nonlinear in backstepping method will lead to a nonoptimal controller as the nonlinearity may be actually beneficial in meeting the stabilization and/or performance objectives [11].

Remark 2

During the actual production process, the controlled system often encounters abrupt disturbances that can be characterized as lumped disturbance [35]. These types of disturbances do not satisfy the assumption we initially made (referred to as Assumption 1). Nevertheless, the proposed control strategy exhibits the capability to stabilize the system and demonstrates a certain level of robustness. This is attributed to the fact that even in the presence of sudden disturbance changes, the designed observer is able to estimate the disturbance at a fixed time. It is worth noting that the nonlinear function employed in the controller design is represented as $f+e$. However, since the term e exists only momentarily and eventually diminishes to zero, the overall effect on the controller’s performance is minimal.

According to the former section, we define $\frac{\partial J}{\partial Z}=\nabla J$ and the Hamilton function is chosen as $H= \nabla J^{\textsf {T}}{\dot{Z}}+Z^{\textsf {T}}QZ+u_{o}^{\textsf {T}}Ru_{o}$. Then, we have the optimal control as $u^{*}= \arg \min \limits _{u_{o}}[H(Z,u_{o},\nabla J^{*})]=-\frac{1}{2}R^{-1}g^{\textsf {T}}\nabla J^{*}$, satisfying $0= Q+u_{o}^{*\textsf {T}}Ru_{o}^{*}+\nabla J^{*\textsf {T}}(F+Gu_{o}^{*})$.

3.2 Critic NN design

The cost function is approximated by a critic neural network,

$$\begin{aligned} \begin{aligned} J=W^{\textsf {T}}\phi (Z)+\epsilon (Z), \end{aligned} \end{aligned}$$

(9)

where W denotes the ideal neuron weights, $\phi (Z): R^{n}\rightarrow R^{N}$ is the NN activation function vector, N stands for the number of neurons in the hidden layer, $\epsilon (Z)$ is the approximation error. As $N\rightarrow \infty$, it has $\epsilon (Z)\rightarrow 0$. As a result of the unknown nature of neural network weight W, the output of the neural network can be expressed as

$$\begin{aligned} \begin{aligned} {\hat{J}}&={\hat{W}}_{c}^{\textsf {T}}\phi (Z), \end{aligned} \end{aligned}$$

(10)

where ${\hat{W}}_{c}$ is the estimation of W.

Considering (9) and (10), the corresponding Hamilton functions are rewritten as

$$\begin{aligned} \begin{aligned} H(Z,u_{o},W)&=W^{\textsf {T}}\nabla \phi (F+Gu_{o})+Q+u_{o}^{\textsf {T}}Ru_{o}+\upsilon _{H} \end{aligned} \end{aligned}$$

(11)

and

$$\begin{aligned} \begin{aligned} H(Z,u_{o},{\hat{W}}_{c})&={\hat{W}}_{c}^{\textsf {T}}\nabla \phi (F+Gu_{o})+Q+u_{o}^{\textsf {T}}Ru_{o}, \end{aligned} \end{aligned}$$

(12)

where $\upsilon _{H}=\nabla \epsilon (F+Gu_{o})$.

Define critic NN approximation error ${\tilde{W}}_{c}=W-{\hat{W}}_{c}$, then we have

$$\begin{aligned} \begin{aligned} e_{H}=H(Z,u_{o},W)-H(Z,u_{o},{\hat{W}}_{c}) ={\tilde{W}}_{c}^{\textsf {T}}\nabla \phi _{1}(F+Gu_{o})+\upsilon _{H}. \end{aligned} \end{aligned}$$

(13)

Given any admissible control policy, it is desired to select ${\hat{W}}_{c}$ to minimize the quadratic error

$$\begin{aligned} \begin{aligned} E=\frac{1}{2}e_{H}^{\textsf {T}}e_{H}. \end{aligned} \end{aligned}$$

(14)

The normalized gradient algorithm is adopted to tune the critic weights

$$\begin{aligned} \begin{aligned} \dot{{\hat{W}}}_{c}=-a_{1}\frac{\partial E}{\partial {\hat{W}}_{c}}=-a_{1}\frac{\sigma _{1}}{(\sigma _{1}^{\textsf {T}}\sigma _{1}+1)^{2}}[\sigma _{1}^{\textsf {T}}{\hat{W}}_{c}+Q+u_{o}^{\textsf {T}}Ru_{o}],\\ \end{aligned} \end{aligned}$$

(15)

where $\sigma _{1}=\nabla \phi (F+Gu_{o})$, $(\sigma _{1}^{T}\sigma _{1}+1)^{2}$ is used for normalization, $a_{1}$ is scalar to be designed.

As mentioned in [2, 18], the identification of the critic parameter needs to fulfill the persistent excitation (PE) condition. In order to satisfy this condition, there are numerous options available for the signal selection, as long as the PE condition outlined in [18] is met.

3.3 Actor NN design

According to (3), we know the optimal control could be $-\frac{1}{2}R^{-1}G^{\textsf {T}}(\nabla \phi ^{\textsf {T}}W+\nabla \epsilon )$. Due the parameter W is unknown, here we utilize an actor NN to approximate the control input. Then, the controller is represented as

$$\begin{aligned} \begin{aligned} {\hat{u}}_{o}&=-\frac{1}{2}R^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}{\hat{W}}_{a}, \end{aligned} \end{aligned}$$

(16)

where ${\hat{W}}_{a}$ denotes the estimated value of W.

Similarly, ${\hat{W}}_{a}$ should be designed to approach W as closely as possible. Here, the tuning law of the actor NN is

$$\begin{aligned} \begin{aligned} \dot{{\hat{W}}}_{a}=-a_{2}\{F_{2}{\hat{W}}_{a}-F_{2}{\hat{W}}_{c}-\frac{1}{4}{\bar{D}}_{1}{\hat{W}}_{a}m^{\textsf {T}}{\hat{W}}_{c}\}, \end{aligned} \end{aligned}$$

(17)

where ${\bar{D}}_{1}=\nabla \phi GR^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}$, $m=\frac{\sigma _{2}}{(\sigma _{2}^{\textsf {T}}\sigma _{2}+1)^{2}}$, $\sigma _{2}=\nabla \phi (F+G{\hat{u}}_{o})$, $a_{2}$ is scalar to be designed.

The following is the online algorithm that facilitates the simultaneous tuning of the actor NN and the critic NN.

3.4 Stability analysis

The following assumption is necessary for stability analysis in Theorem 2.

Assumption 2

[18] In equation (9), the NN approximate error, NN activation functions and their gradient are bounded on a compact set, i.e., $\Vert \epsilon \Vert <b_{\epsilon }$, $\Vert \phi \Vert <b_{\phi }$, $\Vert \nabla \epsilon \Vert <b_{\epsilon _{x}}$, $\Vert \nabla \phi \Vert <b_{\phi _{x}}$.

Theorem 2

Given system (8), critic NN updating law (15), actor NN updating law (17), controller u=$u_{o}$, there exist a positive integer $N_0$ such that the number of the hidden layer units $N > N_0$, the closed-loop system states, the critic NN approximate error, and the actor NN approximate error are UUB.

Proof

Choose the Lyapunov function as

$$\begin{aligned} \begin{aligned} V=J+\frac{1}{2}a_{1}^{-1}{\tilde{W}}_{c}^{\textsf {T}}{\tilde{W}}_{c}+\frac{1}{2}a_{2}^{-1}{\tilde{W}}_{a}^{\textsf {T}}{\tilde{W}}_{a}+\frac{1}{2}e^{\textsf {T}}e, \end{aligned} \end{aligned}$$

(18)

where $e=[e_{i1}, e_{i2}]^{T}$. Taking the derivative, it has

$$\begin{aligned} \begin{aligned} {\dot{V}}={\dot{J}}+a_{1}^{-1}{\tilde{W}}_{c}^{\textsf {T}}\dot{{\tilde{W}}}_{c}+a_{2}^{-1}{\tilde{W}}_{a}^{\textsf {T}}\dot{{\tilde{W}}}_{a}+e^{\textsf {T}}{\dot{e}}. \end{aligned} \end{aligned}$$

(19)

Firstly, we have

$$\begin{aligned} \begin{aligned} {\dot{J}}&=W_{c}^{\textsf {T}}\nabla \phi {\dot{x}}+\nabla \epsilon ^{\textsf {T}}{\dot{x}}\\&=W_{c}^{\textsf {T}}\nabla \phi (F-\frac{1}{2}GR^{-1}G^{T}\nabla \phi ^{\textsf {T}}{\hat{W}}_{a})\\&\quad +\nabla \epsilon ^{\textsf {T}}(F-\frac{1}{2}GR^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}{\hat{W}}_{a}). \end{aligned} \end{aligned}$$

(20)

Here we define $\nabla \phi GR^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}$ as ${\bar{D}}_{1}$, $\nabla \epsilon ^{\textsf {T}}(F-\frac{1}{2}GR^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}{\hat{W}}_{a})$ as $\mu _{1}$ and we have ${\dot{J}}=W_{c}^{\textsf {T}}\sigma _{1}+\frac{1}{2}W_{c}^{\textsf {T}}{\bar{D}}_{1}{\tilde{W}}_{a}+\mu _{1}$. From the HJB Eq. (11), we have $W^{T}\sigma _{1}=-Q-\frac{1}{4}W^{\textsf {T}}{\bar{D}}_{1}W+\upsilon _{H}$. Then, it has

$$\begin{aligned} \begin{aligned} {\dot{J}}&=-Q-\frac{1}{4}W_{c}^{\textsf {T}}{\bar{D}}_{1}W_{c}+\frac{1}{2}W_{c}^{\textsf {T}}{\bar{D}}_{1}{\tilde{W}}_{a}+\upsilon _{H}+\mu _{1}. \end{aligned} \end{aligned}$$

(21)

In addition, we have

$$\begin{aligned} \begin{aligned} a_{1}^{-1}{\tilde{W}}_{c}^{\textsf {T}}\dot{{\tilde{W}}}_{c} =&{\tilde{W}}_{c}^{\textsf {T}}\frac{\sigma _{2}}{(\sigma _{2}^{\textsf {T}}\sigma _{2}+1)^{2}}[\sigma _{2}^{\textsf {T}}{\hat{W}}_{c}+Q+{\hat{u}}_{o}^{\textsf {T}}R{\hat{u}}_{o}])\\ =&{\tilde{W}}_{c}^{\textsf {T}}\frac{\sigma _{2}}{(\sigma _{2}^{\textsf {T}}\sigma _{2}+1)^{2}}[-\sigma _{2}^{\textsf {T}}{\tilde{W}}_{c}\\&\quad +\frac{1}{4}{\tilde{W}}_{a}^{\textsf {T}}{\bar{D}}_{1}{\tilde{W}}_{a}+\upsilon _{H}]. \end{aligned} \end{aligned}$$

(22)

Based on the FTDOB, the error becomes 0 after $T_{d}$ seconds. As $\dot{{\hat{W}}}_{a}$ is given in Theorem 2, we have

$$\begin{aligned} \begin{aligned} {\tilde{W}}_{a}^{\textsf {T}}F_{2}{\hat{W}}_{a}-{\tilde{W}}_{a}^{\textsf {T}}F_{2}{\hat{W}}_{c} ={\tilde{W}}_{a}^{\textsf {T}}F_{2}W-{\tilde{W}}_{a}^{\textsf {T}}F_{2}{\tilde{W}}_{a}-{\tilde{W}}_{a}^{\textsf {T}}F_{2}W+{\tilde{W}}_{a}^{\textsf {T}}F_{2}{\tilde{W}}_{c}. \end{aligned} \end{aligned}$$

(23)

Then ${\dot{V}}$ can be obtained as below by adding (6), (15), (17) and (21)

$$\begin{aligned} \begin{aligned} {\dot{V}}&=-Q-\frac{1}{4}W_{c}^{\textsf {T}}{\bar{D}}_{1}W_{c}\\&\quad +{\tilde{W}}_{c}^{\textsf {T}}\frac{\sigma _{2}}{(\sigma _{2}^{\textsf {T}}\sigma _{2}+1)^{2}}[-\sigma _{2}^{\textsf {T}}{\tilde{W}}_{c}+\upsilon _{H}]+\upsilon _{H}+\mu _{1}\\&\quad +\frac{1}{4}{\tilde{W}}_{a}^{\textsf {T}}{\bar{D}}_{1}{\hat{W}}_{a}\frac{{\bar{\sigma }}^{\textsf {T}}}{m_{s}}{\tilde{W}}_{c}\\&\quad +\frac{1}{2}W^{\textsf {T}}{\bar{D}}_{1}{\tilde{W}}_{a}+\frac{1}{4}{\tilde{W}}_{a}^{\textsf {T}}{\bar{D}}_{1}W\frac{{\bar{\sigma }}_{2}^{\textsf {T}}}{m_{s}}{\tilde{W}}_{a}\\&\quad -\frac{1}{4}{\tilde{W}}_{a}^{\textsf {T}}{\bar{D}}_{1}W\frac{{\bar{\sigma }}_{2}^{T}}{m_{s}}W+{\tilde{W}}_{a}^{\textsf {T}}F_{2}W\\&\quad -{\tilde{W}}_{a}^{T}F_{2}{\tilde{W}}_{a}-{\tilde{W}}_{a}^{\textsf {T}}F_{2}W+{\tilde{W}}_{a}^{\textsf {T}}F_{2}{\tilde{W}}_{c}, \end{aligned} \end{aligned}$$

(24)

where ${\bar{\sigma }}=\frac{\sigma _{2}}{\sigma _{2}^{\textsf {T}}\sigma _{2}+1}$, $m_{s}=\sigma _{2}^{\textsf {T}}\sigma _{2}+1$.

It is obvious that under Assumption 2,

$$\begin{aligned} \begin{aligned} \mu _{1}\!<\!b_{\epsilon _{x}}\Vert b_{f}\Vert \Vert x\Vert \!+\!\frac{b_{\epsilon _{x}}b_{\phi _{x}}b_{g}^{2}\sigma _{\min }(R)(\Vert W\Vert \!+\!\Vert {\tilde{W}}_{a}\Vert )}{2}. \end{aligned} \end{aligned}$$

(25)

As given in [18], $\upsilon _{H}$ converges to 0 as the neurons increase. Hence, $N_{0}$ can be selected such that $\sup \nolimits _{x\in \Omega }\Vert \upsilon _{H}\Vert <\upsilon$. Assuming $N>N_{0}$, if we define ${\tilde{Z}}=[Z,\quad {\tilde{W}}_{c},\quad {\tilde{W}}_{a},\quad e]^{\textsf {T}}$, then we have

$$\begin{aligned} \begin{aligned} {\dot{V}}&<-\Vert {\tilde{Z}}\Vert ^{2}\sigma _{\min }(M)+\Vert p\Vert \Vert {\tilde{z}}\Vert +c+\upsilon , \end{aligned} \end{aligned}$$

(26)

where $c=\frac{1}{4}\Vert W\Vert ^{2}\Vert {\bar{D}}_{1}\Vert +\upsilon +\frac{1}{2}\Vert W\Vert b_{\epsilon _{x}}b_{\phi _{x}}b_{g}^{2}\sigma _{\min }(R)$,

$$\begin{aligned} \begin{aligned} M&=\left[ \begin{array}{ccccccccccccc} qI&{}0&{}0&{}0\\ 0&{}I&{}(-\frac{F_{2}}{2}-\frac{{\bar{D}}_{1}W}{8m_{s}})^{T}&{}0\\ 0&{}(-\frac{F_{2}}{2}-\frac{{\bar{D}}_{1}W}{8m_{s}})&{}F_{2}-\frac{{\bar{D}}_{1}Wm^{T}+mW^{T}{\bar{D}}_{1}}{8}&{}0\\ 0&{}0&{}0&{}\frac{1}{2}\\ \end{array} \right] ,\\ p&=\left[ \begin{array}{cccccc} b_{\varrho _{x}}b_{f}\\ \frac{\upsilon }{m_{s}}\\ (\frac{{\bar{D}}_{1}-\frac{{\bar{D}}_{1}Wm^{T}}{4}}{2})W\!+\!\frac{b_{\epsilon _{x}}b_{\phi _{x}}b_{g}^{2}\sigma _{\min }(R)}{2}\\ -\frac{1}{2}e+f(e)+D \end{array} \right] . \end{aligned} \end{aligned}$$

(27)

Let the parameters be chosen such that $M>0$. If $\Vert {\tilde{Z}}\Vert >\sqrt{\frac{p^{2}}{4\sigma _{\min }(M)}+\frac{c+\upsilon }{\sigma _{\min }(M)}}+\frac{\Vert p\Vert }{2\sigma _{\min }(M)}$, then, ${\dot{V}}$ is negative. Thence, the state and the weight error are UUB. $\square$

4 Examples

In this section, a linear system is presented firstly to show that the designed update law guarantees the convergence of the weights to their ideal values. Secondly, a nonlinear system example is employed to highlight the effectiveness of the proposed method.

4.1 Linear system example

Consider a linear system, ${\dot{x}}_{1}= -x_{1}-2x_{2}+u$, ${\dot{x}}_{2}= x_{1}-4x_{2}-3u$, where $x_{1}$ and $x_{2}$ are system states and u is control input. Choose the cost function as $J=\int _{0}^{\infty }(x^{T}Qx+u^{T}Ru)\mathrm{{d}}t$, where $Q=diag(1\quad 1)$ and $R=1$.

Clearly, the optimal controller based on linear quadratic regulate theory can be easily found. Hence, the ideal NN wights can be also deduced as $W=\left[ \begin{array}{cccccc} 0.3199&-0.1162&0.1292 \end{array} \right]$. For this system, the NN-based optimal control is implemented as (16) and the NN tuning law are selected as (15) and (17). In the process of NN convergence, in order to ensure PE condition, we add noise signal $0.5(sin(t)^2*cos(t)+sin(2t)^2*cos(0.1t)+sin(-1.2t)^2*cos(0.5t)+sin(t)^5)$ to the control input here. The reference signal is set as $r=0$. The simulation results are shown in Fig. 2. The values converge to the optimal values after 50 s, i. e., ${\hat{W}}_{c}=\left[ \begin{array}{cccccc} 0.3199&-0.1162&0.1292 \end{array} \right]$. Also, ${\hat{W}}_{a}=\left[ \begin{array}{cccccc} 0.3199&-0.1162&0.1292 \end{array} \right]$ after 50 s. The optimal controller approximated by NNs is given as

$$\begin{aligned} \begin{aligned} {\hat{u}}&=-\frac{R^{-1}}{2}\left[ \begin{array}{cccccc} \!1 \\ \!-3 \end{array} \right] ^{T}\left[ \begin{array}{cccccc} 2x_{1}&{}0\\ x_{2}&{}x_{1}\\ 0&{}2x_{2} \end{array} \right] ^{T}\left[ \begin{array}{cccccc} 0.3199\\ -0.1162\\ 0.1292 \end{array} \right] . \end{aligned} \end{aligned}$$

(28)

The excitation signal is introduced to satisfy the PE condition, with the result that sufficiently rich data is generated to train the neural network and ensure its convergence. After 80 s, the neural network has converged. After convergence, the exploration signal is removed, and the value of the state of the system remains near 0 after removal.

4.2 Nonlinear system example

Firstly, we consider a reference signal $r=0$. In this case, the tracking problem is actually a stabilization problem. The exploration signal is chosen as $200e^{(-0.23t)}*(sin(t)^2*cos(t)+sin(2t)^2*cos(0.1t)+sin(-1.2t)^2*cos(0.5t)+sin(t)^5+sin(1.12t)^2+cos(2.4t)*sin(2.4t)^3)$ and the corresponding results are depicted in Fig. 3.

Then, we set $r=5$ and exploration signal as $exp(-0.35t)*200*(sin(t)^2*cos(t)+sin(2t)^2*cos(0.1t)+sin(-1.2t)^2*cos(0.5t)+sin(t)^5+sin(1.12t)^2+cos(2.4t)*sin(2.4t)^3)$. The results are depicted in Fig. 4.

In our simulation, the sampling time is relatively small at 0.001 s. Therefore, it is reasonable to increase the exponential term in the excitation signal. This approach offers several advantages, including reducing the overall training time and minimizing computational resource wastage. However, in practical systems, hardware limitations often prevent maintaining a very small sampling time. In such cases, as highlighted in [2, 3], it becomes crucial to ensure that the excitation signal does not decay too rapidly. This ensures an ample amount of data is available for training the neural network.

5 Conclusion

This paper focused on the design of robust optimal controllers for high-order nonlinear systems in the presence of mismatched disturbances. The proposed approach involves the design of disturbance observers that ensure fixed-time convergence. Subsequently, the original system is transformed into a filtered error nonlinear system. To address the challenges associated with solving Hamilton–Jacobi–Bellman (HJB) equations, the reinforcement learning method has been introduced. Two neural networks have been designed to approximate the cost function and the optimal control, respectively. By integrating these components, a robust optimal controller is finally obtained. The effectiveness of the proposed method has been validated through two illustrative examples.

Data availability

Data sharing not applicable as no new data were generated in this study.

References

Tang L, Gao Y, Liu YJ (2014) Adaptive near optimal neural control for a class of discrete-time chaotic system. Neural Comput Appl 25:1111–1117
Article Google Scholar
Na J, Lv Y, Zhang K, Zhao J (2020) Adaptive identifier-critic-based optimal tracking control for nonlinear systems with experimental validation. IEEE Trans Syst Man Cybern Syst 52(1):459–472
Article Google Scholar
Fan ZX, Li S, Liu R (2022) ADP-based optimal control for dystems with mismatched disturbances: a PMSM application. IEEE Trans Circ Syst II Express Briefs 70(6):2057–2061
Google Scholar
Fan ZX, Adhikary AC, Li S, Liu R (2020) Anti-disturbance inverse optimal control for systems with disturbances. Optim Control Appl Methods 44(3):1321–1340
Article MathSciNet Google Scholar
Chen J, Li K, Li K, Yu PS (2021) Dynamic bicycle dispatching of dockless public bicycle-sharing systems using multi-objective reinforcement learning. ACM Trans Cyber-Phys Syst 5(4):1–24
Article Google Scholar
Lewis FL, Vrabie DL, Syrmos VL (2012) Optimal control. Wiley
Book MATH Google Scholar
Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. handbook of intelligent control neural fuzzy and adaptive approaches, 1992
Wei Q, Zhu L, Song R, Zhang P, Liu D, Xiao J (2022) Model-free adaptive optimal control for unknown nonlinear multiplayer nonzero-sum game. IEEE Trans Neural Netw Learn Syst 33(2):879–892
Article MathSciNet Google Scholar
Gao W, Jiang ZP (2016) Adaptive dynamic programming and adaptive optimal output regulation of linear systems. IEEE Trans Autom Control 61(12):4164–4169
Article MathSciNet MATH Google Scholar
Gao W, Jiang ZP, Lewis FL, Wang Y (2018) Leader-to-formation stability of multiagent systems: an adaptive optimal control approach. IEEE Trans Autom Control 63(10):3581–3587
Article MathSciNet MATH Google Scholar
Krstic M, Tsiotras P (1999) Inverse optimal stabilization of a rigid spacecraft. IEEE Trans Autom Control 44(5):1042–1049
Article MathSciNet MATH Google Scholar
Fan ZX, Adhikary AC, Li S, Liu R (2022) Disturbance observer based inverse optimal control for a class of nonlinear systems. Neurocomputing 500:821–831
Article Google Scholar
Ming X, Balakrishnan SN (2005) A new method for suboptimal control of a class of non-linear systems. Optim Control Appl Methods 26(2):55–83
Article MathSciNet Google Scholar
Do TD, Choi HH, Jung WJ (2015) $\theta$-D approximation technique for nonlinear optimal speed control design of surface-mounted PMSM drives. IEEE/ASME Trans Mechatron 20(4):1822–1831
Article Google Scholar
Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Trans Neural Netw 22(12):2226–2236
Article Google Scholar
Qin C, Zhang H, Luo Y (2014) Optimal tracking control of a class of nonlinear discrete-time switched systems using adaptive dynamic programming. Neural Comput Appl 24:531–538
Article Google Scholar
Wang D, Liu D, Zhao D, Huang Y, Zhang D (2013) A neural-network-based iterative GDHP approach for solving a class of nonlinear optimal control problems with control constraints. Neural Comput Appl 22(2):219–227
Article Google Scholar
Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5):878–888
Article MathSciNet MATH Google Scholar
Yang W, Li K, Li K (2019) A pipeline computing method of SpTV for three-order tensors on CPU and GPU. ACM Trans Knowl Discov Data 13(6):1–27
Article MathSciNet Google Scholar
Zhong K, Yang Z, Xiao G, Li X, Yang W, Li K (2022) An efficient parallel reinforcement learning approach to cross-layer defense mechanism in industrial control systems. IEEE Trans Parallel Distrib Syst 3(11):2979–2990
Google Scholar
Liu C, Tang F, Hu Y, Li K, Tang Z, Li K (2021) Distributed task migration optimization in MEC by extending multi-agent deep reinforcement learning approach. IEEE Trans Parallel Distrib Syst 32(7):1603–1614
Article Google Scholar
Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10):2699–2704
Article MathSciNet MATH Google Scholar
Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of nonlinear nonaffine systems. Automatica 50(10):2624–2632
Article MathSciNet MATH Google Scholar
Wang D (2020) Robust policy learning control of nonlinear plants with case studies for a power system application. IEEE Trans Industr Inf 16(3):1733–1741
Article Google Scholar
Zhao J, Yang C, Gao W, Modares H, Chen X, Dai W (2023) Linear quadratic tracking control of unknown systems: a two-phase reinforcement learning method. Automatica 148:110761
Article MathSciNet MATH Google Scholar
Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica 50(7):1780–1792
Article MathSciNet MATH Google Scholar
Chen WH (2004) Disturbance observer based control for nonlinear systems. IEEE/ASME Trans Mechatron 9(4):706–710
Article Google Scholar
Yu B, Du H, Ding L, Wu D, Li H (2022) Neural network-based robust finite-time attitude stabilization for rigid spacecraft under angular velocity constraint. Neural Comput Appl 34:5107–5117
Article Google Scholar
Zhou K, Doyle J, Glover K (1995) Robust and optimal control. Prentice Hall, New Jersey
MATH Google Scholar
Utkin V (2003) Variable structure systems with sliding modes. IEEE Trans Autom Control 22(2):212–222
Article MathSciNet MATH Google Scholar
Levant A (2003) Higher-order sliding modes, differentiation and output-feedback control. Int J Control 76(9–10):924–941
Article MathSciNet MATH Google Scholar
Huang J (2004) Nonlinear output regulation- theory and applications. SIAM
Ohishi K, Nakao M, Ohnishi K et al (1987) Microprocessor-controlled DC motor for load-insensitive position servo system. IEEE Trans Industr Electron 34(1):44–49
Article Google Scholar
Han J (2009) From PID to active disturbance rejection control. IEEE Trans Industr Electron 56(3):900–906
Article Google Scholar
Li S, Yang J, Chen WH, Chen X (2014) Disturbance observer-based control: methods and applications. CRC Press, Inc., Boca Raton
Google Scholar
Li S, Yang J, Chen WH, Chen X (2012) Generalized extended state observer based control for systems with mismatched uncertainties. IEEE Trans Industr Electron 59(12):4792–4802
Article Google Scholar
Sun H, Guo L (2017) Neural network-based DOBC for a class of nonlinear systems with unmatched disturbances. IEEE Trans Neural Netw Learn Syst 28(2):482–489
Article MathSciNet Google Scholar
Cui B, Zhang L, Xia Y, Zhang J (2022) Continuous distributed fixed-time attitude controller design for multiple spacecraft systems with a directed graph. IEEE Trans Circ Syst II- Express Briefs 69(11):478–4482
Google Scholar
Li X, Ma L, Mei K, Ding S, Pan T (2023) Fixed-time adaptive fuzzy SOSM controller design with output constraint. Neural Comput Appl 35(13):9893–9905
Article Google Scholar
Liu W, Chen M, Shi P (2022) Fixed-time disturbance observer-based control for quadcopter suspension transportation system. IEEE Trans Circ Syst I- Regul Pap 69(11):4632–4642
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Automation, Southeast University, Sipailou, Nanjing, 210096, Jiangsu, China
Zhong-Xin Fan & Shihua Li
Key Laboratory of Measurement and Control of Complex Systems of Engineering, Ministry of Education, Southeast University, Sipailou, Nanjing, 210096, Jiangsu, China
Zhong-Xin Fan & Shihua Li
Department of Statistics, Florida State University, Tallahassee, FL, 32304, USA
Lintao Tang & Rongjie Liu

Authors

Zhong-Xin Fan
View author publications
You can also search for this author in PubMed Google Scholar
Lintao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Shihua Li
View author publications
You can also search for this author in PubMed Google Scholar
Rongjie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rongjie Liu.

Ethics declarations

Conflict of interest

All authors declare that there are no conflicts of interest in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fan, ZX., Tang, L., Li, S. et al. Reinforcement learning-based robust optimal tracking control for disturbed nonlinear systems. Neural Comput & Applic 35, 23987–23996 (2023). https://doi.org/10.1007/s00521-023-08993-0

Download citation

Received: 29 May 2023
Accepted: 22 August 2023
Published: 12 September 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s00521-023-08993-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reinforcement learning-based robust optimal tracking control for disturbed nonlinear systems

Abstract

Similar content being viewed by others

Reinforcement Learning-Based Adaptive Output-Feedback Control for Discrete-Time Strict-Feedback Nonlinear Systems

Robust Near-optimal Control for Constrained Nonlinear System via Integral Reinforcement Learning

Reinforcement Learning-Based Anti-disturbances Adaptive Control for Systems Subjected to Mismatched Disturbances and Input Uncertainties

1 Introduction