1 Introduction

The objective of optimal tracking control is to develop a controller that ensures the system’s output tracks a specified reference signal, while minimizing a specific performance index. This field has earned significant attention and research, finding applications in practical domains such as chaotic systems, helicopters, permanent magnet synchronous motors, dispatch and electric vehicles [1,2,3,4,5]. Optimal control techniques rely on the principles of Pontryagin’s minimum principle. In the case of linear systems, the optimal control involves solving the algebraic Riccati equation, as suggested in the work by [6]. For the nonlinear systems, the optimal control necessitates the solution of the nonlinear Hamilton-Jacobi-Bellman (HJB) equation. Despite the practical utility of optimal control, the conventional methodology encounters a significant challenge, namely, the difficulty of solving the nonlinear HJB equation for higher-order systems [7,8,9,10].

In recent years, numerous efforts have been made to obtain the optimal controller, including inverse optimal control, \(\theta\)-D techniques, numerical approximation methods, and others [11, 13, 14]. The inverse optimal control method, presented in [11, 12], offers a solution that avoids the need to solve the HJB equations. For nonlinear systems, a suboptimal control approach was proposed in [13]. Another approach, described in [14], employed a \(\theta\)-D approximation method to solve the HJB equation by transforming it into state-dependent Lyapunov equations. It is important to note that these methods, although effective, are typically performed offline. Consequently, when there are changes in the system parameters, there may be fluctuations in the control effectiveness. To address this issue, researchers have explored the integration of reinforcement learning and adaptive control with optimal control [7, 15,16,17,18,19,20,21].

Approximate dynamic programming (ADP), proposed by [7] in 1992, utilizes function approximation structures to approximate the cost function and control strategy in the dynamic programming equation. ADP has been developed in subsequent works [15,16,17] using neural networks (NNs) to achieve optimal tracking control. These methods have been thoroughly studied and widely adopted [18, 24]. Furthermore, advancements in hardware have paved the way for data-driven approaches in optimal control. For example, [22] introduced a computational adaptive optimal controller for linear systems with completely unknown dynamics. Nonlinear adaptive optimal control was achieved through value iteration and ADP, as described in [23].

Inspired by this, we have incorporated the principles of adaptive and reinforcement learning to develop efficient tracking controllers using an actor-critic approach. Nevertheless, previous studies such as [25, 26] have highlighted a limitation of optimal tracking control, which involves the introduction of a discount factor into the performance index. This factor is intended to prevent the index from growing indefinitely, but it can hinder the convergence of the system state to zero. To address this issue, our paper proposes a reinforcement learning-based tracking control technique that utilizes a filtered error system, thereby eliminating the need for a discount factor.

In practical systems, the presence of disturbances is an inevitable issue [27, 28, 35]. These disturbances encompass both internal environmental factors, such as unmodeled dynamics, perturbed model parameters, and structural perturbations, as well as external environmental disturbances [37]. To achieve desired control outcomes, including improved disturbance rejection, fast dynamic response, and minimal steady-state error, it is crucial to explore high reliability controllers. Extensive research has been conducted on various anti-disturbance control methods, such as robust control [29], sliding mode control [30, 31], and output regulation theory [32]. Among these methods, two approaches have gained attention for their ability to achieve fast disturbance suppression based on system dynamics: disturbance observer-based control and active disturbance rejection control [33,34,35]. By employing disturbance observers or extended state observers to estimate and actively compensate for disturbances, their influence can be effectively mitigated [35].

However, mismatched disturbances are difficult to handle, as highlighted in [36, 37]. In [37], the authors proposed a composite control strategy based on the backstepping method for higher-order nonlinear systems with non-vanishing disturbances. By incorporating estimation information of the disturbance at each step of the virtual control, output is regulated to 0. While this method effectively handles mismatched disturbances, it is not optimal due to two reasons. Firstly, nonlinearity is subtracted at each step of the virtual control process. Secondly, the gain of the virtual control is artificially assigned and only satisfies the condition for making the derivative of the Lyapunov function negative definite. Therefore, we employ the concept of backstepping to construct a filtered error system that retains the nonlinear terms, ensuring optimality in dealing with mismatched disturbances.

Furthermore, the majority of existing studies focus on achieving asymptotic estimates of disturbances, implying that estimation errors persist even as the system converges. To mitigate the impact of disturbances, researchers have proposed fixed-time observers [38,39,40]. This approach involves estimating unknown disturbances within a predetermined time period, thereby minimizing their subsequent effects. In our study, we also employ a fixed-time disturbance observer (FTDOB) to estimate disturbances and reduce their influence on the neural network training process.

Therefore, this paper aims to address the limitations of existing optimal control methods and anti-disturbance methods in order to tackle more complex scenarios. The primary contributions of this paper are as follows:

  • Two neural networks are utilized to implement an actor-critic network, enabling the approximation of both the optimal control and cost function.

  • The fixed-time algorithm is employed in the design of the observer, allowing for the estimation of disturbances over a predetermined time interval, thereby enhancing the reliability of the control strategy.

  • Filtered error systems are constructed to attain an optimal controller for high-order nonlinear systems affected by mismatched disturbances.

The rest of the paper are organized as follows. In Sect. 2, system description and some necessary definitions are given. Section 3 concludes the main results about disturbance observer design and controller design. Simulation examples are given in Sect. 4 and conclusion is given in Sect. 5.

2 System descriptions and some preliminaries

Consider the following disturbed nonlinear system,

$$\begin{aligned} \left\{ \begin{aligned} {\dot{x}}_{i}&= x_{i+1}+f_{i}+d_{i}, ~~i=1, 2,\ldots , n-1,\\ {\dot{x}}_{n}&= f_{n}+u+d_{n}, \end{aligned} \right. \ \end{aligned}$$
(1)

where \(x_{i}\), \(d_{i}\), \(f_{i}\), \(i=1, 2,\ldots , n\) denote system states, disturbances and nonlinear functions, u is the control input. Assuming complete state information is available.

Assumption 1

Assuming there exists a small enough constant \(\xi\) such that \(\Vert {\dot{d}}\Vert <\xi\).

Here, we recall the optimal control theory [6]. For the nominal system, i. e., we do not consider the disturbance here, a cost function is given as

$$\begin{aligned} \begin{aligned} J=\int _{0}^{\infty }[Q(x)+u^{\textsf {T}}Ru]\mathrm{{d}}t, \end{aligned} \end{aligned}$$
(2)

where Q(x) is positive definite function and R is symmetric positive definite constant matrix. Define \(\frac{\partial J}{\partial x}=\nabla J\) and choose the Hamilton function as \(H= \nabla J^{\textsf {T}}{\dot{x}}+Q+u^{\textsf {T}}Ru\). Then, optimal value function \(J^{*}\) meets \(0= \min _{u}[H(x,u,\nabla J^{*})]\). With optimal control policy \(u^{*}\), the HJB equation becomes

$$\begin{aligned} \begin{aligned} 0&= Q+u^{*\textsf {T}}Ru^{*}+\nabla J^{*\textsf {T}}(f+gu^{*}). \end{aligned} \end{aligned}$$
(3)

Then, we have the optimal control input \(u^{*}\) as

$$\begin{aligned} \begin{aligned} u^{*}= \arg \min \limits _{u}[H(x,u,\nabla J^{*})]=-\frac{1}{2}R^{-1}g^{\textsf {T}}\nabla J^{*}. \end{aligned} \end{aligned}$$
(4)

The existing optimal control methods faces two challenges: (1) robustness in the presence of disturbances, especially in the presence of mismatched disturbances; (2) complex nonlinear HJB equation, given that the solution is very resource-intensive. Hence, we proposed a robust optimal control strategy based on NNs and disturbance observers, which will be detailed given in Sect. 3. Next, we provide one definition for the latter process.

Definition 1

The equilibrium \(x_{e}\) of system (1) is uniformly ultimately bounded (UUB) if there is a compact set \(S\subset {\mathbb {R}}^{n}\), and for any initial value \(x_{0}\) that belongs to that compact set, initial time \(t_{0}\), there is an upper bound B and a time \(T(B,x_{0})\) such that \(\Vert x(t)-x_{e}\Vert \le B\) for all \(t>t_{0}+T\).

3 Main results

The classic control method usually adopts the idea of feedback control plus feedforward control [35], but it has the following two shortcomings: (1) The asymptomatically convergent observer will cause the estimation error to persist. (2) Feedback control can only stabilize the system with not optimality. This paper avoids these shortcomings by fusing fixed-time estimation with reinforcement learning. The accompanying Fig. 1 visually represents the core concepts discussed in this paper. The output of the system is directly used as the input of the disturbance observer. By choosing the observer gain reasonably, the complete tracking of the disturbance can be realized in any fixed time. Then, the original with disturbance estimation is transformed into a filter error system, which enables us to deal with mismatched disturbance well. Under the framework of optimal control, reinforcement learning methods relying on actor and critic NNs are proposed. By training the NN, the optimal controller of the error system is obtained.

Fig. 1
figure 1

Reinforcement learning based robust optimal control strategy. The fixed-time observer provides an accurate estimate of the disturbance. By compensating it back to the original system, filtered error systems are constructed. Actor and critic NN is used to achieve reinforcement learning optimal control

Firstly, we design the fixed-time disturbance observers. With the disturbance estimation in hand, a filtered error system is then transformed.

3.1 Fixed-time disturbance observer design

The fixed-time disturbance observer is designed for each channel as

$$\begin{aligned} \left\{ \begin{aligned} {\dot{z}}_{i1}&= z_{i2}-\lambda _{1}(z_{i1}-x_{i})^{\alpha 1}-\lambda _{2}(z_{i1}-x_{i})^{\beta 1}+x_{i+1}+f_{i},\\ {\dot{z}}_{i2}&= -\lambda _{3}(z_{i1}-x_{i})^{\alpha 2}-\lambda _{4}(z_{i1}-x_{i})^{\beta 2}, \end{aligned} \right. \ \end{aligned}$$
(5)

where \(i=1, 2,\ldots , n\). \(z_{i1}\), \(z_{i2}\) are estimations of \(x_{i}\) and \(d_{i}\), \(\lambda _{1}\), \(\lambda _{2}\), \(\lambda _{3}\), \(\lambda _{4}\) are observer gains to be designed, \(\alpha _{1}\), \(\alpha _{2}\), \(\beta _{1}\), \(\beta _{2}\) are observer internal parameters.

Theorem 1

Given system (1) if the observer gain is chosen properly, the disturbance can be estimated in a fixed time \(T_{d}\), which is independent of the initial values.

Proof

Define the estimation error as \(e_{i1}=x_{1}-z_{i1}\), \(e_{i2}=d_{i}-z_{i2}\). Derivation of \(e_{i1}\) and \(e_{i2}\) along time gives

$$\begin{aligned} \left\{ \begin{aligned} {\dot{e}}_{i1}&= e_{i2}-\lambda _{1}(e_{1})^{\alpha 1}-\lambda _{2}(e_{1})^{\beta 1},\\ {\dot{e}}_{i2}&= -\lambda _{3}(e_{1})^{\alpha 2}-\lambda _{4}(e_{1})^{\beta 2}+{\dot{d}}_{i}. \end{aligned} \right. \ \end{aligned}$$
(6)

As long as the observer gain is chosen carefully, then the estimation error is fixed-time convergent, and can be written as \({\dot{e}}=\Lambda (e)+D\), \(D=[0,\quad {\dot{d}}_{i}]^{T}\). The rest proof is similar to [31] and is omitted here. \(\square\)

Under the designed observer, the mismatched disturbance can be handled. With the help of backstepping method, the filtered error is obtained as \({\dot{z}}_{1}=x_{2}+f_{1}+d_{1}-{\dot{r}}\), where r is reference signal. Here, we denote \(z_{2}=x_{2}-x_{2}^{*}\), choose \(x_{2}^{*}=-k_{1}z_{1}-{\hat{d}}_{1}+{\dot{r}}\), then \({\dot{z}}_{1}={\dot{x}}_{1}-{\dot{r}}=z_{2}-k_{1}z_{1}+f_{1}+e_{1}\). Likewise, we have

$$\begin{aligned} \left\{ \begin{aligned} {\dot{z}}_{i}&= z_{i+1}-k_{i}z_{i}+f_{i}+e_{i}, i=1,\ldots , n-1,\\ {\dot{z}}_{n}&= u_{o}+f_{n}+e_{n}. \end{aligned} \right. \ \end{aligned}$$
(7)

Then (7) is rewritten as

$$\begin{aligned} \begin{aligned} {\dot{Z}}&=F(Z)+Gu_{o}, \end{aligned} \end{aligned}$$
(8)

where \(Z=[z_{1},z_{2},\ldots ,z_{n}]^{\textsf {T}}\), \(F(Z)=[z_{2}+f_{1}-k_{1}z_{1}, \cdots , f_{n}-k_{n}z_{n}]^{\textsf {T}}\), \(G=[0,0,\ldots , 1]^{T}\).

Remark 1

Subtracting the nonlinear in backstepping method will lead to a nonoptimal controller as the nonlinearity may be actually beneficial in meeting the stabilization and/or performance objectives [11].

Remark 2

During the actual production process, the controlled system often encounters abrupt disturbances that can be characterized as lumped disturbance [35]. These types of disturbances do not satisfy the assumption we initially made (referred to as Assumption 1). Nevertheless, the proposed control strategy exhibits the capability to stabilize the system and demonstrates a certain level of robustness. This is attributed to the fact that even in the presence of sudden disturbance changes, the designed observer is able to estimate the disturbance at a fixed time. It is worth noting that the nonlinear function employed in the controller design is represented as \(f+e\). However, since the term e exists only momentarily and eventually diminishes to zero, the overall effect on the controller’s performance is minimal.

According to the former section, we define \(\frac{\partial J}{\partial Z}=\nabla J\) and the Hamilton function is chosen as \(H= \nabla J^{\textsf {T}}{\dot{Z}}+Z^{\textsf {T}}QZ+u_{o}^{\textsf {T}}Ru_{o}\). Then, we have the optimal control as \(u^{*}= \arg \min \limits _{u_{o}}[H(Z,u_{o},\nabla J^{*})]=-\frac{1}{2}R^{-1}g^{\textsf {T}}\nabla J^{*}\), satisfying \(0= Q+u_{o}^{*\textsf {T}}Ru_{o}^{*}+\nabla J^{*\textsf {T}}(F+Gu_{o}^{*})\).

3.2 Critic NN design

The cost function is approximated by a critic neural network,

$$\begin{aligned} \begin{aligned} J=W^{\textsf {T}}\phi (Z)+\epsilon (Z), \end{aligned} \end{aligned}$$
(9)

where W denotes the ideal neuron weights, \(\phi (Z): R^{n}\rightarrow R^{N}\) is the NN activation function vector, N stands for the number of neurons in the hidden layer, \(\epsilon (Z)\) is the approximation error. As \(N\rightarrow \infty\), it has \(\epsilon (Z)\rightarrow 0\). As a result of the unknown nature of neural network weight W, the output of the neural network can be expressed as

$$\begin{aligned} \begin{aligned} {\hat{J}}&={\hat{W}}_{c}^{\textsf {T}}\phi (Z), \end{aligned} \end{aligned}$$
(10)

where \({\hat{W}}_{c}\) is the estimation of W.

Considering (9) and (10), the corresponding Hamilton functions are rewritten as

$$\begin{aligned} \begin{aligned} H(Z,u_{o},W)&=W^{\textsf {T}}\nabla \phi (F+Gu_{o})+Q+u_{o}^{\textsf {T}}Ru_{o}+\upsilon _{H} \end{aligned} \end{aligned}$$
(11)

and

$$\begin{aligned} \begin{aligned} H(Z,u_{o},{\hat{W}}_{c})&={\hat{W}}_{c}^{\textsf {T}}\nabla \phi (F+Gu_{o})+Q+u_{o}^{\textsf {T}}Ru_{o}, \end{aligned} \end{aligned}$$
(12)

where \(\upsilon _{H}=\nabla \epsilon (F+Gu_{o})\).

Define critic NN approximation error \({\tilde{W}}_{c}=W-{\hat{W}}_{c}\), then we have

$$\begin{aligned} \begin{aligned} e_{H}=H(Z,u_{o},W)-H(Z,u_{o},{\hat{W}}_{c}) ={\tilde{W}}_{c}^{\textsf {T}}\nabla \phi _{1}(F+Gu_{o})+\upsilon _{H}. \end{aligned} \end{aligned}$$
(13)

Given any admissible control policy, it is desired to select \({\hat{W}}_{c}\) to minimize the quadratic error

$$\begin{aligned} \begin{aligned} E=\frac{1}{2}e_{H}^{\textsf {T}}e_{H}. \end{aligned} \end{aligned}$$
(14)

The normalized gradient algorithm is adopted to tune the critic weights

$$\begin{aligned} \begin{aligned} \dot{{\hat{W}}}_{c}=-a_{1}\frac{\partial E}{\partial {\hat{W}}_{c}}=-a_{1}\frac{\sigma _{1}}{(\sigma _{1}^{\textsf {T}}\sigma _{1}+1)^{2}}[\sigma _{1}^{\textsf {T}}{\hat{W}}_{c}+Q+u_{o}^{\textsf {T}}Ru_{o}],\\ \end{aligned} \end{aligned}$$
(15)

where \(\sigma _{1}=\nabla \phi (F+Gu_{o})\), \((\sigma _{1}^{T}\sigma _{1}+1)^{2}\) is used for normalization, \(a_{1}\) is scalar to be designed.

As mentioned in [2, 18], the identification of the critic parameter needs to fulfill the persistent excitation (PE) condition. In order to satisfy this condition, there are numerous options available for the signal selection, as long as the PE condition outlined in [18] is met.

3.3 Actor NN design

According to (3), we know the optimal control could be \(-\frac{1}{2}R^{-1}G^{\textsf {T}}(\nabla \phi ^{\textsf {T}}W+\nabla \epsilon )\). Due the parameter W is unknown, here we utilize an actor NN to approximate the control input. Then, the controller is represented as

$$\begin{aligned} \begin{aligned} {\hat{u}}_{o}&=-\frac{1}{2}R^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}{\hat{W}}_{a}, \end{aligned} \end{aligned}$$
(16)

where \({\hat{W}}_{a}\) denotes the estimated value of W.

Similarly, \({\hat{W}}_{a}\) should be designed to approach W as closely as possible. Here, the tuning law of the actor NN is

$$\begin{aligned} \begin{aligned} \dot{{\hat{W}}}_{a}=-a_{2}\{F_{2}{\hat{W}}_{a}-F_{2}{\hat{W}}_{c}-\frac{1}{4}{\bar{D}}_{1}{\hat{W}}_{a}m^{\textsf {T}}{\hat{W}}_{c}\}, \end{aligned} \end{aligned}$$
(17)

where \({\bar{D}}_{1}=\nabla \phi GR^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}\), \(m=\frac{\sigma _{2}}{(\sigma _{2}^{\textsf {T}}\sigma _{2}+1)^{2}}\), \(\sigma _{2}=\nabla \phi (F+G{\hat{u}}_{o})\), \(a_{2}\) is scalar to be designed.

The following is the online algorithm that facilitates the simultaneous tuning of the actor NN and the critic NN.

figure a

3.4 Stability analysis

The following assumption is necessary for stability analysis in Theorem 2.

Assumption 2

[18] In equation (9), the NN approximate error, NN activation functions and their gradient are bounded on a compact set, i.e., \(\Vert \epsilon \Vert <b_{\epsilon }\), \(\Vert \phi \Vert <b_{\phi }\), \(\Vert \nabla \epsilon \Vert <b_{\epsilon _{x}}\), \(\Vert \nabla \phi \Vert <b_{\phi _{x}}\).

Theorem 2

Given system (8), critic NN updating law (15), actor NN updating law (17), controller u=\(u_{o}\), there exist a positive integer \(N_0\) such that the number of the hidden layer units \(N > N_0\), the closed-loop system states, the critic NN approximate error, and the actor NN approximate error are UUB.

Proof

Choose the Lyapunov function as

$$\begin{aligned} \begin{aligned} V=J+\frac{1}{2}a_{1}^{-1}{\tilde{W}}_{c}^{\textsf {T}}{\tilde{W}}_{c}+\frac{1}{2}a_{2}^{-1}{\tilde{W}}_{a}^{\textsf {T}}{\tilde{W}}_{a}+\frac{1}{2}e^{\textsf {T}}e, \end{aligned} \end{aligned}$$
(18)

where \(e=[e_{i1}, e_{i2}]^{T}\). Taking the derivative, it has

$$\begin{aligned} \begin{aligned} {\dot{V}}={\dot{J}}+a_{1}^{-1}{\tilde{W}}_{c}^{\textsf {T}}\dot{{\tilde{W}}}_{c}+a_{2}^{-1}{\tilde{W}}_{a}^{\textsf {T}}\dot{{\tilde{W}}}_{a}+e^{\textsf {T}}{\dot{e}}. \end{aligned} \end{aligned}$$
(19)

Firstly, we have

$$\begin{aligned} \begin{aligned} {\dot{J}}&=W_{c}^{\textsf {T}}\nabla \phi {\dot{x}}+\nabla \epsilon ^{\textsf {T}}{\dot{x}}\\&=W_{c}^{\textsf {T}}\nabla \phi (F-\frac{1}{2}GR^{-1}G^{T}\nabla \phi ^{\textsf {T}}{\hat{W}}_{a})\\&\quad +\nabla \epsilon ^{\textsf {T}}(F-\frac{1}{2}GR^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}{\hat{W}}_{a}). \end{aligned} \end{aligned}$$
(20)

Here we define \(\nabla \phi GR^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}\) as \({\bar{D}}_{1}\), \(\nabla \epsilon ^{\textsf {T}}(F-\frac{1}{2}GR^{-1}G^{\textsf {T}}\nabla \phi ^{\textsf {T}}{\hat{W}}_{a})\) as \(\mu _{1}\) and we have \({\dot{J}}=W_{c}^{\textsf {T}}\sigma _{1}+\frac{1}{2}W_{c}^{\textsf {T}}{\bar{D}}_{1}{\tilde{W}}_{a}+\mu _{1}\). From the HJB Eq. (11), we have \(W^{T}\sigma _{1}=-Q-\frac{1}{4}W^{\textsf {T}}{\bar{D}}_{1}W+\upsilon _{H}\). Then, it has

$$\begin{aligned} \begin{aligned} {\dot{J}}&=-Q-\frac{1}{4}W_{c}^{\textsf {T}}{\bar{D}}_{1}W_{c}+\frac{1}{2}W_{c}^{\textsf {T}}{\bar{D}}_{1}{\tilde{W}}_{a}+\upsilon _{H}+\mu _{1}. \end{aligned} \end{aligned}$$
(21)

In addition, we have

$$\begin{aligned} \begin{aligned} a_{1}^{-1}{\tilde{W}}_{c}^{\textsf {T}}\dot{{\tilde{W}}}_{c} =&{\tilde{W}}_{c}^{\textsf {T}}\frac{\sigma _{2}}{(\sigma _{2}^{\textsf {T}}\sigma _{2}+1)^{2}}[\sigma _{2}^{\textsf {T}}{\hat{W}}_{c}+Q+{\hat{u}}_{o}^{\textsf {T}}R{\hat{u}}_{o}])\\ =&{\tilde{W}}_{c}^{\textsf {T}}\frac{\sigma _{2}}{(\sigma _{2}^{\textsf {T}}\sigma _{2}+1)^{2}}[-\sigma _{2}^{\textsf {T}}{\tilde{W}}_{c}\\&\quad +\frac{1}{4}{\tilde{W}}_{a}^{\textsf {T}}{\bar{D}}_{1}{\tilde{W}}_{a}+\upsilon _{H}]. \end{aligned} \end{aligned}$$
(22)

Based on the FTDOB, the error becomes 0 after \(T_{d}\) seconds. As \(\dot{{\hat{W}}}_{a}\) is given in Theorem 2, we have

$$\begin{aligned} \begin{aligned} {\tilde{W}}_{a}^{\textsf {T}}F_{2}{\hat{W}}_{a}-{\tilde{W}}_{a}^{\textsf {T}}F_{2}{\hat{W}}_{c} ={\tilde{W}}_{a}^{\textsf {T}}F_{2}W-{\tilde{W}}_{a}^{\textsf {T}}F_{2}{\tilde{W}}_{a}-{\tilde{W}}_{a}^{\textsf {T}}F_{2}W+{\tilde{W}}_{a}^{\textsf {T}}F_{2}{\tilde{W}}_{c}. \end{aligned} \end{aligned}$$
(23)

Then \({\dot{V}}\) can be obtained as below by adding (6), (15), (17) and (21)

$$\begin{aligned} \begin{aligned} {\dot{V}}&=-Q-\frac{1}{4}W_{c}^{\textsf {T}}{\bar{D}}_{1}W_{c}\\&\quad +{\tilde{W}}_{c}^{\textsf {T}}\frac{\sigma _{2}}{(\sigma _{2}^{\textsf {T}}\sigma _{2}+1)^{2}}[-\sigma _{2}^{\textsf {T}}{\tilde{W}}_{c}+\upsilon _{H}]+\upsilon _{H}+\mu _{1}\\&\quad +\frac{1}{4}{\tilde{W}}_{a}^{\textsf {T}}{\bar{D}}_{1}{\hat{W}}_{a}\frac{{\bar{\sigma }}^{\textsf {T}}}{m_{s}}{\tilde{W}}_{c}\\&\quad +\frac{1}{2}W^{\textsf {T}}{\bar{D}}_{1}{\tilde{W}}_{a}+\frac{1}{4}{\tilde{W}}_{a}^{\textsf {T}}{\bar{D}}_{1}W\frac{{\bar{\sigma }}_{2}^{\textsf {T}}}{m_{s}}{\tilde{W}}_{a}\\&\quad -\frac{1}{4}{\tilde{W}}_{a}^{\textsf {T}}{\bar{D}}_{1}W\frac{{\bar{\sigma }}_{2}^{T}}{m_{s}}W+{\tilde{W}}_{a}^{\textsf {T}}F_{2}W\\&\quad -{\tilde{W}}_{a}^{T}F_{2}{\tilde{W}}_{a}-{\tilde{W}}_{a}^{\textsf {T}}F_{2}W+{\tilde{W}}_{a}^{\textsf {T}}F_{2}{\tilde{W}}_{c}, \end{aligned} \end{aligned}$$
(24)

where \({\bar{\sigma }}=\frac{\sigma _{2}}{\sigma _{2}^{\textsf {T}}\sigma _{2}+1}\), \(m_{s}=\sigma _{2}^{\textsf {T}}\sigma _{2}+1\).

It is obvious that under Assumption 2,

$$\begin{aligned} \begin{aligned} \mu _{1}\!<\!b_{\epsilon _{x}}\Vert b_{f}\Vert \Vert x\Vert \!+\!\frac{b_{\epsilon _{x}}b_{\phi _{x}}b_{g}^{2}\sigma _{\min }(R)(\Vert W\Vert \!+\!\Vert {\tilde{W}}_{a}\Vert )}{2}. \end{aligned} \end{aligned}$$
(25)

As given in [18], \(\upsilon _{H}\) converges to 0 as the neurons increase. Hence, \(N_{0}\) can be selected such that \(\sup \nolimits _{x\in \Omega }\Vert \upsilon _{H}\Vert <\upsilon\). Assuming \(N>N_{0}\), if we define \({\tilde{Z}}=[Z,\quad {\tilde{W}}_{c},\quad {\tilde{W}}_{a},\quad e]^{\textsf {T}}\), then we have

$$\begin{aligned} \begin{aligned} {\dot{V}}&<-\Vert {\tilde{Z}}\Vert ^{2}\sigma _{\min }(M)+\Vert p\Vert \Vert {\tilde{z}}\Vert +c+\upsilon , \end{aligned} \end{aligned}$$
(26)

where \(c=\frac{1}{4}\Vert W\Vert ^{2}\Vert {\bar{D}}_{1}\Vert +\upsilon +\frac{1}{2}\Vert W\Vert b_{\epsilon _{x}}b_{\phi _{x}}b_{g}^{2}\sigma _{\min }(R)\),

$$\begin{aligned} \begin{aligned} M&=\left[ \begin{array}{ccccccccccccc} qI&{}0&{}0&{}0\\ 0&{}I&{}(-\frac{F_{2}}{2}-\frac{{\bar{D}}_{1}W}{8m_{s}})^{T}&{}0\\ 0&{}(-\frac{F_{2}}{2}-\frac{{\bar{D}}_{1}W}{8m_{s}})&{}F_{2}-\frac{{\bar{D}}_{1}Wm^{T}+mW^{T}{\bar{D}}_{1}}{8}&{}0\\ 0&{}0&{}0&{}\frac{1}{2}\\ \end{array} \right] ,\\ p&=\left[ \begin{array}{cccccc} b_{\varrho _{x}}b_{f}\\ \frac{\upsilon }{m_{s}}\\ (\frac{{\bar{D}}_{1}-\frac{{\bar{D}}_{1}Wm^{T}}{4}}{2})W\!+\!\frac{b_{\epsilon _{x}}b_{\phi _{x}}b_{g}^{2}\sigma _{\min }(R)}{2}\\ -\frac{1}{2}e+f(e)+D \end{array} \right] . \end{aligned} \end{aligned}$$
(27)

Let the parameters be chosen such that \(M>0\). If \(\Vert {\tilde{Z}}\Vert >\sqrt{\frac{p^{2}}{4\sigma _{\min }(M)}+\frac{c+\upsilon }{\sigma _{\min }(M)}}+\frac{\Vert p\Vert }{2\sigma _{\min }(M)}\), then, \({\dot{V}}\) is negative. Thence, the state and the weight error are UUB. \(\square\)

4 Examples

In this section, a linear system is presented firstly to show that the designed update law guarantees the convergence of the weights to their ideal values. Secondly, a nonlinear system example is employed to highlight the effectiveness of the proposed method.

4.1 Linear system example

Consider a linear system, \({\dot{x}}_{1}= -x_{1}-2x_{2}+u\), \({\dot{x}}_{2}= x_{1}-4x_{2}-3u\), where \(x_{1}\) and \(x_{2}\) are system states and u is control input. Choose the cost function as \(J=\int _{0}^{\infty }(x^{T}Qx+u^{T}Ru)\mathrm{{d}}t\), where \(Q=diag(1\quad 1)\) and \(R=1\).

Clearly, the optimal controller based on linear quadratic regulate theory can be easily found. Hence, the ideal NN wights can be also deduced as \(W=\left[ \begin{array}{cccccc} 0.3199&-0.1162&0.1292 \end{array} \right]\). For this system, the NN-based optimal control is implemented as (16) and the NN tuning law are selected as (15) and (17). In the process of NN convergence, in order to ensure PE condition, we add noise signal \(0.5(sin(t)^2*cos(t)+sin(2t)^2*cos(0.1t)+sin(-1.2t)^2*cos(0.5t)+sin(t)^5)\) to the control input here. The reference signal is set as \(r=0\). The simulation results are shown in Fig. 2. The values converge to the optimal values after 50 s, i. e., \({\hat{W}}_{c}=\left[ \begin{array}{cccccc} 0.3199&-0.1162&0.1292 \end{array} \right]\). Also, \({\hat{W}}_{a}=\left[ \begin{array}{cccccc} 0.3199&-0.1162&0.1292 \end{array} \right]\) after 50 s. The optimal controller approximated by NNs is given as

$$\begin{aligned} \begin{aligned} {\hat{u}}&=-\frac{R^{-1}}{2}\left[ \begin{array}{cccccc} \!1 \\ \!-3 \end{array} \right] ^{T}\left[ \begin{array}{cccccc} 2x_{1}&{}0\\ x_{2}&{}x_{1}\\ 0&{}2x_{2} \end{array} \right] ^{T}\left[ \begin{array}{cccccc} 0.3199\\ -0.1162\\ 0.1292 \end{array} \right] . \end{aligned} \end{aligned}$$
(28)
Fig. 2
figure 2

a Trajectories of system states under (16) with reference signal \(r=0\). Before 80 s, states constantly fluctuates due to the presence of the excitation signal. After the excitation signal is removed, system states reach a bound near the equilibrium point \((0,0)^{T}\). b CNN weights. Driven by update law (15), the weights of the CNN eventually converge to within a bounded range of ideal weights. c ANN weights. Driven by update law (17), the weights of the ANN eventually converge to within a bounded range of ideal weights

The excitation signal is introduced to satisfy the PE condition, with the result that sufficiently rich data is generated to train the neural network and ensure its convergence. After 80 s, the neural network has converged. After convergence, the exploration signal is removed, and the value of the state of the system remains near 0 after removal.

4.2 Nonlinear system example

Firstly, we consider a reference signal \(r=0\). In this case, the tracking problem is actually a stabilization problem. The exploration signal is chosen as \(200e^{(-0.23t)}*(sin(t)^2*cos(t)+sin(2t)^2*cos(0.1t)+sin(-1.2t)^2*cos(0.5t)+sin(t)^5+sin(1.12t)^2+cos(2.4t)*sin(2.4t)^3)\) and the corresponding results are depicted in Fig. 3.

Fig. 3
figure 3

a Trajectories of system states under (16) with reference signal \(r=0\). Before 80 s, states constantly fluctuates due to the presence of the excitation signal. b CNN weights. Driven by update law (15), the weights of the CNN eventually converge to within a bounded range of ideal weights. c ANN weights. Driven by update law (17), the weights of the ANN eventually converge to within a bounded range of ideal weights

Then, we set \(r=5\) and exploration signal as \(exp(-0.35t)*200*(sin(t)^2*cos(t)+sin(2t)^2*cos(0.1t)+sin(-1.2t)^2*cos(0.5t)+sin(t)^5+sin(1.12t)^2+cos(2.4t)*sin(2.4t)^3)\). The results are depicted in Fig. 4.

Fig. 4
figure 4

a Trajectories of system states under (16) with reference signal \(r=5\). Before 80 s, states constantly fluctuates due to the presence of the excitation signal. After the excitation signal is removed, state \(x_{1}\) converges to 5 and \(x_{2}\) reaches the equilibrium point. b CNN weights. Driven by update law (15), the weights of the CNN eventually converge to within a bounded range of ideal weights. c ANN weights. Driven by update law (17), the weights of the ANN eventually converge to within a bounded range of ideal weights

In our simulation, the sampling time is relatively small at 0.001 s. Therefore, it is reasonable to increase the exponential term in the excitation signal. This approach offers several advantages, including reducing the overall training time and minimizing computational resource wastage. However, in practical systems, hardware limitations often prevent maintaining a very small sampling time. In such cases, as highlighted in [2, 3], it becomes crucial to ensure that the excitation signal does not decay too rapidly. This ensures an ample amount of data is available for training the neural network.

5 Conclusion

This paper focused on the design of robust optimal controllers for high-order nonlinear systems in the presence of mismatched disturbances. The proposed approach involves the design of disturbance observers that ensure fixed-time convergence. Subsequently, the original system is transformed into a filtered error nonlinear system. To address the challenges associated with solving Hamilton–Jacobi–Bellman (HJB) equations, the reinforcement learning method has been introduced. Two neural networks have been designed to approximate the cost function and the optimal control, respectively. By integrating these components, a robust optimal controller is finally obtained. The effectiveness of the proposed method has been validated through two illustrative examples.