Adaptive critic design for nonlinear multi-player zero-sum games with unknown dynamics and control constraints

Huo, Yu; Wang, Ding; Qiao, Junfei; Li, Menghua

doi:10.1007/s11071-023-08419-5

Adaptive critic design for nonlinear multi-player zero-sum games with unknown dynamics and control constraints

Original Paper
Published: 12 April 2023

Volume 111, pages 11671–11683, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Nonlinear Dynamics Aims and scope Submit manuscript

Adaptive critic design for nonlinear multi-player zero-sum games with unknown dynamics and control constraints

Download PDF

Yu Huo¹,
Ding Wang¹,
Junfei Qiao ORCID: orcid.org/0000-0002-1707-6074¹ &
…
Menghua Li¹

668 Accesses
10 Citations
Explore all metrics

Abstract

In this paper, a novel optimal control scheme is established to solve the multi-player zero-sum game (ZSG) issue of continuous-time nonlinear systems with control constraints and unknown dynamics based on the adaptive critic technology. To relax the requirement of system dynamics, a neural network-based identifier is applied to reconstruct the unknown multi-player ZSG system. Then, by developing a new nonquadratic function, the associated Hamilton-Jacobi-Isaacs (HJI) equation of the constrained ZSG is derived. Moreover, an adaptive critic framework is constructed to approximate the optimal cost function. Meanwhile, the strategy sets of optimal control and the worst disturbance are estimated by utilizing the single-critic network, respectively. After that, a modified critic weight updating mechanism with experience replay technique is introduced to relax the requirement of the persistence of excitation condition. Theoretically, by employing the Lyapunov stability theorem, the uniform ultimate boundedness stability of the ZSG system state and the critic network weight approximation error are proved. Finally, a representative example is simulated to validate the efficacy of the constructed framework.

A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system

Article 09 January 2022

Neural-Network-Based Synchronous Iteration Learning Method for Multi-player Zero-Sum Games

Off-Policy Reinforcement Learning for Partially Unknown Nonzero-Sum Games

Discover the latest articles, news and stories from top researchers in related subjects.

Automotive Engineering

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Differential game theory has emerged as an advantageous instrument for theoretical and applied study in the control field [1,2,3,4]. For complex real-world systems, many of these consist of multiple controllers which could be deemed as some relevant players in games [5]. When the differential game is employed to the control issues, it essentially translates to the multi-controller optimal control problems [6, 7]. Generally, game problems are distinguished into zero-sum games (ZSGs) [8] and non-ZSGs [9]. Furthermore, game problems depend on solving the Hamilton-Jacobi (HJ) or the Hamilton-Jacobi-Isaacs (HJI) equations in the optimal control issue which are intractable or impossible to solve [10,11,12,13,14]. Therefore, many scholars have shown different approximate approaches to tackle this difficulty. Specially, the adaptive dynamic programming (ADP) technology was introduced to address HJ or HJI equations for varieties of game problems [15,16,17,18].

For non-ZSG issues, an off-policy scheme was established to deal with the multi-player non-ZSG, while the system dynamics was not required [19]. By employing an ADP-based mechanism, each player could gain the performance index and control. Zhao et al. [20] designed a dual actor-critic scheme. And there is proof to verify control policies have reached the Nash equilibrium for the non-ZSG. In [21], an online ADP-based model-free control structure was proposed to handle the multi-player non-ZSG problem for discrete-time unknown systems.

With regard to ZSG problems, the $H_\infty $-constrained control issue is transformed into a two-player ZSG problem, which pointed a key direction for $H_\infty $ control problems [22]. After that, under the scheme of ADP, a critic network was designed to deal with the HJI equation. In [23], the linear two-player ZSG were investigated based on an adaptive online learning architecture, which was utilized to approximately solve the modified game algebraic Riccati equation via online data. Yazidi et al. [24] developed a pioneering mechanism that is capable of converging to the mixed Nash equilibrium by solving two-player ZSG with incomplete information. Different from [25], Song et al. [26] established an only single-critic network framework to turn the weight and solve the HJI equation without complete dynamics. The aforementioned ADP-based results were only researched for two-player ZSG problems. However, most industrial process plants are commonly controlled by multiple controllers. This means that the cost function designed for two-player ZSGs no longer applies to multi-player ZSGs. Therefore, multi-player ZSGs should attract more attention. In [27], an off-policy framework was devised for multi-player ZSG of completely unknown systems. Therewith, the iterative cost function, controls and disturbances were obtained. In [28], the single-critic mechanism was employed and the event-based structure was developed in a multi-player ZSG form to reduce the data transmission and computation. Qiao et al. [29] extended the adaptive critic mechanism to the problem of combining multi-player ZSG and optimal tracking control. Then, this work provided two cases of multi-player ZSG in the simulation stage.

Note that the most of aforementioned ADP-based frameworks require the known system dynamics, which is difficult to achieve accurately in industrial process. To overcome this disadvantage, system identification algorithms based on neural networks are utilized to reconstruct unknown system dynamics by approximate structures [30,31,32]. For example, Na et al. [33] utilized critic-based ADP and identifier network approaches by means of system data to online address the optimal tracking control issue. In [34], an ADP mechanism for a two-player nonlinear ZSG was designed by utilizing the identifier-critic network. In [35], an intelligent control mechanism was established by using the recurrent neural network and a unique critic network, instead of utilizing the mathematical model. Huo et al. [36] extended the results to constrained decentralized systems by utilizing the identifier-critic mechanism.

Subsequently, in order to address the problem caused by the persistence of excitation (PE) condition, the experience replay (ER) scheme was designed for nonlinear systems [37,38,39,40]. The ER scheme can effectively utilize the historical and available data simultaneously. Under the ER framework, a novel ADP-based approach was developed in [41] to approximate the Nash equilibrium for multi-player non-ZSGs with unknown drift dynamics, which also could accelerate the convergence rate of critic network weights. In [42], the critic network was developed with a new weight updating rule based on the ER method for uncertain interconnections systems. Thereafter, Zhu et al. [43] realized the optimal control of constrained-input partially unknown systems. In order to tune the critic network weight, they leveraged the ER algorithm to effectively use the record data. To relax the PE condition, the ER scheme was introduced to off-policy framework to address the optimal output regulation issue with unknown system dynamics [44].

Moreover, control constraints are considered to be wide-spread factors in practical systems due to the inherent physical properties of the actuators. As a result, the system performance is likely poor or even unstable. Thus, the developed ADP-based controller is supposed to obtain the desired performance with control constraints [45,46,47,48,49]. Accounting for control constraints, an adaptive critic design based on ADP was implemented for nonlinear non-ZSGs in the two-player form [50]. Further, an actor-critic architecture was proposed to approximately gain the Nash equilibrium by utilizing the real-time data. In [51], the unknown multi-player ZSG with control constraints was considered, and the observer-critic structure was established to tackle the HJI equation. Sun and Liu [52] investigated a fixed directed graph structure for multi-agent systems with control constraints to handle the distributed differential game tracking issue. Nevertheless, control constraints for multi-player ZSGs are considered in only few studies. More importantly, the ADP-based optimal control for ZSGs was also investigated in [26, 28, 51] and [53]. However, the single-critic network scheme was not established in [53], the constrained control input was not considered in [26] and [28], and the weight updating rule with the ER technology was not analyzed in [51]. These works promote our research interests. Hence, this article concerns the ER-based adaptive critic design for unknown multi-player ZSGs with control constraints.

The innovations of this article can be listed as four parts.

1.
This paper extends the ADP-based scheme to solve the multi-player ZSG issue for the nonlinear system. It is appropriate for both two-player ZSG problem and multi-player ZSG problem.
2.
Additionally, by constructing a modified non-quadratic utility function, control constraints are considered under the multi-player ZSG situation.
3.
Different from the traditional identifier-actor-critic mechanism [54], the identifier-critic scheme for all players is developed to solve the HJI equation, which can further simplify the method structure and reduce the computing cost.
4.
By introducing the ER mechanism, a novel weight tuning criterion is employed and the PE condition is relaxed to an easy-checked rank condition [see Remark 3], which means an easy-to-execute scheme is designed. Moreover, the uniform ultimate boundedness (UUB) stability of the critic network weight estimation error and the multi-player system can be both guaranteed.

The outline of the article is summarized as follows. In Sect. 2, the problem description is provided. In Sect. 3, a neural network-based identifier is established to identify the system dynamics, and the stability is proved. Moreover, the single-critic network scheme is introduced with the stability analysis. In Sect. 4, one simulation example is shown. In Sect. 5, the conclusion is presented.

2 Problem statement

Consider the multi-player nonlinear ZSG system

$$\begin{aligned} \dot{x}(t)&{=}&f({x}(t)){+}\sum _{q=1}^Ng_{q}({x}(t)){u}_{q}(t)\nonumber \\{} & {} \quad {+}\sum _{l=1}^Mk_{l}({x}(t)){d}_{l}(t), \end{aligned}$$

(1)

where $x\in { {\mathbb {R}}}^{n}$ denotes the state; $u_q\in { {\mathbb {R}}}^{m_{q}}$ and $d_l\in { {\mathbb {R}}}^{w_{l}}$ are the constrained control inputs and the disturbance inputs, respectively. Note that $f(x)\in \mathbb {R}^{n}$, $g_q(x)\in \mathbb {R}^{n\times m_q}$ and $k_l(x)\in \mathbb {R}^{n\times w_l}$ are assumed unknown and Lipschitz continuous on a compact set $\varOmega \in { {\mathbb {R}}}^{n}$ with $f(0)=0$. Let $x(0)=x_{0}$ be the initial state and the system is stabilizable on $\varOmega $.

Define the cost function as

$$\begin{aligned} J(x_0,\mathscr {U},\mathscr {D})=\int _0^\infty h(x(t),\mathscr {U},\mathscr {D})\mathrm d\tau , \end{aligned}$$

(2)

where $\mathscr {U}=\{u_1,\dots ,u_N\}$ is the set of constrained control inputs, $\left| u_q\right| \le \beta _q$ with $\beta _q>0$ being the constraint bound. $\mathscr {D}=\{d_1,\dots ,d_M \}$ is the set of disturbance inputs, $h(x(t),\mathscr {U},\mathscr {D})=x^{{\textsf{T}}}Qx+U(\mathscr {U},\mathscr {D})$ is the utility function, and $U(\mathscr {U},\mathscr {D})=2{\textstyle \sum _{q=1}^N}R_{q}\int _0^{u_q}\beta _q\rho ^{\mathsf {-T}}(v/\beta _q)\mathrm dv-\lambda ^2 \textstyle \sum _{l=1}^M d^{{\textsf{T}}}_ld_l$ with $\lambda $ denoting the disturbance attenuation level. $Q\ge 0$ and $R_{q}\ge 0$ are positive symmetric matrices. Moreover, $\rho (\cdot )$ is a monotonic bounded odd function and we choose $\rho (\cdot )=\textrm{tanh}(\cdot )$.

Then, the multi-player ZSG subject to (1) is defined as

$$\begin{aligned} J^*(x_0)= \inf _{u_1}\inf _{u_2}\cdots \inf _{u_N}\sup _{d_1}\sup _{d_2}\cdots \sup _{d_M}J(x_0,\mathscr {U},\mathscr {D}),\nonumber \\ \end{aligned}$$

(3)

where $J^*(x)$ denotes the optimal cost function.

For the multi-player ZSG, it seeks to attain the saddle point solution $(u_q^*,d_l^*)$ to satisfy the inequalities

$$\begin{aligned} J(x,\mathscr {U}^*,\mathscr {D}){\le } J(x,\mathscr {U}^*,\mathscr {D}^*){\le } J(x,\mathscr {U},\mathscr {D}^*), \end{aligned}$$

(4)

where $\mathscr {U}^*=\{u_1^*,u_2^*,\dots ,u_N^*\}$ and $\mathscr {D}^*=\{d_1^*,d_2^*,$$\dots ,d_M^*\}$ indicate the sets of the optimal control strategies and the worst disturbance strategies, respectively.

Based on cost function (2), one has

$$\begin{aligned} 0 \!&=\! h(x,\mathscr {U},\mathscr {D})\! \nonumber \\&\quad +\! \!(\nabla J(x))^{{\textsf{T}}}\!\left( \!\! f(x)\!+\!\!\sum _{q=1}^N \! g_{q}({x}){u}_{q}\!+\!\sum _{l=1}^M \! k_{l}({x}){d}_{l}\!\right) \!, \end{aligned}$$

(5)

where $\nabla (\cdot )\triangleq \partial (\cdot )/\partial x$ denotes the gradient operator.

The Hamiltonian function is constructed as

$$\begin{aligned}&H\left( x, \nabla J(x), \mathscr {U},\mathscr {D}\right) \nonumber \\&\quad =x^{{\textsf{T}}}Qx+2{\sum _{q=1}^N}R_{q}\int _0^{u_q}\beta _q\rho ^{\mathsf {-T}}(v/\beta _q)\mathrm dv\nonumber \\&\qquad -\lambda ^2 \sum _{l=1}^M d^{{\textsf{T}}}_ld_l\nonumber \\&\qquad + (\nabla J(x))^{{\textsf{T}}}\left( f(x)+\sum _{q=1}^Ng_{q}({x}){u}_{q}\right. \nonumber \\&\qquad \left. +\sum _{l=1}^Mk_{l}({x}){d}_{l}\right) . \end{aligned}$$

(6)

The associated HJI equation can be described as

$$\begin{aligned} \min _{\mathscr {U}}\max _{\mathscr {D}}H\left( x, \nabla J^*(x), \mathscr {U},\mathscr {D}\right) = 0. \end{aligned}$$

(7)

Then, the optimal constrained control policy and the worst disturbance strategy can be derived from the following stationary conditions

$$\begin{aligned} \frac{\partial {H}\left( x, \mathscr {U}, \mathscr {D}, \nabla {J}^{*}(x)\right) }{\partial u_{q}}&{=}0,\quad q{=}1,2, \ldots , N, \end{aligned}$$

(8)

$$\begin{aligned} \frac{\partial {H}\left( x, \mathscr {U}, \mathscr {D}, \nabla {J}^{*}(x)\right) }{\partial d_{l}}&{=}0,\quad l{=}1,2, \ldots , M . \end{aligned}$$

(9)

Therefore, the optimal control law and the worst disturbance law can be obtained by

$$\begin{aligned} u_{q}^*&=-\beta _q \tanh (B^*), \end{aligned}$$

(10)

$$\begin{aligned} d_{l}^*&=\frac{1}{2 \lambda ^2}k_{l}^{{\textsf{T}}} \nabla J^*, \end{aligned}$$

(11)

where $B^{*}= (1/(2 \beta _q)) R_{q}^{-1} g_{q}^{{\textsf{T}}} \nabla J^*$.

Inserting (10) and (11) into (7), we can get the HJI equation expressed as

$$\begin{aligned} 0&=x^{{\textsf{T}}}Qx\nonumber \\&\quad +2 \sum _{q=1}^{N}\bigg ( R_{q} \int _{0}^{-\beta _q \tanh (B^*)} \beta _q\tanh ^{\mathsf {-T}}(v/\beta _q)\mathrm dv \bigg ) \nonumber \\&\quad {+}\frac{1}{4\lambda ^2}\! \sum _{l=1}^M \! \bigg ( (\nabla J^*)^{{\textsf{T}}} k_l(x)k_l^{{\textsf{T}}}(x)\nabla J^*\bigg ){+}(\nabla J^*)^{{\textsf{T}}}\! f(x) \nonumber \\&\quad -(\nabla J^*)^{{\textsf{T}}} \sum _{q=1}^{N} \bigg (g_q(x) \beta _q \tanh (B^*)\bigg ). \end{aligned}$$

(12)

Note that it is intractable to tackle equation (12). Generally, the traditional policy iteration (PI) scheme can be employed to overcome this bottleneck, but this scheme depends on the system dynamics. Hence, in the next section, the identifier-critic network framework is developed which can tackle the constrained multi-player ZSG issue without requiring the system dynamics.

Remark 1

Obviously, this paper considers the multi-player ZSG with control constraints. Therefore, the traditional quadratic cost function is no longer suitable for solving such issue. In this paper, the control constraint problem can be tackled by utilizing an improved non-quadratic cost function which restricts the control policies within the given bound.

3 Approximate solution for multi-player ZSGs

In this section, an identifier-critic framework based on neural networks is constructed for the multi-player ZSG problem of unknown dynamics with control constraints.

First, an identifier network is designed to relax the requirement of unknown system dynamics. Then, a single-critic network is applied and the implementation process is also given. Finally, the stability is proved by using the Lyapunov approach.

3.1 System identification

For the multi-player ZSG system dynamics is unknown, an identifier is used to reconstruct the unknown dynamics. System (1) can be reformulated by

$$\begin{aligned} \dot{x}&=Sx \!{+}\!\omega _{f}^{{\textsf{T}}}\varphi _{f}(x)\!+\!\varepsilon _{f}(x)\nonumber \\&\quad {+}\!\sum _{q{=}1}^N \! \left( \omega _{gq}^{{\textsf{T}}}\varphi _{gq}(x){+}\varepsilon _{gq}(x)\right) {u}_{q} \nonumber \\&\quad +\sum _{l=1}^M\left( \omega _{kl}^{{\textsf{T}}}\varphi _{kl}(x)+\varepsilon _{kl}(x)\right) {d}_{l}, \end{aligned}$$

(13)

where $S\in { {\mathbb {R}}}^{n\times n}$ is a designed matrix. $\omega _{f}\in { {\mathbb {R}}}^{n\times n}$, $\omega _{gq}\in { {\mathbb {R}}}^{n\times n}$, and $\omega _{kl}\in { {\mathbb {R}}}^{n\times n}$ represent the ideal weight matrices. $\varphi _{f}(\cdot )\in { {\mathbb {R}}}^{n}$, $\varphi _{gq}(\cdot )\in { {\mathbb {R}}}^{n\times m_q}$, and $\varphi _{kl}(\cdot )\in { {\mathbb {R}}^{n \times w_l}}$ denote the activation functions. $\varepsilon _{f}(\cdot )\in { {\mathbb {R}}}^{n}$, $\varepsilon _{gq}(\cdot )\in { {\mathbb {R}}}^{n\times m_q}$, and $\varepsilon _{kl}(\cdot )\in { {\mathbb {R}}}^{n \times w_l}$ are bounded reconstruction errors. The activation functions are picked as the $\tanh $ function and satisfy

$$\begin{aligned} 0\le \varphi (x)-\varphi (y)\le \delta (x-y), \end{aligned}$$

(14)

$\forall x,y\in \mathbb {R}$ and $x\ge y,\delta >0$. Based on (13), the output of the identifier network is written as

$$\begin{aligned}&\dot{{\hat{x}}}=S\hat{x}+{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}({\hat{x}})+\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}({\hat{x}}){u}_{q}\nonumber \\&\quad +\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}({\hat{x}}){d}_{l}, \end{aligned}$$

(15)

where ${\hat{\omega }}_{f}\in { {\mathbb {R}}}^{n\times n}$, ${\hat{\omega }}_{gq}\in { {\mathbb {R}}}^{n\times n}$, and ${\hat{\omega }}_{kl}\in { {\mathbb {R}}}^{n\times n}$ denote the estimations of the corresponding ideal weights. Moreover, the identification error is described as

$$\begin{aligned} {{\tilde{x}}}=x-\hat{x}. \end{aligned}$$

(16)

Then, the derivative of (16) can be derived as

$$\begin{aligned} \dot{{\tilde{x}}}&=\dot{x}-\dot{\hat{x}} \nonumber \\&=S{\tilde{x}} +{\tilde{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(\hat{x})+\omega _{f}^{{\textsf{T}}}\left( \varphi _{f}(x)-\varphi _{f}(\hat{x})\right) +\varepsilon _{f}(x) \nonumber \\&\quad +\!\sum _{q=1}^N \!\left( {\tilde{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}({\hat{x}})+\omega _{gq}^{{\textsf{T}}}\left( \varphi _{gq}(x)\!-\!\varphi _{gq}({\hat{x}})\right) \!\right. \nonumber \\&\quad \left. +\!\varepsilon _{gq}(x)\right) \!{u}_{q} \nonumber \\&\quad +\!\sum _{l=1}^M \!\left( {\tilde{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}({\hat{x}})+\omega _{kl}^{{\textsf{T}}}\left( \varphi _{kl}(x)-\varphi _{kl}({\hat{x}})\right) \right. \nonumber \\&\quad \left. +\varepsilon _{kl}(x)\right) \!{d}_{l}, \end{aligned}$$

(17)

where ${\tilde{\omega }}_{f}=\omega _{f}-{\hat{\omega }}_{f}$, ${\tilde{\omega }}_{gq}=\omega _{gq}-{\hat{\omega }}_{gq}$, and ${\tilde{\omega }}_{kl}=\omega _{kl}-{\hat{\omega }}_{kl}$.

Assumption 1

The ideal weights are bounded as

$$\begin{aligned} \left\| \omega _{f}\right\| \le \bar{\omega }_{f},\left\| \omega _{g q}\right\| \le \bar{\omega }_{g q},\left\| \omega _{k l}\right\| \le \bar{\omega }_{k l}, \end{aligned}$$

where $\bar{\omega }_{f}$, $\bar{\omega }_{gq}$, and $\bar{\omega }_{kl}$ are positive constants.

Assumption 2

The reconstruction errors $\varepsilon _{f}$, $\varepsilon _{gq}$, and $\varepsilon _{kl}$ are bounded by the identification error function, that is,

$$\begin{aligned} \varepsilon _{f}^{{\textsf{T}}} \varepsilon _{f} \le \gamma \tilde{x}^{{\textsf{T}}} \tilde{x}, \varepsilon _{g q}^{{\textsf{T}}} \varepsilon _{g q} \le \gamma \tilde{x}^{{\textsf{T}}} \tilde{x}, \varepsilon _{kl}^{{\textsf{T}}} \varepsilon _{kl} \le \gamma \tilde{x}^{{\textsf{T}}} \tilde{x}, \end{aligned}$$

where $\gamma $ is a constant.

Theorem 1

Consider multi-player ZSG (1) with the system dynamics formulated by (13). The identification error $\tilde{x}$ will converge to zero when $t\rightarrow \infty $, if the weights $\hat{\omega }_{f}$, $\hat{\omega }_{gq}$, and $\hat{\omega }_{kl}$ are updated by

$$\begin{aligned} \dot{{\hat{\omega }}}_{f}&=\varLambda _{f} \varphi _{f}({\hat{x}})\tilde{x}^{{\textsf{T}}}, \nonumber \\ \dot{{\hat{\omega }}}_{gq}&=\varLambda _{gq} \varphi _{gq}({\hat{x}})u_q\tilde{x}^{{\textsf{T}}}, q=1,\dots ,N, \nonumber \\ \dot{{\hat{\omega }}}_{kl}&=\varLambda _{kl} \varphi _{kl}({\hat{x}})d_l\tilde{x}^{{\textsf{T}}}, l=1,\dots ,M, \end{aligned}$$

(18)

where $\varLambda _{f}$, $\varLambda _{gq}$, and $\varLambda _{kl}$ are symmetric positive definite matrices.

Proof

Select the Lyapunov function as

$$\begin{aligned} L_3(t)&=\frac{1}{2}\tilde{x}^{{\textsf{T}}}\tilde{x}+\frac{1}{2}\textrm{tr}\left( \tilde{\omega }_{f}^{{\textsf{T}}}\varLambda _{f}^{-1}\tilde{\omega }_{f}\right) \nonumber \\&\quad +\sum _{q=1}^{N} \frac{1}{2}\textrm{tr}\left( \tilde{\omega }_{gq}^{{\textsf{T}}}\varLambda _{gq}^{-1}\tilde{\omega }_{gq}\right) \nonumber \\&\quad +\sum _{l=1}^{M} \frac{1}{2}\textrm{tr}\left( \tilde{\omega }_{kl}^{{\textsf{T}}}\varLambda _{kl}^{-1}\tilde{\omega }_{kl}\right) . \end{aligned}$$

(19)

Computing the time derivative of $L_3(t)$, one has

$$\begin{aligned} \dot{L}_3(t)&{=}\tilde{x}^{{\textsf{T}}} \dot{\tilde{x}}{+}\textrm{tr}\left( \tilde{\omega }_{f}^{{\textsf{T}}}\varLambda _{f}^{{-}1}\dot{\tilde{\omega }}_{f}\right) {+}\sum _{q=1}^{N} \textrm{tr}\left( \tilde{\omega }_{gq}^{{\textsf{T}}}\varLambda _{gq}^{-1}\dot{\tilde{\omega }}_{gq}\right) \nonumber \\&\quad +\sum _{l=1}^{M} \textrm{tr}\left( \tilde{\omega }_{kl}^{{\textsf{T}}}\varLambda _{kl}^{-1}\dot{\tilde{\omega }}_{kl}\right) . \end{aligned}$$

(20)

Observing (18) and using $-\dot{\hat{\omega }}_{f}=\dot{\tilde{\omega }}_{f}$, $-\dot{\hat{\omega }}_{gq}=\dot{\tilde{\omega }}_{gq}$, and $-\dot{\hat{\omega }}_{kl}=\dot{\tilde{\omega }}_{kl}$, we can obtain

$$\begin{aligned} \textrm{tr}\left( \tilde{\omega }_{f}^{{\textsf{T}}}\varLambda _{f}^{-1}\dot{\tilde{\omega }}_{f}\right)&=-\tilde{x}^{{\textsf{T}}} \tilde{\omega }_{f}^{{\textsf{T}}}\varphi _{f}({\hat{x}}), \nonumber \\ \sum _{q=1}^{N} \textrm{tr}\left( \tilde{\omega }_{gq}^{{\textsf{T}}}\varLambda _{gq}^{-1}\dot{\tilde{\omega }}_{gq}\right)&=-\tilde{x}^{{\textsf{T}}}\sum _{q=1}^{N} \tilde{\omega }_{gq}^{{\textsf{T}}}\varphi _{gq}({\hat{x}})u_q, \nonumber \\ \sum _{l=1}^{M} \textrm{tr}\left( \tilde{\omega }_{kl}^{{\textsf{T}}}\varLambda _{kl}^{-1}\dot{\tilde{\omega }}_{kl}\right)&=-\tilde{x}^{{\textsf{T}}}\sum _{l=1}^{M} \tilde{\omega }_{kl}^{{\textsf{T}}}\varphi _{kl}({\hat{x}})d_l. \end{aligned}$$

(21)

Then, we have

$$\begin{aligned} \dot{L}_3(t)&={{\tilde{x}}}^{{\textsf{T}}}S{\tilde{x}} +{{\tilde{x}}}^{{\textsf{T}}}\omega _{f}^{{\textsf{T}}}\left( \varphi _{f}(x)-\varphi _{f}(\hat{x})\right) +{{\tilde{x}}}^{{\textsf{T}}}\varepsilon _{f}(x) \nonumber \\&\quad +\!{{\tilde{x}}}^{{\textsf{T}}}\!\sum _{q=1}^N \! \left( \!\omega _{gq}^{{\textsf{T}}}\!\left( \varphi _{gq}(x)\!-\!\varphi _{gq}({\hat{x}})\right) \right) \!{u}_{q}\!\nonumber \\&\quad +\!{{\tilde{x}}}^{{\textsf{T}}}\!\sum _{q=1}^N\!\varepsilon _{gq}(x){u}_{q} \nonumber \\&\quad +\!{{\tilde{x}}}^{{\textsf{T}}}\!\sum _{l=1}^M \!\left( \!\omega _{kl}^{{\textsf{T}}}\left( \varphi _{kl}(x)-\varphi _{kl}({\hat{x}})\right) \right) \!{d}_{l}\nonumber \\&\quad +\!{{\tilde{x}}}^{{\textsf{T}}}\!\sum _{l=1}^M\!\varepsilon _{kl}(x){d}_{l}. \end{aligned}$$

(22)

Based on (14), we have

$$\begin{aligned} {{\tilde{x}}}^{{\textsf{T}}}\omega _{f}^{{\textsf{T}}}\left( \varphi _{f}(x)-\varphi _{f}(\hat{x})\right)&\le \frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}\omega _{f}^{{\textsf{T}}}\omega _{f}{{\tilde{x}}}+\frac{1}{2}\delta ^2{{\tilde{x}}}^{{\textsf{T}}}{{\tilde{x}}}, \nonumber \\ {{\tilde{x}}}^{{\textsf{T}}} \omega _{gq}^{{\textsf{T}}}\left( \varphi _{gq}(x){-}\varphi _{gq}({\hat{x}})\right)&\le \frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}\omega _{gq}^{{\textsf{T}}}\omega _{gq}{{\tilde{x}}}{+}\frac{1}{2}\delta ^2{{\tilde{x}}}^{{\textsf{T}}}{{\tilde{x}}}, \nonumber \\ {{\tilde{x}}}^{{\textsf{T}}} \omega _{kl}^{{\textsf{T}}}\left( \varphi _{kl}(x)-\varphi _{kl}({\hat{x}})\right)&\le \frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}\omega _{kl}^{{\textsf{T}}}\omega _{kl}{{\tilde{x}}}+\frac{1}{2}\delta ^2{{\tilde{x}}}^{{\textsf{T}}}{{\tilde{x}}}. \end{aligned}$$

(23)

Considering Assumption 2, one has

$$\begin{aligned} {{\tilde{x}}}^{{\textsf{T}}}\varepsilon _{f}(x)&\le \frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}{\tilde{x}} +\frac{1}{2}\gamma {{\tilde{x}}}^{{\textsf{T}}}{\tilde{x}}, \nonumber \\ {{\tilde{x}}}^{{\textsf{T}}}\varepsilon _{gq}(x)&\le \frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}{\tilde{x}} +\frac{1}{2}\gamma {{\tilde{x}}}^{{\textsf{T}}}{\tilde{x}}, \nonumber \\ {{\tilde{x}}}^{{\textsf{T}}}\varepsilon _{kl}(x)&\le \frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}{\tilde{x}} +\frac{1}{2}\gamma {{\tilde{x}}}^{{\textsf{T}}}{\tilde{x}}. \end{aligned}$$

(24)

Hence, (22) can be reconstructed as

$$\begin{aligned}&\dot{L}_3(t) \nonumber \\&{\le }{{\tilde{x}}}^{{\textsf{T}}}S{\tilde{x}} {+}\frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}\omega _{f}^{{\textsf{T}}}\omega _{f}{{\tilde{x}}}{+}\frac{1}{2}\delta ^2{{\tilde{x}}}^{{\textsf{T}}}{{\tilde{x}}}{+}\frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}{\tilde{x}} +\frac{1}{2}\gamma {{\tilde{x}}}^{{\textsf{T}}}{\tilde{x}} \nonumber \\&\quad +\frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}\sum _{q=1}^N ({u}_{q}\omega _{gq}^{{\textsf{T}}}\omega _{gq}{{\tilde{x}}})+\frac{1}{2}\delta ^2\sum _{q=1}^N ({u}_{q}{{\tilde{x}}}^{{\textsf{T}}}{{\tilde{x}}}) \nonumber \\&\quad +\frac{(1+\gamma )}{2}\sum _{q=1}^N({u}_{q}{{\tilde{x}}}^{{\textsf{T}}}{{\tilde{x}}})+\frac{1}{2}{{\tilde{x}}}^{{\textsf{T}}}\sum _{l=1}^M ({d}_{l}\omega _{kl}^{{\textsf{T}}}\omega _{kl}{{\tilde{x}}}) \nonumber \\&\quad +\frac{1}{2}\delta ^2\sum _{l=1}^M({d}_{l}{{\tilde{x}}}^{{\textsf{T}}}{{\tilde{x}}})+\frac{(1+\gamma )}{2}\sum _{l=1}^M\!({d}_{l}{{\tilde{x}}}^{{\textsf{T}}}{{\tilde{x}}}) \nonumber \\&={{\tilde{x}}}^{{\textsf{T}}}\varGamma {{\tilde{x}}}, \end{aligned}$$

(25)

where

$$\begin{aligned} \varGamma&= S{+}\frac{1}{2} \omega _{f}^{{\textsf{T}}}\omega _{f}{+}\frac{1}{2} \sum _{q=1}^{N} u_{q} \omega _{gq}^{{\textsf{T}}}\omega _{gq}{+}\frac{1}{2} \sum _{l=1}^{M} d_{l} \omega _{kl}^{{\textsf{T}}}\omega _{kl} \nonumber \\&\quad {+}\bigg (\frac{1}{2}{+}\frac{1}{2} \gamma {+}\frac{1}{2} \delta ^{2}{+}\frac{(1+\gamma )}{2} \sum _{q=1}^{N} u_{q}{+}\frac{1}{2} \delta ^{2} \sum _{q=1}^{N} u_{q} \nonumber \\&\quad +\frac{(1+\gamma )}{2} \sum _{l=1}^{M} d_{l}+\frac{1}{2} \delta ^{2} \sum _{l=1}^{M} d_{l}\bigg ) I_{n} \end{aligned}$$

(26)

with $I_n$ denoting the identity matrix. If S is reasonably chosen to let $\varGamma \le 0$, then we have $\dot{L}_3(t)\le 0$, and $\tilde{x}(t)\rightarrow 0$ as $t\rightarrow \infty $.

According to Theorem 1, the system dynamics can be removed. Consequently, system (1) is described by

$$\begin{aligned}&\dot{x}=Sx +{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(x)+\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x){u}_{q}\nonumber \\&\quad +\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}(x){d}_{l}, \end{aligned}$$

(27)

$\square $

3.2 Approximate optimal learning scheme with single-critic network

For the implementation purpose, only a single-critic network is constructed to deal with the HJI equation. The optimal cost function $J^*(x)$ is expressed as

$$\begin{aligned} J^*(x)=\omega _{c}^{{\textsf{T}}}\varphi _{c}(x)+\varepsilon _{c}(x), \end{aligned}$$

(28)

where $\omega _{c}\in { {\mathbb {R}}}^{n_{c}}$ is the ideal weight, $\varphi _{c}(x)\in { {\mathbb {R}}}^{n_{c}}$ is the activation function, $n_c$ represents the number of neurons, and $\varepsilon _{c}\in { {\mathbb {R}}}$ is the reconstruction error.

The partial derivative of (28) is derived as

$$\begin{aligned} \nabla J^*(x)=(\nabla \varphi _{c}(x))^{{\textsf{T}}}\omega _{c}+\nabla \varepsilon _{c}(x). \end{aligned}$$

(29)

Then, the approximate formulation of $J^*(x)$ is written as

$$\begin{aligned} {\hat{J}}^*(x)={{\hat{\omega }}}_{c}^{{\textsf{T}}}\varphi _{c}(x), \end{aligned}$$

(30)

where ${\hat{\omega }}_{c}$ is the estimated weight. Similarly, one has

$$\begin{aligned} \nabla {\hat{J}}^*(x)=(\nabla \varphi _{c}(x))^{{\textsf{T}}}{{\hat{\omega }}}_{c}. \end{aligned}$$

(31)

Utilizing the identification result and considering (10), (11), and (29), we have

$$\begin{aligned} u_{q}^*&{=}{-}\beta _q \tanh \left( \frac{1}{2 \beta _q} R_{q}^{{-}1} ({\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq})^{{\textsf{T}}}\right. \nonumber \\&\quad \left. \times \left( \nabla \varphi _{c}^{{\textsf{T}}}\omega _{c}{+}\nabla \varepsilon _{c}(x)\right) \!\right) , \end{aligned}$$

(32)

$$\begin{aligned} d_{l}^*&=\frac{1}{2 \lambda ^2}({\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl})^{{\textsf{T}}} \left( \nabla \varphi _{c}^{{\textsf{T}}}\omega _{c}+\nabla \varepsilon _{c}(x)\right) . \end{aligned}$$

(33)

In light of (31), the approximate forms of (32) and (33) are stated as

$$\begin{aligned} {\hat{u}}_{q}^*&=-\beta _q \tanh \left( \hat{B}\right) , \end{aligned}$$

(34)

$$\begin{aligned} {\hat{d}}_{l}^*&=\frac{1}{2 \lambda ^2}({\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl})^{{\textsf{T}}} \nabla \varphi _{c}^{{\textsf{T}}}{\hat{\omega }}_{c}, \end{aligned}$$

(35)

where $\hat{B}=(1/(2 \beta _q)) R_{q}^{-1} ({\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq})^{{\textsf{T}}} \nabla \varphi _{c}^{{\textsf{T}}}{\hat{\omega }}_{c}$.

Noticing the identifier-critic framework, the approximate Hamiltonian can be presented as

$$\begin{aligned}&{\hat{H}}\left( x, {\hat{\omega }}_c,{\hat{u}}_q^*,{\hat{d}}_l^*\right) \nonumber \\&\quad = x^{{\textsf{T}}}Qx \!+\!2{\sum _{q=1}^N}R_{q}\!\!\int _0^{{\hat{u}}_q^*}\!\beta _q\tanh ^{\mathsf {-T}}(v/\beta _q)\mathrm dv \!\nonumber \\&\qquad -\!\lambda ^2 \sum _{l=1}^M \!\hat{(d_l^*)}^{{\textsf{T}}} {\hat{d}}_l^* \nonumber \\&\qquad {+} {\hat{\omega }}_{c}^{{\textsf{T}}}{\nabla \varphi }_{c}(x)\bigg (Sx {+}{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(x){+}\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x){{\hat{u}}}_{q}^* \nonumber \\&\qquad +\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}(x){{\hat{d}}}_{l}^*\bigg )\triangleq e_c. \end{aligned}$$

(36)

Based on the ER approach [42], we define the objective function as

$$\begin{aligned} E_{c}=\frac{1}{2}\left( e_{c}^{{\textsf{T}}}e_{c}+\sum _{p=1}^{Z_P}e^{{\textsf{T}}}(t_p)e(t_p)\right) , \end{aligned}$$

(37)

where $e(t_p)=h(x(t_p),{\hat{u}}_q^*,{\hat{d}}_l^*)+{\hat{\omega }}_{c}^{{\textsf{T}}}\phi _p$, $\phi _p={\nabla \varphi }_{c}(x(t_p))(Sx +{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(x(t_p))+\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}$$(x(t_p))\hat{u}_{q}^*+\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}(x(t_p))\hat{d}_{l}^*)$, and $p\in \{1,\dots ,$$Z_P\}$ is the index of the stored samples.

For the minimizing of the objective function $E_{c}$, we construct a novel critic weight tuning law based on gradient descent technique as follows

$$\begin{aligned} {\dot{{\hat{\omega }}}}_{c}&=-\alpha _{c}\left( \frac{\partial E_{c}}{\partial {{\hat{\omega }}}_{c}}\right) \nonumber \\&=-\alpha _{c}\phi (\phi ^{{\textsf{T}}}{{\hat{\omega }}}_{c}+h(x,{\hat{u}}_q^*,{\hat{d}}_l^*))\nonumber \\&\quad -\alpha _{c}\sum _{p=1}^{Z_P}\phi _p(\phi _p^{{\textsf{T}}}{{\hat{\omega }}}_{c}+h(x(t_p),{\hat{u}}_q^*,{\hat{d}}_l^*)), \end{aligned}$$

(38)

where $\alpha _{c}>0$ is the adjustable learning rate of the critic network and $\phi ={\nabla \varphi }_{c}(x)(Sx +{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(x)+\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x)\hat{u}_{q}^*+\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}(x)\hat{d}_{l}^*)$.

Remark 2

According to [55], the second term in (38) tends to relax the PE condition. Differing from the PE condition, the new condition is convenient to check during the online learning process. That is to say, the ER approach is effortless to implement by using the historical system data.

Remark 3

When using the ER approach, the new condition should be satisfied. Define $\varXi =[\varphi _{c}(x(t_1)),\dots , \varphi _{c}(x(t_{Z_P}))]$ as the historical data matrix. Let $\varXi $ contain numerous linearly independent elements, i.e., rank($\varXi $) =$n_c$.

Define the weight estimation error of the critic network as ${{\tilde{\omega }}}_{c}=\omega _{c}-{{\hat{\omega }}}_{c}$. Then, by taking the time derivative, we have

$$\begin{aligned} {\dot{{\tilde{\omega }}}}_{c}{=}{-}\alpha _{c}\phi \left( \phi ^{{\textsf{T}}}{{\tilde{\omega }}}_{c}{-}\varepsilon _{H}\right) {-}\alpha _{c}\sum _{p=1}^{Z_P}\phi _p\left( \phi _p^{{\textsf{T}}}{{\tilde{\omega }}}_{c}-\varepsilon _{H_p}\right) , \end{aligned}$$

(39)

where $\varepsilon _{H}=-\nabla \varepsilon _{c}^{{\textsf{T}}}(x)(Sx +{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(x)+\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}$ $(x)\hat{u}_{q}^*+\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}(x)\hat{d}_{l}^*)$ and $\varepsilon _{H_p}=-\nabla \varepsilon _{c}^{{\textsf{T}}}(x(t_p)) (Sx+{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(x(t_p))+\sum _{q=1}^N {\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x(t_p))\hat{u}_{q}^*+\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}(x(t_p))\hat{d}_{l}^*)$ are the residual errors.

Based on the above discussion, the structure of the ADP-based optimal control scheme is shown in Fig. 1.

3.3 Stability analysis

In this subsection, the stability analysis of the multi-player ZSG is presented. First, the following assumption, which is used in [42, 46], and [50], is provided.

Assumption 3

Denote $z_{g}$, $z_{k}$, and $z_{\omega _c}$ as positive constants. ${\hat{\omega }}_{gq}$, ${\hat{\omega }}_{kl}$, and $\omega _{c}$ are upper bounded as $\left\| {\hat{\omega }}_{gq}\right\| \le z_{g}$, $\left\| {\hat{\omega }}_{kl}\right\| \le z_{k}$, and $\left\| \omega _{c}\right\| \le z_{\omega _c}$, respectively.

Assumption 4

Denote $z_{\varepsilon _{c}}$, $z_{\varepsilon _{cd}}$, $z_{\varepsilon _{H}}$, and $z_{\varepsilon _{H_p}}$ as positive constants. $\varepsilon _{c}$, $\nabla \varepsilon _{c}$, $\varepsilon _{H}$, and ${\varepsilon _{H_p}}$ are upper bounded guaranteeing $\left\| \varepsilon _{c}\right\| \le z_{\varepsilon _{c}}$, $\left\| \nabla \varepsilon _{c}\right\| \le z_{\varepsilon _{cd}}$, $\left\| \varepsilon _{H}\right\| \le z_{\varepsilon _{H}}$, and $\left\| \varepsilon _{H_p}\right\| \le z_{\varepsilon _{H_p}}$, respectively.

Assumption 5

Denote $z_{\varphi _{c}}$, $z_{\varphi _{cd}}$, $z_{\varphi _{gq}}$, and $z_{\varphi _{kl}}$ as positive constants. $\varphi _{c}$, $\nabla \varphi _{c}$, $\varphi _{gq}$, and ${\varphi _{kl}}$ are upper bounded guaranteeing $\left\| \varphi _{c}\right\| \le z_{\varphi _{c}}$, $\left\| \nabla \varphi _{c}\right\| \le z_{\varphi _{cd}}$, $\left\| \varphi _{gq}\right\| \le z_{\varphi _{gq}}$, and $\left\| \varphi _{kl}\right\| \le z_{\varphi _{kl}}$, respectively.

Theorem 2

Consider multi-player ZSG (1) with the identifier network, developed control policy (34) and disturbance strategy (35), and single-critic network weight tuning law (38). Then, the UUB stability of the controlled system state and the critic weight estimation error is ensured.

Proof

Select the Lyapunov function as

$$\begin{aligned} L(t)=L_{1}(t)+L_{2}(t) =J^*(x)+\frac{1}{2} {\tilde{\omega }}_{c}^{{\textsf{T}}}{{\tilde{\omega }}}_{c} . \end{aligned}$$

(40)

Calculating the time derivative of $L_1(t)$ and using reconstructed system (27), one has

$$\begin{aligned} {\dot{L}}_{1}(t)=&\left( \nabla J^*(x)\right) ^{{\textsf{T}}}\big (Sx +{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(x) \nonumber \\&+\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x)\hat{u}_{q}^*+\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}(x)\hat{d}_{l}^*\big ). \end{aligned}$$

(41)

Let

$$\begin{aligned} \varpi (u_q)=2{\sum _{q=1}^N}R_{q}\int _0^{u_q}\beta _q\tanh ^{\mathsf {-T}}(v/\beta _q)\mathrm dv. \end{aligned}$$

(42)

According to [55], putting (10) in (41), one has

$$\begin{aligned} \varpi \!\left( u_q^{*}\right) \!&=\!\beta _q (\nabla J^*)^{{\textsf{T}}} \! g_q(x)\! \tanh \!\left( \! B^{*}\!\right) \nonumber \\&\quad +\!\lambda ^{2} \bar{R} \sum _{l=1}^{m_q} \ln \! \left( \bar{\textbf{1}}\!-\!\tanh ^{2}\left( B_l^{*}\right) \right) \!, \end{aligned}$$

(43)

where $B^*=[B^*_{1}, B^*_{2}, \ldots , B^*_{m_q}]^{{\textsf{T}}}$ with $B_l^* \in \mathbb {R}$, $l=1,2, \ldots , m_q$. $\bar{\textbf{1}}$ is a column vector having all of its elements equal to 1, and $\bar{R}=[r_1,\dots ,r_{m_q}]\in { {\mathbb {R}}}^{1\times m_q}$.

From (10)–(12) and (43), we obtain

$$\begin{aligned} (\nabla&J^*(x))^{{\textsf{T}}}(Sx +{\hat{\omega }}_{f}^{{\textsf{T}}}\varphi _{f}(x)) \nonumber \\ {=}&{-}x^{{\textsf{T}}}Qx{-}\varpi \left( u_q^{*}\right) {-}(\nabla J^*(x))^{{\textsf{T}}}\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x){u}_{q}^* \nonumber \\&-\lambda ^2\sum _{l=1}^M({{d}_{l}^*})^{{\textsf{T}}}{d}_{l}^*, \end{aligned}$$

(44)

$$\begin{aligned}&(\nabla J^*(x))^{{\textsf{T}}}\sum _{l=1}^M{\hat{\omega }}_{kl}^{{\textsf{T}}}\varphi _{kl}(x)\hat{d}_{l}^*=2\lambda ^2\sum _{l=1}^M({{d}_{l}^*})^{{\textsf{T}}}\hat{d}_{l}^*. \end{aligned}$$

(45)

Thus, (41) becomes

$$\begin{aligned}&{\dot{L}}_{1}(t) \nonumber \\&\quad =-x^{{\textsf{T}}}Qx-\varpi \left( u_q^{*}\right) \nonumber \\&\qquad -\left( \nabla J^*(x)\right) ^{{\textsf{T}}}\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x)({u}_{q}^*-\hat{u}_{q}^*) \nonumber \\&\qquad -\lambda ^2\sum _{l=1}^M({{d}_{l}^*})^{{\textsf{T}}}({d}_{l}^*-2\hat{d}_{l}^*) \nonumber \\&\quad =-x^{{\textsf{T}}}Qx-\varpi \left( u_q^{*}\right) \nonumber \\&\qquad {+}\beta _q \!\left( \nabla J^*\!(x)\right) ^{{\textsf{T}}}\!\sum _{q=1}^N (\!{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x)(\tanh (B^*\!)\!\nonumber \\&\qquad -\!{\tanh (\hat{B})}))-\lambda ^2\sum _{l=1}^M({{d}_{l}^*-\hat{d}_{l}^*})^{{\textsf{T}}}({{d}_{l}^*-\hat{d}_{l}^*})\nonumber \\&\qquad +\lambda ^2\sum _{l=1}^M(\hat{d}_{l}^*)^{{\textsf{T}}}(\hat{d}_{l}^*). \end{aligned}$$

(46)

Then, utilizing (29), Assumption 3–5, and the fact that $\varpi (u_q^*)$ is positive definite [55], (46) can be rewritten as

$$\begin{aligned} {\dot{L}}_{1}(t){\le }&{-}x^{{\textsf{T}}}Qx+2\beta _q(\omega _{c}^{{\textsf{T}}}\nabla \varphi _{c}{+}\nabla \varepsilon _{c}^{{\textsf{T}}})\sum _{q=1}^N{\hat{\omega }}_{gq}^{{\textsf{T}}}\varphi _{gq}(x) \nonumber \\&-\lambda ^2\sum _{l=1}^M \Vert {{d}_{l}^*-\hat{d}_{l}^*} \Vert ^2 +\lambda ^2\sum _{l=1}^M \Vert \hat{d}_{l}^*\Vert ^2 \nonumber \\ {\le }&-x^{{\textsf{T}}}Qx+2\beta _q z_{\omega _c}z_{\varphi _{cd}}\sum _{q=1}^N z_{g}z_{\varphi _{gq}} \nonumber \\&{+}2\beta _q z_{\varepsilon _{cd}}\!\sum _{q{=}1}^Nz_{g}z_{\varphi _{gq}} {+}\frac{1}{4\lambda ^2} z_{\varphi _{cd}}^2\!\sum _{l{=}1}^M z_{k}^2 z_{\varphi _{kl}}^2\left\| \hat{\omega }_{c}\! \right\| ^2\! . \end{aligned}$$

(47)

Recalling ${{\tilde{\omega }}}_{c}=\omega _{c}-{{\hat{\omega }}}_{c}$, we further get that

$$\begin{aligned} {\dot{L}}_{1}(t)\le&-x^{{\textsf{T}}}Qx+b_2-2b_1\omega _{c}{{\tilde{\omega }}}_{c}+b_1\left\| {{\tilde{\omega }}}_{c} \right\| ^2 \nonumber \\ \le&-\lambda _{\min }(Q)\Vert x\Vert ^2-b_3\Vert {{\tilde{\omega }}}_{c}\Vert ^2+b_2 , \end{aligned}$$

(48)

where $b_1=(1/(4\lambda ^2)) z_{\varphi _{cd}}^2\sum _{l=1}^M z_{k}^2 z_{\varphi _{kl}}^2$, $b_2=2\beta _q z_{\omega _c} z_{\varphi _{cd}}\sum _{q=1}^N z_{g}z_{\varphi _{gq}}+2\beta _q z_{\varepsilon _{cd}}\sum _{q=1}^Nz_{g}z_{\varphi _{gq}}+b_1z_{\omega _c}^2$, and $b_3=(1/2)b_1^2-b_1$.

Next, considering (39), the derivative of $L_2(t)$ is formulated as

$$\begin{aligned} {\dot{L}}_{2}(t) {=}&{-}\alpha _{c}{\tilde{\omega }}_{c}^{{\textsf{T}}}\phi \phi ^{{\textsf{T}}}{{\tilde{\omega }}}_{c}{-}\alpha _{c}\sum _{p=1}^{Z_P}{\tilde{\omega }}_{c}^{{\textsf{T}}}\phi _p\phi _p^{{\textsf{T}}}{{\tilde{\omega }}}_{c} {+}\alpha _{c}{\tilde{\omega }}_{c}^{{\textsf{T}}}\phi \varepsilon _{H} \nonumber \\&+\alpha _{c}\sum _{p=1}^{Z_P}{\tilde{\omega }}_{c}^{{\textsf{T}}}\phi _p\varepsilon _{H_p} . \end{aligned}$$

(49)

With the aid of the Young’s inequality, we can derive the last two terms of (49) as follows:

$$\begin{aligned}&\alpha _{c}{\tilde{\omega }}_{c}^{{\textsf{T}}}\phi \varepsilon _{H} \le \frac{\alpha _{c}}{2} {\tilde{\omega }}_{c}^{{\textsf{T}}}\phi \phi ^{{\textsf{T}}}{{\tilde{\omega }}}_{c}+\frac{\alpha _{c}}{2}\varepsilon _{H}^{{\textsf{T}}}\varepsilon _{H} , \end{aligned}$$

(50)

$$\begin{aligned}&\alpha _{c}\sum _{p{=}1}^{Z_P}{\tilde{\omega }}_{c}^{{\textsf{T}}}\phi _p\varepsilon _{H_p} {\le }\frac{\alpha _{c}}{2}\!\sum _{p{=}1}^{Z_P} {\tilde{\omega }}_{c}^{{\textsf{T}}}\phi _p\phi _p^{{\textsf{T}}}{{\tilde{\omega }}}_{c}{+}\frac{\alpha _{c}}{2}\sum _{p=1}^{Z_P}\varepsilon _{H_p}^{{\textsf{T}}}\varepsilon _{H_p}\!. \end{aligned}$$

(51)

Applying Assumption 3–5 and considering (50) and (51), (49) becomes

$$\begin{aligned} {\dot{L}}_{2}(t) {\le }{-}\frac{\alpha _{c} }{2} \lambda _{\min }(\varPhi (\phi ,\phi _p))\Vert {{\tilde{\omega }}}_{c}\Vert ^2{+}\frac{\alpha _{c} (Z_P{+}1) }{2}z_{\varepsilon _{H}}^2 , \end{aligned}$$

(52)

where $\varPhi (\phi ,\phi _p)=\phi \phi ^{{\textsf{T}}}+\sum _{p=1}^{Z_P}\phi _p\phi _p^{{\textsf{T}}}$.

Combining (48) with (52), one has

$$\begin{aligned} {\dot{L}}(t)\le&-\lambda _{\min }(Q)\Vert x\Vert ^2-b_3\Vert {{\tilde{\omega }}}_{c}\Vert ^2+b_2 \nonumber \\&{-}\frac{\alpha _{c} }{2} \lambda _{\min }(\varPhi (\phi ,\phi _p))\Vert {{\tilde{\omega }}}_{c}\Vert ^2{+}\frac{\alpha _{c} (Z_P+1) }{2}z_{\varepsilon _{H}}^2 . \end{aligned}$$

(53)

Therefore, (53) means $\dot{L}(t)<0$, whenever the following inequalities hold

$$\begin{aligned} \left\| x\right\| >\sqrt{\frac{2b_2+\alpha _{c}(Z_P+1){\varepsilon _{H}}^2}{2\lambda _{\min }(Q)}}\triangleq \mathscr {D}_1 \end{aligned}$$

(54)

or

$$\begin{aligned} \left\| {{\tilde{\omega }}}_{c}\right\| >\sqrt{\frac{2b_2+\alpha _{c}(Z_P+1){\varepsilon _{H}}^2}{2b_3+\alpha _{c}\lambda _{\min }(\varPhi (\phi ,\phi _p))}}\triangleq \mathscr {D}_2 \end{aligned}$$

(55)

with $2b_3+\alpha _{c}\lambda _{\min }(\varPhi (\phi ,\phi _p))>0$. It implies that the UUB stability of x and ${{\tilde{\omega }}}_{c}$ is guaranteed. $\square $

4 Simulation

In this section, we deliver a simulation of a multi-player ZSG with constrained control inputs to demonstrate the effectiveness of the established ADP-based identifier-critic framework.

Consider the multi-player ZSG described as (note: $N = 2, M = 1$)

$$\begin{aligned} \dot{x}=f(x)+g_{1}(x){u}_{1}+g_{2}(x){u}_{2}+k(x){d}, \end{aligned}$$

(56)

where

$$\begin{aligned} f(x)=&\left[ \begin{array}{c} -0.5x_{1}+0.4x_{2}\\ -0.6x_{1}-0.6x_{2}+0.5x_{2}x_1^2\end{array}\right] , \\ g_{1}(x)=&\begin{bmatrix}0\\ \textrm{sin} (x_{1})\end{bmatrix},g_{2}(x)=\begin{bmatrix}0\\ 2x_{1}\end{bmatrix}, k(x)=\begin{bmatrix}0\\ x_{1}\end{bmatrix}. \end{aligned}$$

The system state $x={[x_{1},x_{2}]}^{{\textsf{T}}}\in \mathbb {R}^2$ is initialized to $x_{0}={[0.8,-0.8]}^{{\textsf{T}}}$, and $u_{1},u_{2}\in \mathbb {R}$ are the constrained control inputs. Let $Q=5I_2$, $R_{1}=R_{2}=I$, and $\lambda =2$. In this case, we assume the control inputs $u_1$ and $u_2$ are constrained by $\left| u_1\right| \le 0.4$ and $\left| u_2\right| \le 0.8$, respectively. Then $\varpi (u_1)$ and $\varpi (u_2)$ defined in the utility function are

$$\begin{aligned}{} & {} \varpi (u_1)=2R_{1}\int _0^{u_1}(0.4\tanh ^{-1}(v/0.4))^{{\textsf{T}}}\mathrm dv, \\{} & {} \varpi (u_2)=2R_{2}\int _0^{u_2}(0.8\tanh ^{-1}(v / 0.8))^{{\textsf{T}}}\mathrm dv, \end{aligned}$$

respectively.

Aiming at study the unknown dynamics of (56), an identifier network is built to reconstruct system dynamics based on (15). In the system identification stage, the initial weights ${\hat{\omega }} _f$, ${\hat{\omega }} _{gq}$, and ${\hat{\omega }} _{kl}$ are chosen randomly as ${\hat{\omega }} _f\in [-1, 1]$, ${\hat{\omega }} _{gq}\in [-1, 1]$, and ${\hat{\omega }} _{kl}\in [-1, 1]$. The identifier activation function $\varphi _f(\cdot )$, $\varphi _{gq}(\cdot )$, and $\varphi _{kl}(\cdot )$ are selected as $\varphi _f(\cdot )=\varphi _{gq}(\cdot )=\varphi _{kl}(\cdot )=\tanh (\cdot )$, and the learning matrix $S=[-1, 0; 0,-1]$. The other corresponding parameters are designed as $\varLambda _f = \varLambda _{gq} = \varLambda _{kl} =[1, 0.4; 0.1, 0.6]$.

For the proposed ADP-based approach, we select the activation function as $\varphi _{c}(x)={[x_{1}^2,x_{1}x_{2},x_{2}^2]}^{{\textsf{T}}}$. The approximate critic network weight is ${{\hat{\omega }}}_{c}={[{{\hat{\omega }}}_{c1},{{\hat{\omega }}}_{c2},{{\hat{\omega }}}_{c3}]}^{{\textsf{T}}}$. The initial weight are randomlyselected as ${{\hat{\omega }}}_{c}\in [-1,1]$.

We employ the ER method with recorded data to relax the PE condition. The number of the historical data samples for the critic network is selected as 12, i.e., $Z_P = 12$. Then, the critic network learning scheme is established for 80 s with the novel critic network weight tuning law, which combines the ER technique with the standard gradient descent algorithm.

Simulation results are depicted in Figs. 2, 3, 4, 5, 6, and 7. The convergence curves of reconstruction errors of the neural network-based identifier are depicted in Fig. 2. As displayed in Fig. 2, reconstruction errors converge to a small region of origin around $t = 20$ s. It illustrates that the identifier network can well reconstruct system (56). Figure 3 shows the convergence process of system states for the ZSG with control constraints. In Fig. 3, it can be observed that system states finally converge to the equilibrium point (0, 0). The convergence process of the critic network weights is displayed in Fig. 4. From Fig. 4, we can see that the critic network weights have stabilized after $t=20$ s and their values finally converge to ${{\hat{\omega }}}_{c}=[1.0912,0.0725,1.3409]^{{\textsf{T}}} $. To demonstrate the effectiveness of the proposed ADP-based learning approach, we apply the method in [28] to system (56). Then, the convergence process of the critic network weights under the method in [28] is shown in Fig. 5. By comparing Figs. 4 and 5, it is obvious seen that the developed algorithm in this paper can accelerate the convergence rate of the critic network weights. Then, the converged weights are inserted into (34) and (35) to get the approximate optimal control strategies $\{{\hat{u}}_{1}^*,{\hat{u}}_{2}^*\}$ and the approximate worst disturbance strategy ${\hat{d}}^*$ for nonlinear ZSG (56).

Figure 6 shows the trajectory of constrained control inputs in the control process. As illustrated in Fig. 6, it can be seen that the constrained control inputs $u_1$ and $u_2$ are effectively limited by the predetermined bound $\left| u_q\right| \le \beta _q \,(q=1,2)$ as expected, which indicates that control input signals vary within the control constraints. It proves the effectiveness of the constrained policy. Figure 7 presents the trajectory of the unconstrained disturbance input in the control process. Obviously, it is proved that the disturbance input converges to an adjustable neighborhood of the zero. Therefore, the aforementioned simulation results confirm the effectiveness of the designed ADP-based scheme with the identifier-critic form. Meanwhile, it also shows that the identifier-critic framework is applicable to the nonlinear multi-player ZSG with constrained control inputs.

5 Conclusion

In this article, the multi-player ZSG issue with unknown system dynamics and control constraints is handled by employing a novel ADP-based learning framework. Initially, the neural network-based identifier is adopted to rebuild the system dynamics by utilizing the system data. Then, we define a new non-quadratic function which addresses the control constraints and obtain the constrained HJI equation. Furthermore, a single-critic network mechanism is designed to approximately solve the constrained HJI equation. Subsequently, the novel weight tuning rule base on the ER algorithm is constructed to approach the optimal control strategies and the worst disturbance strategies. Hence, the traditional PE condition is removed via the recorded and current data. Additionally, the UUB stability of the multi-player system and the critic network weight approximation error is analyzed. After that, we demonstrate the convergence and performance of the proposed scheme through simulation studies. However, the limitation of the proposed scheme is the reconstruction error which inevitably introduced by using an identifier. In the consecutive study, how to relax the requirement of system dynamics without reconstruction errors may be investigated.

Data availability statement

No data were used in this paper.

References

Denardo, E.V.: Introduction to Game Theory. Springer, Boston (2011)
Google Scholar
Vamvoudakis, K.G., Modares, H., Kiumarsi, B., Lewis, F.L.: Game theory-based control system algorithms with real-time reinforcement learning: how to solve multiplayer games online. IEEE Control Syst. Mag. 37(1), 33–52 (2017)
MathSciNet Google Scholar
Ni, Z., Paul, S.: A multistage game in smart grid security: a reinforcement learning solution. IEEE Trans. Neural Netw. Learn. Syst. 30(9), 2684–2695 (2019)
MathSciNet Google Scholar
Bidram, A., Davoudi, A., Lewis, F.L., Guerrero, J.M.: Distributed cooperative secondary control of microgrids using feedback linearization. IEEE Trans. Power Syst. 28(3), 3462–3470 (2013)
Google Scholar
Wei, Q., Li, H., Yang, X., He, H.: Continuous-time distributed policy iteration for multi-controller nonlinear systems. IEEE Trans. Cybern. 51(5), 2372–2383 (2021)
Google Scholar
Liu, D., Li, H., Wang, D.: Online synchronous approximate optimal learning algorithm for multiplayer nonzero-sum games with unknown dynamics. IEEE Trans. Syst. Man Cybern. Syst. 44(8), 1015–1027 (2014)
Google Scholar
Li, Y., Wei, C., An, T., Ma, B., Dong, B.: Event-triggered-based cooperative game optimal tracking control for modular robot manipulator with constrained input. Nonlinear Dyn. 109(4), 2759–2779 (2022)
Google Scholar
Modares, H., Lewis, F.L., Sistani, M.B.N.: Online solution of nonquadratic two-player zero-sum games arising in the $H_ \infty $ control of constrained input systems. Int. J. Adapt. Control Signal Process. 28(3), 232–254 (2014)
MathSciNet MATH Google Scholar
Vamvoudakis, K.G.: Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica 61, 274–281 (2015)
MathSciNet MATH Google Scholar
Wang, D., Ha, M., Zhao, M.: The intelligent critic framework for advanced optimal control. Artif. Intell. Rev. 55(1), 1–22 (2022)
Google Scholar
Ha, M., Wang, D., Liu, D.: Discounted iterative adaptive critic designs with novel stability analysis for tracking control. IEEE/CAA J. Automatica Sinica 9(7), 1262–1272 (2022)
Google Scholar
Li, Y., Liu, Y., Tong, S.: Observer-based neuro-adaptive optimized control of strict-feedback nonlinear systems with state constraints. IEEE Trans. Neural Netw. Learn. Syst. 33(7), 3131–3145 (2022)
MathSciNet Google Scholar
Wang, H., Yang, C., Liu, X., Zhou, L.: Neural-network-based adaptive control of uncertain MIMO singularly perturbed systems with full-state constraints. IEEE Trans. Neural Netw. Learn. Syst. (2021). https://doi.org/10.1109/TNNLS.2021.3123361
Article Google Scholar
Huo, Y., Wang, D., Qiao, J.: Adaptive critic optimization to decentralized event-triggered control of continuous-time nonlinear interconnected systems. Opt. Control Appl. Methods 43(1), 198–212 (2022)
MathSciNet Google Scholar
Lv, Y., Na, J., Zhao, X., Huang, Y., Ren, X.: Multi-$H_\infty $ controls for unknown input-interference nonlinear system with reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. (2021). https://doi.org/10.1109/TNNLS.2021.3130092
Article Google Scholar
Wei, Q., Liu, D., Lin, Q., Song, R.: Adaptive dynamic programming for discrete-time zero-sum games. IEEE Trans. Neural Netw. Learn. Syst. 29(4), 957–969 (2018)
Google Scholar
Dong, B., An, T., Zhou, F., Liu, K., Li, Y.: Decentralized robust zero-sum neuro-optimal control for modular robot manipulators in contact with uncertain environments: theory and experimental verification. Nonlinear Dyn. 97(1), 503–524 (2019)
Google Scholar
Wu, H., Liu, Z.: Data-driven guaranteed cost control design via reinforcement learning for linear systems with parameter uncertainties. IEEE Trans. Syst. Man, Cybern. Syst. 50(11), 4151–4159 (2020)
Google Scholar
Song, R., Lewis, F.L., Wei, Q.: Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games. IEEE Trans. Neural Netw. Learn. Syst. 28(3), 704–713 (2017)
MathSciNet Google Scholar
Zhao, Q., Sun, J., Wang, G., Chen, J.: Event-triggered ADP for nonzero-sum games of unknown nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 33(5), 1905–1913 (2022)
Google Scholar
Wei, Q., Zhu, L., Song, R., Zhang, P., Liu, D., Xiao, J.: Model-free adaptive optimal control for unknown nonlinear multiplayer nonzero-sum game. IEEE Trans. Neural Netw. Learn. Syst. 33(2), 879–892 (2022)
MathSciNet Google Scholar
Yang, X., He, H.: Event-driven $H_\infty $ constrained control using adaptive critic learning. IEEE Trans. Cybern 51(10), 4860–4872 (2021)
Google Scholar
Zhao, J., Lv, Y., Zhao, J.: Adaptive learning based output-feedback optimal control of CT two-player zero-sum games. IEEE Trans. Circuits Syst.-II: Express Briefs 69(3), 1437–1441 (2022)
Google Scholar
Yazidi, A., Silvestre, D., Oommen, B.J.: Solving two-person zero-sum stochastic games with incomplete information using learning automata with artificial barriers. IEEE Trans. Neural Netw. Learn. Syst. (2021). https://doi.org/10.1109/TNNLS.2021.3099095
Article Google Scholar
Guo, X., Yan, W., Cui, R.: Reinforcement learning-based nearly optimal control for constrained-input partially unknown systems using differentiator. IEEE Trans. Neural Netw. Learn. Syst. 31(11), 4713–4725 (2020)
MathSciNet Google Scholar
Song, R., Li, J., Lewis, F.L.: Robust optimal control for disturbed nonlinear zero-sum differential games based on single NN and least squares. IEEE Trans. Syst. Man, Cybern. Syst. 50(11), 4009–4019 (2020)
Google Scholar
Song, R., Wei, Q., Song, B.: Neural-network-based synchronous iteration learning method for multi-player zero-sum games. Neurocomputing 242(14), 73–82 (2017)
Google Scholar
Zhang, Y., Zhao, B., Liu, D.: Event-triggered adaptive dynamic programming for multi-player zero-sum games with unknown dynamics. Soft. Comput. 25, 2237–2251 (2021)
MATH Google Scholar
Qiao, J., Li, M., Wang, D.: Asymmetric constrained optimal tracking control with critic learning of nonlinear multiplayer zero-sum games. IEEE Trans. Neural Netw. Learn. Syst. (2022). https://doi.org/10.1109/TNNLS.2022.3208611
Article Google Scholar
Wei, Q., Song, R., Yan, P.: Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Trans. Neural Netw. Learn. Syst. 27(2), 444–458 (2016)
MathSciNet Google Scholar
Yang, X., Zhao, B.: Optimal neuro-control strategy for nonlinear systems with asymmetric input constraints. IEEE/CAA J. Automatica Sinica 7(2), 575–583 (2020)
MathSciNet Google Scholar
Yang, Y., Ding, Z., Wang, R., Modares, H., Wunsch, D.C.: Data-driven human-robot interaction without velocity measurement using off-policy reinforcement learning. IEEE/CAA J. Autom. Sinica 9(1), 47–63 (2022)
MathSciNet Google Scholar
Na, J., Lv, Y., Zhang, K., Zhao, J.: Adaptive identifier-critic-based optimal tracking control for nonlinear systems with experimental validation. IEEE Trans. Syst. Man, Cybern. Syst. 52(1), 459–472 (2022)
Google Scholar
Xue, S., Luo, B., Liu, D.: Event-triggered adaptive dynamic programming for zero-sum game of partially unknown continuous-time nonlinear systems. IEEE Trans. Syst. Man, Cybern. Syst. 50(9), 3189–3199 (2020)
Google Scholar
Wang, D.: Intelligent critic control with robustness guarantee of disturbed nonlinear plants. IEEE Trans. Cybern. 50(6), 2740–2748 (2020)
Google Scholar
Huo, X., Karimi, H.R., Zhao, X., Wang, B., Zong, G.: Adaptive-critic design for decentralized event-triggered control of constrained nonlinear interconnected systems within an identifier-critic framework. IEEE Trans. Cybern. 52(8), 7478–7491 (2022)
Google Scholar
Zhao, D., Zhang, Q., Wang, D., Zhu, Y.: Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Trans. Cybern. 46(3), 854–865 (2016)
Google Scholar
Xue, S., Luo, B., Liu, D., Yang, Y.: Constrained event-triggered $H_\infty $ control based on adaptive dynamic programming with concurrent learning. IEEE Trans. Syst. Man, Cybern. Syst. 52(1), 357–369 (2022)
Google Scholar
Xu, Y., Li, T., Bai, W., Shan, Q., Yuan, L., Wu, Y.: Online event-triggered optimal control for multi-agent systems using simplified ADP and experience replay technique. Nonlinear Dyn. 106(1), 509–522 (2021)
Google Scholar
Kamalapurkar, R., Reish, B., Chowdhary, G., Dixon, W.E.: Concurrent learning for parameter estimation using dynamic state-derivative estimators. IEEE Trans. Autom. Control 62(7), 3594–3601 (2017)
MathSciNet MATH Google Scholar
Zhang, Q., Zhao, D.: Data-based reinforcement learning for nonzero-sum games with unknown drift dynamics. IEEE Trans. Cybern. 49(8), 2874–2885 (2019)
Google Scholar
Yang, X., He, H.: Adaptive critic learning and experience replay for decentralized event-triggered control of nonlinear interconnected systems. IEEE Trans. Syst. Man, Cybern. Syst. 50(11), 4043–4055 (2020)
Google Scholar
Zhu, Y., Zhao, D., He, H., Ji, J.: Event-triggered optimal control for partially unknown constrained-input systems via adaptive dynamic programming. IEEE Trans. Industr. Electron. 64(5), 4101–4109 (2017)
Google Scholar
Luo, B., Yang, Y., Liu, D.: Adaptive Q-learning for data-based optimal output regulation with experience replay. IEEE Trans. Cybern. 48(12), 3337–3348 (2018)
Google Scholar
Xia, L., Li, Q., Song, R., Modares, H.: Optimal synchronization control of heterogeneous asymmetric input-constrained unknown nonlinear MASs via reinforcement learning. IEEE/CAA J. Autom. Sinica 9(3), 520–532 (2022)
Google Scholar
Zhao, B., Liu, D., Luo, C.: Reinforcement learning-based optimal stabilization for unknown nonlinear systems subject to inputs with uncertain constraints. IEEE Trans. Neural Netw. Learn. Syst. 31(10), 4330–4340 (2020)
MathSciNet Google Scholar
Zhao, S., Wang, J.: Robust optimal control for constrained uncertain switched systems subjected to input saturation: The adaptive event-triggered case. Nonlinear Dyn. 110(1), 363–380 (2022)
Google Scholar
Mishra, A., Ghosh, S.: Variable gain gradient descent-based reinforcement learning for robust optimal tracking control of uncertain nonlinear system with input constraints. Nonlinear Dyn. 107(3), 2195–2214 (2022)
Google Scholar
Yang, X., Zhou, Y., Dong, N., Wei, Q.: Adaptive critics for decentralized stabilization of constrained-input nonlinear interconnected systems. IEEE Trans. Syst. Man, Cybern. Syst. 52(7), 4187–4199 (2022)
Google Scholar
Mu, C., Wang, K., Sun, C.: Policy-iteration-based learning for nonlinear player game systems with constrained inputs. IEEE Trans. Syst. Man, Cybern. Syst. 51(10), 6488–6502 (2021)
Google Scholar
Zhang, S., Zhao, B., Liu, D., Zhang, Y.: Observer-based event-triggered control for zero-sum games of input constrained multi-player nonlinear systems. Neural Netw. 114(8), 101–112 (2021)
Google Scholar
Sun, J., Liu, C.: Distributed zero-sum differential game for multi-agent systems in strict-feedback form with input saturation and output constraint. Neural Netw. 106, 8–19 (2018)
MATH Google Scholar
Zhu, Y., Zhao, D., Li, X.: Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data. IEEE Trans. Neural Netw. Learn. Syst. 28(3), 714–725 (2017)
MathSciNet Google Scholar
Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K.G., Lewis, F.L., Dixon, W.E.: A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49, 82–92 (2013)
MathSciNet MATH Google Scholar
Yasini, S., Sitani, M.B.N., Kirampor, A.: Reinforcement learning and neural networks for multi-agent nonzero-sum games of nonlinear constrained-input systems. Int. J. Mach. Learn. Cybern. 7, 967–980 (2016)
Google Scholar

Download references

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2021ZD0112302; and in part by the National Natural Science Foundation of China under Grant 62222301, Grant 61890930-5, and Grant 62021003.

Author information

Authors and Affiliations

Faculty of Information Technology, The Beijing Key Laboratory of Computational Intelligence and Intelligent System, The Beijing Institute of Artificial Intelligence, and The Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing, 100124, China
Yu Huo, Ding Wang, Junfei Qiao & Menghua Li

Authors

Yu Huo
View author publications
You can also search for this author in PubMed Google Scholar
Ding Wang
View author publications
You can also search for this author in PubMed Google Scholar
Junfei Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Menghua Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junfei Qiao.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Ethical approval

No conflict of interest exits in this submission, and the research work does not involve any human participants and/or animals. The manuscript is approved by all authors for publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huo, Y., Wang, D., Qiao, J. et al. Adaptive critic design for nonlinear multi-player zero-sum games with unknown dynamics and control constraints. Nonlinear Dyn 111, 11671–11683 (2023). https://doi.org/10.1007/s11071-023-08419-5

Download citation

Received: 12 November 2022
Accepted: 16 March 2023
Published: 12 April 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11071-023-08419-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Adaptive critic design for nonlinear multi-player zero-sum games with unknown dynamics and control constraints

Abstract

Similar content being viewed by others

A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system

Neural-Network-Based Synchronous Iteration Learning Method for Multi-player Zero-Sum Games

Off-Policy Reinforcement Learning for Partially Unknown Nonzero-Sum Games

Explore related subjects

1 Introduction

2 Problem statement

Remark 1

3 Approximate solution for multi-player ZSGs

3.1 System identification

Assumption 1

Assumption 2

Theorem 1

Proof

3.2 Approximate optimal learning scheme with single-critic network

Remark 2

Remark 3

3.3 Stability analysis

Assumption 3

Assumption 4

Assumption 5

Theorem 2

Proof

4 Simulation

5 Conclusion

Data availability statement

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation