Nonlinear Games for a Class of Continuous-Time Systems Based on ADP

Zhang, Huaguang; Liu, Derong; Luo, Yanhong; Wang, Ding

doi:10.1007/978-1-4471-4757-2_9

Huaguang Zhang⁵,
Derong Liu⁶,
Yanhong Luo⁵ &
…
Ding Wang⁶

Part of the book series: Communications and Control Engineering ((CCE))

3423 Accesses

Abstract

In this chapter, nonlinear game problems are investigated for continuous-time systems, including infinite horizon zero-sum games, finite horizon zero-sum games and non-zero-sum games. First, for the situations that the saddle point exists, the ADP technique is used to obtain the optimal control pair iteratively, which makes the performance index function reach the saddle point of the zero-sum differential games, while complex existence conditions of the saddle point are avoided. For the situation that the saddle point does not exist, the mixed optimal control pair is obtained to make the performance index function reach the mixed optimum. Then, finite horizon zero-sum games for a class of nonaffine nonlinear systems are studied. Moreover, besides the zero-sum games, the non-zero-sum differential games are studied based on single network ADP algorithm. For zero-sum differential games, two players work on a cost functional together and minimax it. However, for non-zero-sum games, the control objective is to find a set of policies that guarantee the stability of the system and minimize the individual performance function to yield a Nash equilibrium.

Access provided by Autonomous University of Puebla. Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

9.1 Introduction

Game theory is concerned with the study of decision making in situations where two or more rational opponents are involved under conditions of conflicting interests. This has been widely investigated by many authors [5, 7, 8, 12, 13]. Though the nonlinear optimal solution in term of Hamilton–Jacobi–Bellman equation is hard to obtain directly [4], it is still fortunate that there is only one controller or decision maker. In the previous chapter, we have studied discrete-time zero-sum games based on the ADP method. In this chapter, we will consider continuous-time games.

For zero-sum differential games, the existence of the saddle point is proposed before obtaining the saddle point in much of the literature [1, 6, 11]. In many real world applications, however, the saddle point of a game may not exist, which means that we can only obtain the mixed optimal solution of the game. In Sect. 9.2, we will study how to obtain the saddle point without complex existence conditions of the saddle point and how to obtain the mixed optimal solution when the saddle point does not exist based on the ADP method for a class of affine nonlinear zero-sum games. Note that many applications of practical zero-sum games have nonaffine control input. In Sect. 9.3, we will focus on finite horizon zero-sum games for a class of nonaffine nonlinear systems.

The non-zero-sum differential games theory also has a number of potential applications in control engineering, economics and the military field [9]. For zero-sum differential games, two players work on a cost functional together and minimax it. However, for non-zero-sum games, the control objective is to find a set of policies that guarantee the stability of the system and minimize the individual performance function to yield a Nash equilibrium. In Sect. 9.4, non-zero-sum differential games will be studied using a single network ADP.

9.2 Infinite Horizon Zero-Sum Games for a Class of Affine Nonlinear Systems

In this section, the nonlinear infinite horizon zero-sum differential games is studied. We propose a new iterative ADP method which is effective for both the situation that the saddle point does and does not exist. For the situation that the saddle point exists, the existence conditions of the saddle point are avoided. The value function can reach the saddle point using the present iterative ADP method. For the situation that the saddle point does not exist, the mixed optimal value function is obtained under a deterministic mixed optimal control scheme, using the present iterative ADP algorithm.

9.2.1 Problem Formulation

Consider the following two-person zero-sum differential games. The system is described by the continuous-time affine nonlinear equation

$$ \dot{x}(t) = f(x(t),u(t),w(t)) = f(x(t)) + g(x(t))u(t) + k(x(t))w(t), $$

(9.1)

where x(t)∈ℝⁿ, u(t)∈ℝ^k, w(t)∈ℝ^m, and the initial condition x(0)=x ₀ is given.

The cost functional is the generalized quadratic form given by

(9.2)

where l(x,u,w)=x ^T Ax+u ^T Bu+w ^T Cw+2u ^T Dw+2x ^T Eu+2x ^T Fw. The matrices A, B, C, D, E, and F have suitable dimensions and A≥0, B>0, and C<0. According to the situation of two players we have the following definitions. Let $\overline{J} (x): = \inf _{u } \sup_{w } J(x,u,w)$ be the upper value function and $\underline{J}(x): = \sup_{w } \inf _{u } J(x,u,w)$ be the lower value function with the obvious inequality $\overline{J}(x) \ge \underline{J}(x)$. Define the optimal control pairs to be $(\overline{u}, \overline{w})$ and $(\underline{u}, \underline{w})$ for upper and lower value functions, respectively. Then, we have $\overline{J} (x) = J(x,\overline{u},\overline{w})$ and $\underline {J} (x) = J(x,\underline{u},\underline{w})$.

If both $\overline{J} (x)$ and $\underline{J}(x)$ exist and

$$ \overline{J} (x)= \underline{J}(x)=J^\ast(x) $$

(9.3)

holds, we say that the saddle point exists and the corresponding optimal control pair is denoted by (u ^∗,w ^∗).

We have the following lemma.

Lemma 9.1

If the nonlinear system (9.1) is controllable and both the upper value function and lower value function exist, then $\overline{J}(x)$ is a solution of the following upper Hamilton–Jacobi–Isaacs (HJI) equation:

(9.4)

which is denoted by $\mathrm{HJI}(\overline{J}(x), \overline{u},\overline{w})=0 $ and $\underline{J}(x)$ is a solution of the following lower HJI equation:

(9.5)

which is denoted by $\mathrm{HJI}(\underline{J}(x), \underline{u}, \underline{w})=0$.

9.2.2 Zero-Sum Differential Games Based on Iterative ADP Algorithm

As the HJI equations (9.4) and (9.5) cannot be solved in general, in the following, a new iterative ADP method for zero-sum differential games is developed.

9.2.2.1 Derivation of the Iterative ADP Method

The goal of the present iterative ADP method is to obtain the saddle point. As the saddle point may not exist, this motivates us to obtain the mixed optimal value function J ^o(x) where $\underline{J}(x) \le J^{o}(x) \le\overline{J}(x)$.

Theorem 9.2

(cf. [15])

Let $(\overline{u}, \overline{w})$ be the optimal control pair for $\overline{J}(x)$ and $(\underline{u}, \underline{w})$ be the optimal control pair for $\underline{J}(x)$. Then, there exist control pairs $(\overline{u},w)$ and $(u,\underline{w})$ which lead to $J^{o}(x)=J(x,\overline{u},w)=J(x,u,\underline{w})$. Furthermore, if the saddle point exists, then J ^o(x)=J ^∗(x).

Proof

According to the definition of $\overline{J} (x)$, we have $J(x,\overline{u},w) \leq J(x,\overline{u},\overline{w})$. As J ^o(x) is a mixed optimal value function, we also have $J^{o}(x) \leq J(x,\overline{u},\overline{w})$. As the system (9.1) is controllable and w is continuous on ℝ^m, there exists a control pair $(\overline{u},w)$ which makes $J^{o}(x)=J(x,\overline{u},w)$. On the other hand, we have $J^{o}(x) \geq J(x,\underline{u},\underline{w})$. We also have $J(x,{u},\underline{w}) \geq J(x,\underline{u},\underline{w})$. As u is continuous on ℝ^k, there exists a control pair $({u},\underline{w})$ which makes $J^{o}(x)=J(x,{u},\underline{w})$. If the saddle point exists, we have (9.3). On the other hand, $\underline{J}(x) \leq J^{o}(x) \leq\overline{J} (x)$. Then, clearly J ^o(x)=J ^∗(x). □

If (9.3) holds, we have a saddle point; if not, we adopt a mixed trajectory to obtain the mixed optimal solution of the game. To apply the mixed trajectory method, the game matrix is necessary under the trajectory sets of the control pair (u,w). Small Gaussian noises γ _u∈R ^k and γ _w∈R ^m are introduced that are added to the optimal control $\underline{u}$ and $\overline{w}$, respectively, where $\gamma_{u}^{i} (0, \sigma_{i}^{2})$, i=1,…,k, and $\gamma_{w}^{j} (0, \sigma_{j}^{2})$, j=1,…,m, are zero-mean Gaussian noises with variances $\sigma_{i}^{2}$ and $\sigma_{j}^{2}$, respectively.

We define the expected value function as

$E(J(x))= \min_{P_{Ii} } \max_{P_{IIj} } \sum_{i = 1}^{2} \sum_{j = 1}^{2} {P_{Ii} L_{ij} P_{IIj} }$, where we let $L_{11}= J(x,\overline {u},\overline{w})$, $L_{12}=J(x,(\underline{u} + \gamma_{u} ),\underline{w})$, $L_{21}=J(x,\underline{u},\underline{w})$ and $L_{22}=J(x,\overline{u}, (\overline{w} + \gamma_{w} ))$. Let $\sum_{i = 1}^{2} {P_{Ii} } = 1 $ and P _Ii>0. Let $\sum_{j = 1}^{2} {P_{IIj} } = 1 $ and P _IIj>0. Next, let N be a large enough positive integer. Calculating the expected value function N times, we can obtain E ₁(J(x)),E ₂(J(x)),…,E _N(J(x)). Then, the mixed optimal value function can be written as

$$J^o (x) = E(E_i (J(x))) = \displaystyle\frac{1}{N}\sum\limits_{i = 1}^N {E_i (J(x))}. $$

Remark 9.3

In the classical mixed trajectory method, the whole control sets ℝ^k and ℝ^m should be searched under some distribution functions. As there are no constraints for both controls, we see that there exist controls that cause the system to be unstable. This is not permitted for real-world control systems. Thus, it is impossible to search the whole control sets and we can only search the local area around the stable controls which guarantees stability of the system. This is the reason why the small Gaussian noises γ _u and γ _w are introduced. So the meaning of the Gaussian noises can be seen in terms of the local stable area of the control pairs. A proposition will be given to show that the control pair chosen in the local area is stable (see Proposition 9.14). Similar work can also be found in [3, 14].

We can see that the mixed optimal solution is a mathematically expected value which means that it cannot be obtained in reality once the trajectories are determined. For most practical optimal control problems, however, the expected optimal solution (or mixed optimal solution) has to be achieved. To overcome this difficulty, a new method is developed in this section. Let $\alpha = {{({J^{o}}(x) - \underline{J}(x))} / {(\overline{J}(x) -\underline{J} (x))}}$. Then, J ^o(x) can be written as $J^{o} (x) = \alpha\overline{J}(x) + (1 - \alpha)\underline{J}(x)$. Let $l^{o}(x,\overline{u},\overline{w}, \underline{u}, \underline{w})=\alpha l(x,\overline{u},\overline{w})+ (1-\alpha) l(x,\underline{u}, \underline{w})$. We have $J^{o} (x(0)) = \int_{0}^{\infty}{l^{o}} \mathrm{d}t$. According to Theorem 9.2, the mixed optimal control pair can be obtained by regulating the control w in the control pair $(\overline{u}, \overline{w})$ that minimizes the error between $\mathcal{J}(x)$ and J ^o(x) where the value function $\mathcal{J}(x)$ is defined as $\mathcal{J}(x(0)) = \,J (x(0), \overline{u}, w) = \int_{0}^{\infty}l(x,\overline{u},w)\mathrm{d}t$ and $\underline{J}(x(0)) \leq\mathcal {J}(x(0)) \leq\overline{J}(x(0))$.

Define $\widetilde{J}(x(0)) = \int_{0}^{\infty}{\widetilde{l}(x,w)} dx$, where $\widetilde{l}(x,w)=l(x,\overline{u},w) - l^{o}(x,\overline {u},\overline{w}, \underline{u}, \underline{w})$. Then, the problem can be described as $\min_{w} (\widetilde{J}(x))^{2}$.

According to the principle of optimality, when $\widetilde{J}(x)\geq 0$ we have the following HJB equation:

(9.6)

For $\widetilde{J}(x)< 0$, we have $-\widetilde{J}(x)=-(\mathcal {J}(x)-J^{o}(x))> 0$, and we can obtain the same HJB equation as (9.6).

9.2.2.2 The Iterative ADP Algorithm

Given the above preparation, we now formulate the iterative ADP algorithm for zero-sum differential games as follows:

1.
Initialize the algorithm with a stabilizing control pair (u ^[0],w ^[0]), and the value function is V ^[0]. Choose the computation precision ζ>0. Set i=0.
2.
For the upper value function, let
(9.7)
where the iterative optimal control pair is formulated as
(9.8)
and
$$ \overline{w}^{[i+1]} = - \frac{1}{2}C^{ - 1}\big(2D^\mathrm{T}\overline{u}^{[i+1]} + 2F^\mathrm{T}x+k^\mathrm{T} (x)\overline{V}^{[i]}_x\big). $$
(9.9)

$(\overline{u}^{[i]},\overline{w}^{[i]})$ satisfies the HJI equation $\mathrm{HJI}(\overline{V}^{[i]}(x)$, $\overline{u}^{[i]},\overline{w}^{[i]})=0$, and $\overline{V}_{x}^{[i]} = d\overline{V}^{[i]}(x)/ dx$.
3.
If $| \overline{V}^{[i+1]} (x(0)) - \overline{V}^{[i]} (x(0)) | < \zeta$, let $\overline{u} = \overline{u}^{[i]} $, $\overline{w} =\overline{w}^{[i]} $ and $\overline{J}(x)=\overline{V}^{[i+1]} (x)$. Set i=0 and go to Step 4. Else, set i=i+1 and go to Step 2.
4.
For the lower value function, let
(9.10)
where the iterative optimal control pair is formulated as
$$ \underline{u}^{[i+1]} = - \frac{1}{2}g^{ - 1} (2D\underline{w}^{[i+1]} +2k^\mathrm{T}x+ g^\mathrm{T}(x)\underline {V}_x^{[i]} ), $$
(9.11)
and
(9.12)
$(\underline{u}^{[i]},\underline{w}^{[i]})$ satisfies the HJI equation $\mathrm{HJI}(\underline{V}^{[i]}(x), \underline{u}^{[i]},\underline{w}^{[i]})=0$, and $\underline{V}_{x}^{[i]} = {d{\underline{V}^{[i]}}(x)} / {dx}$.
5.
If $| {\underline{V}^{[i+1]} (x(0)) - \underline{V}^{[i]} (x(0))} | < \zeta$, let $\underline{u}=\underline{u}^{[i]}$, $\underline{w}=\underline{w}^{[i]}$ and $\underline{J}(x)=\underline{V}^{[i+1]} (x)$. Set i=0 and go to Step 6. Else, set i=i+1 and go to Step 4.
6.
If $| {\overline{J}(x(0)) - \underline{J}(x(0))} | < \zeta$, stop, and the saddle point is achieved. Else set i=0 and go to the next step.
7.
Regulate the control w for the upper value function and let
(9.13)
The iterative optimal control is formulated as
$$ {w}^{[i]} = - \frac{1}{2}C^{ - 1}(2D^\mathrm{T} \overline{u} + 2F^\mathrm{T}x+k^\mathrm{T}(x) \widetilde{V}^{[i+1]}_x), $$
(9.14)
where $\widetilde{V}_{x}^{[i]} = d{\widetilde{V}^{[i]}}(x)/ dx$.
8.
If $| \mathcal{V}^{[i+1]} (x(0)) - J^{o} (x(0)) | < \zeta$, stop. Else, set i=i+1 and go to Step 7.

9.2.2.3 Properties of the Iterative ADP Algorithm

In this part, some results are presented to show the stability and convergence of the present iterative ADP algorithm.

Theorem 9.4

(cf. [15])

If for ∀i≥0, $\mathrm{HJI}(\overline {V}^{[i]}(x),\overline{u}^{[i]}, \overline{w}^{[i]}) = 0$ holds, and for ∀t, $l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) \ge 0$, then the control pairs $( \overline{u}^{[i]},\overline{w}^{[i]})$ make system (9.1) asymptotically stable.

Proof

According to (9.7), for ∀ t, taking the derivative of $\overline{V}^{[i]}(x)$, we have

(9.15)

From the HJI equation we have

(9.16)

Combining (9.15) and (9.16), we get

(9.17)

According to (9.8) we have

(9.18)

So, $\overline{V}^{[i]}(x)$ is a Lyapunov function. Let ε>0 and ∥x(t ₀)∥<δ(ε). Then, there exist two functions α(∥x∥) and β(∥x∥) which belong to class $\mathcal{K}$ and satisfy

(9.19)

Therefore, system (9.1) is asymptotically stable. □

Theorem 9.5

(cf. [15])

If for ∀ i≥0, $\mathrm{HJI}(\underline {V}^{[i]}(x),\underline{u}^{[i]}, \underline{w}^{[i]}) = 0$ holds, and for ∀ t, $l(x,\underline{u}^{[i]},\underline{w}^{[i]}) < 0$, then the control pairs $(\underline{u}^{[i]}, \underline{w}^{[i]})$ make system (9.1) asymptotically stable.

Corollary 9.6

If for ∀i≥0, $\mathrm{HJI}(\underline{V}^{[i]}(x), \underline{u}^{[i]},\underline{w}^{[i]}) = 0$ holds, and for ∀t, $l(x,\underline{u}^{[i]},\underline{w}^{[i]}) \geq0$, then the control pairs $( \underline{u}^{[i]},\underline{w}^{[i]})$ make system (9.1) asymptotically stable.

Proof

As $\underline{V}^{[i]}(x) \leq\overline{V}^{[i]}(x)$ and $l(x,\underline{u}^{[i]},\underline{w}^{[i]}) \geq0$, we have $0 \leq\underline{V}^{[i]}(x) \leq\overline{V}^{[i]}(x)$.

From Theorem 9.4, we know that for ∀t ₀, there exist two functions α(∥x∥) and β(∥x∥) which belong to class $\mathcal{K}$ and satisfy (9.19).

As $\overline{V}^{[i]}(x) \rightarrow0$, there exist time instants t ₁ and t ₂ (without loss of generality, let t ₀<t ₁<t ₂) that satisfy

(9.20)

Choose ε ₁>0 that satisfies $\underline{V}^{[i]} (x(t_{0})) \geq\alpha(\varepsilon_{1}) \geq\overline{V}^{[i]} (x(t_{2}))$. Then, there exists δ ₁(ε ₁)>0 that makes $\alpha(\varepsilon_{1}) \geq\beta(\delta_{1}) \geq\overline {V}^{[i]} (x(t_{2})) $. Then, we obtain

(9.21)

According to (9.19), we have

(9.22)

Since α(∥x∥) belongs to class $\mathcal{K}$, we obtain ∥x∥≤ε.

Therefore, we conclude that the system (9.1) is asymptotically stable. □

Corollary 9.7

If for ∀i≥0, $\mathrm{HJI}(\overline{V}^{[i]}(x), \overline{u}^{[i]},\overline{w}^{[i]}) = 0$ holds, and for ∀t, $l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) < 0$, then the control pairs $( \overline{u}^{[i]},\overline{w}^{[i]})$ make system (9.1) asymptotically stable.

Theorem 9.8

(cf. [15])

If for ∀i≥0, $\mathrm{HJI}(\overline {V}^{[i]}(x),\overline{u}^{[i]}, \overline{w}^{[i]}) = 0$ holds, and $l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) $ is the utility function, then the control pairs $(\overline{u}^{[i]}, \overline{w}^{[i]})$ make system (9.1) asymptotically stable.

Proof

For the time sequence t ₀<t ₁<t ₂<⋯<t _m<t _m+1<⋯, without loss of generality, we assume $l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) \geq0$ in [t _2n,t _(2n+1)) and $l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) < 0$ in [t _2n+1,t _(2(n+1))) where n=0,1,….

Then, for t∈[t ₀,t ₁) we have $l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) \geq0$ and $\int_{t_{0}}^{t_{1}}l(x, \overline {u}^{[i]}, \overline{w}^{[i]}) \mathrm{d}t \geq0$. According to Theorem 9.4, we have ∥x(t ₀)∥≥∥x(t)∥≥∥x(t ₁)∥.

For t∈[t ₁,t ₂) we have $l(x, \overline{u}^{[i]}, \overline {w}^{[i]}) < 0$ and $\int_{t_{1}}^{t_{2}}l(x, \overline{u}^{[i]}, \overline {w}^{[i]}) \mathrm{d}t < 0$. According to Corollary 9.7, we have ∥x(t ₁)∥>∥x(t)∥>∥x(t ₂)∥. So we obtain ∥x(t ₀)∥≥∥x(t)∥>∥x(t ₂)∥, for ∀t∈[t ₀,t ₂).

Using mathematical induction, for ∀t, we have ∥x(t′)∥≤∥x(t)∥ where t′∈[t,∞). So we conclude that the system (9.1) is asymptotically stable, and the proof is completed. □

Theorem 9.9

(cf. [15])

If for ∀i≥0, $\mathrm{HJI}(\underline{V}^{[i]}(x),\underline{u}^{[i]}, \underline{w}^{[i]}) = 0$ holds, and $l(x,\underline{u}^{[i]},\underline{w}^{[i]}) $ is the utility function, then the control pairs $(\underline{u}^{[i]}, \underline{w}^{[i]})$ make system (9.1) asymptotically stable.

Next, we will give the convergence proof of the iterative ADP algorithm.

Proposition 9.10

If for ∀i≥0, $\mathrm{HJI}(\overline {V}^{[i]}(x),\overline{u}^{[i]}, \overline{w}^{[i]}) = 0$ holds, then the control pairs $(\overline{u}^{[i]}, \overline{w}^{[i]} )$ make the upper value function $\overline{V}^{[i]}(x) \rightarrow \bar{J}(x) $ as i→∞.

Proof

According to $HJI(\overline{V}^{[i]}(x),\overline{u}^{[i]}, \overline{w}^{[i]}) = 0$, we obtain ${{\mathrm{d}{{\overline{V}}^{[i+1]}}(x)} / {\mathrm{d}t}} $ by replacing the index “i” by the index “i+1”:

(9.23)

According to (9.18), we obtain

(9.24)

Since the system (9.1) is asymptotically stable, its state trajectories x converge to zero, and so does $\overline {V}^{[i+1]}(x) - \overline{V}^{[i]}(x)$. Since ${{d( {{\overline{V}}^{[i+1]}}(x) - {\overline{V}}^{[i]}(x) )}/ {\mathrm{d}t}} \ge0$ on these trajectories, it implies that $\overline {V}^{[i+1]}(x) - \overline{V}^{[i]}(x) \le0$; that is $\overline {V}^{[i+1]}(x) \le\overline{V}^{[i]}(x)$. Thus, $\overline {V}^{[i]}(x)$ is convergent as i→∞.

Next, we define $\lim_{i \to\infty} \overline{V}^{[i]} (x) = \overline{V}^{[\infty]} (x)$.

For ∀i, let $\overline{w}^{\ast}= \mathrm{arg}\max _{w} \{\int_{t}^{\hat{t}} l(x,u,w) \mathrm{d}\tau+ \overline{V}^{[i]}(x(\hat{t}))\}$. Then, according to the principle of optimality, we have

(9.25)

Since $\overline{V}^{[i+1]}(x) \le\overline{V}^{[i]}(x)$, we have $\overline{V}^{[\infty]}(x) \leq \int_{t}^{\hat{t}} l(x,u,\overline{w}^{\ast}) \mathrm{d}\tau+ \overline{V}^{[i]}(x(\hat{t}))$.

Letting i→∞, we obtain $\overline{V}^{[\infty]}(x) \leq \int_{t}^{\hat{t}} l(x,u,\overline{w}^{\ast}) \mathrm{d}\tau+ \overline{V}^{[\infty]}(x(\hat{t}))$. So, we have $\overline{V}^{[\infty]}(x) \leq \inf _{u }\sup_{w } \{\int_{t}^{\hat{t}} l(x,u,w) \mathrm{d}t+ \overline{V}^{[i]}(x(\hat{t}))\}$.

Let ϵ>0 be an arbitrary positive number. Since the upper value function is nonincreasing and convergent, there exists a positive integer i such that $\overline{V}^{[i]}(x) - \epsilon \leq \overline{V}^{[\infty]}(x) \leq\overline{V}^{[i]}(x)$.

Let $\overline{u}^{\ast}=\mathrm{arg} \min_{u} \{\int_{t}^{\hat{t}} l(x,u,\overline{w}^{\ast}) \mathrm{d}\tau+ \overline{V}^{[i]}(x(\hat{t}))\}$. Then we get $\overline{V}^{[i]}(x) = \int_{t}^{\hat{t}} l(x,\overline{u}^{\ast},\overline{w}^{\ast}) \mathrm{d}\tau+ \overline {V}^{[i]}(x(\hat{t}))$.

Thus, we have

(9.26)

Since ϵ is arbitrary, we have

$$\overline{V}^{[\infty]}(x) \geq\mathop{\inf }\limits_{u }\mathop{\sup}\limits_{w }\left\{\int_{t}^{\hat{t}} l(x,u,w) \mathrm{d}\tau+ \overline{V}^{[\infty]}(x(\hat{t})) \right\}. $$

Therefore, we obtain

$$\overline {V}^{[\infty]}(x) = \mathop{\inf }\limits_{u }\mathop{\sup}\limits_{w }\left\{\int_{t}^{\hat{t}} l(x,u,w) \mathrm{d}\tau+ \overline{V}^{[\infty]}(x(\hat{t})) \right\}. $$

Let $\hat{t} \rightarrow \infty$, we have

$$\overline{V}^{[\infty]}(x)=\mathop{\inf}\limits_{u}\mathop{\sup }\limits_{w} J(x,u,w)=\overline{J} (x). $$

□

Proposition 9.11

If for ∀i≥0, $\mathrm{HJI}(\underline{V}^{[i]}(x),\underline{u}^{[i]}, \underline{w}^{[i]}) = 0$ holds, then the control pairs $(\underline{u}^{[i]}, \underline{w}^{[i]} )$ make the lower value function $\underline{V}^{[i]}(x) \rightarrow \underline{J}(x) $ as i→∞.

Theorem 9.12

(cf. [15])

If the saddle point of the zero-sum differential game exists, then the control pairs $(\overline {u}^{[i]},\overline {w}^{[i]})$ and $(\underline {u}^{[i]}, \underline {w}^{[i]})$ make $\overline{V}^{[i]} (x) \rightarrow J^{\ast}(x)$ and $\underline{V}^{[i]} (x) \rightarrow J^{\ast}(x)$, respectively, as i→∞.

Proof

For the upper value function, according to Proposition 9.10, we have $\overline{V}^{[i]} (x) \rightarrow\overline {J}(x)$ under the control pairs $(\overline {u}^{[i]},\overline {w}^{[i]})$ as i→∞. So the optimal control pair for the upper value function satisfies $\overline{J}(x) = J(x,\overline{u} , \overline{w}) = \inf_{u } \sup_{w} J(x,u,w)$.

On the other hand, there exists an optimal control pair (u ^∗,w ^∗) making the value reach the saddle point. According to the property of the saddle point, the optimal control pair (u ^∗,w ^∗) satisfies J ^∗(x)=J(x,u ^∗,w ^∗)=inf_usup_w J(x,u,w).

So, we have $\overline{V}^{[i]}(x) \rightarrow J^{\ast}(x)$ under the control pair $(\overline{u}^{[i]}, \overline{w}^{[i]})$ as i→∞. Similarly, we can derive $\underline{V}^{[i]}(x) \rightarrow J^{\ast}(x)$ under the control pairs $(\underline {u}^{[i]}, \underline {w}^{[i]})$ as i→∞. □

Remark 9.13

From the proofs we see that the complex existence conditions of the saddle point in [1, 2] are not necessary. If the saddle point exists, the iterative value functions can converge to the saddle point using the present iterative ADP algorithm.

In the following part, we emphasize that when the saddle point does not exist, the mixed optimal solution can be obtained effectively using the iterative ADP algorithm.

Proposition 9.14

If $\overline{u}\in\mathbb{R}^{k}$, w ^[i]∈ℝ^m and the utility function is $\tilde{l}(x, w^{[i]}) = l(x,\overline{u}, w^{[i]})- l^{o}(x,\overline{u},\overline{w},\underline{u},\underline{w})$, and w ^[i] is expressed in (9.14), then the control pairs $(\overline{u} , w^{[i]})$ make the system (9.1) asymptotically stable.

Proposition 9.15

If $\overline{u}\in\mathbb{R}^{k}$, w ^[i]∈ℝ^m and for ∀t, the utility function $\tilde{l}(x, w^{[i]}) \geq0$, then the control pairs $(\overline{u} , w^{[i]})$ make $\widetilde{V}^{[i]}(x)$ a nonincreasing convergent sequence as i→∞.

Proposition 9.16

If $\overline{u}\in\mathbb{R}^{k}$, w ^[i]∈ℝ^m and for ∀t, the utility function $\tilde{l}(x, w^{[i]})< 0$, then the control pairs $(\overline {u} , w^{[i]})$ make $\tilde{V}^{[i]}(x)$ a nondecreasing convergent sequence as i→∞.

Theorem 9.17

(cf. [15])

If $\overline {u}\in\mathbb{R}^{k}$, w ^[i]∈ℝ^m, and $\tilde{l} (x,w^{[i]})$ is the utility function, then the control pairs $(\overline{u} , w^{[i]})$ make $\widetilde{V}^{[i]}(x) $ convergent as i→∞.

Proof

For the time sequence t ₀<t ₁<t ₂<⋯<t _m<t _m+1<⋯, without loss of generality, we suppose $\tilde{l} (x,w^{[i]}) \geq0$ in [t _2n,t _2n+1) and $\tilde{l} (x,w^{[i]}) < 0$ in [t _2n+1,t _2(n+1)), where n=0,1,….

For t∈[t _2n,t _2n+1) we have $\tilde{l} (x,w^{[i]})\geq0$ and $\int_{t_{0}}^{t_{1}} \tilde{l} (x,w^{[i]}) \mathrm{d}t \geq0$. According to Proposition 9.15, we have $\widetilde {V}^{[i+1]}(x) \leq\widetilde{V}^{[i]}(x) $. For t∈[t _2n+1,t _2(n+1)) we have $\tilde{l} (x,w^{[i]}) < 0$ and $\int_{t_{1}}^{t_{2}} \tilde{l} (x,w^{[i]}) \mathrm{d}t < 0$. According to Proposition 9.16 we have $\widetilde{V}^{[i+1]}(x) > \widetilde{V}^{[i]}(x) $. Then, for ∀t ₀, we have

(9.27)

So, $\widetilde{{V}}^{[i]}(x) $ is convergent as i→∞. □

Theorem 9.18

(cf. [15])

If $\overline{u}\in R^{k}$, w ^[i]∈R ^m, and $\tilde{l} (x,w^{[i]}) $ is the utility function, then the control pairs $(\overline{u} , w^{[i]})$ make $\mathcal{V}^{[i]}(x) \rightarrow J^{o}(x) $ as i→∞.

Proof

It is proved by contradiction. Suppose that the control pair $(\overline{u},w^{[i]})$ makes the value function $\mathcal {V}^{[i]}(x)$ converge to $\mathcal{J}'(x)$ and $\mathcal {J}'(x)\neq J^{o}(x)$.

According to Theorem 9.17, based on the principle of optimality, as i→∞ we have the HJB equation $\mathrm{HJB}(\widetilde{J}(x),w) =0$.

From the assumptions we know that $|\mathcal{V}^{[i]}(x)-J^{o}(x)|\neq0$ as i→∞. From Theorem 9.5, we know that there exists a control pair $(\overline{u},w')$ that makes $J(x,\overline{u}, w')=J^{o}(x)$, which minimizes the performance index function $\widetilde{J}(x)$. According to the principle of optimality, we also have the HJB equation $\mathrm{HJB}(\widetilde{J}(x),w') =0$.

It is a contradiction. So the assumption does not hold. Thus, we have $\mathcal{V}^{[i]}(x) \rightarrow J^{o}(x) $ as i→∞. □

Remark 9.19

For the situation where the saddle point does not exist, the methods in [1, 2] are all invalid. Using our iterative ADP method, the iterative value function reaches the mixed optimal value function J ^o(x) under the deterministic control pair. Therefore, we emphasize that the present iterative ADP method is more effective.

9.2.3 Simulations

Example 9.20

The dynamics of the benchmark nonlinear plant can be expressed by system (9.1) where

(9.28)

and ε=0.2. The initial state is given as x(0)=[1,1,1,1]^T. The cost functional is defined by (9.2) where the utility function is expressed as $l(x,u,w)=x_{1}^{2}+0.1x_{2}^{2}+0.1x_{3}^{2}+0.1x_{4}^{2}+\|u\|^{2}-\gamma^{2}\|w\|^{2} $ and γ ²=10.

Any differential structure can be used to implement the iterative ADP method. For facilitating the implementation of the algorithm, we choose three-layer neural networks as the critic networks with the structure of 4–8–1. The structures of the u and w for the upper value function are 4–8–1 and 5–8–1; while they are 5–8–1 and 4–8–1 for the lower one. The initial weights are all randomly chosen in [−0.1, 0.1]. Then, for each i, the critic network and the action networks are trained for 1000 time steps so that the given accuracy ζ=10⁻⁶ is reached. Let the learning rate η=0.01. The iterative ADP method runs for i=70 times and the convergence trajectory of the value function is shown in Fig. 9.1. We can see that the saddle point of the game exists. Then, we apply the controller to the benchmark system and run for T _f=60 seconds. The optimal control trajectories are shown in Fig. 9.2. The corresponding state trajectories are shown in Figs. 9.3 and 9.4, respectively.

Remark 9.21

The simulation results illustrate the effectiveness of the present iterative ADP algorithm. If the saddle point exists, the iterative control pairs $(\overline{u}^{[i]},\overline{w}^{[i]})$ and $(\underline{u}^{[i]},\underline{w}^{[i]})$ can make the iterative value functions reach the saddle point, while the existence conditions of the saddle point are avoided.

Example 9.22

In this example, we just change the utility function to

and all other conditions are the same as the ones in Example 9.20. We obtain $\overline{J}(x(0))= 0.65297$ and $\underline{J}(x(0))=0.44713$, with trajectories shown in Figs. 9.5(a) and (b), respectively. Obviously, the saddle point does not exist. Thus, the method in [1] is invalid. Using the present mixed trajectory method, we choose the Gaussian noises γ _u(0,0.05²) and γ _w(0,0.05²). Let N=5000 times. The value function trajectories are shown in Fig. 9.5(c). Then, we obtain the value of the mixed optimal value function J ^o(x(0))=0.55235 and then α=0.5936. Regulating the control w to obtain the trajectory of the mixed optimal value function displayed in Fig. 9.5. The state trajectories are shown in Figs. 9.6(a) and 9.7, respectively. The corresponding control trajectories are shown in Figs. 9.8 and 9.9, respectively.

9.3 Finite Horizon Zero-Sum Games for a Class of Nonlinear Systems

In this section, a new iterative approach is derived to solve optimal policies of finite horizon quadratic zero-sum games for a class of continuous-time nonaffine nonlinear system. Through the iterative algorithm between two sequences, which are a sequence of state trajectories of linear quadratic zero-sum games and a sequence of corresponding Riccati differential equations, the optimal policies for nonaffine nonlinear zero-sum games are given. Under very mild conditions of local Lipschitz continuity, the convergence of approximating linear time-varying sequences is proved.

9.3.1 Problem Formulation

Consider a continuous-time nonaffine nonlinear zero-sum game described by the state equation

(9.29)

with the finite horizon cost functional

(9.30)

where x(t)∈ℝⁿ is the state, x(t ₀)∈ℝⁿ is the initial state, t _f is the terminal time, the control input u(t) takes values in a convex and compact set $U\subset\mathbb{R}^{m_{1}}$, and w(t) takes values in a convex and compact set $W\subset\mathbb{R}^{m_{2}}$. u(t) seeks to minimize the cost functional J(x ₀,u,w), while w(t) seeks to maximize it. The state-dependent weight matrices F(x(t)), Q(x(t)), R(x(t)), S(x(t)) are with suitable dimensions and F(x(t))≥0, Q(x(t))≥0, R(x(t))>0, S(x(t))>0. In this section, x(t), u(t), and w(t) sometimes are described by x, u, and w for brevity. Our objective is to find the optimal policies for the above nonaffine nonlinear zero-sum games.

In the nonaffine nonlinear zero-sum game problem, nonlinear functions are implicit function with respect to controller input. It is very hard to obtain the optimal policies satisfying (9.29) and (9.30). For practical purposes one may just as well be interested in finding a near-optimal or an approximate optimal policy. Therefore, we present an iterative algorithm to deal with this problem. Nonaffine nonlinear zero-sum games are transformed into an equivalent sequence of linear quadratic zero-sum games which can use the linear quadratic zero-sum game theory directly.

9.3.2 Finite Horizon Optimal Control of Nonaffine Nonlinear Zero-Sum Games

Using a factored form to represent the system (9.29), we get

(9.31)

where f:ℝⁿ→ℝ^n×n is a nonlinear matrix-valued function of x, $g\colon\mathbb{R}^{n} \times \mathbb{R}^{m_{1}} \rightarrow\mathbb{R}^{n\times{m_{1}}}$ is a nonlinear matrix-valued function of both the state x and control input u, and $k\colon\mathbb{R}^{n} \times\mathbb{R}^{m_{2}} \rightarrow \mathbb{R}^{n\times{m_{2}}}$ is a nonlinear matrix-valued function of both the state x and control input w.

We use the following sequence of linear time-varying differential equations to approximate the state equation (9.31):

(9.32)

with the corresponding cost functional

(9.33)

where the superscript i represents the iteration index. For the first approximation, i=0, we assume that the initial values x _i−1(t)=x ₀, u _i−1(t)=0, and w _i−1(t)=0. Obviously, for the ith iteration, f(x _i−1(t)), g(x _i−1(t),u _i−1(t)), k(x _i−1(t),w _i−1(t)), F(x _i−1(t _f)), Q(x _i−1(t)), R(x _i−1(t)), and S(x _i−1(t)) are time-varying functions which do not depend on x _i(t), u _i(t), and w _i(t). Hence, each approximation problem in (9.32) and (9.33) is a linear quadratic zero-sum game problem which can be solved by the existing classical linear quadratic zero-sum game theory.

The corresponding Riccati differential equation of each linear quadratic zero-sum game can be expressed as

(9.34)

where P _i∈ℝ^n×n is a real, symmetric and nonnegative definite matrix.

Assumption 9.23

It is assumed that $S(x_{i-1}(t))>\hat {S}\,_{i}$, where the threshold value $\hat {S}\,_{i}$ is defined as $\hat {S}\,_{i}= \mathrm{inf} \{S_{i}(t)>0,\ \mbox{and (9.34) does not have a conjugate point on} [0,t_{f}]\}$.

If Assumption 9.23 is satisfied, the game admits the optimal policies given by

(9.35)

where x _i(t) is the corresponding optimal state trajectory, generated by

(9.36)

By using the iteration between sequences (9.34) and (9.36) sequently, the limit of the solution of the approximating sequence (9.32) will converge to the unique solution of system (9.29), and the sequences of optimal policies (9.35) will converge, too. The convergence of iterative algorithm will be analyzed in the next section. Notice that the factored form in (9.31) does not need to be unique. The approximating linear time-varying sequences will converge whatever the representation of f(x(t)), g(x(t),u(t)), and k(x(t),w(t)).

Remark 9.24

For the fixed finite interval [t ₀,t _f], if $S(x_{i-1}(t))>\hat{S}\,_{i}$, the Riccati differential equation (9.34) has a conjugate point on [t ₀,t _f]. It means that V _i(x ₀,u,w) is strictly concave in w. Otherwise, since V _i(x ₀,u,w) is quadratic and R(t)>0, F(t)≥0, Q(t)≥0, it follows that V _i(x ₀,u,w) is strictly convex in u. Hence, for linear quadratic zero-sum games (9.32) with the performance index function (9.34) there exists a unique saddle point; they are the optimal policies.

The convergence of the algorithm described above requires the following:

1.
The sequence {x _i(t)} converges on C([t ₀,t _f];ℝⁿ), which means that the limit of the solution of approximating sequence (9.32) converges to the unique solution of system (9.29).
2.
The sequences of optimal policies {u _i(t)} and {w _i(t)} converge on C([t ₀,t _f];ℝ^m1) and C([t ₀,t _f];ℝ^m2), respectively.

For simplicity, the approximating sequence (9.32) is rewritten as

(9.37)

where

The optimal policies for zero-sum games are rewritten as

(9.38)

where

Assumption 9.25

g(x,u), k(x,w), R ⁻¹(x), S ⁻¹(x), F(x) and Q(x) are bounded and Lipschitz continuous in their arguments x, u, and w, thus satisfying:

(C1)
∥g(x,u)∥≤b, ∥k(x,u)∥≤e
(C2)
∥R ⁻¹(x)∥≤r, ∥S ⁻¹(x)∥≤s
(C3)
∥F(x)∥≤f, ∥Q(x)∥≤q

for ∀x∈ℝⁿ, $\forall u\in \mathbb{R}^{m_{1}}$, $\forall w\in\mathbb{R}^{m_{2}}$, and for finite positive numbers b, e, r, s, f, and q.

Define Φ _i−1(t,t ₀) as the transition matrix generated by f _i−1(t). It is well known that

(9.39)

where μ(f) is the measure of matrix f, $\mu(f)= \lim_{h \to0+ }\frac{\parallel I+hf \parallel-1}{h}$. We use the following lemma to get an estimate for Φ _i−1(t,t ₀)−Φ _i−2(t,t ₀).

The following lemma is relevant for the solution of the Riccati differential equation (9.34), which is the basis for proving the convergence.

Lemma 9.26

Let Assumption 9.25 hold; the solution of the Riccati differential equation (9.34) satisfies:

1.
P _i(t) is Lipschitz continuous.
2.
P _i(t) is bounded, if the linear time-varying system (9.32) is controllable.

Proof

First, let us prove that P _i(t) is Lipschitz continuous. We transform (9.34) into the form of a matrix differential equation:

where

Thus, the solution P _i(t) of the Riccati differential equations (9.34) becomes

(9.40)

If Assumption 9.25 is satisfied, such that f(x), g(x,u), k(x,w), R ⁻¹(x), S ⁻¹(x), F(x), and Q(x) are Lipschitz continuous, then X _i(t) and λ _i(t) are Lipschitz continuous. Furthermore, it is easy to verify that (X _i(t))⁻¹ also satisfies the Lipschitz condition. Hence, P _i(t) is Lipschitz continuous.

Next, we prove that P _i(t) is bounded.

If the linear time varying system (9.32) is controllable, there must exist $\hat{u}_{i}(t), \hat{w}_{i}(t)$ such that x(t ₁)=0 at t=t ₁. We define $\bar{u}_{i}(t), \bar{w}_{i}(t)$ as

where $\hat{u}_{i}(t)$ is any control policy making x(t ₁)=0, $\hat{w}_{i}(t)$ is defined as the optimal policy. We have t≥t ₁, and we let $\bar {u}_{i}(t)$ and $\bar {w}_{i}(t)$ be 0, the state x(t) will still hold at 0.

The optimal cost functional $V^{*}_{i}(x_{0},u,w)$ described as

(9.41)

where $u^{*}_{i}(t)$ and $w^{*}_{i}(t)$ are the optimal policies. $V^{*}_{i}(x_{0},u,w)$ is minimized by u ^∗(t) and maximized by $w^{*}_{i}(t)$.

For the linear system, $V^{*}_{i}(x_{0},u,w)$ can be expressed as $V^{*}_{i}(x_{0},u,w)= 1/(2 x^{\mathrm{T}}_{i}(t)P_{i}(t)x_{i}(t))$. Since x _i(t) is arbitrary, if $V^{*}_{i}(x_{0},u,w)$ is bounded, then P _i(t) is bounded. Next, we discuss the boundedness of $V^{*}_{i}(x_{0},u,w)$ in two cases:

Case 1::: t ₁<t _f; we have
(9.42)
Case 2::: t ₁≥t _f; we have
(9.43)

From (9.42) and (9.43), we know that $V_{i}^{*}(x)$ has an upper bound, independent of t _f. Hence, P _i(t) is bounded. □

According to Lemma 9.26, P _i(t) is bounded and Lipschitz continuous. If Assumption 9.25 is satisfied, then M(x,u), N(x,w), G(x,w), and K(x,w) are bounded and Lipschitz continuous in their arguments, thus satisfying:

(C4)
∥M(x,u)∥≤δ ₁, ∥N(x,w)∥≤σ ₁,
(C5)
∥M(x ₁,u ₁)−M(x ₂,u ₂)∥≤δ ₂∥x ₁−x ₂∥+δ ₃∥u ₁−u ₂∥, ∥N(x ₁,w ₁)−N(x ₂,w ₂)∥≤σ ₂∥x ₁−x ₂∥+σ ₃∥w ₁−w ₂∥,
(C6)
∥G(x,u)∥≤ζ ₁, ∥K(x,w)∥≤ξ ₁,
(C7)
∥G(x ₁,u ₁)−G(x ₂,u ₂)∥≤ζ ₂∥x ₁−x ₂∥+ζ ₃∥u ₁−u ₂∥, ∥K(x ₁,w ₁)−K(x ₂,w ₂)∥≤ξ ₂∥x ₁−x ₂∥+ξ ₃∥w ₁−w ₂∥,

∀x∈ℝⁿ, $\forall u\in \mathbb{R}^{m_{1}}$, $\forall w\in\mathbb{R}^{m_{2}}$, and for finite positive numbers δ _j, σ _j, ζ _j, ξ _j, j=1,2,3.

Theorem 9.27

(cf. [16])

Consider the system (9.29) of nonaffine nonlinear zero-sum games with the cost functional (9.30), the approximating sequences (9.32) and (9.33) can be introduced. We have F(x(t))≥0, Q(x(t))≥0, R(x(t))>0, and the terminal time t _f is specified. Let Assumption 9.25, and Assumptions (A1) and (A2) hold and $S(x(t))> \tilde{S}$, for small enough t _f or x ₀; then the limit of the solution of the approximating sequence (9.32) converges to the unique solution of system (9.29) on C([t ₀,t _f];ℝⁿ). Meanwhile, the approximating sequences of optimal policies given by (9.35) also converge on $C([t_{0},t_{f}];\mathbb{R}^{m_{1}})$ and $C([t_{0},t_{f}];\mathbb{R}^{m_{2}})$, if

(9.44)

where

Proof

The approximating sequence (9.37) is an nonhomogeneous differential equation, whose solution can be given by

(9.45)

Then,

(9.46)

According to inequality (9.39) and assuming (C6) to hold, we obtain

(9.47)

On the basis of Gronwall–Bellman’s inequality

(9.48)

which is bounded by a small time interval t∈[t ₀,t _f] or small x ₀.

From (9.45) we have

(9.49)

Consider the supremum to both sides of (9.49) and let

By using (9.39), (C6), and (C7), we get

(9.50)

Combining similar terms, we have

(9.51)

where ψ ₁(t) through ψ ₃(t) are described in (9.44).

Similarly, from (9.38), we get

(9.52)

According to (C4), (C5), and (9.48), we have

(9.53)

where ψ ₄(t) through ψ ₉(t) are shown in (9.44).

Then, combining (9.51) and (9.53), we have

(9.54)

where and .

By induction, Θ _i satisfies

(9.55)

which implies that we have x _i(t), u _i(t) and Cauchy sequences in Banach spaces C([t ₀,t _f];ℝⁿ), C([t ₀,t _f];ℝⁿ), $C([t_{0},t_{f}];\mathbb{R}^{m_{1}})$, and $C([t_{0},t_{f}];\mathbb{R}^{m_{2}})$, respectively. If {x _i(t)} converges on C([t ₀,t _f];ℝⁿ), and the sequences of optimal policies {u _i} and {w _i} also converge on $C([t_{0},t_{f}];\mathbb{R}^{m_{1}})$ and $C([t_{0},t_{f}];\mathbb{R}^{m_{2}})$ on [t ₀,t _f].

It means that x _i−1(t)=x _i(t), u _i−1(t)=u _i(t), w _i−1(t)=w _i(t) when i→∞. Hence, the system (9.29) has a unique solution on [t ₀,t _f], which is given by the limit of the solution of approximating sequence (9.32). □

Based on the iterative algorithm described in Theorem 9.27, the design procedure of optimal policies for nonlinear nonaffine zero-sum games is summarized as follows:

1.
Give x ₀, maximum iteration times i _max and approximation accuracy ε.
2.
Use a factored form to represent the system as (9.31).
3.
Set i=0. Let x _i−1(t)=x ₀, u _i−1(t)=0 and w _i−1(t)=0. Compute the corresponding matrix-valued functions f(x ₀), g(x ₀,0), k(x ₀,0), F(x ₀), Q(x ₀), R(x ₀), and S(x ₀).
4.
Compute x ^[0](t) and P ^[0](t) according to differential equations (9.34) and (9.36) with x(t ₀)=x ₀, P(t _f)=F(x _f).
5.
Set i=i+1. Compute the corresponding matrix-valued functions f(x _i−1(t)), g(x _i−1(t),u _i−1(t)), k(x _i−1(t),w _i−1(t)), Q(x _i−1(t)), R(x _i−1(t)), F(x _i−1(t _f)), and S(x _i−1(t)).
6.
Compute x _i(t) and P _i(t) by (9.34) and (9.36) with x(t ₀)=x ₀, P(t _f)=F(x _tf).
7.
If ∥x _i(t)−x _i−1(t)∥<ε, go to Step 9); otherwise, go to Step 8.
8.
If i>i _max, then go to Step 9; else, go to Step 5.
9.
Stop.

9.3.3 Simulations

Example 9.28

We now show the power of our iterative algorithm for finding optimal policies for nonaffine nonlinear zero-sum games.

In the following, we introduce an example of a control system that has the form (9.29) with control input u(t), subject to a disturbance w(t) and a cost functional V(x ₀,u,w). The control input u(t) is required to minimize the cost functional V(x ₀,u,w). If the disturbance has a great effect on the system, the single disturbance w(t) has to maximize the cost functional V(x ₀,u,w). The conflicting design can guarantee the optimality and strong robustness of the system at the same time. This is a zero-sum game problem, which can be described by the state equations

(9.56)

Define the finite horizon cost functional to be of the form (9.30), where F=0.01 I _2×2, Q=0.01 I _2×2, R=1 and S=1, where I is an identity matrix. Clearly, (9.56) is not affine in u(t) and w(t), it has the control nonaffine nonlinear structure. Therefore, we represent the system (9.56) in the factored form f(x(t))x(t), g(x(t),u(t))u(t) and k(x(t),w(t))w(t), which, given the wide selection of possible representations, have been chosen as

(9.57)

The optimal policies designs given by Theorem 9.27 can now be applied to (9.31) with the dynamics (9.57).

The initial state vectors are chosen as x ₀=[0.6,0]^T and the terminal time is set to t _f=5. Let us define the required error norm between the solutions of the linear time-vary differential equations by ∥x _i(t)−x _i−1(t)∥<ε=0.005, which needs to be satisfied if convergence is to be achieved. The factorization is given by (9.57). Implementing the present iterative algorithm, it just needs six sequences to satisfy the required bound, ∥x ^[6](t)−x ^[5](t)∥=0.0032. With increasing of number of times of iterations, the approximation error will reduce obviously. When the iteration number i=25, the approximation error is just 5.1205×10⁻¹⁰.

Define the maximum iteration times i _max=25. Figure 9.10 represents the convergence trajectories of the state trajectory of each linear quadratic zero-sum game. It can be seen that the sequence is obviously convergent. The magnifications of the state trajectories are given in the figure, which shows that the error will be smaller as the number of times of iteration becomes bigger. The trajectories of control input u(t) and disturbance input w(t) of each iteration are also convergent, which is shown in Figs. 9.11 and 9.12. The approximate optimal policies u ^∗(t) and w ^∗(t) are obtained by the last iteration. Substituting the approximate optimal policies u ^∗(t) and w ^∗(t) into the system of zero-sum games (9.56), we get the state trajectory. The norm of the error between this state trajectory and the state trajectory of the last iteration is just 0.0019, which proves that the approximating iterative approach developed in this section is highly effective.

9.4 Non-Zero-Sum Games for a Class of Nonlinear Systems Based on ADP

In this section, a near-optimal control scheme is developed for the non-zero-sum differential games of continuous-time nonlinear systems. The single network ADP is utilized to obtain the optimal control policies which make the cost functions reach the Nash equilibrium of non-zero-sum differential games, where only one critic network is used for each player, instead of the action-critic dual network used in a typical ADP architecture. Furthermore, novel weight tuning laws for critic neural networks are developed, which not only ensure the Nash equilibrium to be reached, but also guarantee the stability of the system. No initial stabilizing control policy is required for each player. Moreover, Lyapunov theory is utilized to demonstrate the uniform ultimate boundedness of the closed-loop system.

9.4.1 Problem Formulation of Non-Zero-Sum Games

Consider the following continuous-time nonlinear systems:

$$ \dot {x}(t)=f(x(t))+g(x(t))u(t)+k(x(t))w(t), $$

(9.58)

where x(t)∈ℝⁿ is the state vector, u(t)∈ℝ^m and d(t)∈ℝ^q are the control input vectors. Assume that f(0)=0 and that f(x), g(x), k(x) are locally Lipschitz.

The cost functional associated with u is defined as

$$ J_1(x,u,w)=\int_t^\infty r_1(x(\tau),u(\tau),w(\tau))\mathrm {d}\tau, $$

(9.59)

where r ₁(x,u,w)=Q ₁(x)+u ^T R ₁₁ u+w ^T R ₁₂ w, Q ₁(x)≥0 is the penalty on the states, R ₁₁∈ℝ^m×m is a positive definite matrix, and R ₁₂∈ℝ^q×q is a positive semidefinite matrix.

The cost functional associated with w is defined as

$$ J_2(x,u,w)=\int_t^\infty r_2(x(\tau),u(\tau),w(\tau))\mathrm {d}\tau, $$

(9.60)

where r ₂(x,u,w)=Q ₂(x)+u ^T R ₂₁ u+w ^T R ₂₂ w, Q ₂(x)≥0 is the penalty on the states, R ₂₁∈ℝ^m×m is a positive semidefinite matrix, and R ₂₂∈ℝ^q×q is a positive definite matrix.

For the above non-zero-sum differential games, the two feedback control policies u and w are chosen by player 1 and player 2, respectively, where player 1 tries to minimize the cost functional (9.59), while player 2 attempts to minimize the cost functional (9.60).

Definition 9.29

u=μ ₁(x) and w=μ ₂(x) are defined as admissible with respect to (9.59) and (9.60) on Ω∈ℝⁿ, denoted by μ ₁∈ψ(Ω) and μ ₂∈ψ(Ω), respectively, if μ ₁(x) and μ ₂(x) are continuous on Ω, μ ₁(0)=0 and μ ₂(0)=0, μ ₁(x) and μ ₂(x) stabilize (9.58) on Ω, and (9.59) and (9.60) are finite, ∀x ₀∈Ω.

Definition 9.30

The policy set (u ^∗,w ^∗) is a Nash equilibrium policy set if the inequalities

(9.61)

hold for any admissible control policies u and w.

Next, define the Hamilton functions for the cost functionals (9.59) and (9.60) with associated admissible control input u and w, respectively, as follows:

(9.62)

(9.63)

where ▽J _i is the partial derivative of the cost function J _i(x,u,w) with respect to x, i=1,2.

According to the stationarity conditions of optimization, we have

Therefore, the associated optimal feedback control policies u ^∗ and w ^∗ are found and revealed to be

(9.64)

(9.65)

The optimal feedback control policies u ^∗ and w ^∗ provide a Nash equilibrium for the non-zero-sum differential games among all the feedback control policies.

Considering H ₁(x,u ^∗,w ^∗)=0 and H ₂(x,u ^∗,w ^∗)=0, and substituting the optimal feedback control policy (9.64) and (9.65) into the Hamilton functions (9.62) and (9.63), we have

(9.66)

(9.67)

If the coupled HJ equations (9.66) and (9.67) can be solved for the optimal value functions J ₁(x,u ^∗,w ^∗) and J ₂(x,u ^∗,w ^∗), the optimal control can then be implemented by using (9.64) and (9.65). However, these equations are generally difficult or impossible to solve due to their inherently nonlinear nature. To overcome this difficulty, a near-optimal control scheme is developed to learn the solution of coupled HJ equations online using a single network ADP in order to obtain the optimal control policies.

Before presenting the near-optimal control scheme, the following lemma is required.

Lemma 9.31

Given the system (9.58) with associated cost functionals (9.59) and (9.60) and the optimal feedback control policies (9.64) and (9.65). For player i, i=1,2, let L _i(x) be a continuously differentiable, radially unbounded Lyapunov candidate such that $\dot {L}_{i}=\triangledown L_{i}^{\mathrm{T}}\dot {x}=\triangledown L_{i}^{\mathrm{T}} (f(x)+g(x)u^{\ast}+k(x)w^{\ast})<0$, with ▽L _i being the partial derivative of L _i(x) with respect to x. Moreover, let $\bar {Q}_{i}(x)\in\mathbb{R}^{n\times n}$ be a positive definite matrix satisfying $\|\bar {Q}_{i}(x)\|=0$ if and only if ∥x∥=0 and $\bar {Q}_{i\min}\leq\|\bar {Q}_{i}(x)\|\leq\bar {Q}_{i\max}$ for ∥χ _min∥≤∥x∥≤χ _max with positive constants $\bar {Q}_{i\min}$, $\bar {Q}_{i\max}$, χ _min, χ _max. In addition, let $\bar{Q}_{i}(x)$ satisfy $\lim_{x\rightarrow\infty}\bar{Q}_{i}(x)=\infty$ as well as

(9.68)

Then the following relation holds:

(9.69)

Proof

When the optimal control u ^∗ and w ^∗ in (9.64) and (9.65) are applied to the nonlinear system (9.58), the value function J _i(x,u ^∗,w ^∗) becomes a Lyapunov function, i=1,2. Then, for i=1,2, differentiating the value function J _i(x,u ^∗,w ^∗) with respect to t, we have

(9.70)

Using (9.68), (9.70) can be rewritten as

(9.71)

Next, multiplying both sides of (9.71) by $\triangledown L_{i}^{\mathrm{T}}$, (9.69) can be obtained.

This completes the proof. □

9.4.2 Optimal Control of Nonlinear Non-Zero-Sum Games Based on ADP

To begin the development, we rewrite the cost functions (9.59) and (9.60) by NNs as

(9.72)

(9.73)

where W _i, ϕ _i(x), and ε _i are the critic NN ideal constant weights, the critic NN activation function vector and the NN approximation error for player i, i=1,2, respectively.

The derivative of the cost functions with respect to x can be derived as

(9.74)

(9.75)

where ▽ϕ _i≜∂ϕ _i(x)/∂x, ▽ε _i≜∂ε _i/∂x, i=1,2.

Using (9.74) and (9.75), the optimal feedback control policies (9.64) and (9.65) can be rewritten as

(9.76)

(9.77)

and the coupled HJ equations (9.66) and (9.67) can be rewritten as

(9.78)

(9.79)

where

(9.80)

The residual error due to the NN approximation for player 1 is

(9.81)

The residual error due to the NN approximation for player 2 is

(9.82)

Let $\hat{W}_{c1}$ and $\hat{W}_{c2}$ be the estimates of W _c1 and W _c2, respectively. Then we have the estimates of V ₁(x) and V ₂(x) as follows:

(9.83)

(9.84)

Substituting (9.83) and (9.84) into (9.64) and (9.65), respectively, the estimates of optimal control policies can be written as

(9.85)

(9.86)

Applying (9.85) and (9.86) to the system (9.58), we have the closed-loop system dynamics as follows:

(9.87)

Substituting (9.83) and (9.84) into (9.62) and (9.63), respectively, the approximate Hamilton functions can be derived as follows:

(9.88)

(9.89)

It is desired to select $\hat{W}_{c1}$ and $\hat{W}_{c2}$ to minimize the squared residual error $E=e_{1}^{\mathrm{T}}e_{1}/2+e_{2}^{\mathrm{T}}e_{2}/2$. Then we have $\hat{W}_{c1}\rightarrow W_{c1}$, $\hat{W}_{c2}\rightarrow W_{c2}$, and $e_{1}\rightarrow\varepsilon_{\rm HJ1}$, $e_{2}\rightarrow\varepsilon_{\rm HJ2}$. In other words, the Nash equilibrium of the non-zero-sum differential games of continuous-time nonlinear system (9.58) can be obtained. However, tuning the critic NN weights to minimize the squared residual error E alone does not ensure the stability of the nonlinear system (9.58) during the learning process of critic NNs. Therefore, we propose the novel weight tuning laws of critic NNs for two players, which cannot only minimize the squared residual error E but also guarantee the stability of the system as follows:

(9.90)

(9.91)

where $\bar{\sigma}_{i}=\hat{\sigma}_{i}/(\hat{\sigma}_{i}^{\mathrm{T}} \hat {\sigma}_{i}+1)$, $\hat {\sigma}_{i}=\triangledown\phi_{i}(f(x)-D_{1}\triangledown\phi_{1}^{\mathrm{T}} \hat {W}_{c1}/2-D_{2}\triangledown\phi_{2}^{\mathrm{T}}\hat{W}_{c2}/2)$, $m_{s_{i}}=\hat{\sigma}_{i}^{\mathrm{T}} \hat{\sigma}_{i}+1$, α _i>0 is the adaptive gain, ▽L _i is described in Lemma 9.31, i=1,2. F ₁, F ₂, F ₃, and F ₄ are design parameters. The operator $\varSigma(x,\hat{u},\hat{w})$ is given by

(9.92)

where $\dot{x}$ is given as (9.87).

Remark 9.32

The first terms in (9.90) and (9.91) are utilized to minimize the squared residual error E and derived by using a normalized gradient descent algorithm. The other terms are utilized to guarantee the stability of the closed-loop system while the critic NNs learn the optimal cost functions and are derived by following Lyapunov stability analysis. The operator $\varSigma(x,\hat {u},\hat {w})$ is selected based on the Lyapunov’s sufficient condition for stability, which means that the state x is stable if L _i(x)>0 and $\triangledown L_{i}\dot{x}<0$ for player i, i=1,2. When the system (9.58) is stable, the operator $\varSigma(x,\hat{u},\hat{w})=0$ and it will not take effect. When the system (9.58) is unstable, the operator $\varSigma(x,\hat{u},\hat{w})=1$ and it will be activated. Therefore, no initial stabilizing control policies are needed due to the introduction of the operator $\varSigma(x,\hat{u},\hat{w})$.

Remark 9.33

From (9.88) and (9.89), it can be seen that the approximate Hamilton functions $H_{1}(x,\hat{W}_{c1},\hat{W}_{c2})=e_{1}=0$ and $H_{2}(x,\hat{W}_{c1},\hat{W}_{c2})=e_{2}=0$ when x=0. For this case, the tuning laws of critic NN weights for two players (9.90) and (9.91) cannot achieve the purpose of optimization anymore. This can be considered as a persistency of the requirement of excitation for the system states. Therefore, the system states must be persistently excited enough for minimizing the squared residual errors E to drive the critic NN weights toward their ideal values. In order to satisfy the persistent excitation condition, probing noise is added to the control input.

Define the weight estimation errors of critic NNs for two players to be $\tilde{W}_{c1}=W_{c1}-\hat{W}_{c1}$ and $\tilde{W}_{c2}=W_{c2}-\hat{W}_{c2}$, respectively. From (9.78) and (9.79), we observe that

(9.93)

(9.94)

Combining (9.90) with (9.93), we have

(9.95)

Similarly, combining (9.91) with (9.94), we have

(9.96)

In the following, the stability analysis will be performed. First, the following assumption is made, which can reasonably be satisfied under the current problem settings.

Assumption 9.34

(a)
g(⋅) and k(⋅) are upper bounded, i.e., ∥g(⋅)∥≤g _M and ∥k(⋅)∥≤k _M with g _M and k _M being positive constants.
(b)
The critic NN approximation errors and their gradients are upper bounded so that ∥ε _i∥≤ε _iM and ∥▽ε _i∥≤ε _idM with ε _iM and ε _idM being positive constants, i=1,2.
(c)
The critic NN activation function vectors are upper bounded, so that ∥ϕ _i∥≤ϕ _iM and ∥▽ϕ _i∥≤ϕ _idM, with ϕ _iM and ϕ _idM being positive constants, i=1,2.
(d)
The critic NN weights are upper bounded so that ∥W _i∥≤W _iM with W _iM being positive constant, i=1,2. The residual errors $\varepsilon_{\rm HJi}$ are upper bounded, so that $\|\varepsilon_{\rm HJi}\|\leq\varepsilon_{{\rm HJ}iM}$ with $\varepsilon_{{\rm HJ}iM}$ being positive constant, i=1,2.

Now we are ready to prove the following theorem.

Theorem 9.35

(cf. [17])

Consider the system given by (9.58). Let the control input be provided by (9.85) and (9.86), and the critic NN weight tuning laws be given by (9.90) and (9.91). Then, the system state x and the weight estimation errors of critic NNs $\tilde {W}_{c1}$ and $\tilde {W}_{c2}$ are uniformly ultimately bounded (UUB). Furthermore, the obtained control input $\hat {u}$ and $\hat {w}$ in (9.85) and (9.86) are proved to converge to the Nash equilibrium policy of the non-zero-sum differential games approximately, i.e., $\hat {u}$ and $\hat {w}$ are closed for the optimal control input u ^∗ and w ^∗ with bounds ϵ _u and ϵ _w, respectively.

Proof

Choose the following Lyapunov function candidate:

(9.97)

where L ₁(x) and L ₂(x) are given by Lemma 9.31.

The derivative of the Lyapunov function candidate (9.97) along the system (9.87) is computed as

(9.98)

Then, substituting (9.95) and (9.96) into (9.98), we have

(9.99)

In (9.99), the last two terms can be rewritten as

(9.100)

Define $z=[\bar{\sigma}_{1}^{\mathrm{T}}\tilde {W}_{c1},\bar{\sigma} _{2}^{\mathrm{T}} \tilde {W}_{c2},\tilde {W}_{c1},\tilde {W}_{c2}]^{\mathrm{T}}$; then (9.99) can be rewritten as

(9.101)

where the components of the matrix M are given by

and the components of the vector δ=[d ₁ d ₂ d ₃ d ₄]^T are given as

According to Assumption 9.34 and observing the facts that $\bar{\sigma}_{1}<1$ and $\bar{\sigma}_{2}<1$, it can be concluded that δ is bounded by δ _M. Let the parameters F ₁, F ₂, F ₃, and F ₄ be chosen such that M>0. Then, taking the upper bounds of (9.101) reveals

(9.102)

Now, the cases of $\varSigma(x,\hat{u},\hat{w})=0$ and $\varSigma(x,\hat{u},\hat{w})=1$ will be considered.

(1) When $\varSigma(x,\hat{u},\hat{w})=0$, the first two terms are less than zero. Noting that ∥x∥>0 as guaranteed by the persistent excitation condition and using the operator defined in (9.92), it can be ensured that there exists a constant $\dot{x}_{\min}$ satisfying $0<\dot{x}_{\min}<\|\dot{x}\|$. Then (9.102) becomes

(9.103)

Given that the following inequalities:

(9.104)

or

(9.105)

or

(9.106)

hold, then $\dot{L}<0$. Therefore, using Lyapunov theory, it can be concluded that ∥▽L ₁∥, ∥▽L ₂∥ and ∥z∥ are UUB.

(2) When $\varSigma(x,\hat{u},\hat{w})=1$, it implies that the feedback control input (9.85) and (9.86) may not stabilize the system (9.58). Adding and subtracting $\triangledown L_{1}^{\mathrm{T}} D_{1}\varepsilon_{1}/2+\triangledown L_{2}^{\mathrm{T}}D_{2}\varepsilon_{2}/2$ to the right hand side of (9.102), and using (9.64), (9.65), and (9.80), we have

(9.107)

According to Assumption 9.34, D _i is bounded by D _iM, where D _iM is a known constant, i=1,2. Using Lemma 9.31 and recalling the boundedness of ▽ε ₁, ▽ε ₂, and δ, (9.107) can be rewritten as

(9.108)

where

Given that the following inequalities:

(9.109)

or

(9.110)

or

(9.111)

hold, then $\dot{L}<0$. Therefore, using Lyapunov theory, it can be concluded that ∥▽L ₁∥, ∥▽L ₂∥, and ∥z∥ are UUB.

In summary, for the cases $\varSigma(x,\hat{u},\hat{w})=0$ and $\varSigma(x,\hat{u},\hat{w})=1$, if inequalities $\|\triangledown L_{1}\|>\max( B_{\triangledown L_{1}}, B'_{\triangledown L_{1}})\triangleq\bar{B}_{\triangledown L_{1}}$, or $\|\triangledown L_{2}\|>\max( B_{\triangledown L_{2}}, B'_{\triangledown L_{2}})\triangleq\bar{B} _{\triangledown L_{2}}$ or $\|z\|>\max( B_{z}, B'_{z})\triangleq\bar{B}_{z}$ hold, then $\dot{L}<0$. Therefore, we can conclude that ∥▽L ₁∥, ∥▽L ₂∥ and ∥z∥ are bounded by $\bar{B}_{\triangledown L_{1}}$, $\bar{B}_{\triangledown L_{2}}$, and $\bar{B}_{z}$, respectively. According to Lemma 9.31, the Lyapunov candidates ▽L ₁ and ▽L ₂ are radially unbounded and continuously differentiable. Therefore, the boundedness of ∥▽L ₁∥ and ∥▽L ₂∥ implies the boundedness of ∥x∥. Specifically, ∥x∥ is bounded by $\bar{B}_{x}=\max(B_{1x},B_{2x})$, where B _1x and B _2x are determined by $\bar{B}_{\triangledown L_{1}}$ and $\bar{B}_{\triangledown L_{2}}$, respectively. Besides, note that if any component of z exceeds the bound, i.e., $\|\tilde {W}_{c1}\|>\bar{B}_{z}$ or $\|\tilde{W}_{c2}\|>\bar{B}_{z}$ or $\|\bar {\sigma}_{1}^{\mathrm{T}}\tilde {W}_{c1}\|>\bar{B}_{z}$ or $\|\bar {\sigma}_{2}^{\mathrm{T}}\tilde {W}_{c2}\|>\bar{B}_{z}$, the ∥z∥ are bounded by $\bar{B}_{z}$, which implies that the critic NN weight estimation errors $\|\tilde{W}_{c1}\|$ and $\|\tilde{W}_{c2}\|$ are also bounded by B _z.

Next, we will prove $\|\hat{u}-u^{\ast}\|\leq\epsilon_{u}$ and $\|\hat{w}-w^{\ast}\|\leq\epsilon_{w}$. From (9.64) and (9.85) and recalling the boundedness of ∥▽ϕ ₁∥ and $\|\tilde{W}_{c1}\|$, we have

(9.112)

Similarly, from (9.65) and (9.86) and recalling the boundedness of ∥▽ϕ ₂∥ and $\|\tilde{W}_{c2}\|$, we obtain $\|\hat{w}-w^{\ast}\|\leq\epsilon_{w}$.

This completes the proof. □

Remark 9.36

In [10], each player needs two NNs consisting of a critic NN and an action NN to implement the online learning algorithm. By contrast with [10], only one critic NN is required for each player, the action NN is eliminated, resulting in a simpler architecture, and less computational burden.

Remark 9.37

In Remark 3 of [10] one pointed out that the NN weights can be initialized randomly but non-zero. That is because the method proposed in [10] requires initial stabilizing control policies for guaranteeing the stability of the system. By contrast, no initial stabilizing control policies are needed by adding an operator, which is selected by the Lyapunov’s sufficiency condition for stability, on the critic NN weight tuning law for each player in this subsection.

9.4.3 Simulations

Example 9.38

An example is provided to demonstrate the effectiveness of the present control scheme.

Consider the affine nonlinear system as follows:

(9.113)

where

(9.114)

(9.115)

The cost functionals for player 1 and player 2 are defined by (9.59) and (9.60), respectively, where Q ₁(x)=2x ^T x, R ₁₁=R ₁₂=2I, Q ₂(x)=x ^T x, R ₂₁=R ₂₂=2I, and I denotes an identity matrix of appropriate dimensions.

For player 1, the optimal cost function is $V^{\ast}_{1}(x)=0.25x_{1}^{2}+x_{2}^{2}$. For player 2, the optimal cost function is $V^{\ast}_{2}(x)=0.25x_{1}^{2}+0.5x_{2}^{2}$. The activation functions of critic NNs of two players are selected as $\phi_{1}=\phi_{2}= [x_{1}^{2},x_{1}x_{2}, x_{2}^{2}]^{\mathrm{T}}$. Then, the optimal values of the critic NN weights for player 1 are W _c1=[0.5,0,1]^T. The optimal values of the critic NN weights for player 2 are W _c2=[0.25,0,0.5]^T. The estimates of the critic NN weights for two players are denoted $\hat{W}_{c1} = [W_{11}, W_{12},W_{13}]^{\mathrm{T}}$ and $\hat{W}_{c2} = [W_{21}, W_{22},W_{23}]^{\mathrm{T}}$, respectively. The adaptive gains for the critic NNs are selected as a ₁=1 and a ₂=1, and the design parameters are selected as F ₁=F ₂=F ₃=F ₄=10I. All NN weights are initialized to zero, which means that no initial stabilizing control policies are needed for implementing the present control scheme. The system state is initialized as [0.5,0.2]^T. To maintain the excitation condition, probing noise is added to the control input for the first 250 s.

After simulation, the trajectories of the system states are shown in Fig. 9.13. The convergence trajectories of the critic NN weights for player 1 are shown in Fig. 9.14, from which we see that the critic NN weights for player 1 finally converge to [0.4490,0.0280,0.9777]^T. The convergence trajectories of the critic NN weights for player 2 are shown in Fig. 9.15, from which we see that the critic NN weights for player 2 finally converge to [0.1974,0.0403,0.4945]^T. The convergence trajectory of $e_{u}=\hat{u}-u^{\ast}$ is shown in Fig. 9.16. The convergence trajectory of $e_{w}=\hat{w}-w^{\ast}$ is shown in Fig. 9.17. From Fig. 9.16, we see that the error between the estimated control $\hat{u}$ and the optimal control u ^∗ for player 1 is close to zero when t=230 s. Similarly, it can been seen from Fig. 9.17 that the estimated control $\hat{w}$ and the optimal control w ^∗ for player 2 are also close to zero when t=180 s. Simulation results reveal that the present control scheme can make the critic NN learn the optimal cost function for each player and meanwhile guarantees stability of the closed-loop system.

In order to compare with [10], we use the method proposed in [10] to solve the non-zero-sum games of system (9.113) where all NN weights are initialized to be zero, then obtain the trajectories of system states as shown in Fig. 9.18. It is shown that the system is unstable, which implies that the method in [10] requires initial stabilizing control policies for guaranteeing the stability of the system. By contrast, the present method does not need the initial stabilizing control policies.

As pointed out earlier, one of the main advantages of the single ADP approach is that it results in less computational burden and eliminates the approximation error resulting from the action NNs. To demonstrate this quantitatively, we apply the method in [10] and our method to the system (9.113) with the same initial condition. Figures 9.19 and 9.20 show the convergence trajectories of the critic NN weights for player 1 and player 2, where the solid line and the dashed line represent the results from the method in [10] and our method, respectively. For the convenience of comparison, we define an evaluation function by $\text{PER(i)}=\sum_{k=1}^{N} \|\tilde{W}_{i}(k)\|$, i=1,2, which means that the sum of the norm of the critic NN weights error during running time, where N is the number of sample points. The evaluation functions of the critic NN estimation errors as well as the time taken by the method in [10] and our method are calculated and shown in Table 9.1. It clearly indicates that the present method takes less time and obtains a smaller approximation error than [10].

Table 9.1 Critic NN estimation errors and calculation time

Full size table

9.5 Summary

In this chapter, we investigated the problem of continuous-time differential games based on ADP. In Sect. 9.2, we developed a new iterative ADP method to obtain the optimal control pair or the mixed optimal control pair for a class of affine nonlinear zero-sum differential games. In Sect. 9.3, finite horizon zero-sum games for nonaffine nonlinear systems were studied. Then, in Sect. 9.4, the case of non-zero-sum differential games was studied using a single network ADP. Several numerical simulations showed that the present methods are effective.

References

Abu-Khalaf M, Lewis FL, Huang J (2006) Policy iterations on the Hamilton–Jacobi–Isaacs equation for H-infinity state feedback control with input saturation. IEEE Trans Autom Control 51:1989–1995
Article MathSciNet Google Scholar
Abu-Khalaf M, Lewis FL, Huang J (2008) Neurodynamic programming and zero-sum games for constrained control systems. IEEE Trans Neural Netw 19:1243–1252
Article Google Scholar
Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica 43:473–481
Article MathSciNet MATH Google Scholar
Bardi M, Capuzzo-Dolcetta I (1997) Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Birkhäuser, Germany
Book MATH Google Scholar
Birkhäuser (1995) H-infinity optimal control and related minimax design problems: a dynamical game approach. Birkhäuser, Berlin
Google Scholar
Chang HS, Hu J, Fu MC (2010) Adaptive adversial multi-armed bandit approach to two-person zero-sum Markov games. IEEE Trans Autom Control 55:463–468
Article MathSciNet Google Scholar
Chen BS, Tseng CS, Uang HJ (2002) Fuzzy differential games for nonlinear stochastic systems: suboptimal approach. IEEE Trans Fuzzy Syst 10:222–233
Article Google Scholar
Laraki R, Solan E (2005) The value of zero-sum stopping games in continuous time. SIAM J Control Optim 43:1913–1922
Article MathSciNet MATH Google Scholar
Starr AW, Ho YC (1967) Nonzero-sum differential games. J Optim Theory Appl 3:184–206
Article MathSciNet Google Scholar
Vamvoudakisand KG, Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton–Jacobi equations. Automatica. doi:10.1016/j.automatica.2011.03.005
Google Scholar
Wang X (2008) Numerical solution of optimal control for scaled systems by hybrid functions. Int J Innov Comput Inf Control 4:849–856
Google Scholar
Wei QL, Zhang HG, Liu DR (2008) A new approach to solve a class of continuous-time non-linear quadratic zero-sum game using ADP. In: Proceedings of IEEE international conference on networking, sensing and control, Sanya, China, pp 507–512
Chapter Google Scholar
Wei QL, Zhang HG, Cui LL (2009) Data-based optimal control for discrete-time zero-sum games of 2-D systems using adaptive critic designs. Acta Autom Sin 35:682–692
Article MathSciNet MATH Google Scholar
Wei QL, Zhang HG, Dai J (2009) Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions. Neurocomputing 7–9:1839–1848
Article Google Scholar
Zhang HG, Wei QL, Liu DR (2011) An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica 47:207–214
Article MathSciNet MATH Google Scholar
Zhang X, Zhang HG, Wang XY (2011) A new iteration approach to solve a class of finite-horizon continuous-time nonaffine nonlinear zero-sum game. Int J Innov Comput Inf Control 7:597–608
Google Scholar
Zhang HG, Cui LL, Luo YH (2012) Near-optimal control for non-zero-sum differential games of continuous-time nonlinear systems using single network ADP. IEEE Trans Syst Man Cybern, Part B, Cybern. doi:10.1109/TSMCB.2012.2203336
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science Engin., Northeastern University, Shenyang, People’s Republic of China
Huaguang Zhang & Yanhong Luo
Institute of Automation, Laboratory of Complex Systems, Chinese Academy of Sciences, Beijing, People’s Republic of China
Derong Liu & Ding Wang

Authors

Huaguang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Derong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yanhong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ding Wang
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, H., Liu, D., Luo, Y., Wang, D. (2013). Nonlinear Games for a Class of Continuous-Time Systems Based on ADP. In: Adaptive Dynamic Programming for Control. Communications and Control Engineering. Springer, London. https://doi.org/10.1007/978-1-4471-4757-2_9

Download citation

DOI: https://doi.org/10.1007/978-1-4471-4757-2_9
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4756-5
Online ISBN: 978-1-4471-4757-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Nonlinear Games for a Class of Continuous-Time Systems Based on ADP

Abstract

Keywords

9.1 Introduction

9.2 Infinite Horizon Zero-Sum Games for a Class of Affine Nonlinear Systems

9.2.1 Problem Formulation

Lemma 9.1

9.2.2 Zero-Sum Differential Games Based on Iterative ADP Algorithm

9.2.2.1 Derivation of the Iterative ADP Method

Theorem 9.2

Proof

Remark 9.3

9.2.2.2 The Iterative ADP Algorithm

9.2.2.3 Properties of the Iterative ADP Algorithm

Theorem 9.4

Proof

Theorem 9.5

Corollary 9.6

Proof

Corollary 9.7

Theorem 9.8

Proof

Theorem 9.9

Proposition 9.10

Proof

Proposition 9.11

Theorem 9.12

Proof

Remark 9.13

Proposition 9.14

Proposition 9.15

Proposition 9.16

Theorem 9.17

Proof

Theorem 9.18

Proof

Remark 9.19

9.2.3 Simulations

Example 9.20

Remark 9.21

Example 9.22

9.3 Finite Horizon Zero-Sum Games for a Class of Nonlinear Systems

9.3.1 Problem Formulation

9.3.2 Finite Horizon Optimal Control of Nonaffine Nonlinear Zero-Sum Games

Assumption 9.23

Remark 9.24

Assumption 9.25

Lemma 9.26

Proof

Theorem 9.27

Proof

9.3.3 Simulations

Example 9.28

9.4 Non-Zero-Sum Games for a Class of Nonlinear Systems Based on ADP

9.4.1 Problem Formulation of Non-Zero-Sum Games

Definition 9.29

Definition 9.30

Lemma 9.31

Proof

9.4.2 Optimal Control of Nonlinear Non-Zero-Sum Games Based on ADP

Remark 9.32

Remark 9.33

Assumption 9.34

Theorem 9.35

Proof

Remark 9.36

Remark 9.37

9.4.3 Simulations

Example 9.38

9.5 Summary

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us