Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

9.1 Introduction

Game theory is concerned with the study of decision making in situations where two or more rational opponents are involved under conditions of conflicting interests. This has been widely investigated by many authors [5, 7, 8, 12, 13]. Though the nonlinear optimal solution in term of Hamilton–Jacobi–Bellman equation is hard to obtain directly [4], it is still fortunate that there is only one controller or decision maker. In the previous chapter, we have studied discrete-time zero-sum games based on the ADP method. In this chapter, we will consider continuous-time games.

For zero-sum differential games, the existence of the saddle point is proposed before obtaining the saddle point in much of the literature [1, 6, 11]. In many real world applications, however, the saddle point of a game may not exist, which means that we can only obtain the mixed optimal solution of the game. In Sect. 9.2, we will study how to obtain the saddle point without complex existence conditions of the saddle point and how to obtain the mixed optimal solution when the saddle point does not exist based on the ADP method for a class of affine nonlinear zero-sum games. Note that many applications of practical zero-sum games have nonaffine control input. In Sect. 9.3, we will focus on finite horizon zero-sum games for a class of nonaffine nonlinear systems.

The non-zero-sum differential games theory also has a number of potential applications in control engineering, economics and the military field [9]. For zero-sum differential games, two players work on a cost functional together and minimax it. However, for non-zero-sum games, the control objective is to find a set of policies that guarantee the stability of the system and minimize the individual performance function to yield a Nash equilibrium. In Sect. 9.4, non-zero-sum differential games will be studied using a single network ADP.

9.2 Infinite Horizon Zero-Sum Games for a Class of Affine Nonlinear Systems

In this section, the nonlinear infinite horizon zero-sum differential games is studied. We propose a new iterative ADP method which is effective for both the situation that the saddle point does and does not exist. For the situation that the saddle point exists, the existence conditions of the saddle point are avoided. The value function can reach the saddle point using the present iterative ADP method. For the situation that the saddle point does not exist, the mixed optimal value function is obtained under a deterministic mixed optimal control scheme, using the present iterative ADP algorithm.

9.2.1 Problem Formulation

Consider the following two-person zero-sum differential games. The system is described by the continuous-time affine nonlinear equation

$$ \dot{x}(t) = f(x(t),u(t),w(t)) = f(x(t)) + g(x(t))u(t) + k(x(t))w(t), $$
(9.1)

where x(t)∈ℝn, u(t)∈ℝk, w(t)∈ℝm, and the initial condition x(0)=x 0 is given.

The cost functional is the generalized quadratic form given by

(9.2)

where l(x,u,w)=x T Ax+u T Bu+w T Cw+2u T Dw+2x T Eu+2x T Fw. The matrices A, B, C, D, E, and F have suitable dimensions and A≥0, B>0, and C<0. According to the situation of two players we have the following definitions. Let \(\overline{J} (x): = \inf _{u } \sup_{w } J(x,u,w)\) be the upper value function and \(\underline{J}(x): = \sup_{w } \inf _{u } J(x,u,w)\) be the lower value function with the obvious inequality \(\overline{J}(x) \ge \underline{J}(x)\). Define the optimal control pairs to be \((\overline{u}, \overline{w})\) and \((\underline{u}, \underline{w})\) for upper and lower value functions, respectively. Then, we have \(\overline{J} (x) = J(x,\overline{u},\overline{w})\) and \(\underline {J} (x) = J(x,\underline{u},\underline{w})\).

If both \(\overline{J} (x)\) and \(\underline{J}(x)\) exist and

$$ \overline{J} (x)= \underline{J}(x)=J^\ast(x) $$
(9.3)

holds, we say that the saddle point exists and the corresponding optimal control pair is denoted by (u ,w ).

We have the following lemma.

Lemma 9.1

If the nonlinear system (9.1) is controllable and both the upper value function and lower value function exist, then \(\overline{J}(x)\) is a solution of the following upper Hamilton–Jacobi–Isaacs (HJI) equation:

(9.4)

which is denoted by \(\mathrm{HJI}(\overline{J}(x), \overline{u},\overline{w})=0 \) and \(\underline{J}(x)\) is a solution of the following lower HJI equation:

(9.5)

which is denoted by \(\mathrm{HJI}(\underline{J}(x), \underline{u}, \underline{w})=0\).

9.2.2 Zero-Sum Differential Games Based on Iterative ADP Algorithm

As the HJI equations (9.4) and (9.5) cannot be solved in general, in the following, a new iterative ADP method for zero-sum differential games is developed.

9.2.2.1 Derivation of the Iterative ADP Method

The goal of the present iterative ADP method is to obtain the saddle point. As the saddle point may not exist, this motivates us to obtain the mixed optimal value function J o(x) where \(\underline{J}(x) \le J^{o}(x) \le\overline{J}(x)\).

Theorem 9.2

(cf. [15])

Let \((\overline{u}, \overline{w})\) be the optimal control pair for \(\overline{J}(x)\) and \((\underline{u}, \underline{w})\) be the optimal control pair for \(\underline{J}(x)\). Then, there exist control pairs \((\overline{u},w)\) and \((u,\underline{w})\) which lead to \(J^{o}(x)=J(x,\overline{u},w)=J(x,u,\underline{w})\). Furthermore, if the saddle point exists, then J o(x)=J (x).

Proof

According to the definition of \(\overline{J} (x)\), we have \(J(x,\overline{u},w) \leq J(x,\overline{u},\overline{w})\). As J o(x) is a mixed optimal value function, we also have \(J^{o}(x) \leq J(x,\overline{u},\overline{w})\). As the system (9.1) is controllable and w is continuous on ℝm, there exists a control pair \((\overline{u},w)\) which makes \(J^{o}(x)=J(x,\overline{u},w)\). On the other hand, we have \(J^{o}(x) \geq J(x,\underline{u},\underline{w})\). We also have \(J(x,{u},\underline{w}) \geq J(x,\underline{u},\underline{w})\). As u is continuous on ℝk, there exists a control pair \(({u},\underline{w})\) which makes \(J^{o}(x)=J(x,{u},\underline{w})\). If the saddle point exists, we have (9.3). On the other hand, \(\underline{J}(x) \leq J^{o}(x) \leq\overline{J} (x)\). Then, clearly J o(x)=J (x). □

If (9.3) holds, we have a saddle point; if not, we adopt a mixed trajectory to obtain the mixed optimal solution of the game. To apply the mixed trajectory method, the game matrix is necessary under the trajectory sets of the control pair (u,w). Small Gaussian noises γ u R k and γ w R m are introduced that are added to the optimal control \(\underline{u}\) and \(\overline{w}\), respectively, where \(\gamma_{u}^{i} (0, \sigma_{i}^{2})\), i=1,…,k, and \(\gamma_{w}^{j} (0, \sigma_{j}^{2})\), j=1,…,m, are zero-mean Gaussian noises with variances \(\sigma_{i}^{2}\) and \(\sigma_{j}^{2}\), respectively.

We define the expected value function as

\(E(J(x))= \min_{P_{Ii} } \max_{P_{IIj} } \sum_{i = 1}^{2} \sum_{j = 1}^{2} {P_{Ii} L_{ij} P_{IIj} }\), where we let \(L_{11}= J(x,\overline {u},\overline{w})\), \(L_{12}=J(x,(\underline{u} + \gamma_{u} ),\underline{w})\), \(L_{21}=J(x,\underline{u},\underline{w})\) and \(L_{22}=J(x,\overline{u}, (\overline{w} + \gamma_{w} ))\). Let \(\sum_{i = 1}^{2} {P_{Ii} } = 1 \) and P Ii >0. Let \(\sum_{j = 1}^{2} {P_{IIj} } = 1 \) and P IIj >0. Next, let N be a large enough positive integer. Calculating the expected value function N times, we can obtain E 1(J(x)),E 2(J(x)),…,E N (J(x)). Then, the mixed optimal value function can be written as

$$J^o (x) = E(E_i (J(x))) = \displaystyle\frac{1}{N}\sum\limits_{i = 1}^N {E_i (J(x))}. $$

Remark 9.3

In the classical mixed trajectory method, the whole control sets ℝk and ℝm should be searched under some distribution functions. As there are no constraints for both controls, we see that there exist controls that cause the system to be unstable. This is not permitted for real-world control systems. Thus, it is impossible to search the whole control sets and we can only search the local area around the stable controls which guarantees stability of the system. This is the reason why the small Gaussian noises γ u and γ w are introduced. So the meaning of the Gaussian noises can be seen in terms of the local stable area of the control pairs. A proposition will be given to show that the control pair chosen in the local area is stable (see Proposition 9.14). Similar work can also be found in [3, 14].

We can see that the mixed optimal solution is a mathematically expected value which means that it cannot be obtained in reality once the trajectories are determined. For most practical optimal control problems, however, the expected optimal solution (or mixed optimal solution) has to be achieved. To overcome this difficulty, a new method is developed in this section. Let \(\alpha = {{({J^{o}}(x) - \underline{J}(x))} / {(\overline{J}(x) -\underline{J} (x))}}\). Then, J o(x) can be written as \(J^{o} (x) = \alpha\overline{J}(x) + (1 - \alpha)\underline{J}(x)\). Let \(l^{o}(x,\overline{u},\overline{w}, \underline{u}, \underline{w})=\alpha l(x,\overline{u},\overline{w})+ (1-\alpha) l(x,\underline{u}, \underline{w})\). We have \(J^{o} (x(0)) = \int_{0}^{\infty}{l^{o}} \mathrm{d}t\). According to Theorem 9.2, the mixed optimal control pair can be obtained by regulating the control w in the control pair \((\overline{u}, \overline{w})\) that minimizes the error between \(\mathcal{J}(x)\) and J o(x) where the value function \(\mathcal{J}(x)\) is defined as \(\mathcal{J}(x(0)) = \,J (x(0), \overline{u}, w) = \int_{0}^{\infty}l(x,\overline{u},w)\mathrm{d}t\) and \(\underline{J}(x(0)) \leq\mathcal {J}(x(0)) \leq\overline{J}(x(0))\).

Define \(\widetilde{J}(x(0)) = \int_{0}^{\infty}{\widetilde{l}(x,w)} dx\), where \(\widetilde{l}(x,w)=l(x,\overline{u},w) - l^{o}(x,\overline {u},\overline{w}, \underline{u}, \underline{w})\). Then, the problem can be described as \(\min_{w} (\widetilde{J}(x))^{2}\).

According to the principle of optimality, when \(\widetilde{J}(x)\geq 0\) we have the following HJB equation:

(9.6)

For \(\widetilde{J}(x)< 0\), we have \(-\widetilde{J}(x)=-(\mathcal {J}(x)-J^{o}(x))> 0\), and we can obtain the same HJB equation as (9.6).

9.2.2.2 The Iterative ADP Algorithm

Given the above preparation, we now formulate the iterative ADP algorithm for zero-sum differential games as follows:

  1. 1.

    Initialize the algorithm with a stabilizing control pair (u [0],w [0]), and the value function is V [0]. Choose the computation precision ζ>0. Set i=0.

  2. 2.

    For the upper value function, let

    (9.7)

    where the iterative optimal control pair is formulated as

    (9.8)

    and

    $$ \overline{w}^{[i+1]} = - \frac{1}{2}C^{ - 1}\big(2D^\mathrm{T}\overline{u}^{[i+1]} + 2F^\mathrm{T}x+k^\mathrm{T} (x)\overline{V}^{[i]}_x\big). $$
    (9.9)

    \((\overline{u}^{[i]},\overline{w}^{[i]})\) satisfies the HJI equation \(\mathrm{HJI}(\overline{V}^{[i]}(x)\), \(\overline{u}^{[i]},\overline{w}^{[i]})=0\), and \(\overline{V}_{x}^{[i]} = d\overline{V}^{[i]}(x)/ dx\).

  3. 3.

    If \(| \overline{V}^{[i+1]} (x(0)) - \overline{V}^{[i]} (x(0)) | < \zeta\), let \(\overline{u} = \overline{u}^{[i]} \), \(\overline{w} =\overline{w}^{[i]} \) and \(\overline{J}(x)=\overline{V}^{[i+1]} (x)\). Set i=0 and go to Step 4. Else, set i=i+1 and go to Step 2.

  4. 4.

    For the lower value function, let

    (9.10)

    where the iterative optimal control pair is formulated as

    $$ \underline{u}^{[i+1]} = - \frac{1}{2}g^{ - 1} (2D\underline{w}^{[i+1]} +2k^\mathrm{T}x+ g^\mathrm{T}(x)\underline {V}_x^{[i]} ), $$
    (9.11)

    and

    (9.12)

    \((\underline{u}^{[i]},\underline{w}^{[i]})\) satisfies the HJI equation \(\mathrm{HJI}(\underline{V}^{[i]}(x), \underline{u}^{[i]},\underline{w}^{[i]})=0\), and \(\underline{V}_{x}^{[i]} = {d{\underline{V}^{[i]}}(x)} / {dx}\).

  5. 5.

    If \(| {\underline{V}^{[i+1]} (x(0)) - \underline{V}^{[i]} (x(0))} | < \zeta\), let \(\underline{u}=\underline{u}^{[i]}\), \(\underline{w}=\underline{w}^{[i]}\) and \(\underline{J}(x)=\underline{V}^{[i+1]} (x)\). Set i=0 and go to Step 6. Else, set i=i+1 and go to Step 4.

  6. 6.

    If \(| {\overline{J}(x(0)) - \underline{J}(x(0))} | < \zeta\), stop, and the saddle point is achieved. Else set i=0 and go to the next step.

  7. 7.

    Regulate the control w for the upper value function and let

    (9.13)

    The iterative optimal control is formulated as

    $$ {w}^{[i]} = - \frac{1}{2}C^{ - 1}(2D^\mathrm{T} \overline{u} + 2F^\mathrm{T}x+k^\mathrm{T}(x) \widetilde{V}^{[i+1]}_x), $$
    (9.14)

    where \(\widetilde{V}_{x}^{[i]} = d{\widetilde{V}^{[i]}}(x)/ dx\).

  8. 8.

    If \(| \mathcal{V}^{[i+1]} (x(0)) - J^{o} (x(0)) | < \zeta\), stop. Else, set i=i+1 and go to Step 7.

9.2.2.3 Properties of the Iterative ADP Algorithm

In this part, some results are presented to show the stability and convergence of the present iterative ADP algorithm.

Theorem 9.4

(cf. [15])

If fori≥0, \(\mathrm{HJI}(\overline {V}^{[i]}(x),\overline{u}^{[i]}, \overline{w}^{[i]}) = 0\) holds, and fort, \(l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) \ge 0\), then the control pairs \(( \overline{u}^{[i]},\overline{w}^{[i]})\) make system (9.1) asymptotically stable.

Proof

According to (9.7), for ∀ t, taking the derivative of \(\overline{V}^{[i]}(x)\), we have

(9.15)

From the HJI equation we have

(9.16)

Combining (9.15) and (9.16), we get

(9.17)

According to (9.8) we have

(9.18)

So, \(\overline{V}^{[i]}(x)\) is a Lyapunov function. Let ε>0 and ∥x(t 0)∥<δ(ε). Then, there exist two functions α(∥x∥) and β(∥x∥) which belong to class \(\mathcal{K}\) and satisfy

(9.19)

Therefore, system (9.1) is asymptotically stable. □

Theorem 9.5

(cf. [15])

If for ∀ i≥0, \(\mathrm{HJI}(\underline {V}^{[i]}(x),\underline{u}^{[i]}, \underline{w}^{[i]}) = 0\) holds, and for ∀ t, \(l(x,\underline{u}^{[i]},\underline{w}^{[i]}) < 0\), then the control pairs \((\underline{u}^{[i]}, \underline{w}^{[i]})\) make system (9.1) asymptotically stable.

Corollary 9.6

If fori≥0, \(\mathrm{HJI}(\underline{V}^{[i]}(x), \underline{u}^{[i]},\underline{w}^{[i]}) = 0\) holds, and fort, \(l(x,\underline{u}^{[i]},\underline{w}^{[i]}) \geq0\), then the control pairs \(( \underline{u}^{[i]},\underline{w}^{[i]})\) make system (9.1) asymptotically stable.

Proof

As \(\underline{V}^{[i]}(x) \leq\overline{V}^{[i]}(x)\) and \(l(x,\underline{u}^{[i]},\underline{w}^{[i]}) \geq0\), we have \(0 \leq\underline{V}^{[i]}(x) \leq\overline{V}^{[i]}(x)\).

From Theorem 9.4, we know that for ∀t 0, there exist two functions α(∥x∥) and β(∥x∥) which belong to class \(\mathcal{K}\) and satisfy (9.19).

As \(\overline{V}^{[i]}(x) \rightarrow0\), there exist time instants t 1 and t 2 (without loss of generality, let t 0<t 1<t 2) that satisfy

(9.20)

Choose ε 1>0 that satisfies \(\underline{V}^{[i]} (x(t_{0})) \geq\alpha(\varepsilon_{1}) \geq\overline{V}^{[i]} (x(t_{2}))\). Then, there exists δ 1(ε 1)>0 that makes \(\alpha(\varepsilon_{1}) \geq\beta(\delta_{1}) \geq\overline {V}^{[i]} (x(t_{2})) \). Then, we obtain

(9.21)

According to (9.19), we have

(9.22)

Since α(∥x∥) belongs to class \(\mathcal{K}\), we obtain ∥x∥≤ε.

Therefore, we conclude that the system (9.1) is asymptotically stable. □

Corollary 9.7

If fori≥0, \(\mathrm{HJI}(\overline{V}^{[i]}(x), \overline{u}^{[i]},\overline{w}^{[i]}) = 0\) holds, and fort, \(l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) < 0\), then the control pairs \(( \overline{u}^{[i]},\overline{w}^{[i]})\) make system (9.1) asymptotically stable.

Theorem 9.8

(cf. [15])

If fori≥0, \(\mathrm{HJI}(\overline {V}^{[i]}(x),\overline{u}^{[i]}, \overline{w}^{[i]}) = 0\) holds, and \(l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) \) is the utility function, then the control pairs \((\overline{u}^{[i]}, \overline{w}^{[i]})\) make system (9.1) asymptotically stable.

Proof

For the time sequence t 0<t 1<t 2<⋯<t m <t m+1<⋯, without loss of generality, we assume \(l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) \geq0\) in [t 2n ,t (2n+1)) and \(l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) < 0\) in [t 2n+1,t (2(n+1))) where n=0,1,….

Then, for t∈[t 0,t 1) we have \(l(x, \overline{u}^{[i]}, \overline{w}^{[i]}) \geq0\) and \(\int_{t_{0}}^{t_{1}}l(x, \overline {u}^{[i]}, \overline{w}^{[i]}) \mathrm{d}t \geq0\). According to Theorem 9.4, we have ∥x(t 0)∥≥∥x(t)∥≥∥x(t 1)∥.

For t∈[t 1,t 2) we have \(l(x, \overline{u}^{[i]}, \overline {w}^{[i]}) < 0\) and \(\int_{t_{1}}^{t_{2}}l(x, \overline{u}^{[i]}, \overline {w}^{[i]}) \mathrm{d}t < 0\). According to Corollary 9.7, we have ∥x(t 1)∥>∥x(t)∥>∥x(t 2)∥. So we obtain ∥x(t 0)∥≥∥x(t)∥>∥x(t 2)∥, for ∀t∈[t 0,t 2).

Using mathematical induction, for ∀t, we have ∥x(t′)∥≤∥x(t)∥ where t′∈[t,∞). So we conclude that the system (9.1) is asymptotically stable, and the proof is completed. □

Theorem 9.9

(cf. [15])

If fori≥0, \(\mathrm{HJI}(\underline{V}^{[i]}(x),\underline{u}^{[i]}, \underline{w}^{[i]}) = 0\) holds, and \(l(x,\underline{u}^{[i]},\underline{w}^{[i]}) \) is the utility function, then the control pairs \((\underline{u}^{[i]}, \underline{w}^{[i]})\) make system (9.1) asymptotically stable.

Next, we will give the convergence proof of the iterative ADP algorithm.

Proposition 9.10

If fori≥0, \(\mathrm{HJI}(\overline {V}^{[i]}(x),\overline{u}^{[i]}, \overline{w}^{[i]}) = 0\) holds, then the control pairs \((\overline{u}^{[i]}, \overline{w}^{[i]} )\) make the upper value function \(\overline{V}^{[i]}(x) \rightarrow \bar{J}(x) \) as i→∞.

Proof

According to \(HJI(\overline{V}^{[i]}(x),\overline{u}^{[i]}, \overline{w}^{[i]}) = 0\), we obtain \({{\mathrm{d}{{\overline{V}}^{[i+1]}}(x)} / {\mathrm{d}t}} \) by replacing the index “i” by the index “i+1”:

(9.23)

According to (9.18), we obtain

(9.24)

Since the system (9.1) is asymptotically stable, its state trajectories x converge to zero, and so does \(\overline {V}^{[i+1]}(x) - \overline{V}^{[i]}(x)\). Since \({{d( {{\overline{V}}^{[i+1]}}(x) - {\overline{V}}^{[i]}(x) )}/ {\mathrm{d}t}} \ge0\) on these trajectories, it implies that \(\overline {V}^{[i+1]}(x) - \overline{V}^{[i]}(x) \le0\); that is \(\overline {V}^{[i+1]}(x) \le\overline{V}^{[i]}(x)\). Thus, \(\overline {V}^{[i]}(x)\) is convergent as i→∞.

Next, we define \(\lim_{i \to\infty} \overline{V}^{[i]} (x) = \overline{V}^{[\infty]} (x)\).

For ∀i, let \(\overline{w}^{\ast}= \mathrm{arg}\max _{w} \{\int_{t}^{\hat{t}} l(x,u,w) \mathrm{d}\tau+ \overline{V}^{[i]}(x(\hat{t}))\}\). Then, according to the principle of optimality, we have

(9.25)

Since \(\overline{V}^{[i+1]}(x) \le\overline{V}^{[i]}(x)\), we have \(\overline{V}^{[\infty]}(x) \leq \int_{t}^{\hat{t}} l(x,u,\overline{w}^{\ast}) \mathrm{d}\tau+ \overline{V}^{[i]}(x(\hat{t}))\).

Letting i→∞, we obtain \(\overline{V}^{[\infty]}(x) \leq \int_{t}^{\hat{t}} l(x,u,\overline{w}^{\ast}) \mathrm{d}\tau+ \overline{V}^{[\infty]}(x(\hat{t}))\). So, we have \(\overline{V}^{[\infty]}(x) \leq \inf _{u }\sup_{w } \{\int_{t}^{\hat{t}} l(x,u,w) \mathrm{d}t+ \overline{V}^{[i]}(x(\hat{t}))\}\).

Let ϵ>0 be an arbitrary positive number. Since the upper value function is nonincreasing and convergent, there exists a positive integer i such that \(\overline{V}^{[i]}(x) - \epsilon \leq \overline{V}^{[\infty]}(x) \leq\overline{V}^{[i]}(x)\).

Let \(\overline{u}^{\ast}=\mathrm{arg} \min_{u} \{\int_{t}^{\hat{t}} l(x,u,\overline{w}^{\ast}) \mathrm{d}\tau+ \overline{V}^{[i]}(x(\hat{t}))\}\). Then we get \(\overline{V}^{[i]}(x) = \int_{t}^{\hat{t}} l(x,\overline{u}^{\ast},\overline{w}^{\ast}) \mathrm{d}\tau+ \overline {V}^{[i]}(x(\hat{t}))\).

Thus, we have

(9.26)

Since ϵ is arbitrary, we have

$$\overline{V}^{[\infty]}(x) \geq\mathop{\inf }\limits_{u }\mathop{\sup}\limits_{w }\left\{\int_{t}^{\hat{t}} l(x,u,w) \mathrm{d}\tau+ \overline{V}^{[\infty]}(x(\hat{t})) \right\}. $$

Therefore, we obtain

$$\overline {V}^{[\infty]}(x) = \mathop{\inf }\limits_{u }\mathop{\sup}\limits_{w }\left\{\int_{t}^{\hat{t}} l(x,u,w) \mathrm{d}\tau+ \overline{V}^{[\infty]}(x(\hat{t})) \right\}. $$

Let \(\hat{t} \rightarrow \infty\), we have

$$\overline{V}^{[\infty]}(x)=\mathop{\inf}\limits_{u}\mathop{\sup }\limits_{w} J(x,u,w)=\overline{J} (x). $$

 □

Proposition 9.11

If fori≥0, \(\mathrm{HJI}(\underline{V}^{[i]}(x),\underline{u}^{[i]}, \underline{w}^{[i]}) = 0\) holds, then the control pairs \((\underline{u}^{[i]}, \underline{w}^{[i]} )\) make the lower value function \(\underline{V}^{[i]}(x) \rightarrow \underline{J}(x) \) as i→∞.

Theorem 9.12

(cf. [15])

If the saddle point of the zero-sum differential game exists, then the control pairs \((\overline {u}^{[i]},\overline {w}^{[i]})\) and \((\underline {u}^{[i]}, \underline {w}^{[i]})\) make \(\overline{V}^{[i]} (x) \rightarrow J^{\ast}(x)\) and \(\underline{V}^{[i]} (x) \rightarrow J^{\ast}(x)\), respectively, as i→∞.

Proof

For the upper value function, according to Proposition 9.10, we have \(\overline{V}^{[i]} (x) \rightarrow\overline {J}(x)\) under the control pairs \((\overline {u}^{[i]},\overline {w}^{[i]})\) as i→∞. So the optimal control pair for the upper value function satisfies \(\overline{J}(x) = J(x,\overline{u} , \overline{w}) = \inf_{u } \sup_{w} J(x,u,w)\).

On the other hand, there exists an optimal control pair (u ,w ) making the value reach the saddle point. According to the property of the saddle point, the optimal control pair (u ,w ) satisfies J (x)=J(x,u ,w )=inf u sup w J(x,u,w).

So, we have \(\overline{V}^{[i]}(x) \rightarrow J^{\ast}(x)\) under the control pair \((\overline{u}^{[i]}, \overline{w}^{[i]})\) as i→∞. Similarly, we can derive \(\underline{V}^{[i]}(x) \rightarrow J^{\ast}(x)\) under the control pairs \((\underline {u}^{[i]}, \underline {w}^{[i]})\) as i→∞. □

Remark 9.13

From the proofs we see that the complex existence conditions of the saddle point in [1, 2] are not necessary. If the saddle point exists, the iterative value functions can converge to the saddle point using the present iterative ADP algorithm.

In the following part, we emphasize that when the saddle point does not exist, the mixed optimal solution can be obtained effectively using the iterative ADP algorithm.

Proposition 9.14

If \(\overline{u}\in\mathbb{R}^{k}\), w [i]∈ℝm and the utility function is \(\tilde{l}(x, w^{[i]}) = l(x,\overline{u}, w^{[i]})- l^{o}(x,\overline{u},\overline{w},\underline{u},\underline{w})\), and w [i] is expressed in (9.14), then the control pairs \((\overline{u} , w^{[i]})\) make the system (9.1) asymptotically stable.

Proposition 9.15

If \(\overline{u}\in\mathbb{R}^{k}\), w [i]∈ℝm and fort, the utility function \(\tilde{l}(x, w^{[i]}) \geq0\), then the control pairs \((\overline{u} , w^{[i]})\) make \(\widetilde{V}^{[i]}(x)\) a nonincreasing convergent sequence as i→∞.

Proposition 9.16

If \(\overline{u}\in\mathbb{R}^{k}\), w [i]∈ℝm and fort, the utility function \(\tilde{l}(x, w^{[i]})< 0\), then the control pairs \((\overline {u} , w^{[i]})\) make \(\tilde{V}^{[i]}(x)\) a nondecreasing convergent sequence as i→∞.

Theorem 9.17

(cf. [15])

If \(\overline {u}\in\mathbb{R}^{k}\), w [i]∈ℝm, and \(\tilde{l} (x,w^{[i]})\) is the utility function, then the control pairs \((\overline{u} , w^{[i]})\) make \(\widetilde{V}^{[i]}(x) \) convergent as i→∞.

Proof

For the time sequence t 0<t 1<t 2<⋯<t m <t m+1<⋯, without loss of generality, we suppose \(\tilde{l} (x,w^{[i]}) \geq0\) in [t 2n ,t 2n+1) and \(\tilde{l} (x,w^{[i]}) < 0\) in [t 2n+1,t 2(n+1)), where n=0,1,….

For t∈[t 2n ,t 2n+1) we have \(\tilde{l} (x,w^{[i]})\geq0\) and \(\int_{t_{0}}^{t_{1}} \tilde{l} (x,w^{[i]}) \mathrm{d}t \geq0\). According to Proposition 9.15, we have \(\widetilde {V}^{[i+1]}(x) \leq\widetilde{V}^{[i]}(x) \). For t∈[t 2n+1,t 2(n+1)) we have \(\tilde{l} (x,w^{[i]}) < 0\) and \(\int_{t_{1}}^{t_{2}} \tilde{l} (x,w^{[i]}) \mathrm{d}t < 0\). According to Proposition 9.16 we have \(\widetilde{V}^{[i+1]}(x) > \widetilde{V}^{[i]}(x) \). Then, for ∀t 0, we have

(9.27)

So, \(\widetilde{{V}}^{[i]}(x) \) is convergent as i→∞. □

Theorem 9.18

(cf. [15])

If \(\overline{u}\in R^{k}\), w [i]R m, and \(\tilde{l} (x,w^{[i]}) \) is the utility function, then the control pairs \((\overline{u} , w^{[i]})\) make \(\mathcal{V}^{[i]}(x) \rightarrow J^{o}(x) \) as i→∞.

Proof

It is proved by contradiction. Suppose that the control pair \((\overline{u},w^{[i]})\) makes the value function \(\mathcal {V}^{[i]}(x)\) converge to \(\mathcal{J}'(x)\) and \(\mathcal {J}'(x)\neq J^{o}(x)\).

According to Theorem 9.17, based on the principle of optimality, as i→∞ we have the HJB equation \(\mathrm{HJB}(\widetilde{J}(x),w) =0\).

From the assumptions we know that \(|\mathcal{V}^{[i]}(x)-J^{o}(x)|\neq0\) as i→∞. From Theorem 9.5, we know that there exists a control pair \((\overline{u},w')\) that makes \(J(x,\overline{u}, w')=J^{o}(x)\), which minimizes the performance index function \(\widetilde{J}(x)\). According to the principle of optimality, we also have the HJB equation \(\mathrm{HJB}(\widetilde{J}(x),w') =0\).

It is a contradiction. So the assumption does not hold. Thus, we have \(\mathcal{V}^{[i]}(x) \rightarrow J^{o}(x) \) as i→∞. □

Remark 9.19

For the situation where the saddle point does not exist, the methods in [1, 2] are all invalid. Using our iterative ADP method, the iterative value function reaches the mixed optimal value function J o(x) under the deterministic control pair. Therefore, we emphasize that the present iterative ADP method is more effective.

9.2.3 Simulations

Example 9.20

The dynamics of the benchmark nonlinear plant can be expressed by system (9.1) where

(9.28)

and ε=0.2. The initial state is given as x(0)=[1,1,1,1]T. The cost functional is defined by (9.2) where the utility function is expressed as \(l(x,u,w)=x_{1}^{2}+0.1x_{2}^{2}+0.1x_{3}^{2}+0.1x_{4}^{2}+\|u\|^{2}-\gamma^{2}\|w\|^{2} \) and γ 2=10.

Any differential structure can be used to implement the iterative ADP method. For facilitating the implementation of the algorithm, we choose three-layer neural networks as the critic networks with the structure of 4–8–1. The structures of the u and w for the upper value function are 4–8–1 and 5–8–1; while they are 5–8–1 and 4–8–1 for the lower one. The initial weights are all randomly chosen in [−0.1, 0.1]. Then, for each i, the critic network and the action networks are trained for 1000 time steps so that the given accuracy ζ=10−6 is reached. Let the learning rate η=0.01. The iterative ADP method runs for i=70 times and the convergence trajectory of the value function is shown in Fig. 9.1. We can see that the saddle point of the game exists. Then, we apply the controller to the benchmark system and run for T f =60 seconds. The optimal control trajectories are shown in Fig. 9.2. The corresponding state trajectories are shown in Figs. 9.3 and 9.4, respectively.

Fig. 9.1
figure 1

Trajectories of upper and lower value function

Fig. 9.2
figure 2

Trajectories of the controls

Fig. 9.3
figure 3

Trajectories of state x 1 and x 3

Fig. 9.4
figure 4

Trajectories of state x 2 and x 4

Remark 9.21

The simulation results illustrate the effectiveness of the present iterative ADP algorithm. If the saddle point exists, the iterative control pairs \((\overline{u}^{[i]},\overline{w}^{[i]})\) and \((\underline{u}^{[i]},\underline{w}^{[i]})\) can make the iterative value functions reach the saddle point, while the existence conditions of the saddle point are avoided.

Example 9.22

In this example, we just change the utility function to

and all other conditions are the same as the ones in Example 9.20. We obtain \(\overline{J}(x(0))= 0.65297\) and \(\underline{J}(x(0))=0.44713\), with trajectories shown in Figs. 9.5(a) and (b), respectively. Obviously, the saddle point does not exist. Thus, the method in [1] is invalid. Using the present mixed trajectory method, we choose the Gaussian noises γ u (0,0.052) and γ w (0,0.052). Let N=5000 times. The value function trajectories are shown in Fig. 9.5(c). Then, we obtain the value of the mixed optimal value function J o(x(0))=0.55235 and then α=0.5936. Regulating the control w to obtain the trajectory of the mixed optimal value function displayed in Fig. 9.5. The state trajectories are shown in Figs. 9.6(a) and 9.7, respectively. The corresponding control trajectories are shown in Figs. 9.8 and 9.9, respectively.

Fig. 9.5
figure 5

Performance index function trajectories. (a) Trajectory of upper value function. (b) Trajectory of lower value function. (c) Performance index functions with disturbances. (d) Trajectory of the mixed optimal performance index function

Fig. 9.6
figure 6

Trajectories of state x 1 and x 3

Fig. 9.7
figure 7

Trajectories of state x 2 and x 4

Fig. 9.8
figure 8

Trajectory of control u

Fig. 9.9
figure 9

Trajectory of control w

9.3 Finite Horizon Zero-Sum Games for a Class of Nonlinear Systems

In this section, a new iterative approach is derived to solve optimal policies of finite horizon quadratic zero-sum games for a class of continuous-time nonaffine nonlinear system. Through the iterative algorithm between two sequences, which are a sequence of state trajectories of linear quadratic zero-sum games and a sequence of corresponding Riccati differential equations, the optimal policies for nonaffine nonlinear zero-sum games are given. Under very mild conditions of local Lipschitz continuity, the convergence of approximating linear time-varying sequences is proved.

9.3.1 Problem Formulation

Consider a continuous-time nonaffine nonlinear zero-sum game described by the state equation

(9.29)

with the finite horizon cost functional

(9.30)

where x(t)∈ℝn is the state, x(t 0)∈ℝn is the initial state, t f  is the terminal time, the control input u(t) takes values in a convex and compact set \(U\subset\mathbb{R}^{m_{1}}\), and w(t) takes values in a convex and compact set \(W\subset\mathbb{R}^{m_{2}}\). u(t) seeks to minimize the cost functional J(x 0,u,w), while w(t) seeks to maximize it. The state-dependent weight matrices F(x(t)), Q(x(t)), R(x(t)), S(x(t)) are with suitable dimensions and F(x(t))≥0, Q(x(t))≥0, R(x(t))>0, S(x(t))>0. In this section, x(t), u(t), and w(t) sometimes are described by x, u, and w for brevity. Our objective is to find the optimal policies for the above nonaffine nonlinear zero-sum games.

In the nonaffine nonlinear zero-sum game problem, nonlinear functions are implicit function with respect to controller input. It is very hard to obtain the optimal policies satisfying (9.29) and (9.30). For practical purposes one may just as well be interested in finding a near-optimal or an approximate optimal policy. Therefore, we present an iterative algorithm to deal with this problem. Nonaffine nonlinear zero-sum games are transformed into an equivalent sequence of linear quadratic zero-sum games which can use the linear quadratic zero-sum game theory directly.

9.3.2 Finite Horizon Optimal Control of Nonaffine Nonlinear Zero-Sum Games

Using a factored form to represent the system (9.29), we get

(9.31)

where f:ℝn→ℝn×n is a nonlinear matrix-valued function of x, \(g\colon\mathbb{R}^{n} \times \mathbb{R}^{m_{1}} \rightarrow\mathbb{R}^{n\times{m_{1}}}\) is a nonlinear matrix-valued function of both the state x and control input u, and \(k\colon\mathbb{R}^{n} \times\mathbb{R}^{m_{2}} \rightarrow \mathbb{R}^{n\times{m_{2}}}\) is a nonlinear matrix-valued function of both the state x and control input w.

We use the following sequence of linear time-varying differential equations to approximate the state equation (9.31):

(9.32)

with the corresponding cost functional

(9.33)

where the superscript i represents the iteration index. For the first approximation, i=0, we assume that the initial values x i−1(t)=x 0, u i−1(t)=0, and w i−1(t)=0. Obviously, for the ith iteration, f(x i−1(t)), g(x i−1(t),u i−1(t)), k(x i−1(t),w i−1(t)), F(x i−1(t f )), Q(x i−1(t)), R(x i−1(t)), and S(x i−1(t)) are time-varying functions which do not depend on x i (t), u i (t), and w i (t). Hence, each approximation problem in (9.32) and (9.33) is a linear quadratic zero-sum game problem which can be solved by the existing classical linear quadratic zero-sum game theory.

The corresponding Riccati differential equation of each linear quadratic zero-sum game can be expressed as

(9.34)

where P i ∈ℝn×n is a real, symmetric and nonnegative definite matrix.

Assumption 9.23

It is assumed that \(S(x_{i-1}(t))>\hat {S}\,_{i}\), where the threshold value \(\hat {S}\,_{i}\) is defined as \(\hat {S}\,_{i}= \mathrm{inf} \{S_{i}(t)>0,\ \mbox{and (9.34) does not have a conjugate point on} [0,t_{f}]\}\).

If Assumption 9.23 is satisfied, the game admits the optimal policies given by

(9.35)

where x i (t) is the corresponding optimal state trajectory, generated by

(9.36)

By using the iteration between sequences (9.34) and (9.36) sequently, the limit of the solution of the approximating sequence (9.32) will converge to the unique solution of system (9.29), and the sequences of optimal policies (9.35) will converge, too. The convergence of iterative algorithm will be analyzed in the next section. Notice that the factored form in (9.31) does not need to be unique. The approximating linear time-varying sequences will converge whatever the representation of f(x(t)), g(x(t),u(t)), and k(x(t),w(t)).

Remark 9.24

For the fixed finite interval [t 0,t f ], if \(S(x_{i-1}(t))>\hat{S}\,_{i}\), the Riccati differential equation (9.34) has a conjugate point on [t 0,t f ]. It means that V i (x 0,u,w) is strictly concave in w. Otherwise, since V i (x 0,u,w) is quadratic and R(t)>0, F(t)≥0, Q(t)≥0, it follows that V i (x 0,u,w) is strictly convex in u. Hence, for linear quadratic zero-sum games (9.32) with the performance index function (9.34) there exists a unique saddle point; they are the optimal policies.

The convergence of the algorithm described above requires the following:

  1. 1.

    The sequence {x i (t)} converges on C([t 0,t f ];ℝn), which means that the limit of the solution of approximating sequence (9.32) converges to the unique solution of system (9.29).

  2. 2.

    The sequences of optimal policies {u i (t)} and {w i (t)} converge on C([t 0,t f ];ℝm1) and C([t 0,t f ];ℝm2), respectively.

For simplicity, the approximating sequence (9.32) is rewritten as

(9.37)

where

The optimal policies for zero-sum games are rewritten as

(9.38)

where

Assumption 9.25

g(x,u), k(x,w), R −1(x), S −1(x), F(x) and Q(x) are bounded and Lipschitz continuous in their arguments x, u, and w, thus satisfying:

  1. (C1)

    g(x,u)∥≤b, ∥k(x,u)∥≤e

  2. (C2)

    R −1(x)∥≤r, ∥S −1(x)∥≤s

  3. (C3)

    F(x)∥≤f, ∥Q(x)∥≤q

for ∀x∈ℝn, \(\forall u\in \mathbb{R}^{m_{1}}\), \(\forall w\in\mathbb{R}^{m_{2}}\), and for finite positive numbers b, e, r, s, f, and q.

Define Φ i−1(t,t 0) as the transition matrix generated by f i−1(t). It is well known that

(9.39)

where μ(f) is the measure of matrix f, \(\mu(f)= \lim_{h \to0+ }\frac{\parallel I+hf \parallel-1}{h}\). We use the following lemma to get an estimate for Φ i−1(t,t 0)−Φ i−2(t,t 0).

The following lemma is relevant for the solution of the Riccati differential equation (9.34), which is the basis for proving the convergence.

Lemma 9.26

Let Assumption 9.25 hold; the solution of the Riccati differential equation (9.34) satisfies:

  1. 1.

    P i (t) is Lipschitz continuous.

  2. 2.

    P i (t) is bounded, if the linear time-varying system (9.32) is controllable.

Proof

First, let us prove that P i (t) is Lipschitz continuous. We transform (9.34) into the form of a matrix differential equation:

where

Thus, the solution P i (t) of the Riccati differential equations (9.34) becomes

(9.40)

If Assumption 9.25 is satisfied, such that f(x), g(x,u), k(x,w), R −1(x), S −1(x), F(x), and Q(x) are Lipschitz continuous, then X i (t) and λ i (t) are Lipschitz continuous. Furthermore, it is easy to verify that (X i (t))−1 also satisfies the Lipschitz condition. Hence, P i (t) is Lipschitz continuous.

Next, we prove that P i (t) is bounded.

If the linear time varying system (9.32) is controllable, there must exist \(\hat{u}_{i}(t), \hat{w}_{i}(t)\) such that x(t 1)=0 at t=t 1. We define \(\bar{u}_{i}(t), \bar{w}_{i}(t)\) as

where \(\hat{u}_{i}(t)\) is any control policy making x(t 1)=0, \(\hat{w}_{i}(t)\) is defined as the optimal policy. We have tt 1, and we let \(\bar {u}_{i}(t)\) and \(\bar {w}_{i}(t)\) be 0, the state x(t) will still hold at 0.

The optimal cost functional \(V^{*}_{i}(x_{0},u,w)\) described as

(9.41)

where \(u^{*}_{i}(t)\) and \(w^{*}_{i}(t)\) are the optimal policies. \(V^{*}_{i}(x_{0},u,w)\) is minimized by u (t) and maximized by \(w^{*}_{i}(t)\).

For the linear system, \(V^{*}_{i}(x_{0},u,w)\) can be expressed as \(V^{*}_{i}(x_{0},u,w)= 1/(2 x^{\mathrm{T}}_{i}(t)P_{i}(t)x_{i}(t))\). Since x i (t) is arbitrary, if \(V^{*}_{i}(x_{0},u,w)\) is bounded, then P i (t) is bounded. Next, we discuss the boundedness of \(V^{*}_{i}(x_{0},u,w)\) in two cases:

Case 1::

t 1<t f ; we have

(9.42)
Case 2::

t 1t f ; we have

(9.43)

From (9.42) and (9.43), we know that \(V_{i}^{*}(x)\) has an upper bound, independent of t f . Hence, P i (t) is bounded. □

According to Lemma 9.26, P i (t) is bounded and Lipschitz continuous. If Assumption 9.25 is satisfied, then M(x,u), N(x,w), G(x,w), and K(x,w) are bounded and Lipschitz continuous in their arguments, thus satisfying:

  1. (C4)

    M(x,u)∥≤δ 1, ∥N(x,w)∥≤σ 1,

  2. (C5)

    M(x 1,u 1)−M(x 2,u 2)∥≤δ 2x 1x 2∥+δ 3u 1u 2∥, ∥N(x 1,w 1)−N(x 2,w 2)∥≤σ 2x 1x 2∥+σ 3w 1w 2∥,

  3. (C6)

    G(x,u)∥≤ζ 1, ∥K(x,w)∥≤ξ 1,

  4. (C7)

    G(x 1,u 1)−G(x 2,u 2)∥≤ζ 2x 1x 2∥+ζ 3u 1u 2∥, ∥K(x 1,w 1)−K(x 2,w 2)∥≤ξ 2x 1x 2∥+ξ 3w 1w 2∥,

x∈ℝn, \(\forall u\in \mathbb{R}^{m_{1}}\), \(\forall w\in\mathbb{R}^{m_{2}}\), and for finite positive numbers δ j , σ j , ζ j , ξ j , j=1,2,3.

Theorem 9.27

(cf. [16])

Consider the system (9.29) of nonaffine nonlinear zero-sum games with the cost functional (9.30), the approximating sequences (9.32) and (9.33) can be introduced. We have F(x(t))≥0, Q(x(t))≥0, R(x(t))>0, and the terminal time t f is specified. Let Assumption 9.25, and Assumptions (A1) and (A2) hold and \(S(x(t))> \tilde{S}\), for small enough t f or x 0; then the limit of the solution of the approximating sequence (9.32) converges to the unique solution of system (9.29) on C([t 0,t f ];ℝn). Meanwhile, the approximating sequences of optimal policies given by (9.35) also converge on \(C([t_{0},t_{f}];\mathbb{R}^{m_{1}})\) and \(C([t_{0},t_{f}];\mathbb{R}^{m_{2}})\), if

(9.44)

where

Proof

The approximating sequence (9.37) is an nonhomogeneous differential equation, whose solution can be given by

(9.45)

Then,

(9.46)

According to inequality (9.39) and assuming (C6) to hold, we obtain

(9.47)

On the basis of Gronwall–Bellman’s inequality

(9.48)

which is bounded by a small time interval t∈[t 0,t f ] or small x 0.

From (9.45) we have

(9.49)

Consider the supremum to both sides of (9.49) and let

By using (9.39), (C6), and (C7), we get

(9.50)

Combining similar terms, we have

(9.51)

where ψ 1(t) through ψ 3(t) are described in (9.44).

Similarly, from (9.38), we get

(9.52)

According to (C4), (C5), and (9.48), we have

(9.53)

where ψ 4(t) through ψ 9(t) are shown in (9.44).

Then, combining (9.51) and (9.53), we have

(9.54)

where and .

By induction, Θ i satisfies

(9.55)

which implies that we have x i (t), u i (t) and Cauchy sequences in Banach spaces C([t 0,t f ];ℝn), C([t 0,t f ];ℝn), \(C([t_{0},t_{f}];\mathbb{R}^{m_{1}})\), and \(C([t_{0},t_{f}];\mathbb{R}^{m_{2}})\), respectively. If {x i (t)} converges on C([t 0,t f ];ℝn), and the sequences of optimal policies {u i } and {w i } also converge on \(C([t_{0},t_{f}];\mathbb{R}^{m_{1}})\) and \(C([t_{0},t_{f}];\mathbb{R}^{m_{2}})\) on [t 0,t f ].

It means that x i−1(t)=x i (t), u i−1(t)=u i (t), w i−1(t)=w i (t) when i→∞. Hence, the system (9.29) has a unique solution on [t 0,t f ], which is given by the limit of the solution of approximating sequence (9.32). □

Based on the iterative algorithm described in Theorem 9.27, the design procedure of optimal policies for nonlinear nonaffine zero-sum games is summarized as follows:

  1. 1.

    Give x 0, maximum iteration times i max and approximation accuracy ε.

  2. 2.

    Use a factored form to represent the system as (9.31).

  3. 3.

    Set i=0. Let x i−1(t)=x 0, u i−1(t)=0 and w i−1(t)=0. Compute the corresponding matrix-valued functions f(x 0), g(x 0,0), k(x 0,0), F(x 0), Q(x 0), R(x 0), and S(x 0).

  4. 4.

    Compute x [0](t) and P [0](t) according to differential equations (9.34) and (9.36) with x(t 0)=x 0, P(t f )=F(x f ).

  5. 5.

    Set i=i+1. Compute the corresponding matrix-valued functions f(x i−1(t)), g(x i−1(t),u i−1(t)), k(x i−1(t),w i−1(t)), Q(x i−1(t)), R(x i−1(t)), F(x i−1(t f )), and S(x i−1(t)).

  6. 6.

    Compute x i (t) and P i (t) by (9.34) and (9.36) with x(t 0)=x 0, P(t f )=F(x tf ).

  7. 7.

    If ∥x i (t)−x i−1(t)∥<ε, go to Step 9); otherwise, go to Step 8.

  8. 8.

    If i>i max, then go to Step 9; else, go to Step 5.

  9. 9.

    Stop.

9.3.3 Simulations

Example 9.28

We now show the power of our iterative algorithm for finding optimal policies for nonaffine nonlinear zero-sum games.

In the following, we introduce an example of a control system that has the form (9.29) with control input u(t), subject to a disturbance w(t) and a cost functional V(x 0,u,w). The control input u(t) is required to minimize the cost functional V(x 0,u,w). If the disturbance has a great effect on the system, the single disturbance w(t) has to maximize the cost functional V(x 0,u,w). The conflicting design can guarantee the optimality and strong robustness of the system at the same time. This is a zero-sum game problem, which can be described by the state equations

(9.56)

Define the finite horizon cost functional to be of the form (9.30), where F=0.01 I 2×2, Q=0.01 I 2×2, R=1 and S=1, where I is an identity matrix. Clearly, (9.56) is not affine in u(t) and w(t), it has the control nonaffine nonlinear structure. Therefore, we represent the system (9.56) in the factored form f(x(t))x(t), g(x(t),u(t))u(t) and k(x(t),w(t))w(t), which, given the wide selection of possible representations, have been chosen as

(9.57)

The optimal policies designs given by Theorem 9.27 can now be applied to (9.31) with the dynamics (9.57).

The initial state vectors are chosen as x 0=[0.6,0]T and the terminal time is set to t f =5. Let us define the required error norm between the solutions of the linear time-vary differential equations by ∥x i (t)−x i−1(t)∥<ε=0.005, which needs to be satisfied if convergence is to be achieved. The factorization is given by (9.57). Implementing the present iterative algorithm, it just needs six sequences to satisfy the required bound, ∥x [6](t)−x [5](t)∥=0.0032. With increasing of number of times of iterations, the approximation error will reduce obviously. When the iteration number i=25, the approximation error is just 5.1205×10−10.

Define the maximum iteration times i max=25. Figure 9.10 represents the convergence trajectories of the state trajectory of each linear quadratic zero-sum game. It can be seen that the sequence is obviously convergent. The magnifications of the state trajectories are given in the figure, which shows that the error will be smaller as the number of times of iteration becomes bigger. The trajectories of control input u(t) and disturbance input w(t) of each iteration are also convergent, which is shown in Figs. 9.11 and 9.12. The approximate optimal policies u (t) and w (t) are obtained by the last iteration. Substituting the approximate optimal policies u (t) and w (t) into the system of zero-sum games (9.56), we get the state trajectory. The norm of the error between this state trajectory and the state trajectory of the last iteration is just 0.0019, which proves that the approximating iterative approach developed in this section is highly effective.

Fig. 9.10
figure 10

The state trajectory x 1(t) of each iteration

Fig. 9.11
figure 11

The trajectory u(t) of each iteration

Fig. 9.12
figure 12

The trajectory w(t) of each iteration

9.4 Non-Zero-Sum Games for a Class of Nonlinear Systems Based on ADP

In this section, a near-optimal control scheme is developed for the non-zero-sum differential games of continuous-time nonlinear systems. The single network ADP is utilized to obtain the optimal control policies which make the cost functions reach the Nash equilibrium of non-zero-sum differential games, where only one critic network is used for each player, instead of the action-critic dual network used in a typical ADP architecture. Furthermore, novel weight tuning laws for critic neural networks are developed, which not only ensure the Nash equilibrium to be reached, but also guarantee the stability of the system. No initial stabilizing control policy is required for each player. Moreover, Lyapunov theory is utilized to demonstrate the uniform ultimate boundedness of the closed-loop system.

9.4.1 Problem Formulation of Non-Zero-Sum Games

Consider the following continuous-time nonlinear systems:

$$ \dot {x}(t)=f(x(t))+g(x(t))u(t)+k(x(t))w(t), $$
(9.58)

where x(t)∈ℝn is the state vector, u(t)∈ℝm and d(t)∈ℝq are the control input vectors. Assume that f(0)=0 and that f(x), g(x), k(x) are locally Lipschitz.

The cost functional associated with u is defined as

$$ J_1(x,u,w)=\int_t^\infty r_1(x(\tau),u(\tau),w(\tau))\mathrm {d}\tau, $$
(9.59)

where r 1(x,u,w)=Q 1(x)+u T R 11 u+w T R 12 w, Q 1(x)≥0 is the penalty on the states, R 11∈ℝm×m is a positive definite matrix, and R 12∈ℝq×q is a positive semidefinite matrix.

The cost functional associated with w is defined as

$$ J_2(x,u,w)=\int_t^\infty r_2(x(\tau),u(\tau),w(\tau))\mathrm {d}\tau, $$
(9.60)

where r 2(x,u,w)=Q 2(x)+u T R 21 u+w T R 22 w, Q 2(x)≥0 is the penalty on the states, R 21∈ℝm×m is a positive semidefinite matrix, and R 22∈ℝq×q is a positive definite matrix.

For the above non-zero-sum differential games, the two feedback control policies u and w are chosen by player 1 and player 2, respectively, where player 1 tries to minimize the cost functional (9.59), while player 2 attempts to minimize the cost functional (9.60).

Definition 9.29

u=μ 1(x) and w=μ 2(x) are defined as admissible with respect to (9.59) and (9.60) on Ω∈ℝn, denoted by μ 1ψ(Ω) and μ 2ψ(Ω), respectively, if μ 1(x) and μ 2(x) are continuous on Ω, μ 1(0)=0 and μ 2(0)=0, μ 1(x) and μ 2(x) stabilize (9.58) on Ω, and (9.59) and (9.60) are finite, ∀x 0Ω.

Definition 9.30

The policy set (u ,w ) is a Nash equilibrium policy set if the inequalities

(9.61)

hold for any admissible control policies u and w.

Next, define the Hamilton functions for the cost functionals (9.59) and (9.60) with associated admissible control input u and w, respectively, as follows:

(9.62)
(9.63)

where ▽J i is the partial derivative of the cost function J i (x,u,w) with respect to x, i=1,2.

According to the stationarity conditions of optimization, we have

Therefore, the associated optimal feedback control policies u and w are found and revealed to be

(9.64)
(9.65)

The optimal feedback control policies u and w provide a Nash equilibrium for the non-zero-sum differential games among all the feedback control policies.

Considering H 1(x,u ,w )=0 and H 2(x,u ,w )=0, and substituting the optimal feedback control policy (9.64) and (9.65) into the Hamilton functions (9.62) and (9.63), we have

(9.66)
(9.67)

If the coupled HJ equations (9.66) and (9.67) can be solved for the optimal value functions J 1(x,u ,w ) and J 2(x,u ,w ), the optimal control can then be implemented by using (9.64) and (9.65). However, these equations are generally difficult or impossible to solve due to their inherently nonlinear nature. To overcome this difficulty, a near-optimal control scheme is developed to learn the solution of coupled HJ equations online using a single network ADP in order to obtain the optimal control policies.

Before presenting the near-optimal control scheme, the following lemma is required.

Lemma 9.31

Given the system (9.58) with associated cost functionals (9.59) and (9.60) and the optimal feedback control policies (9.64) and (9.65). For player i, i=1,2, let L i (x) be a continuously differentiable, radially unbounded Lyapunov candidate such that \(\dot {L}_{i}=\triangledown L_{i}^{\mathrm{T}}\dot {x}=\triangledown L_{i}^{\mathrm{T}} (f(x)+g(x)u^{\ast}+k(x)w^{\ast})<0\), withL i being the partial derivative of L i (x) with respect to x. Moreover, let \(\bar {Q}_{i}(x)\in\mathbb{R}^{n\times n}\) be a positive definite matrix satisfying \(\|\bar {Q}_{i}(x)\|=0\) if and only ifx∥=0 and \(\bar {Q}_{i\min}\leq\|\bar {Q}_{i}(x)\|\leq\bar {Q}_{i\max}\) forχ min∥≤∥x∥≤χ max with positive constants \(\bar {Q}_{i\min}\), \(\bar {Q}_{i\max}\), χ min, χ max. In addition, let \(\bar{Q}_{i}(x)\) satisfy \(\lim_{x\rightarrow\infty}\bar{Q}_{i}(x)=\infty\) as well as

(9.68)

Then the following relation holds:

(9.69)

Proof

When the optimal control u and w in (9.64) and (9.65) are applied to the nonlinear system (9.58), the value function J i (x,u ,w ) becomes a Lyapunov function, i=1,2. Then, for i=1,2, differentiating the value function J i (x,u ,w ) with respect to t, we have

(9.70)

Using (9.68), (9.70) can be rewritten as

(9.71)

Next, multiplying both sides of (9.71) by \(\triangledown L_{i}^{\mathrm{T}}\), (9.69) can be obtained.

This completes the proof. □

9.4.2 Optimal Control of Nonlinear Non-Zero-Sum Games Based on ADP

To begin the development, we rewrite the cost functions (9.59) and (9.60) by NNs as

(9.72)
(9.73)

where W i , ϕ i (x), and ε i are the critic NN ideal constant weights, the critic NN activation function vector and the NN approximation error for player i, i=1,2, respectively.

The derivative of the cost functions with respect to x can be derived as

(9.74)
(9.75)

where ▽ϕ i ∂ϕ i (x)/∂x, ▽ε i ∂ε i /∂x, i=1,2.

Using (9.74) and (9.75), the optimal feedback control policies (9.64) and (9.65) can be rewritten as

(9.76)
(9.77)

and the coupled HJ equations (9.66) and (9.67) can be rewritten as

(9.78)
(9.79)

where

(9.80)

The residual error due to the NN approximation for player 1 is

(9.81)

The residual error due to the NN approximation for player 2 is

(9.82)

Let \(\hat{W}_{c1}\) and \(\hat{W}_{c2}\) be the estimates of W c1 and W c2, respectively. Then we have the estimates of V 1(x) and V 2(x) as follows:

(9.83)
(9.84)

Substituting (9.83) and (9.84) into (9.64) and (9.65), respectively, the estimates of optimal control policies can be written as

(9.85)
(9.86)

Applying (9.85) and (9.86) to the system (9.58), we have the closed-loop system dynamics as follows:

(9.87)

Substituting (9.83) and (9.84) into (9.62) and (9.63), respectively, the approximate Hamilton functions can be derived as follows:

(9.88)
(9.89)

It is desired to select \(\hat{W}_{c1}\) and \(\hat{W}_{c2}\) to minimize the squared residual error \(E=e_{1}^{\mathrm{T}}e_{1}/2+e_{2}^{\mathrm{T}}e_{2}/2\). Then we have \(\hat{W}_{c1}\rightarrow W_{c1}\), \(\hat{W}_{c2}\rightarrow W_{c2}\), and \(e_{1}\rightarrow\varepsilon_{\rm HJ1}\), \(e_{2}\rightarrow\varepsilon_{\rm HJ2}\). In other words, the Nash equilibrium of the non-zero-sum differential games of continuous-time nonlinear system (9.58) can be obtained. However, tuning the critic NN weights to minimize the squared residual error E alone does not ensure the stability of the nonlinear system (9.58) during the learning process of critic NNs. Therefore, we propose the novel weight tuning laws of critic NNs for two players, which cannot only minimize the squared residual error E but also guarantee the stability of the system as follows:

(9.90)
(9.91)

where \(\bar{\sigma}_{i}=\hat{\sigma}_{i}/(\hat{\sigma}_{i}^{\mathrm{T}} \hat {\sigma}_{i}+1)\), \(\hat {\sigma}_{i}=\triangledown\phi_{i}(f(x)-D_{1}\triangledown\phi_{1}^{\mathrm{T}} \hat {W}_{c1}/2-D_{2}\triangledown\phi_{2}^{\mathrm{T}}\hat{W}_{c2}/2)\), \(m_{s_{i}}=\hat{\sigma}_{i}^{\mathrm{T}} \hat{\sigma}_{i}+1\), α i >0 is the adaptive gain, ▽L i is described in Lemma 9.31, i=1,2. F 1, F 2, F 3, and F 4 are design parameters. The operator \(\varSigma(x,\hat{u},\hat{w})\) is given by

(9.92)

where \(\dot{x}\) is given as (9.87).

Remark 9.32

The first terms in (9.90) and (9.91) are utilized to minimize the squared residual error E and derived by using a normalized gradient descent algorithm. The other terms are utilized to guarantee the stability of the closed-loop system while the critic NNs learn the optimal cost functions and are derived by following Lyapunov stability analysis. The operator \(\varSigma(x,\hat {u},\hat {w})\) is selected based on the Lyapunov’s sufficient condition for stability, which means that the state x is stable if L i (x)>0 and \(\triangledown L_{i}\dot{x}<0\) for player i, i=1,2. When the system (9.58) is stable, the operator \(\varSigma(x,\hat{u},\hat{w})=0\) and it will not take effect. When the system (9.58) is unstable, the operator \(\varSigma(x,\hat{u},\hat{w})=1\) and it will be activated. Therefore, no initial stabilizing control policies are needed due to the introduction of the operator \(\varSigma(x,\hat{u},\hat{w})\).

Remark 9.33

From (9.88) and (9.89), it can be seen that the approximate Hamilton functions \(H_{1}(x,\hat{W}_{c1},\hat{W}_{c2})=e_{1}=0\) and \(H_{2}(x,\hat{W}_{c1},\hat{W}_{c2})=e_{2}=0\) when x=0. For this case, the tuning laws of critic NN weights for two players (9.90) and (9.91) cannot achieve the purpose of optimization anymore. This can be considered as a persistency of the requirement of excitation for the system states. Therefore, the system states must be persistently excited enough for minimizing the squared residual errors E to drive the critic NN weights toward their ideal values. In order to satisfy the persistent excitation condition, probing noise is added to the control input.

Define the weight estimation errors of critic NNs for two players to be \(\tilde{W}_{c1}=W_{c1}-\hat{W}_{c1}\) and \(\tilde{W}_{c2}=W_{c2}-\hat{W}_{c2}\), respectively. From (9.78) and (9.79), we observe that

(9.93)
(9.94)

Combining (9.90) with (9.93), we have

(9.95)

Similarly, combining (9.91) with (9.94), we have

(9.96)

In the following, the stability analysis will be performed. First, the following assumption is made, which can reasonably be satisfied under the current problem settings.

Assumption 9.34

  1. (a)

    g(⋅) and k(⋅) are upper bounded, i.e., ∥g(⋅)∥≤g M and ∥k(⋅)∥≤k M with g M and k M being positive constants.

  2. (b)

    The critic NN approximation errors and their gradients are upper bounded so that ∥ε i ∥≤ε iM and ∥▽ε i ∥≤ε idM with ε iM and ε idM being positive constants, i=1,2.

  3. (c)

    The critic NN activation function vectors are upper bounded, so that ∥ϕ i ∥≤ϕ iM and ∥▽ϕ i ∥≤ϕ idM , with ϕ iM and ϕ idM being positive constants, i=1,2.

  4. (d)

    The critic NN weights are upper bounded so that ∥W i ∥≤W iM with W iM being positive constant, i=1,2. The residual errors \(\varepsilon_{\rm HJi}\) are upper bounded, so that \(\|\varepsilon_{\rm HJi}\|\leq\varepsilon_{{\rm HJ}iM}\) with \(\varepsilon_{{\rm HJ}iM}\) being positive constant, i=1,2.

Now we are ready to prove the following theorem.

Theorem 9.35

(cf. [17])

Consider the system given by (9.58). Let the control input be provided by (9.85) and (9.86), and the critic NN weight tuning laws be given by (9.90) and (9.91). Then, the system state x and the weight estimation errors of critic NNs \(\tilde {W}_{c1}\) and \(\tilde {W}_{c2}\) are uniformly ultimately bounded (UUB). Furthermore, the obtained control input \(\hat {u}\) and \(\hat {w}\) in (9.85) and (9.86) are proved to converge to the Nash equilibrium policy of the non-zero-sum differential games approximately, i.e., \(\hat {u}\) and \(\hat {w}\) are closed for the optimal control input u and w with bounds ϵ u and ϵ w , respectively.

Proof

Choose the following Lyapunov function candidate:

(9.97)

where L 1(x) and L 2(x) are given by Lemma 9.31.

The derivative of the Lyapunov function candidate (9.97) along the system (9.87) is computed as

(9.98)

Then, substituting (9.95) and (9.96) into (9.98), we have

(9.99)

In (9.99), the last two terms can be rewritten as

(9.100)

Define \(z=[\bar{\sigma}_{1}^{\mathrm{T}}\tilde {W}_{c1},\bar{\sigma} _{2}^{\mathrm{T}} \tilde {W}_{c2},\tilde {W}_{c1},\tilde {W}_{c2}]^{\mathrm{T}}\); then (9.99) can be rewritten as

(9.101)

where the components of the matrix M are given by

and the components of the vector δ=[d 1 d 2 d 3 d 4]T are given as

According to Assumption 9.34 and observing the facts that \(\bar{\sigma}_{1}<1\) and \(\bar{\sigma}_{2}<1\), it can be concluded that δ is bounded by δ M . Let the parameters F 1, F 2, F 3, and F 4 be chosen such that M>0. Then, taking the upper bounds of (9.101) reveals

(9.102)

Now, the cases of \(\varSigma(x,\hat{u},\hat{w})=0\) and \(\varSigma(x,\hat{u},\hat{w})=1\) will be considered.

(1) When \(\varSigma(x,\hat{u},\hat{w})=0\), the first two terms are less than zero. Noting that ∥x∥>0 as guaranteed by the persistent excitation condition and using the operator defined in (9.92), it can be ensured that there exists a constant \(\dot{x}_{\min}\) satisfying \(0<\dot{x}_{\min}<\|\dot{x}\|\). Then (9.102) becomes

(9.103)

Given that the following inequalities:

(9.104)

or

(9.105)

or

(9.106)

hold, then \(\dot{L}<0\). Therefore, using Lyapunov theory, it can be concluded that ∥▽L 1∥, ∥▽L 2∥ and ∥z∥ are UUB.

(2) When \(\varSigma(x,\hat{u},\hat{w})=1\), it implies that the feedback control input (9.85) and (9.86) may not stabilize the system (9.58). Adding and subtracting \(\triangledown L_{1}^{\mathrm{T}} D_{1}\varepsilon_{1}/2+\triangledown L_{2}^{\mathrm{T}}D_{2}\varepsilon_{2}/2\) to the right hand side of (9.102), and using (9.64), (9.65), and (9.80), we have

(9.107)

According to Assumption 9.34, D i is bounded by D iM , where D iM is a known constant, i=1,2. Using Lemma 9.31 and recalling the boundedness of ▽ε 1, ▽ε 2, and δ, (9.107) can be rewritten as

(9.108)

where

Given that the following inequalities:

(9.109)

or

(9.110)

or

(9.111)

hold, then \(\dot{L}<0\). Therefore, using Lyapunov theory, it can be concluded that ∥▽L 1∥, ∥▽L 2∥, and ∥z∥ are UUB.

In summary, for the cases \(\varSigma(x,\hat{u},\hat{w})=0\) and \(\varSigma(x,\hat{u},\hat{w})=1\), if inequalities \(\|\triangledown L_{1}\|>\max( B_{\triangledown L_{1}}, B'_{\triangledown L_{1}})\triangleq\bar{B}_{\triangledown L_{1}}\), or \(\|\triangledown L_{2}\|>\max( B_{\triangledown L_{2}}, B'_{\triangledown L_{2}})\triangleq\bar{B} _{\triangledown L_{2}}\) or \(\|z\|>\max( B_{z}, B'_{z})\triangleq\bar{B}_{z}\) hold, then \(\dot{L}<0\). Therefore, we can conclude that ∥▽L 1∥, ∥▽L 2∥ and ∥z∥ are bounded by \(\bar{B}_{\triangledown L_{1}}\), \(\bar{B}_{\triangledown L_{2}}\), and \(\bar{B}_{z}\), respectively. According to Lemma 9.31, the Lyapunov candidates ▽L 1 and ▽L 2 are radially unbounded and continuously differentiable. Therefore, the boundedness of ∥▽L 1∥ and ∥▽L 2∥ implies the boundedness of ∥x∥. Specifically, ∥x∥ is bounded by \(\bar{B}_{x}=\max(B_{1x},B_{2x})\), where B 1x and B 2x are determined by \(\bar{B}_{\triangledown L_{1}}\) and \(\bar{B}_{\triangledown L_{2}}\), respectively. Besides, note that if any component of z exceeds the bound, i.e., \(\|\tilde {W}_{c1}\|>\bar{B}_{z}\) or \(\|\tilde{W}_{c2}\|>\bar{B}_{z}\) or \(\|\bar {\sigma}_{1}^{\mathrm{T}}\tilde {W}_{c1}\|>\bar{B}_{z}\) or \(\|\bar {\sigma}_{2}^{\mathrm{T}}\tilde {W}_{c2}\|>\bar{B}_{z}\), the ∥z∥ are bounded by \(\bar{B}_{z}\), which implies that the critic NN weight estimation errors \(\|\tilde{W}_{c1}\|\) and \(\|\tilde{W}_{c2}\|\) are also bounded by B z .

Next, we will prove \(\|\hat{u}-u^{\ast}\|\leq\epsilon_{u}\) and \(\|\hat{w}-w^{\ast}\|\leq\epsilon_{w}\). From (9.64) and (9.85) and recalling the boundedness of ∥▽ϕ 1∥ and \(\|\tilde{W}_{c1}\|\), we have

(9.112)

Similarly, from (9.65) and (9.86) and recalling the boundedness of ∥▽ϕ 2∥ and \(\|\tilde{W}_{c2}\|\), we obtain \(\|\hat{w}-w^{\ast}\|\leq\epsilon_{w}\).

This completes the proof. □

Remark 9.36

In [10], each player needs two NNs consisting of a critic NN and an action NN to implement the online learning algorithm. By contrast with [10], only one critic NN is required for each player, the action NN is eliminated, resulting in a simpler architecture, and less computational burden.

Remark 9.37

In Remark 3 of [10] one pointed out that the NN weights can be initialized randomly but non-zero. That is because the method proposed in [10] requires initial stabilizing control policies for guaranteeing the stability of the system. By contrast, no initial stabilizing control policies are needed by adding an operator, which is selected by the Lyapunov’s sufficiency condition for stability, on the critic NN weight tuning law for each player in this subsection.

9.4.3 Simulations

Example 9.38

An example is provided to demonstrate the effectiveness of the present control scheme.

Consider the affine nonlinear system as follows:

(9.113)

where

(9.114)
(9.115)

The cost functionals for player 1 and player 2 are defined by (9.59) and (9.60), respectively, where Q 1(x)=2x T x, R 11=R 12=2I, Q 2(x)=x T x, R 21=R 22=2I, and I denotes an identity matrix of appropriate dimensions.

For player 1, the optimal cost function is \(V^{\ast}_{1}(x)=0.25x_{1}^{2}+x_{2}^{2}\). For player 2, the optimal cost function is \(V^{\ast}_{2}(x)=0.25x_{1}^{2}+0.5x_{2}^{2}\). The activation functions of critic NNs of two players are selected as \(\phi_{1}=\phi_{2}= [x_{1}^{2},x_{1}x_{2}, x_{2}^{2}]^{\mathrm{T}}\). Then, the optimal values of the critic NN weights for player 1 are W c1=[0.5,0,1]T. The optimal values of the critic NN weights for player 2 are W c2=[0.25,0,0.5]T. The estimates of the critic NN weights for two players are denoted \(\hat{W}_{c1} = [W_{11}, W_{12},W_{13}]^{\mathrm{T}}\) and \(\hat{W}_{c2} = [W_{21}, W_{22},W_{23}]^{\mathrm{T}}\), respectively. The adaptive gains for the critic NNs are selected as a 1=1 and a 2=1, and the design parameters are selected as F 1=F 2=F 3=F 4=10I. All NN weights are initialized to zero, which means that no initial stabilizing control policies are needed for implementing the present control scheme. The system state is initialized as [0.5,0.2]T. To maintain the excitation condition, probing noise is added to the control input for the first 250 s.

After simulation, the trajectories of the system states are shown in Fig. 9.13. The convergence trajectories of the critic NN weights for player 1 are shown in Fig. 9.14, from which we see that the critic NN weights for player 1 finally converge to [0.4490,0.0280,0.9777]T. The convergence trajectories of the critic NN weights for player 2 are shown in Fig. 9.15, from which we see that the critic NN weights for player 2 finally converge to [0.1974,0.0403,0.4945]T. The convergence trajectory of \(e_{u}=\hat{u}-u^{\ast}\) is shown in Fig. 9.16. The convergence trajectory of \(e_{w}=\hat{w}-w^{\ast}\) is shown in Fig. 9.17. From Fig. 9.16, we see that the error between the estimated control \(\hat{u}\) and the optimal control u for player 1 is close to zero when t=230 s. Similarly, it can been seen from Fig. 9.17 that the estimated control \(\hat{w}\) and the optimal control w for player 2 are also close to zero when t=180 s. Simulation results reveal that the present control scheme can make the critic NN learn the optimal cost function for each player and meanwhile guarantees stability of the closed-loop system.

Fig. 9.13
figure 13

The trajectories of system states

Fig. 9.14
figure 14

The convergence trajectories of critic NN weights for player 1

Fig. 9.15
figure 15

The convergence trajectories of critic NN weights for player 2

Fig. 9.16
figure 16

The convergence trajectory of e u

Fig. 9.17
figure 17

The convergence trajectory of e w

In order to compare with [10], we use the method proposed in [10] to solve the non-zero-sum games of system (9.113) where all NN weights are initialized to be zero, then obtain the trajectories of system states as shown in Fig. 9.18. It is shown that the system is unstable, which implies that the method in [10] requires initial stabilizing control policies for guaranteeing the stability of the system. By contrast, the present method does not need the initial stabilizing control policies.

Fig. 9.18
figure 18

The trajectories of system states obtained by the method in [10] with initial NN weights selected being zero

As pointed out earlier, one of the main advantages of the single ADP approach is that it results in less computational burden and eliminates the approximation error resulting from the action NNs. To demonstrate this quantitatively, we apply the method in [10] and our method to the system (9.113) with the same initial condition. Figures 9.19 and 9.20 show the convergence trajectories of the critic NN weights for player 1 and player 2, where the solid line and the dashed line represent the results from the method in [10] and our method, respectively. For the convenience of comparison, we define an evaluation function by \(\text{PER(i)}=\sum_{k=1}^{N} \|\tilde{W}_{i}(k)\|\), i=1,2, which means that the sum of the norm of the critic NN weights error during running time, where N is the number of sample points. The evaluation functions of the critic NN estimation errors as well as the time taken by the method in [10] and our method are calculated and shown in Table 9.1. It clearly indicates that the present method takes less time and obtains a smaller approximation error than [10].

Fig. 9.19
figure 19

The convergence trajectories of critic NN weights for player 1 (solid line: the method in [10]), dashed line: our method)

Fig. 9.20
figure 20

The convergence trajectories of critic NN weights for player 2 (solid line: the method in [10]), dashed line: our method)

Table 9.1 Critic NN estimation errors and calculation time

9.5 Summary

In this chapter, we investigated the problem of continuous-time differential games based on ADP. In Sect. 9.2, we developed a new iterative ADP method to obtain the optimal control pair or the mixed optimal control pair for a class of affine nonlinear zero-sum differential games. In Sect. 9.3, finite horizon zero-sum games for nonaffine nonlinear systems were studied. Then, in Sect. 9.4, the case of non-zero-sum differential games was studied using a single network ADP. Several numerical simulations showed that the present methods are effective.