Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems

Wei, Qinglai; Liu, Derong

doi:10.1007/s00521-013-1361-7

Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems

Original Article
Published: 19 February 2013

Volume 24, pages 1355–1367, (2014)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems

Download PDF

Qinglai Wei¹ &
Derong Liu¹

539 Accesses
33 Citations
Explore all metrics

Abstract

In this paper, a novel iterative adaptive dynamic programming (ADP) algorithm is developed to solve infinite horizon optimal control problems for discrete-time nonlinear systems. When the iterative control law and iterative performance index function in each iteration cannot be accurately obtained, it is shown that the iterative controls can make the performance index function converge to within a finite error bound of the optimal performance index function. Stability properties are presented to show that the system can be stabilized under the iterative control law which makes the present iterative ADP algorithm feasible for implementation both on-line and off-line. Neural networks are used to approximate the iterative performance index function and compute the iterative control policy, respectively, to implement the iterative ADP algorithm. Finally, two simulation examples are given to illustrate the performance of the present method.

Optimal Tracking Control Scheme for Discrete-Time Nonlinear Systems with Approximation Errors

Neuro-optimal tracking control for a class of discrete-time nonlinear systems via generalized value iteration adaptive dynamic programming approach

Article 25 November 2014

An adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems

Article 05 April 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Adaptive dynamic programming (ADP), proposed by Werbos [33, 34], is an effective adaptive learning control approach to solve optimal control problems forward-in-time. There are several synonyms used for ADP, including "adaptive critic designs” [3, 4, 21], “adaptive dynamic programming” [20, 27], “approximate dynamic programming” [5, 34, 35], “neural dynamic programming” [9], "neuro-dynamic programming” [7], and “reinforcement learning” [22, 23]. In [21, 34, 35], ADP approaches were classified into several main schemes, which are heuristic dynamic programming (HDP), action-dependent HDP (ADHDP), also known as Q-learning [28], dual heuristic dynamic programming (DHP), action-dependent DHP (ADDHP), globalized DHP (GDHP), and ADGDHP. In recent years, iterative methods are also used in ADP to obtain the solution of Hamilton–Jacobi–Bellman (HJB) equation indirectly and have received lots of attention [1, 2, 10, 11, 12, 17, 20, 25, 27, 32, 35]. There are two main iterative ADP algorithms which are based on policy iteration and value iteration [14]. Policy iteration algorithms for optimal control of continuous-time systems with continuous state and action spaces were given in [1]. In 2011, Wang et al. [26] studied finite-horizon optimal control problems of discrete-time nonlinear systems with unspecified terminal time using policy iteration algorithms. Value iteration algorithms for optimal control of discrete-time nonlinear systems were given in [6]. In [5], a value iteration algorithm, which was referred to as greedy HDP iteration algorithm, was proposed for finding the optimal control law. The convergence property of the algorithm is also proved [5]. In [29], an iterative θ-ADP algorithm was established that permits the ADP algorithm to be implemented both on-line and off-line without the initial admissible control sequence.

Although iterative ADP algorithms attract a lot of attentions [3, 16, 19, 24, 37, 40, 41], for nearly all of the iterative algorithms, the iterative control of each iteration is required to be accurately obtained. These iterative ADP algorithms can be called "accurate iterative ADP algorithms”. For most real-world control systems, however, accurate control laws in the iterative ADP algorithms can hardly be obtained, no matter what kind of fuzzy and neural network structures are used [18, 31, 36, 38, 39]. In other words, we can say that the approximation errors are inherent in all real control systems. Hence, it is necessary to discuss the ADP optimal control scheme with approximation error. Unfortunately, discussions on the ADP algorithms with approximation error are very scarce. To the best of our knowledge, only in [30], the convergence property of the algorithm was proposed while the stability of the system cannot be guaranteed, which means the algorithm can only be implemented off-line. To overcome this difficulty, new methods must be developed. This motivates our research.

In this paper, inspired by [29, 30], a new stable iterative ADP algorithm based on iterative θ-ADP algorithm is established for discrete-time nonlinear systems. The theoretical contribution of this paper is that when the iterative control law and iterative performance index function in each iteration cannot be accurately obtained, using the present iterative ADP algorithm, it is proved that the performance index function will converge to a finite neighborhood of the optimal performance index function and the iterative control law can stabilize the system. First, it will show that the properties of the iterative ADP algorithms without approximation errors may be invalid after introducing the approximate errors. Second, we will show that the stability property of the algorithm in [30] cannot be guaranteed, when approximation errors exist. Third, the convergence properties of the iterative ADP algorithm with approximation errors are presented to guarantee that the iterative performance index function is convergent to a finite neighborhood of the optimal one. Next, it will show that the nonlinear system can be stabilized under the iterative control law which makes the developed iterative ADP algorithm feasible for implementations both on-line and off-line.

This paper is organized as follows. In Sect. 2, the problem statement is presented. In Sect. 3, the stable iterative ADP algorithm is derived. The convergence and stability properties are also analyzed in this section. In Sect. 4, the neural network implementation for the optimal control scheme is discussed. In Sect. 5, two simulation examples are given to demonstrate the effectiveness of the present algorithm. Finally, in Sect. 6, the conclusion is drawn.

2 Problem statement

In this paper, we will study the following discrete-time nonlinear systems

$$ x_{k+1}=F(x_k, u_k), \quad k=0, 1, 2,\ldots, $$

(1)

where ${x_k \in \mathbb{R}^n}$ is the n-dimensional state vector, and ${u_k \in \mathbb{R}^m }$ is the m-dimensional control vector. Let x ₀ be the initial state.

Let $\underline{u}_k=(u_k,u_{k+1},\ldots)$ be an arbitrary sequence of controls from k to $\infty. $ The performance index function for state x ₀ under the control sequence $\underline{u}_0=(u_0,u_1,\ldots)$ is defined as

$$ J(x_0, \underline{u}_0)=\sum_{k=0}^{\infty}U(x_k,u_k), $$

(2)

where U(x _k,u _k) > 0, for ∀ x _k, u _k ≠ 0, is the utility function.

In this paper, the results are based on the following assumptions.

Assumption 1

The system (1) is controllable and the function F(x _k, u _k) is Lipschitz continuous for ∀ x _k, u _k.

Assumption 2

The system state x _k = 0 is an equilibrium state of system (1) under the control u _k = 0, i.e., F(0,0) = 0.

Assumption 3

The feedback control u _k = u(x _k) is Lipschitz continuous function for ∀ x _k and satisfies u _k = u(x _k) = 0 for x _k = 0.

Assumption 4

The utility function U(x _k, u _k) is a continuous positive definite function of x _k, u _k.

Define the set of control sequences as ${\underline{\mathfrak{U}}_k=\big\{\underline {u}_k\colon \underline {u}_k=(u_k, u_{k+1}, \ldots), \forall u_{k+i}\in { \mathbb R}^m, \quad i=0,1,\ldots\big\}. }$ Then, for an arbitrary control sequence ${\underline {u}_k \in \underline {\mathfrak{U}}_k}$, the optimal performance index function can be defined as

$$ J^*(x_k)={\inf} \left\{ J(x_k,\underline {u}_k)\colon \underline {u}_k\in\underline {{\mathfrak{U}}}_{k}\right\}. $$

According to Bellman’s principle of optimality, J ^*(x _k) satisfies the discrete-time HJB equation

$$ J^*(x_k)=\inf_{u_k \in {\mathbb{R}}^m} \left\{ U(x_k,u_k)+J^*(F(x_k,u_k))\right\}. $$

(3)

Then, the law of optimal single control vector can be expressed as

$$ u^*(x_k)={{\hbox{arg}} \min_{u_k \in {\mathbb{R}}^m}} \left\{ U(x_k,u_k)+J^*(F(x_k,u_k))\right\}. $$

Hence, the HJB Eq. (3) can be written as

$$ J^*(x_k)=U(x_k,u ^*(x_k))+J^*(F(x_k,u ^*(x_k))). $$

(4)

Generally, J ^*(x _k) is impossible to obtain by solving the HJB Eq. (4) directly. In [29], an iterative θ-ADP algorithm was proposed to solve the performance index function and control law iteratively. The iterative θ-ADP algorithm can be expressed as in the following equations:

$$ {v_i}({x_k}) = {\hbox{arg}} \mathop{\min}\limits_{{u_k} \in {{\mathbb{R}}^m}} \left\{ {U({x_k},{u_k}) + {V_i}({x_{k + 1}})} \right\} $$

(5)

and

$$ \begin{aligned} {V_{i + 1}}({x_k}) =& \mathop{\min }\limits_{{u_k} \in {{\mathbb{R}}^m}} \left\{ {U({x_k},{u_k}) + {V_i}({x_{k + 1}})}\right\}\\ =& U({x_k},{v_i}({x_k})) + {V_i}(F({x_k},{v_i}({x_k}))) \end{aligned} $$

(6)

where $V_0(x_k) =\theta \Uppsi(x_k), \Uppsi(x_k) \in \bar{\Uppsi}_{x_k}$, is the initial performance index function, and θ > 0 is a large finite positive constant. The set of positive definite functions $\bar{\Uppsi}_{x_k}$ is defined as follows.

Definition 1

Let

$$ \begin{aligned} \bar{\Uppsi}_{x_k} &= \left\{ \Uppsi(x_k) \colon \Uppsi ({x_k}) > 0 \, {{{\it {\hbox{is}\,\hbox{positive} \,\hbox{definite}}}}}, \right.\\ &\quad \left.{\hbox{and}} \, \exists \nu ({x_k}) \in {\mathbb R}^m, \Uppsi (F({x_k},\nu ({x_k}))) < \Uppsi ({x_k}) \right\} \end{aligned} $$

be the set of initial positive definition functions.

In [29], it is proved that the iterative performance index function V _i(x _k) converges to J ^*(x _k), as $i\rightarrow\infty. $ It also shows that the iterative control law v _i(x _k) is a stable control law for $i=0,1,\ldots. $ For the iterative θ-ADP algorithm, the accurate iterative control law and accurate iterative performance index function must be obtained to guarantee the convergence of the iterative performance index function. To obtain the accurate control law, it also requires that the control space must be continuous with no constraints. In real-world implementations, however, for $\forall i =0,1,\ldots$, the accurate iterative control law v _i(x _k) and the iterative performance index function V _i(x _k) are both generally impossible to obtain. For this situation, the iterative control law cannot guarantee the iterative performance index function to converge to the optimum. Furthermore, the stability property of the system cannot be proved under the iterative control law. These properties will be shown in the next section. To overcome this difficulty, a new ADP algorithm and analysis method must be developed.

3 Properties of the approximation error-based iterative ADP algorithm

In this section, based on the iterative θ-ADP algorithm, a new stable iterative ADP algorithm with approximation error is developed to obtain the nearly optimal controller for nonlinear systems (1). The goal of the present ADP algorithm is to construct an iterative controller, which moves an arbitrary initial state x ₀ to the equilibrium, and simultaneously makes the iterative performance index function reach a finite neighborhood of the optimal performance index function. Convergence proofs will be presented to guarantee that the iterative performance index functions converge to the finite neighborhood of the optimal one. Stability proofs will be given to show the iterative controls stabilize the nonlinear system (1) which makes the algorithm feasible for implementation both on-line and off-line.

3.1 Derivation of the stable iterative ADP algorithm with approximation error

In the present iterative ADP algorithm, the performance index function and control law are updated by iterations, with the iteration index i increasing from 0 to infinity. For ${\forall x_k\in \mathbb{R}^n, }$ let the initial function $\Uppsi(x_k) $ be an arbitrary function that satisfies $\Uppsi(x_k)\in \bar{\Uppsi}_{x_k} $ where $\bar{\Uppsi}_{x_k}$ is expressed in Definition 1. For ${\forall x_k\in \mathbb{R}^n, }$ let the initial performance index function $\hat{V}_0(x_k) =\theta \Uppsi(x_k)$, where θ > 0 is a large finite positive constant. The iterative control law $\hat{v}_0(x_k)$ can be computed as follows:

$$ \hat{v}_0 (x_k ) = \mathop{{\hbox{arg}}\min}\limits_{u_k \in {\mathbb{R}}^m} \left\{ {U(x_k,u_k ) + \hat{V}_0 (x_{k + 1})} \right\} +\rho_0(x_k) $$

(7)

where $\hat{V}_0 (x_{k + 1} )= \theta\Uppsi(x_{k+1}),$ and the performance index function can be updated as

$$ \hat{V}_1 (x_k ) = U(x_k,\hat{v}_0 (x_k )) + \hat{V}_0 (F(x_k ,\hat{v}_0 (x_k )))+ \pi_0 (x_k), $$

(8)

where ρ ₀(x _k) and π ₀(x _k) are approximation error functions of the iterative control and iterative performance index function, respectively.

For $i = 1, 2, \ldots$, the iterative ADP algorithm will iterate between

$$ \hat{v}_i (x_k ) =\mathop{{\hbox{arg}}\min }\limits_{u_k\in {\mathbb{R}}^m} \left\{ {U(x_k,u_k ) + \hat{V}_i (x_{k + 1} )} \right\}+\rho_i (x_k) $$

(9)

and performance index function

$$ \hat{V}_{i+1} (x_k )=U(x_k,\hat{v}_i (x_k )) + \hat{V}_i (F(x_k ,\hat{v}_i (x_k )))+\pi_i(x_k), $$

(10)

where ρ _i(x _k) and π _i(x _k) are finite approximation error functions of the iterative control and iterative performance index function, respectively.

Remark 1

From the iterative ADP algorithm (7)–(10), we can see that for $i=0,1,\ldots$, the iterative performance index function V _i(x _k) and iterative control law v _i(x _k) in (5)–(6) are replaced by $\hat{V}_i(x_k)$ and $\hat{v}_i(x_k), $ respectively. As there exist approximation errors, generally, we have for $\forall i \geq 0, \hat{v}_i(x_k) \neq {{v}}_i(x_k)$ and the iterative performance index function $\hat{V}_{i+1} (x_k )\neq {V}_{i+1} (x_k ). $ This means there exists an error between $\hat{V}_{i+1} (x_k )$ and V _i+1 (x _k). It should be pointed out that the iterative approximation is not a constant. The fact is that as the iteration index $i \rightarrow \infty$, the boundary of iterative approximation errors will also increase to infinity, although in the single iteration, the approximation error is finite. The following theorem will show this property.

Lemma 1

Let ${x_k \in \mathbb{R}^n}$ be an arbitrary controllable state and Assumptions 1–4 hold. For $i=1,2,\ldots, $ define a new iterative performance index function as

$$ \Upgamma_{i}(x_k)=\mathop{\min}\limits_{{u_k\in {\mathbb{R}}^m}} \{ U({x_k},{u_k}) + {\hat{V}_{i-1}}({x_{k + 1}})\} $$

(11)

where $\hat{V}_{i} (x_k)$ is defined in (10). If the initial iterative performance index function $\hat{V}_0(x_k) ={\Upgamma}_0(x_k)= \theta \Uppsi({x_k})$ and $i=1,2,\ldots, $ and there exists a finite constant $\epsilon$ that makes

$$ \hat{V}_i(x_k) -\Upgamma_i(x_k)\leq \epsilon $$

(12)

hold uniformly, then we have

$$ \hat{V}_i(x_k)-V_i(x_k)\leq i\epsilon $$

(13)

where $\epsilon$ is called uniform approximation error (approximation error in brief).

Proof

The theorem can be proved by mathematical induction. First, let i = 1. We have

$$ \begin{aligned} \Upgamma_1 (x_k) &= \mathop {\min }\limits_{{u_k \in {\mathbb{R}}^m}} \{ U({x_k},{u_k}) + {{\hat V}_{0}}({x_{k + 1}})\}\\ &= \mathop {\min }\limits_{{u_k \in {\mathbb{R}}^m}} \{ U({x_k},{u_k}) +{{V}_{0}}(F(x_{k},u_k))\} \\&=V_1(x_k). \end{aligned} $$

Then, according to (12), we can get

$$ \hat{V}_1(x_k) -V_1(x_k)\leq \epsilon. $$

Assume that (13) holds for $i=l-1,\, l=2,3,\ldots. $ Then, for i = l, we have

$$ \begin{aligned} \Upgamma_l (x_k) =& \mathop {\min }\limits_{{u_k \in {\mathbb{R}}^m}} \{ U({x_k},{u_k}) + {{\hat V}_{l-1}}({x_{k + 1}})\}\\ \leq &\mathop {\min }\limits_{{u_k \in {\mathbb{R}}^m}} \{ U({x_k},{u_k}) + V_{l-1}(x_{k+1})+(l-1)\epsilon\}\\ =&{V}_l(x_k)+(l-1)\epsilon. \end{aligned} $$

Then, according to (12), we can get (13).

Remark 2

From Lemma 1, we can see that for the iterative θ-ADP algorithm (7)–(10), the error bound between V _i(x _k) and $\hat{V}_i(x_k)$ is increasing as i increases. This means that although the approximation error for each single step is finite and maybe small, as the iteration index $i\rightarrow\infty$ increases, the approximation error between $\hat{V}_i(x_k)$ and V _i(x _k) may also increase to infinity. Hence, we can say that the original iterative θ-ADP algorithm may be invalid with the approximation error.

To overcome these difficulties, we must discuss the convergence and stability properties of the iterative ADP algorithm with finite approximation error.

3.2 Properties of the stable iterative ADP algorithm with finite approximation error

For convenience of analysis, we transform the expressions of the approximation error as follows. According to the definitions of $\hat{V}_i(x_k)$ and $\Upgamma_i(x_k)$ in (10) and (11), we have $\Upgamma_i(x_k)\leq \hat{V}_i(x_k). $ Then, for $\forall i =0,1,\ldots, $ there exists a σ ≥ 1 that makes

$$ \Upgamma_i(x_k)\leq \hat{V}_i(x_k)\leq\sigma\Upgamma_i(x_k) $$

(14)

hold uniformly. Hence, we can give the following theorem.

Theorem 1

Let ${x_k \in \mathbb{R}^n}$ be an arbitrary controllable state and Assumptions 1–4 hold. For $\forall i=0,1,\ldots$ , let $\Upgamma_i(x_k)$ be expressed as (11) and $\hat{V}_i (x_k) $ be expressed as (10). Let $\gamma < \infty$ and $1 \leq \delta < \infty$ be constants that make

$$ J^*(F(x_k,u_k)) \leq \gamma U(x_k,u_k) $$

and

$$ J^*(x_k)\leq V_0(x_k) \leq \delta J^*(x_k) $$

(15)

hold uniformly. If there exists a σ, i.e., $1\leq \sigma < \infty$ , that makes (14) hold uniformly, then we have

$$ \begin{aligned} J^*(x_k) \leq& \hat{V}_i(x_k)\\ \leq & \sigma \left( 1 + \sum\limits_{j = 1}^i {\frac{{{\gamma ^j}{\sigma ^{j - 1}}(\sigma - 1)}}{{{{(\gamma + 1)}^j}}}} + \frac{{{\gamma ^i}{\sigma ^i}(\delta - 1)}}{{{{(\gamma + 1)}^i}}} \right)\\ &\quad\times J^*(x_k) \end{aligned} $$

(16)

where we define $\sum\nolimits_j^i {(\cdot) = 0} $ for ∀ j > i.

Proof

The theorem can be proved by mathematical induction. First, let i = 0. Then, (16) becomes

$$ J^*(x_k) \leq \hat{V}_0(x_k)\leq \sigma \delta J^*(x_k). $$

(17)

As J ^*(x _k) ≤ V ₀(x _k) ≤ δJ ^*(x _k) and $\hat{V}_0(x_k) ={V}_0(x_k)= \theta \Uppsi({x_k})$, we can obtain (17). Then, the conclusion holds for i = 0.

Assume that (16) holds for i = l − 1, $l=1,2,\ldots. $ Then for i = l, we have

$$ \begin{aligned} {\Upgamma_l}({x_k})\leq & \mathop {\min }\limits_{{u_k}\in {\mathbb{R}}^m} \Bigg\{ \Bigg(1+\gamma \left.\sum\limits_{j = 1}^{l - 1} {\frac{{{\gamma ^{j - 1}}{\sigma ^{j - 1}}(\sigma - 1)}}{{{{(\gamma + 1)}^j}}}} \right.\\ &\quad\left.+ \frac{{{\gamma ^{l- 1}}{\sigma ^{l - 1}}(\sigma \delta - 1)}}{{{{(\gamma +1)}^l}}}\right)U(x_k,u_k)\\ &\quad+\Bigg[\sigma \Bigg( 1 + \sum\limits_{j = 1}^l {\frac{{{\gamma ^j}{\sigma ^{j - 1}}(\sigma -1)}}{{{{(\gamma + 1)}^j}}}} + \frac{{{\gamma ^l}{\sigma ^l}(\delta - 1)}}{{{{(\gamma + 1)}^l}}} \Bigg)\\ &\quad-\Bigg(\sum\limits_{j = 1}^{l - 1} {\frac{{{\gamma ^{j - 1}}{\sigma ^{j - 1}}(\sigma - 1)}}{{{{(\gamma + 1)}^j}}}} + \frac{{{\gamma ^{l - 1}}{\sigma ^{l - 1}}(\sigma \delta - 1)}}{{{{(\gamma + 1)}^l}}}\Bigg)\Bigg]\\ &\quad\times J^*(F(x_k,u_k))\Bigg\}\\ =&\left( {1 + \sum\limits_{j = 1}^l {\frac{{{\gamma ^j}{\sigma ^{j - 1}}(\sigma - 1)}}{{{{(\gamma + 1)}^j}}}} + \frac{{{\gamma ^l}{\sigma ^l}(\delta - 1)}}{{{{(\gamma + 1)}^l}}}} \right)J^*(x_k). \end{aligned} $$

(18)

Then, according to (14), we can obtain (16) which proves the conclusion for $\forall i=0,1,\ldots$.

Lemma 2

Let ${x_k \in \mathbb{R}^n}$ be an arbitrary controllable state and Assumptions 1–4 hold. Suppose Theorem 1 holds for ${\forall x_k \in \mathbb{R}^n}$ . If for $\gamma < \infty$ and σ ≥ 1, the inequality

$$ \sigma < \frac{{\gamma + 1}}{\gamma} $$

(19)

holds, and then as $i\rightarrow \infty$ , the iterative performance index function $\hat{V} _i({x_k})$ in the iterative ADP algorithm (7)–(10) is convergent to a finite neighborhood of the optimal performance index function J ^* (x _k ), i.e.,

$$ \mathop{\lim }\limits_{i \to \infty } {\hat{V} _i}({x_k}) ={\hat{V}_\infty}({x_k}) \leq \sigma\left( {1 + \frac{{\gamma (\sigma - 1)}}{{1 - \gamma (\sigma - 1)}}} \right){J^*}({x_k}). $$

(20)

Proof

According to (18) in Theorem 1, we can see that for $j=1,2,\ldots, $ the sequence $\left\{\frac{{{\gamma ^j}{\sigma ^{j - 1}}(\sigma - 1)}}{{{{(\gamma + 1)}^j}}} \right\} $ is a geometrical series. Then, (18) can be written as

$$ \begin{aligned} {\Upgamma_i}({x_k}) \leq& \Bigg[ 1 + \frac{{\frac{{\gamma (\sigma - 1)}}{{\gamma + 1}}\left( {1 - {{\left( {\frac{{\gamma \sigma }}{{\gamma + 1}}} \right)}^i}} \right)}}{{1 - \frac{{\gamma \sigma }}{{\gamma + 1}}}}\\ &+ \frac{{{\gamma ^i}{\sigma ^i}\left( {\delta - 1} \right)}}{{{{\left( {\gamma + 1} \right)}^i}}} \Bigg]{J^*}({x_k}). \end{aligned} $$

(21)

As $i\rightarrow \infty, $ if $1 \leq \sigma < \frac{{\gamma + 1}}{\gamma }$, then (21) becomes

$$ \mathop{\lim}\limits_{i \to \infty } {\Upgamma _i}({x_k}) ={\Upgamma_\infty}({x_k}) \leq \left( {1 + \frac{{\gamma (\sigma - 1)}}{{1 - \gamma (\sigma - 1)}}} \right){J^*}({x_k}). $$

(22)

According to (14), let $i\rightarrow\infty,$ and then we have

$$ \hat{V}_\infty(x_k)\leq\sigma\Upgamma_\infty(x_k). $$

(23)

According to (22) and (23), we can obtain (20).

Remark 3

We can see that if the approximation error σ satisfies $\sigma < \frac{{\gamma + 1}}{\gamma }$, then the performance index function is convergent to the finite neighborhood of the optimal performance index function J ^*(x _k). However, the stability of the system under $\hat{v}_i(x_k)$ cannot be guaranteed. The following theorem will show this property.

Theorem 2

Let ${x_k \in \mathbb{R}^n}$ be an arbitrary controllable state and Assumptions 1–4 hold. Let χ _i be defined as

$$ \chi_i= \sigma \!\!\left( {1 + \sum\limits_{j = 1}^i {\frac{{{\gamma ^j}{\sigma ^{j - 1}}(\sigma - 1)}}{{{{(\gamma + 1)}^j}}}} + \frac{{{\gamma ^i}{\sigma ^i}(\delta - 1)}}{{{{(\gamma + 1)}^i}}}} \right). $$

(24)

Let the difference of the iterative performance index function be expressed as

$$ \Updelta \hat{V}_{i+1}(x_k)= \hat{V}_{i+1}(x_{k+1})- \hat{V}_{i+1}(x_k). $$

(25)

If Theorem 1 and Lemma 2 hold for ${\forall x_k \in \mathbb{R}^n, }$ then we have $\Updelta \hat{V}_{i+1}(x_k)$ satisfies

$$ \begin{aligned} &-\frac{1}{{{\chi _i}}} U({x_k},{\hat v_i}({x_k})) - \frac{{({\chi _i} - 1){\chi _{i + 1}}}}{{{\chi _i}}}{J^*}({x_k})\\ &\le \Updelta {{\hat V}_{i + 1}}({x_k})\\ &\le - U({x_k},{{\hat v}_i}({x_k})) + ({\chi _{i+1}} - 1){J^*}(F(x_k,\hat{v}_i(x_k))). \end{aligned} $$

(26)

Proof

According to Theorem 1, we have the inequality (16) holds for $\forall i=0,1,\ldots. $ Taking (16) to (10), we can get

$$ \begin{aligned} \hat{V}_{i+1}(x_k)&\leq U({x_k},{{\hat v}_i}({x_k})) + \chi_i J^*(x_{k+1}) \\ &\leq U({x_k},{{\hat v}_i}({x_k})) + \chi_i \hat{V}_{i+1}(x_{k+1}). \end{aligned} $$

Then, we have

$$ \begin{aligned} {\chi _i}({{\hat V}_{i + 1}}({x_{k + 1}}) - {{\hat V}_{i + 1}}({x_k})) \ge & - U({x_k},{{\hat v}_i}({x_k}))\\ &- ({\chi _i} - 1){{\hat V}_{i + 1}}({x_{k }}) \cr \ge & - U({x_k},{{\hat v}_i}({x_k}))\\ &- ({\chi _i} - 1){\chi _{i+1}}{J^*}({x_{k}}). \end{aligned} $$

(27)

Putting (25) into (27), we can obtain the left hand side of the inequality (26).

On the other hand, according to (10), we can also obtain

$$ \chi_{i+1} J^*(x_k)\geq \hat{V}_{i+1}(x_k)\geq U({x_k},{{\hat v}_i}({x_k}))+ J^*(x_{k+1}). $$

Then, we can get

$$ \begin{aligned} {{\hat V}_{i + 1}}({x_{k + 1}}) - {{\hat V}_{i + 1}}({x_k})\le& {\chi _{i+1}}{J^*}({x_{k + 1}}) - U({x_k},{{\hat v}_i}({x_k}))\\ &- {J^*}({x_{k + 1}})\\ =& - U({x_k},{{\hat v}_i}({x_k})) \\ &+({\chi _{i+1}} -1){J^*}({x_{k + 1}}). \end{aligned} $$

(28)

Putting (25) to (28), we can obtain the right hand side of the inequality (26).

According to Lemma 2, we have for $\forall i=0,1,\ldots, $ χ_i is finite. Furthermore, according to the definitions of σ and δ in (14) and (15), respectively, we know σ > 1 and δ > 1. Then, we have χ_i > 1 for ∀i = 0, 1,…. Hence, the inequality (26) is well defined for ∀i = 0, 1,…. The proof is completed.

From Theorem 2, we can see that for ∀i = 1, 2,…, $\Updelta \hat{V}_{i}(x_k)$ in (25) is not necessarily negative definite. So, the function $\hat{V}_{i}(x_k)$ is not sure to be a Lyapunov function and the system (1) may not be stable under the iterative control law $\hat{v}_i(x_k)$. This means that the iterative θ-ADP algorithm (7)–(10) cannot be implemented on-line. To overcome this difficulty, in the following, we will establish new convergence criteria. It will be shown that if the new convergence conditions are satisfied, the stability of the system can be guaranteed, which makes the algorithm implementable both on-line and off-line.

Theorem 3

Let ${x_k \in \mathbb{R}^n}$ be an arbitrary controllable state and Assumptions 1–4 hold. Suppose Theorem 1 holds for ${\forall x_k \in \mathbb{R}^n}$ . If

$$ \sigma \le 1 + \frac{{\delta - 1}}{{\gamma \delta }}, $$

(29)

then the iterative performance index function $\hat{V}_i(x_k)$ converges to a finite neighborhood of J ^* (x _k ), as $i\rightarrow\infty.$

Proof

For $i=0,1,\ldots$, let

$$ \bar{V}_i(x_k)=\chi_i J^*(x_k). $$

(30)

be the least upper bound of the iterative performance index function $\hat{V}_i(x_k)$, where χ_i is defined in (24). From (24), we can get

$$ \begin{aligned} \chi_{i+1}-\chi_i =& \frac{{{\gamma ^{i + 1}}{\sigma ^i}(\sigma - 1)}}{{{{\left( {\gamma + 1} \right)}^{i + 1}}}} + \frac{{{\gamma ^{i + 1}}{\sigma ^{i + 1}}(\delta - 1)}}{{{{\left( {\gamma + 1} \right)}^{i + 1}}}} \\ &- \frac{{{\gamma ^i}{\sigma ^i}(\delta - 1)}}{{{{\left( {\gamma + 1} \right)}^i}}}. \end{aligned} $$

Let χ_i+1 − χ_i ≤ 0, and then we can obtain (29). On the other hand, we can see that

$$ \frac{{\gamma + 1}}{\gamma }= 1 + \frac{1}{\gamma } > 1 + \frac{{1 - \frac{1}{\delta }}}{\gamma }=1 + \frac{{\delta - 1}}{{\gamma \delta }}. $$

Hence, for $\sigma \le 1 + \frac{{\delta - 1}}{{\gamma \delta }}$, according to Theorem 1, the least upper bound of the iterative performance index function $\bar{V}_i(x_k)$ for the iterative ADP algorithm converges to a finite neighborhood of J ^*(x _k), as $i\rightarrow\infty. $ As $\bar{V}_i(x_k)$ is the least upper bound of the iterative performance index function, we have $\hat{V}_i(x_k)$, which satisfies $J^*(x_k)\leq \hat{V}_i(x_k) \leq \bar{V}_i(x_k)$and also converges to a finite neighborhood of J ^*(x _k), as $i\rightarrow\infty$.

Theorem 4

Let ${x_k \in \mathbb{R}^n}$ be an arbitrary controllable state and Assumptions 1–4 hold. For $i=0,1,\ldots, $ the iterative performance index function $\hat{V}_i(x_k)$ and the iterative control law $\hat{v}_i(x_k)$ are obtained by (7)–(10) , respectively. If Theorem 1 holds for ∀ x _k and σ satisfies (29), then we have for $\forall i=0,1,\ldots$ , the iterative control law $\hat{v}_i(x_k)$ is an asymptotically stable control law for system (1).

Proof

The theorem can be proved by three steps.

(1)
Show that for $\forall i=0,1,\ldots, \hat{V}_{i}(x_k)$ is a positive definite function.

According to the iterative θ-ADP algorithm (7)–(10), for i = 0, we have

$$ \hat{V}_0 (x_k ) ={V}_0 (x_k ) = \theta \Uppsi(x_k). $$

As $\Uppsi(0)=0, $ we have that $\hat{V}_0 (x_k )=0$ at x _k = 0. According to Assumption 4 of this paper, we have that $\hat{V}_0 (x_k)$ is a positive definite function.

Assume that for i, the iterative performance index function $\hat{V}_i (x_k)$ is a positive definite function. Then, for i + 1, we have (10) holds. As U(0,0) = 0, then we have that $\hat{V}_{i+1} (x_k )$ and $\hat{v}_i(x_k)$ can be defined at x _k = 0. According to Assumptions 1–4, we can get

$$ \hat{V}_{i+1} (0) = U(0,\hat{v}_i (0)) + \hat{V}_i(F(0,\hat{v}_i(0))) =0 $$

for x _k = 0. As $\hat{V}_i (x_k)$ is a positive definite function, we have $\hat{V}_i (0)=0$. Then, we have $\hat{V}_{i+1} (0)=0$. If x _k ≠ 0, according to Assumption 4, we have $\hat{V}_{i+1} (x_k) > 0$. On the other hand, we let $x_k \rightarrow \infty$. As U(x _k, u _k) is a positive function for ∀ x _k, u _k, we have $\hat{V}_{i+1} (x_k) \rightarrow \infty$. So, $\hat{V}_{i+1} (x_k) $ is a positive definite function. The mathematical induction is completed.

(2)
Show that $\hat{v}_i(x_k)$ is an asymptotically stable control law if $\hat{V}_{i+1}(x_k)$ reaches its least upper bound.

Let $\bar{V}_i(x_k)$ be defined as in (30). As $\sigma \le 1 + \frac{{\delta - 1}}{{\gamma \delta }}$, according to Theorem 3, we have $\bar{V}_{i+1}(x_k)\leq \bar{V}_i(x_k)$. Then, we can get

$$ \bar{V}_{i} (x_k)\geq \bar{V}_{i+1} (x_k) =U({x_k},{\bar{v}_i(x_k)}) + {{\bar V}_{i}}({x_{k + 1}}) $$

where

$$ \bar{v}_i (x_k ) = \mathop{{\hbox{arg}}\min }\limits_{u_k\in {\mathbb{R}}^m} \left\{ {U(x_k,u_k ) + \bar{V}_i (x_{k + 1} )} \right\}+\bar{\rho}_i(x_k). $$

So, we have ${{\bar V}_{i}}({x_{k + 1}})-\bar{V}_{i} (x_k)\leq - U({x_k},{\bar{v}_i(x_k)})\leq 0$. Hence, $\bar V_i(x_k)$ is a Lyapunov function and $\bar{v}_i(x_k)$ is an asymptotically stable control law for $\forall i=0,1,\ldots.$

3.
Show that $\hat{v}_i(x_k)$ is an asymptotically stable control law when $\hat{V}_{i+1}(x_k)$ does not reach its least upper bound.

As $\bar V_i(x_k)$ is a Lyapunov function, there exist two functions $\alpha(\|x_k\|)$ and $\beta(\|x_k\|)$ belong to class $\mathcal{K}$ which satisfy $\alpha(\|x_k\|)\geq \bar V_i(x_k) \geq \beta(\|x_k\|)$ (details can be seen in [15]). For $\forall \varepsilon > 0, $ there exists $\delta(\varepsilon) > 0 $ that makes $\beta(\delta) \leq \alpha(\varepsilon)$. So for ∀k ₀ and $ \left\|{x(k_0 )}\right\| < \delta (\varepsilon )$, there exists a $ k \in [k_0, \infty)$ that satisfies

$$ \alpha(\varepsilon) \geq \beta({\delta}) \geq \bar V_i (x_{k_0}) \geq \bar V_i (x_k) \geq \alpha (\|x_k\|). $$

(31)

As $\bar V_i(x_k)$ is the least upper bound of $\hat V_i(x_k)$ for $\forall i=0,1,\ldots, $ we have

$$ \hat V_i(x_k)\leq \bar V_i(x_k). $$

(32)

For the Lyapunov function $\bar V_i(x_k)$, if we let $k\rightarrow \infty, $ then $\bar V_i(x_k) \rightarrow 0$. So, according to (32), there exists a time instant k ₁ that satisfies k ₀ < k ₁ < k and makes

$$ \bar V_i (x_{k_0}) \geq \bar V_i (x_{k_1}) \geq \hat V_i (x_{k_0}) \geq \bar V_i (x_{k}) $$

hold. Choose $\varepsilon_1 >0$ that satisfies $\hat V_i (x_{k_0}) \geq \alpha(\varepsilon_1) \geq \bar V_i (x_{k})$. Then, there exists $\bar{\delta}_1(\varepsilon_1) > 0 $ that makes $\alpha(\varepsilon_1) \geq \beta({\delta}_1) \geq \bar V_i (x_{k}) $ hold. Then, we can obtain

$$ \hat V_i (x_{k_0}) \geq \alpha(\varepsilon_1) \geq \beta(\delta_1) \geq \bar V_i (x_k) \geq \hat V_i (x_k) \geq \alpha(\|x_k\|). $$

According to (31), we have

$$ \begin{aligned} \alpha(\varepsilon) &\geq \beta({\delta}) \geq \hat V_i (x_{k_0}) \geq \alpha(\varepsilon_1) \geq \beta({\delta}_1) \geq \hat V_i (x_k)\\ &\geq \alpha(\|x_k\|). \end{aligned} $$

Since $\alpha( \left\|x_k\right\| ) $ belongs to class $\mathcal {K}$, we can obtain $\left\|{x_k}\right\| \leq \varepsilon$. Therefore, we can conclude that $\hat{v}_i(x_k)$ is an asymptotically stable control law for all $\hat{V}_{i}(x_k)$.

Remark 4

According to the analysis of this subsection, we can see that the iterative performance index function $\hat{V}_i(x_k)$ of the iterative ADP algorithm (7)–(10) possesses different convergence properties for different σ.

First, if σ = 1, then we say the iterative performance index function V _i(x _k) and the iterative control law v _i(x _k) can be accurately obtained. The stable iterative ADP algorithm (7)–(10) is reduced to the regular iterative θ-ADP algorithm (5)–(6). It has been proved in [29] that the iterative ADP algorithm $\hat{V} _i({x_k})$ is nonincreasing convergent to J ^*(x _k) and the iterative ADP algorithm can be implemented both on-line and off-line.

Second, if σ satisfies (29), then the iterative performance index function V _i(x _k) and the iterative control law v _i(x _k) cannot be accurately obtained. In this situation, the iterative performance index function $\hat{V}_i(x_k)$ will converge to a finite neighborhood of J ^*(x _k). In the iteration process, the iterative control law $\hat{v}_i(x_k)$ is still stable which means the stable iterative ADP algorithm can also be implemented both on-line and off-line.

Third, when σ satisfies (19), the iterative performance index function $\hat{V}_i(x_k)$ can also converge to a finite neighborhood of J ^*(x _k). But in the iteration process, the stability property of iterative control law $\hat{v}_i(x_k)$ cannot be guaranteed, which means the iterative ADP algorithm can only be implemented off-line.

Finally, when σ does not satisfies (19), the convergence property of V _i(x _k) cannot be guaranteed.

4 Implementation of the stable iterative ADP algorithm by neural networks

In this section, BP neural networks are introduced to approximate $\hat{V}_i (x_k)$ and compute the control law v _i(x _k). Assume that the number of hidden layer neurons is denoted by l, the weight matrix between the input layer and hidden layer is denoted by Y, and the weight matrix between the hidden layer and output layer is denoted by W, and then the output of the three-layer neural network is represented by:

$$ \hat{F}(X,Y,W) = W^T\sigma (Y^TX) $$

where $\sigma (Y^TX) \in R^l,[\sigma (z)]_i = \frac{e^{z_{_i } } - e^{ - z_i }}{e^{z_{_i } } + e^{ - z_i }}, \quad i = 1,\ldots, l$, are the activation function.

There are two networks, which are critic network and action network, respectively, to implement the stable iterative ADP algorithm. The whole structure diagram is shown in Fig. 1.

4.1 The critic network

The critic network is used to approximate the performance index function V _i (x _k). The output of the critic network is denoted as

$$ \hat{V}_i (x_k) = W_{ci} ^T\sigma (Y_{ci} ^Tx_k). $$

(33)

The target function can be written as

$$ V_{i + 1} (x_k) = U(x_k, \hat{v}_i(x_k))+ \hat{V}_i (x_{k+1}). $$

(34)

Then, we define the error function for the critic network as

$$ e_{\hbox{ci}} (k) = \hat{V}_{i + 1} (x_k) - V_{i + 1} (x_k). $$

(35)

The objective function to be minimized in the critic network training is

$$ E_{\hbox{ci}} (k) = \frac{1}{2}e_{\hbox{ci}}^2 (k). $$

(36)

So, the gradient-based weight update rule [22] for the critic network is given by

$$ \begin{aligned} w_{c(i + 1)} (k) & =w_{\hbox{ci}} (k) + \Updelta w_{\hbox{ci}} (k),\\ &=w_{\hbox{ci}} (k) -\alpha _c \left[{\frac{\partial E_{\hbox{ci}} (k)}{\partial w_{\hbox{ci}}(k)}} \right],\\ &=w_{\hbox{ci}} (k) -\alpha _c\frac{\partial E_{\hbox{ci}} (k)}{\partial \hat {V}_i(x_k)}\frac{\partial \hat{V}_i (x_k)}{\partial w_{\hbox{ci}} (k)}, \end{aligned} $$

(37)

where α_c > 0 is the learning rate of critic network, and w _c(k) is the weight vector in the critic network which can be replaced by W _ci and Y _ci. If we define

$$ {q_l}(k) = \sum\limits_{j = 1}^{{L_c}} {Y_{\hbox{ci}}^{{l_j}}(k){x_{jk}}}, \quad l = 1, 2, \ldots,{L_c} $$

and

$$ {p_l}(k) = \frac{{{e^{{q_l}(k)}} - {e^{ - {q_l}(k)}}}}{{{e^{{q_l}(k)}} + {e^{ - {q_l}(k)}}}},\qquad \quad l = 1,2, \ldots,{L_c}, $$

then we have

$$ {{\hat V}_{i + 1}}({x_k}) = \sum\limits_{l = 1}^{{L_c}} {W_{\hbox{ci}}^l(k){p_l}(k)}, $$

where L _c is the total number of hidden nodes in the critic network. By applying the chain rule, the adaptation of the critic network is summarized as follows.

The hidden to output layer of the critic is updated as

$$ \begin{aligned} \Updelta W_{\hbox{ci}}^l(k) &= - {\alpha _c}\frac{{\partial {E_{\hbox{ci}}}(k)}}{{\partial {{\hat V}_{i + 1}}({x_k})}}\frac{{\partial {{\hat V}_{i + 1}}({x_k})}}{{\partial W_{\hbox{ci}}^l(k)}} \\ &= - {\alpha _c}{e_{\hbox{ci}}}(k){p_l}(k). \end{aligned} $$

The input to hidden layer of the critic is updated as

$$ \begin{aligned} \Updelta Y_{\hbox{ci}}^l(k) =& - {\alpha _c}\frac{{\partial {E_{\hbox{ci}}}(k)}}{{\partial {{\hat V}_{i + 1}}({x_k})}}\frac{{\partial {{\hat V}_{i + 1}}({x_k})}}{{\partial Y_{\hbox{ci}}^l(k)}} \\ =& - {\alpha _c}\frac{{\partial {E_{\hbox{ci}}}(k)}}{{\partial {{\hat V}_{i + 1}}({x_k})}}\frac{{\partial {{\hat V}_{i + 1}}({x_k})}}{{\partial {p_l}(k)}}\frac{{\partial {p_l}(k)}}{{\partial {q_l}(k)}}\frac{{\partial {q_l}(k)}}{{\partial Y_{\hbox{ci}}^l(k)}} \\ =& - {\alpha _c}{e_{\hbox{ci}}}(k)W_{\hbox{ci}}^l(k)\left[ {\frac{1}{2}\left( {1 - p_l^2(k)} \right)} \right]{x_{lk}}. \end{aligned} $$

4.2 The action network

In the action network, the state error x _k is used as input to create the optimal control law as the output of the network. The output can be formulated as

$$ \hat{v}_i (x_k) = W_{\hbox{ai}}^T\sigma (Y_{\hbox{ai}} ^Tx_k). $$

The target of the output of the action network is given by (5). So, we can define the output error of the action network as

$$ e_{\hbox{ai}} (k) = \hat{v}_i (x_k) - v_i (x_k). $$

The weights in the action network are updated to minimize the following performance error measure:

$$ E_{\hbox{ai}} (k) = \frac{1}{2}e_{\hbox{ai}}^2 (k). $$

The weights updating algorithm is similar to the one for the critic network. By the gradient descent rule, we can obtain

$$ \begin{aligned} w_{a(i + 1)} (k) & =w_{\hbox{ai}} (k) + \Updelta w_{\hbox{ai}} (k), \\ &=w_{\hbox{ai}} (k)-\beta _a \left[ { \frac{\partial E_{\hbox{ai}} (k)}{\partial w_{\hbox{ai}} (k)}} \right] \\ &=w_{\hbox{ai}} (k)-\beta _a\frac{\partial E_{\hbox{ai}} (k)}{\partial e_{\hbox{ai}} (k)}\frac{\partial e_{\hbox{ai}} (k)}{\partial \hat{v}_i (k)}\frac{\partial \hat{v}_i (k)}{\partial w_{\hbox{ai}} (k)}, \end{aligned} $$

(38)

where β _a > 0 is the learning rate of action network.

If we define

$$ {g_l}(k) = \sum\limits_{j = 1}^{{L_a}} {Y_{\hbox{ai}}^{{l_j}}(k){x_{jk}}}, \quad l = 1,2, \ldots,{L_a} $$

and

$$ {h_l}(k) = \frac{{{e^{{g_l}(k)}} - {e^{ - {g_l}(k)}}}}{{{e^{{g_l}(k)}} + {e^{ - {g_l}(k)}}}},\qquad\quad l = 1,2, \ldots,{L_a}, $$

then we have

$$ {{\hat v}_{i}}({x_k}) = \sum\limits_{l = 1}^{{L_a}} {W_{\hbox{ai}}^l(k){h_l}(k)}, $$

where L _a is the total number of hidden nodes in the action network. By applying the chain rule, the adaptation of the action network is summarized as follows.

The hidden to output layer of the action is updated as

$$ \begin{aligned} \Updelta W_{\hbox{ai}}^l(k) &= - {\beta _a}\frac{{\partial {E_{\hbox{ai}}}(k)}}{{\partial {{\hat v}_{i }}({x_k})}}\frac{{\partial {{\hat v}_{i }}({x_k})}}{{\partial W_{\hbox{ai}}^l(k)}} \\ &= - {\beta _a}{e_{\hbox{ai}}}(k){h_l}(k). \end{aligned} $$

The input to hidden layer of the action is updated as

$$ \begin{aligned} \Updelta Y_{\hbox{ai}}^l(k) &= - {\beta _a}\frac{{\partial {E_{\hbox{ai}}}(k)}}{{\partial {{\hat v}_{i}}({x_k})}}\frac{{\partial {{\hat v}_{i }}({x_k})}}{{\partial Y_{\hbox{ai}}^l(k)}} \\ &= - {\beta _a}\frac{{\partial {E_{\hbox{ai}}}(k)}}{{\partial {{\hat v}_{i}}({x_k})}}\frac{{\partial {{\hat v}_{i}}({x_k})}}{{\partial {h_l}(k)}}\frac{{\partial {h_l}(k)}}{{\partial {g_l}(k)}}\frac{{\partial {g_l}(k)}}{{\partial Y_{\hbox{ai}}^l(k)}} \\ &= - {\beta _a}{e_{\hbox{ai}}}(k)W_{\hbox{ai}}^l(k)\left[ {\frac{1}{2}\left( {1 - g_l^2(k)} \right)} \right]{x_{lk}}. \end{aligned} $$

The detailed neural training algorithm can also be seen in [22].

5 Simulation studies

Example 1

Our first example is chosen as the one in [8, 19, 37]. Consider the following discrete-time affine nonlinear system

$$ {x_{k + 1}} = f({x_k}) + g({x_k}){u_k}, $$

(39)

where $f({x_k}) = \left[{\begin{array}{c} { - 0.8{x_{2k}}} \\ {\sin (0.8{x_{1k}} - {x_{2k}}) + 1.8{x_{2k}}} \\ \end{array}} \right]$ and $g({x_k}) = \left[ {\begin{array}{c} 0 \\ { - {x_{2k}}} \\ \end{array}} \right]$. Let the initial state x ₀ = [0.5, 0.5]^T. Let the performance index function be a quadratic form which is expressed as

$$ J({x_0}, \underline {u}_0) = \sum\limits_{k = 0}^{\infty} {\left(x_k^TQ{x_k} + u_k^TR{u_k}\right)} $$

where the matrix Q = R = I, and I denotes the identity matrix with suitable dimensions. Neural networks are used to implement the stable iterative ADP algorithm. The critic network and the action network are chosen as three-layer BP neural networks with the structures of 2–8–1 and 1–8–1, respectively. Let θ = 7 and $\Uppsi(x_k)=x^T_kQx_k$ to initialize the algorithm. We choose a random array of state variable in [−0.5, 0.5] to train the neural networks. For each iteration step, the critic network and the action network are trained for 800 steps under the learning rate α_c = β _a = 0.01 so that the computation precision 10⁻⁶ is reached. The stable iterative ADP algorithm is implemented for 30 iteration steps to guarantee the convergence of the iterative performance index function. The admissible approximation error by (29) is shown in Fig. 2. From Fig. 2, we can see that the computation precision 10⁻⁶ satisfies (29). Hence, we say that the iterative performance index function is convergent. The trajectory of the iterative performance index function is shown in Fig. 3a.

As value iteration algorithm is a most basic iterative ADP algorithm, in this paper, comparisons between stable iterative adaptive dynamic programming and value iteration algorithm in [19] will be displayed to show the effectiveness of the present algorithm. In [5, 19], it is proved that the iterative performance index function will nondecreasingly converge to the optimum. The trajectory of the iterative performance index function by value iteration is shown in Fig. 3b.

From Fig. 3a, b, we can see that the iterative performance index functions obtained by the stable iterative ADP algorithm and value iteration algorithm both converge, which shows the effectiveness of the present algorithm in this paper. By implementing the converged iterative control law by the stable iterative ADP algorithm to the control system (39) for T _f = 40 time steps, we get the optimal system states and optimal control trajectories. The trajectories of optimal system states by the stable iterative ADP algorithm are shown in Fig. 4a and the corresponding optimal control trajectory is shown in 4b. The trajectories of optimal system states obtained by value iteration algorithm are shown in Fig. 4c, and the corresponding optimal control trajectory is shown in 4d. From the simulation results, we can see that the present stable iterative ADP algorithm achieved effective results.

For value iteration algorithm [37] and many other iterative ADP algorithms in [5, 40, 16, 24, 3, 37, 41], approximation errors were not considered. In the next example, we will analyze the convergence and stability properties considering different approximation errors.

Example 2

Our second example is chosen as the one in [13, 26, 30]. Consider the following discrete-time nonaffine nonlinear system

$$ {x_{k + 1}} = F(x_k,u_k)={x_k} + \sin (0.1x_k^2 + {u_k}) $$

(40)

with x ₀ = 1. Let the performance index function be the same as in Example 1. Neural networks are used to implement the stable iterative ADP algorithm. The critic network and the action network are chosen as three-layer BP neural networks with the structures of 1–8–1 and 1–8–1, respectively. Let θ = 8 and $\Uppsi(x_k)=x^T_kQx_k$ to initialize the algorithm. We choose a random array of state variable in [−0.5, 0.5] to train the neural networks. For the critic network and the action network, let the learning rate α _c = β _a = 0.01. The stable iterative ADP algorithm is implemented for 40 iteration steps. To show the effectiveness of the present stable iterative ADP algorithm, we choose four different approximation errors, which are $ \epsilon =10^{-6}, 10^{-4}, 10^{-3}, 10^{-1}, $ respectively. The trajectories of the iterative performance index functions under the four different approximation errors are shown in Fig. 5a–d, respectively.

In this paper, we have proved that if the approximation error satisfies (29), then iterative performance index function is convergent. As the approximation errors are different for different state x _k, the admissible error trajectory obtained by (29) is shown in Fig. 6a. In [30], it shows that if the approximation error satisfies (19), then iterative performance index function is also convergent. The admissible error trajectory obtained by (19) is shown in Fig. 6b.

From Figs. 5 and 6, we can see that for approximation errors $\epsilon=10^{-6}$ and $\epsilon=10^{-4}$, the performance index functions in Fig. 5a, b are both convergent. The trajectories of the control and state are displayed in Fig. 7a, b, respectively, where the implementation time is T _f = 15. For the approximation error $\epsilon=10^{-3}$, we can see that the approximation error satisfies (19). From [30], we know that the iterative performance index function is also convergent, and the convergence trajectory of the iterative performance index function is shown in Fig. 5c. While in this case, the iterative performance index function is not convergent monotonically. The corresponding control and state trajectories are displayed in Fig. 7c. For the approximation error $\epsilon=10^{-1}, $ we can see that the iterative performance index function is not convergent any more because of the large approximation error. The trajectories of the control and state are displayed in Fig. 7d.

According to Theorem 3, we know that if the approximation error $\epsilon$ satisfies (29), then the iterative control is stable. To show the effectiveness of the present ADP algorithm, we display the state and control trajectories under the control law $\hat{v}_0(x_k)$ with different approximation errors. The results can be seen in Fig. 8a–d, respectively. From Fig. 8a, b, we can see that the system is stable under the iterative control law $\hat{v}_0(x_k)$ with approximation errors $\epsilon =10^{-6}$ and $\epsilon =10^{-4}$. In this paper, we have shown that the approximation error in [30] cannot guarantee the stability of the control system. From Fig. 8c, we can see that for the approximation error $\epsilon =10^{-3}$, the system is not stable under the control law $\hat{v}_0(x_k)$. From Fig. 8d, for $\epsilon =10^{-1}$, as the iterative performance index function is not convergent, the stability of the system cannot be guaranteed.

6 Conclusions

In this paper, an effective stable iterative ADP algorithm is developed to find the infinite horizon optimal control for discrete-time nonlinear systems. In the present stable iterative ADP algorithm, any of the iterative control is stable for the nonlinear system which makes the present algorithm feasible for implementations both on-line and off-line. Convergence analysis of the performance index function for the iterative ADP algorithm is proved and the stability proofs are also given. Neural networks are used to implement the present ADP algorithm. Finally, simulation results are given to illustrate the performance of the present algorithm.

References

Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
Article MATH MathSciNet Google Scholar
Adhyaru DM, Kar IN, Gopal M (2011) Bounded robust control of nonlinear systems using neural network-based HJB solution. Neural Comput Appl 20(1):91–103
Article Google Scholar
Al-Tamimi A, Abu-Khalaf M, Lewis FL (2007) Adaptive critic designs for discrete-time zero-sum games with application to H _∞ control. IEEE Trans Syst Cybern B Cybern 37(1):240–247
Article Google Scholar
Al-Tamimi A, Lewis FL (2007) Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. In: Proceedings of the IEEE symposium on approximate dynamic programming and reinforcement learning, pp 38–43
Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern B Cybern 38(4):943–949
Article Google Scholar
Beard R (1995) Improving the closed-loop performance of nonlinear systems. Ph.D Thesis, Rensselaer Polytechnic Institute, Troy, NY
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont
MATH Google Scholar
Chen Z, Jagannathan S (2008) Generalized Hamilton–Jacobi–Bellman formulation-based neural network control of affine nonlinear discretetime systems. IEEE Trans Neural Netw 19(1):90–106
Article Google Scholar
Enns R, Si J (2003) Helicopter trimming and tracking control using direct neural dynamic programming. IEEE Trans Neural Netw 14(8):929–939
Article Google Scholar
Chen D, Yang J, Mohler RR (2008) On near optimal neural control of multiple-input nonlinear systems. Neural Comput Appl 17(4):327–337
Article Google Scholar
Hagen S, Krose B (2003) Neural Q-learning. Neural Comput Appl 12(2):81–88
Article Google Scholar
Huang T, Liu D (2013) A self-learning scheme for residential energy system control and management. Neural Comput Appl 22(2):259–269
Article Google Scholar
Jin N, Liu D, Huang T, Pang Z (2007) Discrete-time adaptive dynamic programming using wavelet basis function neural networks. In: Proceedings of the IEEE symposium on approximate dynamic programming and reinforcement learning, pp 135–142
Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag 9(3):32–50
Article MathSciNet Google Scholar
Liao X, Wang L, Yu P (2007) Stability of dynamical systems. Elsevier Press, Amsterdam
MATH Google Scholar
Liu D, Javaherian H, Kovalenko O, Huang T (2008) Adaptive critic learning techniques for engine torque and air-fuel ratio control. IEEE Trans Syst Man Cybern B Cybern 38(4):988–993
Article Google Scholar
Liu D, Zhang Y, Zhang H (2005) A self-learning call admission control scheme for CDMA cellular networks. IEEE Trans Neural Netw 16(5):1219–1228
Article Google Scholar
Liu Z, Zhang H, Zhang Q (2010) Novel stability analysis for recurrent neural networks with multiple delays via line integral-type L-K functional. IEEE Trans Neural Netw 21(11):1710-1718
Article Google Scholar
Luo Y, Zhang H (2008) Approximate optimal control for a class of nonlinear discrete-time systems with saturating actuators. Prog Nat Sci 18(8):1023–1029
Article MathSciNet Google Scholar
Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans Syst Man Cybern C Appl Rev 32(2):140–153
Article Google Scholar
Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–1007
Article Google Scholar
Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans Neural Netw 12(2):264–276
Article MathSciNet Google Scholar
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. The MIT Press, Cambridge
Google Scholar
Song R, Zhang H (2013) The finite-horizon optimal control for a class of time-delay affine nonlinear system. Neural Comput Appl 22(2):229–235
Article MathSciNet Google Scholar
Wang D, Liu D, Zhao D, Huang Y, Zhang D (2013) A neural-network-based iterative GDHP approach for solving a class of nonlinear optimal control problems with control constraints. Neural Comput Appl 22(2):219–227
Article Google Scholar
Wang F, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural Netw 22(1):24–36
Article Google Scholar
Wang F, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE Comput Intell Mag 4(2):39–47
Article Google Scholar
Watkins C (1989) Learning from delayed rewards. Ph.D Thesis, Cambridge University, Cambridge, England
Wei Q, Liu D (2012) Adaptive dynamic programming with stable value iteration algorithm for discrete-time nonlinear systems. In Proceedings of international joint conference on neural networks, Brisbane, Australia, 1–6
Wei Q, Liu D (2012) Finite-approximation-error based optimal control approach for discrete-time nonlinear systems. IEEE Trans Syst Man Cybern B Cybern. Available on-line: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6328288
Wei Q, Liu D (2012) An iterative ε-optimal control scheme for a class of discrete-time nonlinear systems with unfixed initial state. Neural Netw 32:236–244
Article MATH Google Scholar
Wei Q, Zhang H, Dai J (2009) Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions. Neurocomputing 72(7–9):839–1848
Article Google Scholar
Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of intelligence. Gen Syst Yearbook 22:25–38
Google Scholar
Werbos PJ (1991) A menu of designs for reinforcement learning over time. In: Miller WT, Sutton RS, Werbos PJ (eds) Neural networks for control. The MIT Press, Cambridge, pp 67–95
Google Scholar
Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. In: White DA, Sofge DA, (eds) Handbook of intelligent control: neural, fuzzy, and adaptive approaches. Van Nostrand Reinhold, New York, ch. 13.
Zhang H, Liu Z, Huang G, Wang Z (2010) Novel weighting-delay-based stability criteria for recurrent neural networks with time-varying delay. IEEE Trans Neural Netw 21(1):91–106
Article Google Scholar
Zhang H, Luo Y, Liu D (2009) The RBF neural network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraint. IEEE Trans Neural Netw 20(9):1490–1503
Article Google Scholar
Zhang H, Quan Y (2001) Modeling identification and control of a class of nonlinear system. IEEE Trans Fuzzy Syst 9(2):349–354
Article Google Scholar
Zhang H, Song R, Wei Q, Zhang T (2011) Optimal tracking control for a class of nonlinear discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans Neural Netw 22(12):1851–1862
Article Google Scholar
Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
Article MATH MathSciNet Google Scholar
Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans Syst Man Cybern B Cybern 38(4):937–942
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 61034002, 61233001, 61273140, in part by Beijing Natural Science Foundation under Grant 4132078, and in part by the Early Career Development Award of SKLMCCS.

Author information

Authors and Affiliations

The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Qinglai Wei & Derong Liu

Authors

Qinglai Wei
View author publications
You can also search for this author in PubMed Google Scholar
Derong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Derong Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, Q., Liu, D. Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems. Neural Comput & Applic 24, 1355–1367 (2014). https://doi.org/10.1007/s00521-013-1361-7

Download citation

Received: 19 March 2012
Accepted: 01 February 2013
Published: 19 February 2013
Issue Date: May 2014
DOI: https://doi.org/10.1007/s00521-013-1361-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems

Abstract

Similar content being viewed by others

Optimal Tracking Control Scheme for Discrete-Time Nonlinear Systems with Approximation Errors

Neuro-optimal tracking control for a class of discrete-time nonlinear systems via generalized value iteration adaptive dynamic programming approach

An adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems

Explore related subjects

1 Introduction

2 Problem statement

Assumption 1

Assumption 2

Assumption 3

Assumption 4

Definition 1

3 Properties of the approximation error-based iterative ADP algorithm

3.1 Derivation of the stable iterative ADP algorithm with approximation error

Remark 1

Lemma 1

Proof

Remark 2

3.2 Properties of the stable iterative ADP algorithm with finite approximation error

Theorem 1

Proof

Lemma 2

Proof

Remark 3

Theorem 2

Proof

Theorem 3

Proof

Theorem 4

Proof

Remark 4

4 Implementation of the stable iterative ADP algorithm by neural networks

4.1 The critic network

4.2 The action network

5 Simulation studies

Example 1

Example 2

6 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation