Keywords

1 Introduction

Adaptive dynamic programming (ADP) is always a hot research area since proposed by Werbos [1]. ADP is a very useful and significant intelligent way to solve nonlinear system problems. With the aim of getting optimal control law, the corresponding iterative learning methods are applied to analyze the convergence and optimality characteristics of ADP [2,3,4,5,6,7].

It has to be admitted that the iterative control laws and the iterative value functions usually have to be updated in the whole state space [8,9,10,11,12,13,14,15,16,17,18], which are also as “global policy iteration algorithms”. Moreover, the global policy iteration algorithms have the disadvantages of low efficiency during applications. Most of time, the algorithm has to pause to wait for the accomplishment of a search of the whole state area. Correspondingly, the computation efficiency goes down in the global policy iteration algorithm. The constraint has hindered the development of this research area. Therefore, useful policy iteration algorithms need to be proposed to increase computation efficiency.

This paper has proposed a new “local policy iteration algorithm” concerning the discrete nonlinear systems. It proves its usage to iterative in a small area. The algorithm has the ability to update the iterative control laws and also the iterative value functions within the given area of the state space. Despite the fact of iterative control laws updating within a preset state space, the system still has the ability to keep stable under any kind of iterative control law. At the end, the simulation part shows the good performance of this newly developed method.

2 Problem Statement

We assume a deterministic discrete-time nonlinear system here

$$\begin{aligned} s_{k+1}=F(s_k, c_k), \ k=0, 1, 2, \dots , \end{aligned}$$
(1)

where \(s_k \in {\mathbb {R}}^n\) is the state vector. Besides, \(c_k\in {\mathbb {R}}^m\) is the control vector. Assume \(s_0\) as the initial state and \(F(s_k,c_k)\) as the system function. Assume \({\underline{c}}_k=(c_k,c_{k+1},\dots )\) as an arbitrary sequence of controls. The performance index function can be defined as

$$\begin{aligned} J(s_0,{\underline{c}}_0)=\sum _{k=0}^{\infty }U(s_k,c_k), \end{aligned}$$
(2)

for state \(s_0\) under the control sequence \({\underline{c}}_0=(c_0,c_1,\dots )\). The utility function \(U(x_k,c_k)\) is a positive definite function for \(s_k\) and \(c_k\). It is noted that \({\underline{c}}_k\) changes from k to \(\infty \).

We aim to find an optimal scheme. The scheme has the ability to minimize performance index function (2) while stabilizing system (1).

Assume the control sequence set as \(\underline{{\mathfrak {U}}}_k=\big \{{\underline{c}}_k :{\underline{c}}_k=(c_k, c_{k+1}, \ldots ),\, \forall c_{k+i}\in {\mathbb {R}}^m, i=0,1,2,\ldots \big \}\).

Then, for an arbitrary control sequence \({\underline{c}}_k \in \underline{{\mathfrak {U}}}_k\), the optimal performance index function is

$$\begin{aligned} J^*(s_k)=\inf _{{\underline{c}}_k} \left\{ J(s_k,{\underline{c}}_k):{\underline{c}}_k\in \underline{{\mathfrak {U}}}_{k}\right\} . \end{aligned}$$
(3)

Based on Bellman principle of optimality, \(J^*(s_k)\) meet the requirement of the discrete-time HJB formula

$$\begin{aligned} J^*(s_k)=\inf _{c_k} \left\{ U(s_k,c_k)+J^*(F(s_k,c_k))\right\} . \end{aligned}$$
(4)

Define the law of optimal control as

$$\begin{aligned} c ^*(s_k)={\arg \inf _{c_k }} \left\{ U(s_k,c_k)+J^*(F(s_k,c_k))\right\} . \end{aligned}$$
(5)

Therefore, the HJB Eq. (4) is

$$\begin{aligned} J^*(s_k)=U(s_k,c ^*(s_k))+J^*(F(s_k,c ^*(s_k))). \end{aligned}$$
(6)

Overall, there exists the curse of dimensionality. So it is very difficult to obtain the numerical results for the traditional dynamic programming algorithms. Considering this situation, we have proposed a new ADP algorithm thereafter.

3 Descriptions of This New Local Iterative ADP Algorithm

We have designed a new local iterative ADP algorithm. This section gives a detailed description of the algorithm. It is designed to have the ability to get the optimal control law for system (1) correspondingly. Assume \(\{\varTheta _s^{i }\}\) as the state sets, \(\varTheta _s^{i } \subseteq \varOmega _s\), \(\forall \,i\). The value iteration functions and the control laws of the newly developed algorithm have to be updated iteratively.

For all \( s_k\in \varOmega _s\), assume \(v_0(s_k)\) as an admissible control law. Besides, assume \(V_0(x_s)\) as the initial iterative value function for all \(s_k\in \varOmega _s\). The function satisfies the generalized HJB (GHJB) equation

$$\begin{aligned} V_0(s_k) = U(s_k,v_0(s_k))+V_0(s_{k+1}), \end{aligned}$$
(7)

where \(s_{k+1}=F(s_k,v_0(s_k))\). Then, for all \( s_k \in \varTheta _s^{0}\), the local iterative control law \(v_1(s_k)\) is computed as

$$\begin{aligned} v_1 (s_k ) =&\, \arg \mathop {\min }\limits _{c_k } \left\{ {U(s_k ,c_k ) + V_0 (s_{k + 1})} \right\} \end{aligned}$$
(8)

and let \(v_1(s_k)=v_0(s_k)\), for all \( s_k \in {\varOmega _s}\mathrm{{\backslash }}{\varTheta }_s^0\).

For all \( s_k \in \varOmega _s\), assume \(V_1(s_k)\) as the iterative value function. Therefore, \(V_1(s_k)\) satisfies the GHJB equation

$$\begin{aligned} V_1 (s_k ) = U(s_k ,v_1 (s_k )) + V_1 (F(s_k ,v_1(s_k ))). \end{aligned}$$
(9)

For \( i=1,2,\ldots \), assume \(V_i (s_k )\) as the iterative value function. So \(V_i (s_k )\) can satisfy the following GHJB equation

$$\begin{aligned} V_i (s_k ) = U(s_k ,v_i (s_k )) + V_i (F(s_k ,v_i (s_k ))). \end{aligned}$$
(10)

For all \( s_k \in \varTheta _x^{i}\), the iterative control law \(v_{i+1}(s_k)\) should be computed as

$$\begin{aligned} v_{i+1} (s_k ) =&\, \arg \mathop {\min }\limits _{c_k} \left\{ {U(s_k ,c_k ) + V_i (s_{k + 1} )} \right\} \nonumber \\ =&\,\arg \mathop {\min }\limits _{c_k} \left\{ {U(s_k ,c_k ) + V_i (F(s_k,c_k) )} \right\} , \end{aligned}$$
(11)

and for all \( s_k \in {\varOmega _s}\mathrm{{\backslash }}{\varTheta }_s^i\), let \(v_{i+1}(s_k)=v_{i}(s_k)\).

The local policy iteration algorithm will be updated within the preset subset of state space according to Eqs. (7) and (11). The given subset is part of whole state space. Therefore, during iterations, once local data of state space is got, the newly developed algorithm can be performed immediately. The advantage is that the algorithm can save lots of time while competing all the data of the whole space in traditional algorithms. Therefore, the computation efficiency can be improved greatly and save a lot of trouble. Besides, if the preset subset of state space is enlarged to all, local policy iteration algorithms equal to the global policy iteration ones.

4 Simulation Examples

First, we have chosen a discretized nonaffine nonlinear system as follows

$$\begin{aligned} {s_{1(k + 1)}}&= (1 - \varDelta T){s_{1k}} + \varDelta T{s_{2k}}{c_k},\nonumber \\ {s_{2(k + 1)}}&= (1 - \varDelta T){x_{2k}} + \varDelta T(1 + s_{1k}^2){c_k} + \varDelta Tc_k^3. \end{aligned}$$
(12)

We choose the utility function as \(Q=I_1\) and \(R=I_2\). Thereafter, We choose the state space as \(\varOmega _s\). While \(I_1\) and \(I_2\), are denoted as the identity matrices with suitable dimensions. Let the initial state be \(s_0=[1,-1]^{\mathsf {T}}\). Based on Algorithm 1 in [16].

Fig. 1.
figure 1

Simulation results of the new local policy iteration algorithm. (a) Corresponding iterative value function. (b) Corresponding state trajectories. (c) Corresponding control trajectories. (d) Corresponding optimal state and control trajectories.

The iterative value functions and iterative control laws should be updated accordingly. After 30 iterations, the algorithm has reached corresponding computing precision of \(\varepsilon = 0.001\). Figure 1(a) shows that the iterative value function is monotonically nonincreasing. More importantly, the value function converges to the optimum. Figure 1(b) illustrates the trajectories of simulation states while Fig. 1(c) shows the simulation functions. In Fig. 1(d), we have shown the optimal trajectories of control and also states correspondingly.

5 Conclusion

We proposed a new local policy iteration ADP algorithm in this paper. The algorithm has the ability to greatly improve the computation efficiency of traditional ADP algorithm concerning discrete time nonlinear systems. Therefore, it can reduce computation time greatly which contrast to traditional global policy iteration algorithms. The characteristic concerning this newly developed algorithm is that the iteration control laws and iterative iteration control laws are updated within a preset area of the state space. Besides, the simulation results have proven its effectiveness of the newly developed algorithm.