Reinforcement learning and adaptive optimization of a class of Markov jump systems with completely unknown dynamic information

He, Shuping; Zhang, Maoguang; Fang, Haiyang; Liu, Fei; Luan, Xiaoli; Ding, Zhengtao

doi:10.1007/s00521-019-04180-2

Reinforcement learning and adaptive optimization of a class of Markov jump systems with completely unknown dynamic information

Extreme Learning Machine and Deep Learning Networks
Published: 09 April 2019

Volume 32, pages 14311–14320, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Reinforcement learning and adaptive optimization of a class of Markov jump systems with completely unknown dynamic information

Download PDF

Shuping He ORCID: orcid.org/0000-0003-1869-2116^1,2,
Maoguang Zhang¹,
Haiyang Fang¹,
Fei Liu³,
Xiaoli Luan³ &
…
Zhengtao Ding⁴

1448 Accesses
50 Citations
Explore all metrics

Abstract

In this paper, an online adaptive optimal control problem of a class of continuous-time Markov jump linear systems (MJLSs) is investigated by using a parallel reinforcement learning (RL) algorithm with completely unknown dynamics. Before collecting and learning the subsystems information of states and inputs, the exploration noise is firstly added to describe the actual control input. Then, a novel parallel RL algorithm is used to parallelly compute the corresponding N coupled algebraic Riccati equations by online learning. By this algorithm, we will not need to know the dynamic information of the MJLSs. The convergence of the proposed algorithm is also proved. Finally, the effectiveness and applicability of this novel algorithm is illustrated by two simulation examples.

Reinforcement learning for optimal tracking of large-scale systems with multitime scales

Article 29 June 2023

Online identification and control of PDEs via reinforcement learning methods

Article 01 August 2024

Reinforcement learning-based online adaptive controller design for a class of unknown nonlinear discrete-time systems with time delays

Article 31 July 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Markov jump linear systems (MJLSs), firstly proposed by Krasovskii and Lidskii [1] in 1961, can be considered as a kind of multi-model stochastic systems. In MJLSs, it contains two mechanisms, i.e., the modes and the states. The modes are jumping dynamics, modeled by finite-state Markov chains. The states are continuous or discrete, modeled by a set of differential or difference equations. With the development of control science and stochastic theory, MJLSs have been widely concerned and many research results are available, such as stochastic stability and stabilizability [2,3,4], controllability [5,6,7,8,9] and robust estimation and filtering [10,11,12].

In recent years, the adaptive optimal control problem has become a focused issue in controllers design and many related works have been published. For example, the authors in [13] studied the adaptive surface optimal control methods for strict-feedback systems. Then, the observer-based adaptive fuzzy control law was proposed for nonlinear nonstrict-feedback systems [14]. A general method to solve the optimal control problem related to the related algebraic Riccati equations (AREs) was studied in [15]. In [16], Kleinman considered an offline iterative technique to solve the AREs. In [17, 18], the iterative algorithm and the temporal difference method were, respectively, proposed to solve the discrete-time coupled AREs. To get the solutions of a set of coupled AREs for MJLSs, a Lyapunov iteration method [19] was derived. For more related researches, we can refer to [20,21,22,23,24]. However, all these algorithms are offline and have limitations to implement when the system dynamics are unknown.

Reinforcement learning (RL) and approximate/adaptive dynamic programming (ADP) algorithms are widely applied to solve the optimal control problems [25,26,27,28]. Moreover, we can use RL and ADP algorithm to obtain the optimal controller solutions through online learning. In [29], an online feedback control scheme is designed for linear systems by applying RL and ADP method. Then, it was developed to solve the zero-sum differential game [30]. In [31, 32], an adaptive policy iteration algorithm is considered for continuous-time and discrete-time nonlinear systems, respectively. Then, the reinforcement Q-learning for optimal tracking control problem was studied for linear discrete-time systems with unknown dynamics [33] and nonlinear discrete-time systems with unknown nonaffine dead-zone input [34]. Applying the H_∞ state feedback scheme, an online policy update algorithm was proposed [35]. For more results on this studying, we refer readers to [36,37,38,39,40,41,42,43,44].

It is worth pointing out the relevant methods related to RL and ADP are mostly used to solve the optimal controller design problems for general linear/nonlinear dynamics. Due to the complexity of MJLSs which contain states and modes, the mode-based switching subsystems have coupling relationships. Obviously, the RL and ADP algorithms proposed above can not directly solve the adaptive optimal control problems of MJLSs. To study the online control problem for MJLSs, the authors in [45] considered an online adaptive optimal control scheme by applying a novel policy iteration algorithm. Then, the relevant results were developed to the Itô stochastic systems [46] and the H_∞ optimal problem [47] of MJLSs. But in these results, we must partially know the system dynamics. Therefore, these published results can be considered as partial model-free ones.

In this paper, we propose a novel parallel RL algorithm to solve the optimal control problem of MJLSs with completely unknown dynamics. This algorithm is based on the subsystem transformation technique and the parallel RL algorithm. The subsystem transformation technique is used to obtain the N AREs. Then, we can parallelly solve the corresponding N AREs by collecting and learning the subsystems information of states and inputs. Comparing with the published results in [45, 46], we will not use the system dynamics in this paper and the exploration noise is added to describe the actual control input. Moreover, the solutions can be conducted directly in one step by repeated iteration and computation without using policy evaluation and improvement. It is clear that the iterative algorithm in this paper can be effective to study the online optimal control problems for MJLSs. The convergence has also been proved. In order to demonstrate the effectiveness of the designed method, two simulation examples are given at last.

The main contributions of this paper can be concluded as:

1.
A novel parallel RL algorithm with completely unknown dynamic information is first proposed to solve the optimal control problem of MJLSs;
2.
To deal with the jumping modes, we use the subsystem transformation technique to decompose the MJLSs.
3.
Compared with the results in [19], we apply the online collecting and learning the subsystems information of states and inputs to obtain the dynamics information. Then, we optimize and update the iterative results to obtain the optimal control policy. The designed PI algorithm is online and does not require the dynamic information of the MJLSs.

This paper contains the following five sections. The problem formulation and preliminaries are introduced in Sect. 2. The novel parallel iteration algorithm is proposed in Sect. 3.1, and the online implementation of the algorithm is given in Sect. 3.2; Sect. 3.3 gives the convergence proof of the designed algorithm. Two simulation examples are shown in Sect. 4. A conclusion is given in Sect. 5.

Notations Throughout this paper, $ \Re^{n} $ is used to denote an n-dimensional real matrix; $ \lambda (A) $ denotes the eigenvalue of A; $ \text{rank}{\kern 1pt} (A) $ denotes the rank of A; $ Z_{ + } $ denotes a non-negative–positive integer set; $ {\mathbf{E}}{\text{\{ }} \cdot {\text{\} }} $ shows the mathematical expectation of stochastic processes or vectors; $ \left\| \cdot \right\| $ represents the Euclidean norm; $ \otimes $ is used to represent the Kronecker product which has the following properties: (1) $ (A \otimes B)^{\text{T}} = A^{\text{T}} \otimes B^{\text{T}} $, (2) $ x^{\text{T}} Px = (x^{\text{T}} \otimes x^{\text{T}} )\text{vec}{\kern 1pt} {\kern 1pt} (P) $. For a matrix $ A \in \Re^{n \times n} $, we denote $ \text{vec}{\kern 1pt} {\kern 1pt} (A) = [a_{1}^{\text{T}} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} a_{2}^{\text{T}} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ldots ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} a_{n}^{\text{T}} ]^{\text{T}} $, where $ a_{i} \in \Re^{n} $ is the ith column of $ A $.

2 Backgrounds and preliminaries

2.1 Problem description

Consider a class of continuous-time Markov jump linear systems (MJLSs) described by:

$$ \begin{aligned} \left\{ {\begin{array}{*{20}l} {\dot{x}(t) = A(r(t))x(t) + B(r(t))u(t)} \hfill \\ {x(0) = x_{0} ,{\kern 1pt} {\kern 1pt} r(0) = r_{0} ,t = 0} \hfill \\ \end{array} } \right. \hfill \\ \hfill \\ \end{aligned} $$

(1)

with the infinite horizon performance index:

$$ J\left( {x(t)} \right) = {\mathbf{E}}\left\{ {\int_{0}^{\infty } {\left. {\left[ {x^{\text{T}} (t)Q_{i} x(t) + u^{\text{T}} (t)R_{i} u(t)} \right]\text{d}t} \right|} t = 0,{\kern 1pt} {\kern 1pt} x_{0} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} r_{0} } \right\} $$

(2)

where $ x(t) \in \Re^{n} $ is the system state with the initial value $ x_{0} $; $ u(t) \in \Re^{m} $ is the control input; $ A(r(t)) $, $ B(r(t)) $ are the coefficient matrices, $ Q(r(t)) $ and $ R(r(t)) $ are the weight matrices which are mode-dependent and positive-definite; the pair $ (A(r(t)),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} Q^{1/2} (r(t))) $ is supposed to be observable; $ r(t) $ is a continuous-time and discrete-state right continuous Markov stochastic process which represents the system mode and takes on values in a discrete set $ M = \left\{ { 1,{\kern 1pt} {\kern 1pt} 2,{\kern 1pt} \ldots {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} N} \right\} $; $ r_{0} $ represents the initial mode. Here, MJLSs (1) is assumed to be stochastically stabilizable.

For MJLSs (1), the transition probabilities of each mode satisfy:

$$ P_{r} \left\{ {{\kern 1pt} r(t + \Delta t) = j\left| {r(t) = i} \right.} \right\} = \left\{ \begin{aligned} & \pi_{ij} \Delta t + o(\Delta t){\kern 1pt} \quad j \ne i \\ & 1 + \pi_{ii} \Delta t + o(\Delta t)\quad j = i \\ \end{aligned} \right.{\kern 1pt} $$

(3)

where $ i,{\kern 1pt} {\kern 1pt} j \in M $, $ \Delta t > 0{\kern 1pt} $ and $ {\kern 1pt} \mathop {{\kern 1pt} \lim }\nolimits_{\Delta t \to 0} \frac{o(\Delta t)}{\Delta t} = 0 $; $ \pi_{ij} \ge 0 \, (i \ne j) $ represents the transition rate from mode $ i $ to mode $ j $ at time $ t \to t + \Delta t $ and has $ \pi_{ii} = - \sum\nolimits_{i \ne j} {\pi_{ij} } $. For simplicity, when $ r(t) = i $, $ A(r(t)) $, $ B(r(t)) $, $ Q(r(t)) $, $ R(r(t)) $ can be labeled as $ A_{i} $, $ B_{i} $, $ Q_{i} $, $ R_{i} $, respectively.

For a given feedback control policy $ u(t) = - K_{i} x(t) $, we define the following cost function as:

$$ V(x(t)){\kern 1pt} = {\mathbf{E}}\left\{ {\int_{t}^{\infty } {\left. {\left[ {x^{\text{T}} (t)Q_{i} x(t) + u^{\text{T}} (t)R_{i} u(t)} \right]\text{d}t} \right|{\kern 1pt} {\kern 1pt} } t,x(t),r(t)} \right\} $$

(4)

The optimal control problem for MJLSs (1) is to find a mode-dependent admissible feedback control policy, which minimizes the performance index, i.e.

$$ \begin{aligned} V^{*} (x(t) & = \mathop {\hbox{min} }\limits_{u(t)} J\left( {x(t)} \right) \\ & {\kern 1pt} = \mathop {\hbox{min} }\limits_{u(t)} {\mathbf{E}}\left\{ {\int_{0}^{\infty } {\left. {\left[ {x^{\text{T}} (t)Q_{i} x(t) + u^{\text{T}} (t)R_{i} u(t)} \right]\text{d}t} \right|} {\kern 1pt} {\kern 1pt} {\kern 1pt} t = 0,{\kern 1pt} {\kern 1pt} x_{0} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} r_{0} } \right\} \\ \end{aligned} $$

(5)

Then, the cost function (4) can be denoted as $ V(x(t)){\kern 1pt} = x_{{i,{\kern 1pt} {\kern 1pt} t}}^{\text{T}} P_{i} x_{{i,{\kern 1pt} {\kern 1pt} t}} $, where $ x_{{i,{\kern 1pt} {\kern 1pt} t}} $ is the state of the ith jump mode; $ P_{i} $ is a positive-definite and mode-dependent matrix which can be obtained through solving the following coupled algebraic Riccati equations (AREs):

$$ \hat{A}_{i}^{\text{T}} P_{i} + P_{i} \hat{A}_{i} + Q_{i} + \sum\limits_{j = 1,i \ne j}^{N} {\pi_{ij} P_{j} } - P_{i} B_{i} R_{i}^{ - 1} B_{i}^{\text{T}} P_{i} = 0 $$

(6)

where $ \hat{A}_{i} = A_{i} + \frac{{\pi_{ii} }}{2}I $. Then, the optimal feedback control policy can be determined by:

$$ u(t) = - K_{i} x(t) = - R_{i}^{ - 1} B_{i}^{\text{T}} P_{i} x(t) . $$

(7)

Definition 1

MJLSs (1) is stochastically stable (SS), if for any initial conditions $ x_{0} $ and $ r_{0} $, the following formula is satisfied:

$$ \mathop {\lim }\limits_{t \to \infty } {\mathbf{E}}{\kern 1pt} {\kern 1pt} \left\| {x(t)} \right\|^{2} = 0 . $$

(8)

Inspired by [48], according to the subsystem transformation technique, we can re-write MJLSs (1) as the following N re-constructed subsystems:

$$ \begin{aligned} \left\{ {\begin{array}{*{20}l} {\dot{x}_{{i,{\kern 1pt} {\kern 1pt} t}} = \hat{A}_{i} x_{{i,{\kern 1pt} {\kern 1pt} t}} + B_{i} u_{{i,{\kern 1pt} {\kern 1pt} t}} } \hfill \\ {x_{i} (0) = x_{{i,{\kern 1pt} 0}} ,{\kern 1pt} \quad {\kern 1pt} t_{0} = 0} \hfill \\ \end{array} } \right. \hfill \\ \hfill \\ \end{aligned} . $$

(9)

In order to guarantee the stabilizability of the N re-constructed subsystems in Eq. (9) by the optimal feedback control policy (7), the triples $ (A_{i} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} B_{i} ,{\kern 1pt} {\kern 1pt} \sqrt {Q_{i} } ) $ are assumed to be stabilizable-detectable [19].

2.2 An offline parallel policy iteration algorithm (OPPIA)

According to the assumptions mentioned above, the exact solution $ P_{i}^{*} $ of the positive-definite and mode-dependent matrix $ P_{i} $ can be obtained by solving AREs (6). Notice that AREs (6) is a set of nonlinear algebraic equations and it is difficult to be solved directly, especially for large dimensional matrices. To tackle this problem, some algorithms have been proposed by Kleinman [16] and Gajic and Borno [19]. For MJLSs (1), we give the following Lemma 1.

Lemma 1

[19] For each subsystems, we give an initial stabilizable value $ K_{i(0)} $. The solutions of the following decoupled algebraic Lyapunov equations are equivalent to the solutions of AREs (6).

$$ \begin{aligned} & (\hat{A}_{i} - B_{i} K_{i(k)} )^{\text{T}} P_{i(k + 1)} + P_{i(k + 1)} (\hat{A}_{i} - B_{i} K_{i(k)} ) \\ & {\kern 1pt} \quad = - Q_{i(k)} - P_{i(k)} B_{i} R_{i}^{ - 1} B_{i}^{\text{T}} P_{i(k)} \\ \end{aligned} $$

(10)

In order to solve the N parallel decoupled algebraic Lyapunov equations in (10), we give the following Algorithm 1.

Remark 1

Due to the fact that MJLSs (1) can be decomposed into N independent subsystems, the solutions can be obtained by parallel solving N decoupled algebraic Lyapunov linear equations. Thus, the solutions to the N parallel nonlinear AREs (6) are numerically approximated. By Algorithm 1, the exact solutions of $ P_{i}^{*} $ can be obtained while $ k \to \infty $. Notice that the method in Algorithm 1 is offline and the knowledge of the dynamics $ A_{i} $, $ B_{i} $ are all needed.

3 Main results

In this part, we will propose a parallel reinforcement learning (RL) algorithm to design the feedback control policy for MJLSs (1). This is a novel online learning strategy which will not require the knowledge of the coefficient matrices, i.e., $ A_{i} $ and $ B_{i} $. For each solution step, we can solve N parallel decoupled algebraic Lyapunov equations in (10). In the end, the convergence of this new algorithm is also proved.

3.1 A parallel RL algorithm with completely unknown dynamics

For the offline Algorithm 1, the N subsystems dynamics are required to solve the decoupled algebraic Lyapunov equations in (10). For practical applications, this algorithm will be difficult to achieve if we do not know the exact knowledge in advance. To overcome these weaknesses, a new online parallel RL algorithm is developed to study the optimal solutions of MJLSs with completely unknown dynamics. For the ith subsystem, the unique symmetric positive-definite solutions of $ P_{i(k)} $ and the state feedback gains $ {\kern 1pt} K_{i(k)} $ $ (\forall k \in Z_{ + } ) $ should satisfy Eqs. (10), (11).

Recalling to Eq. (9), we rewrite the system state equation as:

$$ \left\{ \begin{aligned} & \dot{x}_{{i,{\kern 1pt} t}} = \bar{A}_{i(k)} x_{i,t} + B_{i} (u_{i,t} + K_{i(k)} x_{i,t} ) \\ & x_{i} (0) = x_{{i,{\kern 1pt} 0}} ,\quad {\kern 1pt} {\kern 1pt} t_{0} = 0 \\ \end{aligned} \right. $$

(12)

where $ \bar{A}_{i} = \hat{A}_{i} - B_{i} K_{i(k)} $ and the pairs $ (\hat{A}_{i} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} B_{i} ) $ are stabilizable.

Then, the novel algorithm is given as follows:

Notice that the N subsystems in Eq. (9) are reconstructed into N new continuous-time subsystems in Eq. (12). Next, we will give an online implementation of the parallel RL algorithm to solve Eq. (13). In Eq. (12), $ A_{i} $ and $ \pi_{ii} $ are embedded in the reconstructed subsystem states; $ B_{i} $ is embedded in the reconstructed subsystem control inputs, which can be measured online. For real-time learning of the information of $ x_{{i,{\kern 1pt} {\kern 1pt} t}} {\kern 1pt} $ and $ u_{{i,{\kern 1pt} {\kern 1pt} t}} {\kern 1pt} $, we can choose $ u_{{i,{\kern 1pt} {\kern 1pt} t}} {\kern 1pt} = - K_{i(k)} x_{{i,{\kern 1pt} {\kern 1pt} t}} + e_{i} $ as a true control input, where $ e_{i} $ is an additional exploration noise.

Remark 2

Solving Eq. (13), we can get a series of solutions, where $ P_{i(k)} $, $ K_{i(k)} $ satisfy (10), (11), $ x_{{i,{\kern 1pt} {\kern 1pt} t}} $ is the solution of Eq. (12) with the arbitrary control input $ u_{{i,{\kern 1pt} {\kern 1pt} t}} $. This fact enables us to choose $ u_{{i,{\kern 1pt} {\kern 1pt} t}} {\kern 1pt} = - K_{i(k)} x_{{i,{\kern 1pt} {\kern 1pt} t}} + e_{i} $ as a real input signal for learning. Obviously, this operation will not affect the convergence of the learning process.

Remark 3

In Algorithm 2, we choose an exploration noise and assume that it is a nonzero measurable bounded signal. In fact, the choice of the exploration noise will not affect the algorithm. We add the exploration noise as an excitation signal to satisfy the persistence of the excitation conditions. For convenience, we always choose the exponentially decreasing probing noise and the sum of sinusoidal signals as the exploration noise.

Remark 4

In Eq. (13), the subsystems internal dynamics information is not needed, i.e., the dynamic matrices of $ A_{i} $, $ B_{i} $ are all not included. In order to solve Eq. (13) by the online RL algorithm, we should continuously sample the states and the inputs information online from the N subsystems at each iteration step. In fact, the dynamics information is already included in the sample information. Therefore, the online Algorithm 2 did not require the dynamics information, i.e., Algorithm 2 as a completely model-free algorithm.

3.2 Online implementation of the novel parallel RL algorithm

To find a set of pairs $ (P_{i(k)} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} K_{i(k + 1)} ) $ which satisfy Eqs. (10), (11), the novel online parallel RL implement approach is derived. It observes that the dynamics matrices $ A_{i} $, $ B_{i} $ and the transfer matrix element $ \pi_{ii} $ are not included in Algorithm 2. According to the Kronecker product representation, the left side of Eq. (13) can be rewritten as:

$$ \begin{aligned} & x_{{i,{\kern 1pt} t + T}}^{\text{T}} P_{i(k)} x_{{i,{\kern 1pt} t + T}} - x_{{i,{\kern 1pt} t}}^{\text{T}} P_{i(k)} x_{{i,{\kern 1pt} t}} \\ & \quad = \bar{P}_{i(k)}^{\text{T}} \left[ {x_{{i,{\kern 1pt} t + T}}^{\text{T}} \otimes x_{{i,{\kern 1pt} t + T}}^{\text{T}} - x_{{i,{\kern 1pt} t}}^{\text{T}} \otimes x_{{i,{\kern 1pt} t}}^{\text{T}} } \right]{\kern 1pt} {\kern 1pt} = \bar{P}_{i(k)}^{\text{T}} \delta_{ix} \\ \end{aligned} $$

(14)

where $ \bar{P}_{i(k)} $ is a set of positive-definite and mode-dependent matrices; $ \bar{P}_{i} $ is a column vector-valued matrix whose internal elements are produced by $ P_{i} $ and can be given as:

$$ \bar{P}_{i} = [p_{i11} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 2p_{i12} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \ldots ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 2p_{i1n} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} p_{i22} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 2p_{i23} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \ldots ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} p_{inn} ]^{\text{T}} , $$

(15)

$$ x_{{_{i} }}^{\text{T}} \otimes x_{{_{i} }}^{\text{T}} = [x_{i1}^{2} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} x_{i1} x_{i2} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \ldots ,{\kern 1pt} {\kern 1pt} {\kern 1pt} x_{i1} x_{in} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} x_{i2}^{2} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} x_{i2} x_{i3} ,{\kern 1pt} {\kern 1pt} \ldots x_{i2} x_{in} ,{\kern 1pt} {\kern 1pt} \ldots ,{\kern 1pt} {\kern 1pt} {\kern 1pt} x_{in}^{2} ]^{\text{T}} . $$

(16)

Using the similar method as Eq. (14), we have:

$$ x_{{i,{\kern 1pt} t}}^{\text{T}} {\kern 1pt} {\kern 1pt} \bar{Q}_{i(k)} {\kern 1pt} {\kern 1pt} x_{{i,{\kern 1pt} t}} = (x_{{i,{\kern 1pt} t}}^{\text{T}} \otimes x_{{i,{\kern 1pt} t}}^{\text{T}} ){\kern 1pt} {\kern 1pt} {\text{vec}}{\kern 1pt} {\kern 1pt} (\bar{Q}_{i(k)} ) $$

(17)

where $ \bar{Q}_{i(k)} = Q_{i(k)} + K_{i(k)}^{\text{T}} R_{i} K_{i(k)} $. Then, the right side of Eq. (13) can be rewritten as:

$$ \begin{aligned} & (u_{{i,{\kern 1pt} t}} + K_{i(k)} x_{{i,{\kern 1pt} t}} )^{\text{T}} R_{i} K_{i(k + 1)} x_{{i,{\kern 1pt} t}} \\ & {\kern 1pt} \quad {\kern 1pt} = [(x_{{i,{\kern 1pt} t}}^{\text{T}} \otimes x_{{i,{\kern 1pt} t}}^{\text{T}} )(I_{n} \otimes K_{i(k)}^{\text{T}} R_{i} ) + (x_{{i,{\kern 1pt} t}}^{\text{T}} \otimes u_{{i,{\kern 1pt} {\kern 1pt} {\kern 1pt} t}}^{\text{T}} )(I_{n} \otimes R_{i} )]{\kern 1pt} {\kern 1pt} {\text{vec}}(K_{i(k + 1)} ). \\ \end{aligned} $$

(18)

Meanwhile, we define the following operators:

$$ I_{ixx} = \left[ {\begin{array}{*{20}c} {\int_{{t_{0} }}^{{t_{1} }} {x_{{i,{\kern 1pt} \tau }}^{\text{T}} \otimes x_{{i,{\kern 1pt} \tau }}^{\text{T}} } \text{d}\tau } \\ {\int_{{t_{1} }}^{{t_{2} }} {x_{{i,{\kern 1pt} \tau }}^{\text{T}} \otimes x_{{i,{\kern 1pt} \tau }}^{\text{T}} } \text{d}\tau } \\ \vdots \\ {\int_{{t_{l - 1} }}^{{t_{l} }} {x_{{i,{\kern 1pt} \tau }}^{\text{T}} \otimes x_{{i,{\kern 1pt} \tau }}^{\text{T}} } \text{d}\tau } \\ \end{array} } \right] , $$

(19)

$$ I_{ixu} = \left[ {\begin{array}{*{20}c} {\int_{{t_{0} }}^{{t_{1} }} {x_{{i,{\kern 1pt} \tau }}^{\text{T}} \otimes u_{{i,{\kern 1pt} \tau }}^{\text{T}} } \text{d}\tau } \\ {\int_{{t_{1} }}^{{t_{2} }} {x_{{i,{\kern 1pt} \tau }}^{\text{T}} \otimes u_{{i,{\kern 1pt} \tau }}^{\text{T}} } \text{d}\tau } \\ \vdots \\ {\int_{{t_{l - 1} }}^{{t_{l} }} {x_{{i,{\kern 1pt} \tau }}^{\text{T}} \otimes u_{{i,{\kern 1pt} \tau }}^{\text{T}} } \text{d}\tau } \\ \end{array} } \right] . $$

(20)

Thus, Eq. (13) can be described as:

$$ \varTheta_{i(k)} \left[ {\begin{array}{*{20}c} {\bar{P}_{i(k)} } \\ {\text{vec}(K_{i(k + 1)} } \\ \end{array} } \right] = \varXi_{i(k)} $$

(21)

where $ \varTheta_{i(k)} \in \Re^{{l \times [\frac{1}{2}n(n + 1) + mn]}} $, $ \varXi_{i(k)} \in \Re^{l} $ and they can be described as:

$$ \varTheta_{i(k)} = \left[ {\begin{array}{*{20}c} {\delta_{ixx} } & { - 2{\kern 1pt} {\kern 1pt} {\kern 1pt} [I_{ixx} (I_{n} \otimes K_{i(k)}^{\text{T}} R_{i} ) + I_{ixu} (I_{n} \otimes R_{i} )]} \\ \end{array} } \right] , $$

(22)

$$ \varXi_{i(k)} = - I_{ixx} \text{vec}(\bar{Q}_{i(k)} ) . $$

(23)

For Eq. (21), if $ \varTheta_{i(k)} $ has full column rank, the least-squares solutions of Eq. (13) can be solved by:

$$ \left[ {\begin{array}{*{20}c} {\bar{P}_{i(k)} } \\ {\text{vec}(K_{i(k + 1)} } \\ \end{array} } \right] = (\varTheta_{i(k)}^{\text{T}} \varTheta_{i(k)} )^{ - 1} \varTheta_{i(k)}^{\text{T}} \varXi_{i(k)} $$

(24)

By formula (24), we can obtain $ P_{i(k)} $ and $ {\kern 1pt} K_{i(k + 1)} $. To compute the matrices of $ I_{ixx} $ and $ I_{ixu} $, we can use $ \frac{n(n + 1)}{2} + nm $ integrators to collect the information of the states and the control inputs of each subsystems. In order to guarantee that Algorithm 2 can be online implemented at each step, $ \varTheta_{i(k)} $ is required to have full column rank for all $ k \in Z_{ + } $, i.e., there is an integer $ l_{0} > 0 $, for $ \forall l > l_{0} $, the following condition holds:

$$ \text{rank}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {{\kern 1pt} {\kern 1pt} \left[ {I_{ixx} ,{\kern 1pt} {\kern 1pt} I_{ixu} } \right]{\kern 1pt} {\kern 1pt} } \right) = \frac{n(n + 1)}{2} + nm . $$

(25)

3.3 The convergence proof

In order to prove the convergence of Algorithm 2, we establish the following Theorem 1.

Theorem 1

For the novel parallel RL Algorithm 2, we give a stabilizable value $ K_{i(0)} $, where $ \forall i \in M $ and assume that $ \varTheta_{i(k)} $ has the full column rank; then the solutions of Eq. (13) satisfy Eqs. (10), (11).

Proof

Give a stabilizable value $ K_{i(0)} $, i.e., $ A_{i} - {\kern 1pt} B_{i} K_{i(0)} $ is a Hurwitz. According to Algorithm 1, the exact solutions $ P_{i}^{*} = \mathop {\lim }\nolimits_{k \to \infty } P_{i(k)} $ can be obtained by the decoupled algebraic Lyapunov equations in (10), and the feedback gain $ K_{i}^{*} = \mathop {\lim }\nolimits_{k \to \infty } K_{i(k)} $ can be uniquely determined by Eq. (11). For $ \forall t > 0 $, we choose the cost function as $ V(x(t)){\kern 1pt} = x_{{i,{\kern 1pt} {\kern 1pt} t}}^{T} P_{i} x_{{i,{\kern 1pt} {\kern 1pt} t}} $. Then, for the new re-constructed ith N independent subsystems in Eq. (12) and along the solutions of (12) by (10) and (11), the following is satisfied:

$$ \begin{aligned} & x_{{i,{\kern 1pt} t + T}}^{\text{T}} P_{i(k)} x_{{i,{\kern 1pt} t + T}} - x_{{i,{\kern 1pt} t}}^{\text{T}} P_{i(k)} x_{{i,{\kern 1pt} t}} \\ & \quad = \int_{t}^{t + T} {\left[ {x_{{i,{\kern 1pt} \tau }}^{\text{T}} (\bar{A}_{i}^{\text{T}} P_{i(k)} + P_{i(k)} \bar{A}_{i} )x_{{i,{\kern 1pt} \tau }} } \right.} {\kern 1pt} \left. { + 2(u_{{i,{\kern 1pt} \tau }} + K_{i(k)} x_{{i,{\kern 1pt} \tau }} )^{\text{T}} B_{i}^{\text{T}} P_{i(k)} } \right]\text{d}\tau {\kern 1pt} \\ & {\kern 1pt} \quad = - \int_{t}^{t + T} {x_{{i,{\kern 1pt} \tau }}^{\text{T}} \bar{Q}_{i(k)} x_{{i,{\kern 1pt} \tau }} \text{d}\tau + } \int_{t}^{t + T} {2(u_{{i,{\kern 1pt} \tau }} + K_{i(k)} x_{{i,{\kern 1pt} \tau }} )^{\text{T}} R_{i} K_{i(k + 1)} x_{{i,{\kern 1pt} \tau }} \text{d}\tau } \\ \end{aligned} $$

(26)

Since $ \varTheta_{i(k)} $ has full column rank, the least-squares solutions in Eq. (24) is an online implement of Eq. (13). Therefore, the solutions of Eq. (13) are equivalent to the iteration solutions of Eqs. (10) and (11). The proof is completed.

The flowchart of Algorithm 2 is given in Fig. 1.

Remark 5

By the online implementation in Algorithm 2, we can see the differences between the offline Algorithm 1 and the online Algorithm 2. In Algorithm 1, we apply the offline iterative algorithm to solve the Riccati functions and equations directly, while in Algorithm 2, we use the online sampling and learning method. Moreover, Eqs. (10), (11) in Algorithm 1 depend on the dynamics information. But in Algorithm 2, we do not need any dynamics information by Eq. (13).

4 Simulation examples

Example 1

To show the applicability of Algorithm 2, we consider a fourth order jump linear model with two jump modes as:

$$ N = 2,\;\varPi = \left[ {\begin{array}{*{20}c} { - \;3} & 3 \\ {2.5} & { - \;2.5} \\ \end{array} } \right], $$

$$ A_{1} = \left[ {\begin{array}{*{20}c} { - \;2.2361} & { - \;1.1358} & 1 & {0.6324} \\ { - \;0.1024} & { - \;3} & {0.3835} & {0.85} \\ {0.7112} & {11.2346} & { - \;36.8199} & 4 \\ {1.0692} & {13.4230} & {20.1185} & { - \;12.1801} \\ \end{array} } \right], $$

$$ A_{2} = \left[ {\begin{array}{*{20}c} { - \;1.5326} & { - \;1.2436} & {0.5458} & {0.7136} \\ { - \;0.8} & { - \;2.9346} & {0.0920} & {0.42} \\ {11.1634} & {23} & { - \;26.4655} & { - \;1.8347} \\ {25} & {8.3132} & { - \;3.8714} & { - \;31.4631} \\ \end{array} } \right], $$

$$ B_{1} = B_{1} = \left[ {\begin{array}{*{20}c} 1 \\ 1 \\ 1 \\ 1 \\ \end{array} } \right],\quad Q_{1} = Q_{2} = \left[ {\begin{array}{*{20}c} 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 \\ \end{array} } \right]. $$

In the simulation, we set $ R_{1} = R_{2} = 1 $, the error value $ \varepsilon $ as $ 10^{ - 15} $ and the initial stabilizing feedback gain as $ K_{i(0)} = 0 $. According to the OPPIA in Algorithm 1, AREs (6) can be directly solved. After the 14th iterations, the exactly solutions are obtained:

$$ P_{1}^{*} = \left[ {\begin{array}{*{20}c} {0.3505} & { - \;0.0320} & {0.0180} & {0.0191} \\ { - \;0.0320} & {0.5943} & {0.0572} & {0.0845} \\ {0.0180} & {0.0572} & {0.0268} & {0.0232} \\ {0.0191} & {0.0845} & {0.0232} & {0.0503} \\ \end{array} } \right], $$

$$ P_{2}^{*} = \left[ {\begin{array}{*{20}c} {0.5574} & { - 0.0483} & {0.0140} & {0.0227} \\ { - 0.0483} & {0.4057} & {0.0164} & {0.0111} \\ {0.0140} & {0.0164} & {0.0196} & { - 0.0003} \\ {0.0227} & {0.0111} & { - 0.0003} & {0.0178} \\ \end{array} } \right]. $$

The OPPIA in Algorithm 1 is offline and the subsystems dynamics matrices are all needed. To show the effectiveness and applicability of Algorithm 2, we use the same conditions mentioned above. For the two subsystems, we select the initial states as $ x_{10} = x_{20} = \left[ {\begin{array}{*{20}c} 0 & { - \;0.1} & {0.1} & 0 \\ \end{array} } \right]^{\text{T}} $ and the controller $ u_{i(k)} {\kern 1pt} = - K_{i(k)} x_{i} + e_{i} $ in the time interval $ \left[ {0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 2} \right] $ with the exploration noises as $ e_{1} = e_{2} = 100\sum\nolimits_{n = 1}^{100} {\sin (\omega_{n} {\kern 1pt} t)} $, where $ \omega_{n} $ is a number and can be randomly selected in $ \left[ { - \;500,{\kern 1pt} \;{\kern 1pt} 500} \right] $. For the two subsystems, the information of the states and inputs is sampled every 0.01 s, and the optimal controller can be updated every 0.1 s. By the given iteration criterion, we can get the final calculation results of Algorithm 2 in step 4, after 7th iterations as:

$$ P_{1(7)} = \left[ {\begin{array}{*{20}c} {0.3505} & { - \;0.0320} & {0.0180} & {0.0191} \\ { - \;0.0320} & {0.5943} & {0.0572} & {0.0845} \\ {0.0180} & {0.0572} & {0.0268} & {0.0232} \\ {0.0191} & {0.0845} & {0.0232} & {0.0503} \\ \end{array} } \right], $$

$$ P_{2(7)} = \left[ {\begin{array}{*{20}c} {0.5574} & { - \;0.0483} & {0.0140} & {0.0227} \\ { - \;0.0483} & {0.4057} & {0.0164} & {0.0111} \\ {0.0140} & {0.0164} & {0.0196} & { - \;0.0003} \\ {0.0227} & {0.0111} & { - \;0.0003} & {0.0178} \\ \end{array} } \right]. $$

The optimal state feedback controller gains are:

$$ K_{1(7)} = \left[ {\begin{array}{*{20}c} {0.3557} & {0.7040} & {0.1253} & {0.1771} \\ \end{array} } \right], $$

$$ K_{2(7)} = \left[ {\begin{array}{*{20}c} {0.5458} & {0.3849} & {0.0497} & {0.0513} \\ \end{array} } \right]. $$

Then, we get the simulation results by using MATLAB. Figures 2, 3, 4 and 5 show the convergence of Algorithm 2. The convergence of $ P_{1} $ and $ P_{2} $ to the optimal values are shown in Figs. 2, 3. The elements of matrices $ P_{1} $ and $ P_{2} $ are updated which are shown in Figs. 4, 5.

Example 2

In order to further study the proposed approach, we give the following comparative results. Applying the parallel RL algorithm proposed in this paper and simulating the experiment with the same parameters as [45], we can get the final solutions after 7 iterations:

$$ P_{1(7)} = \left[ {\begin{array}{*{20}c} {0.2408} & {0.0705} & {0.0393} & {0.0182} \\ {0.0705} & {0.0308} & {0.0085} & {0.0064} \\ {0.0393} & {0.0085} & {0.0157} & {0.0025} \\ {0.0182} & {0.0064} & {0.0025} & {0.0016} \\ \end{array} } \right], $$

$$ P_{2(7)} = \left[ {\begin{array}{*{20}c} {0.5026} & {0.1343} & {0.0518} & {0.0097} \\ {0.1343} & {0.0485} & {0.0138} & {0.0026} \\ {0.0518} & {0.0138} & {0.0193} & {0.0002} \\ {0.0097} & {0.0026} & {0.0002} & {0.0003} \\ \end{array} } \right]. $$

Comparing with the online algorithms proposed in [45], the simulation errors between the calculated results proposed in this paper can nearly converge to zero. It shows that the parallel RL algorithm in this paper gives a more accurate solution. It is also seen from Figs. 6 and 7 that the convergence rate of the parallel RL algorithm is also faster than the results in [45].

Remark 6

In the two simulation examples, we solve the positive-definite solution $ P_{i(k)} $ by the online reinforcement learning method. First, we obtain N coupled Riccati equations by decomposing the subsystems. According to the online implementation process, we need to collect the states and the inputs of each subsystem, and satisfy the condition of the full rank of the least square equation. Then, we iteratively solve the least square formula, until the final optimal solution satisfying the limited precision requirement is obtained.

5 Conclusion

A novel online parallel RL algorithm is proposed with completely unknown dynamics. It can solve the AREs by collecting the information of the states and the inputs of each of the subsystems. Therefore, this novel parallel RL algorithm can parallelly compute the corresponding N coupled AREs with completely unknown dynamics. The convergence of the parallel RL algorithm is also proved. Two simulation results illustrate the applicability and effectiveness of the designed algorithm. In the future, the RL algorithm and the relevant optimal algorithm can be used in other aspects, such as artificial intelligence, industrial informatics, pattern processing and wireless automation devices.

References

Krasovskii NN, Lidskii EA (1961) Analysis design of controller in systems with random attributes—part 1. Autom Remote Control 22:1021–1025
Google Scholar
Luan X, Huang B, Liu F (2018) Higher order moment stability region for Markov jump systems based on cumulant generating function. Automatica 93:389–396
Article MathSciNet MATH Google Scholar
Zhang L, Boukas EK (2009) Stability and stabilization of Markovian jump linear systems with partly unknown transition probabilities. Automatica 45(2):463–468
Article MathSciNet MATH Google Scholar
Shi P, Li F (2015) A survey on Markovian jump systems: modeling and design. Int J Control Autom Syst 13(1):1–16
Article MathSciNet Google Scholar
Wang Y, Xia Y, Shen H, Zhou P (2017) SMC design for robust stabilization of nonlinear Markovian jump singular systems. IEEE Trans Autom Control. https://doi.org/10.1109/tac.2017.2720970(In Press)
Article MATH Google Scholar
Li H, Shi P, Yao D, Wu L (2016) Observer-based adaptive sliding mode control for nonlinear Markovian jump systems. Automatica 64(1):133–142
Article MathSciNet MATH Google Scholar
Kao Y, Xie J, Wang C, Karimi HR (2015) A sliding mode approach to H_∞ non-fragile observer-based control design for uncertain Markovian neutral-type stochastic systems. Automatica 52:218–226
Article MathSciNet MATH Google Scholar
Shi P, Liu M, Zhang L (2015) Fault-tolerant sliding mode observer synthesis of Markovian jump systems using quantized measurements. IEEE Trans Industr Electron 62(9):5910–5918
Article Google Scholar
Ma Y, Jia X, Liu D (2016) Robust finite-time H_∞ control for discrete-time singular Markovian jump systems with time-varying delay and actuator saturation. Appl Comput Math 286:213–227
MathSciNet MATH Google Scholar
Mao Z, Jiang B, Shi P (2007) H_∞ fault detection filter design for networked control systems modelled by discrete Markovian jump systems. IET Control Theory Appl 1(5):1336–1343
Article MathSciNet Google Scholar
Shi P, Li F, Wu L, Lim CC (2017) Neural network-based passive filtering for delayed neutral-type semi-markovian jump systems. IEEE Trans Neural Netw Learn Syst 28(9):2101–2114
MathSciNet Google Scholar
Li F, Wu L, Shi P, Lim CC (2015) State estimation and sliding mode control for semi-Markovian jump systems with mismatched uncertainties. Automatica 51:385–393
Article MathSciNet MATH Google Scholar
Ma H, Liang H, Zhu Q, Ahn CK (2018) Adaptive dynamic surface control design for uncertain nonlinear strict-feedback systems with unknown control direction and disturbances. IEEE Trans Syst Man Cybern Syst. https://doi.org/10.1109/tsmc.2018.2855170(In Press)
Article Google Scholar
Ma H, Zhou Q, Bai L, Liang H (2018) Observer-based adaptive fuzzy fault-tolerant control for stochastic nonstrict-feedback nonlinear systems with input quantization. IEEE Trans Syst Man Cybern Syst. https://doi.org/10.1109/tsmc.2018.2833872(In Press)
Article Google Scholar
Tao G (2003) Adaptive control design and analysis. Wiley-IEEE Press, Hoboken
Book MATH Google Scholar
Kleinman D (1968) On an iterative technique for Riccati equation computations. IEEE Trans Autom Control 13(1):114–115
Article Google Scholar
Lu L, Lin W (1993) An iterative algorithm for the solution of the discrete-time algebraic Riccati equation. Linear Algebra Appl 188–189(1):465–488
Article MathSciNet MATH Google Scholar
Costa OLV, Aya JCC (1999) Temporal difference methods for the maximal solution of discrete-time coupled algebraic Riccati equations. In: Proceedings of the american control conference, San Diego. IEEE Press, pp 1791–1795
Gajic Z, Borno I (1975) Lyapunov iterations for optimal control of jump linear systems at steady state. IEEE Trans Autom Control 40(11):1971–1975
Article MathSciNet MATH Google Scholar
He W, Dong Y, Sun C (2016) Adaptive neural impedance control of a robotic manipulator with input saturation. IEEE Trans Syst Man Cybern Syst 46(3):334–344
Article Google Scholar
Shen H, Men Y, Wu Z, Park JH (2017) Nonfragile H_∞ control for fuzzy Markovian jump systems under fast sampling singular perturbation. IEEE Trans Syst Man Cybern Syst. https://doi.org/10.1109/tsmc.2017.2758381(In Press)
Article Google Scholar
Xu Y, Lu R, Peng H, Xie K, Xue A (2017) Asynchronous dissipative state estimation for stochastic complex networks with quantized jumping coupling and uncertain measurements. IEEE Trans Neural Netw Learn Syst 28(2):268–277
Article MathSciNet Google Scholar
Cheng J, Park JH, Karimi HR (2018) A flexible terminal approach to sampled-data exponentially synchronization of Markovian neural networks with time-varying delayed signals. IEEE Trans Cybern 48(8):2232–2244
Article Google Scholar
Zhai D, An L, Li X, Zhang Q (2018) Adaptive fault-tolerant control for nonlinear systems with multiple sensor faults and unknown control directions. IEEE Trans Neural Netw Learn Syst 29(9):4436–4446
Article Google Scholar
Zhai D, An L, Ye D, Zhang Q (2018) Adaptive reliable H_∞ static output feedback control against Markovian jumping sensor failures. IEEE Trans Neural Netw Learn Syst 29(3):631–644
Article MathSciNet Google Scholar
Liu D, Wei Q, Yan P (2015) Generalized policy iteration adaptive dynamic programming for discrete-time nonlinear systems. IEEE Trans Syst Man Cybern Syst 45(12):1577–1591
Article Google Scholar
Wei Q, Liu D (2014) Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems. Neural Comput Appl 24(6):1355–1367
Article Google Scholar
Liang Y, Zhang H, Xiao G, Jiang H (2018) Reinforcement learning-based online adaptive controller design for a class of unknown nonlinear discrete-time systems with time delays. Neural Comput Appl 30(6):1733–1745
Article Google Scholar
Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag 9(3):32–50
Article Google Scholar
Vrabie D, Lewis FL (2011) Adaptive dynamic programming for online solution of a zero-sum differential game. IET Control Theory Appl 9(3):353–360
Article MathSciNet MATH Google Scholar
Vrabie D, Lewis FL (2009) Adaptive optimal control algorithm for continuous-time nonlinear systems based on policy iteration. In: Proceedings of the 48th IEEE conference on decision and control, Shanghai, pp 73–79
Guo W, Si J, Liu F, Mei S (2018) Policy approximation in policy iteration approximate dynamic programming for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 29(7):2794–2807
MathSciNet Google Scholar
Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB (2014) Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 50(4):1167–1175
Article MathSciNet MATH Google Scholar
Liu YJ, Li S, Tong CT, Chen CLP (2019) Adaptive reinforcement learning control based on neural approximation for nonlinear discrete-time systems with unknown nonaffine dead-zone input. IEEE Trans Neural Netw Learn Syst 30(1):295–305
Article Google Scholar
Wu HN, Luo B (2013) Simultaneous policy update algorithms for learning the solution of linear continuous-time H_∞ state feedback control. Inf Sci 222(11):472–485
Article MathSciNet MATH Google Scholar
Mu C, Wang D, He H (2017) Data-driven finite-horizon approximate optimal control for discrete-time nonlinear systems using iterative HDP approach. IEEE Trans Cybern. https://doi.org/10.1109/tcyb.2017.2752845(In Press)
Article Google Scholar
He X, Huang T, Yu J, Li C, Zhang Y (2017) A continuous-time algorithm for distributed optimization based on multiagent networks. IEEE Trans Syst Man Cybern Syst. https://doi.org/10.1109/tsmc.2017.2780194(In Press)
Article Google Scholar
Yang X, He H, Liu Y (2017) Adaptive dynamic programming for robust neural control of unknown continuous-time nonlinear systems. IET Control Theory Appl 11(14):2307–2316
Article MathSciNet Google Scholar
Xu W, Huang Z, Zuo L, He H (2017) Manifold-based reinforcement learning via locally linear reconstruction. IEEE Trans Neural Netw Learn Syst 28(4):934–947
Article Google Scholar
Alipour MM, Razavi SN, Derakhshi MRF, Balafar MA (2018) A hybrid algorithm using a genetic algorithm and multiagent reinforcement learning heuristic to solve the traveling salesman problem. Neural Comput Appl 30(9):2935–2951
Article Google Scholar
Zhu Y, Zhao D (2015) A data-based online reinforcement learning algorithm satisfying probably approximately correct principle. Neural Comput Appl 26(4):775–787
Article Google Scholar
Tang L, Liu Y, Tong S (2014) Adaptive neural control using reinforcement learning for a class of robot manipulator. Neural Comput Appl 25(1):135–141
Article Google Scholar
Mu C, Ni Z, Sun C, He H (2017) Air-breathing hypersonic vehicle tracking control based on adaptive dynamic programming. IEEE Trans Neural Netw Learn Syst 28(3):584–598
Article MathSciNet Google Scholar
Xie X, Yue D, Hu S (2017) Fault estimation observer design of discrete-time nonlinear systems via a joint real-time scheduling law. IEEE Trans Syst Man Cybern Syst 45(7):1451–1463
Article Google Scholar
He S, Song J, Ding Z, Liu F (2015) Online adaptive optimal control for continuous-time Markov jump linear systems using a novel policy iteration algorithm. IET Control Theory Appl 9(10):1536–1543
Article MathSciNet Google Scholar
Song J, He S, Liu F, Niu Y, Ding Z (2016) Data-driven policy iteration algorithm for optimal control of continuous-time Itô stochastic systems with Markovian jumps. IET Control Theory Appl 10(12):1431–1439
Article MathSciNet Google Scholar
Song J, He S, Ding Z, Liu F (2016) A new iterative algorithm for solving H_∞ control problem of continuous-time Markovian jumping linear systems based on online implementation. Int J Robust Nonlinear Control 26(17):3737–3754
Article MathSciNet MATH Google Scholar
Gajic Z, Borno I (2000) General transformation for block diagonalization of weakly coupled linear systems composed of N-subsystems. IEEE Trans Circuits Syst I Fundam Theory Appl 47(6):909–912
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61673001, 61722306, the Foundation for Distinguished Young Scholars of Anhui Province under Grant 1608085J05, the Key Support Program of University Outstanding Youth Talent of Anhui Province under Grant gxydZD2017001, the State Key Program of National Natural Science Foundation of China under Grant 61833007 and the 111 Project under Grant B12018.

Author information

Authors and Affiliations

School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, China
Shuping He, Maoguang Zhang & Haiyang Fang
Institute of Physical Science and Information Technology, Anhui University, Hefei, 230601, China
Shuping He
Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Institute of Automation, Jiangnan University, Wuxi, 214122, China
Fei Liu & Xiaoli Luan
School of Electrical and Electronic Engineering, The University of Manchester, Manchester, M13 9PL, UK
Zhengtao Ding

Authors

Shuping He
View author publications
You can also search for this author in PubMed Google Scholar
Maoguang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haiyang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoli Luan
View author publications
You can also search for this author in PubMed Google Scholar
Zhengtao Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuping He.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, S., Zhang, M., Fang, H. et al. Reinforcement learning and adaptive optimization of a class of Markov jump systems with completely unknown dynamic information. Neural Comput & Applic 32, 14311–14320 (2020). https://doi.org/10.1007/s00521-019-04180-2

Download citation

Received: 29 November 2018
Accepted: 29 March 2019
Published: 09 April 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00521-019-04180-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reinforcement learning and adaptive optimization of a class of Markov jump systems with completely unknown dynamic information

Abstract

Similar content being viewed by others

Reinforcement learning for optimal tracking of large-scale systems with multitime scales

Online identification and control of PDEs via reinforcement learning methods

Reinforcement learning-based online adaptive controller design for a class of unknown nonlinear discrete-time systems with time delays

Explore related subjects

1 Introduction

2 Backgrounds and preliminaries

2.1 Problem description

Definition 1

2.2 An offline parallel policy iteration algorithm (OPPIA)

Lemma 1

Remark 1

3 Main results

3.1 A parallel RL algorithm with completely unknown dynamics

Remark 2

Remark 3

Remark 4

3.2 Online implementation of the novel parallel RL algorithm

3.3 The convergence proof

Theorem 1

Proof

Remark 5

4 Simulation examples

Example 1

Example 2

Remark 6

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation