Keywords

1 Introduction

Linear quadratic tracking (LQT) of discrete time systems (DT) is a very important problem in the field of control. The basic idea is that the performance index is a quadratic function defined by the accumulation of deviation between the reference signal and the system output and the accumulation of control input. By designing an optimal controller, the performance index is minimized so that the output of the system can follow the track of the reference signal by an optimal approach. The traditional method to solve LQT problem is to solve a Riccati algebraic equation [1,2,3]. The traditional controller design methods all require the system model information to be known.

Reinforcement learning (RL) is a one kind of machine learning methods, which was born in 1950s and 1960s [4,5,6]. During the dynamic interaction with unknown dynamic environment, performance is evaluated and action is updated, such that the optimal performance together with the optimal action can be learned [7,8,9,10,11]. Reinforcement learning has many advantages and strong adaptability. At present, it has become an important learning method for solving optimization problems. It is widely used in robot, artificial intelligence, intelligent systems and other fields, and it is one of the research hot spots in recent years [12,13,14,15].

In the existing researches on the optimal tracking control results which do not depend on the system model by reinforcement learning method, most of them use the system state data to learn the state feedback control strategy and track the reference input of the system in the optimal or nearly optimal way, such as the optimal tracking control [16,17,18,19,20,21,22], etc. In [19], a Q-learning method based on off policy iteration was proposed to solve the optimal tracking control problem of networked control system. In [20], when the system parameter model is unknown, an optimal control method of linear network control was proposed. According to the state feedback information of the system, a novel off-policy Q-learning algorithm was proposed in [21], which solved the problem of linear quadratic tracking in discrete time under the condition of unknown system parameters. In [22], an optimal tracking control scheme was proposed.

In this paper, a Q-learning algorithm is developed to design the output feedback optimal tracking control strategy, such that the optimal quadratic tracking problem can be solved without the knowledge of system dynamics.

The innovation of this paper lies in (a): different from the traditional research which needs the traditional model information [1,2,3], the research of this paper is to learn the optimal tracking control strategy when the system model parameters are unknown. (b): compared with other model-free research on the state feedback of the system [20,21,22], this paper adopts a fully data-driven off policy Q-learning method to solve the linear quadratic tracking control problem of the discrete-time system based on the output feedback of the system and independent of the system model parameters. Finally, simulation experiments and practical application examples are given to verify the effectiveness of the algorithm.

2 On the Optimal Control Problem

This section will introduce the optimal control of linear quadratic tracking problem for discrete-time systems. The state equations of the following linear discrete systems are considered below:

$$ \left\{ {\begin{array}{*{20}c} {x_{k + 1} \, = \,Ax_{k} \, + \,Bu_{k} } \\ {y_{k} \, = \,Cx_{k} } \\ \end{array} } \right. $$
(1)

where \( x_{k} \) is the state of the controlled object, \( u_{k} \) is the input of the controlled object, and \( y_{k} \) is the output of the controlled object. \( A \), \( B \), and \( C \) are matrices of appropriate dimensions, respectively. The reference signal of our interest is as follows:

$$ r_{k + 1} \, = \,Fr_{k} $$
(2)

where \( r_{k} \) is the input of the reference object, \( F \) is also a matrix with appropriate dimensions. For the linear quadratic tracking problem of discrete-time system, we need to control the output signal \( y_{k} \) in the system (1), and gradually track and catch up with our reference input signal \( r_{k} \) as time goes on. According to the actual problems, we design and select the output feedback controller as follows:

$$ u_{k} \, = \, - K_{1} y_{k} \, - \,K_{2} r_{k} $$
(3)

The purpose of our controller is to optimize performance index \( J \). Our performance indicator is:

$$ J\, = \,\mathop {\hbox{min} }\limits_{{u_{k} }} \sum\limits_{k = 0}^{\infty } {[\beta^{\text{k}} (y_{k} \, - \,r_{k} )^{T} Q(} y_{k} \, - \,r_{k} )\, + \,u_{k}^{T} Ru_{k} ] $$
(4)

where \( Q\, \ge \,0 \) and \( R\, > \,0 \) are symmetric matrices, and \( \beta \) is a discount factor with \( 0\, < \,\beta \, < \,1 \). The constraints are as follows:

$$ \left\{ {\begin{array}{*{20}c} {x_{k + 1} \, = \,Ax_{k} \, + \,Bu_{k} } \\ {y_{k} \, = \,Cx_{k} } \\ {r_{k + 1} \, = \,Fr_{k} } \\ \end{array} } \right. $$
(5)

According to the performance index, we can define the optimal value function \( V^{*} \) as:

$$ \begin{aligned} & V^{*} (x_{k} ,r_{k} )\, = \,\mathop {\hbox{min} }\limits_{{u_{k} }} \sum\limits_{i = k}^{\infty } {[\beta^{\text{k}} (y_{k} } \, - \,r_{k} )^{T} Q(y_{k} \, - \,r_{k} )\, + \,u_{k}^{T} Ru_{k} ] \\ & { = }\,\mathop {\hbox{min} }\limits_{{u_{k} }} \sum\limits_{k = 0}^{\infty } {[\beta^{\text{k}} (Cx_{k} \, - \,r_{k} )^{T} } Q(Cx_{k} \, - \,r_{k} )\, + \,u_{k}^{T} Ru_{k} ] \\ \end{aligned} $$
(6)

Then, the \( Q \) function can be expressed as:

$$ Q(x_{k} ,\,r_{k} ,\,u_{k} )\, = \,(y_{k} \, - \,r_{k} )^{T} Q(y_{k} \, - \,r_{k} )\, + \,u_{k}^{T} Ru_{k} \, + \,\sum\limits_{i = k + 1}^{\infty } [ (y_{i} \, - \,r_{i} )^{T} Q(y_{i} \, - \,r_{i} )\, + \,u_{i}^{T} Ru_{i} ] $$
(7)

The optimal function \( Q^{*} \) can be expressed as:

$$ Q^{*} (x_{k} ,\,r_{k} ,\,u_{k} )\, = \,(y_{k} \, - \,r_{k} )^{T} Q(y_{k} \, - \,r_{k} )\, + \,u_{k}^{T} Ru_{k} \, + \,V^{*} (x_{k + 1} ,\,r_{k + 1} ) $$
(8)

For the convenience of calculation and understanding, (8) can be rewritten as:

$$ Q^{*} (x_{k} ,\,r_{k} ,\,u_{k} )\, = \,\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right]^{T} \overline{Q} \left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right]\, + \,V^{*} (x_{k + 1} ,\,r_{k + 1} ) $$
(9)

where \( \overline{Q} \) can be written as:

$$ \overline{Q} \, = \,\left[ {\begin{array}{*{20}c} Q & { - Q} & 0 \\ { - Q} & Q & 0 \\ 0 & 0 & R \\ \end{array} } \right] $$
(10)

From the above formula, we can know that the relationship between the optimal value function \( V^{ *} \) and the optimal function \( Q \) is:

$$ V^{*} (x_{k} ,\,r_{k} )\, = \,Q^{*} (x_{k} ,\,r_{k} ,\,u_{k}^{*} ) $$
(11)

Since the input \( u_{k} \) of the control object is controllable, the optimal value function \( V^{*} \) can be expressed as [15]:

$$ V^{*} (x_{k} ,\,r_{k} )\, = \,\left[ {\begin{array}{*{20}c} {x_{k} } \\ {r_{k} } \\ \end{array} } \right]^{T} P\left[ {\begin{array}{*{20}c} {x_{k} } \\ {r_{k} } \\ \end{array} } \right] $$
(12)

The optimal \( Q \) function can be expressed as:

$$ Q^{*} (x_{k} ,\,r_{k} ,\,u_{k} )\, = \,\left[ {\begin{array}{*{20}c} {x_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right]^{T} H\left[ {\begin{array}{*{20}c} {x_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right] $$
(13)

The matrix \( H \) can be expressed as follows:

$$ H\,{ = }\,\left[ {\begin{array}{*{20}c} {H_{xx} } & {H_{xr} } & {H_{xu} } \\ {H_{rx} } & {H_{rr} } & {H_{ru} } \\ {H_{ux} } & {H_{ur} } & {H_{uu} } \\ \end{array} } \right]\, = \,\left[ {\begin{array}{*{20}c} {A^{T} PA\, + \,C^{T} QC} & {A^{T} PF\, - \,C^{T} Q} & {A^{T} PB} \\ {F^{T} PA\, - \,QC} & {F^{T} PF\, + \,Q} & {F^{T} PB} \\ {B^{T} PA} & {B^{T} PF} & {B^{T} PB\,{ + }\,R} \\ \end{array} } \right] $$
(14)

According to the necessary conditions to achieve optimal performance, implementing \( \frac{{\partial Q^{*} (x_{k} ,\,r_{k} ,\,u_{k} )}}{{\partial u_{k} }}\, = \,0 \) yields the following forms.

$$ \left\{ {\begin{array}{*{20}c} {K_{1} C\, = \,H_{uu}^{ - 1} H_{xu}^{T} } \\ {K_{2} \, = \,H_{uu}^{ - 1} H_{ru}^{T} } \\ \end{array} } \right. $$
(15)

From (15), we find that we cannot get \( K_{1} \) if the matrix \( C \) is unknown. The output equation can be treated as follows.

$$ (C^{T} C)^{ - 1} C^{T} y_{k} \, = \,x_{k} $$
(16)

By substituting (16) into (9) above, we can obtain a new optimal \( Q \) function equation:

$$ \begin{aligned} & Q^{*} (x_{k} ,\,r_{k} ,\,u_{k} )\, = \,(y_{k} \, - \,r_{k} )^{T} Q(y_{k} \, - \,r_{k} )\, + \,u_{k}^{T} Ru_{k} \, + \,\left[ {\begin{array}{*{20}c} {x_{k + 1} } \\ {r_{k + 1} } \\ \end{array} } \right]^{T} P\left[ {\begin{array}{*{20}c} {x_{k + 1} } \\ {r_{k + 1} } \\ \end{array} } \right] \\ & { = }\,\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right]^{T} \left[ {\begin{array}{*{20}c} Q & { - Q} & 0 \\ { - Q} & Q & 0 \\ 0 & 0 & R \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right]\,{ + }\,\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right]^{\text{T}} \left[ {\begin{array}{*{20}c} {(C^{T} C)^{ - 1} C^{T} } & 0 & 0 \\ 0 & I & 0 \\ 0 & 0 & I \\ \end{array} } \right]^{\text{T}} \left[ {\begin{array}{*{20}c} A & 0 & B \\ 0 & F & 0 \\ \end{array} } \right]^{T} P \\ & \times \left[ {\begin{array}{*{20}c} A & 0 & B \\ 0 & F & 0 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {(C^{T} C)^{ - 1} C^{T} } & 0 & 0 \\ 0 & I & 0 \\ 0 & 0 & I \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right]\,{ = }\,\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right]^{T} \left[ {\begin{array}{*{20}c} {\overline{H}_{yy} } & {\overline{H}_{yr} } & {\overline{H}_{yu} } \\ {\overline{H}_{ry} } & {\overline{H}_{rr} } & {\overline{H}_{ru} } \\ {\overline{H}_{uy} } & {\overline{H}_{ur} } & {\overline{H}_{uu} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ {u_{k} } \\ \end{array} } \right] \\ & = \,Z_{k}^{T} \overline{H} Z_{k} \\ \end{aligned} $$
(17)

where

$$ \begin{aligned} & \overline{H} \, = \,\left[ {\begin{array}{*{20}c} {\overline{H}_{yy} } & {\overline{H}_{yr} } & {\overline{H}_{yu} } \\ {\overline{H}_{ry} } & {\overline{H}_{rr} } & {\overline{H}_{ru} } \\ {\overline{H}_{uy} } & {\overline{H}_{ur} } & {\overline{H}_{uu} } \\ \end{array} } \right] \\ & = \,\left[ {\begin{array}{*{20}c} {[C^{T} (CC^{T} )^{ - 1} ]^{T} (A^{T} PA)[C^{T} (CC^{T} )^{ - 1} ]\,{ + }\,Q} \\ {(F^{T} PA)[C^{T} (CC^{T} )^{ - 1} ]\, - \,Q} \\ {B^{T} PA[C^{T} (CC^{T} )^{ - 1} ]} \\ \end{array} } \right.\begin{array}{*{20}c} {(C^{T} C)^{ - 1} C^{T} A^{T} PF\, - \,Q} \\ {F^{T} PF\, + \,Q} \\ {B^{T} PF} \\ \end{array} \left. {\begin{array}{*{20}c} {(C^{T} C)^{ - 1} C^{T} A^{T} PB} \\ {F^{T} PB} \\ {B^{T} PB\, + \,R} \\ \end{array} } \right] \\ \end{aligned} $$
(18)

Implementing \( \frac{{\partial Q^{*} (x_{k} ,r_{k} ,u_{k} )}}{{\partial u_{k} }}\, = \,0 \) yields the following forms

$$ \left\{ {\begin{array}{*{20}c} {K_{1} \, = \,\overline{H}_{uu}^{ - 1} \overline{H}_{yu}^{T} } \\ {K_{2} \, = \,\overline{H}_{uu}^{ - 1} \overline{H}_{ru}^{T} } \\ \end{array} } \right. $$
(19)

According to the above relation, the Riccati equation for the optimal \( Q \) function is as follows:

$$ Z_{k}^{T} \overline{H} Z_{k} \, = \,Z_{k}^{T} \overline{Q} Z_{k} \, + \,Z_{k + 1}^{T} \overline{H} Z_{k + 1} $$
(20)

3 Data Driven Q-Learning Algorithm

This section will give off-policy Q-learning algorithm for designing the output feedback optimal tracking control strategy, under which the system output can track the reference signal in an approximate optimal way.

First, the on-policy Q-learning algorithm is given, and then based on it the off-policy Q-learning algorithm is derived.

Algorithm 1: On-policy Q-learning algorithm.

  1. 1.

    Give a stablizing controller gain \( K_{1}^{j} \) and \( K_{2}^{j} \), let the initial \( j \) value be 0, \( j \) represents the number of iterations. The control object input is defined as.

    $$ u_{k}^{j} = - K_{1}^{j} y_{k} - K_{2}^{j} r_{k} $$
    (21)
  2. 2.

    Evaluate the control policy by solving the optimal \( Q \) function and matrix \( \overline{H} \).

    $$ Z_{k}^{T} \overline{H}^{j + 1} Z_{\text{k}} { = }(y_{k} - r_{k} )^{T} Q(y_{k} - r_{k} ) + u_{k}^{j} Ru_{k}^{j} + Z_{k + 1}^{T} \overline{H}^{j + 1} Z_{{{\text{k + }}1}} $$
    (22)
  3. 3.

    Update the control policy.

    $$ \begin{aligned} u_{k}^{j + 1} = - K_{1}^{j + 1} y_{k} - K_{2}^{j + 1} r_{k} \hfill \\ \left\{ {\begin{array}{*{20}c} {K_{1}^{j + 1} = (\bar{H}_{uu}^{j + 1} )^{ - 1} (\bar{H}_{yu}^{j + 1} )^{T} } \\ {K_{2}^{j + 1} = (\bar{H}_{uu}^{j + 1} )^{ - 1} (H_{ru}^{j + 1} )^{T} } \\ \end{array} } \right. \hfill \\ \end{aligned} $$
    (23)
  4. 4.

    If \( \left\| {\bar{H}^{j + 1} } \right. - \left. {\bar{H}^{j} } \right\| \le \varepsilon \)(\( \varepsilon \) is a small positive number) stops the iteration of the strategy. Otherwise,\( j = j + 1 \), and return to Step 2.

In view of the advantages of off policy Q-learning algorithm, we will propose an off-policy algorithm based on Q-function, and adopt data-driven algorithm without model to solve the linear quadratic tracking problem of discrete-time system. We introduce the target control strategy into the system dynamics and get the following equation, where \( u_{k} \) is the behavior control strategy and \( u_{k}^{j} \) is the target control strategy.

$$ y_{k + 1} \, = \,Cx_{k + 1} \, = \,CAx_{k} \, + \,CBu_{k} \, = \,CAC^{T} (CC^{T} )^{ - 1} y_{k} \, + \,CBu_{k} $$
(24)
$$ \left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {CAC^{T} (CC^{T} )^{ - 1} } & 0 \\ 0 & F \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {CB} \\ 0 \\ \end{array} } \right] \times u_{k}^{j} + \left[ {\begin{array}{*{20}c} {CB} \\ 0 \\ \end{array} } \right] \times (u_{k} - u_{k}^{j} ) $$
(25)

From Eq. (22), one has

$$ \begin{aligned} & \left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right]^{T} P^{j + 1} \left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right]\, - \,\left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]^{T} P^{j + 1} \left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]\, + \,2\left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]^{T} P^{j + 1} \left[ {\begin{array}{*{20}c} {CB} \\ 0 \\ \end{array} } \right] \\ & \, \times \,\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right]^{T} \, \times \,\left[ {\begin{array}{*{20}c} {\text{I}} & 0 \\ 0 & {\text{I}} \\ { - K_{1}^{j} } & { - K_{2}^{j} } \\ \end{array} } \right]^{T} \overline{H}^{{{\text{j + }}1}} \left[ {\begin{array}{*{20}c} {\text{I}} & 0 \\ 0 & {\text{I}} \\ { - K_{1}^{j} } & { - K_{2}^{j} } \\ \end{array} } \right]\, \times \,\left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right] \\ & - \,(\left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]\, - \,\left[ {\begin{array}{*{20}c} {CB} \\ 0 \\ \end{array} } \right]\, \times \,(u_{k} - u_{k}^{j} ))^{T} \, \times \,\left[ {\begin{array}{*{20}c} {\text{I}} & 0 \\ 0 & {\text{I}} \\ { - K_{1}^{j} } & { - K_{2}^{j} } \\ \end{array} } \right]^{T} \overline{H}^{{{\text{j + }}1}} \\ & \, \times \,\left[ {\begin{array}{*{20}c} {\text{I}} & 0 \\ 0 & {\text{I}} \\ { - K_{1}^{j} } & { - K_{2}^{j} } \\ \end{array} } \right](\left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]\, - \,\left[ {\begin{array}{*{20}c} {CB} \\ 0 \\ \end{array} } \right]\, \times \,(u_{k} \, - \,u_{k}^{j} )) \\ & \, = \,(y_{k} \, - \,r_{k} )^{T} Q(y_{k} \, - \,r_{k} )\, + \,(u_{k}^{j} )^{T} Ru_{\text{k}}^{j} \\ \end{aligned} $$
(26)

where

$$ P^{j + 1} \, = \,\left[ {\begin{array}{*{20}c} I & 0 \\ 0 & I \\ { - K_{1}^{j} } & { - K_{2}^{j} } \\ \end{array} } \right]^{T} \overline{H}^{{{\text{j + }}1}} \left[ {\begin{array}{*{20}c} I & 0 \\ 0 & I \\ { - K_{1}^{j} } & { - K_{2}^{j} } \\ \end{array} } \right] $$
(27)

(26) can be rewritten as:

$$ \begin{aligned} & \left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right]^{T} P^{j + 1} \left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right]\, - \,\left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]^{T} P^{j + 1} \left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]\, + \,2\left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]^{T} P^{j + 1} \left[ {\begin{array}{*{20}c} {CB} \\ 0 \\ \end{array} } \right]\, \times \,(u_{k} \, - \,u_{k}^{j} ) \\ & - \,(u_{k} \, - \,u_{k}^{j} )^{T} (CB)^{T} P^{j + 1} (CB)(u_{k} \, - \,u_{k}^{j} )\, = \,(y_{k} \, - \,r_{k} )^{T} Q(y_{k} \, - \,r_{k} )\, + \,(u_{k}^{j} )^{T} Ru_{\text{k}}^{j} \\ \end{aligned} $$
(28)

where

$$ \begin{aligned} & 2\left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]^{T} \,P^{j + 1} \left[ {\begin{array}{*{20}c} {CB} \\ 0 \\ \end{array} } \right]\, \times \,(u_{k} \, - \,u_{k}^{j} )\,{ = }\,2\left[ {\begin{array}{*{20}c} {CAC^{T} (CC^{T} )^{ - 1} y_{k} \, + \,CBu_{k} } \\ {Fr_{k} } \\ \end{array} } \right]^{T} \,P^{j + 1} \\ & \times \,\left[ {\begin{array}{*{20}c} {CB} \\ 0 \\ \end{array} } \right](u_{k} \, - \,u_{k}^{j} )\, = \,2y_{k}^{T} [C^{T} (CC^{T} )^{ - 1} ]^{T} (CA)^{T} P^{j + 1} CB(u_{k} \, - \,u_{k}^{j} ) \\ & + \,2u_{k}^{T} (CB)^{T} P^{j + 1} CB(u_{k} \, - \,u_{k}^{j} )\, + \,2r_{k}^{T} F^{T} P^{j + 1} CB(u_{k} \, - \,u_{k}^{j} ) \\ \end{aligned} $$
(29)

Further, one has

$$ \begin{aligned} & \left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right]^{T} \,P^{j + 1} \left[ {\begin{array}{*{20}c} {y_{k} } \\ {r_{k} } \\ \end{array} } \right]\, - \,\left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right]^{T} \,P^{j + 1} \left[ {\begin{array}{*{20}c} {y_{k + 1} } \\ {r_{{{\text{k}} + 1}} } \\ \end{array} } \right] \\ & + \,2y_{k}^{T} [C^{T} (CC^{T} )^{ - 1} ]^{T} (CA)^{T} P^{j + 1} CB(u_{k} \, - \,u_{k}^{j} )\, + \,2u_{k}^{T} (CB)^{T} P^{j + 1} CB(u_{k} \, - \,u_{k}^{j} ) \\ & + \,2r_{k}^{T} F^{T} P^{j + 1} CB(u_{k} \, - \,u_{k}^{j} )\, - \,(u_{k} \, - \,u_{k}^{j} )^{T} (CB)^{T} P^{j + 1} (CB)(u_{k} \, - \,u_{k}^{j} ) \\ & { = }\,(y_{k} \, - \,r_{k} )^{T} Q(y_{k} \, - \,r_{k} )\, + \,(u_{k}^{j} )^{T} Ru_{\text{k}}^{j} \\ \end{aligned} $$
(30)

Rewriting (30) yields below

$$ \theta_{k}^{j} L^{j + 1} \, = \,\rho_{k}^{j} $$
(31)

where

$$ \rho_{k}^{j} \,{ = }\,(y_{k} \, - \,r_{k} )^{T} Q(y_{k} \, - \,r_{k} )\, + \,(u_{k} )^{T} Ru_{k} $$
$$ \begin{aligned} & L^{j + 1} \,{ = }\,\left[ {(vec(L_{1}^{j + 1} ))^{T} \begin{array}{*{20}c} {(vec(L_{2}^{j + 1} ))^{T} } & {(vec(L_{3}^{j + 1} ))^{T} } \\ \end{array} } \right. \\ & \left. {\begin{array}{*{20}c} {(vec(L_{4}^{j + 1} ))^{T} } & {(vec(L_{5}^{j + 1} ))^{T} } & {(vec(L_{6}^{j + 1} ))^{T} } \\ \end{array} } \right]^{\text{T}} \\ \end{aligned} $$

\( L_{1}^{j + 1} \,{ = }\,\overline{H}_{yy}^{j + 1} , \) \( L_{2}^{j + 1} \,{ = }\,\overline{H}_{yr}^{j + 1} , \) \( L_{3}^{j + 1} \,{ = }\,\overline{H}_{yu}^{j + 1} , \) \( L_{4}^{j + 1} \,{ = }\,\overline{H}_{rr}^{j + 1} , \) \( L_{5}^{j + 1} \,{ = }\,\overline{H}_{ru}^{j + 1} , \) \( L_{6}^{j + 1} \,{ = }\,\overline{H}_{uu}^{j + 1} , \)

$$ \theta_{k}^{j} \,{ = }\,\left[ {\begin{array}{*{20}c} {\theta_{1}^{j} } & {\theta_{2}^{j} } & {\theta_{3}^{j} } & {\theta_{4}^{j} } & {\theta_{5}^{j} } & {\theta_{6}^{j} } \\ \end{array} } \right] $$
$$ \theta_{1}^{\text{j}} \,{ = }\,y_{k}^{T} \, \otimes \,y_{k}^{T} \, - \,y_{k + 1}^{T} \, \otimes \,y_{k + 1}^{T} $$
$$ \theta_{2}^{\text{j}} \,{ = }\,2y_{k}^{T} \, \otimes \,y_{k}^{T} \, - \,2y_{k + 1}^{T} \, \otimes \,r_{k + 1}^{T} $$
$$ \theta_{3}^{\text{j}} \,{ = }\,2y_{k}^{T} \, \otimes \,{\text{u}}_{k}^{T} \, - \,2y_{k + 1}^{T} \, \otimes \,(u_{k + 1}^{j} )^{T} $$
$$ \theta_{4}^{j} \,{ = }\,r_{k}^{T} \, \otimes \,r_{k}^{T} \, - \,r_{k + 1}^{T} \, \otimes \,r_{k + 1}^{T} $$
$$ \theta_{5}^{\text{j}} \,{ = }\,2{\text{r}}_{k}^{T} \, \otimes \,u_{k}^{\text{T}} \, - \,2{\text{r}}_{k + 1}^{T} \, \otimes \,(u_{k + 1}^{j} )^{T} $$
$$ \theta_{6}^{\text{j}} \,{ = }\,{\text{u}}_{k}^{T} \, \otimes \,{\text{u}}_{k}^{T} \, - \,(u_{k + 1}^{j} )^{T} \, \otimes \,(u_{k + 1}^{j} )^{T} $$

4 Simulation Experiment

In this section, we use the proposed algorithm to simulate the experiment, and use the experimental results to verify whether the algorithm is effective.

Example 1:

Consider the following system:

$$ x_{k + 1} \, = \,\left[ {\begin{array}{*{20}c} { - 1} & 2 \\ {2.2} & {1.7} \\ \end{array} } \right]x_{k} \, + \,\left[ {\begin{array}{*{20}c} 2 \\ {1.6} \\ \end{array} } \right]u_{k} $$
(32)
$$ y_{k} \, = \,\left[ {\begin{array}{*{20}c} 1 & { - 2} \\ { - 1} & 4 \\ \end{array} } \right]x_{k} $$
(33)

The reference signal generator is:

$$ r_{k + 1} \, = \,\left[ {\begin{array}{*{20}c} { - 1} & 0 \\ 0 & { - 1} \\ \end{array} } \right]r_{k} $$
(34)

Choose \( Q\, = \,\left[ {\begin{array}{*{20}c} {1000} & 0 \\ 0 & {10} \\ \end{array} } \right] \) and \( R\, = \,1 \). The optimal matrix \( H \) and \( \overline{H} \) can be obtained from (13) and (17), respectively, and the control gain \( K_{1} \) and \( K_{2} \) of the optimal tracking control can be obtained from (19).

$$ H\, = \,\left[ {\begin{array}{*{20}c} {55014.1382} & { - 47201.9872} \\ { - 47201.9872} & {124780.7167} \\ { - 10241.4410} & {9344.6976} \\ {193.0046} & { - 162.08466} \\ { - 48231.9264} & {124282.7373} \\ \end{array} } \right.\left. {\begin{array}{*{20}c} { - 10241.4410} & {193.0046} & { - 48231.9264} \\ {9344.6976} & { - 162.08466} & {124282.7373} \\ {2067.8905} & {51.2434} & {9527.2908} \\ {51.2434} & {49.1044} & { - 166.5050} \\ {9527.2908} & { - 166.5050} & {123826.6900} \\ \end{array} } \right] $$
(35)
$$ \overline{H} \,\, = \left[ {\begin{array}{*{20}c} {157847.7575} & {70420.4747} \\ {70420.4747} & {39017.3301} \\ { - 16810.5331} & { - 5569.0922} \\ {304.5860} & {101.5813} \\ { - 34322.4842} & {13909.4422} \\ \end{array} } \right.\left. {\begin{array}{*{20}c} { - 16810.5331} & {304.5860} & { - 34322.4842} \\ { - 5569.0922} & {101.5813} & {13909.4422} \\ {3067.8905} & {51.2434} & {9527.2908} \\ {51.2434} & {59.1044} & { - 166.5050} \\ {9527.2908} & { - 166.5050} & {123826.6900} \\ \end{array} } \right] $$
(36)
$$ \left\{ {\begin{array}{*{20}c} {K_{1} \, = \,\left[ {\begin{array}{*{20}c} {0.2772} & { - 0.1123} \\ \end{array} } \right]} \\ {K_{2} \, = \,\left[ {\begin{array}{*{20}c} {0.0769} & {0.0013} \\ \end{array} } \right]} \\ \end{array} } \right. $$
(37)

After 8 iterations, we find that the algorithm converges and the matrix \( \overline{H}^{8} \) and the control gain \( K_{1}^{8} \) and \( K_{2}^{8} \) of the optimal tracking control are the following data.

$$ \overline{H}^{8} \, = \,\left[ {\begin{array}{*{20}c} {157847.7575} & {70420.4747} \\ {70420.4747} & {39017.3301} \\ { - 16810.5331} & { - 5569.0922} \\ {304.5860} & {101.5813} \\ { - 34322.4842} & {13909.4422} \\ \end{array} } \right.\left. {\begin{array}{*{20}c} { - 16810.5331} & {304.5860} & { - 34322.4842} \\ { - 5569.0922} & {101.5813} & {13909.4422} \\ {3067.8905} & {51.2434} & {9527.2908} \\ {51.2434} & {59.1044} & { - 166.5050} \\ {9527.2908} & { - 166.5050} & {123826.6900} \\ \end{array} } \right] $$
(38)
$$ \left\{ {\begin{array}{*{20}c} {K_{1}^{8} \, = \,\left[ {\begin{array}{*{20}c} {0.2772} & { - 0.1123} \\ \end{array} } \right]} \\ {K_{2}^{8} \, = \,\left[ {\begin{array}{*{20}c} {0.0769} & {0.0013} \\ \end{array} } \right]} \\ \end{array} } \right. $$
(39)

In the learning process, the optimal tracking controller gain is convergent, and the following figure shows its convergence process in the learning process (Fig. 1).

Fig. 1.
figure 1

Optimal control gain convergence process of tracking controller

In order to obtain the exact solution of Riccati Eq. (20) of the optimal Q-function under sufficient excitation conditions, it is necessary to add detection noise (Figs. 2 and 3).

Fig. 2.
figure 2

System output and reference signal

Fig. 3.
figure 3

Tracking errors using the learned controller

Example 2:

We select the water tank system made by ingenieurburo gurski Schramm company in Germany, which is a three tank water tank of TTS20 type, as our simulation object, and the water tank is shown in Fig. 4. This water tank system consists of a nonlinear multi input multi output system with two actuators and a digital controller, which meets our requirements for the system. The main structure and overall industrial process of TTS20 three tank water tank are shown in Fig. 5.

Fig. 4.
figure 4

Three tank of TTS20

Fig. 5.
figure 5

Structure and industrial process of TTS20

TTS20 three tank device is composed of three plexiglass cylinders T1, T2 and T3 with Section A, which are connected in series with each other through cylindrical pipes with section Sn. There is a one-way valve in T2 glass pipe, and the outflow liquid will be collected in a reservoir to provide water for pump 1 and pump 2. \( H_{\hbox{max} } \) is the highest liquid level. If the level of T1 or T2 exceeds this value, the corresponding pumps 1 and 2 will automatically shut down. Q1 and Q2 represent the flow of pump 1 and pump 2. In addition, to simulate leakage, each tank has a circular opening with a manually adjustable ball valve on the cross section. The drain valve and leakage flow can describe the failure information of the water tank. The liquid extracted from the pool is injected into T1 and T2 by pump 1 (P1) and pump 2 (P2), respectively. Then their bottom valve and T3 drain valve discharge water into the reservoir for P1 and P2 recycling, forming a circuit. Among them, T1, T2 and T3 are measured by three pressure level sensors as the measuring elements of the system, and the flow of Q1 and Q2 is regulated by the digital controller.

For TTS20, we can design its model as follows:

$$ \left\{ {\begin{array}{*{20}c} {\left[ {\begin{array}{*{20}c} {\mathop {h_{1} }\limits^{ \bullet } } \\ {\mathop {h_{3} }\limits^{ \bullet } } \\ \end{array} } \right]\, = \,\frac{1}{s}\left[ {\begin{array}{*{20}c} { - Q_{13} } \\ { - Q_{13} - Q_{out} } \\ \end{array} } \right]Q_{in} } \\ {y\, = \,h_{1} } \\ \end{array} } \right. $$
(40)

In the model, \( h_{1} \) and \( h_{3} \) are the control variables, representing the water level height of water tanks T1 and T3, \( Q_{n} \) is selected as the control variable of the system as the flow of Q1, and the flow from T1 to T3 is \( Q_{13} \, = \,az_{1} S_{n} \text{sgn} (h_{1} \, - \,h_{3} )\sqrt {2g\left| {h_{1} \, - \,\left. {h_{3} } \right|} \right.} \), the water flow from the bottom of T3 is \( Q_{out} \, = \,az_{2} S_{1} \sqrt {2gh_{2} } \), \( S_{1} \, = \,S_{n} \, = \,5\, \times \,10^{ - 5} \,\text{m}^{2} , \) \( S\, = \,0.154\,\text{m}^{2} , \)\( H_{\hbox{max} } \, = \,0.6\,\text{m}, \) Flow coefficient \( az_{1} \, = \,0.48,\,az_{2} \, = \,0.58, \) \( \text{sgn} ( \bullet ) \) is a symbolic function. We set the initial value of \( h_{1} \) and \( h_{3} \) is 0, and the relationship between state variable and input variable is \( \left[ {\begin{array}{*{20}c} {x_{1} (k)} \\ {x_{2} (k)} \\ \end{array} } \right]\,{ = }\,\left[ {\begin{array}{*{20}c} {h_{1} (k)} \\ {h_{3} (k)} \\ \end{array} } \right] \), \( u(k)\, = \,Q_{in} (k) \). The state space model of TTS20 is as follows:

$$ x_{k + 1} \, = \,\left[ {\begin{array}{*{20}c} {0.9850} & {0.0107} \\ {0.0078} & {0.9784} \\ \end{array} } \right]x_{k} \, + \,\left[ {\begin{array}{*{20}c} {64.4453} \\ {0.2559} \\ \end{array} } \right]u_{k} $$
(41)
$$ y_{k} \, = \,\left[ {\begin{array}{*{20}c} 1 & 0 \\ \end{array} } \right]x_{k} $$
(42)

The reference signal generator is:

$$ r_{k + 1} \, = \,r_{k} $$
(43)

Select the value of reference signal water level as 0.5 m. Choose \( Q\, = \,10 \) and \( R\, = \,5 \), The optimal Q- function matrix \( H \) and \( \overline{H} \) are obtained, The control gain \( K_{1} \) and \( K_{2} \) of the optimal tracking control are as follows:

$$ H\, = \,\left[ {\begin{array}{*{20}c} {17.7618} & { - 17.6205} & {507.6889} \\ { - 17.6205} & { - 131.2013} & { - 514.5708} \\ {507.6889} & { - 514.5708} & {33224.5750} \\ \end{array} } \right] $$
(44)
$$ \overline{H} \, = \,\left[ {\begin{array}{*{20}c} {17.7627} & { - 17.8809} & {507.8883} \\ { - 17.8809} & {18.0010} & { - 515.6235} \\ {507.8883} & { - 515.6235} & {33234.4549} \\ \end{array} } \right] $$
(45)
$$ \left\{ {\begin{array}{*{20}c} {K_{1} \, = \,\left[ { - 0.0153} \right]} \\ {K_{2} \, = \,\left[ {0.0155} \right]} \\ \end{array} } \right. $$
(46)

After 10 iterations, we find that the algorithm converges and the optimal Q-function matrix \( \overline{H}^{10} \) and the gain \( K_{1}^{10} \) and \( K_{2}^{10} \) of the optimal tracking control are the following data.

$$ \overline{H}^{10} \, = \,\left[ {\begin{array}{*{20}c} {17.7627} & { - 17.8809} & {507.8883} \\ { - 17.8809} & {18.0010} & { - 515.6235} \\ {507.8883} & { - 515.6235} & {33234.4549} \\ \end{array} } \right] $$
(47)
$$ \left\{ {\begin{array}{*{20}c} {K_{1}^{10} \, = \,\left[ { - 0.0153} \right]} \\ {K_{2}^{10} \, = \,\left[ {0.0155} \right]} \\ \end{array} } \right. $$
(48)

We find that in the learning process, the optimal tracking controller gain is convergent, and the following figure will show its convergence process in the learning process (Figs. 6, 7 8 and 9).

Fig. 6.
figure 6

Optimal \( \overline{H} \) matrix convergence process

Fig. 7.
figure 7

Optimal control gain convergence process of tracking controller

Fig. 8.
figure 8

Output trajectories of system

Fig. 9.
figure 9

Tracking errors using the learned controller

5 Conclusion

In this paper, a data-driven off policy Q-learning method is proposed to solve the linear quadratic tracking problem of discrete-time system based on the output feedback of the system. This paper introduces and compares the on policy Q-learning method and the off policy Q-learning method for the linear quadratic tracking problem of the discrete-time system, combines the dynamic programming with the Q-learning method, and uses the off policy Q-learning method to learn the optimal controller gain when the system environment is unknown. Finally, the simulation results show that the method is effective.