Abstract
In this paper, an off-policy Q-learning method is proposed to solve the linear quadratic tracking problem of discrete-time system based on the output feedback of the system when the system model parameters are unknown. First, a linear discrete-time system with unknown parameters in the system matrix is given. Then, based on the Q-learning method and dynamic programming, an off-policy Q-learning algorithm without knowing system model parameters is proposed, such that the optimal controller is designed to obtain the control strategy which uses the system output data to learn the output feedback data driven optimal tracking control for linear discrete time systems with output feedback. Finally, the simulation results verify the effectiveness of the method.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Linear quadratic tracking (LQT) of discrete time systems (DT) is a very important problem in the field of control. The basic idea is that the performance index is a quadratic function defined by the accumulation of deviation between the reference signal and the system output and the accumulation of control input. By designing an optimal controller, the performance index is minimized so that the output of the system can follow the track of the reference signal by an optimal approach. The traditional method to solve LQT problem is to solve a Riccati algebraic equation [1,2,3]. The traditional controller design methods all require the system model information to be known.
Reinforcement learning (RL) is a one kind of machine learning methods, which was born in 1950s and 1960s [4,5,6]. During the dynamic interaction with unknown dynamic environment, performance is evaluated and action is updated, such that the optimal performance together with the optimal action can be learned [7,8,9,10,11]. Reinforcement learning has many advantages and strong adaptability. At present, it has become an important learning method for solving optimization problems. It is widely used in robot, artificial intelligence, intelligent systems and other fields, and it is one of the research hot spots in recent years [12,13,14,15].
In the existing researches on the optimal tracking control results which do not depend on the system model by reinforcement learning method, most of them use the system state data to learn the state feedback control strategy and track the reference input of the system in the optimal or nearly optimal way, such as the optimal tracking control [16,17,18,19,20,21,22], etc. In [19], a Q-learning method based on off policy iteration was proposed to solve the optimal tracking control problem of networked control system. In [20], when the system parameter model is unknown, an optimal control method of linear network control was proposed. According to the state feedback information of the system, a novel off-policy Q-learning algorithm was proposed in [21], which solved the problem of linear quadratic tracking in discrete time under the condition of unknown system parameters. In [22], an optimal tracking control scheme was proposed.
In this paper, a Q-learning algorithm is developed to design the output feedback optimal tracking control strategy, such that the optimal quadratic tracking problem can be solved without the knowledge of system dynamics.
The innovation of this paper lies in (a): different from the traditional research which needs the traditional model information [1,2,3], the research of this paper is to learn the optimal tracking control strategy when the system model parameters are unknown. (b): compared with other model-free research on the state feedback of the system [20,21,22], this paper adopts a fully data-driven off policy Q-learning method to solve the linear quadratic tracking control problem of the discrete-time system based on the output feedback of the system and independent of the system model parameters. Finally, simulation experiments and practical application examples are given to verify the effectiveness of the algorithm.
2 On the Optimal Control Problem
This section will introduce the optimal control of linear quadratic tracking problem for discrete-time systems. The state equations of the following linear discrete systems are considered below:
where \( x_{k} \) is the state of the controlled object, \( u_{k} \) is the input of the controlled object, and \( y_{k} \) is the output of the controlled object. \( A \), \( B \), and \( C \) are matrices of appropriate dimensions, respectively. The reference signal of our interest is as follows:
where \( r_{k} \) is the input of the reference object, \( F \) is also a matrix with appropriate dimensions. For the linear quadratic tracking problem of discrete-time system, we need to control the output signal \( y_{k} \) in the system (1), and gradually track and catch up with our reference input signal \( r_{k} \) as time goes on. According to the actual problems, we design and select the output feedback controller as follows:
The purpose of our controller is to optimize performance index \( J \). Our performance indicator is:
where \( Q\, \ge \,0 \) and \( R\, > \,0 \) are symmetric matrices, and \( \beta \) is a discount factor with \( 0\, < \,\beta \, < \,1 \). The constraints are as follows:
According to the performance index, we can define the optimal value function \( V^{*} \) as:
Then, the \( Q \) function can be expressed as:
The optimal function \( Q^{*} \) can be expressed as:
For the convenience of calculation and understanding, (8) can be rewritten as:
where \( \overline{Q} \) can be written as:
From the above formula, we can know that the relationship between the optimal value function \( V^{ *} \) and the optimal function \( Q \) is:
Since the input \( u_{k} \) of the control object is controllable, the optimal value function \( V^{*} \) can be expressed as [15]:
The optimal \( Q \) function can be expressed as:
The matrix \( H \) can be expressed as follows:
According to the necessary conditions to achieve optimal performance, implementing \( \frac{{\partial Q^{*} (x_{k} ,\,r_{k} ,\,u_{k} )}}{{\partial u_{k} }}\, = \,0 \) yields the following forms.
From (15), we find that we cannot get \( K_{1} \) if the matrix \( C \) is unknown. The output equation can be treated as follows.
By substituting (16) into (9) above, we can obtain a new optimal \( Q \) function equation:
where
Implementing \( \frac{{\partial Q^{*} (x_{k} ,r_{k} ,u_{k} )}}{{\partial u_{k} }}\, = \,0 \) yields the following forms
According to the above relation, the Riccati equation for the optimal \( Q \) function is as follows:
3 Data Driven Q-Learning Algorithm
This section will give off-policy Q-learning algorithm for designing the output feedback optimal tracking control strategy, under which the system output can track the reference signal in an approximate optimal way.
First, the on-policy Q-learning algorithm is given, and then based on it the off-policy Q-learning algorithm is derived.
Algorithm 1: On-policy Q-learning algorithm.
-
1.
Give a stablizing controller gain \( K_{1}^{j} \) and \( K_{2}^{j} \), let the initial \( j \) value be 0, \( j \) represents the number of iterations. The control object input is defined as.
$$ u_{k}^{j} = - K_{1}^{j} y_{k} - K_{2}^{j} r_{k} $$(21) -
2.
Evaluate the control policy by solving the optimal \( Q \) function and matrix \( \overline{H} \).
$$ Z_{k}^{T} \overline{H}^{j + 1} Z_{\text{k}} { = }(y_{k} - r_{k} )^{T} Q(y_{k} - r_{k} ) + u_{k}^{j} Ru_{k}^{j} + Z_{k + 1}^{T} \overline{H}^{j + 1} Z_{{{\text{k + }}1}} $$(22) -
3.
Update the control policy.
$$ \begin{aligned} u_{k}^{j + 1} = - K_{1}^{j + 1} y_{k} - K_{2}^{j + 1} r_{k} \hfill \\ \left\{ {\begin{array}{*{20}c} {K_{1}^{j + 1} = (\bar{H}_{uu}^{j + 1} )^{ - 1} (\bar{H}_{yu}^{j + 1} )^{T} } \\ {K_{2}^{j + 1} = (\bar{H}_{uu}^{j + 1} )^{ - 1} (H_{ru}^{j + 1} )^{T} } \\ \end{array} } \right. \hfill \\ \end{aligned} $$(23) -
4.
If \( \left\| {\bar{H}^{j + 1} } \right. - \left. {\bar{H}^{j} } \right\| \le \varepsilon \)(\( \varepsilon \) is a small positive number) stops the iteration of the strategy. Otherwise,\( j = j + 1 \), and return to Step 2.
In view of the advantages of off policy Q-learning algorithm, we will propose an off-policy algorithm based on Q-function, and adopt data-driven algorithm without model to solve the linear quadratic tracking problem of discrete-time system. We introduce the target control strategy into the system dynamics and get the following equation, where \( u_{k} \) is the behavior control strategy and \( u_{k}^{j} \) is the target control strategy.
From Eq. (22), one has
where
(26) can be rewritten as:
where
Further, one has
Rewriting (30) yields below
where
\( L_{1}^{j + 1} \,{ = }\,\overline{H}_{yy}^{j + 1} , \) \( L_{2}^{j + 1} \,{ = }\,\overline{H}_{yr}^{j + 1} , \) \( L_{3}^{j + 1} \,{ = }\,\overline{H}_{yu}^{j + 1} , \) \( L_{4}^{j + 1} \,{ = }\,\overline{H}_{rr}^{j + 1} , \) \( L_{5}^{j + 1} \,{ = }\,\overline{H}_{ru}^{j + 1} , \) \( L_{6}^{j + 1} \,{ = }\,\overline{H}_{uu}^{j + 1} , \)
4 Simulation Experiment
In this section, we use the proposed algorithm to simulate the experiment, and use the experimental results to verify whether the algorithm is effective.
Example 1:
Consider the following system:
The reference signal generator is:
Choose \( Q\, = \,\left[ {\begin{array}{*{20}c} {1000} & 0 \\ 0 & {10} \\ \end{array} } \right] \) and \( R\, = \,1 \). The optimal matrix \( H \) and \( \overline{H} \) can be obtained from (13) and (17), respectively, and the control gain \( K_{1} \) and \( K_{2} \) of the optimal tracking control can be obtained from (19).
After 8 iterations, we find that the algorithm converges and the matrix \( \overline{H}^{8} \) and the control gain \( K_{1}^{8} \) and \( K_{2}^{8} \) of the optimal tracking control are the following data.
In the learning process, the optimal tracking controller gain is convergent, and the following figure shows its convergence process in the learning process (Fig. 1).
In order to obtain the exact solution of Riccati Eq. (20) of the optimal Q-function under sufficient excitation conditions, it is necessary to add detection noise (Figs. 2 and 3).
Example 2:
We select the water tank system made by ingenieurburo gurski Schramm company in Germany, which is a three tank water tank of TTS20 type, as our simulation object, and the water tank is shown in Fig. 4. This water tank system consists of a nonlinear multi input multi output system with two actuators and a digital controller, which meets our requirements for the system. The main structure and overall industrial process of TTS20 three tank water tank are shown in Fig. 5.
TTS20 three tank device is composed of three plexiglass cylinders T1, T2 and T3 with Section A, which are connected in series with each other through cylindrical pipes with section Sn. There is a one-way valve in T2 glass pipe, and the outflow liquid will be collected in a reservoir to provide water for pump 1 and pump 2. \( H_{\hbox{max} } \) is the highest liquid level. If the level of T1 or T2 exceeds this value, the corresponding pumps 1 and 2 will automatically shut down. Q1 and Q2 represent the flow of pump 1 and pump 2. In addition, to simulate leakage, each tank has a circular opening with a manually adjustable ball valve on the cross section. The drain valve and leakage flow can describe the failure information of the water tank. The liquid extracted from the pool is injected into T1 and T2 by pump 1 (P1) and pump 2 (P2), respectively. Then their bottom valve and T3 drain valve discharge water into the reservoir for P1 and P2 recycling, forming a circuit. Among them, T1, T2 and T3 are measured by three pressure level sensors as the measuring elements of the system, and the flow of Q1 and Q2 is regulated by the digital controller.
For TTS20, we can design its model as follows:
In the model, \( h_{1} \) and \( h_{3} \) are the control variables, representing the water level height of water tanks T1 and T3, \( Q_{n} \) is selected as the control variable of the system as the flow of Q1, and the flow from T1 to T3 is \( Q_{13} \, = \,az_{1} S_{n} \text{sgn} (h_{1} \, - \,h_{3} )\sqrt {2g\left| {h_{1} \, - \,\left. {h_{3} } \right|} \right.} \), the water flow from the bottom of T3 is \( Q_{out} \, = \,az_{2} S_{1} \sqrt {2gh_{2} } \), \( S_{1} \, = \,S_{n} \, = \,5\, \times \,10^{ - 5} \,\text{m}^{2} , \) \( S\, = \,0.154\,\text{m}^{2} , \)\( H_{\hbox{max} } \, = \,0.6\,\text{m}, \) Flow coefficient \( az_{1} \, = \,0.48,\,az_{2} \, = \,0.58, \) \( \text{sgn} ( \bullet ) \) is a symbolic function. We set the initial value of \( h_{1} \) and \( h_{3} \) is 0, and the relationship between state variable and input variable is \( \left[ {\begin{array}{*{20}c} {x_{1} (k)} \\ {x_{2} (k)} \\ \end{array} } \right]\,{ = }\,\left[ {\begin{array}{*{20}c} {h_{1} (k)} \\ {h_{3} (k)} \\ \end{array} } \right] \), \( u(k)\, = \,Q_{in} (k) \). The state space model of TTS20 is as follows:
The reference signal generator is:
Select the value of reference signal water level as 0.5 m. Choose \( Q\, = \,10 \) and \( R\, = \,5 \), The optimal Q- function matrix \( H \) and \( \overline{H} \) are obtained, The control gain \( K_{1} \) and \( K_{2} \) of the optimal tracking control are as follows:
After 10 iterations, we find that the algorithm converges and the optimal Q-function matrix \( \overline{H}^{10} \) and the gain \( K_{1}^{10} \) and \( K_{2}^{10} \) of the optimal tracking control are the following data.
We find that in the learning process, the optimal tracking controller gain is convergent, and the following figure will show its convergence process in the learning process (Figs. 6, 7 8 and 9).
5 Conclusion
In this paper, a data-driven off policy Q-learning method is proposed to solve the linear quadratic tracking problem of discrete-time system based on the output feedback of the system. This paper introduces and compares the on policy Q-learning method and the off policy Q-learning method for the linear quadratic tracking problem of the discrete-time system, combines the dynamic programming with the Q-learning method, and uses the off policy Q-learning method to learn the optimal controller gain when the system environment is unknown. Finally, the simulation results show that the method is effective.
References
Lewis, F.L., Vrabie, D.L., Syrmos, V.L.: Optimal Control for Polynomial Systems. pp. 287–296, Wiley (2012)
Hengster-Movric, K., You, K., Lewis, F.L., Xie, L.: Synchronization of discrete-time multi-agent systems on graphs using riccati design. Automatica 49(2), 414–423 (2013)
Stoorvogel, A.A., Weeren, A.J.T.M.: The discrete-time riccati equation related to the H∞ control problem. IEEE Trans. Autom. Control 39(3), 686–691 (1994)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4(1), 237–285 (1996)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Li, J., Ding, J., Chai, T., Lewis, F.L.: Nonzero-sum game reinforcement learning for performance optimization in large-scale industrial processes. IEEE Trans. Cybern. 50(9), 4132–4145 (2020)
Sutton, R.: Learning to predict by the methods of temporal difference. Mach. Learn. 3(1), 9–44 (1988)
Bertsekas, D.P., Tsitsiklis, J.N., Volgenant, A.: Neuro-dynamic programming. Encycl. Optim. 27(6), 1687–1692 (2011)
Santamaria, J.C., Sutton, R., Ram, A.: Experiments with reinforcement learning in problems with continuous state and action spaces. Adap. Behav. 6, 163–217 (1997)
Watkins, C., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)
Wang, D., Liu, D.: Learning and guaranteed cost control with event-based adaptive critic implementation. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 6004–6014 (2018)
Smart, W.D., Kaelbling, L.P.: Effective reinforcement learning for mobile robots. In: Proceedings of the IEEE International Conference on Robotics and Autoinforcement learning for performance optimization, pp. 3404–3410 (2002)
Beom, H.R., Cho, H.S.: A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning. IEEE Trans. Syst. Man Cybern. 25, 464–477 (1995)
Kondo, T., Ito, K.: A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control. Robot. Auton. Syst. 46(2), 111–124 (2004)
Li, J.N., Ding, J.L., Chai, T.Y, Li, C., Lewis, F.L.: Nonzero-sum game reinforcement learning for performance optimization in large-scale industrial processes. IEEE Trans. Cybern. (2019). https://doi.org/10.1109/tcyb.2019.2950262
Kiumarsi, B., Lewis, F.L., Jiang, Z.P.: H∞ control of linear discrete-time systems: off-policy reinforcement learning. Automatic 37(1), 144–152 (2017)
Kim, J.H., Lewis, F.L.: Model-free H∞ control design for unknown linear discrete-time systems via Q-learning with LMI. Automatica 46(8), 1320–1326 (2010)
Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica 43(3), 473–481 (2007)
Li, J.N., Yin, Z.X.: Optimal tracking control of networked control systems based on off policy Q-learning. Control Decis. 34(11), 2343–2349 (2019)
Xu, H., Sahoo, A., Jagannathan, S.: Stochastic adaptive event-triggered control and network scheduling protocol co-design for distributed networked systems. Control Theory Appl. IET. 8(18), 2253–2265 (2014)
Li, J.N., Chai, T.Y., Lewis, F.L., Ding, Z.T., Jiang, Y.: Off-policy interleaved Q-learning: optimal control for affine nonlinear discrete-time systems. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1308–1320 (2019)
Li, X.F., Xue, L., Sun, C.Y.: Linear quadratic tracking control of unknown discrete-time systems using value iteration algorithm. Neurocomputing 314(7), 86–93 (2018)
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grants 61673280, the Open Project of Key Field Alliance of Liaoning Province under Grant 2019-KF-03-06 and the Project of Liaoning Shihua. University under Grant 2018XJJ-005.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, S., Xiao, Z., Li, J. (2020). Data-Driven Optimal Tracking Control for Linear Systems Based on Output Feedback Approach. In: Qian, J., Liu, H., Cao, J., Zhou, D. (eds) Robotics and Rehabilitation Intelligence. ICRRI 2020. Communications in Computer and Information Science, vol 1336. Springer, Singapore. https://doi.org/10.1007/978-981-33-4932-2_21
Download citation
DOI: https://doi.org/10.1007/978-981-33-4932-2_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4931-5
Online ISBN: 978-981-33-4932-2
eBook Packages: Computer ScienceComputer Science (R0)