Reinforcement Learning Informed by Optimal Control

Önnheim, Magnus; Andersson, Pontus; Gustavsson, Emil; Jirstrand, Mats

doi:10.1007/978-3-030-30493-5_40

Magnus Önnheim^12,13,
Pontus Andersson^12,13,
Emil Gustavsson^12,13 &
…
Mats Jirstrand^12,13

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11731))

Included in the following conference series:

International Conference on Artificial Neural Networks

5486 Accesses
1 Citations
1 Altmetric

Abstract

Model-free reinforcement learning has seen tremendous advances in the last few years, however practical applications of pure reinforcement learning are still limited by sample inefficiency and the difficulty of giving robustness and stability guarantees of the proposed agents. Given access to an expert policy, one can increase sample efficiency by in addition to learning from data, and also learn from the experts actions for safer learning.

In this paper we pose the question whether expert learning can be accelerated and stabilized if given access to a family of experts which are designed according to optimal control principles, and more specifically, linear quadratic regulators. In particular we consider the nominal model of a system as part of the action space of a reinforcement learning agent. Further, using the nominal controller, we design customized reward functions for training a reinforcement learning agent, and perform ablation studies on a set of simple benchmark problems.

This work was developed in Fraunhofer Cluster of Excellence Cognitive Internet Technologies. It has also partially been funded by the Swedish Foundation for Strategic Research.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Reinforcement Learning for Optimal Adaptive Control of Time Delay Systems

Model-Free Optimal Control: A Critical Analysis

Model-Free Reinforcement Learning-Based Control for Continuous-Time Systems

Keywords

Consider a standard problem in optimal control where one wants to find a sequence of control signals $u_t$ such that the following optimization is solved.

$$\begin{aligned}&\min _{u} \mathbb {E} \left[ \sum _{t=0}^T \ell (y_t, u_t, t) \right] \end{aligned}$$

(1a)

$$\begin{aligned}&x_{t+1} = f(x_t, u_t) + v_t, \quad t = 0,\ldots , T-1, \end{aligned}$$

(1b)

$$\begin{aligned}&y_t = g(x_t, u_t) + w_t, \quad t = 0,\ldots , T, \end{aligned}$$

(1c)

where $\ell $ denotes a loss function, f and g denote the system and observation dynamics, and where $v_t$ and $w_t$ denote system and observation noise, respectively. Further, we assume that we are presented with a nominal version of (1), where $\ell $ is a quadratic form, and $f(x, u) = Ax + Bu$, $g(x, u) = C x$, for some matrices A, B, C, and where $v_t, w_t$ are i.i.d. samples of zero-mean Gaussian distributions with covariance matrices V and W, respectively. In the sequel we will refer to the A, B, C-matrices of (1) whenever we are talking about a nominal model.

Given the above nominal model, it is well-known from control theory that we can design an optimal nominal controller as a linear quadratic regulator (LQR), consisting of a Kalman estimator, with Kalman gains $K_t$, and linear feedbacks $L_t$ (see e.g. [1]). From the optimal LQR we have a feedback law that explicitly gives the control signal through

$$\begin{aligned} \hat{x}_{t+1}&= A \hat{x}_t + B u_t + K_t\left[ y_t - C\left( A\hat{x}_t + B u_t\right) \right] \end{aligned}$$

(2a)

$$\begin{aligned} u_t&= L_t \hat{x}_t. \end{aligned}$$

(2b)

One can alternatively consider a model-free reinforcement learning approach to solving the problem (1). Given the recent highly impressive successes of model-free reinforcement learning to highly complex domains (e.g. AlphaZero), it is perhaps surprising that such an approach can fail to perform on even simple problems [6], in particularly with regards to sample efficiency and robustness. In the authors’ view, this failure is in large part due to an inherent disadvantage of model-free approaches as compared to model-based approaches in the case where good models are available.

Here we consider an indirectly model based approach to solving the problem (1). Given a fixed nominal model, we ask whether it is possible to modify the operation of the nominal controller using a reinforcement learning agent. That is, instead of using a reinforcement learning agent for directly providing actual control signals $u_t$ as actions, we investigate various ways of letting the reinforcement learning agent’s actions affect the control law in (2). This requires some care when defining the action space of the agent, and also opens up for designing various reward functions guided by the fixed nominal model, and we perform ablation studies over these design choices. We note the previous similar work done in [3, 5], however, to the authors’ knowledge, direct manipulation of nominal models seems to be unexplored in the literature.

1 Actions

There are many ways of modifying the operations of the nominal controller, but for brevity we here only discuss what we consider to be illustrative subsets of the full action space, left undefined here. This subset consists of

(a)
Perturbations $\delta A_t$ of the nominal A-matrix.
(b)
Perturbations $\delta u_t$ of the nominal control signal $u_t$.
(c)
Hidden (explained later) perturbations $\delta u^h_t$ of the nominal control signal $u_t$.

For completeness, the control law (2) using the possible actions (a)–(c) is

$$\begin{aligned} \hat{x}_{t+1}&= (A + \delta A_t) \hat{x}_t + B (u_t - \delta u^h_t) + L_t\left[ y_t - C\left( (A+\delta A_t)\hat{x}_t + B (u_t - \delta u^h_t\right) \right] , \end{aligned}$$

(3a)

$$\begin{aligned} u_t&= K_t \hat{x}_t + \delta u_t + \delta u^h_t, \end{aligned}$$

(3b)

where the Kalman filter $K_t$ and feedback $L_t$ are adjusted according to the perturbations in the A-matrix. Note the difference that $\delta u^h_t$ does not affect the state estimation Eq. (3a), whereas $\delta u_t$ does.

2 Environment

For the observation space we will, again for brevity, only use a rolling window of measurements, that is, the observation $o_t$ at time t that the agent receives is $\left[ y_t, y_{t-1}, \ldots , y_{t-m} \right] ^T$ for a window length m. To facilitate online learning, we will introduce normal shocks to the benchmark problems, simulating control towards a varying reference signal. We thus also extend the size and timing of the normal shocks to the observations. We point out however that the observation space can be extended in many different ways, e.g., by including the nominally estimated states, the nominal value function etc. to the observation.

As rewards we use the following signals:

System loss: :: $R_t = -\ell (y_t, u_t, t)$,
Innovation: :: $R_t = -\Vert y_t - C\left( A\hat{x}_t + B u_t\right) \Vert ^2$, and
Nominalized: :: $R_t = -\ell (y_t, u_t, t) - \delta R_t^{\text {nom}}$,

as well as a weighted aggregation of the above. System loss represents the naïve reward derived from (1), Innovation represents modifying the nominal model such that the system estimations becomes correct, Nominalized reward represents a reward shaping [4], intended to reduce the variance of stochastic policy gradient estimates as in Generalized Advantage Estimation [7], by factoring out a part of the raw system reward that can be considered as being the responsibility of nominal controller. That is, we may take $\delta R_t^{\text {nom}}(x_t, u_t, x_{t+1}) = \gamma V^{\text {nom}}(x_{t+1}) - V^{\text {nom}}(x_t)$, where $V^{\text {nom}}(x_t)$ denotes the (known) value function of the nominal control policy assuming the nominal model to be exactly correct. Concretely we implement an approximation of this by letting

$$\begin{aligned} \delta R^{\text {nom}}_t = -\ell (\hat{x}_{t+1|t, u_t}) \approx \mathbb {E}_{\pi ^{\text {nom}}}\left[ -\ell (x_{t+1}, u_t) | x_0, \ldots , x_t, u_0, \ldots , u_{t-1} \right] . \end{aligned}$$

(4)

3 Experimental results

In view of [6], and the therein demonstrated failure of model-free reinforcement learning approaches to optimal control for even simple problems, we take as benchmark problems perturbations of a discrete-in-time frictionless unit mass double integrator system. The nominal model is thus

$$\begin{aligned} f^{nom}(x, y) = \begin{bmatrix} 1 &{} dt \\ 0 &{} 1 \end{bmatrix} x + \begin{bmatrix} dt^2 / 2 \\ dt \end{bmatrix} u,&g^{nom}(x, u) = \begin{bmatrix} 1&0 \end{bmatrix} u. \end{aligned}$$

(5a)

We train all agents with a PPO2 algorithm [8], as implemented in [2], with an increased learning rate, and use neural networks to approximate both the value functions and the policy. We train in an online fashion, i.e., we learn from a single trajectory of the system. Further, we induce large random shocks to the system at regular intervals, and all agents are trained using 10000 samples.

Misidentified linear system. The (2, 2)-component of the A-matrix is replaced by $1 - \mu \in (0, 1]$, representing friction.
Piecewise linear system. $f(x,u) = f^{nom}(x,u) + \mathbb {I}_{\Vert x \Vert > 1} \begin{bmatrix} 0 \\ -\text {sgn}(x)\sin \theta \end{bmatrix}$ corresponding to a mass on plane that at unit distance away from the origin slopes downward at an angle $\theta $.

Main results are presented in Fig. 1. Figure 1a shows a clear improvement in sample efficiency using reward nominalization, compared to both raw system loss and innovation rewards. A weighted aggregation appears to show an additional increase in robustness, indicated by relatively narrower error bars. Figure 1b illustrates the importance of choosing the correct action, in the top row the agents’ actions enters the feedback loop of the nominal controller, and the action of the agent causes severe problems for the nominal state estimator. On the other hand, when acting invisibly, the agent successfully learns to compensate for the unmodelled nonlinearities using only roughly 1000 samples.

References

Glad, T., Ljung, L.: Control Theory. Taylor & Francis, London (2000)
Google Scholar
Hill, A., et al.: Stable baselines (2019). https://github.com/hill-a/stable-baselines
Koryakovskiy, I., Kudruss, M., Vallery, H., Babuška, R., Caarls, W.: Model-plant mismatch compensation using reinforcement learning. IEEE Robot. Autom. Lett. 3(3), 2471–2477 (2018)
Article Google Scholar
Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations. ICML 99, 278–287 (1999)
Google Scholar
Rastogi, D., Koryakovskiy, I., Kober, J.: Sample-efficient reinforcement learning via difference models. In: Machine Learning in Planning and Control of Robot Motion Workshop at ICRA (2018)
Google Scholar
Recht, B.: A tour of reinforcement learning. AAnnu. Rev. Control. Robot. Auton. Syst. (2018)
Google Scholar
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

Download references

Author information

Authors and Affiliations

Fraunhofer-Chalmers Centre, Chalmers Science Park, 412 88, Gothenburg, Sweden
Magnus Önnheim, Pontus Andersson, Emil Gustavsson & Mats Jirstrand
Fraunhofer Center for Machine Learning, Gothenburg, Sweden
Magnus Önnheim, Pontus Andersson, Emil Gustavsson & Mats Jirstrand

Authors

Magnus Önnheim
View author publications
You can also search for this author in PubMed Google Scholar
Pontus Andersson
View author publications
You can also search for this author in PubMed Google Scholar
Emil Gustavsson
View author publications
You can also search for this author in PubMed Google Scholar
Mats Jirstrand
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Magnus Önnheim .

Editor information

Editors and Affiliations

Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Igor V. Tetko
Institute of Computer Science, Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Pavel Karpov
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Fabian Theis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Önnheim, M., Andersson, P., Gustavsson, E., Jirstrand, M. (2019). Reinforcement Learning Informed by Optimal Control. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Workshop and Special Sessions. ICANN 2019. Lecture Notes in Computer Science(), vol 11731. Springer, Cham. https://doi.org/10.1007/978-3-030-30493-5_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-30493-5_40
Published: 09 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30492-8
Online ISBN: 978-3-030-30493-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Reinforcement Learning Informed by Optimal Control

Abstract

Similar content being viewed by others

Reinforcement Learning for Optimal Adaptive Control of Time Delay Systems

Model-Free Optimal Control: A Critical Analysis

Model-Free Reinforcement Learning-Based Control for Continuous-Time Systems

Keywords

1 Actions

2 Environment

3 Experimental results

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Reinforcement Learning Informed by Optimal Control

Abstract

Similar content being viewed by others

Reinforcement Learning for Optimal Adaptive Control of Time Delay Systems

Model-Free Optimal Control: A Critical Analysis

Model-Free Reinforcement Learning-Based Control for Continuous-Time Systems

Keywords

1 Actions

2 Environment

3 Experimental results

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation