Policy Learning Using SPSA

Ramamurthy, R.; Bauckhage, C.; Sifa, R.; Wrobel, S.

doi:10.1007/978-3-030-01424-7_1

R. Ramamurthy^18,19,
C. Bauckhage^18,19,
R. Sifa^18,19 &
…
S. Wrobel^18,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11141))

Included in the following conference series:

International Conference on Artificial Neural Networks

8975 Accesses
4 Citations

Abstract

We analyze the use of simultaneous perturbation stochastic approximation (SPSA), a stochastic optimization technique, for solving reinforcement learning problems. In particular, we consider settings of partial observability and leverage the short-term memory capabilities of echo state networks (ESNs) to learn parameterized control policies. Using SPSA, we propose three different variants to adapt the weight matrices of an ESN to the task at hand. Experimental results on classic control problems with both discrete and continuous action spaces reveal that ESNs trained using SPSA approaches outperform conventional ESNs trained using temporal difference and policy gradient methods.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Policy-based optimization: single-step policy gradient method seen as an evolution strategy

Article 14 September 2022

Reinforcement Learning for Optimal Adaptive Control of Time Delay Systems

Event-Triggered Optimal Nonlinear Systems Control Based on State Observer and Neural Network

Article 20 June 2022

Keywords

1 Introduction

Creating systems that learn to solve complex tasks from interactions with their environment is one of the primary goals of artificial intelligence research. Recently, much progress has been made in this regard, mainly achieved through modern reinforcement learning (RL) techniques [1, 21]. Examples of recent successes include systems which exceed human level performance in playing console-based Atari games [12] or can navigate 3D virtual environments [11], and AlphaGo Zero [17] became the first program to beat world class GO players by learning from self-play only. Function approximators such as deep neural networks, when used with off-policy and bootstrapping methods such as Q-learning, which used to be unstable and were referred to as a “deadly-triad” [20], have now been proven to be a competent approach using techniques such as experience replay [8] which stabilize learning with the help of a large replay memory.

Spurred by these successes, another line of recent research has considered alternative approaches to RL using black-box optimization methods which do not require back propagation of gradient computations. Corresponding contributions include systems [10, 14] that are trained using so called evolution strategies which achieve competitive performance in playing Atari games. Similar performance was obtained in [19] where genetic algorithms were found to scale better than evolution strategies. This revived interest in black-box methods for solving RL problems as these can be parallelized when using modern distributed architectures. However, most real-world systems must deal with limited and noisy state information resulting in partial observability as encountered in partially observable Markov decision processes (POMDPs). To learn policies under such circumstances, systems need to have internal memory. Therefore, recurrent RL methods to cope with partial observability have recently been investigated but were found to be difficult to train [4].

In this paper, we focus on these kind of problems and consider RL in partially observable environments. Since echo state networks [5] are known for their simple architecture and short-term memorization capabilities, we choose them in order to train parameterized control policies. In particular, we propose to use simultaneous perturbation stochastic optimization (SPSA), a gradient approximation technique, as a training algorithm, which at each iteration requires only two evaluations of objective function regardless of dimension of the parameter. Using SPSA, we devise three types of ESN training that differ in how the weight matrices are chosen in each iteration. Finally, we use such ESNs to learn policies and test them against baselines on classic control problems.

Previous work on black-box methods for training echo state networks seeks to combine genetic algorithms to train internal weights of the reservoir and stochastic gradient descent to train the output weights [3, 15]. Similar work was done in [6] where output weights and spectral radii of internal weight matrices were evolved. Alternatively, more recent work [16] concerning different learning strategy focused on using hebbian learning rules to adapt reservoir matrices. An interesting hybrid of using hebbian learning and temporal difference learning was later proposed in [7] to adapt actor-critic ESNs. In contrast to these previous approaches, we use SPSA to optimize the entire network weights which has several noteworthy properties: (i) it requires only two loss measurements at each iteration, (ii) it does not require back propagation of gradients, (iii) it does not require any maintenance of candidate solutions as in genetic algorithms, and (iv) it can handle stochastic returns and hence does not require averaging over multiple measurements to account for the noisy returns.

2 Simultaneous Perturbation Stochastic Approximation

In this short section, we briefly recall the main ideas behind simultaneous perturbation stochastic approximation (SPSA) for derivative free optimization; readers familiar with this technique may safely skip ahead.

Consider the general problem of maximizing a differentiable objective function $f(\varvec{\theta }):\mathbb {R}^d \rightarrow \mathbb {R}$, that is, consider the problem of finding $\varvec{\theta }^* = {\mathop {\mathrm {argmax}}\nolimits _{\varvec{\theta }}} f(\varvec{\theta })$.

For many complex systems, the gradient $\partial f$/$\partial \varvec{\theta }$ cannot be computed directly so that $\partial f$/$\partial \varvec{\theta } = \varvec{0}$ can often not be solved. It is, however, typically possible to evaluate $f(\varvec{\theta })$ at various values of $\varvec{\theta }$ which, in turn, allows, for computing stochastic approximations of the gradient. One method in this regard is SPSA due to Spall [18] which iteratively updates estimates of the optimal $\varvec{\theta }$ as

$$\begin{aligned} \varvec{\theta }_{k+1} = \varvec{\theta }_{k} + l_k \, \varvec{\hat{g}}_k(\varvec{\theta }_k) \end{aligned}$$

(1)

where $\varvec{\hat{g}}_k(\varvec{\theta }_k)$ is an estimator of the gradient at $\varvec{\theta }_k$ and $l_k$ is the learning rate in iteration k. To estimate the gradient, two perturbations are generated, namely $(\varvec{\theta }_k + c_k \, \varvec{\delta }_k)$ and $(\varvec{\theta }_k - c_k \, \varvec{\delta }_k)$ where $\varvec{\delta }_k$ is a perturbation vector and $c_k$ is a scaling parameter. Then, the possibly noisy objective function $F(\cdot ) = f(\cdot ) + noise$ is measured at $F(\varvec{\theta }_k + c_k \, \varvec{\delta }_k)$ and $F(\varvec{\theta }_k - c_k \, \varvec{\delta }_k)$ and the gradient is estimated using a two-sided gradient approximation

$$\begin{aligned} \varvec{\hat{g}}_k(\varvec{\theta _k}) = \dfrac{F(\varvec{\theta }_k + c_k \, \varvec{\delta }_k) - F(\varvec{\theta }_k - c_k \, \varvec{\delta }_k) }{2 \, c_k \, \varvec{\delta }_k}. \end{aligned}$$

(2)

The convergence of the SPSA algorithm critically depends on the choice of its parameters $l_k$, $c_k$ and $\varvec{\delta }_k$. In particular, the learning rate $l_k$ must meet the Robbins-Monro conditions [13], namely $l_k > 0$ and $\sum _{k=1}^{\infty } l_k = \infty $, and a common choice in practice therefore is $l_k = \tfrac{l}{{(L + k)}^\alpha }$ where $l, \alpha , L>0$. Similarly, the scaling factor $c_k$ must satisfy $\sum _{k=1}^{\infty } \bigl ( \tfrac{l_k}{c_k} \bigr )^2 < \infty $ so that a good choice amounts to $c_k = \frac{c}{k^\gamma }$ where $c, \gamma >0$. And, essentially, each element of the perturbation vector $\varvec{\delta }_k$ is sampled from a uniform distribution over the set $\{-1,+1\}$.

3 Learning Policies Using Echo State Networks

In this section, we first briefly review policy learning under partial observability as well as echo state networks and then introduce our approach towards policy learning using echo state networks trained via SPSA.

3.1 Partial Observability

Consider an agent interacting with an environment. At any time t, the agent observes the state $\varvec{s}_t$ of the environment and performs an action $a_t$ by following a policy $\pi (a_t | \varvec{s_t})$ which is a mapping of state $\varvec{s_t}$ to the probability of choosing action a at time t. In return, the environment responds with a reward $r_{t}$ and finds itself in a new state $\varvec{s}_{t+1}$.

However, in environments that are only partially observable, the agent does not receive all relevant state information because of limited sensory inputs. In this case, the state $\varvec{s_t}$ does not satisfy the Markov property because it does not summarize what has happened in the past so that an informed decision cannot be taken. For such non-Markovian states, it is necessary to make the policy dependent on a history of states $h_t = \{\varvec{s}_t,\varvec{s}_{t-1}, \dots \}$ rather than on the current state $\varvec{s}_t$ only. Hence, the policy becomes $\pi (a_t | h_t)$.

This, however, becomes impractical to compute whenever different tasks require arbitrary lengths of histories. In situations like these, an echo state network can be used to integrate the required history in its reservoir states. In this way, we are able to parameterize the policy with weights of an echo state network $\varvec{\theta }$ as $\pi (a_t | \varvec{s}_t, \varvec{\theta })$ which takes the current state $\varvec{s}_t$ as the input and returns probabilities of actions by compacting the history of input states in the reservoir memory.

3.2 Echo State Networks

We next briefly recall the notion of echo state networks. These belong to reservoir computing paradigm in which a large reservoir of recurrently interconnected neurons processes sequential input data. In our setup, given that the state of the environment $\varvec{s}_t \in \mathbb {R}^{n_{s}}$ is given as the input to the network, the hidden states and output of our policy network are given by $\varvec{h}_t \in \mathbb {R}^{n_{h}}$ and $\varvec{\pi }_t \in \mathbb {R}^{n_{a}}$, respectively. The temporal evolution of such a network is governed by the following, non-linear dynamical system

$$\begin{aligned} \varvec{h}_t&= (1-\beta ) \, \varvec{h}_{t-1} + \beta \, f_h \bigl ( \varvec{W}^{h} \varvec{h}_{t-1} + \varvec{W}^{s} \varvec{s}_t \bigr ) \end{aligned}$$

(3)

$$\begin{aligned} \varvec{\pi }_t&= f_\pi \bigl ( \varvec{W}^{a} \varvec{h}_t \bigr ) \end{aligned}$$

(4)

where $\beta \in [0, 1]$ is called the leaking rate and $\varvec{W}^s$, $\varvec{W}^h$, and $\varvec{W}^a$ are the input, reservoir, and output weight matrices, respectively. The function $f_h(\cdot )$ is understood to act component-wise on its argument and is typically a sigmoidal activation function. For the output layer, however, $f_\pi (\cdot )$ is usually just a linear or softmax function depending on the application context.

3.3 Policy Learning Using Echo State Networks

At any time, the goal of the agent is to maximize the expected cumulative reward or the return received over a period of time which is defined as $R_T = \sum _{t=1}^{T} r_t$.

Hence, the objective function that is to be maximized is $f(\varvec{\theta }) = \mathbb {E}_{\pi _\theta } \big [ R_T\big ]$ and finding an optimal policy amounts to finding $\varvec{\theta }^* = {\mathop {\mathrm {argmax}}\nolimits _{\varvec{\theta }}} f(\varvec{\theta })$ where we now write $\varvec{\theta }$ to denote the set of weights of an echo state network used to approximate the policy $\pi (a_t | \varvec{s}_t, \varvec{\theta })$.

According to our discussion in Sect. 2, we can then iteratively learn an optimal $\varvec{\theta }$ according to a stochastic gradient ascent rule that follows the gradient $\nabla _\theta \mathbb {E}_{\pi _\theta } \big [ R_T\big ]$. In particular, we can resort to SPSA in order to approximate this gradient as

$$\begin{aligned} \nabla _\theta \mathbb {E}_{\pi _\theta } \big [ R_T\big ] \approx \dfrac{F(\varvec{\theta } + \epsilon ) - F(\varvec{\theta } - \epsilon ) }{2 \epsilon } \end{aligned}$$

(5)

where $F(\cdot ) $ is the stochastic return from the environment by running an episode where, in each step, the agent follows the policy $\pi (a_t | \varvec{s}_t, \varvec{\theta })$ approximated by the ESN and where $\varvec{\epsilon }$ is the perturbation generated by SPSA. A summary of this learning method can be found in Algorithm 1.

3.4 Deterministic and Stochastic Policies

An agent’s policy can either be deterministic or stochastic. In a discrete action space, the agent may apply a deterministic, greedy, “winner-takes-all” strategy to select an action, i.e. $a_t = {\mathop {\mathrm {argmax}}\nolimits _{a}} \pi (a | \varvec{s}_t, \varvec{\theta })$. However, in order to encourage exploration, the agent can follow a stochastic softmax policy in which actions are sampled based on action probabilities according to the policy $\pi (a_t | \varvec{s}_t, \varvec{\theta })$, i.e. $a_t \sim f_\pi $ where $f_\pi $ is the softmax function. In a continuous action space, the agent’s actions are sampled from a Gaussian policy parameterized by mean and variance neurons, that is $f_\pi $ is considered a Gaussian probability distribution.

3.5 Three Variants of Echo State Network Training

Typically, in echo state networks, only the output weight matrix $\varvec{W}^a$ is optimized. However some tasks require tuning of the input- and reservoir weights $\varvec{W}^s$ and $\varvec{W}^h$ in order to extract relevant information from observations or to construct missing state information. Therefore, we consider three variants of our SPSA algorithm using different choices of $\varvec{\theta }$ at each iteration

1.
output_spsa: at each iteration, we optimize only the output weight matrix, that is we let $\varvec{\theta } = \varvec{W}^a$
2.
all_spsa: at each iteration, all of the weight matrices are updated at once, that is we let $\varvec{\theta } = \{\varvec{W}^s, \varvec{W}^h,\varvec{W}^a\}$
3.
alternating_spsa: at each iteration, we update one of these matrices and alternate in the subsequent iteration.

4 Experiments and Results

We evaluated the above SPSA variants on a benchmark of classic control problems available from OpenAI Gym [2] and compared them against temporal difference and policy gradient learning methods.

4.1 Acrobot and Mountain Car

We considered two classic problems, namely Mountain Car and Acrobot, and considered discrete and continuous action selection. For both problems, we restrict state observations to include only positional information excluding velocities so that the agent has to infer velocity information in order to retrieve the full state information. An illustration of these OpenAI Gym problems and their state-action spaces is given in Fig. 1.

4.2 Implementation Details

We used the same architecture of echo state networks consisting of 40 reservoir neurons with tanh activation functions for our SPSA variants and their RL baselines. The number of input- and output neurons, and the output activation function are chosen depending on the task and the type of policy being learned. The weight matrices are initialized according to parameters such as sparsity, scaling and spectral radius which are carefully set as per the guidelines in [9]. The input and reservoir matrices are chosen from a uniform distribution over values $[-0.5,0.5]$. However, the output scaling is chosen differently for each task. The initial spectral radius of the reservoir matrix and the leaking rate are chosen to be 1.0 and 0.3, respectively, for all tasks. The SPSA parameters such as learning rate, scaling factor, decay rates and similarly, parameters concerning reinforcement learning methods such as discount factor and learning rates are tuned for each experiment. Table 1 lists all hyper parameters and their values.

Table 1. Hyperparameters and their values for different experiments.

Full size table

4.3 Results

First, we tested our algorithms to train deterministic greedy policies for the discrete versions of the acrobot and mountain car tasks and found that SPSA variants are able to solve both these tasks. In a quantitative evaluation, we computed mean learning curves with 10 different random seeds and compared them to similar curves obtained using echo state networks trained with temporal difference methods such as Q-learning and SARSA learning using stochastic gradient descent. Figures 2(a) and (b) show the learning curves in terms of the evolution of episodic total reward in the learning process (the higher the better). As we can observe, all SPSA variants find better policies than Q-learning or SARSA learning.

Next, we tested our algorithms to learn stochastic policies for a discrete version of the acrobot- and a continuous version of the mountain car task. We found that the SPSA variants are able to solve these by finding a softmax policy and a Gaussian policy for acrobot and mountain car, respectively. In a quantitative evaluation, we again computed mean learning curves with 10 different random seeds and compared them to data obtained using actor-critic methods. In the actor-critic method, two echo state networks are used, one to learn the policy (policy network) and one to learn the state value function (value network), both act with limited state information as in our SPSA variants. Figures 2(c) and (d) show the learning curves and it is seen that SPSA variants perform better than actor-critic methods. Next, in order to visualize the learned Gaussian policy for the mountain car task, we plotted action probabilities for selected input states. As we can see in Fig. 3(b), for the same input states, the resulting action probability distribution is a mixture of Gaussians, meaning that the actions are sampled from appropriate mixture components based on the hidden states of the network which constructs the missing velocity information.

Our most important evaluation results are summarized in Fig. 3(a) which shows average episodic total rewards in the last 100 iterations with 10 different random seeds. Here we observe: (i) training only the output weight matrix using SPSA yields better performance than its RL counterparts in all the experiments which indicate that SPSA is a powerful alternative to common RL methods; (ii) updating all the weight matrices at once gives the best performance in all tasks; however, training in an alternating fashion also seems to be a promising approach which warrants for further investigation; (iii) for the acrobot tasks, it is evident that SPSA works better in learning a deterministic policy than a stochastic policy. The reason could be that it is not necessary to also introduce stochasticity into action space since the exploration happens already in parameter space in terms of perturbations. This concurs with the work done in [14] whose authors also seek to learn a deterministic policy when using black-box methods. Nevertheless, our approach demonstrates the general feasibility of learning both deterministic- and stochastic policies.

5 Conclusion

In this paper, we considered the use of SPSA in training echo state networks to solve action selection tasks under partial observability. We proposed three variants that seek to perform gradient updates without using back-propagation. Experiments on classic problems indicate that SPSA is a powerful alternative to reinforcement learning methods commonly used for policy learning. In future work, we intend to extend the ideas reported here using LSTM units to solve more complex RL problems that require long-term dependencies. We also plan to examine the alternating SPSA variant further to verify their applicability in training deep recurrent neural networks.

References

Bertsekas, D.P.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
MATH Google Scholar
Brockman, G., et al.: OpenAI Gym. arXiv:1606.01540 (2016)
Chatzidimitriou, K.C., Mitkas, P.A.: A NEAT way for evolving echo state networks. In: Proceedings of European Conference on Artificial Intelligence (2010)
Google Scholar
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: Proceedings of International Conference on Machine Learning (2016)
Google Scholar
Jäger, H.: The “echo state” approach to analysing and training recurrent neural networks. Technical report 148, GMD (2001)
Google Scholar
Jiang, F., Berry, H., Schoenauer, M.: Supervised and evolutionary learning of echo state networks. In: Proceedings of International Conference on Parallel Problem Solving from Nature (2008)
Google Scholar
Koprinkova-Hristova, P.: Three approaches to train echo state network actors of adaptive critic design. In: Proceeding of International Conference on Artificial Neural Networks (2016)
Google Scholar
Lin, L.J.: Reinforcement learning for robots using neural networks. Technical reports CMU-CS-93-103, Carnegie-Mellon University (1993)
Google Scholar
Lukoševičius, M.: A practical guide to applying echo state networks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 659–686. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_36
Chapter Google Scholar
Mania, H., Guy, A., Recht, B.: Simple random search provides a competitive approach to reinforcement learning. arXiv:1803.07055 (2018)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of International Conference on Machine Learning (2016)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)
Article Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet Google Scholar
Salimans, T., Ho, J., Chen, X., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864 (2017)
Schmidhuber, J., Wierstra, D., Gagliolo, M., Gomez, F.: Training recurrent networks by Evolino. Neural Comput. 19(3), 757–779 (2007)
Article Google Scholar
Schrauwen, B., Wardermann, M., Verstraeten, D., Steil, J.J., Stroobandt, D.: Improving reservoirs using intrinsic plasticity. Neurocomputing 71(7–9), 1159–1171 (2008)
Article Google Scholar
Silver, D., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354 (2017)
Article Google Scholar
Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control. 37(3), 332–341 (1992)
Article MathSciNet Google Scholar
Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O., Clune, J.: Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567 (2017)
Sutton, R.: Introduction to reinforcement learning with function approximation. In: Tutorial at the Conference on Neural Information Processing Systems (2015)
Google Scholar
Sutton, R.S., Barto, A.G., et al.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Bonn, Bonn, Germany
R. Ramamurthy, C. Bauckhage, R. Sifa & S. Wrobel
Fraunhofer Center for Machine Learning, Sankt Augustin, Germany
R. Ramamurthy, C. Bauckhage, R. Sifa & S. Wrobel

Authors

R. Ramamurthy
View author publications
You can also search for this author in PubMed Google Scholar
C. Bauckhage
View author publications
You can also search for this author in PubMed Google Scholar
R. Sifa
View author publications
You can also search for this author in PubMed Google Scholar
S. Wrobel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Ramamurthy .

Editor information

Editors and Affiliations

Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Open University of Cyprus, Latsia, Cyprus
Yannis Manolopoulos
CITEC Bielefeld University, Bielefeld, Germany
Barbara Hammer
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramamurthy, R., Bauckhage, C., Sifa, R., Wrobel, S. (2018). Policy Learning Using SPSA. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11141. Springer, Cham. https://doi.org/10.1007/978-3-030-01424-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-01424-7_1
Published: 27 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01423-0
Online ISBN: 978-3-030-01424-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics