An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions

Ma, Yao; Zhao, Tingting; Hatano, Kohei; Sugiyama, Masashi

doi:10.1007/978-3-662-44851-9_23

Yao Ma²³,
Tingting Zhao²³,
Kohei Hatano²⁴ &
…
Masashi Sugiyama²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8725))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4186 Accesses
1 Citations

Abstract

We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret — the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret \(O(\sqrt{T})\) for T steps under a certain concavity assumption and O(logT) under a strong concavity assumption. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.

Download to read the full chapter text

Chapter PDF

Softmax policy gradient methods can take exponential time to converge

Article Open access 23 January 2023

Online Learning in Markov Decision Processes with Continuous Actions

Policy gradient in Lipschitz Markov Decision Processes

Article 03 March 2015

Keywords

References

Even-Dar, E., Kakade, S.M., Mansour, Y.: Experts in a Markov Decision Process. In: Advances in Neural Information Processing System 17, pp. 401–408. MIT Press, Cambridge (2005)
Google Scholar
Even-Dar, E., Karade, S.M., Mansour, Y.: Online Markov Decision Processes. Mathematics of Operations Research 34(3), 726–736 (2009)
Article MATH MathSciNet Google Scholar
Neu, G., György, A., Szepesvári, C., Antos, A.: Online Markov Decision Processes under Bandit Feedback. In: Advances in Neural Information Processing Systems 23, pp. 1804–1812 (2010)
Google Scholar
Neu, G., György, A., Szepesvári, C.: The Online Loop-free Stochastic Shortest-path Problem. In: Conference on Learning Theory, pp. 231–243 (2010)
Google Scholar
Peter, J., Schaal, S.: Policy Gradient Methods for Robotics. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (2006)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
Google Scholar
Williams, R.J.: Simple Statistical Gradient-following Algorithms for Connetionist Reinforcement Learning. Machine Learning 8(3-4), 229–256 (1992)
Article MATH Google Scholar
Yu, J.Y., Mannor, S., Shimkin, N.: Markov Decision Processes with Arbitrary Reward Processes. Mathematics of Operations Research 34(3), 737–757 (2009)
Article MATH MathSciNet Google Scholar
Zimin, A., Neu, G.: Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search. In: Advances in Neural Information Processing Systems 26, pp. 1583–1591 (2013)
Google Scholar
Zinkevich, M.: Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In: International Conference on Machine Learning, pp. 928–936. AAAI Press (2003)
Google Scholar
Abbasi-Yadkori, Y., Bartlett, P., Kanade, V., Seldin, Y., Szepesvári, C.: Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions. In: Advances in Neural Information Processing Systems 26 (2013)
Google Scholar
Hazan, E., Agarwal, A., Kale, S.: Logarithmic Regret Algorithms for Online Convex Optimization. Machine Learning 69, 169–192 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro, Tokyo, 152-8552, Japan
Yao Ma, Tingting Zhao & Masashi Sugiyama
Kyushu University, 744 Motooka, Nishi, Fukuoka, 819-0395, Japan
Kohei Hatano

Authors

Yao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Tingting Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Kohei Hatano
View author publications
You can also search for this author in PubMed Google Scholar
Masashi Sugiyama
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Applied Sciences,Department of Computer and Decision Engineering, Université Libre de Bruxelles, Av. F. Roosevelt, CP 165/15, 1050, Brussels, Belgium
Toon Calders
Dipartimento di Informatica, Università degli Studi “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Floriana Esposito
Department of Computer Science, Universität Paderborn, Warburger Str. 100, 33098, Paderborn, Germany
Eyke Hüllermeier
Dipartimento di Informatica, Università degli Studi di Torino, Corso Svizzera 185, 10149, Torino, Italy
Rosa Meo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, Y., Zhao, T., Hatano, K., Sugiyama, M. (2014). An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8725. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44851-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-662-44851-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44850-2
Online ISBN: 978-3-662-44851-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions

Abstract

Chapter PDF

Similar content being viewed by others

Softmax policy gradient methods can take exponential time to converge

Online Learning in Markov Decision Processes with Continuous Actions

Policy gradient in Lipschitz Markov Decision Processes

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions

Abstract

Chapter PDF

Similar content being viewed by others

Softmax policy gradient methods can take exponential time to converge

Online Learning in Markov Decision Processes with Continuous Actions

Policy gradient in Lipschitz Markov Decision Processes

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation