Abstract
In Chapter 3, we showed that potentials and performance gradients can be estimated with a sample path of a Markov chain, and the estimated potentials and gradients can be used in gradient-based performance optimization of Markov systems. In this chapter, we show that we can use sample-path-based potential estimates in policy iteration to find optimal policies. We focus on the average-reward optimality criterion and ergodic Markov chains.
It is a mistake to try to look too far ahead. The chain of destiny can only be grasped one link at a time.
Sir Winston Churchill British politician (1874 – 1965)
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
References
D. P. Bertsekas and T. N. Tsitsiklis, Introduction to Probability, Athena Scientific, Belmont, Massachusetts, 2002.
P. Billingsley, Probability and Measure, John Wiley & Sons, New York, 1979.
H. T. Fang and X. R. Cao, “Potential-Based Online Policy Iteration Algorithms for Markov Decision Processes,” IEEE Transactions on Automatic Control, Vol. 49, 493-505, 2004.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2007 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Cao, XR. (2007). Sample-Path-Based Policy Iteration. In: Stochastic Learning and Optimization. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-69082-7_5
Download citation
DOI: https://doi.org/10.1007/978-0-387-69082-7_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-36787-3
Online ISBN: 978-0-387-69082-7
eBook Packages: Computer ScienceComputer Science (R0)