Abstract
We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
D. P. Bertsekas, H. Yu. Solution of Large Systems of Equations Using Approximate Dynamic Programming Methods. Report LIDS-P-2754. Cambridge: Laboratory for Information and Decision Systems, MIT, 2007.
R. S. Sutton, A. G. Barto. Reinforcement Learning. Cambridge: MIT Press, 1998.
A. Gosavi. Simulation-Based Optimization Parametric Optimization Techniques and Reinforcement Learning. New York: Springer-Verlag, 2003.
X. R. Cao. Stochastic Learning and Optimization: A Sensitivity-Based Approach. New York: Springer-Verlag, 2007.
H. Chang, M. Fu, J. Hu, et al. Simulation-Based Algorithms for Markov Decision Processes. New York: Springer-Verlag, 2007.
S. Meyn. Control Techniques for Complex Networks. New York: Cambridge University Press, 2007.
W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York: John Wiley & Sons, 2007.
L. Busoniu, R. Babuska, B. De Schutter, et al. Reinforcement Learning and Dynamic Programming Using Function Approximators. New York: CRC Press, 2010.
D. P. Bertsekas. Approximate dynamic programming. Web-based chapter. Dynamic Programming and Optimal Control. 3rd ed. Belmont, MA: Athena Scientific, 2010: 321–540.
D. White, D. Sofge. Handbook of Intelligent Control. New York: Van Nostrand Reinhold, 1992.
J. Si, A. Barto, W. Powell, et al, eds. Learning and Approximate Dynamic Programming. New York: IEEE, 2004.
F. L. Lewis, G. G. Lendaris, D. Liu. Special issue on adaptive dynamic programming and reinforcement learning in feedback control. IEEE Transactions on Systems, Man, and Cybernetics — Part B, 2008, 38(4): 896–897.
A. G. Barto, S. J. Bradtke, S. P. Singh. Real-time learning and control using asynchronous dynamic programming. Artificial Intelligence, 1995, 72(1/2): 81–138.
V. S. Borkar. Reinforcement learning: a bridge between numerical methods and Monte Carlo. Perspectives in Mathematical Science — I: Probability and Statistics, World Scientific Publishing Co., 2009: 71–91.
F. L. Lewis, D. Vrabie. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 2009, 9(3): 32–50.
C. Szepesvari. Reinforcement Learning Algorithms for MDPs. Report TR09-13. Edmonton, CA: Department of Computing Science, University of Alberta, 2009.
M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: John Wiley & Sons, 1994.
D. P. Bertsekas. Dynamic Programming and Optimal Control. 3rd ed. Belmont, MA: Athena Scientific, 2007.
R. J. Williams, L. C. Baird. Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-critic Learning Systems. Report NU-CCS-93-11, Boston, MA: College of Computer Science, Northeastern University, 1993.
I. Menache, S. Mannor, N. Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals Operation Research, 2005, 134(1): 215–238.
H. Yu, D. P. Bertsekas. Basis function adaptation methods for cost approximation in MDP. Proceedings of 2009 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2009), New York: IEEE, 2009: 74–81.
L. Busoniu, D. Ernst, B. De Schutter, et al. Cross-entropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, and Cybernetics — Part B, 2010, 41(1): 1–14.
D. D. Castro, S. Mannor. Adaptive Bases for Reinforcement Learning. Berlin: Springer-Verlag, 2010.
D. P. Bertsekas, H. Yu. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 2009, 227(1): 27–50.
A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 1959, 3(3): 210–229.
A. L. Samuel. Some studies in machine learning using the game of checkers — II: recent progress. IBM Journal of Research and Development, 1967, 11(6): 601–617.
A. G. Barto, R. S. Sutton, C. W. Anderson. Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 1983, 13(5): 835–846.
R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9–44.
G. J. Tesauro. Practical issues in temporal difference learning. Machine Learning, 1992, 8(3/4): 257–277.
L. Gurvits, L. J. Lin, S. J. Hanson. Incremental Learning of Evaluation Functions for Absorbing Markov Chains: New Methods and Theorems. Princeton: Siemens Corporate Research, 1994.
T. Jaakkola, M. I. Jordan, S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 1994, 6(6): 1185–1201.
F. Pineda. Mean-field analysis for batched TD(λ). Neural Computation, 1997, 9: 1403–1419.
J. N. Tsitsiklis, B. V. Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690.
J. N. Tsitsiklis, B. Van Roy. Average cost temporal-difference learning. Automatica, 1999, 35(11): 1799–1808.
S. J. Bradtke, A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1/3): 33–57.
D. P. Bertsekas, S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in Neuro-dynamic Programming. Report LIDS-P-2349. Cambridge: Laboratory for Information and Decision Systems, MIT, 1996.
M. A. Krasnoselskii, J. B. Rutitcki, V. J. Stecenko, et al. Approximate Solution of Operator Equations. Translated by D. Louvish, Groningen: Wolters-Noordhoff Publisher, 1972.
C. A. J. Fletcher. Computational Galerkin Methods. New York: Springer-Verlag, 1984.
D. P. Bertsekas. Projected Equations, Variational Inequalities, and Temporal Difference Methods. Report LIDS-P-2808. Cambridge: Laboratory for Information and Decision Systems, MIT, 2009.
H. Yu, D. P. Bertsekas. Error bounds for approximations from projected linear equations. Mathematics of Operations Research, 2010, 35(2): 306–329.
H. Yu. Least Squares Temporal Difference Methods: An Analysis Under General Conditions. Technical report C-2010-39, Finland: Department Computer Science, University of Helsinki, 2010.
H. Yu. Convergence of least squares temporal difference methods under general conditions. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 1207–1214.
S. P. Singh, T. Jaakkola, M. I. Jordan. Learning without state-estimation in partially observable Markovian decision processes. Proceedings of the 11th Machine Learning Conference, San Francisco, CA: Morgan Kaufmann Publishers Inc., 1994: 284–292.
S. P. Singh, T. Jaakkola, M. I. Jordan. Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems 7, Vancouver, BC, Canada, 1995: 361–368.
G. J. Gordon. Stable function approximation in dynamic programming. Machine Learning: Proceedings of the 12th International Conference, Pittsburgh, PA: mCarnegie Mellon University, 1995.
J. N. Tsitsiklis, B. V. Roy. Feature-based methods for large-scale dynamic programming. Machine Learning, 1996, 22(1/3): 59–94.
B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 2006, 31(2): 234–244.
B. Scherrer. Should one compute the temporal difference fixed point or minimize the Bellman residual: the unified oblique projection view. Proceedings of International Conference on Machine Learning, Haifa, Israel, 2010: 959–966.
C. Thiery, B. Scherrer. Least-squares policy iteration: bias-variance trade-off in control problems. Proceedings of 27th International Conference on Machine Learning, Haifa, Israel. 2010: 1071–1078.
M. Wang, N. Polydorides, D. P. Bertsekas. Approximate Simulation-Based Solution of Large-scale Least Squares Problems. Report LIDS-P-2819. Cambridge: Laboratory for Information and Decision Systems, MIT. 2009.
H. Yu, D. P. Bertsekas. Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automation Control, 2009, 54(7): 1515–1531.
D. P. Bertsekas, J. N. Tsitsiklis. Neuro-dynamic Programming. Belmont, MA: Athena Scientific, 1996.
H. Yao, Z. Liu. Preconditioned temporal difference learning. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland. 2008: 1208–1215.
D. P. Bertsekas, E. Gafni. Projection methods for variational inequalities with applications to the traffic assignment problem. Mathmatical Programmming Studies, 1982, 17(17): 139–159.
B. Martinet. Regularisation d’inequations variationnelles par approximations successives. Revue Francaise Informatique Recherche Operationnelle, 1970, 4(R-3): 154–159.
R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 1976, 14(5): 877–898.
D. P. Bertsekas. Convex Optimization Theory. Belmont, MA: Athena Scientific, 2009.
G. Strang. Linear Algebra and its Applications. 4th ed. Wellesley, MA: Wellesley-Cambridge Press, 2009.
L. N. Trefethen, D. Bau. Numerical Linear Algebra. Philadelphia: SIAM, 1997.
J. A. Boyan. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49(2/3): 1–15.
A. Nedić, D. P. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems: Theory and Applications, 2003, 13(1/2): 79–110.
D. P. Bertsekas, V. S. Borkar, A. Nedić. Improved temporal difference methods with linear function approximation. Learning and Approximate Dynamic Programming. J. Si, A. Barto, W. Powell, et al., eds. New York: IEEE, 2004: 231–255.
D. S. Choi, B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 2006, 16(2): 207–239.
J. Liu. Monte Carlo Strategies in Scientific Computing. New York: Springer-Verlag, 2001.
R. Y. Rubinstein, D. P. Kroese. Simulation and the Monte Carlo Method. 2nd ed. New York: John Wiley & Sons, 2008.
N. Polydorides, M. Wang, D. P. Bertsekas. Approximate Solution of Large-scale Linear Inverse Problems with Monte Carlo Simulation. Report LIDS-P-2822. Cambridge: Laboratory for Information and Decision Systems, MIT, 2009.
V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge: Cambridge University Press, 2008.
R. Munos. Error bounds for approximate policy iteration. Proceedings of the 20th International Conference on Machine Learning, Washington D. C., 2003: 560–567.
D. P. Bertsekas. Pathologies of temporal difference methods in approximate dynamic programming. Proceedings of IEEE Conference on Decision and Control. New York: IEEE, 2010.
H. Yu. Least Squares Temporal Difference Methods: An Analysis Under General Conditions. Report C-2010-39. Finland: Department of Computer Science, University of Helsinki, 2010.
H. Yu. Convergence of least squares temporal difference methods under general conditions. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 1207–1214.
R. S. Sutton, C. Szepesvari, H. R. Maei. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 2008: 1–8.
H. R. Maei, C. Szepesvari, S. Bhatnagar, et al. Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 2009: 1204–1212.
R. S. Sutton, H. R. Maei, D. Precup, et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proceedings of the 26th International Conference on Machine Learning, New York, 2009: 1–8.
D. P. Bertsekas, H. Yu. Q-Learning and enhanced policy iteration in discounted dynamic programming. Report LIDS-P-2831. Cambridge: Laboratory for Information and Decision Systems, MIT, 2010.
J. N. Tsitsiklis, B. Van Roy. Optimal stopping of Markov processes: hilbert space theory, approximation algorithms, and an application to pricing financial derivatives. IEEE Transactions on Automatic Control, 1999, 44(10): 1840–1851.
H. Yu, D. P. Bertsekas. A Least Squares Q-learning Algorithm for Optimal Stopping Problems. Report 2731. Cambridge: Labotary for Information and Decision Systems, MIT, 2006.
M. G. Lagoudakis, R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4: 1107–1149.
T. Jung, D. Polani. Kernelizing LSPE(λ). IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, New York: IEEE, 2007: 338–345.
L. Busoniu, D. Ernst, B. De Schutter, et al. Online least-squares policy iteration for reinforcement learning control. American Control Conference, New York: IEEE, 2010: 486–491.
D. P. Bertsekas. Lecture at NSF Workshop on Reinforcement Learning. Harpers Ferry, New York, 1996.
V. V. Desai, V. F. Farias, C. C. Moallemi. Approximate dynamic programming via a smoothed approximate linear program. Advances in Neural Information Processing Systems 22. Vancouver, BC, Canada, 2009: http://www.moallemi.com/ciamac/papers/salp-2009.pdf.
V. F. Farias, B. Van Roy. Tetris: a study of randomized constraint sampling. Probabilistic and Randomized Methods for Design Under Uncertainty. G. Calafiore, F. Dabbene, eds. New York: Springer-Verlag, 2006: 189–202.
S. Kakade. A natural policy gradient. Advances in Neural Information Processing Systems. Vancouver, BC, Canada, 2002: 1531–1538.
I. Szita, A. Lorinz. Learning tetris using the noisy cross-entropy method. Neural Computation, 2006, 18(12): 2936–2941.
B. Van Roy. Feature-Based Methods for Large Scale Dynamic Programming. Report LIDS-TH-2289. Cambridge: Laboratory for Information and Decision Systems, MIT, 1995.
P. T. de Boer, D. P. Kroese, S. Mannor, et al. A tutorial on the cross-entropy method. Annals of Operations Research, 2005, 134(1): 19–67.
C. Thiery, B. Scherrer. Improvements on learning tetris with cross-entropy. International Computer Games Association Journal, 2009, 32(1): 23–33.
P. J. Werbos. Approximate dynamic programming for real-time control and neural modeling. Handbook of Intelligent Control. New York: Van Nostrand, 1992: 493–525.
P. J. Werbös. Neurocontrol and supervised learning: an overview and valuation. Handbook of Intelligent Control. D. A. White, D. A. Sofge, eds. New York: Van Nostrand, 1992: 65–89.
L. C. Baird. Advantage Updating. Report WL-TR-93-1146. Wright-Patterson Air Force Base, OH: Wright Laboratory, 1993.
M. E. Harmon, L. C. Baird, A. H. Klopf. Advantage updating applied to a differential game. G. Tesauro, D. S. Touretzky, T. K. Leen, eds. Advances in Neural Information Processing Systems, Denver, CO, 1994: 353–360.
D. P. Bertsekas. Differential training of rollout policies. Proceedings of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, IL, 1997.
E. V. Denardo. Contraction mappings in the theory underlying dynamic programming. SIAM Review, 1967, 9(2): 165–177.
D. P. Bertsekas. Monotone mappings with application in dynamic programming. SIAM Journal on Control and Optimization, 1977, 15(3): 438–464.
D. P. Bertsekas, S. E. Shreve. Stochastic Optimal Control: the Discrete Time Case. New York: Academic Press, 1978.
L. S. Shapley. Stochastic games. Proceedings of National Academy Sciences, 1953, 39(10): 1095–1100.
D. P. Bertsekas, H. Yu. Asynchronous distributed policy iteration in dynamic programming. Proceedings of Allerton Conferrence on Information Sciences and Systems, Allerton Park, IL, 2010.
D. P. Bertsekas. Distributed dynamic programming. IEEE Transactions on Automatic Control, 1982, 27(3): 610–616.
Author information
Authors and Affiliations
Additional information
This work was supported by the National Science Foundation (No.ECCS-0801549), the LANL Information Science and Technology Institute, and the Air Force (No.FA9550-10-1-0412).
Dimitri P. BERTSEKAS studied engineering at the National Technical University of Athens, Greece, obtained his M.S. degree in Electrical Engineering at the George Washington University, Washington D.C. in 1969, and his Ph.D. in System Science in 1971 at the Massachusetts Institute of Technology. He has held faculty positions with the Engineering-Economic Systems Department, Stanford University (1971–1974) and the Electrical Engineering Department of the University of Illinois, Urbana (1974–1979). Since 1979 he has been teaching at the Electrical Engineering and Computer Science Department of the Massachusetts Institute of Technology (MIT), where he is currently McAfee Professor of Engineering. He consults regularly with private industry and has held editorial positions in several journals. His research at MIT spans several fields, including optimization, control, large-scale computation, and data communication networks, and is closely tied to his teaching and book authoring activities. He has written numerous research papers, and fourteen books, several of which are used as textbooks in MIT classes.
He was awarded the INFORMS 1997 Prize for Research Excellence in the Interface Between Operations Research and Computer Science for his book ‘Neuro-dynamic Programming’ (coauthored with John Tsitsiklis), the 2000 Greek National Award for Operations Research, the 2001 ACC John R. Ragazzini Education Award, and the 2009 INFORMS Expository Writing Award. In 2001, he was elected to the United States National Academy of Engineering.
Rights and permissions
About this article
Cite this article
Bertsekas, D.P. Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9, 310–335 (2011). https://doi.org/10.1007/s11768-011-1005-3
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11768-011-1005-3