Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Antos, András; Szepesvári, Csaba; Munos, Rémi

doi:10.1007/s10994-007-5038-2

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Published: 14 November 2007

Volume 71, pages 89–129, (2008)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Download PDF

András Antos¹,
Csaba Szepesvári¹^nAff3 &
Rémi Munos²

4008 Accesses
187 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

Article PDF

An incremental off-policy search in a model-free Markov decision process using a single sample path

Article 13 February 2018

Two Approaches to Stochastic Optimal Control Problems with a Final-Time Expectation Constraint

Article 10 September 2016

An Incremental Fast Policy Search Using a Single Sample Path

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Anthony, M., & Bartlett, P. L. (1999). Neural network learning: theoretical foundations. Cambridge: Cambridge University Press.
MATH Google Scholar
Antos, A., Szepesvári, C., & Munos, R. (2006). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. In G. Lugosi & H. Simon (Eds.), LNCS/LNAI : Vol. 4005. Proceedings of the nineteenth annual conference on learning theory, COLT 2006 (pp. 574–588), Pittsburgh, PA, USA, 22–25 June 2006. Berlin: Springer.
Google Scholar
Antos, A., Munos, R., & Szepesvári, C. (2007a, in press). Fitted Q-iteration in continuous action-space MDPs. In Advances in neural information processing systems.
Antos, A., Szepesvári, C., & Munos, R. (2007b). Value-iteration based fitted policy iteration: learning with a single trajectory. In IEEE symposium on approximate dynamic programming and reinforcement learning (ADPRL 2007) (pp. 330–337), Honolulu, HI, 1–5 April 2007. New York: IEEE.
Chapter Google Scholar
Baraud, Y., Comte, F., & Viennet, G. (2001). Adaptive estimation in autoregression or β-mixing regression via model selection. Annals of Statistics, 29, 839–875.
Article MATH MathSciNet Google Scholar
Bellman, R., & Dreyfus, S. (1959). Functional approximation and dynamic programming. Mathematical Tables and Other Aids to Computation, 13, 247–251.
Article MATH MathSciNet Google Scholar
Bertsekas, D. P., & Shreve, S. (1978). Stochastic optimal control (the discrete time case). New York: Academic Press.
MATH Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.
MATH Google Scholar
Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.
MATH Google Scholar
Carrasco, M., & Chen, X. (2002). Mixing and moment properties of various GARCH and stochastic volatility models. Econometric Theory, 18, 17–39.
Article MATH MathSciNet Google Scholar
Cheney, E. (1966). Introduction to approximation theory. London: McGraw-Hill.
MATH Google Scholar
Davidov, Y. (1973). Mixing conditions for Markov chains. Theory of Probability and its Applications, 18, 312–328.
Article Google Scholar
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. In Applications of mathematics: stochastic modelling and applied probability. New York: Springer.
Google Scholar
Dietterich, T. G., & Wang, X. (2002). Batch value function approximation via support vectors. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems (Vol. 14). MIT Press: Cambridge.
Google Scholar
Doukhan, P. (1994). Lecture notes in statistics : Vol. 85. Mixing properties and examples lecture notes in statistics. Berlin: Springer.
Google Scholar
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556.
MathSciNet Google Scholar
Gordon, G. (1995). Stable function approximation in dynamic programming. In A. Prieditis & S. Russell (Eds.), Proceedings of the twelfth international conference on machine learning (pp. 261–268). Kaufmann: San Francisco.
Google Scholar
Guestrin, C., Koller, D., & Parr, R. (2001). Max-norm projections for factored MDPs. In Proceedings of the international joint conference on artificial intelligence.
Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. Berlin: Springer.
MATH Google Scholar
Haussler, D. (1995). Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik–Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2), 217–232.
Article MATH MathSciNet Google Scholar
Howard, R. A. (1960). Dynamic programming and Markov processes. MIT Press: Cambridge.
MATH Google Scholar
Kuczma, M. (1985). An introduction to the theory of functional equations and inequalities. Katowice: Silesian University Press.
MATH Google Scholar
Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
Article MathSciNet Google Scholar
Meir, R. (2000). Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1), 5–34.
Article MATH Google Scholar
Meyn, S., & Tweedie, R. (1993). Markov chains and stochastic stability. New York: Springer.
MATH Google Scholar
Munos, R. (2003). Error bounds for approximate policy iteration. In 19th International conference on machine learning (pp. 560–567)
Munos, R., & Szepesvári, C. (2006). Finite time bounds for sampling based fitted value iteration (Technical report). Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary.
Murphy, S. (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.
MathSciNet Google Scholar
Nobel, A. (1996). Histogram regression estimation using data-dependent partitions. Annals of Statistics, 24(3), 1084–1105.
Article MATH MathSciNet Google Scholar
Ormoneit, D., & Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49, 161–178.
Article MATH Google Scholar
Pollard, D. (1984). Convergence of stochastic processes. New York: Springer.
MATH Google Scholar
Pollard, D. (1990). Empirical processes: theory and applications. In NSF-CBMS regional conference series in probability and statistics. Institute of Mathematical Statistics, Hayward, CA.
Precup, D., Sutton, R., & Dasgupta, S. (2001). Off-policy temporal difference learning with function approximation. In Proceedings of the eighteenth international conference on machine learning (ICML 2001) (pp. 417–424)
Samuel, A. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, pp. 210–229. Reprinted in E. A. Feigenbaum & J. Feldman (Eds.) Computers and thought. New York: McGraw-Hill (1963)
Schweitzer, P., & Seidmann, A. (1985). Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110, 568–582.
Article MATH MathSciNet Google Scholar
Sutton, R., & Barto, A. (1998). Reinforcement learning: an introduction. MIT Press: Bradford Book.
Google Scholar
Szepesvári, C., & Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In ICML’2005 (pp. 881–886).
Szepesvári, C., & Smart, W. (2004). Interpolation-based Q-learning. In D. S. R. Greiner (Ed.), Proceedings of the international conference on machine learning (pp. 791–798).
Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.
MATH Google Scholar
Wang, X., & Dietterich, T. (1999). Efficient value function approximation using regression trees. In Proceedings of the IJCAI workshop on statistical machine learning for large-scale optimization, Stockholm, Sweden.
Williams, R. J., & Baird III L. (1994). Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the tenth yale workshop on adaptive and learning systems.
Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1), 94–116.
Article MATH MathSciNet Google Scholar

Download references

Author information

Csaba Szepesvári
Present address: Department of Computing Science, University of Alberta, Edmonton, AB, Canada

Authors and Affiliations

Computer and Automation Research Inst. of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest, 1111, Hungary
András Antos & Csaba Szepesvári
Institut National de Recherche en Informatique et en Automatique, INRIA Lille, 40 avenue Hally, 59650, Villeneuve d’Ascq, France
Rémi Munos

Authors

András Antos
View author publications
You can also search for this author in PubMed Google Scholar
Csaba Szepesvári
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Munos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Csaba Szepesvári.

Additional information

Editors: Hans Ulrich Simon, Gabor Lugosi, Avrim Blum.

This paper appeared in a preliminary form at COLT2007 (Antos, et al. in LNCS/LNAI, vol. 4005, pp. 574–588, 2006).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Antos, A., Szepesvári, C. & Munos, R. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach Learn 71, 89–129 (2008). https://doi.org/10.1007/s10994-007-5038-2

Download citation

Received: 09 September 2006
Revised: 04 September 2007
Accepted: 15 October 2007
Published: 14 November 2007
Issue Date: April 2008
DOI: https://doi.org/10.1007/s10994-007-5038-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Abstract

Article PDF

Similar content being viewed by others

An incremental off-policy search in a model-free Markov decision process using a single sample path

Two Approaches to Stochastic Optimal Control Problems with a Final-Time Expectation Constraint

An Incremental Fast Policy Search Using a Single Sample Path

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Abstract

Article PDF

Similar content being viewed by others

An incremental off-policy search in a model-free Markov decision process using a single sample path

Two Approaches to Stochastic Optimal Control Problems with a Final-Time Expectation Constraint

An Incremental Fast Policy Search Using a Single Sample Path

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation