Abstract
In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Anthony, M., & Bartlett, P. L. (1999). Neural network learning: theoretical foundations. Cambridge: Cambridge University Press.
Antos, A., Szepesvári, C., & Munos, R. (2006). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. In G. Lugosi & H. Simon (Eds.), LNCS/LNAI : Vol. 4005. Proceedings of the nineteenth annual conference on learning theory, COLT 2006 (pp. 574–588), Pittsburgh, PA, USA, 22–25 June 2006. Berlin: Springer.
Antos, A., Munos, R., & Szepesvári, C. (2007a, in press). Fitted Q-iteration in continuous action-space MDPs. In Advances in neural information processing systems.
Antos, A., Szepesvári, C., & Munos, R. (2007b). Value-iteration based fitted policy iteration: learning with a single trajectory. In IEEE symposium on approximate dynamic programming and reinforcement learning (ADPRL 2007) (pp. 330–337), Honolulu, HI, 1–5 April 2007. New York: IEEE.
Baraud, Y., Comte, F., & Viennet, G. (2001). Adaptive estimation in autoregression or β-mixing regression via model selection. Annals of Statistics, 29, 839–875.
Bellman, R., & Dreyfus, S. (1959). Functional approximation and dynamic programming. Mathematical Tables and Other Aids to Computation, 13, 247–251.
Bertsekas, D. P., & Shreve, S. (1978). Stochastic optimal control (the discrete time case). New York: Academic Press.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.
Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.
Carrasco, M., & Chen, X. (2002). Mixing and moment properties of various GARCH and stochastic volatility models. Econometric Theory, 18, 17–39.
Cheney, E. (1966). Introduction to approximation theory. London: McGraw-Hill.
Davidov, Y. (1973). Mixing conditions for Markov chains. Theory of Probability and its Applications, 18, 312–328.
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. In Applications of mathematics: stochastic modelling and applied probability. New York: Springer.
Dietterich, T. G., & Wang, X. (2002). Batch value function approximation via support vectors. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems (Vol. 14). MIT Press: Cambridge.
Doukhan, P. (1994). Lecture notes in statistics : Vol. 85. Mixing properties and examples lecture notes in statistics. Berlin: Springer.
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556.
Gordon, G. (1995). Stable function approximation in dynamic programming. In A. Prieditis & S. Russell (Eds.), Proceedings of the twelfth international conference on machine learning (pp. 261–268). Kaufmann: San Francisco.
Guestrin, C., Koller, D., & Parr, R. (2001). Max-norm projections for factored MDPs. In Proceedings of the international joint conference on artificial intelligence.
Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. Berlin: Springer.
Haussler, D. (1995). Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik–Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2), 217–232.
Howard, R. A. (1960). Dynamic programming and Markov processes. MIT Press: Cambridge.
Kuczma, M. (1985). An introduction to the theory of functional equations and inequalities. Katowice: Silesian University Press.
Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
Meir, R. (2000). Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1), 5–34.
Meyn, S., & Tweedie, R. (1993). Markov chains and stochastic stability. New York: Springer.
Munos, R. (2003). Error bounds for approximate policy iteration. In 19th International conference on machine learning (pp. 560–567)
Munos, R., & Szepesvári, C. (2006). Finite time bounds for sampling based fitted value iteration (Technical report). Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary.
Murphy, S. (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.
Nobel, A. (1996). Histogram regression estimation using data-dependent partitions. Annals of Statistics, 24(3), 1084–1105.
Ormoneit, D., & Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49, 161–178.
Pollard, D. (1984). Convergence of stochastic processes. New York: Springer.
Pollard, D. (1990). Empirical processes: theory and applications. In NSF-CBMS regional conference series in probability and statistics. Institute of Mathematical Statistics, Hayward, CA.
Precup, D., Sutton, R., & Dasgupta, S. (2001). Off-policy temporal difference learning with function approximation. In Proceedings of the eighteenth international conference on machine learning (ICML 2001) (pp. 417–424)
Samuel, A. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, pp. 210–229. Reprinted in E. A. Feigenbaum & J. Feldman (Eds.) Computers and thought. New York: McGraw-Hill (1963)
Schweitzer, P., & Seidmann, A. (1985). Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110, 568–582.
Sutton, R., & Barto, A. (1998). Reinforcement learning: an introduction. MIT Press: Bradford Book.
Szepesvári, C., & Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In ICML’2005 (pp. 881–886).
Szepesvári, C., & Smart, W. (2004). Interpolation-based Q-learning. In D. S. R. Greiner (Ed.), Proceedings of the international conference on machine learning (pp. 791–798).
Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.
Wang, X., & Dietterich, T. (1999). Efficient value function approximation using regression trees. In Proceedings of the IJCAI workshop on statistical machine learning for large-scale optimization, Stockholm, Sweden.
Williams, R. J., & Baird III L. (1994). Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the tenth yale workshop on adaptive and learning systems.
Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1), 94–116.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Hans Ulrich Simon, Gabor Lugosi, Avrim Blum.
This paper appeared in a preliminary form at COLT2007 (Antos, et al. in LNCS/LNAI, vol. 4005, pp. 574–588, 2006).
Rights and permissions
About this article
Cite this article
Antos, A., Szepesvári, C. & Munos, R. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach Learn 71, 89–129 (2008). https://doi.org/10.1007/s10994-007-5038-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-007-5038-2