Abstract
We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Barto, A. G., Sutton, R. S., & Watkins, C. (1990). Sequential decision problems and neural networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 686–693). San Mateo, CA: Morgan Kaufmann.
Bertsekas, D. P. (1987). Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice-Hall.
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI-92.
Fiechter, C. (1994). Efficient reinforcement learning. In COLT94: Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (pp. 88–97). New York: ACM Press.
Fiechter, C. (1997). Expected mistake bound model for on-line reinforcement learning. In Machine Learning: Proceedings of the Fourteenth International Conference, ICML97 (pp. 116–124). San Mateo, CA: Morgan Kaufmann.
Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis, & S., Russell (Eds.), Machine Learning: Proceedings of the Twelth International Conference (pp. 261–268). San Mateo, CA: Morgan Kaufmann.
Gullapalli, V., & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances is neural information processing systems 6 (pp. 695–702). San Mateo, CA: Morgan Kauffman.
Jaakkola, T., Jordan, M. I., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:6, 1185–1201.
Jaakkola, T., Singh, S., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7 (pp. 345–352). San Mateo, CA: Morgan Kaufmann.
Jalali, A., & Ferguson, M. (1989). A distributed asynchronous algorithm for expected average cost dynamic programming. In Proceedings of the 29th Conference on Decision and Control, Honolulu, Hawaii (pp. 1283-1288).
Kearns, M., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. In Proceeding of the Sixteenth International Joint Conference on Artificial Intelligence (pp. 740-747). Morgan Kaufmann.
Kumar, P. R., & Varaiya, P. P. (1986). Stochastic systems: Estimation, identification, and adaptive control. Englewood Cliffs, N.J.: Prentice Hall.
Littman, M., Cassandra, A., & Kaelbling., L. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis, & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 362–370). San Francisco, CA: Morgan Kaufmann.
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 12:1.
Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: John Wiley & Sons.
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Dept.
Saul, L., & Singh, S. (1996). Learning curve bounds for markov decision processes with undiscounted rewards. In COLT96: Proceedings of the Ninth Annual ACM Conference on Computational Learning Theory.
Schapire, R. E., & Warmuth, M. K. (1994). On the worst-case analysis of temporal-difference learning algorithms. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning: Proceedings of the Eleventh International Conference (pp. 266–274). San Mateo, CA: Morgan Kaufmann.
Sinclair, A. (1993). Algorithms for random generation and counting: A Markov chain approach. Boston: Birkhauser.
Singh, S., & Dayan, P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning, 32:1, 5–40.
Singh, S., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems 7. San Mateo, CA: Morgan Kaufmann.
Singh, S., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement learning algorithms. Machine Learning, 38:3, 287–308.
Singh, S., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123–158.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Sutton, R. S. (1995). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8 (pp. 1038–1044). Cambridge, MA: MIT Press.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Thrun, S. B. (1992). The role of exploration in learning control. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy and adaptive approaches. Florence, KY: Van Nostrand Reinhold.
Tsitsiklis, J. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:3, 185–202.
Tsitsiklis, J., & Roy, B. V. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, Cambridge Univ., Cambridge, England, UK.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8:3/4, 279–292.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kearns, M., Singh, S. Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning 49, 209–232 (2002). https://doi.org/10.1023/A:1017984413808
Issue Date:
DOI: https://doi.org/10.1023/A:1017984413808