Near-Optimal Reinforcement Learning in Polynomial Time

Kearns, Michael; Singh, Satinder

doi:10.1023/A:1017984413808

Near-Optimal Reinforcement Learning in Polynomial Time

Published: November 2002

Volume 49, pages 209–232, (2002)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Near-Optimal Reinforcement Learning in Polynomial Time

Download PDF

Michael Kearns¹ &
Satinder Singh²

4529 Accesses
353 Citations
Explore all metrics

Abstract

We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Barto, A. G., Sutton, R. S., & Watkins, C. (1990). Sequential decision problems and neural networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 686–693). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Bertsekas, D. P. (1987). Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice-Hall.
Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall.
Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Google Scholar
Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI-92.
Fiechter, C. (1994). Efficient reinforcement learning. In COLT94: Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (pp. 88–97). New York: ACM Press.
Google Scholar
Fiechter, C. (1997). Expected mistake bound model for on-line reinforcement learning. In Machine Learning: Proceedings of the Fourteenth International Conference, ICML97 (pp. 116–124). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis, & S., Russell (Eds.), Machine Learning: Proceedings of the Twelth International Conference (pp. 261–268). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Gullapalli, V., & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances is neural information processing systems 6 (pp. 695–702). San Mateo, CA: Morgan Kauffman.
Google Scholar
Jaakkola, T., Jordan, M. I., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:6, 1185–1201.
Google Scholar
Jaakkola, T., Singh, S., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7 (pp. 345–352). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Jalali, A., & Ferguson, M. (1989). A distributed asynchronous algorithm for expected average cost dynamic programming. In Proceedings of the 29th Conference on Decision and Control, Honolulu, Hawaii (pp. 1283-1288).
Kearns, M., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. In Proceeding of the Sixteenth International Joint Conference on Artificial Intelligence (pp. 740-747). Morgan Kaufmann.
Kumar, P. R., & Varaiya, P. P. (1986). Stochastic systems: Estimation, identification, and adaptive control. Englewood Cliffs, N.J.: Prentice Hall.
Google Scholar
Littman, M., Cassandra, A., & Kaelbling., L. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis, & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 362–370). San Francisco, CA: Morgan Kaufmann.
Google Scholar
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 12:1.
Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: John Wiley & Sons.
Google Scholar
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Dept.
Saul, L., & Singh, S. (1996). Learning curve bounds for markov decision processes with undiscounted rewards. In COLT96: Proceedings of the Ninth Annual ACM Conference on Computational Learning Theory.
Schapire, R. E., & Warmuth, M. K. (1994). On the worst-case analysis of temporal-difference learning algorithms. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning: Proceedings of the Eleventh International Conference (pp. 266–274). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Sinclair, A. (1993). Algorithms for random generation and counting: A Markov chain approach. Boston: Birkhauser.
Google Scholar
Singh, S., & Dayan, P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning, 32:1, 5–40.
Google Scholar
Singh, S., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems 7. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Singh, S., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement learning algorithms. Machine Learning, 38:3, 287–308.
Google Scholar
Singh, S., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123–158.
Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Sutton, R. S. (1995). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8 (pp. 1038–1044). Cambridge, MA: MIT Press.
Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Google Scholar
Thrun, S. B. (1992). The role of exploration in learning control. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy and adaptive approaches. Florence, KY: Van Nostrand Reinhold.
Google Scholar
Tsitsiklis, J. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:3, 185–202.
Google Scholar
Tsitsiklis, J., & Roy, B. V. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.
Google Scholar
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, Cambridge Univ., Cambridge, England, UK.
Google Scholar
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8:3/4, 279–292.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, University of Pennsylvania, Moore School Building, 200 South 33rd Street, Philadelphia, PA, 19104-6389, USA
Michael Kearns
Syntek Capital, New York, NY, 10019, USA
Satinder Singh

Authors

Michael Kearns
View author publications
You can also search for this author in PubMed Google Scholar
Satinder Singh
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kearns, M., Singh, S. Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning 49, 209–232 (2002). https://doi.org/10.1023/A:1017984413808

Download citation

Issue Date: November 2002
DOI: https://doi.org/10.1023/A:1017984413808

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Near-Optimal Reinforcement Learning in Polynomial Time

Abstract

Article PDF

Similar content being viewed by others

Reinforcement Learning with Guarantees that Hold for Ever

Bayesian Reinforcement Learning with Exploration

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Near-Optimal Reinforcement Learning in Polynomial Time

Abstract

Article PDF

Similar content being viewed by others

Reinforcement Learning with Guarantees that Hold for Ever

Bayesian Reinforcement Learning with Exploration

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation