Technical Update: Least-Squares Temporal Difference Learning

Boyan, Justin A.

doi:10.1023/A:1017936530646

Technical Update: Least-Squares Temporal Difference Learning

Published: November 2002

Volume 49, pages 233–246, (2002)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Technical Update: Least-Squares Temporal Difference Learning

Download PDF

Justin A. Boyan¹

4184 Accesses
164 Citations
4 Altmetric
Explore all metrics

Abstract

TD.λ/ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD.λ/ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency.

This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Atkeson, C. G., & Santamaria, J. C. (1997). A comparison of direct and model-based reinforcement learning. In International Conference on Robotics and Automation.
Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Google Scholar
Boyan, J. A. (1998). Learning evaluation functions for global optimization. Ph.D. Thesis, Carnegie Mellon University.
Boyan, J. A., & Moore, A. W. (1998) Learning evaluation functions for global optimization and Boolean satisfiability. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI).
Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:1-3, 33–57.
Google Scholar
Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Ph.D. Thesis, Carnegie Mellon University.
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130.
Google Scholar
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. (2nd ed.), Cambridge: Cambridge University Press.
Google Scholar
Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), NIPS-9 (p. 974). Cambridge, MA: The MIT Press.
Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3.
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Sutton, R. S. (1992). Gain adaptation beats least squares. In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems (pp. 161-166).
Sutton, R. S. (1995).TD models: Modeling theworld at a mixture of time scales. In Machine Learning: Proceedings of the 12th International Conference (pp. 531–539). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Google Scholar
Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6:2, 215–219.
Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Trans. Auto. Control, 42:5, 674–690.
Google Scholar

Download references

Author information

Authors and Affiliations

ITA Software, 141 Portland Street, Cambridge, MA, 02139, USA
Justin A. Boyan

Authors

Justin A. Boyan
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boyan, J.A. Technical Update: Least-Squares Temporal Difference Learning. Machine Learning 49, 233–246 (2002). https://doi.org/10.1023/A:1017936530646

Download citation

Issue Date: November 2002
DOI: https://doi.org/10.1023/A:1017936530646

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Technical Update: Least-Squares Temporal Difference Learning

Abstract

Article PDF

Similar content being viewed by others

Least-Squares Reinforcement Learning Methods

From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions

Reinforcement Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Technical Update: Least-Squares Temporal Difference Learning

Abstract

Article PDF

Similar content being viewed by others

Least-Squares Reinforcement Learning Methods

From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions

Reinforcement Learning

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation