Abstract
TD.λ/ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD.λ/ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency.
This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Atkeson, C. G., & Santamaria, J. C. (1997). A comparison of direct and model-based reinforcement learning. In International Conference on Robotics and Automation.
Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Boyan, J. A. (1998). Learning evaluation functions for global optimization. Ph.D. Thesis, Carnegie Mellon University.
Boyan, J. A., & Moore, A. W. (1998) Learning evaluation functions for global optimization and Boolean satisfiability. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI).
Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:1-3, 33–57.
Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Ph.D. Thesis, Carnegie Mellon University.
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. (2nd ed.), Cambridge: Cambridge University Press.
Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), NIPS-9 (p. 974). Cambridge, MA: The MIT Press.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3.
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann.
Sutton, R. S. (1992). Gain adaptation beats least squares. In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems (pp. 161-166).
Sutton, R. S. (1995).TD models: Modeling theworld at a mixture of time scales. In Machine Learning: Proceedings of the 12th International Conference (pp. 531–539). San Mateo, CA: Morgan Kaufmann.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6:2, 215–219.
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Trans. Auto. Control, 42:5, 674–690.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Boyan, J.A. Technical Update: Least-Squares Temporal Difference Learning. Machine Learning 49, 233–246 (2002). https://doi.org/10.1023/A:1017936530646
Issue Date:
DOI: https://doi.org/10.1023/A:1017936530646