On Average Versus Discounted Reward Temporal-Difference Learning

Tsitsiklis, John N.; Van Roy, Benjamin

doi:10.1023/A:1017980312899

On Average Versus Discounted Reward Temporal-Difference Learning

Published: November 2002

Volume 49, pages 179–191, (2002)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

On Average Versus Discounted Reward Temporal-Difference Learning

Download PDF

John N. Tsitsiklis¹ &
Benjamin Van Roy²

759 Accesses
34 Citations
Explore all metrics

Abstract

We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Abounadi, J. (1998). Stochastic approximation for non-expansive maps: Application to Q-learning algorithms. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA.
Google Scholar
Bertsekas, D. P. (1995). Dynamic programming and optimal control, Belmont, MA: Athena Scientific.
Google Scholar
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22, 1–38.
Google Scholar
Marbach, P., Mihatsch, O., & Tsitsiklis, J. N. (1998). Call admission control and routing in integrated service networks using reinforcement learning. In Proceedings of the 1998 IEEE CDC, Tampa, FL.
Schwartz, A. (1993). A reinforcement learning algorithm for maximizing undiscounted rewards. In Proceedings of the Tenth Machine Learning Conference.
Singh, S. P. (1994). Reinforcement learning algorithms for average payoff markovian decision processes. In Proceedings of the Twelfth National Conference on Artificial Intelligence.
Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Tadepalli, P., & Ok, D. (1998). Model-based average reward reinforcement learning. Artificial Intelligence, 100, 177–224.
Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:5, 674–690.
Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1999). Average cost temporal-difference learning. Automatica, 35:11, 1799–1808.
Google Scholar
Van Roy, B. (1998). Learning and value function approximation in complex decision processes. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for Information and Decision Systems, M.I.T., Cambridge, MA, 01239, USA
John N. Tsitsiklis
Department of Management Science and Engineering and Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
Benjamin Van Roy

Authors

John N. Tsitsiklis
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Van Roy
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tsitsiklis, J.N., Van Roy, B. On Average Versus Discounted Reward Temporal-Difference Learning. Machine Learning 49, 179–191 (2002). https://doi.org/10.1023/A:1017980312899

Download citation

Issue Date: November 2002
DOI: https://doi.org/10.1023/A:1017980312899

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On Average Versus Discounted Reward Temporal-Difference Learning

Abstract

Article PDF

Similar content being viewed by others

A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning

Q( $$\lambda $$ ) with Off-Policy Corrections

On the Distributional Convergence of Temporal Difference Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

On Average Versus Discounted Reward Temporal-Difference Learning

Abstract

Article PDF

Similar content being viewed by others

A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning

Q( $$\lambda $$ ) with Off-Policy Corrections

On the Distributional Convergence of Temporal Difference Learning

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation