Abstract
We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Abounadi, J. (1998). Stochastic approximation for non-expansive maps: Application to Q-learning algorithms. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA.
Bertsekas, D. P. (1995). Dynamic programming and optimal control, Belmont, MA: Athena Scientific.
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22, 1–38.
Marbach, P., Mihatsch, O., & Tsitsiklis, J. N. (1998). Call admission control and routing in integrated service networks using reinforcement learning. In Proceedings of the 1998 IEEE CDC, Tampa, FL.
Schwartz, A. (1993). A reinforcement learning algorithm for maximizing undiscounted rewards. In Proceedings of the Tenth Machine Learning Conference.
Singh, S. P. (1994). Reinforcement learning algorithms for average payoff markovian decision processes. In Proceedings of the Twelfth National Conference on Artificial Intelligence.
Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44.
Tadepalli, P., & Ok, D. (1998). Model-based average reward reinforcement learning. Artificial Intelligence, 100, 177–224.
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:5, 674–690.
Tsitsiklis, J. N., & Van Roy, B. (1999). Average cost temporal-difference learning. Automatica, 35:11, 1799–1808.
Van Roy, B. (1998). Learning and value function approximation in complex decision processes. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Tsitsiklis, J.N., Van Roy, B. On Average Versus Discounted Reward Temporal-Difference Learning. Machine Learning 49, 179–191 (2002). https://doi.org/10.1023/A:1017980312899
Issue Date:
DOI: https://doi.org/10.1023/A:1017980312899