The optimal unbiased value estimator and its relation to LSTD, TD and MC

Grünewälder, Steffen; Obermayer, Klaus

doi:10.1007/s10994-010-5220-9

The optimal unbiased value estimator and its relation to LSTD, TD and MC

Published: 29 October 2010

Volume 83, pages 289–330, (2011)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

The optimal unbiased value estimator and its relation to LSTD, TD and MC

Download PDF

Steffen Grünewälder^1,2 &
Klaus Obermayer¹

706 Accesses
3 Citations
Explore all metrics

Abstract

In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The reason for this is that at each state the bias is calculated with a different probability measure and due to the strong coupling by the Bellman equation it is typically not possible for a set of value estimators to be unbiased with respect to each of these measures. Furthermore, we derive relations of the MVU to MC and TD. The most important of these relations is the equivalence of MC to the MVU and to LSTD for undiscounted MRPs in which MC has the same amount of information. In the discounted case this equivalence does not hold anymore. For TD we show that it is essentially unbiased for acyclic MRPs and biased for cyclic MRPs. We also order estimators according to their risk and present counter-examples to show that no general ordering exists between the MVU and LSTD, between MC and LSTD and between TD and MC. Theoretical results are supported by examples and an empirical evaluation.

Article PDF

Multi-objective Optimization of Long-run Average and Total Rewards

Markov Reward Models and Markov Decision Processes in Discrete and Continuous Time: Performance Evaluation and Optimization

A bias-corrected Least-Squares Monte Carlo for solving multi-period utility models

Article 03 July 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Aigner, M. (2006). A course in enumeration. Berlin: Springer.
Google Scholar
Bauer, H., & Burckel, R. B. (1995). Probability theory. Berlin: de Gruyter.
Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific: Nashua.
MATH Google Scholar
Boyan, J. (1998). Learning evaluation functions for global optimization. PhD thesis, School of Computer Science Carnegie Mellon University.
Boyan, J. (1999). Least-squares temporal difference learning. In International conference machine learning.
Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22.
Grünewälder, S., Hochreiter, S., & Obermayer, K. (2007). Optimality of lstd and its relation to mc. In Proceedings of the international joint conference of neural networks.
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation.
Kearns, M., & Singh, S. (2000). Bias-variance error bounds for temporal difference updates. In Conference on computational learning theory.
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation. Springer texts in stat. Berlin: Springer.
MATH Google Scholar
Lugosi, G. (2006). Concentration-of-measure inequalities. Lecture notes.
Mannor, S., Simester, D., Sun, P., & Tsitsiklis, J. N. (2007). Bias and variance approximation in value function estimates. Management Science, 53.
Singh, S., & Dayan, P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning, 32.
Singh, S., & Sutton, R. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123–158.
MATH Google Scholar
Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19.
Stuart, A., & Ord, K. (1991). Kendall’s advanced theory of statistics (5th ed.). Sevenoaks: Edward Arnold.
MATH Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.
Google Scholar
Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8.

Download references

Author information

Authors and Affiliations

Department of Computer Science, Berlin University of Technology, Berlin, 10587, Germany
Steffen Grünewälder & Klaus Obermayer
Centre for Computational Statistics and Machine Learning, University College London, Gower Street, London, WC1E 6BT, UK
Steffen Grünewälder

Authors

Steffen Grünewälder
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Obermayer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steffen Grünewälder.

Additional information

Editor: P. Tadepalli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grünewälder, S., Obermayer, K. The optimal unbiased value estimator and its relation to LSTD, TD and MC. Mach Learn 83, 289–330 (2011). https://doi.org/10.1007/s10994-010-5220-9

Download citation

Received: 13 July 2009
Revised: 20 June 2010
Accepted: 29 September 2010
Published: 29 October 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10994-010-5220-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The optimal unbiased value estimator and its relation to LSTD, TD and MC

Abstract

Article PDF

Similar content being viewed by others

Multi-objective Optimization of Long-run Average and Total Rewards

Markov Reward Models and Markov Decision Processes in Discrete and Continuous Time: Performance Evaluation and Optimization

A bias-corrected Least-Squares Monte Carlo for solving multi-period utility models

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The optimal unbiased value estimator and its relation to LSTD, TD and MC

Abstract

Article PDF

Similar content being viewed by others

Multi-objective Optimization of Long-run Average and Total Rewards

Markov Reward Models and Markov Decision Processes in Discrete and Continuous Time: Performance Evaluation and Optimization

A bias-corrected Least-Squares Monte Carlo for solving multi-period utility models

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation