Tuning Bandit Algorithms in Stochastic Environments

Audibert, Jean-Yves; Munos, Rémi; Szepesvári, Csaba

doi:10.1007/978-3-540-75225-7_15

Jean-Yves Audibert⁴,
Rémi Munos⁵ &
Csaba Szepesvári⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4754))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

2475 Accesses
48 Citations

Abstract

Algorithms based on upper-confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. In this paper we consider a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. The purpose of this paper is to provide a theoretical explanation of these findings and provide theoretical guidelines for the tuning of the parameters of these algorithms. For this we analyze the expected regret and for the first time the concentration of the regret. The analysis of the expected regret shows that variance estimates can be especially advantageous when the payoffs of suboptimal arms have low variance. The risk analysis, rather unexpectedly, reveals that except for some very special bandit problems, the regret, for upper confidence bounds based algorithms with standard bias sequences, concentrates only at a polynomial rate. Hence, although these algorithms achieve logarithmic expected regret rates, they seem less attractive when the risk of suffering much worse than logarithmic regret is also taken into account.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

Sub-sampling for Multi-armed Bandits

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Agrawal, R.: Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Advances in Applied Probability 27, 1054–1078 (1995)
MathSciNet MATH Google Scholar
Audibert, J.-Y., Munos, R., Szepesvári,Cs.: Variance estimates and exploration function in multi-armed bandit. Research report 07-31, Certis - Ecole des Ponts (2007), http://cermics.enpc.fr/~audibert/RR0731.pdf
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)
Article MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Shawe-Taylor, J.: Exploration versus exploitation challenge. In: 2nd PASCAL Challenges Workshop. Pascal Network (2006)
Google Scholar
Gittins, J.C.: Multi-armed Bandit Allocation Indices. In: Wiley-Interscience series in systems and optimization. Wiley, Chichester (1989)
Google Scholar
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)
Article MathSciNet MATH Google Scholar
Lai, T.L., Yakowitz, S.: Machine learning and nonparametric bandit theory. IEEE Transactions on Automatic Control 40, 1199–1209 (1995)
Article MathSciNet MATH Google Scholar
Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society 58, 527–535 (1952)
Article MathSciNet MATH Google Scholar
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

CERTIS - Ecole des Ponts, 19, rue Alfred Nobel - Cité Descartes, 77455 Marne-la-Vallée, France
Jean-Yves Audibert
INRIA Futurs Lille, SequeL project, 40 avenue Halley, 59650 Villeneuve d’Ascq, France
Rémi Munos
University of Alberta, Edmonton T6G 2E8, Canada
Csaba Szepesvári

Authors

Jean-Yves Audibert
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Munos
View author publications
You can also search for this author in PubMed Google Scholar
Csaba Szepesvári
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

RSISE @ ANU and SML @ NICTA, Canberra,, ACT, 0200, Australia
Marcus Hutter
Columbia University, NY, P.O. Box, New York, USA
Rocco A. Servedio
Graduate School of Information Sciences, Tohoku University,, Sendai 980-8579, Japan
Eiji Takimoto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Audibert, JY., Munos, R., Szepesvári, C. (2007). Tuning Bandit Algorithms in Stochastic Environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds) Algorithmic Learning Theory. ALT 2007. Lecture Notes in Computer Science(), vol 4754. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75225-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-540-75225-7_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75224-0
Online ISBN: 978-3-540-75225-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Tuning Bandit Algorithms in Stochastic Environments

Abstract

Chapter PDF

Similar content being viewed by others

Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

Sub-sampling for Multi-armed Bandits

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Tuning Bandit Algorithms in Stochastic Environments

Abstract

Chapter PDF

Similar content being viewed by others

Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

Sub-sampling for Multi-armed Bandits

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation