Abstract
Algorithms based on upper-confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. In this paper we consider a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. The purpose of this paper is to provide a theoretical explanation of these findings and provide theoretical guidelines for the tuning of the parameters of these algorithms. For this we analyze the expected regret and for the first time the concentration of the regret. The analysis of the expected regret shows that variance estimates can be especially advantageous when the payoffs of suboptimal arms have low variance. The risk analysis, rather unexpectedly, reveals that except for some very special bandit problems, the regret, for upper confidence bounds based algorithms with standard bias sequences, concentrates only at a polynomial rate. Hence, although these algorithms achieve logarithmic expected regret rates, they seem less attractive when the risk of suffering much worse than logarithmic regret is also taken into account.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agrawal, R.: Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Advances in Applied Probability 27, 1054–1078 (1995)
Audibert, J.-Y., Munos, R., Szepesvári,Cs.: Variance estimates and exploration function in multi-armed bandit. Research report 07-31, Certis - Ecole des Ponts (2007), http://cermics.enpc.fr/~audibert/RR0731.pdf
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)
Auer, P., Cesa-Bianchi, N., Shawe-Taylor, J.: Exploration versus exploitation challenge. In: 2nd PASCAL Challenges Workshop. Pascal Network (2006)
Gittins, J.C.: Multi-armed Bandit Allocation Indices. In: Wiley-Interscience series in systems and optimization. Wiley, Chichester (1989)
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)
Lai, T.L., Yakowitz, S.: Machine learning and nonparametric bandit theory. IEEE Transactions on Automatic Control 40, 1199–1209 (1995)
Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society 58, 527–535 (1952)
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Audibert, JY., Munos, R., Szepesvári, C. (2007). Tuning Bandit Algorithms in Stochastic Environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds) Algorithmic Learning Theory. ALT 2007. Lecture Notes in Computer Science(), vol 4754. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75225-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-75225-7_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75224-0
Online ISBN: 978-3-540-75225-7
eBook Packages: Computer ScienceComputer Science (R0)