Abstract
We consider parametric exponential families of dimension K on the real line. We study a variant of boundary crossing probabilities coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension K. Formally, our result is a concentration inequality that bounds the probability that Bψ(θ̂ n , θ*) ≥ f(t/n)/n, where θ* is the parameter of an unknown target distribution, θ̂ n is the empirical parameter estimate built from n observations, ψ is the log-partition function of the exponential family and Bψ is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function f is logarithmic, as it is enables to analyze the regret of the state-of-the-art KL-ucb and KL-ucb+ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when K = 1, while we provide results for arbitrary finite dimension K, thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T. L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Rajeev Agrawal, “Sample Mean Based Index Policies by o(log n) Regret for the Multi-Armed Bandit Problem”, Adv. in Appl. Probab. 27 (04), 1054–1078 (1995).
J.-Y. Audibert, R. Munos, and Cs. Szepesvári, “Exploration-Exploitation Trade-Off Using Variance Estimates in Multi-Armed Bandits”, Theoret. Comp. Sci. 410 (19), (2009).
J.-Y. Audibert and S. Bubeck, “Regret Bounds andMinimax Policies under PartialMonitoring”, J. Machine Learning Res. 11, 2635–2686 (2010).
P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-Time Analysis of the Multiarmed Bandit Problem”, Machine Learning 47 (2), 235–256 (2002).
L. M. Bregman, “The Relaxation Method of Finding the Common Point of Convex Sets and Its Application to the Solution of Problems in Convex Programming”, USSR Comput. Math. and Math. Phys. (Elsevier) 7 (3), 200–217 (1967).
S. Bubeck, N. Cesa-Bianchi, et al., “Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems”, Foundations and Trends ® in Machine Learning 5 (1), 1–122 (2012).
A. N. Burnetas and M. N. Katehakis, “Optimal Adaptive Policies for Markov Decision Processes”, in Mathematics of Operations Research (1997), pp. 222–255.
O. Cappé, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz, “Kullback–Leibler Upper Confidence Bounds for Optimal Sequential Allocation”, Ann. Statist. 41 (3), 1516–1541 (2013).
Y. S. Chow and H. Teicher, Probability Theory, 2nd. ed. (Springer, 1988).
I. H. Dinwoodie, “Mesures dominantes et théoreme de sanov”, in Annales de l’IHP Probabilités et statistiques (1992), Vol. 28, pp. 365–373.
A. Garivier, P. Ménard, and G. Stoltz, Explore first, Exploit Next: The True Shape of Regret in Bandit Problems arXiv preprint arXiv:1602.07182 (2016).
J. C. Gittins, “Bandit Processes and Dynamic Allocation Indices”, J. Roy. Statist. Soc., Ser. B 41 (2), 148–177 (1979).
J. Honda and A. Takemura, “An Asymptotically Optimal Bandit Algorithm for Bounded SupportModels”, in Conf. Comput. Learning Theory, Ed. by T. Kalai and M. Mohri (Haifa, Israel, 2010).
T. L. Lai and H. Robbins, “Asymptotically Efficient Adaptive Allocation Rules”, Advances in Appl. Math. 6 (1), 4–22 (1985).
T. L. Lai, “Adaptive Treatment Allocation and the Multi-Armed Bandit Problem”, Ann. Statist, 1091–1114 (1987).
T. L. Lai, “Boundary Crossing Problems for SampleMeans”, Ann. Probab., 375–396 (1988).
O.-A. Maillard, R. Munos, and G. Stoltz, “A Finite-Time Analysis of Multi-Armed Bandits Problems with Kullback–LeiblerDivergences”, in Proc. 24th ConferenceOn Learning Theory (Budapest,Hungary), 497–514 (2011).
H. Robbins, “Some Aspects of the Sequential Design of Experiments”, Bull.Amer. Math. Soc. 58 (5), 527–535 (1952).
H. Robbins, Herbert Robbins Selected Papers (Springer, 2012).
W. R. Thompson, “On the Likelihood That One Unknown Probability Exceeds Another in View of the Evidence of Two Samples”, Biometrika 25 (3/4), 285–294 (1933).
W. R. Thompson, “On a Criterion for the Rejection of Observations and the Distribution of the Ratio of Deviation to Sample Standard Deviation”, Ann.Math. Statist. 6 (4), 214–219 (1935).
A. Wald, “Sequential Tests of Statistical Hypotheses”, Ann.Math. Statist. 16 (2), 117–186 (1945).
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Maillard, OA. Boundary Crossing Probabilities for General Exponential Families. Math. Meth. Stat. 27, 1–31 (2018). https://doi.org/10.3103/S1066530718010015
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S1066530718010015