Abstract
We study on-line decision problems where the set of actions that are available to the decision algorithm varies over time. With a few notable exceptions, such problems remained largely unaddressed in the literature, despite their applicability to a large number of practical problems. Departing from previous work on this “Sleeping Experts” problem, we compare algorithms against the payoff obtained by the best ordering of the actions, which is a natural benchmark for this type of problem. We study both the full-information (best expert) and partial-information (multi-armed bandit) settings and consider both stochastic and adversarial rewards models. For all settings we give algorithms achieving (almost) information-theoretically optimal regret bounds (up to a constant or a sub-logarithmic factor) with respect to the best-ordering benchmark.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Audibert, J.-Y., & Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd conference on learning theory (COLT).
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32, 48–77.
Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19, 357–367.
Blum, A., & Mansour, Y. (2005). From external to internal regret. In Proceedings of the 18th conference on learning theory (COLT) (pp. 621–636).
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of ACM, 44, 427–485.
Cover, T. M., & Thomas, J. A. (1999). Elements of information theory. New York: Wiley.
Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29, 79–103.
Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). Using and combining predictors that specialize. In Proceedings of the 29th ACM symp. on theory of computing (STOC) (pp. 334–343).
Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: a guide to the theory of NP-completeness. New York: Freeman.
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B, 41, 148–177.
Gittins, J. C., & Jones, D. M. (1979). A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika, 66, 561–565.
Hannan, J. (1957). Approximation to Bayes risk in repeated plays. In M. Dresher, A. Tucker, & P. Wolfe (Eds.), Contributions to the theory of games (pp. 97–139). Princeton: Princeton University Press.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.
Kalai, A. T., & Vempala, S. (2005). Efficient algorithms for on-line optimization. Journal of Computer and System Sciences, 71, 291–307.
Karp, R. M., & Kleinberg, R. (2007). Noisy binary search and its applications. In Proceedings of the 18th ACM-SIAM symp. discrete algorithms (SODA) (pp. 881–890).
Khintchine, A. (1923). Über dyadische Brüche. Mathematische Zeitschsift, 18, 109–116.
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocations rules. Advances in Applied Mathematics, 6, 4–22.
Langford, J., & Zhang, T. (2007). The Epoch-Greedy algorithm for multiarmed bandits with side information. In Proceedings of the 21st conference on neural information processing systems (NIPS).
Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261. An extended abstract appeared in IEEE symposium on foundations of computer science, 1989 (pp. 256–261).
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58, 527–535.
Vovk, V. G. (1990). Aggregating strategies. In Proceedings of the 3rd conference on learning theory (COLT) (pp. 371–386).
Vovk, V. G. (1998). A game of prediction with expert advice. Journal of Computer and System Sciences, 56, 153–173. An extended abstract appeared in COLT, 1995 (pp. 51–60).
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Sham Kakade and Ping Li.
Robert Kleinberg was supported by NSF awards CCF-0643934, IIS-0905467, and AF-0910940, a Microsoft Research New Faculty Fellowship, and an Alfred P. Sloan Foundation Fellowship.
Alexandru Niculescu-Mizil was supported by NSF awards 0347318, 0412930, 0427914, and 0612031.
Yogeshwer Sharma was supported by NSF grants CCF-0514628 and CCF-0830519.
Rights and permissions
About this article
Cite this article
Kleinberg, R., Niculescu-Mizil, A. & Sharma, Y. Regret bounds for sleeping experts and bandits. Mach Learn 80, 245–272 (2010). https://doi.org/10.1007/s10994-010-5178-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5178-7