Abstract
We study a bad arm existence checking problem in a stochastic K-armed bandit setting, in which a player’s task is to judge whether a positive arm exists or all the arms are negative among given K arms by drawing as small number of arms as possible. Here, an arm is positive if its expected loss suffered by drawing the arm is at least a given threshold \(\theta _U\), and it is negative if that is less than another given threshold \(\theta _L(\le \theta _U)\). This problem is a formalization of diagnosis of disease or machine failure. An interesting structure of this problem is the asymmetry of positive and negative arms’ roles; finding one positive arm is enough to judge positive existence while all the arms must be discriminated as negative to judge whole negativity. In the case with \(\varDelta =\theta _U-\theta _L>0\), we propose elimination algorithms with arm selection policy (policy to determine the next arm to draw) and decision condition (condition to conclude positive arm’s existence or the drawn arm’s negativity) utilizing this asymmetric problem structure and prove its effectiveness theoretically and empirically.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In the diagnosis of disease or machine failure, the test object is judged as “positive” if some anomaly is detected in at least one of many parts. In the case that the purpose of the diagnosis is the classification into one of the two classes, “positive” or “negative”, then the diagnosis can be terminated right after the first anomaly part has been detected. Thus, fast diagnosis will be realized if one of anomaly parts can be detected as fast as possible in the positive case.
The fast diagnosis of anomaly detection is particularly important in the case that the judgment is done based on measurements using a costly or slow device. For example, a Raman spectral image has been known to be useful for cancer diagnosis (Haka et al. 2009), but its acquisition time is 1–10 seconds per point (pixel)Footnote 1 resulting in an order of hours or days per one image (typically 10,000–40,000 pixels), so it is critical to measure only the points necessary for cancer diagnosis in order to achieve fast measurement. A Raman spectrum of each point is believed to be converted to a cancer index, which indicates how likely the point is inside a cancer cell, and we can judge the existence of cancer cells from the existence of area with a high cancer index.
The above cancer cell existence checking problem can be formulated as the problem of checking the existence of a grid with a high cancer index for a given area that is divided into grids. By regarding each grid as an arm, we formalize this problem as a loss-version of a stochastic K-armed bandit problem in which the existence of positive arms is checked by drawing arms and suffering losses for the drawn arms. In our formulation, given an acceptable error rate \(0<\delta <1/2\) and two thresholds \(\theta _L\) and \(\theta _U\) with \(0<\theta _L\le \theta _U<1\), a player is required to, with probability at least \(1-\delta \), answer “positive” if positive arms exist and “negative” if all the arms are negative. Here, an arm is defined to be positive if its loss mean is at least \(\theta _U\), and defined to be negative if its loss mean is less than \(\theta _L\). We call player algorithms for this problem as \((\theta _L,\theta _U,\delta )\)-BAEC (Bad Arm Existence Checking) algorithms. The objective of this research is to design a \((\theta _L,\theta _U,\delta )\)-BAEC algorithm that minimizes the number of arm draws, that is, an algorithm with the lowest sample complexity. The problem of this objective is said to be a Bad Arm Existence Checking Problem.
The bad arm existence checking problem is closely related to the thresholding bandit problem (Locatelli et al. 2016), which is a kind of pure-exploration problem such as the best arm identification problem (Even-Dar et al. 2006; Audibert et al. 2010). In the thresholding bandit problem, provided a threshold \(\theta \) and a required precision \(\epsilon >0\), the player’s task is to classify each arm into positive (its loss mean is at least \(\theta +\epsilon \)) or negative (its loss mean is less than \(\theta -\epsilon \)) by drawing a fixed number of samples, and his/her objective is to minimize the error probability, that is, the probability that positive (resp. negative) arms are wrongly classified into negative (resp. positive). Apart from whether fixed confidence (constraint on error probability to achieve) or fixed budget (constraint on the allowable number of draws), positive and negative arms are treated symmetrically in the thresholding bandit problem while they are dealt with asymmetrically in our problem setting; judgment of one positive arm existence is enough for positive conclusion though all the arms must be judged as negative for negative conclusion. This asymmetry has also been considered in the good arm identification problem (Kano et al. 2017), and our problem can be seen as its specialized version though their problem deal with the case with \(\theta _L=\theta _U\) only. In their setting, the player’s task is to output all the arms of above-threshold means with probability at least \(1-\delta \), and his/her objective is to minimize the number of drawn samples until \(\lambda \) arms are outputted as arms with above-threshold means for a given \(\lambda \). In the case with \(\lambda =1\), algorithms for their problem can be used to solve our existence checking problem. Their proposed algorithm, however, does not utilize the asymmetric problem structure. Kaufmann et al. (2018) studied the problem of sequential test for the lowest mean, which is basically the same problem as the bad arm existence checking problem except the difference in the number of thresholds; they also treat the case with \(\theta _L=\theta _U\) only. They proposed an algorithm to utilize the asymmetric problem structure: Murphy Sampling and asymmetric stopping condition. Our approach to utilize the asymmetric problem structure is different from their approach; our algorithm is an elimination algorithm and asymmetric conditions are used not only to stop but also to eliminate the drawn arm.
We consider elimination algorithms \(\mathrm {BAEC}[\mathrm {ASP},\mathrm {LB},\mathrm {UB}]\) that are mainly composed of an arm-selection policy\({\mathop {\mathop {\hbox {arg max}}\limits }\nolimits _{i}}\mathrm {ASP}(t,i)\) and a decision condition\(\mathrm {LB}(t)\ge \theta _L\) or \(\mathrm {UB}(t)< \theta _U\) at time t. The arm-selection policy decides which arm is drawn at each time t based on loss samples obtained so far. The decision condition is used to conclude positive arm’s existence if \(\mathrm {LB}(t)\ge \theta _L\) holds or the drawn arm’s negativity if \(\mathrm {UB}(t)< \theta _U\) holds. If the conclusion is positive arm’s existence, then the algorithms stop immediately by returning “positive”. In the case that the conclusion is the drawn arm’s negativity, the arm is eliminated from the set of positive-arm candidates, which is composed of all the arms initially, and will not be drawn any more. If there remains no positive-arm candidate, then the algorithms stop by returning “negative”. To utilize our asymmetric problem structure, we propose a decision condition that uses \(\varDelta \)-dependent asymmetric confidence bounds \({\underline{\mu }}(t)\) and \({\overline{\mu }}(t)\) of estimated loss means as \(\mathrm {LB}(t)\) and \(\mathrm {UB}(t)\) in the case with \(\varDelta =\theta _U-\theta _L>0\). Here, asymmetric bounds mean that the width of the upper confidence interval is narrower than the width of the lower confidence interval. As an arm selection policy, we propose policy \(\mathrm {APT}_\mathrm {P}\) that is derived by modifying policy APT (Anytime Parameter-free Thresholding) (Locatelli et al. 2016) so as to favor arms with sample means larger than a single threshold \(\theta \) (rather than arms with sample means closer to \(\theta \) as the original APT does). Here, as the single threshold \(\theta \) used by policy \(\mathrm {APT}_\mathrm {P}\), we use not the center between \(\theta _L\) and \(\theta _U\) but the value closer to \(\theta _U\) by utilizing the asymmetry of our confidence bounds.
By using \(\varDelta \)-dependent asymmetric confidence bounds as the decision condition, the worst-case bound on the number of samples for each arm is shown to be improved by \(\varOmega \left( \frac{1}{\varDelta ^2}\ln \frac{\sqrt{K}}{\varDelta ^2}\right) \) compared to the case using the conventional symmetric confidence bounds of the successive elimination algorithm (Even-Dar et al. 2006).
Our sample complexity results regarding the asymptotic behavior as \(\delta \rightarrow 0\) is summarized as Table 1. Reflecting the asymmetric structure of the problem, the existence of a positive arm makes the sample complexity higher. In the case with negative arms only, algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\), our elimination algorithm with any arm selection policy and \(\varDelta \)-dependent asymmetric confidence bounds, is proved to achieve almost optimal sample complexity. In the case with positive arm existence, the upper bound on the expected number of samples for algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) is proved to be almost optimal when all the positive arms have the same loss mean while that for algorithm \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\) using \(\mathrm {UCB}\) (Upper Confidence Bound) (Auer et al. 2002) as the arm selection policy like HDoC (Hybrid algorithm for the Dilemma of Confidence) (Kano et al. 2017) is proved to be almost optimal when just one positive arm has the largest loss mean.
The effectiveness of our decision condition using the \(\varDelta \)-dependent asymmetric confidence bounds is demonstrated in simulation experiments. The algorithm using our \(\varDelta \)-dependent asymmetric confidence bounds stops drawing an arm about two times faster than the algorithm using the symmetric confidence bounds when its loss mean is around the center of the thresholds. Our algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) almost always stops faster than the algorithm \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\), and our algorithm’s stopping time is faster or comparable to the stopping time of the algorithm \(\mathrm {BAEC}[\mathrm {ASP},{\underline{\mu }},{\overline{\mu }}]\) using LUCB (Lower and Upper Confidence Bounds) (Kalyanakrishnan et al. 2012), Thompson Sampling (Thompson 1933) and Murphy Sampling (Kaufmann et al. 2018) as \(\mathrm {ASP}\)s in almost all the our simulations using Bernoulli loss distribution with synthetically generated means and means generated from a real-world dataset.
2 Related work
The bad arm existence checking problem is a kind of multi-armed bandit problem, which is a classical problem studied by Thompson (1933) and Robbins (1952). A bandit problem is an online learning problem (Littlestone and Warmuth 1994), but a player can obtain partial information only in its setting. In our study, loss (or reward) distribution is assumed to be stochastic, which is easier to deal with than the adversarial setting (Auer et al. 2003). For the bandit problem, depending on problem objectives, two kinds of settings exist: regret-minimization setting (Auer et al. 2002) and pure-exploration setting (Bubeck et al. 2011). Most pure-exploration problems are best arm identification problems (Even-Dar et al. 2006; Audibert et al. 2010; Kalyanakrishnan et al. 2012; Kaufmann and Kalyanakrishnan 2013) which are the problems to identify the arms with the maximum reward means. There are the fixed budget version and the fixed confidence version of best arm identification problems, and algorithms for the fixed confidence version have an arm selection policy and a stopping condition. Some best arm identification algorithms eliminate arms that are estimated not to be the best, and most of those algorithms use uniform sampling as an arm selection policy (Even-Dar et al. 2006; Bubeck et al. 2013). Non-elimination best arm identification algorithms use a more sophisticated adaptive sampling as an arm selection policy (Gabillon et al. 2012; Kalyanakrishnan et al. 2012). Comparison analysis between elimination and non-elimination algorithms was performed by Kaufmann and Kalyanakrishnan (2013). Identification of the above-or-below-threshold arms (Locatelli et al. 2016; Kano et al. 2017; Kaufmann et al. 2018) is a variant of best arm identification, and among these algorithms, only ours and HDoC (Kano et al. 2017) are elimination algorithms using adaptive sampling. This combination is effective for checking existence of above-or-below-threshold arm setting.
3 Preliminaries
For given thresholds \(0<\theta _L\le \theta _U<1\), consider the following bandit problem. Let \(K(\ge 2)\) be the number of arms, and at each time \(t=1,2,\dots \), a player draws arm \(i_t\in \{1,\dots ,K\}\). For \(i\in \{1,\dots ,K\}\), \(X_i(n)\in [0,1]\) denotes the loss for the nth draw of arm i, where \(X_i(1),X_i(2),\dots \) are a sequence of i.i.d. random variables generated according to a probability distribution \(\nu _i\) with mean \(\mu _i\in [0,1]\). We assume independence between \(\{X_i(t)\}_{t=1}^{\infty }\) and \(\{X_j(t)\}_{t=1}^{\infty }\) for any \(i,j\in \{1,\dots ,K\}\) with \(i\ne j\). For a distribution set \({\varvec{\nu }}=\{\nu _i\}\) of K arms, \({\mathbb {E}}_{\varvec{\nu }}\) and \({\mathbb {P}}_{\varvec{\nu }}\) denote the expectation and the probability under \(\varvec{\nu }\), respectively, and we omit the subscript \(\varvec{\nu }\) if it is trivial from the context. Without loss of generality, we can assume that \(\mu _1\ge \cdots \ge \mu _K\) and the player does not know this ordering. Let \(n_i(t)\) denote the number of draws of arm i right before the beginning of the round at time t. After the player observed the loss \(X_{i_t}(n_{i_t}(t)+1)\), he/she can choose stopping or continuing to play at time \(t+1\). Let T denote the stopping time.
The player’s objective is to check the existence of some positive arm(s) with as small a stopping time T as possible. Here, arm i is said to be positive if \(\mu _i\ge \theta _U\), negative if \(\mu _i<\theta _L\), and neutral otherwise. We consider a bad arm existence checking problem, which is a problem of developing algorithms that satisfy the following definition with as small number of arm draws as possible.
Definition 1
GivenFootnote 2\(0<\theta _L\le \theta _U<1\) and \(\delta \in (0,1/2)\), consider a game that repeats choosing one of K arms and observing its loss at each time t. A player algorithm for this game is said to be a \((\theta _L,\theta _U,\delta )\)-BAEC (Bad Arm Existence Checking) algorithm if it stops in a finite time outputting “positive” with probability at least \(1-\delta \) in the case that at least one arm is positive, and “negative” with probability at least \(1-\delta \) in the case that all the arms are negative.
Note that the definition of BAEC algorithms requires nothing when arm 1 is neutral. Our problem definition coincides with the highest-mean version problem of sequential testing for the lowest mean (Kaufmann et al. 2018) in the case with \(\theta _L=\theta _U\). Table 2 is the table of notations used throughout this paper.
4 Sample complexity lower bound
In this section, we derive a lower bound on the expected number of samples needed for a \((\theta _L,\theta _U,\delta )\)-BAEC algorithm. The derived lower bound is used to evaluate algorithm’s sample complexity upper bound in Sects. 5.3 and 6.2.
We let \(\mathrm {KL}(\nu ,\nu ')\) denote Kullback–Leibler divergence from distribution \(\nu '\) to \(\nu \) and define d(x, y) as
Note that \(\mathrm {KL}(\nu ,\nu ')=d(\mu _i,\mu '_i)\) holds if \(\nu \) and \(\nu '\) are Bernoulli distributions with means \(\mu _i\) and \(\mu '_i\), respectively.
The following theorem is an extensionFootnote 3 of Lemma 1 in Kaufmann et al. (2018) to the case with two thresholds.
Theorem 1
Let \(\{\nu _i\}\) be a set of Bernoulli distributions with means \(\{\mu _i\}\). Then, the stopping time T of any \((\theta _L,\theta _U,\delta )\)-BAEC algorithm with \(\theta _U\) and \(\theta _L\) is bounded as
if some arm is positive, and
if all the arms are negative.
Proof
See “Appendix A”. \(\square \)
Remark 1
Identification is not needed for checking existence, however, in terms of asymptotic behavior as \(\delta \rightarrow +0\), the shown expected sample complexity lower bounds of both the tasks are the same; \(\lim _{\delta \rightarrow +0}\mathbb {E}(T)/\ln (1/\delta )\ge 1/d(\mu _1,\theta _L)\) for both the tasks in the case with some positive arms. The bounds are tight considering the shown upper bounds, so the bad arm existence checking is not more difficult than the good arm identificationFootnote 4 (Kano et al. 2017) with respect to asymptotic behavior as \(\delta \rightarrow +0\).
5 Algorithm
5.1 \(\mathrm {BAEC}[\mathrm {ASP},\mathrm {LB},\mathrm {UB}]\) algorithm framework
As \((\theta _L,\theta _U,\delta )\)-BAEC algorithms, we consider algorithm \(\mathrm {BAEC}[\mathrm {ASP},\mathrm {LB},\mathrm {UB}]\) shown in Algorithm 1 that, at each time t, chooses an arm \(i_t\) from the set \(A_t\) of positive-candidate arms by an arm-selection policy\(\mathrm {ASP}\)
using some index value \(\mathrm {ASP}(t,i)\) of arm i at time t (Line 4), suffers a loss \(X_{i_t}(n_{i_t}(t+1))\) (Line 6) and then checks whether a decision condition
is satisfied (Lines 8 and 10). Here, \(\mathrm {LB}(t)\) and \(\mathrm {UB}(t)\) are lower and upper confidence bounds of an estimated loss mean of the current drawn arm \(i_t\), and condition \(\mathrm {LB}(t)\ge \theta _L\) is the condition for the decision of positive arm’s existence , and condition \(\mathrm {UB}(t)<\theta _U\) is the condition for concluding the drawn arm’s negativity and eliminating arm \(i_t\) from the set \(A_{t+1}\) of positive-candidate arms of time \(t+1\). In addition to the case with positive conclusion, algorithm \(\mathrm {BAEC}[\mathrm {ASP},\mathrm {LB},\mathrm {UB}]\) also stops with negative conclusion when \(A_t\) becomes empty.
Define sample loss mean \({\hat{\mu }}_i(n)\) of arm i with n draws as
and we use \({\hat{\mu }}_{i_t}(n_{i_t}(t+1))\) as an estimated loss mean of the current drawn arm \(i_t\) at time t.
5.2 Asymmetric \(\varDelta \)-dependent confidence bounds
As we use the sample mean \({\hat{\mu }}_i(n)\) as an estimated loss mean, \(\mathrm {LB}(t)\) and \(\mathrm {UB}(t)\) are determined by defining lower and upper bounds of a confidence interval of \({\hat{\mu }}_i(n)\) for \(i=i_t\) and \(n=n_{i_t}(t+1)\).
As lower and upper confidence bounds of \({\hat{\mu }}_i(n)\),
respectively, are generally usedFootnote 5 in successive elimination algorithms (Even-Dar et al. 2006). Define \({\underline{\mu }}'(t)\) and \({\overline{\mu }}'(t)\) as \({\underline{\mu }}'(t)=\underline{\mu }'_{i_t}(n_{i_t}(t+1))\) and \({\overline{\mu }}'(t)=\overline{\mu }'_{i_t}(n_{i_t}(t+1))\) for use as \(\mathrm {LB}(t)\) and \(\mathrm {UB}(t)\).
Consider the case with \(\theta _L< \theta _U\), namely, the case that \(\theta _L\) is strictly smaller than \(\theta _U\). In this case, we propose asymmetric bounds \(\underline{\mu }_{i}(n)\) and \(\overline{\mu }_{i}(n)\) defined using a gray zone width\(\varDelta =\theta _U-\theta _L\) as follows:
where
We also let \({\underline{\mu }}(t)\) and \({\overline{\mu }}(t)\) denote \(\mathrm {LB}(t)\) and \(\mathrm {UB}(t)\) using these bounds, that is, \({\underline{\mu }}(t)=\underline{\mu }_{i_t}(n_{i_t}(t+1))\) and \({\overline{\mu }}(t)=\overline{\mu }_{i_t}(n_{i_t}(t+1))\).
Note that \(\overline{\mu }_{i}(n)<\overline{\mu }'_{i}(n)\) for \(n>\sqrt{N_{{\varDelta }}/2K}\) and \(\underline{\mu }_{i}(n)>\underline{\mu }'_{i}(n)\) for \(n>\sqrt{N_{{\varDelta }}/2}\), so \(\overline{\mu }_{i}(n)-\underline{\mu }_{i}(n)<\overline{\mu }'_{i}(n)-\underline{\mu }'_{i}(n)\) holds for \(n\ge \sqrt{N_{{\varDelta }}/2}\). Both \(\overline{\mu }_{i}(n)-\underline{\mu }_{i}(n)\) and \(\overline{\mu }'_{i}(n)-\underline{\mu }'_{i}(n)\) decrease as n increases, and \(\mathrm {LB}(t)\ge \theta _L\) or \(\mathrm {UB}(t)<\theta _U\) is satisfied for \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) and \(\mathrm {BAEC}[*,{\underline{\mu }}',{\overline{\mu }}']\) when they become at most \(\varDelta \) for \(n=n_i(t+1)\), where \(\mathrm {ASP}=*\) means that any index function \(\mathrm {ASP}(t,i)\) can be assumed.
Remark 2
Condition \({\underline{\mu }}(t)\ge \theta _L\) essentially identifies non-negative arm \(i_t\). Is there real-valued function \(\mathrm {LB}\) that can check existence of a non-negative arm without identifying it? The answer is yes. Consider a virtual arm at each time t whose mean loss \(\mu ^t\) is a weighted average over the mean losses \(\mu _i\) of all the arms i (\(i=1,\dots ,K\)) defined as \(\mu ^t=\frac{1}{t}\sum _{i=1}^Kn_i(t+1)\mu _i\). If \(\mu ^t\ge \theta _L\), then at least one arm i must be non-negative. Thus, we can check the existence of a non-negative arm by judging whether \(\mu ^t\ge \theta _L\) or not. Since \({\underline{\mu }}^t(t)\) defined as
can be considered to be a lower bound of the estimated value of \(\mu ^t\), \({\underline{\mu }}^t\) can be used as \(\mathrm {LB}\) for checking the existence of a non-negative arm without identifying it. Instead of the set of all arms, any arm subset can be considered to be a virtual arm as the stopping condition proposed by Kaufmann et al. (2018) in the case with \(\theta _L=\theta _U\). However, the increase of the number of subsets to be considered also makes the required number of each subset’s samples increase due to the property of union bound. In this paper, we do not pursue in this direction, and instead focus on the effect investigation of the decision condition using \(\varDelta \)-dependent asymmetric confidence bounds.
The ratio of the width of our upper confidence interval \(\left[ {\hat{\mu }}_i(n),\overline{\mu }_{i}(n)\right] \) to the width of our lower confidence interval \(\left[ \underline{\mu }_{i}(n),{\hat{\mu }}_i(n)\right] \) is \(\sqrt{\ln \frac{N_{{\varDelta }}}{\delta }}:\sqrt{\ln \frac{KN_{{\varDelta }}}{\delta }}=1:\sqrt{1+\frac{\ln K}{\ln \frac{N_{{\varDelta }}}{\delta }}}\). Thus, we define \(\theta \) as
This \(\theta \) can be considered to be the balanced center between the thresholds \(\theta _L\) and \(\theta _U\) for our asymmetric confidence bounds.
5.3 Arm selection policy \(\mathrm {APT}_{\mathrm {P}}\)
As arm selection policy \(\mathrm {ASP}\), we consider policy \(\mathrm {APT}_{\mathrm {P}}\) that uses index function
where we use \({\hat{\mu }}_i(n_i(t))=\theta \) when \(n_i(t)=0\). This arm-selection policy is a modification of the policy of \(\mathrm {APT}\) (Anytime Parameter-free Thresholding algorithm) (Locatelli et al. 2016), in which an arm
is chosen for given threshold \(\theta \) and accuracy \(\epsilon \). In the original APT, arm i with the sample mean \({\hat{\mu }}_i(n_i(t))\) closest to \(\theta \) is preferred to be chosen no matter whether \({\hat{\mu }}_i(n_i(t))\) is larger or smaller than \(\theta \). In \(\mathrm {APT}_{\mathrm {P}}\), there is at most one arm i whose sample mean \({\hat{\mu }}_i(n_i(t))\) is larger than \(\theta \) at any time t because of the above our definition of \({\hat{\mu }}_j(n_j(t))\) for arms j with \(n_j(t)=0\) and mathematical induction in t, and such unique arm i is always chosen as long as \({\hat{\mu }}_i(n_i(t))>\theta \).
6 Theoretical analyses of algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\)
In the following sections, we consider the case with \(\theta _L<\theta _U\) (\(\varDelta >0\)). We first analyze arm’s sample complexity for any arm, then analyze algorithm’s sample complexity.
6.1 Worst case sample complexity upper bound for any arm
One merit of the two threshold setting with \(\theta _L<\theta _U\) is that the number of drawn samples until the decision condition is satisfied, is upper-bounded for any arm by a common constant depending on \(\varDelta =\theta _U-\theta _L\) and \(\delta \). In this subsection, we prove such common constant bound for our \(\varDelta \)-dependent asymmetric confidence bounds and compare it with the corresponding number of samples for the conventional symmetric confidence bounds.
Let \(\tau _i\) denote the smallest number n of draws of arm i for which the decision condition is met, that is, either \(\underline{\mu }_{i}(n)\ge \theta _L\) or \(\overline{\mu }_{i}(n)<\theta _U\) holds. Define \(T_{{\varDelta }}\) as \(T_{{\varDelta }}=\left\lceil \frac{2}{\varDelta ^2}\ln \frac{\sqrt{K}N_{{\varDelta }}}{\delta }\right\rceil \). Then, \(\tau _i\) can be upper-bounded by \(T_{{\varDelta }}\) for any arm i as the following theorem.
Theorem 2
Inequality \(\tau _i\le T_{{\varDelta }}\) holds for \(i=1,\dots ,K\).
Proof
See “Appendix B”. \(\square \)
How good is the worst case bound \(T_{{\varDelta }}\) on the number of samples for each arm compared to the case with \(\mathrm {LB}={\underline{\mu }}'\) and \(\mathrm {UB}={\overline{\mu }}'\) (Eq. 3)? It is shown by the following theorem that, in \(\mathrm {BAEC}[*,{\underline{\mu }}',{\overline{\mu }}']\), the number of arm draws \(\tau '_i\) for some arm i, which is corresponding to \(\tau _i\), can be larger than \(T'_{{\varDelta }}=\lfloor \frac{2}{\varDelta ^2}\ln \frac{448K}{\varDelta ^4\delta }\rfloor \), which means \(\tau '_i-\tau _i=\varOmega \left( \frac{1}{\varDelta ^2}\ln \frac{\sqrt{K}}{\varDelta ^2}\right) \) if \(\frac{1}{\delta }=o\left( \mathrm {e}^{\sqrt{K}/\varDelta ^2}\right) \).
Theorem 3
Consider algorithm \(\mathrm {BAEC}[*,{\underline{\mu }}',{\overline{\mu }}']\) and define \(\tau '_i=\min \{n\mid \underline{\mu }'_{i}(n)\ge \theta _L \text { or } \overline{\mu }'_{i}(n)<\theta _U\}\) for \(i=1,\dots ,K\). Then, event \(\tau '_i > T'_{{\varDelta }}\) can happen for \(i=1,\dots ,K\), where \(T'_{{\varDelta }}\) is defined as \(T'_{{\varDelta }}=\lfloor \frac{2}{\varDelta ^2}\ln \frac{448K}{\varDelta ^4\delta }\rfloor \). Furthermore, the difference between the worst case decision times \(\tau '_i-\tau _i\) is lower-bounded as
Proof
See “Appendix C”. \(\square \)
Remark 3
Theorem 3 says that the difference between the worst case decision times \(\tau '_i\) and \(\tau _i\) of arm i is \(\varOmega \left( \frac{1}{\varDelta ^2}\ln \frac{\sqrt{K}}{\varDelta ^2}\right) \) for \(\delta =\omega \left( \frac{\sqrt{K}}{\varDelta ^2}\mathrm {e}^{-\frac{52\sqrt{K}}{\varDelta ^2}}\right) \) under the condition that \(\delta >\frac{3\sqrt{K}}{\varDelta ^2}\mathrm {e}^{-\frac{52\sqrt{K}}{\varDelta ^2}}\). In the experimental setting of Sect. 7.1, in which parameters \(K=100\), \((\varDelta ,\delta )=(0.2,0.01), (0.2,0.001), (0.02,0.01), (0.02,0.001)\) are used, the lower bounds of \(\tau '_i - \tau _i\) calculated using the above inequality are 352.7, 343.4, 56579.7, 55900.7, respectively, which seem relatively large compared to the corresponding \(T_{{\varDelta }}=684, 808, 93098, 105307\). The range of \(\delta \) which guarantees that the lower bound of \(\tau '_i -\tau _i\) is positive, is \(>1.11\times 10^{-5643}\) for \(\varDelta =0.2\) and \(1.12\times 10^{-564578}\) for \(\varDelta =0.02\).
Remark 4
Instead of \(\overline{\mu }'_{i}(n)\) defined in Eq. (3), \(\overline{\mu }''_{i}(n)={\hat{\mu }}_i(n)+\sqrt{\frac{1}{2n}\ln \frac{2n^2}{\delta }}\) can be used because an union bound is not necessary for a positive arm as \(\overline{\mu }_{i}(n)\) defined in Eq. (4). For the algorithm \(\mathrm {BAEC}[*,{\underline{\mu }}',{\overline{\mu }}'']\) using this upper confidence bound \(\overline{\mu }''_{i}(n)\) (\(i=1,\ldots ,K\)), the decision time difference from \(\tau _i\) is still lower-bounded by \(\frac{2}{\varDelta ^2}\left( \ln \frac{2}{K^{\frac{1}{4}}\varDelta ^2}-\ln \ln \frac{3\sqrt{K}}{\varDelta ^2\delta }\right) \) by Theorem 9 in “Appendix D”. The values of this lower bound for the experimental setting of Sect. 7.1, that is, \(K=100\), \((\varDelta ,\delta )=(0.2,0.01), (0.2,0.001), (0.02,0.01), (0.02,0.001)\), are 17.13, 7.80, 23020.3, 22341.3, respectively. Compared to the corresponding \(T_{{\varDelta }}=684, 808, 93098, 105307\), the difference seems still large for \(\varDelta =0.02\) though it becomes small for \(\varDelta =0.2\). The range of \(\delta \) guaranteeing positiveness of the lower bound is \(>1.03\times 10^{-5}\) for \(\varDelta =0.2\) and \(1.63\times 10^{-682}\) for \(\varDelta =0.02\).
6.2 Algorithm’s correctness
In this subsection, we prove that algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is a \((\theta _L,\theta _U,\delta )\)-BAEC algorithm.
We define events \(\mathcal {E}^+\) and \(\mathcal {E}^-\)as
Note that algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) returns “positive” under the event \(\mathcal {E}^+\) and returns “negative” under the event \(\mathcal {E}^-\). For any event \(\mathcal {E}\), we let \(\mathbb {1}\{\mathcal {E}\}\) denote an indicator function of \(\mathcal {E}\), that is, \(\mathbb {1}\{\mathcal {E}\}=1\) if \(\mathcal {E}\) occurs and \(\mathbb {1}\{\mathcal {E}\}=0\) otherwise.
The following proposition is used to prove Lemma 1.
Proposition 1
\(T_{{\varDelta }}\le N_{{\varDelta }}\).
Proof
See “Appendix E”. \(\square \)
The next lemma says that algorithm’s output is correct with probability at least \(1-\delta \) in the cases that at least one positive arm exists or all the arms are negative.
Lemma 1
For the complementary events \(\overline{\mathcal {E}^+}\), \(\overline{\mathcal {E}^-}\) of events \(\mathcal {E}^+\), \(\mathcal {E}^-\), inequality \(\mathbb {P}\{\overline{\mathcal {E}^+}\}\le \delta \) holds when \(\mu _1\ge \theta _U\) and inequality \(\mathbb {P}\{\overline{\mathcal {E}^-}\}\le \delta \) holds when \(\mu _1<\theta _L\).
Proof
Assume that \(\mu _1\ge \theta _U\). Using De Morgan’s laws, \(\overline{\mathcal {E}^+}\) can be expressed as
So, the probability that event \(\overline{\mathcal {E}^+}\) occurs is bounded by \(\delta \) using Hoeffding’s inequality:
Assume that \(\mu _1< \theta _L\). Using De Morgan’s laws, \(\overline{\mathcal {E}^-}\) can be expressed as
So, the probability that event \(\overline{\mathcal {E}^-}\) occurs is bounded by \(\delta \) using the union bound and Hoeffding’s inequality:
\(\square \)
The following theorem states that algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is a \((\theta _L,\theta _U,\delta )\)-BAEC algorithm which needs at most \(KT_{{\varDelta }}\) samples in the worst case.
Theorem 4
Algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is a \((\theta _L,\theta _U,\delta )\)-BAEC algorithm that stops after at most \(KT_{{\varDelta }}\) arm draws.
Proof
By the definition of \(\tau _i\), algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) draws arm i at most \(\tau _i\) times, which is upper-bounded by \(T_{{\varDelta }}\) due to Theorem 2. So, algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) stops after at most \(KT_{{\varDelta }}\) arm draws.
When at least one arm is positive, that is, in the case with \(\mu _1\ge \theta _U\), algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) returns “positive” if event \(\mathcal {E}^+\) occurs. Thus, algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) returns “positive” with probability \(\mathbb {P}\{\mathcal {E}^+\}=1-\mathbb {P}\{\overline{\mathcal {E}^+}\}\ge 1-\delta \) by Lemma 1. When all the arms are negative, that is, in the case with \(\mu _1< \theta _L\), algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) returns “negative” if event \(\mathcal {E}^-\) occurs. Thus, algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) returns “negative” with probability \(\mathbb {P}\{\mathcal {E}^-\}=1-\mathbb {P}\{\overline{\mathcal {E}^-}\}\ge 1-\delta \) by Lemma 1.\(\square \)
6.3 High-probability and average-case bounds
By Theorem 4, we know worst-case upper bound \(KT_{{\varDelta }}\) on the number of samples needed for algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\). In this section, we show a high-probability and an average-case bounds for the algorithm.
We define \(\varDelta _i\) as
and let \(T_{{\varDelta _i}}\) denote \(T_{{\varDelta _i}}=\left\lceil \frac{2}{\varDelta _i^2}\ln \frac{\sqrt{K}N_{{\varDelta }}}{\delta }\right\rceil \).
A high-probability upper bound of the number of samples needed for algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is shown in the next theorem. Compared to worst case bound, \(KT_{{\varDelta }}\) can be improved to \(\sum _{i=1}^KT_{{\varDelta _i}}\) in the case with \(\mu _1<\theta _L\), however, only one \(T_{{\varDelta }}\) is guaranteed to be improved to the maximum \(T_{{\varDelta _i}}\) among those of positive arms i in the case with \(\mu _1\ge \theta _U\).
Theorem 5
In algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\), inequality \(\tau _i \le T_{{\varDelta _i}}\) holds for at least one positive arm i with probability at least \(1-\delta \) when \(\mu _1\ge \theta _U\). Inequality \(\tau _i \le T_{{\varDelta _i}}\) holds for all the arm \(i=1,\dots ,K\) with probability at least \(1-\delta \) when \(\mu _1< \theta _L\). As a result, with probability at least \(1-\delta \), the stopping time T of algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is upper-bounded as \(T\le \max _{i:\mu _i\ge \theta _U}T_{{\varDelta _i}}+(K-1)T_{{\varDelta }}\) when \(\mu _1\ge \theta _U\) and \(T\le \sum _{i=1}^KT_{{\varDelta _i}}\) when \(\mu _1<\theta _L\).
Proof
See “Appendix F”. \(\square \)
The last sample complexity upper bound for algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is an upper bound on the expected number of samples. Compared to the high-probability bound, \(T_{{\varDelta _i}}=\left\lceil \frac{2}{\varDelta _i^2}\ln \frac{\sqrt{K}N_{{\varDelta }}}{\delta }\right\rceil \) is improved to \(\frac{1}{2\varDelta _i^2}\ln \frac{KN_{{\varDelta }}}{\delta }\) or \(\frac{1}{2\varDelta _i^2}\ln \frac{N_{{\varDelta }}}{\delta }\).
Theorem 6
For algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\), the expected value of \(\tau _i\) of each arm i is upper-bounded as follows.
As a result, the expected stopping time \(\mathbb {E}[T]\) of algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is upper-bounded as
The above theorem can be easily derived from the following lemma by setting event \(\mathcal {E}\) to a certain event (an event that occurs with probability 1).
Lemma 2
For any event \(\mathcal {E}\), in algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\), inequality
holds for any arm i with \(\mu _i\ge \theta \) and
holds for any arm i with \(\mu _i< \theta \).
Proof
See “Appendix G”. \(\square \)
Remark 5
When all the arms have Bernoulli loss distributions with means less than \(\theta _L\), by Pinsker’s Inequality \(d(x,y)\ge 2(x-y)^2\), the right-hand side of Ineq. (2) in Theorem 1 can be upper-bounded as
Since Pinsker’s Inequality is tight in the worst case, algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is almost asymptotically optimal as \(\delta \rightarrow +0\). Algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) is a kind of elimination algorithm, that is, the arms that satisfy negative decision condition are eliminated. Excluding elimination algorithms, UCB and Murphy Sampling coupled with a box stopping rule is known to also have asymptotically optimal stopping time in this case when \(\varDelta =0\) (Kaufmann et al. 2018).
7 Sample complexity of algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\)
7.1 Sample complexity upper bound
If all the arms are judged as negative in algorithm \(\mathrm {BAEC}[\mathrm {ASP},{\underline{\mu }},{\overline{\mu }}]\), that is, drawing arm i is stopped by the decision condition of \({\overline{\mu }}_i(\tau _i)<\theta _U\) for all \(i=1,\dots ,K\), the stopping time T is \(\sum _{i=1}^K\tau _i\) regardless of arm-selection policy \(\mathrm {ASP}\). In the case that some positive arms exist, however, the stopping time depends on how fast the \((\theta _L,\theta _U,\delta )\)-BAEC algorithm can find one of positive arms.
In this subsection, we prove upper bounds on the expected number of samples needed for algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\), an instance of algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) with specific arm-selection policy \(\mathrm {APT}_\mathrm {P}\).
Let arm \({\hat{i}}_1\) denote the first arm that is drawn \(\tau _i\) times in algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\). In addition to \(\varDelta _i\), we also use \(\underline{\varDelta }_i=|\mu _i-\theta |\) in the following analysis. We let m denote the number of arms i with \(\mu _i\ge \theta \). The event that arm i is judged as positive is denoted as \(\mathcal {E}_i^\mathrm {POS}\).
From the following theorem and corollary, we know that, when \(\delta \) is small, the dominant terms of our upper bound on the expected stopping time of algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\), are \(\frac{\mathbb {P}\left[ {\hat{i}}_1=i,\mathcal {E}_i^\mathrm {POS}\right] }{2\varDelta _i^2}\ln \frac{1}{\delta }\) (\(i=1,\ldots ,m\)), whose sum is between \(\frac{1}{2\varDelta _1^2}\ln \frac{1}{\delta }\) and \(\frac{1}{2\varDelta _m^2}\ln \frac{1}{\delta }\).
Theorem 7
If \(m\ge 1\) (or \(\mu _1\ge \theta \)), then the expected stopping time \(\mathbb {E}[T]\) of algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) is upper-bounded as
Proof
See “Appendix H”. \(\square \)
The next corollary is easily derived from Theorem 7.
Corollary 1
If \(m\ge 1\), then
holds for the expected stopping time \(\mathbb {E}[T]\) of algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\).
7.2 Comparison with \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\)
HDoC (Hybrid algorithm for the Dilemma of Confidence)(Kano et al. 2017) for good arm identification problem uses arm selection policy UCB (Upper Confidence Bound) (Auer et al. 2002), in which
is used as \(\mathrm {ASP}(t,i)\). In this section, we analyze a sample complexity upper bound of algorithmFootnote 6\(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\) and compare it with that of \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\).
Define \(\varDelta _{1i}\) as \(\varDelta _{1i}=\mu _1-\mu _i\). Then, we can obtain the following theorem and corollary, from which, we know that, when \(\delta \) is small, the dominant terms of our upper bound on the expected stopping time of algorithm \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\), are \(\frac{1}{2\varDelta _i^2}\ln \frac{1}{\delta }\) (\(i:\mu _i=\mu _1\)), whose sum is \(\frac{|\{i\mid \mu _i=\mu _1\}|}{2\varDelta _1^2}\ln \frac{1}{\delta }\).
Theorem 8
If \(m\ge 1\), then expected stopping time \(\mathbb {E}[T]\) of algorithm \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\) is upper-bounded as
Proof
See “Appendix I”. \(\square \)
Corollary 2
If \(m\ge 1\), then
holds for the expected stopping time \(\mathbb {E}[T]\) of algorithm \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\).
Remark 6
From the upper bound shown by Ineq. (7), inequality
is derived. This means that the expected stopping time upper bounds for algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) and \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\) shown in Theorems 7 and 8 are asymptotically smaller than that of algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\) as \(\delta \rightarrow +0\).
Remark 7
When all the arms have Bernoulli loss distributions, the right-hand side of Ineq. (1) in Theorem 1 can be upper-bounded as
by Pinsker’s Inequality. Considering tightness of Pinsker’s Inequality, \(\frac{1}{2\varDelta _1^2}\) is considered to be a tight upper bound of \(\lim _{\delta \rightarrow +0}\frac{\mathbb {E}[T]}{\ln \frac{1}{\delta }}\) if Ineq. (1) is tight. There is a large gap between \(\sum _{i=1}^m\frac{\lim _{\delta \rightarrow +0}\mathbb {P}\left[ {\hat{i}}_1=i,\mathcal {E}_i^\mathrm {POS}\right] }{2\varDelta _i^2}\) and \(\frac{1}{2\varDelta _1^2}\), and improvement of the upper bound on the number of samples for \(\mathrm {APT}_\mathrm {P}\) seems difficult, so the algorithm BAEC with arm selection policy \(\mathrm {APT}_\mathrm {P}\) does not seem asymptotically optimal unless \(\lim _{\delta \rightarrow +0}\mathbb {P}\left[ {\hat{i}}_1=1,\mathcal {E}_i^\mathrm {POS}\right] =1\). On the other hand, \(\lim _{\delta \rightarrow +0}\frac{\mathbb {E}[T]}{\ln \frac{1}{\delta }}\) for \(\mathrm {UCB}\) is upper-bounded by \(\frac{1}{2\varDelta _1^2}\), that is, asymptotically optimal when \(\mu _i<\mu _1\) for all arm \(i \ne 1\). In the case with \(\mu _i=\mu _1\) for all \(i=1,\dots ,m\), however, \(\lim _{\delta \rightarrow +0}\frac{\mathbb {E}[T]}{\ln \frac{1}{\delta }}\le \frac{m}{2\varDelta _1^2}\) holds for \(\mathrm {UCB}\) while the corresponding bound for \(\mathrm {APT}_\mathrm {P}\) is asymptotically optimal, that is, \(\lim _{\delta \rightarrow +0}\frac{\mathbb {E}[T]}{\ln \frac{1}{\delta }}\le \frac{1}{2\varDelta _1^2}\) holds. The stopping time’s asymptotic optimality of Murphy Sampling coupled with a box stopping rule (Kaufmann et al. 2018) for \(\varDelta =0\) is basically the same as that of \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\) for \(\varDelta >0\); its stopping time is optimal in the unique-best-arm case but not in the multiple-best-arms case.
Remark 8
Comparing non-dominant terms of \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) and \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\), a cause for the large upper bound of the expected stopping time can be the existence of arms i whose loss mean \(\mu _i\) is close to \(\mu _1\) in \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\) while it can be the existence of arms i whose loss mean \(\mu _i\) is close to \(\theta \) in \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\).
8 Experiments
In this section, we report the results of our experiments that were conducted in order to demonstrate the effectiveness of our \(\varDelta \)-dependent asymmetric confidence bounds used in decision condition and arm selection policy on the stopping time.
In all the tables of experimental results, the smallest averaged stopping time in each parameter setting is bolded or italic, and bolded ones mean statistically significant difference.
8.1 Effectiveness of \(\varDelta \)-dependent asymmetric confidence bounds
As upper and lower confidence bounds \(\mathrm {LB}\) and \(\mathrm {UB}\), we proposed \({\underline{\mu }}\) and \({\overline{\mu }}\) based on \(\varDelta \)-dependent asymmetric bounds \(\overline{\mu }_{i}(n)\) and \(\underline{\mu }_{i}(n)\) defined by Eq. (4), instead of \({\underline{\mu }}'\) and \({\overline{\mu }}'\) based on conventional non-\(\varDelta \)-dependent symmetric bounds \(\overline{\mu }'_{i}(n)\) and \(\underline{\mu }'_{i}(n)\) defined by Eq. (3). In this subsection, we empirically compare the number of draws for an arm with mean \(\mu _i\) to satisfy the decision condition using those bounds.
In the experiment, an i.i.d. loss sequence \(X_i(1),\ldots \) was generated according to a Bernoulli distribution with mean \(\mu _i\) and we measured the decision time \(\tau _i\) which is the smallest n that satisfies the decision condition (\(\underline{\mu }_{i}(n)\ge \theta _L\) or \(\overline{\mu }_{i}(n)< \theta _U\)). The decision times were averaged over 100 runs for each combination of parameters \(\delta =0.001, 0.01\), \(\mu _i = 0.2, 0.4, 0.6, 0.8\) and \((\theta _L,\theta _U) = (0.1,0.3), (0.3,0.5), (0.5,0.7), (0.7,0.9),\)\((0.19,0,21), (0.39,0.41), (0.59,0.61), (0.79,0.81)\). Note that \(\varDelta =\theta _U-\theta _L=0.2\) for the first half of the setting and \(\varDelta =0.02\) for the last half of the setting. We used \(K=100\) so as to make the bounds asymmetric. As a result, \(\alpha =1.154,1.186\) for \(\delta =0.001, 0.01\), respectively. So, \(\theta \) is \((\theta _L+\theta _U)/2+0.007\) for \(\delta =0.001\) and \((\theta _L+\theta _U)/2+0.009\) for \(\delta =0.01\).
The result is shown in Table 3. As we can see from the table, the decision condition using \(\varDelta \)-dependent asymmetric bounds make the decision time fast compared to that using conventional bounds except in the case with \(\varDelta =0.02\) and \(\mu _i>\theta \). The effect of the proposed \(\varDelta \)-dependent asymmetric confidence bounds become significant when the arm is neutral or negative, notably, 1.74\(\sim \)2.08 times faster when \(\mu _i\approx \theta \). The reason why the decision condition using conventional bounds performs better for \(\varDelta =0.02\) and \(\mu _i>\theta \), is that \(\underline{\mu }'_{i}(\tau '_i)>\underline{\mu }_{i}(\tau '_i)\) occurs frequently for decision time \(\tau '_i\) using \({\underline{\mu }}'\). In fact, \(\underline{\mu }'_{i}(n)>\underline{\mu }_{i}(n)\) holds for \(n<\sqrt{N_{{\varDelta }}/2}\), and \(\sqrt{N_{{\varDelta }}/2}=246.99, 264.79\) for \(\delta =0.01, 0.001\), respectively, in the case with \(\varDelta =0.02\) and \(K=100\). The width \({\hat{\mu }}_i(n)-\underline{\mu }'_{i}(n)\) of the lower confidence interval of \(\underline{\mu }'_{i}(n)\) is 0.206 for \(\delta =0.01\) and \(n=246\), and 0.210 for \(\delta =0.001\) and \(n=264\), Thus, arm i with mean \(\mu _i\) larger than \(\theta \) by more than 0.21 is more likely to satisfy condition \(\underline{\mu }'_{i}(n_i(t))\ge \theta _L\) before satisfying condition \(\underline{\mu }_{i}(n_i(t))\ge \theta _L\). This indicates that, in the case with very small \(\varDelta \), decision condition using conventional bounds is better for arms far from \(\theta \).
8.2 Effectiveness of arm selection policy \(\mathrm {APT}_\mathrm {P}\)
8.2.1 Simulation using synthetic distribution parameters
In this experiment, we first generated distribution means \(\mu _1,\dots ,\mu _{100}\) of 100 arms, and then ran algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) simulating an arm-i draw by generating a loss according to a Bernoulli distribution with mean \(\mu _i\).
For given natural number m and a threshold pair \((\theta _L,\theta _U)\), m distribution means were generated according to a uniform distribution over \([\theta , 1]\) and \(100-m\) distribution means were generated according to a uniform distribution over \([0,\theta )\), where \(\theta =\theta _U-\frac{1}{1+\alpha }\varDelta \).
For each set of 100 distribution means, we also ran algorithms \(\mathrm {BAEC}[\mathrm {ASP},{\underline{\mu }},{\overline{\mu }}]\) for \(\mathrm {ASP}=\mathrm {UCB}, \mathrm {LUCB},\mathrm {TS}\) (Thompson sampling) and \(\mathrm {MS}\) (Murphy sampling)Footnote 7 in addition to for \(\mathrm {ASP}=\mathrm {APT}_\mathrm {P}\) by generating the same i.i.d. loss sequence for the same arm, which can be realized by feeding a same seed to a random number generator for the same arm. Here, arm selection policy LUCB uses
Note that LUCBFootnote 8 (Kalyanakrishnan et al. 2012) is an algorithm for the best k arm identification problem, and the above policy is exactly the same arm-selection policy as original LUCB for \(k=1\).
Both of TS and MS decide the arm to select at each round t based on samples \({\tilde{\mu }}_i^t\) drawn from [0, 1] according to each arm’s posterior loss-mean distribution \(\pi _i^t\) (\(i=1,\dots ,K\)). TS chooses the arm \(i \in A_t\) with \({\tilde{\mu }}_i^t=\max _j {\tilde{\mu }}_j^t\) without any condition while MS similarly choosesFootnote 9 the maximum-sampled-mean arm \(i \in A_t\) under the conditionFootnote 10 that the \(\max _j{\tilde{\mu }}_j^t > \theta \). We used independent uniform distribution over [0, 1] for each arm as the prior loss-mean distribution of TS and MS.
For each \(m=0, 1, 25, 50, 100\), we generated 100 setsFootnote 11 of 100 distribution means, and ran the three algorithms for each set and for each combination of parameters \(\delta =0.01, 0.001\) and \((\theta _L,\theta _U) = (0.19,0.21), (0.49,0.51), (0.79,0.81),\) (0.1, 0.3), (0.4, 0.6), (0.7, 0.9). As for threshold pairs \((\theta _L,\theta _U)\), \(\varDelta =0.02\) for the first three and \(\varDelta =0.2\) for the last three. Stopping times were averaged over 100 runs.
The result is shown in Table 4. In the case with large \(\varDelta (=0.2)\), the averaged stopping time for \(\mathrm {APT}_\mathrm {P}\) is the smallest for all the combinations of parameters in this experiment. In the case with small \(\varDelta (=0.02)\), \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) also stopped first, on average, for more than half of the combinations of parameters. For this small \(\varDelta \), MS, TS and LUCB also performed well to some extent, and in fact, MS and TS stopped first for most of small m (\(m=1, 25\)), and LUCB’s stopping time was shortest for about a quarter of the parameter combinations. \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) stopped first even when \(m=0\), that is, in the case that all the loss means are below \(\theta \). In such case, some gray zone arms can be judged as positive and make the algorithm stop. \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) is considered to have found such gray zone arms faster.
8.2.2 Simulation based on real dataset
In this experiment, as loss distribution means, we used estimated ad click rates by users in the same category calculated from Real-Time Bidding dataset provided by iPinYou (Zhang et al. 2014). From the training dataset of the second season of iPinYou dataset, we chose 20 most frequently appeared user categories (sets of user profile ids) and calculated the click rate by the users in the category for each of them using the impression and click logs. Since the click rates are smaller than 0.001, we used the values multiplied by 100 as loss means. The loss means \(\mu _1,\dots ,\mu _{20}\) used in the experiment are followings:
In this experiment, 5 thresholds \((\theta _L,\theta _U)=(\theta _{m'}-0.01,\theta _{m'}+0.01)\) for \(m'=0,1,5,10,19\) are used so as to let the loss means of about \(m'\) arms be at least \(\theta \), where \(\theta _0=\mu _1+\frac{\mu _1-\mu _2}{2}\), \(\theta _{m'}=\frac{\mu _{m'}+\mu _{m'+1}}{2}\) for \(m'=1,5,10,19\). For these \((\theta _L,\theta _U)\)s, \(\theta =0.06649,0.05966,0.04168,0.03485,0.01220\) when \(\delta =0.001\), and \(\theta =0.06659,0.05976,0.04178,0.03495,0.01230\) when \(\delta =0.01\). For these \(\theta \)s, the number of arms whose loss mean is at least \(\theta \) is 0, 1, 4, 10, 19. For each combination of parameters \(\delta = 0.01, 0.001\), \((\theta _L,\theta _U)=(\theta _{m'}-0.01,\theta _{m'}+0.01)\) (\(m'=0,1,5,10,19\)), we ran algorithm \(\mathrm {BAEC}[\mathrm {ASP},{\underline{\mu }},{\overline{\mu }}]\) with three arm selection policies \(\mathrm {ASP}=\mathrm {APT}_\mathrm {P}\), \(\mathrm {LUCB}\) and \(\mathrm {UCB}\) 100 times and calculated their stopping times averaged over the 100 runs.
The result is shown in Table 5. For \(m=1\), the stopping times for \(\mathrm {APT}_\mathrm {P}\) are significantly small compared with those for the other four arm selection policies. Shortest averaged stopping time was achieved by MS and TS for \(m=4, 10\) and by LUCB for \(m=19\) though the differences from \(\mathrm {APT}_\mathrm {P}\)’s stopping times are not significant except for the stopping time of MS and TS in the case with \(\delta =0.001, m=10\). When \(m = 0\), the stopping times of the three algorithms are equal, which means that all the arms including the unique neutral arm \(\mu _1\) were always judged as negative arms in the experiment.
9 Conclusions
We theoretically and empirically studied sample complexity of a bad arm existence checking problem (BAEC problem), whose objective is to judge whether some arms are bad (having loss mean at least \(\theta _U\)) or all the arms are good (having loss mean less than \(\theta _L\)) correctly with probability at least \(1-\delta \) for given thresholds \(0<\theta _L\le \theta _U<1\) and a given acceptable error rate \(0<\delta <1/2\). In the case with \(\varDelta =\theta _U-\theta _L>0\), we proposed algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) that utilizes asymmetry of positive and negative arms’ roles in this problem; the algorithm with a decision condition for each arm i with the current number of draws n using \(\varDelta \)-dependent asymmetric confidence bounds \(\underline{\mu }_{i}(n)\) and \(\overline{\mu }_{i}(n)\), and arm selection policy \(\mathrm {APT}_\mathrm {P}\) that uses a single threshold \(\theta \) closer to \(\theta _U\) instead of the center between \(\theta _L\) and \(\theta _U\). Effectiveness of our decision condition was shown empirically and theoretically. Algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) stopped faster or comparably fast as algorithms \(\mathrm {BAEC}[\mathrm {ASP},{\underline{\mu }},{\overline{\mu }}]\) for \(\mathrm {ASP}=\mathrm {LUCB}, \mathrm {UCB}, \mathrm {TS}\) (Thompson Sampling) and \(\mathrm {MS}\) (Murphy Sampling) in almost all the our simulations. We also showed an asymptotic upper bound of the expected stopping time for \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) which is smaller than that for \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\) in the case that there are multiple positive arms and all the positive arms have the same loss means. Current theoretical support for our arm selection policy \(\mathrm {APT}_\mathrm {P}\) is very limited, and further theoretical analysis that explains its empirically observed small stopping times is our future work.
Notes
Thresholds \(\theta _L\) and \(\theta _U\) correspond to \(\theta -\epsilon \) and \(\theta +\epsilon \), respectively, in thresholding bandit problem (Locatelli et al. 2016) with one threshold \(\theta \) and precision \(\epsilon \), but we use the two thresholds due to convenience for our asymmetric problem structure.
The original lemma treats the problem to decide whether the lowest mean is less than a given one threshold for one-parameter canonical exponential family of K distributions.
The lower bound on the stopping time under the decision of no more positive arm is not analyzed in Kano et al. (2017), and the stopping time in the case with no positive arm is the time of its special case. In good arm identification, the algorithm must stop without falsely identifying any arm as positive in such case with probability at least \(1-\delta \), so its task is the same as our bad arm existence checking problem in the case with no positive arm.
Precisely speaking, \({\hat{\mu }}_i(n)\pm \sqrt{\frac{1}{2n}\ln \frac{4Kn^2}{\delta }}\) is used in successive elimination algorithms for best arm identification problem. A narrower confidence interval is enough to judge whether expected loss is larger than a fixed threshold.
This is not completely the same algorithm as HDoC because, in the HDoC’s decision condition, bounds \({\hat{\mu }}_{i}(n_i(t))\pm \sqrt{\frac{1}{2n_i(t)} \ln {\frac{4Kn_i(t)^2}{\delta }}}\) are used.
Note that \(\mathrm {BAEC}[\mathrm {MS},{\underline{\mu }},{\overline{\mu }}]\) is an elimination algorithm though original Murphy sampling does not eliminate arms.
LUCB means that both of LCB (lower confidence bound) and UCB (upper confidence bound) are used in the algorithm. In fact, it chooses the arm i with the smallest LCB among the arms with the largest m sample means when \(m\ge 2\).
The original Murphy sampling is an algorithm for checking the existence of negative arms and the procedure of MS here is completely opposite to the original one.
This conditioned sampling is realized by rejecting a condition-unsatisfied set of samples and drawing another one repeatedly until a condition-satisfied set of samples is drawn.
Note that the results shown in Table 4 are the averaged decision times not for a specific set of Bernoulli distributions but for 100 sets of Bernoulli distributions with means generated from certain uniform distributions. So, the decision times obtained in this experiment are not a direct experimental evaluation of the theoretically analyzed decision times.
An arm is judged as positive when both the positive and negative decision conditions are satisfied simultaneously.
References
Audibert, J., Bubeck, S., & Munos, R. (2010). Best arm identification in multi-armed bandits. In Proceedings of the 23rd conference on learning theory (pp. 41–53).
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2003). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48–77.
Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832–1852.
Bubeck, S., Wang, T., & Viswanathan, N. (2013). Multiple identifications in multi-armed bandits. In Proceedings of the 30th international conference on machine learning, proceedings of machine learning research, vol 28 (pp. 258–265).
Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7, 1079–1105.
Gabillon, V., Ghavamzadeh, M., & Lazaric, A. (2012). Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems, 25, 3212–3220.
Haka, A. S., Volynskaya, Z., Gardecki, J. J. A., Nazemi, J., Shenk, R., Wang, N., et al. (2009). Diagnosing breast cancer using Raman spectroscopy: Prospective analysis. Journal of Biomedical Optics, 14(5), 054023.
Kalyanakrishnan, S., Tewari, A., Auer, P., & Stone, P. (2012). Pac subset selection in stochastic multi-armed bandits. In Proceedings of the 29th international conference on machine learning (pp. 655–662).
Kano, H., Honda, J., Sakamaki, K., Matsuura, K., Nakamura, A., & Sugiyama, M. (2017). Good arm identification via bandit feedback. arXiv e-prints arXiv:1710.06360.
Kaufmann, E., Cappé, O., & Garivier, A. (2016). On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1), 1–42.
Kaufmann, E., & Kalyanakrishnan, S. (2013). Information complexity in bandit subset selection. In Proceedings of the 26th annual conference on learning theory, proceedings of machine learning research, vol 30 (pp. 228–251).
Kaufmann, E., Koolen, W. M., & Garivier, A. (2018). Sequential test for the lowest mean: From Thompson to Murphy sampling. In Proceedings of the 32nd conference on neural information processing systems (pp. 6333–6343).
Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information Computation, 108(2), 212–261.
Locatelli, A., Gutzeit, M., & Carpentier, A. (2016). An optimal algorithm for the thresholding bandit problem. In Proceedings of the 33rd international conference on machine learning, vol PMLR 48 (pp. 1690–1698).
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5), 527–535.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294.
Zhang, W., Yuan, S., Wang, J., & Shen, X. (2014). Real-time bidding benchmarking with iPinYou dataset. arXiv e-prints. arXiv:1407.7073.
Acknowledgements
This work was partially supported by JST CREST Grant Numbers JPMJCR1662 and JPMJCR18K3, JSPS KAKENHI Grant Numbers JP18H05413 and JP19H04161.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Po-Ling Loh, Evimaria Terzi, Antti Ukkonen, Karsten Borgwardt.
Appendices
Proof of Theorem 1
We use the following lemma to prove our lower bound on the number of samples needed for a \((\theta _L,\theta _U,\delta )\)-BAEC algorithm.
Lemma 3
(Kaufmann et al. 2016) Let \({\varvec{\nu }}\) and \({\varvec{\nu }}'\) be two loss distribution sets of K arms such that distributions \(\nu _i\) and \(\nu '_i\) are mutually absolutely continuous for \(i=1,\dots ,K\). For any almost-surely finite stopping time T and any event \({\mathcal {E}}\), the following inequality holds.
Proof of Theorem 1
Consider a set \({\varvec{\nu }}\) of Bernoulli distributions \(\nu _i\) with mean \(\mu _i\) for which some positive arms exist, that is, the case with \(\mu _1\ge \theta _U\). Let k be the number of arms i with \(\mu _i\ge \theta _L\) in \(\{\nu _i\}\), that means \(\mu _1\ge \cdots \ge \mu _k\ge \theta _L > \mu _{k+1}\ge \cdots \ge \mu _K\). For an arbitrary fixed \(\epsilon >0\), let \(\{\nu '_i\}\) be the set of Bernoulli distributions with means \(\mu '_i\) defined as
For any \((\theta _L,\theta _U,\delta )\)-BAEC algorithm, \(\mathcal {E}_\mathrm {POS}\) denotes the event that its output is “positive”. Since some positive arms exist for the distribution set \({\varvec{\nu }}\), the probability that the event \(\mathcal {E}_\mathrm {POS}\) occurs must be at least \(1-\delta \) by Definition 1, that is, inequality \({\mathbb {P}}_{\varvec{\nu }}(\mathcal {E}_\mathrm {POS})\ge 1-\delta \) holds. All the arms are negative in the distribution set \({\varvec{\nu }'}=\{\nu '_i\}\), likewise by Definition 1, inequality \({\mathbb {P}}_{\varvec{\nu }'}(\mathcal {E}_\mathrm {POS})<\delta \) holds. Thus,
holds. From the fact that \(\max _{i\in \{1,\dots ,k\}}d(\mu _i,\theta _L-\epsilon )=d(\mu _1,\theta _L-\epsilon )\),
holds, which leads to Ineq. (1) by considering its limit as \(\epsilon \rightarrow +0\).
Next, consider a set \({\varvec{\nu }}\) of Bernoulli distributions \(\nu _i\) with mean \(\mu _i\) for which all the arms are negative, that is, the case with \(\mu _1<\theta _L\). Fix \(j\in \{1,\dots ,K\}\) arbitrarily. For arbitrary \(\epsilon >0\), let \({\varvec{\nu }'}\) be a set of Bernoulli distributions \(\nu '_i\) with mean \(\mu '_i\) defined as
For any \((\theta _L,\theta _U,\delta )\)-BAEC algorithm, \(\mathcal {E}_\mathrm {NEG}\) denotes the event that its output is “negative”. Then, inequalities \({\mathbb {P}}_{\varvec{\nu }}(\mathcal {E}_\mathrm {NEG})\ge 1-\delta \) and \({\mathbb {P}}_{\varvec{\nu }'}(\mathcal {E}_\mathrm {NEG})<\delta \) hold by Definition 1 because all the arms are negative in \(\varvec{\nu }\) and arm j is positive in \(\varvec{\nu }'\). Thus, by Lemma 3,
holds, that is, for each \(j=1,\dots ,K\),
holds. This leads to Ineq. (2) by considering its limit as \(\epsilon \rightarrow +0\) and the summation over \(j=1,\dots ,K\). \(\square \)
Proof of Theorem 2
We prove Theorem 2 using the following proposition.
Proposition 2
For any \(x > 0\), the following inequality holds:
Proof
Since
holds,
and
hold for \(x>0\). \(\square \)
Proof of Theorem 2
We prove this theorem by contradiction. Assume that \({\overline{\mu }}_i(T_{{\varDelta }})\ge \theta _U\) and \(\theta _L > {\underline{\mu }}_i(T_{{\varDelta }})\). Then,
holds. On the other hand,
holds, which contradicts Ineq. (10). \(\square \)
Proof of Theorem 3
If \(\overline{\mu }'_{i}(T'_{{\varDelta }})-\underline{\mu }'_{i}(T'_{{\varDelta }})>\varDelta \) holds, then \(\overline{\mu }'_{i}(n)-\underline{\mu }'_{i}(n)>\varDelta \) holds for \(n=1,\dots ,T'_{{\varDelta }}\). In this case, \(\underline{\mu }'_{i}(n)<\theta _L\) and \(\overline{\mu }'_{i}(n)\ge \theta _U\) hold for \(n=1,\dots ,T'_{{\varDelta }}\) when \(\theta _U- (\overline{\mu }'_{i}(n)-\underline{\mu }'_{i}(n))/2\le {\hat{\mu }}_i(n)<\theta _L+ (\overline{\mu }'_{i}(n)-\underline{\mu }'_{i}(n))/2\), which means \(\tau '_i>T'_{{\varDelta }}\). In fact, Inequality \(\overline{\mu }'_{i}(T'_{{\varDelta }})-\underline{\mu }'_{i}(T'_{{\varDelta }})>\varDelta \) holds because
The difference between the worst case stopping times \(\tau '_i-\tau _i\) is lower-bounded as
Theorem refered in Remark 4
Define \(\overline{\mu }''_{i}(n)\) as
Then, the following theorem holds.
Theorem 9
Consider algorithm \(\mathrm {BAEC}[*,{\underline{\mu }}',{\overline{\mu }}'']\) and define \(\tau ''_i=\min \{n\mid \underline{\mu }'_{i}(n)\ge \theta _L \text { or } \overline{\mu }''_{i}(n)<\theta _U\}\) for \(i=1,\dots ,K\). Then, event \(\tau ''_i > T''_{{\varDelta }}\) can happen for \(i=1,\dots ,K\), where \(T''_{{\varDelta }}\) is defined as \(T''_{{\varDelta }}=\lfloor \frac{2}{\varDelta ^2}\ln \frac{366K^{1/4}}{\varDelta ^4\delta }\rfloor \). Furthermore, the difference between the worst case stopping times \(\tau ''_i-\tau _i\) is lower-bounded as
Proof
If \(\overline{\mu }''_{i}(T''_{{\varDelta }})-\underline{\mu }'_{i}(T''_{{\varDelta }})>\varDelta \) holds, then \(\overline{\mu }''_{i}(n)-\underline{\mu }'_{i}(n)>\varDelta \) holds for \(n=1,\dots ,T''_{{\varDelta }}\). In this case, \(\underline{\mu }'_{i}(n)<\theta _L\) and \(\overline{\mu }''_{i}(n)\ge \theta _U\) hold for \(n=1,\dots ,T''_{{\varDelta }}\) when \(\theta _U- (\overline{\mu }''_{i}(n)-\underline{\mu }'_{i}(n))/2\le {\hat{\mu }}_i(n)<\theta _L+ (\overline{\mu }''_{i}(n)-\underline{\mu }'_{i}(n))/2\), which means \(\tau ''_i>T''_{{\varDelta }}\). In fact, Inequality \(\overline{\mu }''_{i}(T''_{{\varDelta }})-\underline{\mu }'_{i}(T''_{{\varDelta }})>\varDelta \) holds because
The difference between the worst case stopping times \(\tau ''_i-\tau _i\) is lower-bounded as
\(\square \)
Proof of Proposition 1
The following proposition is needed to prove Proposition 1.
Proposition 3
For \(0<a<1\), any \(t \ge \frac{\mathrm {e}}{(\mathrm {e}-1)a} \ln \frac{1}{a}\) satisfies the following inequality.
Proof
For \(0<a<1\), let \(f(t) = a t - \ln {t}\). When \(a > \frac{1}{\mathrm {e}}\), f(t) is always positive for any \(t > 0\) since f(t) takes minimum value \(1 - \ln {\frac{1}{a}}\) at \(t = \frac{1}{a}\).
When \(a \le \frac{1}{\mathrm {e}}\), if \(t = \frac{\mathrm {e}}{(\mathrm {e}-1)a} \ln \frac{1}{a}\),
holds because \(y=\frac{1}{\mathrm {e}- 1}x - \ln \frac{\mathrm {e}}{\mathrm {e}-1}\) is a tangential line of \(y=\ln x\) at \(x=\mathrm {e}-1\). If \(t> \frac{\mathrm {e}}{(\mathrm {e}-1)a} \ln \frac{1}{a} \left( \ge \frac{\mathrm {e}}{(\mathrm {e}-1)a} > \frac{1}{a}\right) \), \(\frac{d f(t)}{d t} = a - \frac{1}{t}\) is positive. Therefore, for \(t \ge \frac{\mathrm {e}}{(\mathrm {e}-1)a} \ln \frac{1}{a}\), \(a t -\ln t \ge 0\). \(\square \)
Proof of Proposition 1
The following inequality is derived from Proposition 3 by setting a to \(\frac{\varDelta ^2\delta }{2\sqrt{K}}\) that means \(t = \frac{\sqrt{K}N_{{\varDelta }}}{\delta } \ge \frac{2\mathrm {e}\sqrt{K}}{(\mathrm {e}-1)\varDelta ^2\delta } \ln \frac{2\sqrt{K}}{\varDelta ^2\delta }\),
Thus,
holds, and so
holds. \(\square \)
Proof of Theorem 5
Consider the case that \(\mu _1\ge \theta _U\) and event \(\mathcal {E}^+\) occurs. In this case, \(\bigcap _{n=1}^{T_{{\varDelta }}}\{\overline{\mu }_{i}(n)\ge \mu _i\}\) holds for some i with \(\mu _i\ge \theta _U\). Assume \(T_{{\varDelta _i}}<\tau _i\) for this i. Then, \(\overline{\mu }_{i}(T_{{\varDelta _i}})\ge \mu _i\ge \theta _U\) and \(\underline{\mu }_{i}(T_{{\varDelta _i}})< \theta _L\) hold. However,
holds, which contradicts the fact that \(\underline{\mu }_{i}(T_{{\varDelta _i}})< \theta _L\). Thus, \(\tau _i\le T_{{\varDelta _i}}\) holds for at least one positive arm i with probability \(\mathbb {P}\{\mathcal {E}^+\}\) which is at least \(1-\delta \) by Lemma 1.
Consider the case that \(\mu _1< \theta _L\) holds and event \(\mathcal {E}^-\) occurs. Assume \(T_{{\varDelta _i}}<\tau _i\) for \(i=1,\dots ,K\). Then, \(\overline{\mu }_{i}(T_{{\varDelta _i}})\ge \theta _U\) and \(\underline{\mu }_{i}(T_{{\varDelta _i}})< \mu _i <\theta _L\) hold. However,
holds, which contradicts the fact that \(\overline{\mu }_{i}(T_{{\varDelta _i}})\ge \theta _U\). Thus, \(\tau _i\le T_{{\varDelta _i}}\) holds for all arms i with probability \(\mathbb {P}\{\mathcal {E}^-\}\) which is at least \(1-\delta \) by Lemma 1.
Proof of Lemma 2
Let \(\epsilon \) be an arbitrary real that satisfies \(0<\epsilon <\varDelta /2(1+\alpha )\).
Consider the case with \(\mu _i\ge \theta \). Define \(n_i\) as \(n_i=\frac{1}{2(\varDelta _i-\epsilon )^2}\ln \frac{KN_{{\varDelta }}}{\delta }\). Then,
holds. Since \(\frac{1}{\varDelta _i^2}+\frac{6\epsilon }{\varDelta _i^3}- \frac{1}{(\varDelta _i-\epsilon )^2}=\frac{\epsilon (\varDelta _i-2\epsilon )(4\varDelta _i-3\epsilon )}{\varDelta _i^3(\varDelta _i-\epsilon )^2}\ge 0\) holds for \(0<\epsilon \le \frac{\varDelta _i}{2}\), \(\frac{1}{(\varDelta _i-\epsilon )^2}\le \frac{1}{\varDelta _i^2}+\frac{6\epsilon }{\varDelta _i^3}\) holds for \(0<\epsilon <\varDelta /2(1+\alpha )\le \varDelta _i/2\). Thus, Ineq. (8) can be obtained by setting \(\epsilon \) to \(O((\ln \frac{KN_{{\varDelta }}}{\delta })^{-1/3})\).
Next, consider the case with \(\mu _i<\theta \). Define \(n_i\) as \(n_i=\frac{1}{2(\varDelta _i-\epsilon )^2}\ln \frac{N_{{\varDelta }}}{\delta }\). Then,
holds. Similar calculation leads to Inequality (9).
Proof of Theorem 7
Define \(\mathrm {apt}_\mathrm {P}(n,i)\) as \(\mathrm {apt}_\mathrm {P}(n,i)=\sqrt{n}({\hat{\mu }}_i(n)-\theta )\) for convenience. Note that \(\mathrm {APT}_\mathrm {P}(t,i)=\mathrm {apt}_\mathrm {P}(n_i(t),i)\). Random variables \(Y_i\) and \(N_i(a)\) are defined as
To obtain an upper bound of the expected stopping time \(\mathbb {E}[T]\) for algorithm \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\), we consider the case that, for some arm i with \(\mu _i\ge \theta \), arm i is the first arm that satisfies decision condition and \({\underline{\mu }}_i(\tau _i)\ge \theta _L\), that is, the case that event \(\{{\hat{i}}_1=i, \mathcal {E}_i^\mathrm {POS}\}\) occurs for \(i\le m\). In the case with no such arm i, stopping time T is upper-bounded by the worst case bound \(KT_{{\varDelta }}\) (Theorem 4) and the decreasing order of the occurrence probability of this case as \(\delta \rightarrow +0\) can be proved to be small compared to the increasing order of \(KT_{{\varDelta }}\) (for the case with \({\hat{i}}_1=i\ge m+1\) by Lemma 13 and for the case that event \(\overline{\mathcal {E}_i^\mathrm {POS}}\) occurs for \(i\le m\) by Lemma 14), so it can be ignored asymptotically as \(\delta \rightarrow +0\). An upper bound of \(\mathbb {E}[T\mathbb {1}\{{\hat{i}}_1=i, \mathcal {E}_i^\mathrm {POS}\}]\) for arm i with \(\mu _i\ge \theta \) is proved in Lemma 10. When event \(\{{\hat{i}}_1=i, \mathcal {E}_i^\mathrm {POS}\}\) occurs for arm i with \(\mu _i\ge \theta \), the number of arm draws is \(\tau _i\) for arm i, at most \(N_j(Y_i)\) for arm \(j\ne i\) if \(Y_i\le 0\) and at most \(N_j(0)\) for arm \(j\ne i\) if \(Y_i>0\). So, to prove the upper bound in Lemma 10, we upper bound \(\mathbb {E}[\tau _i\mathbb {1}\{{\hat{i}}_1=i, \mathcal {E}_i^\mathrm {POS}\}]\) by Lemma 2, \(\mathbb {E}[N_j(Y_i)\mathbb {1}\{Y_i\le 0,{\hat{i}}_1=i, \mathcal {E}_i^\mathrm {POS}\}]\) for \(j\ne i\) by Lemmas 5 and 8 and \(\mathbb {E}[N_j(0)\mathbb {1}\{Y_i>0,{\hat{i}}_1=i, \mathcal {E}_i^\mathrm {POS}\}]\) for \(j\ne i\) by Lemma 9.
Lemma 4
\(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
Proof
\(\square \)
Lemma 5
\(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
for \(i=1,\dots ,K\) and \(j\le m \ (i\ne j)\).
Proof
Define \({\mathbb {F}}_i(a)\) as \({\mathbb {F}}_i(a)=\mathbb {P}[Y_i\le a]\). Then,
holds.
\(\square \)
Lemma 6
\(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
for \(j\ge m+1\) and \(a\le 0\).
Proof
Define \(n_0\) as \(n_0=\frac{4a^2}{\underline{\varDelta }_j^2}\). Note that \(\underline{\varDelta }_j+\frac{a}{\sqrt{n}}> \frac{\underline{\varDelta }_j}{2}\) for \(n> n_0\). Then,
\(\square \)
Lemma 7
\(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
Proof
\(\square \)
Lemma 8
For \(i\le m\) and \(j\ge m+1\), \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
Proof
Define \({\mathbb {F}}_i(a)\) as \({\mathbb {F}}_i(a)=\mathbb {P}[Y_i\le a]\). Then,
\(\square \)
Lemma 9
For \(i\le m\), \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
Proof
\(\square \)
Lemma 10
For \(i\le m\) and any event \(\mathcal {E}\), \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
Proof
In the case that the decision condition is satisfied first by one of arms i with \(\mu _i\ge \theta \) (\(i\le m\)), that is, \({\hat{i}}_1=i\), the stopping time T is at most \(\tau _i+\sum _{j\ne i}N_j(Y_i)\) if \(Y_i\le 0\) and at most \(\tau _i+\sum _{j\ne i}N_j(0)\) if \(Y_i>0\). Thus, for \(i\le m\),
holds.
\(\square \)
Define \(n_{{\varDelta ,\delta }}\) as \(n_{{\varDelta ,\delta }}=\left\lceil \frac{1}{2(\max \{\theta _U,1-\theta _L\})^2}\ln \frac{N_{{\varDelta }}}{\delta }\right\rceil \). Then, \(\tau _i\) for any arm \(i=1,\dots ,K\) is bounded by \(n_{{\varDelta ,\delta }}\) from below.
Lemma 11
In algorithm \(\mathrm {BAEC}[*,{\underline{\mu }},{\overline{\mu }}]\), \(\tau _i\ge n_{{\varDelta ,\delta }}\) holds for any arm \(i=1,\dots ,K\).
Proof
By the definition of \(\tau _i\), \(\overline{\mu }_{i}(\tau _i)<\theta _U\) or \(\underline{\mu }_{i}(\tau _i)\ge \theta _L\) must be satisfied for any arm i. In the case with \(\overline{\mu }_{i}(\tau _i)<\theta _U\),
holds. Since \({\hat{\mu }}_i(\tau _i)\ge 0\),
holds. So, we obtain
In the case with \(\underline{\mu }_{i}(\tau _i)\ge \theta _L\),
holds. Since \({\hat{\mu }}_i(\tau _i)\le 1\),
holds. So, we obtain
Therefore,
holds. Since \(\tau _i\) is a natural number,
holds. \(\square \)
Lemma 12
\(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
for \(i\ge m+1\).
Proof
\(\square \)
Lemma 13
For \(m\ge 1\) and \(i\ge m+1\), \(\mathrm {BAEC}[\mathrm {APT}_\mathrm {P},{\underline{\mu }},{\overline{\mu }}]\) satisfies
Proof
Define \({\mathbb {F}}_i(a)\) as \({\mathbb {F}}_i(a)=\mathbb {P}[Y_i\ge a]\). Then,
The second term is bounded as
Thus, by Ineqs. (12), (13) and Lemma 12,
holds. \(\square \)
Lemma 14
For the complementary events \(\overline{\mathcal {E}_i^\mathrm {POS}}\) of event \(\mathcal {E}_i^\mathrm {POS}\), inequality
holds when \(i\le m\).
Proof
In the case with \({\hat{\mu }}_i(\tau _i)\ge \theta \), arm i is judged as positive because \(\underline{\mu }_{i}(\tau _i)\ge \theta _L\) holds whenever \(\overline{\mu }_{i}(\tau _i)<\theta _U\) holds.Footnote 12 This is because \(\theta _U-\theta :\theta -\theta _L=\overline{\mu }_{i}(\tau _i)-{\hat{\mu }}_i(\tau _i):{\hat{\mu }}_i(\tau _i)-\underline{\mu }_{i}(\tau _i)=1:\alpha \) holds. Thus,
holds. \(\square \)
Proof of Theorem 7
\(\square \)
Proof of Theorem 8
We consider event \(\bigcup _{i:\mu _i=\mu _1}\mathcal {E}_i^\mathrm {POS}\), that is, the event that one of the best arm i is judged as positive. In the case that event \(\bigcup _{i:\mu _i=\mu _1}\mathcal {E}_i^\mathrm {POS}\) does not occur, stopping time T is upper-bounded by the worst case bound \(KT_{{\varDelta }}\) (Theorem 4) and the decreasing order of the occurrence probability of this case as \(\delta \rightarrow +0\) can be proved to be small compared to the increasing order of \(KT_{{\varDelta }}\) (Lemma 14), so it can be ignored asymptotically as \(\delta \rightarrow +0\). When event \(\bigcup _{i:\mu _i=\mu _1}\mathcal {E}_i^\mathrm {POS}\) occurs, non-optimal arms i with \(\mu _i<\mu _1\) is drawn in the case of \(\mu _i\)’s overestimation (\(\mathrm {UCB}(t,i)\ge \mu _1-\epsilon \)) or in the case of \(\mu _1\)’s underestimation (\(\mathrm {UCB}(t,1)< \mu _1-\epsilon \)). So, \(\mathbb {E}[T\mathbb {1}\{\bigcup _{i:\mu _i=\mu _1}\mathcal {E}_i^\mathrm {POS}]\) is upper-bounded by upper bounding \(\mathbb {E}[\tau _i\mathbb {1}\{\bigcup _{i:\mu _i=\mu _1}\mathcal {E}_i^\mathrm {POS}\}]\) for optimal arms i with \(\mu _i=\mu _1\) by Lemma 2, the expected number of overestimations \(\mathbb {E}\left[ \sum _{t=1}^{KT_{{\varDelta }}}\mathbb {1}[\mathrm {UCB}(t,i)\ge \mu _1-\epsilon , i_t=i]\right] \) for non-optimal arms i with \(\mu _i<\mu _1\) by Lemma 15, and the expected number of underestimations \(\mathbb {E}\left[ \sum _{t=1}^{KT_{{\varDelta }}}\mathbb {1}[\mathrm {UCB}(t,1)< \mu _1-\epsilon ]\right] \) for the optimal arm 1 by Lemma 16.
Lemma 15
For an arbitrary \(\epsilon >0\), \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\) satisfies
for \(i=2,\ldots ,K\) with \(\mu _i<\mu _1\).
Proof
Let \(n'_i=\frac{\ln KT_{{\varDelta }}}{2(\varDelta _{1i}-2\epsilon )^2}\). Then,
Therefore,
\(\square \)
Lemma 16
For \(\mathrm {BAEC}[\mathrm {UCB},{\underline{\mu }},{\overline{\mu }}]\), the following inequality holds.
for \(0<\epsilon \le 1\).
Proof
Define \({\mathbb {F}}_n(x)\) as \({\mathbb {F}}_n(x)=\mathbb {P}\{{\hat{\mu }}_1(n)\le x\}\). Note that \({\mathbb {F}}_n(x)\le \mathrm {e}^{-2n(\mu _1-x)^2}\) for \(x<\mu _1\) by Hoeffding’s Inequality. Then,
\(\square \)
Proof of Theorem 8
Let \(\epsilon \) be \(0<\epsilon \le \min _{i:\varDelta _{1i}>0} \varDelta _{1i}/4\).
Since \(\frac{1}{\varDelta _{1i}^2}+\frac{12\epsilon }{\varDelta _{1i}^3}- \frac{1}{(\varDelta _{1i}-2\epsilon )^2}=\frac{4\epsilon (\varDelta _{1i}-4\epsilon )(2\varDelta _{1i}-3\epsilon )}{\varDelta _{1i}^3(\varDelta _{1i}-2\epsilon )^2}\ge 0\) holds for \(0<\epsilon \le \frac{\varDelta _{1i}}{4}\), \(\frac{1}{(\varDelta _{1i}-2\epsilon )^2}\le \frac{1}{\varDelta _{1i}^2}+\frac{12\epsilon }{\varDelta _{1i}^3}\) holds. Thus, by setting \(\epsilon \) to \(O((\ln KT_{{\varDelta }})^{-1/3})\), we have
\(\square \)
Rights and permissions
About this article
Cite this article
Tabata, K., Nakamura, A., Honda, J. et al. A bad arm existence checking problem: How to utilize asymmetric problem structure?. Mach Learn 109, 327–372 (2020). https://doi.org/10.1007/s10994-019-05854-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-019-05854-7