1 Introduction

Multi-armed bandit (MAB) algorithms [7] are now widely used to model and solve problems where decisions are required to be made sequentially at every time step and there is an exploration - exploitation dilemma. This dilemma is the tradeoff that the planner faces in deciding whether to explore arms that may yield higher rewards in the future or exploit the arms that have already yielded high rewards in the past. If the rewards are generated from fixed distributions with unknown parameters, the setting goes by the name stochastic MAB [7]. Popular algorithms in the stochastic MAB setting include Upper Confidence Bound (UCB) based algorithms [2] and Thompson Sampling [1] based algorithms. These algorithms incur \(O(\log T)\) regret where T is the total number of time steps. MAB algorithms are well studied with several variants [8, 9, 22, 23] and applications [11, 29,30,31].

When the arms are controlled by strategic agents, we need to tackle additional challenges. Mechanism design [26,27,28] has been applied in this context, leading to stochastic MAB mechanisms [24]. The design of such mechanisms requires ideas from online learning as well as mechanism design, both of which are increasingly gaining importance in the field of artificial intelligence. An immediate application of stochastic MAB mechanisms is in sponsored search auctions (SSA). In SSA, there are several advertisers who wish to display their ads along with the search results generated in response to a query from an internet user. In the standard model, an advertiser has only one ad to display. We use the terms agent, ad, and advertiser interchangeably. There are two components that are of interest to the planner or the search engine, (1) stochastic component: click through rate (CTR) of the ads or the probability that a displayed ad receives a click (2) strategic component: valuation of the agent for every click that the agent’s ad receives. The search engine would seek to allocate a slot to an ad which has the maximum social welfare (product of click through rate and valuation). However neither the CTRs nor the valuations of the agents are known. This calls for a learning algorithm to learn the stochastic component (CTR) as well as a mechanism to elicit the strategic component (valuation). This problem could become much harder as the agents may manipulate the learning process [4, 19] to gain higher utilities.

For single slot SSA, it is known that any deterministic MAB mechanism (that is, a MAB mechanism with a deterministic allocation and payment rule) suffers a regret [7, 12] of Ω(T2/3) [4]. Furthermore, there exists a deterministic MAB mechanism with regret matching the theoretical lower bound [4] and also satisfies ex-post truthfulness, the strongest notion of truthfulness (a posteriori to the clicks). When a more relaxed notion of truthfulness is targeted (truthfulness in expectation of the clicks), the regret guarantee improves to O(T1/2) [3]. Truthfulness in expectation has also been achieved in [16, 17]. The regret can be further improved when randomized mechanisms are used and in fact the regret in this space is \(O(\log T)\) [3, 21]. However, the high variance that is inevitable to the payments in randomized mechanisms is a serious deterrent to the use of randomized mechanisms. Towards reducing the variance, [15] propose a MAB mechanism using Thompson sampling [1]. However the notion of truthfulness achieved is ‘within period DSIC’ and with high probability. Thus again, only a weaker notion of truthfulness is achieved compared to ex-post truthfulness.

In this work, we observe that the characterization provided by Babaioff et al. [4] targets the worst case scenario. In particular, in the lower bound proof of regret of Ω(T2/3), they consider an example scenario where the actual separation, \(\bar {{\varDelta }}\), between the expected rewards of the arms is a function of T. We note that when a similar example (\(\bar {{\varDelta }} = T^{-1}\)) is used with the popular UCB algorithm [2], the number of pulls of sub-optimal arms could be linear, even in the non-strategic case. Hence, a dependence of \(\bar {{\varDelta }}\) on T is severely restrictive for the case when the rewards are stochastic, even when the arms are non-strategic. We make the observation that \(\bar {{\varDelta }}\) is in most situations independent of T. This motivates our main idea in this paper, which is to provide the planner an option to specify a parameter Δ, which is the tolerance or distinguishing level for sub-optimal arms. The understanding is that any arm that is within Δ from the best arm will not cause any additional regret to the planner. For example, the best arm may yield expected reward of 6.000 while a sub-optimal arm may yield a very close expected reward of 5.999. The planner is typically indifferent to such small differences. Traditional exploration-separated schemes end up spending a huge number of exploration rounds in order to distinguish between these two closely separated arms.

figure f

Another scenario that the planner might be interested in is to ensure that over T rounds, the cumulative loss from choosing sub-optimal arms stays within an overall factor of τ. To achieve this, the tolerance level Δ could be set to a value Δ = τ/T. The notion of Δ tolerance will require an appropriate definition of regret, which we call Δ-regret. Focussing on Δ-regret instead of the usual notion of regret helps us to reduce the number of exploration rounds significantly from O(T2/3) to \(O(\log T)\). We propose an exploration separated mechanism based on UCB, which achieves a Δ-regret of \(O(\log T)\). This mechanism can be readily applied in several settings such as SSA, crowdsourcing, and online procurement. For the rest of the paper, however, we use SSA as a running example.

1.1 Contributions:

  1. (1)

    We make the crucial observation that in most MAB scenarios, the separation between the agents’ rewards is rarely a function of T (the number of time steps). Moreover, in the case that the rewards of the arms are arbitrarily close, the regret contributed by such sub-optimal arms is negligible. We exploit this observation to allow the center to specify the resolution, Δ, with which the agents must be distinguished. We introduce the notion of Δ-Regret to formalize this regret. The notion of regret used in state of the art does not permit a reduction in the number of exploratory rounds of sub-optimal arms whereas the more flexible notion of Δ-regret that we introduce supports a reduction in the number of mandatory pulls of the sub-optimal arms.

  2. (2)

    Using sponsored search auctions as a concrete example, we propose a dominant strategy incentive compatible (DSIC) and individually rational (IR) MAB mechanism with a deterministic allocation and payment rule, based on ideas from the UCB family of MAB algorithms. The proposed mechanism Δ-UCB achieves a Δ-regret of \(O(\log T)\) for the case of single slot sponsored search auctions. The truthfulness achieved by Δ-UCB is a posteriori to the click realizations and is the strongest form of truthfulness. This loss of \(O(\log T)\) would not have been possible otherwise if the traditional notions of regret were used. In particular the number of exploration rounds in Δ-UCB is \(O(\log T)\) as opposed to the O(T2/3) rounds which were mandatory so far for ensuring a truthful, deterministic mechanism. Thus we now enable the planner to be relieved from this huge number of exploration rounds. We also show that a lower bound on the Δ-regret suffered by any mechanism is \({\varOmega }(\log T)\).

  3. (3)

    We non-trivially extend the above results to the case where multiple slots are to be allocated. Here again, our mechanism is DSIC, IR, and requires sub-optimal arms to be explored for only \(O(\log T)\) rounds. Therefore a Δ-regret of \(O(\log T)\) is achieved for the multiple slot scenario as well.

Our results are generic to stochastic MAB mechanisms and can be applied to other popular applications such as crowdsourcing and online procurement.

2 Relevant work

In the area of MAB mechanisms, a lot of work has been done in sponsored search auctions. Babaioff et al. [4] provide a characterization of truthful MAB mechanisms, wherein the objective is to maximize social welfare. They introduce the notion of influential rounds. The influential rounds are the rounds where the parameters of reward distributions (CTRs) are learnt. One of the characterizations of truthful deterministic mechanisms is that the allocation must be exploration separated, that is, in such influential rounds, the allocation must not depend on the bids of the agents. The allocation is also required to be point wise monotone. One of the main results of their paper is that any truthful, deterministic MAB mechanism incurs a regret of Ω(T2/3). In particular, their analysis holds an adversarial nature, as the sub-optimality between the best and second best arm is chosen as if by an adversary, to be proportional to T− 1/3. Such a choice ensures a huge regret for any truthful, deterministic mechanism. They also provide a mechanism which incurs a matching upper bound regret of O(T2/3). Devanur et al. [10] concurrently provide similar bounds on the regret when the objective is revenue maximization rather than social welfare maximization.

All the above results pertain to the setting of single slot auctions where there is a single slot for which the agents compete. In the generalization of this setting multiple slots are reserved for ads. This setting is more challenging as every slot is not identical and some slots are more prominent than the others. MAB mechanisms have also been extended to the multiple slot setting [14] in line with the characterization in [4]. Hence, a similar regret of O(T2/3) on the social welfare has been attained here as well. Similar results are also stated in the characterisation provided in [32].

MAB mechanisms have also been proposed in the context of crowdsourcing [6]. Some of these mechanisms incur a regret of \(O(\log T)\). This is rendered possible due to the specific nature of the problem in hand. In particular, Bhat et al. [5] look at divisible tasks. Jain et al. [20] look at deterministic mechanisms where a block of tasks is allocated to each agent and provide a weaker notion of truthfulness.

The lower bound of both of social welfare regret as well as regret in the revenue of Ω(T2/3) have influenced subsequent research to follow similar assumptions and thereby obtain a similar regret. However, we show in this work that it is indeed possible to design a deterministic mechanism which attains logarithmic regret and is also truthful in the dominant strategy incentive compatible (DSIC) sense [25]. DSIC, of course, is the most preferred form of truthfulness [26]. This work opens up the possibility for a planner to move away from the worst case scenario to a more realistic scenario. We enable the planner to specify a resolution parameter for distinguishing the arms, introduce the notion of Δ-regret and thereafter propose a mechanism that ensures that the number of exploration rounds and hence the regret suffered is only \(O(\log T)\) instead of the expensive Ω(T2/3) available currently in state of the art. We summarize the contrast between our work and the state of the art in Table 1.

Table 1 Comparison of our results with state of the art

3 The model: Single slot SSA

We now describe our SSA setting. For ease of reference, our notations are provided in Table 2. Let K be the number of agents or arms. We denote the set of arms by [K]. Each of the K arms, when pulled, gives rewards from distributions with unknown parameters. We assume here, that the form of the distributions are known but the parameters of the distributions are unknown. In SSA, the rewards of the arms correspond to clicks. The clicks for the advertisements are assumed to be generated from Bernoulli distributions with parameters μ1, μ2,…, μK where μi is the CTR or probability that advertisement i receives a click once observed. The means μ1,…, μK are unknown.

Table 2 Notations for the single slot SSA setting

A click realization ρ represents the click information of every agent at all rounds, that is, ρi(t) = 1 if agent i received a click in round t. In a round t, only the click information of the allocated agent is revealed after the completion of the round. Click information of all other unallocated agents is never known to the planner.

The agents also have their valuations for each click they receive. We work in the ‘pay per click’ setting where the agent pays the search engine for each click received. Let the true valuation of agent i be θi for a click. θi is a private type of agent i and is never known to the learner. However the agent is asked to bid his valuation. Let the bid of agent i be bi. We denote by a vector b = (b1,…, bK) the bid profile of all the agents. The central planner wants to ensure that the agents bid their true valuations, that is bi must be equal to θi. Assume that there is a single slot which must be allocated to one of the K agents. We denote by Wi the social welfare when agent i is allocated a slot, that is, Wi = μiθi. The social welfare represents the expected valuation of agent i per click. If the CTRs of the agents as well as their valuations were known, the planner would have selected the arm with the maximum social welfare, that is, μiθi. However neither μi nor θi is known to the planner. Assume θmax is the maximum valuation that any agent can have and is common knowledge. The central agent wants to allocate a single slot to one of the ads in such a way that the net social welfare of the allocation is maximized.

A mechanism \({\mathscr{M}} = \left \langle \mathcal {A}, P\right \rangle \) is a tuple containing an allocation rule \(\mathcal {A}\) and a payment rule P. At every time step or round t, the allocation rule acts on a bid profile b of the agents as well as click realization ρ and allocates the slot to one of the K agents, say i. Then \(\mathcal {A}(b, \rho , t) = i\). Alternatively we denote the indicator variable . The payment rule \(P^t = (P_1^t, P_{2}^t, \ldots , P_K^t)\), where \(P_i^t(b, \rho )\) is the payment to be made by agent i at time t upon receiving a click, when the bids are b and for click realization ρ. As stated earlier ρi(t) of the allocated agent alone is observed. Also note that the allocation as well as payments in each round t only depends on the click histories till that round.

Let i be the arm with the largest social welfare, that is, \(i_{*} = \underset {i \in [K]}{\arg \max \limits } W_i\). We denote the corresponding social welfare as \(W_{*} = \max \limits _{i \in [K]} W_{i} \). We denote by It the agent chosen at time t as a shorthand for \(\mathcal {A}(b, \rho , t)\). For any given Δ > 0, define the set SΔ = {i ∈ [K] : WWi < Δ}. SΔ denotes the set of all agents separated from the best arm i with a social welfare less than Δ. These arms are therefore indistinguishable for the center and they contribute zero to the regret. Note that Δ is a parameter that the center fixes based on the amount in dollars he is willing to tradeoff for choosing sub-optimal arms, given he has only a fixed time horizon T to his disposal. To capture this revised and more practical notion of regret, we introduce the metric Δ-regret. Formally,

The center may not want to invest a huge number of exploration rounds (Ω(T2/3) in state of the art) to perfectly distinguish the arms that are arbitrarily close. Many a time, the planner may instead be willing to allocate arms that are at most Δ away from the best arm. The center therefore suffers a regret only when an agent with a social welfare greater than Δ away from W is chosen. Δ-regret captures this loss.

The goal of our mechanism is to select agents at every round t to minimize the Δ-regret.

4 Our mechanism: Δ-UCB

We are now ready to describe our mechanism Δ-UCB. The idea in Δ-UCB is to explore all the arms in a round-robin fashion for a fixed number of rounds. The number of exploration rounds is fixed based on the desired Δ, specified by the planner. At the end of exploration, with high probability, we are guaranteed that the arms not in SΔ are well separated from the best arm i with respect to their social welfare estimates. In the exploration rounds, agents need not pay and these rounds are free.

Further on, for all the remaining rounds, the best arm as per the UCB estimate of social welfare is chosen. However in the exploitation rounds, the chosen agent pays an amount for each click he receives. The amount to be paid by the agent is fixed based on variant of the well known Vickrey Clark Grove (VCG) scheme [33] known as weighted VCG [28]. Note that no learning takes place in these rounds and the UCB, LCB indices do not change thereafter. We present our mechanism in Algorithm 1.

figure i

4.1 Properties of Δ-UCB

Next we discuss the properties satisfied by Δ-UCB regarding truthfulness and regret. Before that, we state a few useful definitions which will help in understanding the notion of truthfulness.

At any time step, every agent obtains some utility by participating in the mechanism. This utility is a function of his bid, valuation, bids of other agents and his click realization. Let Θi denote the space of bids of agent i. bi = (b1,…, bi− 1, bi+ 1,…, bK) is the bid profile containing bids of all agents except agent i. Let Θi denote the space of bids of all agents other than agent i. Therefore Θi = Θ1 ×…,×Θi− 1 ×Θi+ 1 ×… ×ΘK. We denote by ui(bi, bi, ρ, t;θi) the utility to agent i at time t when his bid is bi, his valuation is θi, the bid profile of the remaining agents is bi and the click realization is ρ. All agents are assumed to be rational and are interested in maximizing their own utilities.

In our setting the utility to an agent i is computed as,

$$ \begin{array}{@{}rcl@{}} u_{i}(b_{i}, b_{-i}, \rho, t; \theta_{i}) = (\theta_{i} - {P_{i}^{t}}(b, \rho)) \mathcal{A}_{i}(b_{i}, b_{-i}, \rho, t)\rho_{i}(t) \end{array} $$
(2)

The idea behind the computation of the utility is as follows. If an agent i does not receive an allocation (that is, \(\mathcal {A}_i(b_i, b_{-i}, \rho , t)=0\)), his utility is also zero. He gets a non-zero utility only if he receives an allocation. If he receives an allocation and also a click (ρi(t) = 1), then his utility is the difference between his valuation for the click and the amount he has to pay to the search engine (\(\theta _{i} - P_i^t(b, \rho )\)). If he does not receive a click (ρi(t) = 0), his utility is zero.

Definition 1

Dominant Strategy Incentive Compatible (DSIC) [4]: A mechanism \(M = \left \langle \mathcal {A},P \right \rangle \) is said to be dominant strategy incentive compatible if ∀i ∈ [K],∀biΘi, ∀biΘi,∀ρ,∀t, ui(θi, bi, ρ, t;θi) ≥ ui(bi, bi, ρ, t;θi).

Note that in the above definition, the truthfulness is demanded a posteriori to even the click realization [14]. Hence it is the strongest notion of truthfulness. Examples for weaker forms of truthfulness include those which take expectation over click realizations.

Definition 2

Individually Rational (IR): A mechanism \(M = \left \langle \mathcal {A},P \right \rangle \) is said to be individually rational if ∀i ∈ [K], ∀biΘi,∀ρ,∀t, ui(θi, bi, ρ, t;θi) ≥ 0.

Theorem 1

Δ-UCB mechanism is dominant strategy incentive compatible (DSIC) and individually rational (IR).

Proof

We analyze the scenarios where an agent i bids his true valuation and receives an allocation and also when he does not. We show that in both these scenarios, bidding his true valuation θi is indeed a best response strategy. We only need to consider the exploitation rounds because in the exploration rounds, every agent is allocated a fixed number of rounds independent of his bids and these rounds are also free for agents.

  • Case 1: \(\mathcal {A}_i(\theta _i, b_{-i}, \rho , t)=1\)

    This implies that when the agent bids his true valuation, he gets an allocation. Therefore \(\widehat {\mu }_{i,t}^+ \theta _{i} > \widehat {\mu }_{l,t}^+ b_l\) for all the other agents l. In particular, let agent j be such that \(j = \underset {l \in [K] \setminus \{i\}}{\arg \max \limits } \widehat {\mu }_{l,t}^+ b_l\). The amount to be paid by agent i is \(P_i^t(\theta _i, b_{-i}, \rho ) = \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+\). If he receives a click then \(u_i(\theta _i, b_{-i}, \rho ,t; \theta _i)= \theta _{i} - \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+ > 0 \).

    Overbid::

    If agent i bids a value bi > θi, he continues to receive an allocation and his payment is still the same, \(P_i^t(b_i, b_{-i}, \rho ) = \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+\). Therefore his utility continues to be \(u_i(b_i, b_{-i}, \rho ,t; \theta _i)= \theta _{i} - \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+ = u_i(\theta _i, b_{-i}, \rho ,t; \theta _i)\). Therefore he does not benefit from an overbid.

    Underbid::

    Suppose agent i bids a value bi < θi.

    1. Case a:

      If bi is such that \(\widehat {\mu }_{i,t}^+b_{i} < \widehat {\mu }_{j,t}^+ b_{j}\), the he fails to get an allocation as \(\mathcal {A}(b_i, b_{-i}, \rho , t) =j \neq i \). Then the utility to agent i is ui(bi, bi, ρ, t;θi) = 0 < ui(θi, bi, ρ, t;θi). Therefore he clearly loses his utility by such an underbid.

    2. Case b:

      Suppose bi is such that \(\widehat {\mu }_{i,t}^+\theta _{i} > \widehat {\mu }_{i,t}^+b_{i} > \widehat {\mu }_{j,t}^+ b_{j}\). That is agent i bids in such a way that he wins the allocation even with an underbid. Then, if he gets a click, the amount he must pay to the center is \(P_i^t(b_i, b_{-i}, \rho ) = \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+ \). Therefore his utility \(u_i(b_i, b_{-i}, \rho ,t; \theta _i)= \theta _{i} - \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+= u_i(\theta _i, b_{-i}, \rho ,t; \theta _i) \). He obtains the same utility as a truthful bid and there is no benefit from such an underbid.

  • Case 2: \(\mathcal {A}_i(\theta _i, b_{-i}, \rho , t)=0\)

    This implies that when the agent bids his true valuation, he does not get an allocation. Suppose agent j wins the allocation. \(\mathcal {A}(\theta _i, b_{-i}, \rho , t)=j\) and \(\widehat {\mu }_{i,t}^+ \theta _{i} < \widehat {\mu }_{j,t}^+ b_{j}\).

    Truthful bid::

    Since agent i does not win an allocation with a truthful bid, his utility ui(θi, bi, ρ, t;θi) = 0

    Overbid::

    Suppose agent i bids in such a way that bi > θi. We have two sub-cases here.

    1. Case a:

      If bi is such that \(\widehat {\mu }_{i,t}^+ \theta _{i} < \widehat {\mu }_{j,t}^+ b_{j} < \widehat {\mu }_{i,t}^+ b_{i} \), then agent i wins the allocation. So, \(\mathcal {A}_i(b_i, b_{-i}, \rho , t)=1\). If he gets a click, he now has to make a payment \(P_i^t(b_i, b_{-i}, \rho ) = \widehat {\mu }_{j,t}^+ b_{j}/ \widehat {\mu }_{i,t}^+ \). Now his utility \(u_i(b_i, b_{-i}, \rho ,t; \theta _i) = \theta _{i} - \widehat {\mu }_{j,t}^+ b_{j}/ \widehat {\mu }_{i,t}^+ \) < 0. And in particular ui(bi, bi, ρ, t;θi) < ui(θi, bi, ρ, t;θi) = 0. Therefore, such an overbid is clearly disadvantageous compared to a truthful bid.

    2. Case b:

      Suppose \(\widehat {\mu }_{i,t}^+ \theta _{i} < \widehat {\mu }_{i,t}^+ b_{i} < \widehat {\mu }_{j,t}^+ b_{j} \). The overbid by agent i is not sufficient to make him win the allocation and agent j wins the allocation, \(\mathcal {A}(b_i, b_{-i}, \rho , t)=j\). The utility of agent i, ui(bi, bi, ρ, t;θi) = 0 = ui(θi, bi, ρ, t;θi). Therefore there is no advantage for agent i by this case of overbid.

    Underbid::

    If agent i bids in such a way that bi < θi, he continues to lose the allocation and therefore his utility, ui(bi, bi, ρ, t;θi) = 0 = ui(θi, bi, ρ, t;θi). Since, the utility by an underbid remains the same as a truthful bid, there is clearly no advantage in underbidding.

All the above cases show that our mechanism is DSIC a posteriori to the click realizations. Also, in each of the above cases, note that the utility of an agent i, ui(θi, bi, ρ, t) ≥ 0. Therefore, by truthful bidding he never gets a negative utility. This proves that our mechanism is individually rational. □

We next discuss the regret incurred by Δ-UCB. We note that the regret analysis we provide differs in spirit from the worst case analysis in [4]. The number of exploration rounds in [4] is required to be Ω(T2/3) since the separation between the best and second best arm is fixed in an adversarial manner in their analysis. Our analysis does not resort to any adversarial arguments.

In order to prove our Δ-regret results, we will first need to prove several other lemmas.

Lemma 1

Social Welfare UCB index: For an agent i, we define the social welfare UCB indices for agent i as,

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i,t}^{+} = \widehat{\mu}_{i,t}\theta_{i}+ \epsilon_{i,t}\theta_{i} = \widehat{\mu}_{i,t}\theta_{i} + \sqrt{2\frac{{\theta_{i}^{2}} \log T}{N_{i,t}}} \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i,t}^{-} = \widehat{\mu}_{i,t}\theta_{i} - \epsilon_{i,t}\theta_{i} = \widehat{\mu}_{i,t}\theta_{i} - \sqrt{2\frac{{\theta_{i}^{2}} \log T}{N_{i,t}}} \end{array} $$
(4)

Then, \( \forall t P\left (\left \lbrace \omega : W_{i} \notin [\widehat {W}_{i,t}^-(\omega ), \widehat {W}_{i,t}^+ (\omega )]) \right \rbrace \right ) \leq 2T^{-4} \).

Proof

Let \(\widehat {\mu }_{i,t}^+\) and \(\widehat {\mu }_{i,t}^-\) denote the UCB and LCB indices for the estimate \(\widehat {\mu }_i\). Then the events \(\{\omega : \mu _{i} \notin [\widehat {\mu }_{i,t}^-(\omega ),\) \( \widehat {\mu }_{i,t}^+(\omega ) ] \}\) and \(\{\omega : W_{i} \notin [\widehat {W}_{i,t}^-(\omega ), \widehat {W}_{i,t}^+(\omega ) ] \}\) are identical. So, \(P(W_{i} \notin [\widehat {W}_{i,t}^-, \widehat {W}_{i,t}^+]) = P(\mu _{i} \notin [\widehat {\mu }_{i,t}^-, \widehat {\mu }_{i,t}^+] )\). An application of Hoeffding bound [18] gives \(P(\mu _{i} \notin [\widehat {\mu }_{i,t}^-, \widehat {\mu }_{i,t}^+] ) \leq 2\exp (-2 N_{i,t} \epsilon _{i,t}^2) \). As per the mechanism \(\epsilon _{i,t} = \sqrt {2\log T/N_{i,t}}\). So, \(P(\mu _{i} \notin [\widehat {\mu }_{i,t}^-, \widehat {\mu }_{i,t}^+] ) \leq 2\exp (-2 N_{i,t} \times 2 \log T/ N_{i,t}) = 2T^{-4}\). □

Lemma 2

Suppose at time step t, \(N_{i,t} > \frac {8\theta _{max}^2 \log T}{{\varDelta }^2} \forall i \in [K]\). Then ∀i ∈ [K], 2𝜖i, tθi < Δ.

Proof

Given that \(N_{i,t} > \frac {8 \theta _{max}^2 \log T}{{\varDelta }^2}\). Therefore,

$$ \begin{array}{@{}rcl@{}} {\varDelta}^{2} > \frac{8 \theta_{max}^{2} \log T}{N_{i,t}} \geq \frac{8 {\theta_{i}^{2}} \log T}{N_{i,t}} \geq 4 \left[\frac{2 {\theta_{i}^{2}} \log T}{N_{i,t}} \right] \end{array} $$

Taking square roots on both sides of the above equation yields Δ > 2𝜖i, tθi thereby proving the lemma. □

Lemma 3

Suppose KT. For an agent i and time step t, let Bi, t be the event \(B_{i,t} = \{\omega : W_{i} \notin [\widehat {W}_{i,t}^-, \widehat {W}_{i,t}^+] \}\). Define the event \(G =\bigcap \limits _t \bigcap \limits _{i \in [K]} B_{i,t}^c\), where \(B_{i,t}^c\) is the complement of Bi, t. Then \(P(G) \geq 1 - \frac {2}{T^2}\).

Proof

From Lemma 1, the probability of the ‘bad’ event, P(Bi, t) ≤ 2T− 4.

$$ \begin{array}{@{}rcl@{}} P(G)&=& P\left( \underset{t}{\bigcap} \underset{i}{\bigcap} B_{i,t}^{c}\right) = 1- P\left( \left( \underset{t}{\bigcap} \underset{i}{\bigcap} B_{i,t}^{c}\right)^{c}\right) \\ &=& 1- P\left( \underset{t}{\bigcup} \underset{i}{\bigcup} B_{i,t}\right) = 1 - \sum\limits_{t} \sum\limits_{i \in [K]} P(B_{i,t}) \\ & \geq& 1- \sum\limits_{t} \sum\limits_{i \in [K]} 2T^{-4} \geq 1 - \frac{2}{T^{2}} \end{array} $$

The last statement follows by summing over all rounds and using the fact that KT. □

Theorem 2

Suppose at time step t, \(N_{j,t} > \frac {8\theta _{max}^2 \log T}{{\varDelta }^2} \forall j \in [K]\). Then ∀i ∈ [K] ∖ SΔ, \(\widehat {W}_{i_{*}, t}^+ > \widehat {W}_{i, t}^+\) with high probability (= 1 − 2/T4).

Proof

In Theorem 1, we have shown that Δ-UCB is DSIC. Therefore, all the agents bid their valuations truthfully, bi = θii ∈ [K]. Suppose in exploitation round t, a sub-optimal arm i is pulled. Therefore, \(\widehat {W}_{i, t}^+ \geq \widehat {W}_{i_{*}, t}^+\). Then one of the following three conditions must have happened.

  • Condition 1: \(W_{i} < \widehat {W}_{i,t}^{-}\). This condition implies a drastic overestimate of the sub-optimal arm i so that the true social welfare Wi is even below the LCB index \(\widehat {W}_{i,t}^-\). Figure 1 shows this case.

  • Condition 2: \( W_{*} > \widehat {W}_{i_{*}, t}^{+}\). This implies an underestimate of the optimal arm so that the true social welfare W lies above even the UCB index \(\widehat {W}_{i_{*},t}^+\). This situation is shown in Fig 2.

  • Condition 3: WWi < 2𝜖i, tθi. This implies an overlap in the confidence intervals of the optimal and sub-optimal arm. Even though Conditions 1 and 2 are false, still the UCB of sub-optimal arm i is greater than the UCB of the optimal arm i.

Fig. 1
figure 1

Condition 1, proof of Theorem 2

Fig. 2
figure 2

Condition 2, proof of Theorem 2

From Fig. 3, \(W_{*} - W_{i} \leq \widehat {W}_{i,t}^+ - \widehat {W}_{i,t}^- \leq 2 \epsilon _{i,t} \theta _{i} \)

Fig. 3
figure 3

Condition 3, proof of Theorem 2

If all the three conditions above were false, then,

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i_{*},t}^{+} &> W_{*} > W_{i} + 2 \epsilon_{i,t} \theta_{i} > \widehat{W}_{i,t}^{-} + 2 \epsilon_{i,t} \theta_{i}= \widehat{W}_{i,t}^{+} \end{array} $$

This implies that \(\widehat {W}_{i_{*},t}^+ > \widehat {W}_{i,t}^+\), leading to a contradiction.

As per the statement of the theorem, \(N_{i,t} > \frac {8 \theta _{max}^2 \log T}{{\varDelta }^2}\). Therefore by Lemma 2, 2𝜖i, tθi < Δ. For i ∈ [K] ∖ SΔ, WWi > Δ > 2𝜖i, tθi. So Condition 3 above does not hold true. So if the sub-optimal arm i must have been pulled, only possibilities are for Condition 1 or 2.

$$ \begin{array}{@{}rcl@{}} P(\widehat{W}_{i,t}^{+} &>& \widehat{W}_{i_{*},t}^{+}) \leq P(\text{Condition 1}) + P(\text{Condition 2}) \\ & \leq& \frac{1}{2}P(B_{i,t}) + \frac{1}{2}P(B_{i_{*}, t}) \leq 2/T^{-4} \end{array} $$
$$ \begin{array}{@{}rcl@{}} P(\widehat{W}_{i_{*},t}^{+} & >\widehat{W}_{i,t}^{+} ) = 1- P(\widehat{W}_{i,t}^{+} > \widehat{W}_{i_{*},t}^{+}) \geq 1 - \frac{2}{T^{4}} \end{array} $$

thereby completing the proof.

We are now ready to state our main result on the incurred regret.

Theorem 3

If the Δ-UCB mechanism is executed for a total time horizon of T rounds, it achieves an expected Δ-regret of \(O(\log T)\).

Proof

The main idea in the proof is to compute the Δ-regret conditional on two events - G and Gc and then to find a bound for these two conditional expectations.

The last step comes from the fact that Conditions 1 and 2 in the proof of Theorem 2 are eliminated as we are given that the event G has occurred. After exploration rounds, \(N_{i,t}\geq 8K \theta _{max}^2 \log T/{\varDelta }^2\). From Theorem 2, no Δ-regret occurs during exploitation since G is true. Therefore the regret is only incurred during the exploration rounds.

We now compute \(\mathbb {E}\left [ {\varDelta }\text {-regret} |G^c \right ]\).

$$ \begin{array}{@{}rcl@{}} \mathbb{E}\left[ {\varDelta}\text{-regret} |G^{c} \right] \leq T \theta_{max} \end{array} $$
(5)

But \(P(G^c) = 1 - P(G)< \frac {2}{T^2}\) from Lemma 3. Putting all the steps together,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}\left[ {\varDelta}\text{-regret} \right] &=& \mathbb{E}\left[ {\varDelta}\text{-regret} |G \right] P(G) + \mathbb{E}\left[ {\varDelta}\text{-regret} |G^{c} \right] P(G^{c}) \\ & \leq& \frac{8K\theta_{max}^{3} \log T}{{\varDelta}^{2}} * 1 + T\theta_{max}*\frac{2}{T^{2}} \\ & \leq& \frac{8K\theta_{max}^{3} \log T}{{\varDelta}^{2}} + 2 \end{array} $$
(6)

The second term is less than 2 as θmaxT. This completes the proof. □

A consequence of the above theorem is that even if an adversary chooses an arbitrary small gap between the best and second best arm, there is nothing to worry for the planner - if the gap is less than his tolerance Δ, no loss is incurred as opposed to the otherwise Ω(T2/3) loss in [4].

4.2 A lower bound for Δ-regret

We will now discuss a lower bound for the Δ-regret incurred by our approach. In particular, we will provide the lower bound for the case where θi = 1 for all i and is known. The proof will follow along the lines of the lower bound proof in [7]. The same lower bound will also naturally apply to the case of the general strategic version as well, since we our proposed mechanism Δ-UCB is truthful and achieves a matching upper bound.

Let kl(p, q) denote the KL divergence between the distributions Bernoulli(p) and Bernoulli(q). Then \(kl(p,q) = p \log p/q + (1-p) \log (1-p)/(1-q)\).

Theorem 4

Consider the setting where θi = 1∀i ∈ [K]. Suppose an algorithm satisfies \(\mathbb {E}[N_{i,t}] = o(t^a)\) for any set of Bernoulli reward distributions and for all arms iSΔ and a > 0. Then for any set of Bernoulli reward distributions we have,

$$ \liminf_{T \rightarrow \infty} \frac{\mathbb{E}[{\varDelta}\text{-regret}]}{\log T } \geq \sum\limits_{i \notin S_{{\varDelta}}} \frac{{\varDelta}_{i}}{kl(\mu_{i}, \mu^{*} + {\varDelta})} $$
(7)

where \(\mu ^{*} = \underset {j \in [K]}{\arg \max \limits } \mu _{j}\), Δi = μμi for all j ∈ [K].

Proof

We will provide the proof for the case of two agents. The proof for the case K > 2 follows analogously. Assume that μ2μ1 ≤ 1 and μ1μ2 > Δ. Therefore agent 1 is optimal and agent 2 does not belong to SΔ. For any 𝜖 > 0, due to the continuity of kl(μ2, x), we can find \(\mu ^{\prime }_{2} \in (\mu _1 + {\varDelta }, 1)\) such that

$$ kl(\mu_{2}, \mu^{\prime}_{2}) \leq (1+ \epsilon) kl(\mu_{2}, \mu_{1} + {\varDelta}) $$
(8)

This configuration then corresponds to an alternate setting where the mean of agent 2 is \(\mu ^{\prime }_{2}\). In this alternate setting, \(\mu ^{\prime }_{2} - \mu _1 > {\varDelta }\) and agent 2 is the unique optimal. For s ∈{1,…, T}, let,

$$ \tilde{kl}_{s} = \sum\limits_{t=1}^{s} \frac{\mu_{2} {\rho_{2}^{t}} + (1-\mu_{2}) (1- {\rho_{2}^{t}})}{\mu^{\prime}_{2} {\rho_{2}^{t}} + (1-\mu^{\prime}_{2}) (1- {\rho_{2}^{t}})} $$
(9)

It can be verified that \(\lim _{t \rightarrow \infty } \mathbb {E}[ \tilde {kl}_t ]/t = kl(\mu _{2}, \mu ^{\prime }_{2})\) (where the expectation is taken over \(\rho _{2}^t\)) and therefore \(\tilde {kl}_t\) serves as an un-normalized estimate for \(kl(\mu _{2}, \mu ^{\prime }_{2})\).

Let CT denote the following random variable,

One may verify that \(\mathbb {P}_{\mu ^{\prime }_{2}} (C_{T} = 1) = \mathbb {E}_{\mu _{2}}[C_{T} \exp (-\tilde {kl}_{N_{2,T}})]\) by applying a change of measure. We will now show that \( \mathbb {P}_{\mu _{2}} (C_{T} = 1) \rightarrow 0\) as \(T \rightarrow \infty \). This is due to the following:

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\mu^{\prime}_{2}} (C_{T} = 1) &=& \mathbb{E}_{\mu_{2}}[C_{T} \exp(-\tilde{kl}_{N_{2,T}})] \\&&\geq \exp (- (1- \epsilon/2) \log T) \times \mathbb{P}_{\mu_{2}}(C_{T} = 1) \end{array} $$

Therefore, setting \(f_{T} = \frac {(1-\epsilon )\log T}{kl(\mu _{2},\mu ^{\prime }_{2})}\), and applying Markov inequality we get,

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\mu_{2}}(C_{T} = 1)& \leq& T^{1- \epsilon/2} \mathbb{P}_{\mu^{\prime}_{2}} (C_{T} = 1) \leq T^{1- \epsilon/2} \mathbb{P}_{\mu^{\prime}_{2}} (N_{2,t} \leq f_{T})\\ &\leq& T^{1- \epsilon/2} \frac{\mathbb{E}_{\mu^{\prime}_{2}}[T - N_{2,T}]}{T-f_{T}} \rightarrow 0 \end{array} $$

The last step arises as a consequence of TN2, T = N1, T and agent 1 is sub-optimal for the setting where agent 2 has the mean reward of \(\mu ^{\prime }_{2}\).

We will finally show that \(\mathbb {P}_{\mu _{2}}(N_{2,T} < f_T) \rightarrow 0\).

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\mu_{2}}(C_{T} = 1) &\geq& \mathbb{P}_{\mu_{2}}(N_{2,T} < f_{T} \text{ and } \max_{s \leq f_{T}} \tilde{kl}_{s} \leq (1-\epsilon/2)\log T) \\ &=& \mathbb{P}_{\mu_{2}}(N_{2,T} < f_{T} \text{ and } \frac{kl(\mu_{2}, \mu^{\prime}_{2})}{(1-\epsilon) \log T} \max_{s \leq f_{T}} \tilde{kl}_{s} \\&\leq& \frac{kl(\mu_{2}, \mu^{\prime}_{2})}{(1-\epsilon)}(1-\epsilon/2)) \end{array} $$

Note that \(kl(\mu _{2}, \mu ^{\prime }_{2}) > 0\) and \(\frac {1-\epsilon /2}{1 - \epsilon } \geq 1\). Therefore by an application of the strong law of large numbers, we have

$$ \begin{array}{@{}rcl@{}} \underset{T \rightarrow \infty}{\lim} \mathbb{P}_{\mu_{2}}(\frac{kl(\mu_{2}, \mu^{\prime}_{2})}{(1-\epsilon) \log T} \max_{s \leq f_{T}} \tilde{kl}_{s} \leq \frac{kl(\mu_{2}, \mu^{\prime}_{2})}{(1-\epsilon)}(1-\epsilon/2)) = 1 \end{array} $$

Since \(\mathbb {P}_{\mu _{2}}(C_{T} = 1) \rightarrow 0\), we must have \(\mathbb {P}_{\mu _{2}}(N_{2,T} < f_T) \rightarrow 0\) as well. Applying Markov inequality again, we get,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\mu_{2}}[N_{2,T}] &\geq& \mathbb{P}_{\mu_{2}}(N_{2,T} \geq f_{T}) f_{T} = \frac{1-\epsilon}{kl(\mu_{2},\mu^{\prime}_{2})} \\&\geq& \frac{1-\epsilon}{1+\epsilon} \frac{\log T}{ kl(\mu_{2}, \mu_{1} + {\varDelta})} \end{array} $$

The last step is obtained by applying Eq. 8. This completes the proof. Note the key difference between our proof and [7] lies in Eq. 8. Our RHS in Eq. 8 is necessary to ensure that in the alternate scenario agent 1 is sub-optimal. □

Remark 1

The lower bound for the expected Δ-regret Theorem 4 is quite similar to the lower bound for the regret of the UCB algorithm in [7]. The difference is that the KL divergence term in the bound is also a function of the parameter Δ. Intuitively instead of considering the KL divergence between KL(μi, μ), we give an allowance of Δ for the optimal agent.

5 Extension to multi-slot SSA

In the previous sections, we assumed that there was a single slot for which the advertisers were competing. We now look at a more general setting where there are M slots to be allocated to the K agents. As before, each advertiser has exactly one ad for display and the CTR for advertisement i is denoted by μi. Recall that in the case of single slot auctions, the CTR exactly denoted the probability with which an ad received a click. However in the generalized setting of multi-slot auctions, an additional parameter comes into play while computing the click probability due to which the problem becomes much harder [13].

Each position or slot m is associated with a parameter λm called ‘prominence’. λm denotes the probability with which a user observes an ad at slot m + 1 given he has observed the ad at slot m. In order to understand the need for this parameter, a useful scenario to imagine is the listing of web-pages in Google for a query. There are two phases that one can think of once the listing of pages or results have appeared.

  • Phase 1: This is the phase where a user scans through the pages listed. A page listed higher up in the ranking (say second from the top) has more chances of being observed by a user rather than a page that is far below in the ranking (say fifth from the top). λ4, for instance, denotes the probability that a user observes the fifth page, given he has observed the fourth page. Coming back to sponsored ads, we assume that λ0 = 1, that is, the ad listed in the first slot is surely observed. We denote by Γm the probability that an ad at slot m is observed. Γm is computed as,

    $$ \begin{array}{@{}rcl@{}} {\varGamma}_{m} = \begin{cases} 1 & \quad \text{if } m = 1 \\ \prod\limits_{s=1}^{m-1} \lambda_{s} & \quad \text{if } 2 \leq m \leq M\\ 0 & \quad \text{if } m> M\\ \end{cases} \end{array} $$
    (10)

    This modeling assumption for Γm is known as position dependent cascade model.

  • Phase 2: After having scanned through the list, the user decides to click one or more of the shown ads. In the multi-slot setting [14], it is assumed that multiple ads in a listing may receive clicks. The probability that ad i receives a click when shown at slot m = Γmμi.

We assume that λm, m = 1,…, M are known to the planner a-priori. The problem of learning these parameters along with the CTR μ is much harder in the presence of strategic agents. Therefore, in this section, we work with the assumption that the λ s and hence Γ s are known. In Section 6.2, we give pointers for design of mechanisms where the Γ s are unknown.

The above modeling assumptions are as per standard conventions [13]. In the multi-slot setting, the allocation is given to multiple agents at every time step. We denote by \(\mathcal {A}(b, \rho , t)\) ⊂{1,…, K}, the allocation at time t for bids b and click realization ρ. The cardinality of the allocated set \(|\mathcal {A}(b, \rho , t)| = M\). We also use the notation \(\mathcal {A}_i(b, \rho , t) = m\) to denote the allocation to agent i at time t is slot m, for the bid profile b, click realization ρ. If an agent i is not allocated any of the M slots at time t, we say \(\mathcal {A}_i(b, \rho , t) = 0\).

We denote by Wi, m the social welfare of agent i, when he is given slot m. Wi, m is the expected valuation that agent i receives when he is given slot m and is computed as,

$$ \begin{array}{@{}rcl@{}} W_{i,m} = {\varGamma}_{m} \mu_{i} \theta_{i} \end{array} $$
(11)

For ease of reference, the additional relevant parameters for the multi-slot setting are provided in Table 3.

Table 3 Additional notations for multi-slot SSA

Having described the multi-slot setting, we now analyze the scenario from the view point of the search engine or central planner. In the ideal scenario, the planner would like to allot the ads exactly to the top M agents with the largest social welfare. This use case has been studied in the literature [14] and exploration separated mechanisms with regret of O(T2/3) have been proposed. Various possible allocations are explored for O(T2/3) time steps for every agent after which the allocation algorithm is guaranteed to converge to the ideal allocation with high probability. As in the single slot case, O(T2/3) exploration rounds are required to distinguish all the agents perfectly from each other, when there are agents whose social welfare values are arbitrarily close.

However, a much more practical problem of interest is to study and design mechanisms when the search engine is indifferent to a gap in Δ in social welfare for every slot. We observe that in cases where the agents are well-separated, O(T2/3) exploration rounds are not required. In fact, \(O(\log T)\) exploration rounds are sufficient to converge to an allocation that is well within the requirements of the search engine.

Having explained the problem, we now formalize the notions of separatedness in this setting. Let K(1),…, K(M) ∈ [K] be the best M agents in terms of their single slot social welfare values, that is, \(\mu _{K^{(1)}} \theta _{K^{(1)}} > \mu _{K^{(2)}} \theta _{K^{(2)}} > {\ldots } >\mu _{K^{(M)}} \theta _{K^{(M)}} \). Let \(W_{*,m} = W_{K^{(m)}, m}\). The ideal solution would be to allocate agent K(m) the slot m. This allocation would yield the largest social welfare but in the worst case, when the agents’ social welfares are separated by a function of T, converging to this optimal allocation would require O(T2/3) exploration rounds [14]. Instead, for a prescribed value of Δ fixed by the search engine, define the set,

$$ S_{{\varDelta},m } = \left\lbrace i \in [K]: W_{K^{(m)}, m} - W_{i,m} < {\varDelta} \right\rbrace. $$
(12)

SΔ, m is the set of all agents whose social welfare is at most Δ away from the agent K(m) (who should have ideally been given slot m). The planner is indifferent to the regret contributed by the agents in SΔ, m, if any of them are allotted slot m. Hence we define the multi-slot Δ-regret metric as,

$$ \begin{array}{@{}rcl@{}} {\varDelta}\text{-regret} = \sum\limits_{t=1}^{T} \sum\limits_{m=1}^{M} (W_{*,m} - W_{I_{t,m}, m})\mathbbm{1}\left[ I_{I_{t},m} \in [K]\setminus S_{{\varDelta},m}\right] \end{array} $$

The Δ-UCB mechanism for the multi-slot SSA is given in Algorithm 2.

figure l

We analyze the regret and truthfulness of Algorithm 2. The lemmas and theorems for establishing the results for the multi-slot setting are similar to the single slot setting, however there are subtle differences in proving many of the results. We will highlight them as and when necessary.

Theorem 5

In the multi-slot setting Δ-UCB is Dominant Strategy Incentive Compatible (DSIC) and Individually Rational (IR).

Proof

The mechanism is an implementation of the weighted VCG scheme (with the weights for each agent \(w_{i} = \mu _{i}^{+}/\mu _{i} )\) and is hence DSIC and IR. □

Lemma 4

For an agent i and slot m, the click through rate UCB indices for agent i,

$$ \begin{array}{@{}rcl@{}} \widehat{\mu}_{i,t}^{+} = \widehat{\mu}_{i, t} + \epsilon_{i,t} = \widehat{\mu}_{i, t} + \sqrt{\left( \sum\limits_{m^{\prime}=1}^{M} \frac{M_{i,t}^{(m^{\prime})}}{{\varGamma}_{m^{\prime}}^{2}} \right) \frac{2\log T}{N_{i, t}^{2}}} \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} \widehat{\mu}_{i,t}^{-} = \widehat{\mu}_{i,t} - \epsilon_{i,t} = \widehat{\mu}_{i,t} - \sqrt{\left( \sum\limits_{m^{\prime}=1}^{M} \frac{M_{i,t}^{(m^{\prime})}}{{\varGamma}_{m^{\prime}}^{2}} \right) \frac{2\log T}{N_{i, t}^{2}}} \end{array} $$
(14)

satisfy \(P(\mu _{i} \notin [\widehat {\mu }_{i, t}^{-}, \widehat {\mu }_{i,t}^{+}])) \leq 2T^{-4} \forall t\)

Proof

At every time step, we observe samples \(\rho _{I_{t,m}}(t), m = 1, \ldots , M \) corresponding to the clicks of the allocated ads. These samples also encompass slot specific information which must be accounted for in the computation of empirical mean as well as UCB index for μi. For an agent i, let the random variable Ci, m denote whether ad i receives a click at slot m. Therefore Ci, m is a Bernoulli random variable with bias Γmμi.

We obtain a sample ρi(.) of Ci, m when ad i is allocated slot m. However it is the samples from Ci, m/Γm that gives us an unbiased estimator for μi. Therefore, the random variable of interest is the Bernoulli random variable,

$$ \begin{array}{@{}rcl@{}} D_{i,m} = \begin{cases} 0 & \quad \text{w.p } 1- {\varGamma}_{m}\mu_{i} \\ 1/{\varGamma}_{m} & \quad \text{w.p } {\varGamma}_{m}\mu_{i} \end{cases} \end{array} $$
(15)

Di, m is bounded in [0,1/Γm] and \(\mathbb {E}[D_{i,m}]\) is μi. Also,

$$ \begin{array}{@{}rcl@{}} \log \mathbb{E}\left[\exp(\lambda (D_{i,m} - \mu_{i})\right] \leq \frac{\lambda^{2}}{8 {{\varGamma}_{m}^{2}}} \text{ (by Hoeffding's Lemma)} \end{array} $$

Consider the scenario where, for an ad i, a single sample click is available from each slot. Let Xi, m denote this sample of Ci, m. Assume Xi, m are all independent and \( \widehat {\mu }_{i} = 1/M{\sum }_{m=1}^{M} X_{i,m}/{\varGamma }_{m}\). \(\mathbb {E}[\widehat {\mu _{i}}] = \mu _{i}\). Now,

$$ \begin{array}{@{}rcl@{}} P(&&\!\!\!\!\!\!\!\! \widehat{\mu_{i}} - \mu_{i} > \epsilon) = P\left( \sum\limits_{m=1}^{M} X_{i,m}/{\varGamma}_{m} -M \mu_{i} > \epsilon M \right) \\&&= P\left( \exp(\lambda (\sum\limits_{m=1}^{M} X_{i,m}/{\varGamma}_{m} -M \mu_{i} ))> \exp(\lambda\epsilon M) \right) \\ &&\quad\leq \mathbb{E}\left[\exp(\lambda (\sum\limits_{m=1}^{M} X_{i,m}/{\varGamma}_{m} -M \mu_{i} ))\right]\\&&/ \exp(\lambda\epsilon M) \text{ (by Markov inequality)} \end{array} $$
$$ \begin{array}{@{}rcl@{}} &&= \overset{M}{\underset{m=1}{\prod}} \mathbb{E}\left[\exp(\lambda (X_{i,m}/{\varGamma}_{m} - \mu_{i} ))\right]\\&&/ \exp(\lambda\epsilon M) \text{ (by independence of }X_{i,m}\text{)} \\ &&= \exp\left( \sum\limits_{m=1}^{M} \frac{\lambda^{2} }{8 {{\varGamma}_{m}^{2}}} - \lambda M\epsilon\right) \end{array} $$
(16)

In order to tighten the above bound on the right hand side, one must find appropriate λ which minimizes \(\exp ({\sum }_{m=1}^{M} \frac {\lambda ^{2} }{8 {{\varGamma }_{m}^{2}}} - \lambda M\epsilon )\). Setting λ = λ = 4M𝜖/η where \(\eta = {\sum }_{m=1}^{M} 1/{{\varGamma }_{m}^{2}}\) achieves the minimum value. Therefore,

$$ \begin{array}{@{}rcl@{}} P(& \widehat{\mu_{i}} - \mu_{i} > \epsilon) \leq \exp(-2M^{2} \epsilon^{2} /\eta) \end{array} $$
(17)

In order to obtain a δ confidence on \(P(\widehat {\mu _{i}} - \mu _{i} > \epsilon ) \), 𝜖 must be set so that \(\exp (-2M^{2} \epsilon ^{2} /\eta ) = \delta = T^{-4}\). Therefore, \(\epsilon = \sqrt {\sum \limits _{m=1}^{M} \left (\frac {1}{{{\varGamma }_{m}^{2}}} \right ) \frac {2 \log T}{M^{2}}}\). In the above analysis we assumed that from each slot, one sample was available. When we have a total of Ni, t independent samples for ad i, with \(M_{i,m}^{t}\) samples for slot m at any time t, \(\eta = {\sum }_{m=1}^{M} M_{i,t}^{m}/{{\varGamma }_{m}^{2}}\) and therefore \(\epsilon _{i,t} = \sqrt {\left (\sum \limits _{m^{\prime }=1}^{M} \frac {M_{i,t}^{(m^{\prime })}}{{\varGamma }_{m^{\prime }}^{2}} \right ) \frac {2\log T}{N_{i, t}^{2}}}\), completing the proof. □

A noteworthy feature of our estimates is the following. An allocation of an ad i in a slot m yields a sample for the computation of not only \(\widehat {W}_{i,m,t}\), but also for \(\widehat {W}_{i,m^{\prime },t}\) for all slots \(m^{\prime } \in \{1, \ldots , M \}\). This is because Γm is known to the planner a-priori. Therefore note that, the number of allocations that ad i receives till time t, Ni, t is the sum of the number of allocations that agent i receives irrespective of the slot or inclusive of all the slots.

Lemma 5

For an agent i and slot m, the social welfare UCB indices for agent i,

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i,m,t}^{+} &=& {\varGamma}_{m} \widehat{\mu}_{i, t} \theta_{i} + \epsilon_{i,m,t} = {\varGamma}_{m} \widehat{\mu}_{i, t} \theta_{i} \\&&+ \sqrt{\left( \sum\limits_{m^{\prime}=1}^{M} \frac{M_{i,t}^{(m^{\prime})}}{{\varGamma}_{m^{\prime}}^{2}} \right) \frac{2{\theta_{i}^{2}} {{\varGamma}_{m}^{2}}\log T}{N_{i, t}^{2}}} \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i,m,t}^{-} &=& {\varGamma}_{m} \widehat{\mu}_{i, t} \theta_{i} - \epsilon_{i,m,t} = {\varGamma}_{m} \widehat{\mu}_{i,t} \theta_{i} \\&&- \sqrt{\left( \sum\limits_{m^{\prime}=1}^{M} \frac{M_{i,t}^{(m^{\prime})}}{{\varGamma}_{m^{\prime}}^{2}} \right) \frac{2{\theta_{i}^{2}} {{\varGamma}_{m}^{2}}\log T}{N_{i, t}^{2}}} \end{array} $$
(19)

satisfy \(P(W_{i,m} \notin [\widehat {W}_{i, m, t}^{-}, \widehat {W}_{i,m, t}^{+}])) \leq 2T^{-4} \forall t\)

Proof

The proof idea is similar to Lemma 1. □

Lemma 6

Suppose at time step t, \(N_{j,t} > \frac {8\theta _{max}^{2} \log T}{{\varDelta }^{2}} \forall j \in [K]\). Then ∀i ∈ [K] and ∀m ∈ [M], 2𝜖i, m, t < Δ.

Proof

The proof is similar to Lemma 2. □

Lemma 7

For an agent i, slot m and time t, let Bi, m, t be the event \(B_{i,m,t} = \{\omega : W_{i,m} \notin [\widehat {W}_{i,m,t}^{-}(\omega ), \widehat {W}_{i,m,t}^{+}(\omega )] \}\). Define the event \(G =\underset {t}{\bigcap } \underset {i}{\bigcap } \underset {m}{\bigcap } B_{i,m,t}^{c}\). Then \(P(G) \geq 1 - \frac {2}{T^{2}}\).

Proof

The proof has some subtle differences from Lemma 3 because in the multi-slot extension, the events Bi, m, t are not independent across the slots.

Observation: :

If an element ω from the set of outcomes is such that ωBi, m, t, then \(\omega \in B_{i,m^{\prime },t} \forall m^{\prime } \in [M]\). This is because, for any two slots m and \(m^{\prime }\),

$$ \begin{array}{@{}rcl@{}} W_{i,m} \notin [\widehat{W}_{i,m,t}^{-}, \widehat{W}_{i,m,t}^{+}] &\iff& \mu_{i} \notin [\widehat{\mu}_{i,t}^{-}, \widehat{\mu}_{i,t}^{+}]\\ & \iff& W_{i,m^{\prime}} \notin [\widehat{W}_{i,m^{\prime},t}^{-}, \widehat{W}_{i,m^{\prime},t}^{+}] \end{array} $$

Therefore \(P(\bigcup _{m} B_{i,m,t}) = P(B_{i,1,t})\). From Lemma 5, \(P(\bigcup _{m} B_{i,m,t}) = P(B_{i,1,t}) \leq 2T^{-4}\). Hence,

$$ \begin{array}{@{}rcl@{}} P(G) &=& 1 - P\left( \underset{t}{\bigcup} \underset{i}{\bigcup} \underset{m}{\bigcup} B_{i,m,t}\right) = 1 - P\left( \underset{t}{\bigcup} \underset{i}{\bigcup} B_{i,1,t}\right) \\ &\geq& 1 - \frac{2}{T^{2}} \text{(as in Lemma 3)}. \end{array} $$

Theorem 6

Suppose at time t, \(N_{j,t}>8\theta _{max}^{2} \log T/{\varDelta }^{2}\)j ∈ [K]. Then ∀m ∈ [M],∀i ∈ [K] ∖ SΔ, m, \(\widehat {W}_{K^{(m)}, m, t}^{+} > \widehat {W}_{i, m, t}^{+}\) with high probability (= 1 − 2/T4).

Proof

Suppose at time t where \(N_{j,t} > 8 \theta _{max}^{2} \log T/{\varDelta }^{2}\)j ∈ [K], there exists some m ∈ [M] such that \(\widehat {W}_{K^{(m)}, m, t}^{+} < \widehat {W}_{i, m, t}^{+}\). (Note that this statement does not arise from any assumptions on the allocation, for instance, that agent i is given slot m. This is the major difference from Theorem 2). But the relation between the true social welfare values of these agents is \(W_{K^{(m)}, m} > W_{i, m}\). Then one of the following three conditions must have occurred, like in proof of Theorem 2.

  • Condition 1: \(W_{i,m} < \widehat {W}_{i,m,t}^{-}\). This condition implies a drastic overestimate of the sub-optimal arm i so that the true mean social welfare Wi, m is even below the LCB index \(\widehat {W}_{i,m,t}^{-}\). Figure 4 captures this condition.

  • Condition 2: \( W_{K^{(m)},m} > \widehat {W}_{K^{(m)},m, t}^{+}\). This implies an underestimate of the optimal arm so that the true mean social welfare \(W_{K^{(m)},m} \) lies above even the UCB index \(\widehat {W}_{K^{(m)},m,t}^{+}\). See Fig. 5.

  • Condition 3: \(W_{K^{(m)},m} - W_{i,m} < 2\epsilon _{i,m,t}\). This implies an overlap in the confidence intervals of the optimal and sub-optimal arm. Even if, Conditions 1 and 2 are false, still the UCB of sub-optimal arm i is greater than the UCB of the optimal arm i. See Fig. 6 for an illustration of this condition.

Fig. 4
figure 4

Condition 1, Proof of Theorem 6

Fig. 5
figure 5

Condition 2, Proof of Theorem 6

Fig. 6
figure 6

Condition 3, Proof of Theorem 6

From the figure, \( W_{K^{(m)},m} - W_{i,m} \leq \widehat {W}_{i,m,t}^{+} - \widehat {W}_{i,m,t}^{-} \leq 2 \epsilon _{i, m, t} \). If all the three conditions above were false, then,

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{K^{(m)},m,t}^{+} &>& W_{K^{(m)},m} > W_{i,m} + 2 \epsilon_{i,t} > \widehat{W}_{i,m,t}^{-} + 2 \epsilon_{i,t} \\ & =& \widehat{W}_{i,m, t}^{+} \text{ (A contradiction!)} \end{array} $$

As per the statement of the theorem, \(N_{i,t} > 8 \theta _{max}^{2} \log \) T/Δ2. Therefore by Lemma 6, 2𝜖i, m, t < Δ. For agent i ∈ [K] ∖ SΔ, m, \(W_{K^{(m)},m} - W_{i, m} > {\varDelta } > 2\epsilon _{i,m,t}\). Therefore, Condition 3 above does not hold true. So,

$$ \begin{array}{@{}rcl@{}} P(\widehat{W}_{i,m,t}^{+} &>& \widehat{W}_{K^{(m)},m,t}^{+}) \leq P(\text{Condition 1}) + P(\text{Condition 2}) \\ & \leq &0.5 P(B_{i,m,t}) + 0.5P(B_{K^{(m)},m,t}) \leq 2/T^{-4} \end{array} $$
$$ \begin{array}{@{}rcl@{}} P(\widehat{W}_{K^{(m)},m, t}^{+} >\widehat{W}_{i,m, t}^{+} ) &= 1- P(\widehat{W}_{i,m, t}^{+} > \widehat{W}_{K^{(m)},m,t}^{+}) \geq 1 - \frac{2}{T^{4}} \end{array} $$

Theorem 7

If the Δ-UCB mechanism is executed in the multiple slot scenario for a total time horizon of T rounds, it achieves an expected Δ-regret of \(O(\log T)\).

Proof

The proof idea has some subtle differences from the proof of Theorem 3. As before, we first compute the expected Δ-regret conditional on G. For the exploration rounds, the mechanism obtains a regret of \(\xi =\frac {8MK \theta _{max}^{3} \log T}{{\varDelta }^{2}}\).

$$ \begin{array}{@{}rcl@{}} &&\!\mathbb{E}\left[ {\varDelta}\text{-regret}|G \right]\\& \leq&\! \xi + \sum\limits_{t=\gamma+1}^{T} \sum\limits_{m=1}^{M} (W_{K^{(m),m}} - W_{(I_{t,m}),m}) \mathbbm{1}\!\left[I_{t,m} \!\in\! K \!\setminus \!S_{{\varDelta},m}|G\right] \end{array} $$

We will now show that the second term above evaluates to zero. For any m, the cardinality of SΔ, m is at least m. This is because for all K(j) above m in the ranking of agents (j < m), \(W_{K^{(m)}, m} - W_{K^{(j)},m} < 0 < {\varDelta }\) as \(W_{K^{(j)},m}> W_{K^{(m)}, m}\). Therefore there are at least m − 1 agents in SΔ, m. Also K(m)SΔ, m as \(W_{K^{(m)}, m} - W_{K^{(m)},m}=0 < {\varDelta }\). Therefore ∀j ∈{1,…, m}, K(j)SΔ, m. While allocating slot m, at least one of the agents in SΔ, m must be free. This is by the pigeonhole principle. Now if the allocated agent for slot m, It, m ∈ [K] ∖ SΔ, m, one of the following two cases occur.

  • Case 1: The ideal agents K(1),…, K(m− 1) for all the previous slots 1,…, m − 1 have already been allocated before the allocation of slot m. This means that K(m) has not been allocated yet. Also, \(\widehat {W}_{(I_{t,m}),m, \gamma }^{+} > \widehat {W}_{K^{(m)},m, \gamma }^{+} \). Since G is true and t > γ, the above event cannot occur (by Theorem 6).

  • Case 2: The agent K(m) has already been allocated to some other slot before the allocation of slot m has begun. Therefore there is some agent K(j), j < m with a larger social welfare value, who has still not been allocated. That is, \(W_{K^{(j)},m} > W_{K^{(m)},m} > W_{(I_{t,m}),m}\). Given that It, mSΔ, m. Therefore we can deduce that It, mSΔ, j. This is because,

    $$ \begin{array}{@{}rcl@{}} &&W_{K^{(m)},m} - W_{(I_{t,m}),m} \geq {\varDelta} \\& \implies& W_{K^{(j)},m} - W_{(I_{t,m}),m} \geq {\varDelta} \\& \implies& \mu_{K^{(j)}} \theta_{K^{(j)}} - \mu_{I_{t,m}} \theta_{I_{t,m}} \geq {\varDelta}/{\varGamma}_{m} \\& \implies& {\varGamma}_{j} (\mu_{K^{(j)}} \theta_{K^{(j)}} - \mu_{I_{t,m}} \theta_{I_{t,m}}) \geq {\varGamma}_{j} {\varDelta}/{\varGamma}_{m} \\& \implies& W_{K^{(j)},j} - W_{(I_{t,m}),j} \geq {\varDelta} \end{array} $$
    (20)

The last line in the above implications is true as Γj > Γm. But \(\widehat {W}_{K^{(j)},m, \gamma }^{+} < \widehat {W}_{(I_{t,m}),m, \gamma }^{+}\). Then the inequality \(\widehat {W}_{K^{(j)},j, \gamma }^{+} < \widehat {W}_{(I_{t,m}),j, \gamma }^{+}\) is also true due to the way the slot specific UCB indices are computed. From Theorem 6 for slot j, we find that \(\widehat {W}_{K^{(j)},j, \gamma }^{+} > \widehat {W}_{(I_{t,m}),j, \gamma }^{+}\). Again this cannot happen as G is true and t > γ. Therefore we get that \(\mathbb {E}\left [ {\varDelta }\text {-regret}|G \right ] \leq \xi \).

Also, \(P(G^{c}) = 1 - P(G)< \frac {2}{T^{2}}\) from Lemma 7.Putting all the steps together,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}\left[ {\varDelta}\text{-regret}\right] &=& \mathbb{E}\left[ {\varDelta}\text{-regret} |G \right] P(G) + \mathbb{E}\left[ {\varDelta}\text{-regret} |G^{c} \right] P(G^{c}) \\ & \leq& \frac{8KM\theta_{max}^{3} \log T}{{\varDelta}^{2}} * 1 + TM\theta_{max}*\frac{2}{T^{2}} \\ & \leq& \frac{8KM\theta_{max}^{3} \log T}{{\varDelta}^{2}} + 2\theta_{max} \end{array} $$
(21)

The simplification in the second line is because \(\mathbb {E}\left [{\varDelta }\text {-regret} |G^{c} \right ]\)TMθmax. In the last line we use the fact that MT. This completes the proof. □

6 Extensions to other variants of multi-slot SSA

In this section, we look at other variants in the multi-slot SSA setting and discuss how our mechanism can be adapted to such settings.

6.1 Position and Ad dependent cascade model

We have explained our algorithm and performed the analysis for the position dependent cascade model for SSA where the Γm function is characterized by Eq. 10 and is known to the planner a-priori. A more general model would be one where the function Γm may also depend on the ad displayed at position m. Our model can also be used in such scenarios and the same analysis will hold.

6.2 Handling the case of unknown Γ m

We have assumed that the functions Γms are known to the planner a-priori. Now suppose that the Γms are required to be learnt. The same allocation scheme as in Algorithm 2 may be used. However the computation of the proposed payment scheme in algorithm 2 is not feasible as the payments use Γms, which are unknown.

In order to handle such a scenario, we must obtain estimates for Γ first. It is known that, the parameter for the first slot, Γ1 = 1. Only Γ2,…, ΓM need to be estimated. We will first describe a mechanism which relies on an arbitrary learning algorithm to provide estimates \( \widehat {{\varGamma }}_{2}, \ldots ,\widehat {{\varGamma }}_{M}\). Thereafter we will remark on the possible learning schemes.

Proposition 1

Suppose we have a learning scheme that gives us estimates \(\widehat {{\varGamma }}_{2}, \ldots ,\widehat {{\varGamma }}_{M}\) such that, \(\widehat {{\varGamma }}_{2} \geq \widehat {{\varGamma }}_{3} \geq {\ldots } \geq \widehat {{\varGamma }}_{M}\) and \(0 \leq \widehat {{\varGamma }}_{m} \leq 1 \text { for } m = 2, \ldots , M\). Let \(\widehat {{\varGamma }}_{1} = 1\).

We propose a weighted VCG mechanism [28] which is known to be DSIC truthful and is also IR. Suppose the private valuation of agent i for a click is θi. Let x ∈{0,1}K×M be an outcome of the allocation. xim = 1 if ad i is alloted slot m and zero otherwise. The valuation function of agent i in this case is,

$$ \begin{array}{@{}rcl@{}} v_{i}(x, \theta_{i}) = \sum\limits_{m=1}^{M} {\varGamma}_{m} \mu_{i} \theta_{i} x_{im} \end{array} $$
(22)

Define a weight vector \(w_{i} \in \mathcal {R}^{M}\) for every agent i. wi has weights corresponding to agent i and slot m such that, \(w_{i,m} = \frac {\widehat {\mu }_{i}^{+} \widehat {{\varGamma }}_{m}}{\mu _{i} {\varGamma }_{m}}\). \(\widehat {\mu }_{i}^{+}\) is the UCB index corresponding to the CTR of ad i, computed after the fixed number of exploration rounds as in Algorithm 2. However, in this scenario, the UCB index is constructed using samples of the clicks from allocation in the first slot alone.

Our weighted VCG mechanism is described in Fig. 7. The mechanism uses the allocation,

$$ \begin{array}{@{}rcl@{}} A^{*}(b_{i}, b_{-i}) = \underset{x}{\arg\max} \sum\limits_{i=1}^{K} \sum\limits_{m=1}^{M} {\varGamma}_{m} \mu_{i} b_{i} x_{im} w_{i,m} \end{array} $$
Fig. 7
figure 7

Δ-UCB mechanism for the position dependent cascade model using estimates for Γms

But note that this allocation rule boils down to the same allocation used in Algorithm 2. This is due to the fact that the estimates \(\widehat {{\varGamma }}_{m}\) monotonically decrease with m. The procedure for obtaining the allocation A(bi, bi) is the following. We sort the agents based on \(\widehat {\mu }_{i}^{+} b_{i}\) and allocate the slots to the best M agents. Therefore, the allocation rule is independent of the Γ s and is equivalent to,

$$ \begin{array}{@{}rcl@{}} A^{*}(b_{i}, b_{-i}) = \underset{x}{\arg\max} \sum\limits_{i=1}^{K} \sum\limits_{m=1}^{M} \widehat{\mu}_{i}^{+} b_{i} x_{im} \end{array} $$

The expected payment to be made by agent i when allocated a slot \(m^{\prime }\) is,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[{P_{i}^{t}}(b, \rho)] = \frac{\mu_{i} {\varGamma}_{m^{\prime}}}{\widehat{\mu}_{i,t}^{+} \widehat{{\varGamma}}_{m^{\prime}}} \sum\limits_{j \neq i} \sum\limits_{m=m^{\prime}+1}^{M+1} \widehat{\mu}_{j,t}^{+} b_{j} x_{jm} (\widehat{{\varGamma}}_{m-1} - \widehat{{\varGamma}}_{m} ) \end{array} $$

The above is the externality based payment prescribed by weighted VCG. However since we adopt the pay per click scheme,

$$ \begin{array}{@{}rcl@{}} {P_{i}^{t}}(b,\rho)] = \frac{\rho_{i}(t)}{\widehat{\mu}_{i,t}^{+} \widehat{{\varGamma}}_{m^{\prime}}} \sum\limits_{j \neq i} \sum\limits_{m=m^{\prime}+1}^{M+1} \widehat{\mu}_{j,t}^{+} b_{j} x_{jm} (\widehat{{\varGamma}}_{m-1} - \widehat{{\varGamma}}_{m} ) \end{array} $$

Therefore, the computation of the payments is also feasible now. The above mentioned weighted VCG scheme is DSIC truthful and IR. The proof follows from the standard weighted VCG scheme where the weights are as defined as above. We now remark on the Δ-regret of the mechanism.

6.2.1 Remarks on learning \(\widehat {{\varGamma }}_{m}\) and computation of Δ-regret

In the above mechanism we have assumed, that the estimates \(\widehat {{\varGamma }}_{m}\) satisfy Proposition 1. The allocation scheme described above ultimately does not rely on these estimates, although the weights wi, m use it. The mechanism therefore uses the estimates only in the payment rule. We now make an important observation here.

Observation: :

When any set of estimates \(\{\widehat {{\varGamma }}_{m}\}\), m = 1,…, M satisfying Proposition 1 is used in the mechanism above, the mechanism is DSIC truthful, IR and suffers only logarithmic Δ-regret.

The reason is that the mechanism is an instance of weighted VCG mechanism and therefore is DSIC truthful and IR, with any estimate for the Γms. As far as the Δ-regret in social welfare is concerned, the allocation rule determines it. The allocation rule used turns out to be identical to the allocation rule used where Γm is known and is independent of the estimates. Note that it is now possible to minimise regret in payments by choosing estimates \(\widehat {{\varGamma }}_{m}\) that maximise the payments and also satisfy the constraints in Proposition 1. This will lead to a constrained optimization problem which can be solved. However the current work focuses on minimizing Δ-regret in social welfare and therefore the problem of minimising regret in payments is still open.

7 Conclusion

We have studied the more practical use case in MAB mechanisms where a planner has the option to specify a tolerance level Δ for sub-optimal arms. All the papers in the literature on MAB mechanisms propose schemes to target the worst case scenario where the arms are arbitrarily close. Therefore they prescribe investing a huge number of exploration rounds (Ω(T2/3)) to perfectly distinguish the arms. However, the planner may not want to perfectly distinguish arms that are arbitrarily close. Many a time, the planner may instead be willing to allocate arms that are at most Δ away from the best arm. The state of the art does not permit this flexibility to the planner. Towards providing such a flexibility to the planner, we have, for the first time, introduced a new notion of regret called Δ-regret. When arms that are less than Δ away from the best arm are selected, the Δ-regret incurred is zero. Only arms more than Δ away from the best arm contribute to the Δ-regret.

From the above perspective, we have revisited the application of MAB mechanisms in sponsored search auctions. First we analysed the single slot SSA setting and proposed a deterministic, exploration separated MAB mechanism called Δ-UCB. We showed that Δ-UCB is DSIC truthful, IR and achieves a Δ-regret of \(O(\log T)\). Next we studied the more challenging setting of multi-slot SSA. In particular, we adopted the cascade model and adapted Δ-UCB to this setting, first with the assumption that the prominence parameters are known. Here too, we have shown that the mechanism is DSIC truthful, IR and achieves a Δ-regret of \(O(\log T)\). We finally adapt the mechanism to the general multi-slot SSA setting where neither the CTRs nor the prominences are known. Here too our deterministic, exploration separated mechanism is DSIC truthful, IR and suffers a Δ-regret of \(O(\log T)\). The other mechanisms in literature for this setting are not able to obtain all these desirable properties that our mechanism achieves. They either compromise on the truthfulness, satisfying a weaker notion (truthfulness in expectation) or are forced to resort to randomness in the mechanism.

Our results are generic and apply equally well to several other applications where MAB mechanisms have been used.