Dominant strategy truthful, deterministic multi-armed bandit mechanisms with logarithmic regret for sponsored search auctions

Padmanabhan, Divya; Bhat, Satyanath; Prabuchandran, K. J.; Shevade, Shirish; Narahari, Y.

doi:10.1007/s10489-021-02387-2

Dominant strategy truthful, deterministic multi-armed bandit mechanisms with logarithmic regret for sponsored search auctions

Published: 01 July 2021

Volume 52, pages 3209–3226, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Dominant strategy truthful, deterministic multi-armed bandit mechanisms with logarithmic regret for sponsored search auctions

Download PDF

Divya Padmanabhan ORCID: orcid.org/0000-0002-2287-0440¹,
Satyanath Bhat¹,
K. J. Prabuchandran²,
Shirish Shevade³ &
…
Y. Narahari³

322 Accesses
1 Citation
Explore all metrics

Abstract

Stochastic multi-armed bandit (MAB) mechanisms are widely used in sponsored search auctions, crowdsourcing, online procurement, etc. Existing stochastic MAB mechanisms with a deterministic payment rule, proposed in the literature, necessarily suffer a regret of Ω(T^2/3), where T is the number of time steps. This happens because the existing mechanisms consider the worst case scenario where the means of the agents’ stochastic rewards are separated by a very small amount that depends on T. We make, and, exploit the crucial observation that in most scenarios, the separation between the agents’ rewards is rarely a function of T. Moreover, in the case that the rewards of the arms are arbitrarily close, the regret contributed by such sub-optimal arms is minimal. Our idea is to allow the center to indicate the resolution, Δ, with which the agents must be distinguished. This immediately leads us to introduce the notion of Δ-Regret. Using sponsored search auctions as a concrete example (the same idea applies for other applications as well), we propose a dominant strategy incentive compatible (DSIC) and individually rational (IR), deterministic MAB mechanism, based on ideas from the Upper Confidence Bound (UCB) family of MAB algorithms. Remarkably, the proposed mechanism Δ-UCB achieves a Δ-regret of $O(\log T)$ for the case of sponsored search auctions. We first establish the results for single slot sponsored search auctions and then non-trivially extend the results to the case where multiple slots are to be allocated.

Mechanisms with learning for stochastic multi-armed bandit problems

Article 30 June 2016

Stochastic One-Sided Full-Information Bandit

Mechanisms for combinatorial auctions with budget constraints

Article 01 April 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multi-armed bandit (MAB) algorithms [7] are now widely used to model and solve problems where decisions are required to be made sequentially at every time step and there is an exploration - exploitation dilemma. This dilemma is the tradeoff that the planner faces in deciding whether to explore arms that may yield higher rewards in the future or exploit the arms that have already yielded high rewards in the past. If the rewards are generated from fixed distributions with unknown parameters, the setting goes by the name stochastic MAB [7]. Popular algorithms in the stochastic MAB setting include Upper Confidence Bound (UCB) based algorithms [2] and Thompson Sampling [1] based algorithms. These algorithms incur $O(\log T)$ regret where T is the total number of time steps. MAB algorithms are well studied with several variants [8, 9, 22, 23] and applications [11, 29,30,31].

When the arms are controlled by strategic agents, we need to tackle additional challenges. Mechanism design [26,27,28] has been applied in this context, leading to stochastic MAB mechanisms [24]. The design of such mechanisms requires ideas from online learning as well as mechanism design, both of which are increasingly gaining importance in the field of artificial intelligence. An immediate application of stochastic MAB mechanisms is in sponsored search auctions (SSA). In SSA, there are several advertisers who wish to display their ads along with the search results generated in response to a query from an internet user. In the standard model, an advertiser has only one ad to display. We use the terms agent, ad, and advertiser interchangeably. There are two components that are of interest to the planner or the search engine, (1) stochastic component: click through rate (CTR) of the ads or the probability that a displayed ad receives a click (2) strategic component: valuation of the agent for every click that the agent’s ad receives. The search engine would seek to allocate a slot to an ad which has the maximum social welfare (product of click through rate and valuation). However neither the CTRs nor the valuations of the agents are known. This calls for a learning algorithm to learn the stochastic component (CTR) as well as a mechanism to elicit the strategic component (valuation). This problem could become much harder as the agents may manipulate the learning process [4, 19] to gain higher utilities.

For single slot SSA, it is known that any deterministic MAB mechanism (that is, a MAB mechanism with a deterministic allocation and payment rule) suffers a regret [7, 12] of Ω(T^2/3) [4]. Furthermore, there exists a deterministic MAB mechanism with regret matching the theoretical lower bound [4] and also satisfies ex-post truthfulness, the strongest notion of truthfulness (a posteriori to the clicks). When a more relaxed notion of truthfulness is targeted (truthfulness in expectation of the clicks), the regret guarantee improves to O(T^1/2) [3]. Truthfulness in expectation has also been achieved in [16, 17]. The regret can be further improved when randomized mechanisms are used and in fact the regret in this space is $O(\log T)$ [3, 21]. However, the high variance that is inevitable to the payments in randomized mechanisms is a serious deterrent to the use of randomized mechanisms. Towards reducing the variance, [15] propose a MAB mechanism using Thompson sampling [1]. However the notion of truthfulness achieved is ‘within period DSIC’ and with high probability. Thus again, only a weaker notion of truthfulness is achieved compared to ex-post truthfulness.

In this work, we observe that the characterization provided by Babaioff et al. [4] targets the worst case scenario. In particular, in the lower bound proof of regret of Ω(T^2/3), they consider an example scenario where the actual separation, $\bar {{\varDelta }}$, between the expected rewards of the arms is a function of T. We note that when a similar example ($\bar {{\varDelta }} = T^{-1}$) is used with the popular UCB algorithm [2], the number of pulls of sub-optimal arms could be linear, even in the non-strategic case. Hence, a dependence of $\bar {{\varDelta }}$ on T is severely restrictive for the case when the rewards are stochastic, even when the arms are non-strategic. We make the observation that $\bar {{\varDelta }}$ is in most situations independent of T. This motivates our main idea in this paper, which is to provide the planner an option to specify a parameter Δ, which is the tolerance or distinguishing level for sub-optimal arms. The understanding is that any arm that is within Δ from the best arm will not cause any additional regret to the planner. For example, the best arm may yield expected reward of 6.000 while a sub-optimal arm may yield a very close expected reward of 5.999. The planner is typically indifferent to such small differences. Traditional exploration-separated schemes end up spending a huge number of exploration rounds in order to distinguish between these two closely separated arms.

Another scenario that the planner might be interested in is to ensure that over T rounds, the cumulative loss from choosing sub-optimal arms stays within an overall factor of τ. To achieve this, the tolerance level Δ could be set to a value Δ = τ/T. The notion of Δ tolerance will require an appropriate definition of regret, which we call Δ-regret. Focussing on Δ-regret instead of the usual notion of regret helps us to reduce the number of exploration rounds significantly from O(T^2/3) to $O(\log T)$. We propose an exploration separated mechanism based on UCB, which achieves a Δ-regret of $O(\log T)$. This mechanism can be readily applied in several settings such as SSA, crowdsourcing, and online procurement. For the rest of the paper, however, we use SSA as a running example.

1.1 Contributions:

(1)
We make the crucial observation that in most MAB scenarios, the separation between the agents’ rewards is rarely a function of T (the number of time steps). Moreover, in the case that the rewards of the arms are arbitrarily close, the regret contributed by such sub-optimal arms is negligible. We exploit this observation to allow the center to specify the resolution, Δ, with which the agents must be distinguished. We introduce the notion of Δ-Regret to formalize this regret. The notion of regret used in state of the art does not permit a reduction in the number of exploratory rounds of sub-optimal arms whereas the more flexible notion of Δ-regret that we introduce supports a reduction in the number of mandatory pulls of the sub-optimal arms.
(2)
Using sponsored search auctions as a concrete example, we propose a dominant strategy incentive compatible (DSIC) and individually rational (IR) MAB mechanism with a deterministic allocation and payment rule, based on ideas from the UCB family of MAB algorithms. The proposed mechanism Δ-UCB achieves a Δ-regret of $O(\log T)$ for the case of single slot sponsored search auctions. The truthfulness achieved by Δ-UCB is a posteriori to the click realizations and is the strongest form of truthfulness. This loss of $O(\log T)$ would not have been possible otherwise if the traditional notions of regret were used. In particular the number of exploration rounds in Δ-UCB is $O(\log T)$ as opposed to the O(T^2/3) rounds which were mandatory so far for ensuring a truthful, deterministic mechanism. Thus we now enable the planner to be relieved from this huge number of exploration rounds. We also show that a lower bound on the Δ-regret suffered by any mechanism is ${\varOmega }(\log T)$.
(3)
We non-trivially extend the above results to the case where multiple slots are to be allocated. Here again, our mechanism is DSIC, IR, and requires sub-optimal arms to be explored for only $O(\log T)$ rounds. Therefore a Δ-regret of $O(\log T)$ is achieved for the multiple slot scenario as well.

Our results are generic to stochastic MAB mechanisms and can be applied to other popular applications such as crowdsourcing and online procurement.

2 Relevant work

In the area of MAB mechanisms, a lot of work has been done in sponsored search auctions. Babaioff et al. [4] provide a characterization of truthful MAB mechanisms, wherein the objective is to maximize social welfare. They introduce the notion of influential rounds. The influential rounds are the rounds where the parameters of reward distributions (CTRs) are learnt. One of the characterizations of truthful deterministic mechanisms is that the allocation must be exploration separated, that is, in such influential rounds, the allocation must not depend on the bids of the agents. The allocation is also required to be point wise monotone. One of the main results of their paper is that any truthful, deterministic MAB mechanism incurs a regret of Ω(T^2/3). In particular, their analysis holds an adversarial nature, as the sub-optimality between the best and second best arm is chosen as if by an adversary, to be proportional to T^− 1/3. Such a choice ensures a huge regret for any truthful, deterministic mechanism. They also provide a mechanism which incurs a matching upper bound regret of O(T^2/3). Devanur et al. [10] concurrently provide similar bounds on the regret when the objective is revenue maximization rather than social welfare maximization.

All the above results pertain to the setting of single slot auctions where there is a single slot for which the agents compete. In the generalization of this setting multiple slots are reserved for ads. This setting is more challenging as every slot is not identical and some slots are more prominent than the others. MAB mechanisms have also been extended to the multiple slot setting [14] in line with the characterization in [4]. Hence, a similar regret of O(T^2/3) on the social welfare has been attained here as well. Similar results are also stated in the characterisation provided in [32].

MAB mechanisms have also been proposed in the context of crowdsourcing [6]. Some of these mechanisms incur a regret of $O(\log T)$. This is rendered possible due to the specific nature of the problem in hand. In particular, Bhat et al. [5] look at divisible tasks. Jain et al. [20] look at deterministic mechanisms where a block of tasks is allocated to each agent and provide a weaker notion of truthfulness.

The lower bound of both of social welfare regret as well as regret in the revenue of Ω(T^2/3) have influenced subsequent research to follow similar assumptions and thereby obtain a similar regret. However, we show in this work that it is indeed possible to design a deterministic mechanism which attains logarithmic regret and is also truthful in the dominant strategy incentive compatible (DSIC) sense [25]. DSIC, of course, is the most preferred form of truthfulness [26]. This work opens up the possibility for a planner to move away from the worst case scenario to a more realistic scenario. We enable the planner to specify a resolution parameter for distinguishing the arms, introduce the notion of Δ-regret and thereafter propose a mechanism that ensures that the number of exploration rounds and hence the regret suffered is only $O(\log T)$ instead of the expensive Ω(T^2/3) available currently in state of the art. We summarize the contrast between our work and the state of the art in Table 1.

Table 1 Comparison of our results with state of the art

Full size table

3 The model: Single slot SSA

We now describe our SSA setting. For ease of reference, our notations are provided in Table 2. Let K be the number of agents or arms. We denote the set of arms by [K]. Each of the K arms, when pulled, gives rewards from distributions with unknown parameters. We assume here, that the form of the distributions are known but the parameters of the distributions are unknown. In SSA, the rewards of the arms correspond to clicks. The clicks for the advertisements are assumed to be generated from Bernoulli distributions with parameters μ₁, μ₂,…, μ_K where μ_i is the CTR or probability that advertisement i receives a click once observed. The means μ₁,…, μ_K are unknown.

Table 2 Notations for the single slot SSA setting

Full size table

A click realization ρ represents the click information of every agent at all rounds, that is, ρ_i(t) = 1 if agent i received a click in round t. In a round t, only the click information of the allocated agent is revealed after the completion of the round. Click information of all other unallocated agents is never known to the planner.

The agents also have their valuations for each click they receive. We work in the ‘pay per click’ setting where the agent pays the search engine for each click received. Let the true valuation of agent i be θ_i for a click. θ_i is a private type of agent i and is never known to the learner. However the agent is asked to bid his valuation. Let the bid of agent i be b_i. We denote by a vector b = (b₁,…, b_K) the bid profile of all the agents. The central planner wants to ensure that the agents bid their true valuations, that is b_i must be equal to θ_i. Assume that there is a single slot which must be allocated to one of the K agents. We denote by W_i the social welfare when agent i is allocated a slot, that is, W_i = μ_iθ_i. The social welfare represents the expected valuation of agent i per click. If the CTRs of the agents as well as their valuations were known, the planner would have selected the arm with the maximum social welfare, that is, μ_iθ_i. However neither μ_i nor θ_i is known to the planner. Assume θ_max is the maximum valuation that any agent can have and is common knowledge. The central agent wants to allocate a single slot to one of the ads in such a way that the net social welfare of the allocation is maximized.

A mechanism ${\mathscr{M}} = \left \langle \mathcal {A}, P\right \rangle $ is a tuple containing an allocation rule $\mathcal {A}$ and a payment rule P. At every time step or round t, the allocation rule acts on a bid profile b of the agents as well as click realization ρ and allocates the slot to one of the K agents, say i. Then $\mathcal {A}(b, \rho , t) = i$. Alternatively we denote the indicator variable . The payment rule $P^t = (P_1^t, P_{2}^t, \ldots , P_K^t)$, where $P_i^t(b, \rho )$ is the payment to be made by agent i at time t upon receiving a click, when the bids are b and for click realization ρ. As stated earlier ρ_i(t) of the allocated agent alone is observed. Also note that the allocation as well as payments in each round t only depends on the click histories till that round.

Let i_∗ be the arm with the largest social welfare, that is, $i_{*} = \underset {i \in [K]}{\arg \max \limits } W_i$. We denote the corresponding social welfare as $W_{*} = \max \limits _{i \in [K]} W_{i} $. We denote by I_t the agent chosen at time t as a shorthand for $\mathcal {A}(b, \rho , t)$. For any given Δ > 0, define the set S_Δ = {i ∈ [K] : W_∗− W_i < Δ}. S_Δ denotes the set of all agents separated from the best arm i_∗ with a social welfare less than Δ. These arms are therefore indistinguishable for the center and they contribute zero to the regret. Note that Δ is a parameter that the center fixes based on the amount in dollars he is willing to tradeoff for choosing sub-optimal arms, given he has only a fixed time horizon T to his disposal. To capture this revised and more practical notion of regret, we introduce the metric Δ-regret. Formally,

The center may not want to invest a huge number of exploration rounds (Ω(T^2/3) in state of the art) to perfectly distinguish the arms that are arbitrarily close. Many a time, the planner may instead be willing to allocate arms that are at most Δ away from the best arm. The center therefore suffers a regret only when an agent with a social welfare greater than Δ away from W_∗ is chosen. Δ-regret captures this loss.

The goal of our mechanism is to select agents at every round t to minimize the Δ-regret.

4 Our mechanism: Δ-UCB

We are now ready to describe our mechanism Δ-UCB. The idea in Δ-UCB is to explore all the arms in a round-robin fashion for a fixed number of rounds. The number of exploration rounds is fixed based on the desired Δ, specified by the planner. At the end of exploration, with high probability, we are guaranteed that the arms not in S_Δ are well separated from the best arm i_∗ with respect to their social welfare estimates. In the exploration rounds, agents need not pay and these rounds are free.

Further on, for all the remaining rounds, the best arm as per the UCB estimate of social welfare is chosen. However in the exploitation rounds, the chosen agent pays an amount for each click he receives. The amount to be paid by the agent is fixed based on variant of the well known Vickrey Clark Grove (VCG) scheme [33] known as weighted VCG [28]. Note that no learning takes place in these rounds and the UCB, LCB indices do not change thereafter. We present our mechanism in Algorithm 1.

4.1 Properties of Δ-UCB

Next we discuss the properties satisfied by Δ-UCB regarding truthfulness and regret. Before that, we state a few useful definitions which will help in understanding the notion of truthfulness.

At any time step, every agent obtains some utility by participating in the mechanism. This utility is a function of his bid, valuation, bids of other agents and his click realization. Let Θ_i denote the space of bids of agent i. b_−i = (b₁,…, b_i− 1, b_i+ 1,…, b_K) is the bid profile containing bids of all agents except agent i. Let Θ_−i denote the space of bids of all agents other than agent i. Therefore Θ_−i = Θ₁ ×…,×Θ_i− 1 ×Θ_i+ 1 ×… ×Θ_K. We denote by u_i(b_i, b_−i, ρ, t;θ_i) the utility to agent i at time t when his bid is b_i, his valuation is θ_i, the bid profile of the remaining agents is b_−i and the click realization is ρ. All agents are assumed to be rational and are interested in maximizing their own utilities.

In our setting the utility to an agent i is computed as,

$$ \begin{array}{@{}rcl@{}} u_{i}(b_{i}, b_{-i}, \rho, t; \theta_{i}) = (\theta_{i} - {P_{i}^{t}}(b, \rho)) \mathcal{A}_{i}(b_{i}, b_{-i}, \rho, t)\rho_{i}(t) \end{array} $$

(2)

The idea behind the computation of the utility is as follows. If an agent i does not receive an allocation (that is, $\mathcal {A}_i(b_i, b_{-i}, \rho , t)=0$), his utility is also zero. He gets a non-zero utility only if he receives an allocation. If he receives an allocation and also a click (ρ_i(t) = 1), then his utility is the difference between his valuation for the click and the amount he has to pay to the search engine ($\theta _{i} - P_i^t(b, \rho )$). If he does not receive a click (ρ_i(t) = 0), his utility is zero.

Definition 1

Dominant Strategy Incentive Compatible (DSIC) [4]: A mechanism $M = \left \langle \mathcal {A},P \right \rangle $ is said to be dominant strategy incentive compatible if ∀i ∈ [K],∀b_i ∈Θ_i, ∀b_−i ∈Θ_−i,∀ρ,∀t, u_i(θ_i, b_−i, ρ, t;θ_i) ≥ u_i(b_i, b_−i, ρ, t;θ_i).

Note that in the above definition, the truthfulness is demanded a posteriori to even the click realization [14]. Hence it is the strongest notion of truthfulness. Examples for weaker forms of truthfulness include those which take expectation over click realizations.

Definition 2

Individually Rational (IR): A mechanism $M = \left \langle \mathcal {A},P \right \rangle $ is said to be individually rational if ∀i ∈ [K], ∀b_−i ∈Θ_−i,∀ρ,∀t, u_i(θ_i, b_−i, ρ, t;θ_i) ≥ 0.

Theorem 1

Δ-UCB mechanism is dominant strategy incentive compatible (DSIC) and individually rational (IR).

Proof

We analyze the scenarios where an agent i bids his true valuation and receives an allocation and also when he does not. We show that in both these scenarios, bidding his true valuation θ_i is indeed a best response strategy. We only need to consider the exploitation rounds because in the exploration rounds, every agent is allocated a fixed number of rounds independent of his bids and these rounds are also free for agents.

Case 1: $\mathcal {A}_i(\theta _i, b_{-i}, \rho , t)=1$

This implies that when the agent bids his true valuation, he gets an allocation. Therefore $\widehat {\mu }_{i,t}^+ \theta _{i} > \widehat {\mu }_{l,t}^+ b_l$ for all the other agents l. In particular, let agent j be such that $j = \underset {l \in [K] \setminus \{i\}}{\arg \max \limits } \widehat {\mu }_{l,t}^+ b_l$. The amount to be paid by agent i is $P_i^t(\theta _i, b_{-i}, \rho ) = \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+$. If he receives a click then $u_i(\theta _i, b_{-i}, \rho ,t; \theta _i)= \theta _{i} - \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+ > 0 $.
Overbid::

If agent i bids a value b_i > θ_i, he continues to receive an allocation and his payment is still the same, $P_i^t(b_i, b_{-i}, \rho ) = \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+$. Therefore his utility continues to be $u_i(b_i, b_{-i}, \rho ,t; \theta _i)= \theta _{i} - \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+ = u_i(\theta _i, b_{-i}, \rho ,t; \theta _i)$. Therefore he does not benefit from an overbid.

Underbid::

Suppose agent i bids a value b_i < θ_i.
1. Case a:
  If b_i is such that $\widehat {\mu }_{i,t}^+b_{i} < \widehat {\mu }_{j,t}^+ b_{j}$, the he fails to get an allocation as $\mathcal {A}(b_i, b_{-i}, \rho , t) =j \neq i $. Then the utility to agent i is u_i(b_i, b_−i, ρ, t;θ_i) = 0 < u_i(θ_i, b_−i, ρ, t;θ_i). Therefore he clearly loses his utility by such an underbid.
2. Case b:
  Suppose b_i is such that $\widehat {\mu }_{i,t}^+\theta _{i} > \widehat {\mu }_{i,t}^+b_{i} > \widehat {\mu }_{j,t}^+ b_{j}$. That is agent i bids in such a way that he wins the allocation even with an underbid. Then, if he gets a click, the amount he must pay to the center is $P_i^t(b_i, b_{-i}, \rho ) = \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+ $. Therefore his utility $u_i(b_i, b_{-i}, \rho ,t; \theta _i)= \theta _{i} - \widehat {\mu }_{j,t}^+ b_{j} / \widehat {\mu }_{i,t}^+= u_i(\theta _i, b_{-i}, \rho ,t; \theta _i) $. He obtains the same utility as a truthful bid and there is no benefit from such an underbid.
Case 2: $\mathcal {A}_i(\theta _i, b_{-i}, \rho , t)=0$

This implies that when the agent bids his true valuation, he does not get an allocation. Suppose agent j wins the allocation. $\mathcal {A}(\theta _i, b_{-i}, \rho , t)=j$ and $\widehat {\mu }_{i,t}^+ \theta _{i} < \widehat {\mu }_{j,t}^+ b_{j}$.
Truthful bid::

Since agent i does not win an allocation with a truthful bid, his utility u_i(θ_i, b_−i, ρ, t;θ_i) = 0

Overbid::
Suppose agent i bids in such a way that b_i > θ_i. We have two sub-cases here.
1. Case a:
  If b_i is such that $\widehat {\mu }_{i,t}^+ \theta _{i} < \widehat {\mu }_{j,t}^+ b_{j} < \widehat {\mu }_{i,t}^+ b_{i} $, then agent i wins the allocation. So, $\mathcal {A}_i(b_i, b_{-i}, \rho , t)=1$. If he gets a click, he now has to make a payment $P_i^t(b_i, b_{-i}, \rho ) = \widehat {\mu }_{j,t}^+ b_{j}/ \widehat {\mu }_{i,t}^+ $. Now his utility $u_i(b_i, b_{-i}, \rho ,t; \theta _i) = \theta _{i} - \widehat {\mu }_{j,t}^+ b_{j}/ \widehat {\mu }_{i,t}^+ $ < 0. And in particular u_i(b_i, b_−i, ρ, t;θ_i) < u_i(θ_i, b_−i, ρ, t;θ_i) = 0. Therefore, such an overbid is clearly disadvantageous compared to a truthful bid.
2. Case b:
  Suppose $\widehat {\mu }_{i,t}^+ \theta _{i} < \widehat {\mu }_{i,t}^+ b_{i} < \widehat {\mu }_{j,t}^+ b_{j} $. The overbid by agent i is not sufficient to make him win the allocation and agent j wins the allocation, $\mathcal {A}(b_i, b_{-i}, \rho , t)=j$. The utility of agent i, u_i(b_i, b_−i, ρ, t;θ_i) = 0 = u_i(θ_i, b_−i, ρ, t;θ_i). Therefore there is no advantage for agent i by this case of overbid.
Underbid::

If agent i bids in such a way that b_i < θ_i, he continues to lose the allocation and therefore his utility, u_i(b_i, b_−i, ρ, t;θ_i) = 0 = u_i(θ_i, b_−i, ρ, t;θ_i). Since, the utility by an underbid remains the same as a truthful bid, there is clearly no advantage in underbidding.

All the above cases show that our mechanism is DSIC a posteriori to the click realizations. Also, in each of the above cases, note that the utility of an agent i, u_i(θ_i, b_−i, ρ, t) ≥ 0. Therefore, by truthful bidding he never gets a negative utility. This proves that our mechanism is individually rational. □

We next discuss the regret incurred by Δ-UCB. We note that the regret analysis we provide differs in spirit from the worst case analysis in [4]. The number of exploration rounds in [4] is required to be Ω(T^2/3) since the separation between the best and second best arm is fixed in an adversarial manner in their analysis. Our analysis does not resort to any adversarial arguments.

In order to prove our Δ-regret results, we will first need to prove several other lemmas.

Lemma 1

Social Welfare UCB index: For an agent i, we define the social welfare UCB indices for agent i as,

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i,t}^{+} = \widehat{\mu}_{i,t}\theta_{i}+ \epsilon_{i,t}\theta_{i} = \widehat{\mu}_{i,t}\theta_{i} + \sqrt{2\frac{{\theta_{i}^{2}} \log T}{N_{i,t}}} \end{array} $$

(3)

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i,t}^{-} = \widehat{\mu}_{i,t}\theta_{i} - \epsilon_{i,t}\theta_{i} = \widehat{\mu}_{i,t}\theta_{i} - \sqrt{2\frac{{\theta_{i}^{2}} \log T}{N_{i,t}}} \end{array} $$

(4)

Then, $ \forall t P\left (\left \lbrace \omega : W_{i} \notin [\widehat {W}_{i,t}^-(\omega ), \widehat {W}_{i,t}^+ (\omega )]) \right \rbrace \right ) \leq 2T^{-4} $.

Proof

Let $\widehat {\mu }_{i,t}^+$ and $\widehat {\mu }_{i,t}^-$ denote the UCB and LCB indices for the estimate $\widehat {\mu }_i$. Then the events $\{\omega : \mu _{i} \notin [\widehat {\mu }_{i,t}^-(\omega ),$ $ \widehat {\mu }_{i,t}^+(\omega ) ] \}$ and $\{\omega : W_{i} \notin [\widehat {W}_{i,t}^-(\omega ), \widehat {W}_{i,t}^+(\omega ) ] \}$ are identical. So, $P(W_{i} \notin [\widehat {W}_{i,t}^-, \widehat {W}_{i,t}^+]) = P(\mu _{i} \notin [\widehat {\mu }_{i,t}^-, \widehat {\mu }_{i,t}^+] )$. An application of Hoeffding bound [18] gives $P(\mu _{i} \notin [\widehat {\mu }_{i,t}^-, \widehat {\mu }_{i,t}^+] ) \leq 2\exp (-2 N_{i,t} \epsilon _{i,t}^2) $. As per the mechanism $\epsilon _{i,t} = \sqrt {2\log T/N_{i,t}}$. So, $P(\mu _{i} \notin [\widehat {\mu }_{i,t}^-, \widehat {\mu }_{i,t}^+] ) \leq 2\exp (-2 N_{i,t} \times 2 \log T/ N_{i,t}) = 2T^{-4}$. □

Lemma 2

Suppose at time step t, $N_{i,t} > \frac {8\theta _{max}^2 \log T}{{\varDelta }^2} \forall i \in [K]$. Then ∀i ∈ [K], 2𝜖_{i, t}θ_i < Δ.

Proof

Given that $N_{i,t} > \frac {8 \theta _{max}^2 \log T}{{\varDelta }^2}$. Therefore,

$$ \begin{array}{@{}rcl@{}} {\varDelta}^{2} > \frac{8 \theta_{max}^{2} \log T}{N_{i,t}} \geq \frac{8 {\theta_{i}^{2}} \log T}{N_{i,t}} \geq 4 \left[\frac{2 {\theta_{i}^{2}} \log T}{N_{i,t}} \right] \end{array} $$

Taking square roots on both sides of the above equation yields Δ > 2𝜖_{i, t}θ_i thereby proving the lemma. □

Lemma 3

Suppose K ≪ T. For an agent i and time step t, let B_{i, t} be the event $B_{i,t} = \{\omega : W_{i} \notin [\widehat {W}_{i,t}^-, \widehat {W}_{i,t}^+] \}$. Define the event $G =\bigcap \limits _t \bigcap \limits _{i \in [K]} B_{i,t}^c$, where $B_{i,t}^c$ is the complement of B_{i, t}. Then $P(G) \geq 1 - \frac {2}{T^2}$.

Proof

From Lemma 1, the probability of the ‘bad’ event, P(B_{i, t}) ≤ 2T^− 4.

$$ \begin{array}{@{}rcl@{}} P(G)&=& P\left( \underset{t}{\bigcap} \underset{i}{\bigcap} B_{i,t}^{c}\right) = 1- P\left( \left( \underset{t}{\bigcap} \underset{i}{\bigcap} B_{i,t}^{c}\right)^{c}\right) \\ &=& 1- P\left( \underset{t}{\bigcup} \underset{i}{\bigcup} B_{i,t}\right) = 1 - \sum\limits_{t} \sum\limits_{i \in [K]} P(B_{i,t}) \\ & \geq& 1- \sum\limits_{t} \sum\limits_{i \in [K]} 2T^{-4} \geq 1 - \frac{2}{T^{2}} \end{array} $$

The last statement follows by summing over all rounds and using the fact that K ≪ T. □

Theorem 2

Suppose at time step t, $N_{j,t} > \frac {8\theta _{max}^2 \log T}{{\varDelta }^2} \forall j \in [K]$. Then ∀i ∈ [K] ∖ S_Δ, $\widehat {W}_{i_{*}, t}^+ > \widehat {W}_{i, t}^+$ with high probability (= 1 − 2/T⁴).

Proof

In Theorem 1, we have shown that Δ-UCB is DSIC. Therefore, all the agents bid their valuations truthfully, b_i = θ_i∀i ∈ [K]. Suppose in exploitation round t, a sub-optimal arm i is pulled. Therefore, $\widehat {W}_{i, t}^+ \geq \widehat {W}_{i_{*}, t}^+$. Then one of the following three conditions must have happened.

Condition 1: $W_{i} < \widehat {W}_{i,t}^{-}$. This condition implies a drastic overestimate of the sub-optimal arm i so that the true social welfare W_i is even below the LCB index $\widehat {W}_{i,t}^-$. Figure 1 shows this case.
Condition 2: $ W_{*} > \widehat {W}_{i_{*}, t}^{+}$. This implies an underestimate of the optimal arm so that the true social welfare W_∗ lies above even the UCB index $\widehat {W}_{i_{*},t}^+$. This situation is shown in Fig 2.
Condition 3: W_∗− W_i < 2𝜖_{i, t}θ_i. This implies an overlap in the confidence intervals of the optimal and sub-optimal arm. Even though Conditions 1 and 2 are false, still the UCB of sub-optimal arm i is greater than the UCB of the optimal arm i_∗.

□

From Fig. 3, $W_{*} - W_{i} \leq \widehat {W}_{i,t}^+ - \widehat {W}_{i,t}^- \leq 2 \epsilon _{i,t} \theta _{i} $

If all the three conditions above were false, then,

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i_{*},t}^{+} &> W_{*} > W_{i} + 2 \epsilon_{i,t} \theta_{i} > \widehat{W}_{i,t}^{-} + 2 \epsilon_{i,t} \theta_{i}= \widehat{W}_{i,t}^{+} \end{array} $$

This implies that $\widehat {W}_{i_{*},t}^+ > \widehat {W}_{i,t}^+$, leading to a contradiction.

As per the statement of the theorem, $N_{i,t} > \frac {8 \theta _{max}^2 \log T}{{\varDelta }^2}$. Therefore by Lemma 2, 2𝜖_{i, t}θ_i < Δ. For i ∈ [K] ∖ S_Δ, W_∗− W_i > Δ > 2𝜖_{i, t}θ_i. So Condition 3 above does not hold true. So if the sub-optimal arm i must have been pulled, only possibilities are for Condition 1 or 2.

$$ \begin{array}{@{}rcl@{}} P(\widehat{W}_{i,t}^{+} &>& \widehat{W}_{i_{*},t}^{+}) \leq P(\text{Condition 1}) + P(\text{Condition 2}) \\ & \leq& \frac{1}{2}P(B_{i,t}) + \frac{1}{2}P(B_{i_{*}, t}) \leq 2/T^{-4} \end{array} $$

$$ \begin{array}{@{}rcl@{}} P(\widehat{W}_{i_{*},t}^{+} & >\widehat{W}_{i,t}^{+} ) = 1- P(\widehat{W}_{i,t}^{+} > \widehat{W}_{i_{*},t}^{+}) \geq 1 - \frac{2}{T^{4}} \end{array} $$

thereby completing the proof.

We are now ready to state our main result on the incurred regret.

Theorem 3

If the Δ-UCB mechanism is executed for a total time horizon of T rounds, it achieves an expected Δ-regret of $O(\log T)$.

Proof

The main idea in the proof is to compute the Δ-regret conditional on two events - G and G^c and then to find a bound for these two conditional expectations.

The last step comes from the fact that Conditions 1 and 2 in the proof of Theorem 2 are eliminated as we are given that the event G has occurred. After exploration rounds, $N_{i,t}\geq 8K \theta _{max}^2 \log T/{\varDelta }^2$. From Theorem 2, no Δ-regret occurs during exploitation since G is true. Therefore the regret is only incurred during the exploration rounds.

We now compute $\mathbb {E}\left [ {\varDelta }\text {-regret} |G^c \right ]$.

$$ \begin{array}{@{}rcl@{}} \mathbb{E}\left[ {\varDelta}\text{-regret} |G^{c} \right] \leq T \theta_{max} \end{array} $$

(5)

But $P(G^c) = 1 - P(G)< \frac {2}{T^2}$ from Lemma 3. Putting all the steps together,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}\left[ {\varDelta}\text{-regret} \right] &=& \mathbb{E}\left[ {\varDelta}\text{-regret} |G \right] P(G) + \mathbb{E}\left[ {\varDelta}\text{-regret} |G^{c} \right] P(G^{c}) \\ & \leq& \frac{8K\theta_{max}^{3} \log T}{{\varDelta}^{2}} * 1 + T\theta_{max}*\frac{2}{T^{2}} \\ & \leq& \frac{8K\theta_{max}^{3} \log T}{{\varDelta}^{2}} + 2 \end{array} $$

(6)

The second term is less than 2 as θ_max ≪ T. This completes the proof. □

A consequence of the above theorem is that even if an adversary chooses an arbitrary small gap between the best and second best arm, there is nothing to worry for the planner - if the gap is less than his tolerance Δ, no loss is incurred as opposed to the otherwise Ω(T^2/3) loss in [4].

4.2 A lower bound for Δ-regret

We will now discuss a lower bound for the Δ-regret incurred by our approach. In particular, we will provide the lower bound for the case where θ_i = 1 for all i and is known. The proof will follow along the lines of the lower bound proof in [7]. The same lower bound will also naturally apply to the case of the general strategic version as well, since we our proposed mechanism Δ-UCB is truthful and achieves a matching upper bound.

Let kl(p, q) denote the KL divergence between the distributions Bernoulli(p) and Bernoulli(q). Then $kl(p,q) = p \log p/q + (1-p) \log (1-p)/(1-q)$.

Theorem 4

Consider the setting where θ_i = 1∀i ∈ [K]. Suppose an algorithm satisfies $\mathbb {E}[N_{i,t}] = o(t^a)$ for any set of Bernoulli reward distributions and for all arms i∉S_Δ and a > 0. Then for any set of Bernoulli reward distributions we have,

$$ \liminf_{T \rightarrow \infty} \frac{\mathbb{E}[{\varDelta}\text{-regret}]}{\log T } \geq \sum\limits_{i \notin S_{{\varDelta}}} \frac{{\varDelta}_{i}}{kl(\mu_{i}, \mu^{*} + {\varDelta})} $$

(7)

where $\mu ^{*} = \underset {j \in [K]}{\arg \max \limits } \mu _{j}$, Δ_i = μ^∗− μ_i for all j ∈ [K].

Proof

We will provide the proof for the case of two agents. The proof for the case K > 2 follows analogously. Assume that μ₂ ≤ μ₁ ≤ 1 and μ₁ − μ₂ > Δ. Therefore agent 1 is optimal and agent 2 does not belong to S_Δ. For any 𝜖 > 0, due to the continuity of kl(μ₂, x), we can find $\mu ^{\prime }_{2} \in (\mu _1 + {\varDelta }, 1)$ such that

$$ kl(\mu_{2}, \mu^{\prime}_{2}) \leq (1+ \epsilon) kl(\mu_{2}, \mu_{1} + {\varDelta}) $$

(8)

This configuration then corresponds to an alternate setting where the mean of agent 2 is $\mu ^{\prime }_{2}$. In this alternate setting, $\mu ^{\prime }_{2} - \mu _1 > {\varDelta }$ and agent 2 is the unique optimal. For s ∈{1,…, T}, let,

$$ \tilde{kl}_{s} = \sum\limits_{t=1}^{s} \frac{\mu_{2} {\rho_{2}^{t}} + (1-\mu_{2}) (1- {\rho_{2}^{t}})}{\mu^{\prime}_{2} {\rho_{2}^{t}} + (1-\mu^{\prime}_{2}) (1- {\rho_{2}^{t}})} $$

(9)

It can be verified that $\lim _{t \rightarrow \infty } \mathbb {E}[ \tilde {kl}_t ]/t = kl(\mu _{2}, \mu ^{\prime }_{2})$ (where the expectation is taken over $\rho _{2}^t$) and therefore $\tilde {kl}_t$ serves as an un-normalized estimate for $kl(\mu _{2}, \mu ^{\prime }_{2})$.

Let C_T denote the following random variable,

One may verify that $\mathbb {P}_{\mu ^{\prime }_{2}} (C_{T} = 1) = \mathbb {E}_{\mu _{2}}[C_{T} \exp (-\tilde {kl}_{N_{2,T}})]$ by applying a change of measure. We will now show that $ \mathbb {P}_{\mu _{2}} (C_{T} = 1) \rightarrow 0$ as $T \rightarrow \infty $. This is due to the following:

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\mu^{\prime}_{2}} (C_{T} = 1) &=& \mathbb{E}_{\mu_{2}}[C_{T} \exp(-\tilde{kl}_{N_{2,T}})] \\&&\geq \exp (- (1- \epsilon/2) \log T) \times \mathbb{P}_{\mu_{2}}(C_{T} = 1) \end{array} $$

Therefore, setting $f_{T} = \frac {(1-\epsilon )\log T}{kl(\mu _{2},\mu ^{\prime }_{2})}$, and applying Markov inequality we get,

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\mu_{2}}(C_{T} = 1)& \leq& T^{1- \epsilon/2} \mathbb{P}_{\mu^{\prime}_{2}} (C_{T} = 1) \leq T^{1- \epsilon/2} \mathbb{P}_{\mu^{\prime}_{2}} (N_{2,t} \leq f_{T})\\ &\leq& T^{1- \epsilon/2} \frac{\mathbb{E}_{\mu^{\prime}_{2}}[T - N_{2,T}]}{T-f_{T}} \rightarrow 0 \end{array} $$

The last step arises as a consequence of T − N_{2, T} = N_{1, T} and agent 1 is sub-optimal for the setting where agent 2 has the mean reward of $\mu ^{\prime }_{2}$.

We will finally show that $\mathbb {P}_{\mu _{2}}(N_{2,T} < f_T) \rightarrow 0$.

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\mu_{2}}(C_{T} = 1) &\geq& \mathbb{P}_{\mu_{2}}(N_{2,T} < f_{T} \text{ and } \max_{s \leq f_{T}} \tilde{kl}_{s} \leq (1-\epsilon/2)\log T) \\ &=& \mathbb{P}_{\mu_{2}}(N_{2,T} < f_{T} \text{ and } \frac{kl(\mu_{2}, \mu^{\prime}_{2})}{(1-\epsilon) \log T} \max_{s \leq f_{T}} \tilde{kl}_{s} \\&\leq& \frac{kl(\mu_{2}, \mu^{\prime}_{2})}{(1-\epsilon)}(1-\epsilon/2)) \end{array} $$

Note that $kl(\mu _{2}, \mu ^{\prime }_{2}) > 0$ and $\frac {1-\epsilon /2}{1 - \epsilon } \geq 1$. Therefore by an application of the strong law of large numbers, we have

$$ \begin{array}{@{}rcl@{}} \underset{T \rightarrow \infty}{\lim} \mathbb{P}_{\mu_{2}}(\frac{kl(\mu_{2}, \mu^{\prime}_{2})}{(1-\epsilon) \log T} \max_{s \leq f_{T}} \tilde{kl}_{s} \leq \frac{kl(\mu_{2}, \mu^{\prime}_{2})}{(1-\epsilon)}(1-\epsilon/2)) = 1 \end{array} $$

Since $\mathbb {P}_{\mu _{2}}(C_{T} = 1) \rightarrow 0$, we must have $\mathbb {P}_{\mu _{2}}(N_{2,T} < f_T) \rightarrow 0$ as well. Applying Markov inequality again, we get,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\mu_{2}}[N_{2,T}] &\geq& \mathbb{P}_{\mu_{2}}(N_{2,T} \geq f_{T}) f_{T} = \frac{1-\epsilon}{kl(\mu_{2},\mu^{\prime}_{2})} \\&\geq& \frac{1-\epsilon}{1+\epsilon} \frac{\log T}{ kl(\mu_{2}, \mu_{1} + {\varDelta})} \end{array} $$

The last step is obtained by applying Eq. 8. This completes the proof. Note the key difference between our proof and [7] lies in Eq. 8. Our RHS in Eq. 8 is necessary to ensure that in the alternate scenario agent 1 is sub-optimal. □

Remark 1

The lower bound for the expected Δ-regret Theorem 4 is quite similar to the lower bound for the regret of the UCB algorithm in [7]. The difference is that the KL divergence term in the bound is also a function of the parameter Δ. Intuitively instead of considering the KL divergence between KL(μ_i, μ^∗), we give an allowance of Δ for the optimal agent.

5 Extension to multi-slot SSA

In the previous sections, we assumed that there was a single slot for which the advertisers were competing. We now look at a more general setting where there are M slots to be allocated to the K agents. As before, each advertiser has exactly one ad for display and the CTR for advertisement i is denoted by μ_i. Recall that in the case of single slot auctions, the CTR exactly denoted the probability with which an ad received a click. However in the generalized setting of multi-slot auctions, an additional parameter comes into play while computing the click probability due to which the problem becomes much harder [13].

Each position or slot m is associated with a parameter λ_m called ‘prominence’. λ_m denotes the probability with which a user observes an ad at slot m + 1 given he has observed the ad at slot m. In order to understand the need for this parameter, a useful scenario to imagine is the listing of web-pages in Google for a query. There are two phases that one can think of once the listing of pages or results have appeared.

Phase 1: This is the phase where a user scans through the pages listed. A page listed higher up in the ranking (say second from the top) has more chances of being observed by a user rather than a page that is far below in the ranking (say fifth from the top). λ₄, for instance, denotes the probability that a user observes the fifth page, given he has observed the fourth page. Coming back to sponsored ads, we assume that λ₀ = 1, that is, the ad listed in the first slot is surely observed. We denote by Γ_m the probability that an ad at slot m is observed. Γ_m is computed as,
$$ \begin{array}{@{}rcl@{}} {\varGamma}_{m} = \begin{cases} 1 & \quad \text{if } m = 1 \\ \prod\limits_{s=1}^{m-1} \lambda_{s} & \quad \text{if } 2 \leq m \leq M\\ 0 & \quad \text{if } m> M\\ \end{cases} \end{array} $$
(10)
This modeling assumption for Γ_m is known as position dependent cascade model.
Phase 2: After having scanned through the list, the user decides to click one or more of the shown ads. In the multi-slot setting [14], it is assumed that multiple ads in a listing may receive clicks. The probability that ad i receives a click when shown at slot m = Γ_mμ_i.

We assume that λ_m, m = 1,…, M are known to the planner a-priori. The problem of learning these parameters along with the CTR μ is much harder in the presence of strategic agents. Therefore, in this section, we work with the assumption that the λ s and hence Γ s are known. In Section 6.2, we give pointers for design of mechanisms where the Γ s are unknown.

The above modeling assumptions are as per standard conventions [13]. In the multi-slot setting, the allocation is given to multiple agents at every time step. We denote by $\mathcal {A}(b, \rho , t)$ ⊂{1,…, K}, the allocation at time t for bids b and click realization ρ. The cardinality of the allocated set $|\mathcal {A}(b, \rho , t)| = M$. We also use the notation $\mathcal {A}_i(b, \rho , t) = m$ to denote the allocation to agent i at time t is slot m, for the bid profile b, click realization ρ. If an agent i is not allocated any of the M slots at time t, we say $\mathcal {A}_i(b, \rho , t) = 0$.

We denote by W_{i, m} the social welfare of agent i, when he is given slot m. W_{i, m} is the expected valuation that agent i receives when he is given slot m and is computed as,

$$ \begin{array}{@{}rcl@{}} W_{i,m} = {\varGamma}_{m} \mu_{i} \theta_{i} \end{array} $$

(11)

For ease of reference, the additional relevant parameters for the multi-slot setting are provided in Table 3.

Table 3 Additional notations for multi-slot SSA

Full size table

Having described the multi-slot setting, we now analyze the scenario from the view point of the search engine or central planner. In the ideal scenario, the planner would like to allot the ads exactly to the top M agents with the largest social welfare. This use case has been studied in the literature [14] and exploration separated mechanisms with regret of O(T^2/3) have been proposed. Various possible allocations are explored for O(T^2/3) time steps for every agent after which the allocation algorithm is guaranteed to converge to the ideal allocation with high probability. As in the single slot case, O(T^2/3) exploration rounds are required to distinguish all the agents perfectly from each other, when there are agents whose social welfare values are arbitrarily close.

However, a much more practical problem of interest is to study and design mechanisms when the search engine is indifferent to a gap in Δ in social welfare for every slot. We observe that in cases where the agents are well-separated, O(T^2/3) exploration rounds are not required. In fact, $O(\log T)$ exploration rounds are sufficient to converge to an allocation that is well within the requirements of the search engine.

Having explained the problem, we now formalize the notions of separatedness in this setting. Let K⁽¹⁾,…, K^(M) ∈ [K] be the best M agents in terms of their single slot social welfare values, that is, $\mu _{K^{(1)}} \theta _{K^{(1)}} > \mu _{K^{(2)}} \theta _{K^{(2)}} > {\ldots } >\mu _{K^{(M)}} \theta _{K^{(M)}} $. Let $W_{*,m} = W_{K^{(m)}, m}$. The ideal solution would be to allocate agent K^(m) the slot m. This allocation would yield the largest social welfare but in the worst case, when the agents’ social welfares are separated by a function of T, converging to this optimal allocation would require O(T^2/3) exploration rounds [14]. Instead, for a prescribed value of Δ fixed by the search engine, define the set,

$$ S_{{\varDelta},m } = \left\lbrace i \in [K]: W_{K^{(m)}, m} - W_{i,m} < {\varDelta} \right\rbrace. $$

(12)

S_{Δ, m} is the set of all agents whose social welfare is at most Δ away from the agent K^(m) (who should have ideally been given slot m). The planner is indifferent to the regret contributed by the agents in S_{Δ, m}, if any of them are allotted slot m. Hence we define the multi-slot Δ-regret metric as,

$$ \begin{array}{@{}rcl@{}} {\varDelta}\text{-regret} = \sum\limits_{t=1}^{T} \sum\limits_{m=1}^{M} (W_{*,m} - W_{I_{t,m}, m})\mathbbm{1}\left[ I_{I_{t},m} \in [K]\setminus S_{{\varDelta},m}\right] \end{array} $$

The Δ-UCB mechanism for the multi-slot SSA is given in Algorithm 2.

We analyze the regret and truthfulness of Algorithm 2. The lemmas and theorems for establishing the results for the multi-slot setting are similar to the single slot setting, however there are subtle differences in proving many of the results. We will highlight them as and when necessary.

Theorem 5

In the multi-slot setting Δ-UCB is Dominant Strategy Incentive Compatible (DSIC) and Individually Rational (IR).

Proof

The mechanism is an implementation of the weighted VCG scheme (with the weights for each agent $w_{i} = \mu _{i}^{+}/\mu _{i} )$ and is hence DSIC and IR. □

Lemma 4

For an agent i and slot m, the click through rate UCB indices for agent i,

$$ \begin{array}{@{}rcl@{}} \widehat{\mu}_{i,t}^{+} = \widehat{\mu}_{i, t} + \epsilon_{i,t} = \widehat{\mu}_{i, t} + \sqrt{\left( \sum\limits_{m^{\prime}=1}^{M} \frac{M_{i,t}^{(m^{\prime})}}{{\varGamma}_{m^{\prime}}^{2}} \right) \frac{2\log T}{N_{i, t}^{2}}} \end{array} $$

(13)

$$ \begin{array}{@{}rcl@{}} \widehat{\mu}_{i,t}^{-} = \widehat{\mu}_{i,t} - \epsilon_{i,t} = \widehat{\mu}_{i,t} - \sqrt{\left( \sum\limits_{m^{\prime}=1}^{M} \frac{M_{i,t}^{(m^{\prime})}}{{\varGamma}_{m^{\prime}}^{2}} \right) \frac{2\log T}{N_{i, t}^{2}}} \end{array} $$

(14)

satisfy $P(\mu _{i} \notin [\widehat {\mu }_{i, t}^{-}, \widehat {\mu }_{i,t}^{+}])) \leq 2T^{-4} \forall t$

Proof

At every time step, we observe samples $\rho _{I_{t,m}}(t), m = 1, \ldots , M $ corresponding to the clicks of the allocated ads. These samples also encompass slot specific information which must be accounted for in the computation of empirical mean as well as UCB index for μ_i. For an agent i, let the random variable C_{i, m} denote whether ad i receives a click at slot m. Therefore C_{i, m} is a Bernoulli random variable with bias Γ_mμ_i.

We obtain a sample ρ_i(.) of C_{i, m} when ad i is allocated slot m. However it is the samples from C_{i, m}/Γ_m that gives us an unbiased estimator for μ_i. Therefore, the random variable of interest is the Bernoulli random variable,

$$ \begin{array}{@{}rcl@{}} D_{i,m} = \begin{cases} 0 & \quad \text{w.p } 1- {\varGamma}_{m}\mu_{i} \\ 1/{\varGamma}_{m} & \quad \text{w.p } {\varGamma}_{m}\mu_{i} \end{cases} \end{array} $$

(15)

D_{i, m} is bounded in [0,1/Γ_m] and $\mathbb {E}[D_{i,m}]$ is μ_i. Also,

$$ \begin{array}{@{}rcl@{}} \log \mathbb{E}\left[\exp(\lambda (D_{i,m} - \mu_{i})\right] \leq \frac{\lambda^{2}}{8 {{\varGamma}_{m}^{2}}} \text{ (by Hoeffding's Lemma)} \end{array} $$

Consider the scenario where, for an ad i, a single sample click is available from each slot. Let X_{i, m} denote this sample of C_{i, m}. Assume X_{i, m} are all independent and $ \widehat {\mu }_{i} = 1/M{\sum }_{m=1}^{M} X_{i,m}/{\varGamma }_{m}$. $\mathbb {E}[\widehat {\mu _{i}}] = \mu _{i}$. Now,

$$ \begin{array}{@{}rcl@{}} P(&&\!\!\!\!\!\!\!\! \widehat{\mu_{i}} - \mu_{i} > \epsilon) = P\left( \sum\limits_{m=1}^{M} X_{i,m}/{\varGamma}_{m} -M \mu_{i} > \epsilon M \right) \\&&= P\left( \exp(\lambda (\sum\limits_{m=1}^{M} X_{i,m}/{\varGamma}_{m} -M \mu_{i} ))> \exp(\lambda\epsilon M) \right) \\ &&\quad\leq \mathbb{E}\left[\exp(\lambda (\sum\limits_{m=1}^{M} X_{i,m}/{\varGamma}_{m} -M \mu_{i} ))\right]\\&&/ \exp(\lambda\epsilon M) \text{ (by Markov inequality)} \end{array} $$

$$ \begin{array}{@{}rcl@{}} &&= \overset{M}{\underset{m=1}{\prod}} \mathbb{E}\left[\exp(\lambda (X_{i,m}/{\varGamma}_{m} - \mu_{i} ))\right]\\&&/ \exp(\lambda\epsilon M) \text{ (by independence of }X_{i,m}\text{)} \\ &&= \exp\left( \sum\limits_{m=1}^{M} \frac{\lambda^{2} }{8 {{\varGamma}_{m}^{2}}} - \lambda M\epsilon\right) \end{array} $$

(16)

In order to tighten the above bound on the right hand side, one must find appropriate λ which minimizes $\exp ({\sum }_{m=1}^{M} \frac {\lambda ^{2} }{8 {{\varGamma }_{m}^{2}}} - \lambda M\epsilon )$. Setting λ = λ^∗ = 4M𝜖/η where $\eta = {\sum }_{m=1}^{M} 1/{{\varGamma }_{m}^{2}}$ achieves the minimum value. Therefore,

$$ \begin{array}{@{}rcl@{}} P(& \widehat{\mu_{i}} - \mu_{i} > \epsilon) \leq \exp(-2M^{2} \epsilon^{2} /\eta) \end{array} $$

(17)

In order to obtain a δ confidence on $P(\widehat {\mu _{i}} - \mu _{i} > \epsilon ) $, 𝜖 must be set so that $\exp (-2M^{2} \epsilon ^{2} /\eta ) = \delta = T^{-4}$. Therefore, $\epsilon = \sqrt {\sum \limits _{m=1}^{M} \left (\frac {1}{{{\varGamma }_{m}^{2}}} \right ) \frac {2 \log T}{M^{2}}}$. In the above analysis we assumed that from each slot, one sample was available. When we have a total of N_{i, t} independent samples for ad i, with $M_{i,m}^{t}$ samples for slot m at any time t, $\eta = {\sum }_{m=1}^{M} M_{i,t}^{m}/{{\varGamma }_{m}^{2}}$ and therefore $\epsilon _{i,t} = \sqrt {\left (\sum \limits _{m^{\prime }=1}^{M} \frac {M_{i,t}^{(m^{\prime })}}{{\varGamma }_{m^{\prime }}^{2}} \right ) \frac {2\log T}{N_{i, t}^{2}}}$, completing the proof. □

A noteworthy feature of our estimates is the following. An allocation of an ad i in a slot m yields a sample for the computation of not only $\widehat {W}_{i,m,t}$, but also for $\widehat {W}_{i,m^{\prime },t}$ for all slots $m^{\prime } \in \{1, \ldots , M \}$. This is because Γ_m is known to the planner a-priori. Therefore note that, the number of allocations that ad i receives till time t, N_{i, t} is the sum of the number of allocations that agent i receives irrespective of the slot or inclusive of all the slots.

Lemma 5

For an agent i and slot m, the social welfare UCB indices for agent i,

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i,m,t}^{+} &=& {\varGamma}_{m} \widehat{\mu}_{i, t} \theta_{i} + \epsilon_{i,m,t} = {\varGamma}_{m} \widehat{\mu}_{i, t} \theta_{i} \\&&+ \sqrt{\left( \sum\limits_{m^{\prime}=1}^{M} \frac{M_{i,t}^{(m^{\prime})}}{{\varGamma}_{m^{\prime}}^{2}} \right) \frac{2{\theta_{i}^{2}} {{\varGamma}_{m}^{2}}\log T}{N_{i, t}^{2}}} \end{array} $$

(18)

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{i,m,t}^{-} &=& {\varGamma}_{m} \widehat{\mu}_{i, t} \theta_{i} - \epsilon_{i,m,t} = {\varGamma}_{m} \widehat{\mu}_{i,t} \theta_{i} \\&&- \sqrt{\left( \sum\limits_{m^{\prime}=1}^{M} \frac{M_{i,t}^{(m^{\prime})}}{{\varGamma}_{m^{\prime}}^{2}} \right) \frac{2{\theta_{i}^{2}} {{\varGamma}_{m}^{2}}\log T}{N_{i, t}^{2}}} \end{array} $$

(19)

satisfy $P(W_{i,m} \notin [\widehat {W}_{i, m, t}^{-}, \widehat {W}_{i,m, t}^{+}])) \leq 2T^{-4} \forall t$

Proof

The proof idea is similar to Lemma 1. □

Lemma 6

Suppose at time step t, $N_{j,t} > \frac {8\theta _{max}^{2} \log T}{{\varDelta }^{2}} \forall j \in [K]$. Then ∀i ∈ [K] and ∀m ∈ [M], 2𝜖_{i, m, t} < Δ.

Proof

The proof is similar to Lemma 2. □

Lemma 7

For an agent i, slot m and time t, let B_{i, m, t} be the event $B_{i,m,t} = \{\omega : W_{i,m} \notin [\widehat {W}_{i,m,t}^{-}(\omega ), \widehat {W}_{i,m,t}^{+}(\omega )] \}$. Define the event $G =\underset {t}{\bigcap } \underset {i}{\bigcap } \underset {m}{\bigcap } B_{i,m,t}^{c}$. Then $P(G) \geq 1 - \frac {2}{T^{2}}$.

Proof

The proof has some subtle differences from Lemma 3 because in the multi-slot extension, the events B_{i, m, t} are not independent across the slots.

Observation: :: If an element ω from the set of outcomes is such that ω ∈ B_{i, m, t}, then $\omega \in B_{i,m^{\prime },t} \forall m^{\prime } \in [M]$. This is because, for any two slots m and $m^{\prime }$,
$$ \begin{array}{@{}rcl@{}} W_{i,m} \notin [\widehat{W}_{i,m,t}^{-}, \widehat{W}_{i,m,t}^{+}] &\iff& \mu_{i} \notin [\widehat{\mu}_{i,t}^{-}, \widehat{\mu}_{i,t}^{+}]\\ & \iff& W_{i,m^{\prime}} \notin [\widehat{W}_{i,m^{\prime},t}^{-}, \widehat{W}_{i,m^{\prime},t}^{+}] \end{array} $$

Therefore $P(\bigcup _{m} B_{i,m,t}) = P(B_{i,1,t})$. From Lemma 5, $P(\bigcup _{m} B_{i,m,t}) = P(B_{i,1,t}) \leq 2T^{-4}$. Hence,

$$ \begin{array}{@{}rcl@{}} P(G) &=& 1 - P\left( \underset{t}{\bigcup} \underset{i}{\bigcup} \underset{m}{\bigcup} B_{i,m,t}\right) = 1 - P\left( \underset{t}{\bigcup} \underset{i}{\bigcup} B_{i,1,t}\right) \\ &\geq& 1 - \frac{2}{T^{2}} \text{(as in Lemma 3)}. \end{array} $$

□

Theorem 6

Suppose at time t, $N_{j,t}>8\theta _{max}^{2} \log T/{\varDelta }^{2}$ ∀j ∈ [K]. Then ∀m ∈ [M],∀i ∈ [K] ∖ S_{Δ, m}, $\widehat {W}_{K^{(m)}, m, t}^{+} > \widehat {W}_{i, m, t}^{+}$ with high probability (= 1 − 2/T⁴).

Proof

Suppose at time t where $N_{j,t} > 8 \theta _{max}^{2} \log T/{\varDelta }^{2}$ ∀j ∈ [K], there exists some m ∈ [M] such that $\widehat {W}_{K^{(m)}, m, t}^{+} < \widehat {W}_{i, m, t}^{+}$. (Note that this statement does not arise from any assumptions on the allocation, for instance, that agent i is given slot m. This is the major difference from Theorem 2). But the relation between the true social welfare values of these agents is $W_{K^{(m)}, m} > W_{i, m}$. Then one of the following three conditions must have occurred, like in proof of Theorem 2.

Condition 1: $W_{i,m} < \widehat {W}_{i,m,t}^{-}$. This condition implies a drastic overestimate of the sub-optimal arm i so that the true mean social welfare W_{i, m} is even below the LCB index $\widehat {W}_{i,m,t}^{-}$. Figure 4 captures this condition.
Condition 2: $ W_{K^{(m)},m} > \widehat {W}_{K^{(m)},m, t}^{+}$. This implies an underestimate of the optimal arm so that the true mean social welfare $W_{K^{(m)},m} $ lies above even the UCB index $\widehat {W}_{K^{(m)},m,t}^{+}$. See Fig. 5.
Condition 3: $W_{K^{(m)},m} - W_{i,m} < 2\epsilon _{i,m,t}$. This implies an overlap in the confidence intervals of the optimal and sub-optimal arm. Even if, Conditions 1 and 2 are false, still the UCB of sub-optimal arm i is greater than the UCB of the optimal arm i_∗. See Fig. 6 for an illustration of this condition.

□

From the figure, $ W_{K^{(m)},m} - W_{i,m} \leq \widehat {W}_{i,m,t}^{+} - \widehat {W}_{i,m,t}^{-} \leq 2 \epsilon _{i, m, t} $. If all the three conditions above were false, then,

$$ \begin{array}{@{}rcl@{}} \widehat{W}_{K^{(m)},m,t}^{+} &>& W_{K^{(m)},m} > W_{i,m} + 2 \epsilon_{i,t} > \widehat{W}_{i,m,t}^{-} + 2 \epsilon_{i,t} \\ & =& \widehat{W}_{i,m, t}^{+} \text{ (A contradiction!)} \end{array} $$

As per the statement of the theorem, $N_{i,t} > 8 \theta _{max}^{2} \log $ T/Δ². Therefore by Lemma 6, 2𝜖_{i, m, t} < Δ. For agent i ∈ [K] ∖ S_{Δ, m}, $W_{K^{(m)},m} - W_{i, m} > {\varDelta } > 2\epsilon _{i,m,t}$. Therefore, Condition 3 above does not hold true. So,

$$ \begin{array}{@{}rcl@{}} P(\widehat{W}_{i,m,t}^{+} &>& \widehat{W}_{K^{(m)},m,t}^{+}) \leq P(\text{Condition 1}) + P(\text{Condition 2}) \\ & \leq &0.5 P(B_{i,m,t}) + 0.5P(B_{K^{(m)},m,t}) \leq 2/T^{-4} \end{array} $$

$$ \begin{array}{@{}rcl@{}} P(\widehat{W}_{K^{(m)},m, t}^{+} >\widehat{W}_{i,m, t}^{+} ) &= 1- P(\widehat{W}_{i,m, t}^{+} > \widehat{W}_{K^{(m)},m,t}^{+}) \geq 1 - \frac{2}{T^{4}} \end{array} $$

Theorem 7

If the Δ-UCB mechanism is executed in the multiple slot scenario for a total time horizon of T rounds, it achieves an expected Δ-regret of $O(\log T)$.

Proof

The proof idea has some subtle differences from the proof of Theorem 3. As before, we first compute the expected Δ-regret conditional on G. For the exploration rounds, the mechanism obtains a regret of $\xi =\frac {8MK \theta _{max}^{3} \log T}{{\varDelta }^{2}}$.

$$ \begin{array}{@{}rcl@{}} &&\!\mathbb{E}\left[ {\varDelta}\text{-regret}|G \right]\\& \leq&\! \xi + \sum\limits_{t=\gamma+1}^{T} \sum\limits_{m=1}^{M} (W_{K^{(m),m}} - W_{(I_{t,m}),m}) \mathbbm{1}\!\left[I_{t,m} \!\in\! K \!\setminus \!S_{{\varDelta},m}|G\right] \end{array} $$

We will now show that the second term above evaluates to zero. For any m, the cardinality of S_{Δ, m} is at least m. This is because for all K^(j) above m in the ranking of agents (j < m), $W_{K^{(m)}, m} - W_{K^{(j)},m} < 0 < {\varDelta }$ as $W_{K^{(j)},m}> W_{K^{(m)}, m}$. Therefore there are at least m − 1 agents in S_{Δ, m}. Also K^(m) ∈ S_{Δ, m} as $W_{K^{(m)}, m} - W_{K^{(m)},m}=0 < {\varDelta }$. Therefore ∀j ∈{1,…, m}, K^(j) ∈ S_{Δ, m}. While allocating slot m, at least one of the agents in S_{Δ, m} must be free. This is by the pigeonhole principle. Now if the allocated agent for slot m, I_{t, m} ∈ [K] ∖ S_{Δ, m}, one of the following two cases occur.

Case 1: The ideal agents K⁽¹⁾,…, K^(m− 1) for all the previous slots 1,…, m − 1 have already been allocated before the allocation of slot m. This means that K^(m) has not been allocated yet. Also, $\widehat {W}_{(I_{t,m}),m, \gamma }^{+} > \widehat {W}_{K^{(m)},m, \gamma }^{+} $. Since G is true and t > γ, the above event cannot occur (by Theorem 6).
Case 2: The agent K^(m) has already been allocated to some other slot before the allocation of slot m has begun. Therefore there is some agent K^(j), j < m with a larger social welfare value, who has still not been allocated. That is, $W_{K^{(j)},m} > W_{K^{(m)},m} > W_{(I_{t,m}),m}$. Given that I_{t, m}∉S_{Δ, m}. Therefore we can deduce that I_{t, m}∉S_{Δ, j}. This is because,
$$ \begin{array}{@{}rcl@{}} &&W_{K^{(m)},m} - W_{(I_{t,m}),m} \geq {\varDelta} \\& \implies& W_{K^{(j)},m} - W_{(I_{t,m}),m} \geq {\varDelta} \\& \implies& \mu_{K^{(j)}} \theta_{K^{(j)}} - \mu_{I_{t,m}} \theta_{I_{t,m}} \geq {\varDelta}/{\varGamma}_{m} \\& \implies& {\varGamma}_{j} (\mu_{K^{(j)}} \theta_{K^{(j)}} - \mu_{I_{t,m}} \theta_{I_{t,m}}) \geq {\varGamma}_{j} {\varDelta}/{\varGamma}_{m} \\& \implies& W_{K^{(j)},j} - W_{(I_{t,m}),j} \geq {\varDelta} \end{array} $$
(20)

The last line in the above implications is true as Γ_j > Γ_m. But $\widehat {W}_{K^{(j)},m, \gamma }^{+} < \widehat {W}_{(I_{t,m}),m, \gamma }^{+}$. Then the inequality $\widehat {W}_{K^{(j)},j, \gamma }^{+} < \widehat {W}_{(I_{t,m}),j, \gamma }^{+}$ is also true due to the way the slot specific UCB indices are computed. From Theorem 6 for slot j, we find that $\widehat {W}_{K^{(j)},j, \gamma }^{+} > \widehat {W}_{(I_{t,m}),j, \gamma }^{+}$. Again this cannot happen as G is true and t > γ. Therefore we get that $\mathbb {E}\left [ {\varDelta }\text {-regret}|G \right ] \leq \xi $.

Also, $P(G^{c}) = 1 - P(G)< \frac {2}{T^{2}}$ from Lemma 7.Putting all the steps together,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}\left[ {\varDelta}\text{-regret}\right] &=& \mathbb{E}\left[ {\varDelta}\text{-regret} |G \right] P(G) + \mathbb{E}\left[ {\varDelta}\text{-regret} |G^{c} \right] P(G^{c}) \\ & \leq& \frac{8KM\theta_{max}^{3} \log T}{{\varDelta}^{2}} * 1 + TM\theta_{max}*\frac{2}{T^{2}} \\ & \leq& \frac{8KM\theta_{max}^{3} \log T}{{\varDelta}^{2}} + 2\theta_{max} \end{array} $$

(21)

The simplification in the second line is because $\mathbb {E}\left [{\varDelta }\text {-regret} |G^{c} \right ]$ ≤ TMθ_max. In the last line we use the fact that M ≪ T. This completes the proof. □

6 Extensions to other variants of multi-slot SSA

In this section, we look at other variants in the multi-slot SSA setting and discuss how our mechanism can be adapted to such settings.

6.1 Position and Ad dependent cascade model

We have explained our algorithm and performed the analysis for the position dependent cascade model for SSA where the Γ_m function is characterized by Eq. 10 and is known to the planner a-priori. A more general model would be one where the function Γ_m may also depend on the ad displayed at position m. Our model can also be used in such scenarios and the same analysis will hold.

6.2 Handling the case of unknown Γ _m

We have assumed that the functions Γ_ms are known to the planner a-priori. Now suppose that the Γ_ms are required to be learnt. The same allocation scheme as in Algorithm 2 may be used. However the computation of the proposed payment scheme in algorithm 2 is not feasible as the payments use Γ_ms, which are unknown.

In order to handle such a scenario, we must obtain estimates for Γ first. It is known that, the parameter for the first slot, Γ₁ = 1. Only Γ₂,…, Γ_M need to be estimated. We will first describe a mechanism which relies on an arbitrary learning algorithm to provide estimates $ \widehat {{\varGamma }}_{2}, \ldots ,\widehat {{\varGamma }}_{M}$. Thereafter we will remark on the possible learning schemes.

Proposition 1

Suppose we have a learning scheme that gives us estimates $\widehat {{\varGamma }}_{2}, \ldots ,\widehat {{\varGamma }}_{M}$ such that, $\widehat {{\varGamma }}_{2} \geq \widehat {{\varGamma }}_{3} \geq {\ldots } \geq \widehat {{\varGamma }}_{M}$ and $0 \leq \widehat {{\varGamma }}_{m} \leq 1 \text { for } m = 2, \ldots , M$. Let $\widehat {{\varGamma }}_{1} = 1$.

We propose a weighted VCG mechanism [28] which is known to be DSIC truthful and is also IR. Suppose the private valuation of agent i for a click is θ_i. Let x ∈{0,1}^K×M be an outcome of the allocation. x_im = 1 if ad i is alloted slot m and zero otherwise. The valuation function of agent i in this case is,

$$ \begin{array}{@{}rcl@{}} v_{i}(x, \theta_{i}) = \sum\limits_{m=1}^{M} {\varGamma}_{m} \mu_{i} \theta_{i} x_{im} \end{array} $$

(22)

Define a weight vector $w_{i} \in \mathcal {R}^{M}$ for every agent i. w_i has weights corresponding to agent i and slot m such that, $w_{i,m} = \frac {\widehat {\mu }_{i}^{+} \widehat {{\varGamma }}_{m}}{\mu _{i} {\varGamma }_{m}}$. $\widehat {\mu }_{i}^{+}$ is the UCB index corresponding to the CTR of ad i, computed after the fixed number of exploration rounds as in Algorithm 2. However, in this scenario, the UCB index is constructed using samples of the clicks from allocation in the first slot alone.

Our weighted VCG mechanism is described in Fig. 7. The mechanism uses the allocation,

$$ \begin{array}{@{}rcl@{}} A^{*}(b_{i}, b_{-i}) = \underset{x}{\arg\max} \sum\limits_{i=1}^{K} \sum\limits_{m=1}^{M} {\varGamma}_{m} \mu_{i} b_{i} x_{im} w_{i,m} \end{array} $$

But note that this allocation rule boils down to the same allocation used in Algorithm 2. This is due to the fact that the estimates $\widehat {{\varGamma }}_{m}$ monotonically decrease with m. The procedure for obtaining the allocation A^∗(b_i, b_−i) is the following. We sort the agents based on $\widehat {\mu }_{i}^{+} b_{i}$ and allocate the slots to the best M agents. Therefore, the allocation rule is independent of the Γ s and is equivalent to,

$$ \begin{array}{@{}rcl@{}} A^{*}(b_{i}, b_{-i}) = \underset{x}{\arg\max} \sum\limits_{i=1}^{K} \sum\limits_{m=1}^{M} \widehat{\mu}_{i}^{+} b_{i} x_{im} \end{array} $$

The expected payment to be made by agent i when allocated a slot $m^{\prime }$ is,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[{P_{i}^{t}}(b, \rho)] = \frac{\mu_{i} {\varGamma}_{m^{\prime}}}{\widehat{\mu}_{i,t}^{+} \widehat{{\varGamma}}_{m^{\prime}}} \sum\limits_{j \neq i} \sum\limits_{m=m^{\prime}+1}^{M+1} \widehat{\mu}_{j,t}^{+} b_{j} x_{jm} (\widehat{{\varGamma}}_{m-1} - \widehat{{\varGamma}}_{m} ) \end{array} $$

The above is the externality based payment prescribed by weighted VCG. However since we adopt the pay per click scheme,

$$ \begin{array}{@{}rcl@{}} {P_{i}^{t}}(b,\rho)] = \frac{\rho_{i}(t)}{\widehat{\mu}_{i,t}^{+} \widehat{{\varGamma}}_{m^{\prime}}} \sum\limits_{j \neq i} \sum\limits_{m=m^{\prime}+1}^{M+1} \widehat{\mu}_{j,t}^{+} b_{j} x_{jm} (\widehat{{\varGamma}}_{m-1} - \widehat{{\varGamma}}_{m} ) \end{array} $$

Therefore, the computation of the payments is also feasible now. The above mentioned weighted VCG scheme is DSIC truthful and IR. The proof follows from the standard weighted VCG scheme where the weights are as defined as above. We now remark on the Δ-regret of the mechanism.

6.2.1 Remarks on learning $\widehat {{\varGamma }}_{m}$ and computation of Δ-regret

In the above mechanism we have assumed, that the estimates $\widehat {{\varGamma }}_{m}$ satisfy Proposition 1. The allocation scheme described above ultimately does not rely on these estimates, although the weights w_{i, m} use it. The mechanism therefore uses the estimates only in the payment rule. We now make an important observation here.

Observation: :: When any set of estimates $\{\widehat {{\varGamma }}_{m}\}$, m = 1,…, M satisfying Proposition 1 is used in the mechanism above, the mechanism is DSIC truthful, IR and suffers only logarithmic Δ-regret.

The reason is that the mechanism is an instance of weighted VCG mechanism and therefore is DSIC truthful and IR, with any estimate for the Γ_ms. As far as the Δ-regret in social welfare is concerned, the allocation rule determines it. The allocation rule used turns out to be identical to the allocation rule used where Γ_m is known and is independent of the estimates. Note that it is now possible to minimise regret in payments by choosing estimates $\widehat {{\varGamma }}_{m}$ that maximise the payments and also satisfy the constraints in Proposition 1. This will lead to a constrained optimization problem which can be solved. However the current work focuses on minimizing Δ-regret in social welfare and therefore the problem of minimising regret in payments is still open.

7 Conclusion

We have studied the more practical use case in MAB mechanisms where a planner has the option to specify a tolerance level Δ for sub-optimal arms. All the papers in the literature on MAB mechanisms propose schemes to target the worst case scenario where the arms are arbitrarily close. Therefore they prescribe investing a huge number of exploration rounds (Ω(T^2/3)) to perfectly distinguish the arms. However, the planner may not want to perfectly distinguish arms that are arbitrarily close. Many a time, the planner may instead be willing to allocate arms that are at most Δ away from the best arm. The state of the art does not permit this flexibility to the planner. Towards providing such a flexibility to the planner, we have, for the first time, introduced a new notion of regret called Δ-regret. When arms that are less than Δ away from the best arm are selected, the Δ-regret incurred is zero. Only arms more than Δ away from the best arm contribute to the Δ-regret.

From the above perspective, we have revisited the application of MAB mechanisms in sponsored search auctions. First we analysed the single slot SSA setting and proposed a deterministic, exploration separated MAB mechanism called Δ-UCB. We showed that Δ-UCB is DSIC truthful, IR and achieves a Δ-regret of $O(\log T)$. Next we studied the more challenging setting of multi-slot SSA. In particular, we adopted the cascade model and adapted Δ-UCB to this setting, first with the assumption that the prominence parameters are known. Here too, we have shown that the mechanism is DSIC truthful, IR and achieves a Δ-regret of $O(\log T)$. We finally adapt the mechanism to the general multi-slot SSA setting where neither the CTRs nor the prominences are known. Here too our deterministic, exploration separated mechanism is DSIC truthful, IR and suffers a Δ-regret of $O(\log T)$. The other mechanisms in literature for this setting are not able to obtain all these desirable properties that our mechanism achieves. They either compromise on the truthfulness, satisfying a weaker notion (truthfulness in expectation) or are forced to resort to randomness in the mechanism.

Our results are generic and apply equally well to several other applications where MAB mechanisms have been used.

References

Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. In: COLT, pp 39.1–39.26
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach learn 47(2-3):235–256
Article Google Scholar
Babaioff M, Kleinberg RD, Slivkins A (2010) Truthful mechanisms with implicit payment computation. In: Proceedings of the Eleventh ACM conference on electronic commerce (EC’10), ACM, pp 43–52
Babaioff M, Sharma Y, Slivkins A (2014) Characterizing truthful multi-armed bandit mechanisms. SIAM J Comput 43(1):194–230
Article MathSciNet Google Scholar
Bhat S, Padmanabhan D, Jain S, Narahari Y (2016) A truthful mechanism with biparameter learning for online crowdsourcing: (extended abstract). In: Proceedings of the 2016 international conference on autonomous agents & multiagent systems (AAMAS’16), Singapore, May 9-13, 2016, pp 1385–1386
Biswas A, Jain S, Mandal D, Narahari Y (2015) A truthful budget feasible multi-armed bandit mechanism for crowdsourcing time critical tasks. In: Proceedings of the 2015 international conference on autonomous agents and multiagent systems (AAMAS’15), pp 1101–1109
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found Trends Mach Learn 5(1):1–122
Article Google Scholar
Bubeck S, Cesa-bianchi N, Lugosi G (2013) Bandits with heavy tail. IEEE Trans Inf Theory 59(11):7711–7717
Article MathSciNet Google Scholar
Chen W, Wang Y, Yuan Y (2013) Combinatorial multi-armed bandit: General framework and applications. In: International conference on machine learning (ICML), pp 151–159
Devanur NR, Kakade SM (2009) The price of truthfulness for pay-per-click auctions. In: Proceedings of the 10th ACM conference on electronic commerce (EC’09), pp 99–106
Dirkx R, Dimitrakopoulos R (2018) Optimizing infill drilling decisions using multi-armed bandits: Application in a long-term, multi-element stockpile. Math Geosci 50(1):35–52
Article MathSciNet Google Scholar
Feldman Z, Domshlak C (2014) Simple regret optimization in online planning for markov decision processes. J Artif Intell Res (JAIR) 51(1):165–205
Article MathSciNet Google Scholar
Gatti N, Lazaric A, Rocco M, Trovò F (2015) Truthful learning mechanisms for multi-slot sponsored search auctions with externalities. Artif Intell 227:93–139
Article MathSciNet Google Scholar
Gatti N, Lazaric A, Trovò F (2012) A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities. In: Proceedings of the 13th ACM conference on electronic commerce (EC’12), pp 605–622
Ghalme Ganesh, Jain Shweta, Gujar Sujit, Narahari Y. (2017) Thompson sampling based mechanisms for stochastic multi-armed bandit problems. In: Proceedings of the 16th conference on autonomous agents and multiagent systems (AAMAS), pp 87– 95
Gonen Rica, Pavlov Elan (2007) An incentive-compatible multi-armed bandit mechanism. In: Proceedings of the Twenty-sixth annual ACM symposium on principles of distributed computing (PODC), pp 362–363
Gonen R, Pavlov E (2009) Adaptive incentive-compatible sponsored search auction. In: SOFSEM 2009: theory and practice of computer science, pp 303–316
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30
Article MathSciNet Google Scholar
Jain S, Bhat S, Ghalme G, Padmanabhan D, Narahari Y (2016) Mechanisms with learning for stochastic multi-armed bandit problems. Indian J Pure Appl Math 47(2):229–272
Article MathSciNet Google Scholar
Jain S, Ghalme G, Bhat S, Gujar S, Narahari Y (2016) A deterministic MAB mechanism for crowdsourcing with logarithmic regret and immediate payments. In: Proceedings of the 2016 international conference on autonomous agents & multiagent systems (AAMAS’16), Singapore, May 9-13, 2016, pp 86–94
Jain S, Gujar S, Bhat S, Zoeter O, Narahari Y (2018) A quality assuring, cost optimal multi-armed bandit mechanism for expertsourcing. Artif Intell 254(Supplement C):44–63
Article MathSciNet Google Scholar
Kapoor S, Patel KK, Kar P (2018) Corruption-tolerant bandit learning. Machine Learning, pp 1–29
Kleinberg R, Niculescu-Mizil A, Sharma Y (2010) Regret bounds for sleeping experts and bandits. Mach Learn 80(2):245– 272
Article MathSciNet Google Scholar
Liu Chang, Cai Qingpeng, Zhang Yukui (2017) Multi-armed bandit mechanism with private histories. In: Proceedings of the 16th conference on autonomous agents and MultiAgent systems (AAMAS), pp 1607–1609
Myerson RB (1991) Game Theory: Analysis of Conflict, Harvard University Press, Cambridge
Narahari Y. (2014) Game Theory and Mechanism Design. IISc Press and the World Scientific Publishing Company
Nisan Noam, Ronen Amir (2007) Computationally feasible vcg mechanisms. J Artif Intell Rese (JAIR) 29(1):19–47
Article MathSciNet Google Scholar
Nisan N, Roughgarden T, Tardos E, Vazirani VV (2007) Algorithmic Game Theory. Cambridge University Press, New York
Book Google Scholar
Santiago Ontanon (2017) Combinatorial multi-armed bandits for real-time strategy games. J Artif Intell Res (JAIR) 58:665–702
Article MathSciNet Google Scholar
Padmanabhan D, Bhat S, Garg D, Shevade SK, Narahari Y (2016) A robust UCB scheme for active learning in regression from strategic crowds. In: International joint conference on neural networks, IJCNN 2016, pp 2212–2219
Scott SL (2010) A modern bayesian look at the multi-armed bandit. Appl Stoch Model Bus Ind 26(6):639–658
Article MathSciNet Google Scholar
Das Sharma A, Gujar S, Narahari Y (2012) Truthful multi-armed bandit mechanisms for multi-slot sponsored search auctions. Curr Sci 103(9):1064–1077
Google Scholar
Vickrey W (1961) Counterspeculation, Auctions, and competitive sealed tenders. The Journal of Finance 16(1):8–37
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology, Goa, India
Divya Padmanabhan & Satyanath Bhat
Indian Institute of Technology, Dharwad, India
K. J. Prabuchandran
Indian Institute of Science, Bangalore, India
Shirish Shevade & Y. Narahari

Authors

Divya Padmanabhan
View author publications
You can also search for this author in PubMed Google Scholar
Satyanath Bhat
View author publications
You can also search for this author in PubMed Google Scholar
K. J. Prabuchandran
View author publications
You can also search for this author in PubMed Google Scholar
Shirish Shevade
View author publications
You can also search for this author in PubMed Google Scholar
Y. Narahari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Divya Padmanabhan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Padmanabhan, D., Bhat, S., Prabuchandran, K.J. et al. Dominant strategy truthful, deterministic multi-armed bandit mechanisms with logarithmic regret for sponsored search auctions. Appl Intell 52, 3209–3226 (2022). https://doi.org/10.1007/s10489-021-02387-2

Download citation

Accepted: 24 March 2021
Published: 01 July 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s10489-021-02387-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Dominant strategy truthful, deterministic multi-armed bandit mechanisms with logarithmic regret for sponsored search auctions

Abstract

Similar content being viewed by others

Mechanisms with learning for stochastic multi-armed bandit problems

Stochastic One-Sided Full-Information Bandit

Mechanisms for combinatorial auctions with budget constraints

Explore related subjects

1 Introduction

1.1 Contributions:

2 Relevant work

3 The model: Single slot SSA

4 Our mechanism: Δ-UCB

4.1 Properties of Δ-UCB

Definition 1

Definition 2

Theorem 1

Proof

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 2

Proof

Theorem 3

Proof

4.2 A lower bound for Δ-regret

Theorem 4

Proof

Remark 1

5 Extension to multi-slot SSA

Theorem 5

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Theorem 6

Proof

Theorem 7

Proof

6 Extensions to other variants of multi-slot SSA

6.1 Position and Ad dependent cascade model

6.2 Handling the case of unknown Γ m

Proposition 1

6.2.1 Remarks on learning \(\widehat {{\varGamma }}_{m}\) and computation of Δ-regret

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

6.2 Handling the case of unknown Γ _m