Keywords

1 Introduction

A growing fraction of online advertisements are sold via ad exchanges. In an ad exchange, after a visitor arrives on a webpage, advertisers compete in an auction to win the impression (the right to deliver an ad to that visitor). Typically, these auctions are second-price auctions, where the winner pays the second highest bid or a reserve price (whichever is larger), and no sale occurs if all of the bids are lower than the reserve price. However, as indicated by e.g. [2, 3, 21], a non-trivial fraction of auctions only involve a single bidder and this reduces to a posted-price auction [20] when reserve prices known: the seller sets a reserve price and the buyer decides whether to accept or reject it. A single publisher can track a large number of visitors with similar properties over time and sell the impressions generated by these visitors to buyers. As buyers typically are involved in a large number of auctions, there is an incentive for them to act strategically [2, 3, 16, 21]. These observations have led to the study of repeated posted-price auctions between a single seller and strategic buyer.

In this paper we consider a repeated posted-price auction between a single seller and a single buyer, similar to that considered in [2, 21]. In every round, the seller posts a price and the buyer decides to buy or not at that price. The buyer does not know the distribution of his valuation, the seller’s pricing algorithm or the seller’s price set. Furthermore, the seller does not know the valuation distribution and needs to learn how to set the price over time. There are a number of differences between this paper and previous work on repeated posted-price auction such as [2, 3, 21]. First, unlike in previous work, we study the problem from the perspective of a buyer that aims to maximize his expected utility or surplus, instead of the perspective of the seller that aims to maximize his revenue. Second, previous papers assume that the buyer knows his valuation in each round. In this paper, we relax this assumption and assume the buyer does not know the distribution of his valuation and the valuation is only revealed after he buys the item. This is motivated by applications in online advertising where the buyer (advertiser) does not know the exact value of showing the ads to a set of users: some users may click on the ad and in some cases the ad may lead to a sale, but the buyer only observes a response after he displays the advertisement to the user.

As the valuation distribution is unknown, buyers face an exploration and exploitation trade-off and their decisions lead to regret: (i) accepting a price that is at most the mean valuation leads to positive expected utility and accepting a price above it leads to negative utility; (ii) buying the item leads to additional information about the mean valuation (at the risk of negative utility), but by not buying there is a risk of missing out on positive utility. We study two types of buyers: strategic buyers and non-strategic buyers. Non-strategic buyers are only interested achieving sub-linear regret given the prices that are observed and do not attempt to manipulate or influence the observed prices. Strategic buyers are also interested in sub-linear regret given the observed prices, but they also actively attempt to influence future prices that will be offered. If non-strategic buyers knew the mean valuation they would use the following rule: always accept a price that is at most the mean valuation and always reject a price above it. Strategic buyers on the other hand, would sometimes deviate from this rule in an attempt to influence future prices that will be offered. If non-strategic buyers knew the mean valuation, then their decisions would have low regret but the seller could learn to ask a price very close to the mean valuation, resulting in low utility for the buyer [2, 21]. Strategic buyers attempt to influence the learning process of the seller in order to lower the price and to increase the utility. However, as these attempts are not guaranteed to succeed (as buyers don’t know the seller’s pricing algorithm or price set), strategic buyers still want to ensure sub-linear regret for all possible prices sequences.

In our setting, the seller needs to learn to set his prices because he does not know the valuation distribution. To the best of our knowledge, there are no existing ‘optimal’ algorithms with performance guarantees (specifically) for repeated posted-price auctions with a single seller and a single strategic buyer that doesn’t know his valuation: existing algorithms (e.g., [2, 3, 14, 15, 21, 25]) assume that buyers know their valuation and thus lose their performance guarantees. In our experiments (see Sect. 5) we therefore assume that the seller uses an off-the-shelf low-regret learning algorithm for adaptive adversarial bandit feedback as these have known performance guarantees [10, 20, 22].

Our main contributions are as follows. First, to the best of our knowledge, we are the first to study repeated posted-price auctions in strategic settings from the perspective of the buyer. We do not assume that the buyer knows his valuation distribution. Second, we construct algorithms with sub-linear (in the problem horizon) regret for both non-strategic and strategic buyers by using ideas from popular multi-armed bandit algorithms UCB1 [5] and Thompson Sampling [1]. Our algorithms do not require knowledge about the seller’s pricing algorithm or price set. Third, we use experiments to support our theoretical findings. Using experiments we show that, if the seller is using a low-regret learning algorithm based on weights updating (such as EXP3.P [4, 10]), then strategic buyers can obtain much higher utilities compared to non-strategic buyers.

The remainder of this paper is organized as follows. In Sect. 2 we discuss the related literature. Section 3 provides a formal description of the problem. In Sect. 4 we present the our proposed algorithms and provide a theoretical analysis. In Sect. 5 we perform experiments in order to assess the quality of our proposed algorithms. Section 6 concludes our work and provides some directions for further research.

2 Related Literature

The work in this paper is mainly related to the following areas of the literature: posted-price auctions, low-regret learning by sellers and buyers, and decision making for buyers in auctions. We discuss these areas in more detail below.

Repeated posted-price auctions with the goal of maximizing revenue for the seller and assuming that the feedback from buyers is i.i.d. distributed was studied in [20]. Other works [2, 3, 14, 15, 19, 21, 25] instead study repeated posted-price auctions with strategic buyers. However, these papers all study the seller side of the problem and assume that buyers know their valuations in each round.

On a high level this paper is related to works that study repeated auctions where either the seller and/or the buyer is running a low-regret learning algorithm [8, 9, 11] and the interaction between bandit algorithms and incentives of buyers [6, 7, 12, 18]. The goal in such studies is to design (truthful) mechanisms that either maximize revenue of the seller or welfare, when decision are made based on low-regret algorithms. This is not the focus of our paper.

The aforementioned works focus on either the seller side or on mechanism design, but there is also work that considers the perspective of buyers or bidders. In [13, 23, 24] the focus is on maximizing clicks when click-through-rates are unknown and typically with budget constraints. In this paper, rewards for buyers are not determined by the number of clicks, instead the buyer aims to maximize cumulative utilities or his net surplus as in e.g., [2, 3, 21]. In [17] the focus is on designing bidding strategies for buyers that compete against each other and where the buyer valuation is unknown. However, these studies do not focus on repeated posted-price auctions and strategic behaviour of buyers is not considered.

3 Problem Formulation

We consider a single buyer and a single seller that interact for \( T \) rounds. An item, such as an advertisement space, is repeatedly offered for sale by the seller to the buyer over these \( T \) rounds. In each round \(t \in \mathcal {T}=\lbrace 1, \dots , T \rbrace \), a price \( p_{t} \in \mathcal {P}\) is offered by the seller and a decision \(a_{t} \in \lbrace 0, 1 \rbrace \) is made by the buyer: \(a_{t} = 1\) when the buyer accepts to buy at that price, \(a_{t} = 0\) otherwise. The buyer holds a private valuation \( v_{t} \in [0, 1] \) for the item in round \(t \). The value of \( v_{t} \) is an i.i.d. draw from a distribution \( \mathcal {D} \) and has expectation \( \nu = \mathbb {E} \left\{ v_{t} \right\} \). The buyer does not know \( \mathcal {D} \) and \( \nu \). Also, the buyer does not know \( \mathcal {P} \) or the seller’s pricing algorithm. The value \( v_{t} \) is only revealed to the buyer if he buys the item in round \( t \), i.e., the buyer only observes the value after he buys the item. The seller also does not know \( \mathcal {D} \) or \( \nu \) and does not observe \( v_{t} \).

The utility of the buyer in round \( t \) is given by \( u_{t} = a_{t} \cdot (v_{t} - p_{t}) \). In other words, if the buyer purchases the item the utility is the difference between the valuation and the price. Otherwise, the utility is zero. For a fixed sequence \( \overrightarrow{p} = p_{1}, \dots , p_{T} \) of observed prices and a fixed sequence of decisions \( a_{1}, \dots , a_{T} \) by the buyer, the pseudo-regret of the buyer over \( T \) rounds is defined as \( R_{T}(\overrightarrow{p}) = \sum _{t = 1}^{T} \max \lbrace \nu - p_{t}, 0\rbrace - \sum _{t = 1}^{T} a_{t} \cdot (\nu - p_{t}) \). The term \( \max \lbrace \nu - p_{t}, 0\rbrace \) represents the expected utility of the optimal decision in round \(t \) and the term \(a_{t} \cdot (\nu - p_{t}) \) represents the expected utility of the actual decision that is made by the buyer in round \(t \). The expected pseudo-regret over \( T \) rounds is defined as \(\mathcal {R}_{T}(\overrightarrow{p}) = \mathbb {E} \left\{ R_{T}(\overrightarrow{p}) \right\} \), where the expectation is taken with respect to possible randomization in the selection of the actions \( a_{1}, \dots , a_{T} \). In the remainder, the expected pseudo-regret will simply be referred to as the regret. The notation using \(\overrightarrow{p} \) makes it clear that the regret depends on the sequence of observed prices. We will omit this dependence when the meaning is clear from the context or when a relation is understood to hold for all possible price sequences. For example, we write \( \mathcal {R}_{T} \le O(\sqrt{T \log {T}} )\) when \(\mathcal {R}_{T}(\overrightarrow{p}) \le O(\sqrt{T \log {T}} ) \) for all choices of \(\overrightarrow{p} \).

We consider two types of buyers: non-strategic buyers and strategic buyers. Non-strategic buyers are interested in achieving sub-linear regret for all possible price sequences, but they treat the price sequence as exogenous. That is, if non-strategic buyers knew \( \nu \), then they would follow this rule: buy if and only if \( p_{t} \le \nu \). Strategic buyers also want sub-linear regret for all possible prices sequences, but they would sometimes deviate from this rule in an attempt to influence (i.e., lower) future prices that will be offered. If non-strategic buyers knew \( \nu \), then their decisions would have low regret but the seller could learn to ask a price just below \( \nu \), resulting in low utility for the buyer [2, 21]. Strategic buyers actively attempt to influence the learning process of the seller in order to lower the price and to increase the utility. However, as these attempts are not guaranteed to succeed (recall that buyers do not know the seller’s pricing algorithm or \( \mathcal {P} \)), strategic buyers still want to ensure sub-linear regret for all possible prices sequences. The seller does not know \( \mathcal {D} \) or \( \nu \) and does not observe \( v_{t} \), and so he has to learn how to set his price over time under bandit feedback. This paper focuses on the buyer side and the regret bounds that we derive do not depend on the seller’s pricing algorithm. However, in order to test our algorithms, some assumption about the seller’s algorithm is required. To the best of our knowledge, there are no existing ‘optimal’ algorithms for sellers with performance guarantees (specifically) for repeated posted-price auctions with a single seller and a single strategic buyer that doesn’t know his valuation: existing algorithms (e.g., [2, 3, 14, 15, 19, 21, 25]) assume that buyers know \( v_{t} \) and thus lose their performance guarantees. In our experiments (see Sect. 5) we therefore assume that the seller uses an off-the-shelf low-regret learning algorithm for adaptive adversarial bandit feedback as these have known performance guarantees [10, 20, 22].

figure a

4 Algorithms and Analysis

In this section we present our proposed algorithms for strategic and non-strategic buyers and we provide a theoretical analysis of these algorithms.

4.1 Non-strategic Buyers

We provide two algorithms for non-strategic buyers that have sub-linear regret. The first algorithm, UCB-NS, is based on UCB (upper confidence bound) style bandit algorithms [5] and the second algorithm, TS-NS, is based on the Thompson Sampling principle [1]. In every round, UCB-NS maintains an optimistic estimate of the unknown mean \( \nu \) and decides to buy the item if the estimate is at least as large as the offered price \( p_{t} \). TS-NS samples from a posterior distribution and decides to buy the item if the sampled value is at least as large as the offered price \( p_{t} \). Proposition 1 and 2 bound the regret of UCB-NS and TS-NS, respectively.

Proposition 1

If Algorithm 1 is run with inputs: \( T \), then \( \mathcal {R}_{T} \le O(\sqrt{T \log {T}} )\).

Proof

If \( \mathbb {I} \left\{ \nu> p_{t} > I_{t} \right\} = 1 \) then the buyer did not buy the item when instead he should have bought it. Similarly, if \( \mathbb {I} \left\{ \nu < p_{t} \le I_{t} \right\} = 1 \), then the buyer did buy the item when instead he should not have bought it.

Note that we can bound the regret as follows

$$\begin{aligned} \mathcal {R}_{T} \le 1&+ \sum _{t = 1}^{T} \mathbb {E} \left\{ (\nu - p_{t}) \cdot \mathbb {I} \left\{ \nu> p_{t} > I_{t} \right\} \right\} \\&+ \sum _{t = 1}^{T} \mathbb {E} \left\{ (p_{t} - \nu ) \cdot \mathbb {I} \left\{ \nu < p_{t} \le I_{t} \right\} \right\} . \end{aligned}$$

Define and

\( B = \sum _{t = 1}^{T} \mathbb {E} \left\{ (p_{t} - \nu ) \cdot \mathbb {I} \left\{ \nu < p_{t} \le I_{t} \right\} \right\} \). We will bound each term separately.

Define the following events , , and .

For term \(A \) we have,

Using Hoeffding’s inequality (and a union bound) we obtain . Therefore, we conclude that \( \sum _{t = 1}^{T} \mathbb {P} \left\{ \nu > I_{t} \right\} \le \frac{\pi ^2}{6} \).

Define . For term \(B \) we have,

Define . We bound \( B_{1} \) as follows:

Define . We bound \( B_{2} \) as follows:

Inequality (a) follows from applying Hoeffding’s inequality and from the fact that \( |\mathcal {B}| \le T \).

Putting everything together we obtain \( \mathcal {R}_{T} \le 1+ \frac{\pi ^2}{6}+ 4 \sqrt{ 2 \log {T}} \sqrt{T} + T \cdot \frac{2}{T^{4}}\). Therefore, we conclude that \( \mathcal {R}_{T} \le O(\sqrt{T \log {T}} )\).    \(\square \)

Proposition 2

If Algorithm 2 is run with inputs: \( T \) and \(N = \lceil c_{N} \cdot T^{\frac{2}{3}} \rceil \), then \( \mathcal {R}_{T} \le O(T^{\frac{2}{3}} \sqrt{\log {T}} )\).

Proof

The proof can be found in the Appendix.    \(\square \)

4.2 Strategic Buyers

In this section we show how the algorithms for non-strategic buyers can be converted into algorithms for strategic buyers with the same growth rate (up to constant factors) for the regret. Our proposed approach BUYER-STRAT is presented in Algorithm 3. The main idea behind BUYER-STRAT is to take a base algorithm \(\mathcal {A}_{base}\) for non-strategic buyers (e.g. UCB-NS or TS-NS) and modify it using what we refer to as strategic cycles.

We now give a description of how Algorithm 3 works. In BUYER-STRAT the buyers make decisions according to \(\mathcal {A}_{base}\) for the first \( N_{1} \) rounds. Afterwards, in the next \( N_{2} \) rounds, we enter a so-called strategic cycle. In this strategic cycle, the buyer only buys the item if the price is below some threshold, that is, if \(p_{t} \le v^{*} - c_{1} \). Here \( v^{*} \) is an estimate of the unknown mean \( \nu \) and \( 0< c_{1} < 1 \) is a parameter chosen by the buyer (e.g. \(c_{1} = 0.1 \)). The purpose of this strategic cycle is to entice the seller into asking prices that are lower than \( \nu \). After this strategic cycle comes to an end, we start another strategic cycle of length \( L \) with some small probability \( p_{cycle} \). If another strategic cycle has been triggered, we set a new parameter \( 0< c_{target} < 1 \) and only prices \(p_{t} \le v^{*} - c_{target} \) are accepted. If no strategic cycle is triggered, the buyer makes decisions according to \(\mathcal {A}_{base}\). In the next round, we start a strategic cycle of length \( L \) with probability \( p_{cycle} \) and the aforementioned process is repeated.

Algorithm 3 makes use of the functions \( F_{1} \), \( F_{2} \), \( F_{3} \), \( F_{4} \), \( F_{5} \), \( F_{6} \). The intuition behind these functions is as follows. In every strategic cycle, only prices that satisfy \( p_{t} \le v^{*} - c \) are accepted, where \( c \in \mathcal {C}\) for some set \( \mathcal {C} \). The value of \(v^{*} \) is selected using the function \( F_{5}(\cdot ) \) which takes as input a base algorithm \(\mathcal {A}_{base}\). The value \( c \in \mathcal {C}\) is selected by using the function \( F_{1}(\cdot ) \) which depends on a counter of the number strategic cycles that have passed \( C_{phase} \). Initially, the number of strategic cycles in which values \( c \in \mathcal {C}\) are used, is equal to \( N_{phase} \). When \( F_{2}(x) = 1 \), this indicates that the last strategic cycle in which a value \( c \in \mathcal {C} \) is used has just been completed, and the function \( F_{3}(\cdot ) \) is used to collect information about the price trajectory. When \( F_{6}(x) = 1 \), a final value for \( p_{target} \) is chosen (using \( F_{4}(\cdot ) \)) and only prices with \( p_{t} \le p_{target}\) are accepted in all subsequent strategic cycles. In Sect. 5 we discuss these functions in more detail and give specific examples that are used in our experiments.

The key parameters to control the regret of Algorithm 3 are the cycle probability \( p_{cycle} \) and the cycle length \( L \). Proposition 3 shows that BUYER-STRAT with \(\mathcal {A}_{base}\) chosen as UCB-NS has regret of order \( O(\sqrt{T \log {T}} )\) if the probability \( p_{cycle} \) and the cycle length \( L \) is carefully chosen. Proposition 4 shows an analogous result for BUYER-STRAT with TS-NS.

Proposition 3

Let \( A_{p} \), \( A_{L} \) and \( A_{N} \) be positive real constants. Assume that Algorithm 3 is run with \(\mathcal {A}_{base}\) chosen as UCB-NS and with inputs: \( T \), \( N_{1} = \lceil T^{\frac{2}{3}} {(\log {T})}^{\frac{1}{2}} \rceil \), \( N_{2} = \lceil A_{N} \sqrt{T \log {T}} \rceil \), \( p_{cycle} = A_{p} T^{ -\frac{1}{2} } \) and \( L = A_{L} \sqrt{\log {T}} \), then \( \mathcal {R}_{T} \le O( \sqrt{T \log {T}} ) \).

Proof

We will decompose the regret in two parts: the regret incurred in rounds that are part of strategic cycles and rounds that are not. For an arbitrary subset \( \mathcal {T}^{*} \subseteq \mathcal {T}\), let . Let \( \mathcal {T}_{S} \subseteq \mathcal {T}\) denote the indices of the rounds that are part of strategic cycles and let \( \mathcal {T}_{NS} = \mathcal {T} \setminus \mathcal {T}_{S} \) denote the indices of the rounds that are not. Then we can write, \( \mathcal {R}_{T} = \mathcal {R}_{T, \mathcal {T}_{NS} } + \mathcal {R}_{T, \mathcal {T}_{S} } \).

For \(\mathcal {R}_{T, \mathcal {T}_{S} } \) we have that \( \mathcal {R}_{T, \mathcal {T}_{S} } \le N_{2} + T \cdot p_{cycle} \cdot L\). This follows from the fact that the expected number of triggered strategic cycles (after round \( N_{1} + N_{2} \)) is at most \( T \cdot p_{cycle} \) and the regret in every such cycle is at most \( L \). Furthermore, the first strategic cycle has length \( N_{2} \). For \(\mathcal {R}_{T, \mathcal {T}_{NS} } \) we have that \( \mathcal {R}_{T, \mathcal {T}_{NS} } \le 5+ 4 \sqrt{ 2 \log {T}} \sqrt{T} \). This follows from the fact that \(\mathcal {R}_{T, \mathcal {T}_{NS} } \) represents the regret after \( |\mathcal {T}_{NS}| \le T \) rounds in a problem with horizon \( T \), and by Proposition 1, this quantity is bounded by \( 5+ 4 \sqrt{ 2 \log {T}} \sqrt{T} \). By plugging in the values we get \( \mathcal {R}_{T} =\mathcal {R}_{T, \mathcal {T}_{NS} } + \mathcal {R}_{T, \mathcal {T}_{S} } \le O( \sqrt{T \log {T}} ) \).    \(\square \)

figure l

Proposition 4

Let \( A_{p} \), \( A_{L} \) and \( A_{N} \) be positive real constants. Assume that Algorithm 3 is run with \(\mathcal {A}_{base}\) chosen as TS-NS and with inputs: \( T \), \( N_{1} = \lceil T^{\frac{2}{3}} {(\log {T})}^{\frac{1}{2}} \rceil \), \( N_{2} = \lceil A_{N} \sqrt{T \log {T}} \rceil \), \( p_{cycle} = A_{p} T^{ -\frac{1}{2} } \) and \( L = A_{L} \sqrt{\log {T}} \). Assume that TS-NS is run with inputs: \( T \) and \(N = \lceil c_{N} \cdot T^{\frac{2}{3}} \rceil \). Then \(\mathcal {R}_{T} \le O( T^{\frac{2}{3}} \sqrt{\log {T}} ) \).

Proof

The proof uses similar arguments as the proof of Proposition 3 and is omitted. A complete proof can be found in the Appendix.    \(\square \)

Remark 5

In order to derive the results of Proposition 3 and 4, we only used the fact that the regret for \(\mathcal {A}_{base}\) is bounded by \( O( \sqrt{T \log {T}} ) \) or \( O( T^{\frac{2}{3}} \sqrt{\log {T}} ) \). The same proof is also valid for any other base algorithm that satisfies these bounds. Also, the exact choices for functions \( F_{1} \), \( F_{2} \), \( F_{3} \), \( F_{4} \), \( F_{5} \), \( F_{6} \) do not effect the regret guarantee (in Sect. 5 we discuss these functions in more detail).

In which setting is BUYER-STRAT useful? As the seller does not know \( \mathcal {D} \), it is reasonable to assume (as argued in Sect. 3) that the seller uses a low-regret algorithm to learn how to set prices. Note that many online learning algorithms (e.g. EXP3 and its variants) are weight-based algorithms: at round \( t \), there are weights \( w_{k,t}, \dots , w_{K,t} \) and an action \( k \in \lbrace 1, \dots , K\rbrace \) is chosen with probability \( w_{k, t}/ \sum _{k = 1}^{K} w_{k,t} \). We call an algorithm a pure weight-based algorithm if in round \( t \), only the weight of the selected action gets updated and if weights can only increase due to positive rewards (note that EXP3 is an example, see the Appendix for a general definition). Proposition 6 shows that, if the seller uses a pure weight-based algorithm, then BUYER-STRAT tends to encourage lower prices by using strategic cycles.

Proposition 6

Assume that the buyer uses Algorithm 3, that the seller is using a pure weight-based algorithm and that the price set \( \mathcal {P} \) is finite. Suppose that a strategic cycle runs from round \( t+ 1 \) to round \( t + L \) with \( p_{target}\), then \( \mathbb {P} \left\{ p_{t+L+1} \le p_{target} \right\} \ge \mathbb {P} \left\{ p_{t+1} \le p_{target} \right\} \).

Proof

The proof can be found in the Appendix.    \(\square \)

5 Experiments

In this section we verify the theoretical results that were derived and investigate the effects of strategic behaviour on the regret in different scenarios.

5.1 Setup of Experiments

In the experiments \( v_{t} \) is drawn from an uniform distribution on \( [ a -0.3, a + 0.3] \), where \( a \) is drawn from an uniform distribution on \( [ 0.4, 0.7] \) independently for each run. We consider two settings for the set of prices used by the seller and these are given by \( \mathcal {P}_{1} \) and \(\mathcal {P}_{2} \): \( \mathcal {P}_{1} = \lbrace a+x \mid x \in \lbrace -0.35, -0.3, -0.25, -0.2, -0.1, -0.05, -0.02, 0.0, 0.1, 0.3 \rbrace \rbrace \), \( \mathcal {P}_{2} = \lbrace a+x \mid x \in \lbrace -0.05, -0.02, 0.0, 0.1, 0.3 \rbrace \rbrace \). We will use the following abbreviations: P1 and P2. The abbreviation P1 means that \( \mathcal {P}_{1} \) is used. The other abbreviations have a similar interpretation.

We consider three options for the seller pricing algorithm: (i) the seller chooses a price at random from the price set (RAND seller); (ii) the seller uses the low-regret learning algorithm EXP3.P (EXP3.P seller); (iii) the seller uses the full-information algorithm HEDGE (HEDGE seller). RAND seller is included because it models a situation where the buyer has no influence over the prices. EXP3.P seller is included because it is a bandit algorithm designed for adaptive adversaries and it enjoys high-probability regret bounds [4, 10]. It models a seller that is learning which prices to use based on bandit feedback that is non-stochastic. HEDGE seller is included in order to investigate whether the restriction to bandit feedback has a major impact on the performance of BUYER-STRAT. HEDGE seller is tuned according to Remark 5.17 in [22] and EXP3.P according to Theorem 3.2 in [10].

In the experiments, BUYER-STRAT is tuned with \( N_{1} = \lceil T^{\frac{2}{3}} \log {T} \rceil \), \( N_{2} = \lceil 2 \sqrt{ T \log {T}} \rceil \), \( L = \lfloor 25 \sqrt{\log {T}} \rfloor \), \( p_{cycle} = \frac{5}{\sqrt{ T}} \), \( c_{1} = 0.1 \). We set \( N_{phase} = 4 \cdot N_{3}\), where \( N_{3} = \lceil 0.1 \cdot \sqrt{ T} \rceil \). TS-NS is tuned with \( N = \lceil 0.005 \cdot T^{\frac{2}{3}} \rceil \). We will refer to BUYER-STRAT with \(\mathcal {A}_{base}\) chosen as UCB-NS, as UCB-S (Upper Confidence Bound Strategic). Similarly, We will refer to BUYER-STRAT with \(\mathcal {A}_{base}\) chosen as TS-NS, as TS-S (Thompson Sampling Strategic). The functions \( F_{1} \), \( F_{2} \), \( F_{3} \), \( F_{4} \), \( F_{5} \), \( F_{6} \) are chosen as follows.

(1)

For \( F_{2}(x) \) we take \( F_{2}(x) = \mathbb {I} \lbrace x \in \lbrace N_{3}, 2 \cdot N_{3}, 3 \cdot N_{3}, 4 \cdot N_{3} \rbrace \rbrace \). The function \( F_{3}(L_{p}) \) takes the last 100 elements added to the input list \( L_{p} \) and then calculates the 25-th percentile of these 100 values. The function \( F_{4}(\cdot ) \) is defined as \( F_{4}(L_{target}) = \min \lbrace L_{target} \rbrace + \varepsilon \). The function \( F_{4}(L_{target}) \) takes the smallest number in the set \( L_{target} \) and adds a small value to it. In our experiments we use \( \varepsilon = 0.005 \). The function \( F_{5}(\cdot ) \) takes as input a base algorithm and returns the value of \( \bar{v}_{t} \) in the base algorithm. For \( F_{6}(x) \) we take \( F_{6}(x) = \mathbb {I} \lbrace x = 4 \cdot N_{3} \rbrace \).

The intuition behind these choices is as follows. In every strategic cycle, only prices that satisfy \( p_{t} \le v^{*} - c \) are accepted, where \( c \in \mathcal {C} = \lbrace 0.1, 0.2, 0.3, 0.4, 0.5 \rbrace \) and where \( c \) is chosen in increasing order (to try to reduce the price in stages) as the number of strategic cycles increases (this is specified by the function \( F_{1}(\cdot ) \)). Initially, the number of strategic cycles in which every \( c \in \mathcal {C}\) is used, is proportional to \( N_{3} \). When \( F_{2}(x) = 1 \), this indicates that the last strategic cycle in which \( c = x \) has just been completed, and the function \( F_{3}(\cdot ) \) is used to collect information about the price trajectory. When \( F_{6}(x) = 1 \), a final value for \( p_{target} \) is chosen (using \( F_{4}(\cdot ) \)) and this value is used in all subsequent strategic cycles.

We perform 100 independent simulation runs in order to calculate our performance metrics. We use three performance metrics in order to evaluate our algorithm. In each run, we calculate the cumulative regret \( R_{T} = \sum _{t = 1}^{T} \max \lbrace \nu - p_{t}, 0\rbrace - \sum _{t = 1}^{T} a_{t} \cdot (\nu - p_{t}) \), the cumulative utility \( U_{T} = \sum _{t = 1}^{T} a_{t} \cdot (\nu - p_{t}) \) and the scaled cumulative regret \( R^{S}_{T} = R_{T}/\sum _{t = 1}^{T} \max \lbrace \nu - p_{t}, 0\rbrace \). In the experiments we set \( T \in \lbrace 25000, 50000, 75000, 100000, 200000, \dots , 1000000 \rbrace \).

5.2 Results: Non-strategic Buyers vs. Strategic Buyers

Non-strategic Buyers. In Figs. 1 and 4 the cumulative regret is shown for different experimental settings and different values for the problem horizon. Each point in the graph shows the cumulative regret over \( T \) rounds for a problem of horizon \( T \) averaged over 100 simulations. In all figures, the lines indicate the mean and the shaded region indicates a 95% confidence interval. The results indicate that the expected regret indeed grows as a sub-linear function of \( T \) and that this pattern holds for both RAND seller and EXP3.P seller. An interesting finding is that the regret for TS-NS is lower than UCB-NS: based on the theoretical analysis one would expect the opposite pattern. Figures 3 and 6 show the scaled cumulative regret and provides further evidence that the expected regret is a sub-linear function of the horizon \( T \), as the curve shows a monotonically decreasing pattern. Figures 2 and 5 show the cumulative utility against different sellers. Here we observe that the utility tends to be higher if the seller uses \( \mathcal {P}_{1} \), which makes intuitive sense as this price set contains lower prices.

Strategic Buyers. Figures 7, 8, 9, 10, 11 and 12 show the same performance metrics as for the non-strategic bidders. Figures 1 and 4 show that the level of the expected regret for strategic bidders is higher compared to the non-strategic bidders. Figures 9 and 12 again indicate that the expected regret is sub-linear in \( T \), as the curves show a monotonically decreasing pattern (from Fig. 7 it is hard to tell). Thus, we observe sub-linear regret for both UCB-S and TS-S regardless of the seller algorithm and this is in line with the theoretical analysis. If we compare the cumulative utility in Figs. 8 and 11 with those in Figs. 2 and 5, then we observe some interesting results. First, when strategic buyers are facing RAND seller (Fig. 8), then we see that the cumulative utility is about 70%–80% of the cumulative utility if non-strategic buyers are facing RAND seller (Fig. 2). Second, we see that if the seller is using EXP3.P (i.e, a low-regret learning algorithm), then the cumulative utility for strategic buyers is much higher compared to the cumulative utility for non-strategic buyers. In scenario P1 utilities are about 2.5–3 times higher and in scenario P2 utilities are about 2 times higher. The results for scenario P2 imply that, even when the lowest price is very close to the unknown mean valuation (absolute distance at most 0.05), it is still beneficial to act strategically. Additional experimental results when the seller uses EXP3.S [4] can be found in the Appendix.

Fig. 1.
figure 1

\( R_{T}\) with RAND seller.

Fig. 2.
figure 2

\( U_{T}\) with RAND seller.

Fig. 3.
figure 3

\( R^{S}_{T}\) with RAND seller.

Fig. 4.
figure 4

\( R_{T}\) with EXP3.P seller.

Fig. 5.
figure 5

\( U_{T}\) with EXP3.P seller.

Fig. 6.
figure 6

\( R^{S}_{T}\) with EXP3.P seller.

5.3 Explanation of Differences

In order to study the impact of the quality of feedback that the seller observes, we give the seller full-information feedback instead of bandit feedback. More specifically, we assume the seller uses the algorithm HEDGE. Figures 13, 14 and 15 show results for TS-S and TS-NS against HEDGE seller. Even with full-information the results are qualitatively similar as before: the regret for the strategic buyers is sub-linear and cumulative utility is much higher for strategic buyers. Thus, the results indicate that the feedback type is not the main driver for the observed patterns.

Figures 16 and 17 display the gap \( \nu -p_{t} \) for a problem with horizon \( T = 200000 \) averaged over the 100 simulation runs. If the seller is using a low-regret algorithm in order to set prices and buyers are non-strategic, then we observe that prices tend to increase towards the mean valuation \( \nu \). This effect is stronger for HEDGE seller compared to EXP3.P seller and this is in line with expectations as HEDGE uses full-information feedback. Furthermore, we see a qualitatively similar pattern for the price sets \( \mathcal {P}_{1} \) and \( \mathcal {P}_{2} \), although the increase in price with \( \mathcal {P}_{2} \) is slightly larger. For HEDGE seller, we hardly see any difference for different price sets. If buyers are strategic then we see the opposite pattern. The algorithms for strategic buyers tend to lower the price over time and the magnitude of this reduction depends on the price set of the seller (reduction for \( \mathcal {P}_{1} \) is larger than for \( \mathcal {P}_{2} \)).

Fig. 7.
figure 7

\( R_{T}\) with RAND seller.

Fig. 8.
figure 8

\( U_{T}\) with RAND seller.

Fig. 9.
figure 9

\( R^{S}_{T}\) with RAND seller.

Fig. 10.
figure 10

\( R_{T}\) with EXP3.P seller.

Fig. 11.
figure 11

\( U_{T}\) with EXP3.P seller.

Fig. 12.
figure 12

\( R^{S}_{T}\) with EXP3.P seller.

Fig. 13.
figure 13

\( R_{T}\) with HEDGE seller.

Fig. 14.
figure 14

\( U_{T}\) with HEDGE seller.

Fig. 15.
figure 15

\( R^{S}_{T}\) with HEDGE seller.

Fig. 16.
figure 16

\(\nu -p_{t}\) with EXP3.P seller.

Fig. 17.
figure 17

\(\nu -p_{t}\) with HEDGE seller.

However, even with price set \( \mathcal {P}_{2} \) where the lowest prices are very close to \( \nu \), strategic behaviour is beneficial and strategic buyers can induce prices that are almost twice as far from \( \nu \).

6 Conclusion

This is paper we study repeated posted-price auctions with a single seller from the perspective of a utility maximizing buyer that does not know the distribution of his valuation. Previous work has only focused on the seller side and does not study how buyers should make decisions, hence in this paper, we address this gap in the literature. We study two types of buyers (strategic and non-strategic) and derive sub-linear regret bounds that hold for all possible sequences of observed prices. Our algorithms are based on ideas from UCB-type bandit algorithms and Thompson Sampling. Our experiments we show that, if the seller is using a low-regret learning algorithm based on weights updating, then strategic buyers can obtain much higher utilities compared to non-strategic buyers.

In practice, buyers have limited budgets for purchasing items. One direction for future work is to investigate the impact of this on the problem. In particular, it would be interesting to analyze how a budget constraint would affect the regret guarantees derived in this paper and whether budget constraints make it easier or harder to engage in strategic behavior.