1 Introduction

In this chapter, we consider learning and pricing problems with inventory constraints: given an initial inventory of one or multiple products and finite selling season, a seller must choose prices dynamically to maximize revenue over the course of the season. Inventory constraints are prevalent in many business settings. For most goods and services, there is limited inventory due to supply constraint, sellers’ budget constraint, or limited storage space. Therefore, one must consider the impact of inventory constraints when learning demand functions and setting prices.

Dynamic pricing with inventory constraints has been extensively studied in the revenue management literature, often under the additional assumption that the demand function (i.e., the relationship between demand and price) is known to the seller prior to the selling season. However, when the demand function is unknown, the seller faces a trade-off commonly referred to as the exploration–exploitation trade-off. Toward the beginning of the selling season, the seller may offer different prices to try to learn and estimate the demand rate at each price (“exploration” objective). Over time, the seller can use these demand rate estimates to set prices that maximize revenue throughout the remainder of the selling season (“exploitation” objective). With limited inventory, pursuing the exploration objective comes at the cost of not only lowering revenue but also diminishing valuable inventory. Simply put, if inventory is depleted while exploring different prices, there is no inventory left to exploit the knowledge gained.

In this chapter, we will study how one should design learning algorithm in the presence of inventory constraints. Specifically, we will study how one can overcome the additional challenges brought forth by the limited inventory and still design efficient algorithms for learning demand functions with regret guarantees. In what follows, we will first discuss the simplest case in this setting in Sect. 5.2, i.e., the learning and pricing problem of a single product with an inventory constraint. Then, in Sect. 5.3, we discuss the problem of learning and pricing with multiple products under inventory constraints. Finally, in Sect. 5.4, we consider a Bayesian learning setting with inventory constraints. In each of the sections, we describe the model and the challenges and then present the algorithms and analysis for respective problems. In Sect. 5.5, we present concluding remarks and some further readings for this chapter.

2 Single Product Case

In this section, we consider the problem of a monopolist selling a single product in a finite selling season T. We assume that the seller has a fixed inventory x at the beginning and no replenishment can be made during the selling season. During the selling season, customers arrive according to a Poisson process with an instantaneous demand rate λt at time t.Footnote 1 In our model, we assume that λt is solely dependent on the price the seller offers at time t. That is, we can write λt = λ(p(t)), where p(t) is the price offered at time t. The sales will be terminated at time T, and there is no salvage value for the remaining items.

In our model, we assume that the set of feasible prices is an interval \([ \underline {p},\overline {p}]\) with an additional cut-off price p such that λ(p) = 0. The demand rate function λ(p) is assumed to be strictly decreasing in p and has an inverse function p = γ(λ). We define a revenue rate function r(λ) = λγ(λ), which captures the expected revenue when the price is chosen such that the demand is λ. We further assume r(λ) is concave in λ. These assumptions on demand functions are quite standard and are called the regular assumptions in the revenue management literature (Gallego and van Ryzin, 1994).

In addition to the above, we make the following assumptions on the demand rate function λ(p) and the revenue rate function r(λ):

Assumption 1

For some positive constants M, K, mL, and mU,

  1. 1.

    Boundedness: |λ(p)|≤ M for all \(p\in [ \underline {p},\overline {p}]\).

  2. 2.

    Lipschitz continuity: λ(p) and r(λ(p)) are Lipschitz continuous with respect to p with factor K. Also, the inverse demand function p = γ(λ) is Lipschitz continuous in λ with factor K.

  3. 3.

    Strict concavity and differentiability: r″(λ) exists and − mL ≤ r″(λ) ≤−mU < 0 for all λ in the range of λ(p) for \(p\in [ \underline {p},\overline {p}]\).

In the following, we use Γ =  Γ(M, K, mL, mU) to denote the set of demand functions satisfying the above assumptions with the corresponding coefficients. In our model, the seller does not know the true demand function λ. The only knowledge the seller has is that the demand function belongs to Γ. Note that Γ does not need to have any parametric representation. We note that Assumption 1 is quite mild, and it is satisfied for many commonly used demand function classes including linear, exponential, and logit demand functions.

To evaluate the performance of any pricing algorithm, we adopt the minimax regret objective. We call a pricing policy π = (p(t) : 0 ≤ t ≤ T) admissible if (1) it is a non-anticipating price process that is defined on \([ \underline {p},\overline {p}]\cup \{p_\infty \}\) and (2) it satisfies the inventory constraint, that is, \(\int _0^TdN^\pi (s)\le x, \mbox{with probability }1\), where \(N^\pi (t)=N\left (\int _{0}^t\lambda (p(s))ds\right )\) denotes the cumulative demand up to time t using policy π.

We denote the set of admissible pricing policies by \(\mathcal {P}\). We define the expected revenue generated by a policy π by

$$\displaystyle \begin{aligned} J^\pi(x,T;\lambda) = E\left[\int_{0}^Tp(s)dN^\pi(s)\right]. \end{aligned} $$
(5.1)

Given a demand rate function λ, there exists an optimal admissible policy π that maximizes (5.1). In our model, since we do not know λ in advance, we seek \(\pi \in \mathcal {P}\) that performs as close to π as possible.

However, even if the demand function λ is known, computing the expected value of the optimal policy is computationally prohibitive. It involves solving a continuous-time dynamic program exactly. Fortunately, as shown in Gallego and van Ryzin (1994), there exists an upper bound for the expected value of any policy based on the following deterministic optimization problem:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} J^D(x,T;\lambda ) = & \sup &\displaystyle \int_0^Tr(\lambda(p(s))) ds \\ & \mbox{s.t.} &\displaystyle \int_0^T \lambda(p(s)) ds \le x \\ & &\displaystyle p(s)\in[\underline{p},\overline{p}]\cup \{p_\infty\}, \quad \forall s\in [0,T]. \end{array} \end{aligned} $$
(5.2)

Gallego and van Ryzin (1994) showed that JD(x, T;λ) provides an upper bound on the expected revenue generated by any admissible pricing policy π, that is, Jπ(x, T;λ) ≤ JD(x, T;λ), for all λ ∈ Γ and \(\pi \in \mathcal {P}\). With this upper bound, we define the regret Rπ(x, T;λ) for any given demand function λ ∈ Γ and policy \(\pi \in \mathcal {P}\) by

$$\displaystyle \begin{aligned} R^\pi(x,T;\lambda) = 1 - \frac{J^\pi(x,T;\lambda)}{J^D(x,T;\lambda)}. \end{aligned} $$
(5.3)

Clearly, Rπ(x, T;λ) is always greater than 0. Furthermore, the smaller Rπ(x, T;λ) is, the closer the performance of π is to that of the optimal policy. However, since the decision-maker does not know the true demand function, it is attractive to have a pricing policy π that achieves small regrets across all possible underlying demand functions λ ∈ Γ. To capture this, we consider the worst-case regret. Specifically, the decision-maker chooses a pricing policy π, and the nature picks the worst possible demand function for that policy and our goal is to minimize the worst-case regret:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \inf_{\pi\in \mathcal{P}}\sup_{\lambda\in\Gamma}R^\pi(x,T;\lambda). \end{array} \end{aligned} $$
(5.4)

Unfortunately, it is hard to evaluate (5.4) for any finite size problem. In order to obtain theoretical guarantee of proposed policies, we adopt a widely used asymptotic performance analysis. Particularly, we consider a regime in which both the size of the initial inventory and the demand rate grow proportionally. Specifically, in a problem with size n, the initial inventory and the demand function are given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} x_n = nx \mbox{ and } \lambda_n(\cdot) = n \lambda(\cdot). \end{array} \end{aligned} $$

Define \(J^D_n(x,T;\lambda ) = J^D(nx, T, n\lambda ) = nJ^D(x, T, \lambda )\) to be the optimal value to the deterministic problem with size n and \(J^\pi _n(x,T;\lambda ) = J^\pi (nx, T, n\lambda )\) to be the expected value of a pricing policy π when it is applied to a problem with size n. The regret for the size-n problem \(R^\pi _n(x,T;\lambda )\) is therefore

$$\displaystyle \begin{aligned} R^\pi_n(x,T;\lambda) = 1 - \frac{J^\pi_n(x,T;\lambda)}{J^D_n(x,T;\lambda)}, \end{aligned}$$

and our objective is to study the asymptotic behavior of \(R^\pi _n(x,T;\lambda )\) as n grows large and design an algorithm with small asymptotic regret.

2.1 Dynamic Pricing Algorithm

In this section, we introduce a dynamic pricing algorithm, which achieves near-optimal asymptotic regret for the aforementioned problem. To start with, we first consider the full-information deterministic problem (5.2). As shown in Besbes and Zeevi (2009), the optimal solution to (5.2) is given by

$$\displaystyle \begin{aligned} p(t) = p^D = \max\{p^u,p^c\} \end{aligned} $$
(5.5)

where

$$\displaystyle \begin{aligned} p^u = \mbox{arg}\max_{p\in[\underline{p},\overline{p}]}\{r(\lambda(p))\}, \quad p^c = \mbox{arg}\min_{p\in[\underline{p},\overline{p}]}\left|\lambda(p)-\frac{x}{T}\right|. \end{aligned} $$
(5.6)

The following important lemma is proved in Gallego and van Ryzin (1994).

Lemma 1

Let pD be the optimal deterministic price when the underlying demand function is λ. Let πD be the pricing policy that uses the deterministic optimal price pD throughout the selling season until there is no inventory left. Then, \(R_n^{\pi ^D}(x,T,\lambda ) = O(n^{-1/2})\).

Lemma 1 states that if one knows pD in advance, then simply applying this price throughout the entire time horizon can achieve asymptotically optimal performance. Therefore, the idea of our algorithm is to find an estimate of pD that is close enough to the true one efficiently, using empirical observations on hand. In particular, under Assumption 1, we know that if pD = pu > pc, then

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \left|r(p)-r(p^D)\right|\le \frac{1}{2}m_L(p-p^D)^2 \end{array} \end{aligned} $$
(5.7)

for p close to pD, while if pD = pc ≥ pu, then

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \left|r(p)-r(p^D)\right| \le K\left|p-p^D\right| \end{array} \end{aligned} $$
(5.8)

for p close to pD. In the following discussion, without loss of generality, we assume \(p^D\in ( \underline {p}, \overline {p})\). Note that this can always be achieved by choosing a large interval of \([ \underline {p}, \overline {p}]\).

We now state the main result in this section. We use the notation f(n) = O(g(n)) to denote there is a constant C and k such that \(f(n)\le C\cdot g(n)\cdot \log ^k{n}\).

Theorem 1

Let Assumption 1 hold for Γ =  Γ(M, K, mL, mU). Then, there exists an admissible policy π generated by Algorithm 1, such that for all n ≥ 1,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sup_{\lambda\in \Gamma}R_n^{\pi}(x,T;\lambda) = O^*\left(n^{-1/2}\right). \end{array} \end{aligned} $$

A corollary of Theorem 1 follows from the relationship between the nonparametric model and the parametric one:

Corollary 1

Assume that Γ is a parameterized demand function family satisfying Assumption 1. Then, there exists an admissible policy π generated by Algorithm 1, such that for all n ≥ 1,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sup_{\lambda\in \Gamma}R_n^{\pi}(x,T;\lambda) = O^*\left(n^{-1/2}\right). \end{array} \end{aligned} $$

Now, we explain the meaning of Theorem 1 and Corollary 1. First, as we will show a matching lower bound in Theorem 2, the result in Theorem 1 is the best asymptotic regret that one can achieve in this setting. Another consequence of our result is that it shows that there is no performance gap between parametric and nonparametric settings in the asymptotic sense, implying that the value of knowing the parametric form of the demand function is marginal in this problem when the best algorithm is adopted. In this sense, our algorithm could save firms’ efforts in searching for the right parametric form of the demand functions.

Now, we describe the dynamic pricing algorithm. As mentioned earlier, we aim to learn pD through price experimentations. Specifically, the algorithm will be able to distinguish whether pu or pc is optimal. Meanwhile, it keeps a shrinking interval containing the optimal price with high probability until a certain accuracy is achieved.

Algorithm 1 Dynamic pricing algorithm (DPA)

Now, we explain the ideas behind the Algorithm 1. In the algorithm, we divide the selling season into a carefully selected set of time periods. In each time period, we test a set of prices within a certain price interval. Based on the empirical observations, we shrink the price interval to a smaller subinterval that still contains the optimal price with high probability and enter the next time period with a smaller price range. We repeat the shrinking procedure until the price interval is small enough so that the desired accuracy is achieved.

Recall that the optimal deterministic price pD is equal to the maximum of pu and pc, where pu and pc are solved from (5.6). As shown in (5.7) and (5.8), the local behavior of the revenue rate function is quite different around pu and pc: the former one resembles a quadratic function, while the latter one resembles a linear function (this is an important feature due to the inventory constraint). This difference requires us to use different shrinking strategies for the cases when pu > pc and pc > pu. This is why we have two learning steps (Steps 2 and 3) in our algorithm. Specifically, in Step 2, the algorithm works by shrinking the price interval until either a transition condition (5.10) is triggered or the learning phase is terminated. We show that when the transition condition (5.10) is triggered, with high probability, the optimal solution to the deterministic problem is pc. Otherwise, if we terminate learning before the condition is triggered, we know that pu is either the optimal solution to the deterministic problem or it is close enough so that using pu can yield a near-optimal revenue. When (5.10) is triggered, we switch to Step 3, in which we use a new set of shrinking and price testing parameters. Note that in Step 3, we start from the initial price interval rather than the current interval obtained. This is not necessary but solely for the ease of analysis. Both Step 2 and Step 3 (if it is invoked) must terminate in a finite number of iterations.

In the end of the algorithm, a fixed price is used for the remaining selling season (Step 4) until the inventory runs out. In fact, instead of applying a fixed price in Step 4, one may continue learning using our shrinking strategy. However, it will not further improve the asymptotic performance of our algorithm.

In the following, we define \(\tau _i^u, \kappa _i^u, N^u, \tau _i^c, \kappa _i^c\) and Nc. Without loss of generality, we assume T = 1 and \(\overline {p}- \underline {p} = 1\) in the following discussion. We first provide a set of relations we want \((\tau _i^u,\kappa _i^u)\) and \((\tau _i^c, \kappa _i^c)\) to satisfy. Then, we explain the meaning of each relations and derive a set of parameters that satisfy these relations. We use the notation f ∼ g to mean that f and g are of the same order in n.

The relations that we want \((\tau _i^u,\kappa _i^u)_{i=1}^{N^u}\) to satisfy are

$$\displaystyle \begin{aligned} \left(\frac{\overline{p}_i^u-\underline{p}_i^u}{\kappa_i^u}\right)^2\sim\sqrt{\frac{\kappa_i^u}{n\tau_i^u}},\quad \forall i = 2,\ldots,N^u, \end{aligned} $$
(5.14)
$$\displaystyle \begin{aligned} \overline{p}_{i+1}^u-\underline{p}_{i+1}^u \sim \log{n}\cdot\frac{\overline{p}_i^u-\underline{p}_i^u}{\kappa_i^u}, \quad \forall i = 1,\ldots,N^u-1, \end{aligned} $$
(5.15)
$$\displaystyle \begin{aligned} \tau_{i+1}^u\cdot \left(\frac{\overline{p}_i^u - \underline{p}_i^u}{\kappa_i^u}\right)^2\cdot\sqrt{\log{n}} \sim \tau_1^u, \quad \forall i = 1,\ldots,N^u-1. \end{aligned} $$
(5.16)

Also, we define

$$\displaystyle \begin{aligned} N^u = \min_{l}\left\{l:\left(\frac{\overline{p}_l^u - \underline{p}_l^u}{\kappa_l^u}\right)^2\sqrt{\log{n}}<\tau_1^u\right\}. \end{aligned} $$
(5.17)

Next, we state the set of relations we want \((\tau _i^c,\kappa _i^c)_{i=1}^{N^c}\) to satisfy

$$\displaystyle \begin{aligned} \frac{\overline{p}_i^c-\underline{p}_i^c}{\kappa_i^c}\sim\sqrt{\frac{\kappa_i^c}{n\tau_i^c}},\quad \forall i = 2,\ldots,N^c, \end{aligned} $$
(5.18)
$$\displaystyle \begin{aligned} \overline{p}_{i+1}^c-\underline{p}_{i+1}^c \sim \log{n}\cdot\frac{\overline{p}_i^c-\underline{p}_i^c}{\kappa_i^c}, \quad \forall i = 1,\ldots,N^c-1, \end{aligned} $$
(5.19)
$$\displaystyle \begin{aligned} \tau_{i+1}^c \cdot \frac{\overline{p}_i^c - \underline{p}_i^c}{\kappa_i^c}\cdot\sqrt{\log{n}} \sim \tau_1^c,\quad \forall i = 1,\ldots,N^c-1. \end{aligned} $$
(5.20)

Also, we define

$$\displaystyle \begin{aligned} N^c = \min_{l}\left\{l:\frac{\overline{p}_l^c - \underline{p}_l^c}{\kappa_l^c}\sqrt{\log{n}}<\tau_1^c\right\}. \end{aligned} $$
(5.21)

To understand the above relations, it is useful to examine the source of revenue losses in our algorithm. First, there is an exploration loss in each period—the prices tested are not optimal, resulting in suboptimal revenue rate or suboptimal inventory consumption rate. The magnitude of such losses in each period is roughly the deviation of the revenue rate (or the inventory consumption rate) multiplied by the time length of the period. Second, there is a deterministic loss due to the limited learning capacity—we only test a grid of prices in each period and may never use the exact optimal price. Third, since the demand follows a stochastic process, the observed demand rate may deviate from the true underlying demand rate, resulting in a stochastic loss. Note that these three losses also exist in the learning algorithm proposed in Besbes and Zeevi (2009). However, in dynamic learning, the loss in one period does not simply appear once, it may have impact on all the future periods. The design of our algorithm tries to balance these losses in each step to achieve the maximum efficiency of learning. With these in mind, we explain the meaning of each equation above in the following:

  • The first relation (5.14) ((5.18), respectively) balances the deterministic loss induced by only considering the grid points (the grid granularity is \(\frac {\overline {p}_i^u- \underline {p}_i^u}{\kappa _i^u}\) \((\frac {\overline {p}_i^c- \underline {p}_i^c}{\kappa _i^c}\), resp.)) and the stochastic loss induced in the learning period which will be shown to be \(\sqrt {\frac {\kappa _i^u}{n\tau _i^u}}\) (\(\sqrt {\frac {\kappa _i^c}{n\tau _i^c}}\), respectively). Due to the relation in (5.7) and (5.8), the loss is quadratic in the price granularity in Step 2 and linear in Step 3.

  • The second relation (5.15) ((5.19), respectively) makes sure that with high probability, the price intervals \(I_i^u\) (\(I_i^c\), respectively) contain the optimal price pD. This is necessary, since otherwise a constant loss will be incurred in all periods afterward.

  • The third relation (5.16) ((5.20), respectively) bounds the exploration loss for each learning period. This is done by considering the multiplication of the revenue rate deviation (also demand rate deviation) and the length of the learning period, which in our case can be upper bounded by \(\tau _{i+1}^u\sqrt {\log {n}}\cdot \left ({\frac {\overline {p}_i^u - \underline {p}_i^u}{\kappa _i^u}}\right )^2\) (\(\tau _{i+1}^c\sqrt {\log {n}}\cdot {\frac {\overline {p}_i^c - \underline {p}_i^c}{\kappa _i^c}}\), respectively). We want this loss to be of the same order for each learning period (and all equal to the loss in the first learning period, which is τ1) to achieve the maximum efficiency of learning.

  • Formula (5.17) ((5.21), respectively) determines when the price we obtain is close enough to optimal such that we can apply this price in the remaining selling season. We show that \(\sqrt {\log {n}}\cdot \left ({\frac {\overline {p}_l^u - \underline {p}_l^u}{\kappa _l^u}}\right )^2\) (\(\sqrt {\log {n}}\cdot {\frac {\overline {p}_l^c - \underline {p}_l^c}{\kappa _l^c}}\), respectively) is an upper bound of the revenue rate and demand rate deviations of price \(\hat {p}_l\). When this is less than τ1, we can simply apply \(\hat {p}_l\) and the loss will not exceed the loss of the first learning period.

Now, we solve the relations (5.14)–(5.16) and obtain a set of parameters that satisfy them:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \tau_1^u = n^{-\frac{1}{2}}\cdot{(\log{n})}^{3.5}\mbox{ and } \tau_i^u = n^{-\frac{1}{2}\cdot(\frac{3}{5})^{i-1}}\cdot{(\log{n})}^{5},\,\forall i = 2,\ldots,N^u, \end{array} \end{aligned} $$
(5.22)
$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \kappa_i^u = n^{\frac{1}{10}\cdot(\frac{3}{5})^{i-1}}\cdot\log{n},\quad \forall i = 1,2,\ldots,N^u. \end{array} \end{aligned} $$
(5.23)

As a by-product, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \overline{p}_i^u - \underline{p}_i^u = n^{-\frac{1}{4}(1-(\frac{3}{5})^{i-1})},\quad \forall i = 1,2,\ldots,N^u. \end{array} \end{aligned} $$
(5.24)

Similarly, we solve the relations (5.18)–(5.20) and obtain a set of parameters that satisfy them:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \tau_1^c = n^{-\frac{1}{2}}\cdot{(\log{n})}^{2.5} \mbox{ and } \tau_i^c = n^{-\frac{1}{2}\cdot(\frac{2}{3})^{i-1}}\cdot{(\log{n})}^{3},\,\forall i = 2,\ldots,N^c, \end{array} \end{aligned} $$
(5.25)
$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \kappa_i^c = n^{\frac{1}{6}\cdot(\frac{2}{3})^{i-1}}\cdot\log{n},\quad \forall i = 1,2,\ldots,N^c \end{array} \end{aligned} $$
(5.26)

and

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \overline{p}_i^c - \underline{p}_i^c = n^{-\frac{1}{2}(1-(\frac{2}{3})^{i-1})},\quad \forall i = 1,\ldots,N^c. \end{array} \end{aligned} $$
(5.27)

Note that by (5.24) and (5.27), the price intervals defined in our algorithm indeed shrink in each iteration.

2.2 Lower Bound Example

In the last section, we proposed a dynamic pricing algorithm and proved an upper bound of O(n−1∕2) on its regret in Theorem 1. In this section, we show that there exists a class of demand functions satisfying our assumptions such that no pricing policy can achieve an asymptotic regret less than O(n−1∕2). This lower bound example provides a clear evidence that the upper bound is tight. Therefore, our algorithm achieves nearly the best performance among all possible algorithms and closes the performance gap for this problem. Because our algorithm and the lower bound example apply for both parametric and nonparametric settings, it also closes the gap for the problem with a known parametric demand function.

Theorem 2 (Lower Bound Example)

Let λ(p;z) = 1∕2 + z  zp, where z is a parameter taking values in Z = [1∕3, 2∕3] (we denote this demand function set by Λ). Let \( \underline {p} = 1/2\), \(\overline {p} = 3/2\), x = 2, and T = 1. Then, we have the following:

  • This class of demand function satisfies Assumption 1. Furthermore, for any z ∈ [1∕3, 2∕3], the optimal price pD always equals pu and pD ∈ [7∕8, 5∕4].

  • For any admissible pricing policy π and all n ≥ 1,

    $$\displaystyle \begin{aligned} \begin{array}{rcl} \sup_{z\in Z}R_n^\pi(x,T;z)\ge \frac{1}{12(48)^2\sqrt{n}}. \end{array} \end{aligned} $$

We first explain some intuitions behind this example. Note that all the demand functions in Λ cross at one common point, that is, when p = 1, λ(p;z) = 1∕2. Such a price is called an uninformative price in Broder and Rusmevichientong (2012). When there exists an uninformative price, experimenting at that price will not gain information about the demand function. Therefore, in order to learn the demand function (i.e., the parameter z) and determine the optimal price, one must at least perform some price experiments at prices away from the uninformative price; on the other hand, when the optimal price is indeed the uninformative price, doing price experiments away from the optimal price will incur some revenue losses. This tension is the key reason for such a lower bound for the loss, and mathematically it is reflected in statistical bounds on hypothesis testing. For the proof of Theorem 2, we refer the readers to Wang et al. (2014).

3 Multiproduct Setting

In this section, we consider a multiple product and multiple resource generalization of the problem introduced in the previous section. This more general problem, also known as the price-based Network Revenue Management (NRM) problem with learning, considers a setting in which a seller sells to incoming customers n types of products, each of which is made up from a subset of m types of resources, during a finite selling season which consists of T decision periods. Denote by \(A = [A_{ij}] \in \mathbb {R}_+^{m \times n}\) the resource consumption matrix, which indicates that a single unit of product j requires Aij units of resource i. Denote by \(C \in \mathbb {R}_+^m\) the vector of initial capacity levels of all resources at the beginning of the selling season which cannot be replenished and have zero salvage value at the end of the selling season. At the beginning of period t ∈ [T], the seller first decides the price pt = (pt,1;…;pt,n) for his products, where pt is chosen from a convex and compact set \(\mathcal {P} = \otimes _{l=1}^n [ \underline {p}_l,\bar {p}_l] \subseteq \mathbb {R}^n\) of feasible price vectors. Let \(D_t(p_t) = (D_{t,1}(p_t); \dots ; D_{t,n}(p_t)) \in \mathcal {D}:= \{(d_1;\dots ;d_n) \in \{0,1\}^n: \sum _{i=1}^n d_i \leq 1\}\) denote the vector of realized demand in period t under price pt. For simplicity, we assume that at most one sale for one of the products occurs in each period. We assume that the purchase probability vector for all products under any price pt, i.e., λ(pt) := E [Dt(pt)] is unknown to the seller, and this relationship λ(.), also known as the demand function, needs to be estimated from the data the seller observes during the finite selling season. Define the revenue function r(p) := p ⋅ λ(p) to be the one-period expected revenue that the seller can earn under price p. It is typically assumed in the literature that λ(.) is invertible (see the regularity assumptions below). By abuse of notation, we can then write r(p) = p ⋅ λ(p) = λ ⋅ p(λ) = r(λ) to emphasize the dependency of revenue on demand rate instead of on price. We make the following regularity assumptions about λ(.) and r(.) which can be viewed as multidimensional counterparts of Assumption 1.

Regularity Assumptions

  1. R1.

    λ(.) is twice continuously differentiable and it has an inverse function p(.) which is also twice continuously differentiable.

  2. R2.

    There exists a set of turnoff prices \(p^{\infty }_j \in \mathbb {R}_+ \cap \{\infty \}\) for j = 1, …, n such that for any p = (p1;…;pn), \(p_j = p_{j}^{\infty }\) implies that \(\lambda _j^*(p)=0\).

  3. R3.

    r(.) is bounded and strongly concave in λ.

Compared to the single product setting, the NRM setting imposes two challenges: first, the nice solution structure for single product setting breaks down in the presence of multiple types of products and resources, and second, the approach of estimating the deterministic optimal prices and then applying this learned price may not be sufficient to get tight regret bound since ensuring the same estimation error of the deterministic optimal prices in multidimensional setting requires significantly more observations which in turn affects the best achievable regret bound of this approach. The goal of this section is twofold. First, we introduce two settings of NRM where the demand function possesses some additional structural properties, i.e., the parametric setting where demand function comes from a family of functions parameterized by a finite number of parameters and the nonparametric setting where demand function is sufficiently smooth. Second, we introduce an adaptive exploitation pricing scheme which help achieve tight regret bound for the two settings. In the remainder of this section, after introducing some additional preliminary results in Sect. 5.3.1, we will first investigate parametric setting in Sect. 5.3.2 and then investigate the nonparametric setting in Sect. 5.3.3.

3.1 Preliminaries

Let D1:t := (D1, D2, …, Dt) denote the history of the demand realized up to and including period t. Let \(\mathcal {H}_t\) denote the σ-field generated by D1:t. We define a control π as a sequence of functions π = (π1, π2, …, πT), where πt is a \(\mathcal {H}_{t-1}\)-measurable real function that maps the history D1:t−1 to \(\otimes _{j=1}^n [ \underline {p}_j, \bar {p}_j] \cup \{p_j^{\infty }\}\). This class of controls is often referred to as non-anticipating controls because the decision in each period depends only on the accumulated observations up to the beginning of the period. Under policy π, the seller sets the price in period t equal to \( p^{\pi }_t = \pi _t(D_{1:t-1})\) almost surely (a.s.). Let Π denote the set of all admissible controls:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \Pi := \left\{\pi: \sum_{t=1}^T A D_t(p^{\pi}_t) \preceq C \,\, \mbox{ and } \,\, p^{\pi}_t = \pi_t(\mathcal{H}_{t-1}) \,\,\, a.s.\right\}. \end{array} \end{aligned} $$

Note that even though the seller does not know the underlying demand function, the existence of the turnoff prices \(p_1^{\infty },\dots ,p_n^{\infty }\) guarantees that this constraint can be satisfied if the seller applies \(p_j^{\infty }\) for product j as soon as the remaining capacity at hand is not sufficient to produce one more unit of product j. Let \({\mathbf {P}}^{\pi }_t\) denote the induced probability measure of D1:t = d1:t under an admissible control π ∈ Π, i.e.,

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{P}}^{\pi}_t(d_{1:t}) & = &\displaystyle \prod_{s=1}^t \left[ \left(1 - \sum_{j=1}^n \lambda_j^*(p^{\pi}_s)\right)^{\left(1 - \sum_{j=1}^n d_{s,j}\right)} \prod_{j=1}^n \lambda_j^*(p^{\pi}_s)^{d_{s,j}}\right], \end{array} \end{aligned} $$

where \(p^{\pi }_s = \pi _{s}(d_{1:s-1})\) and \(d_s = [d_{s,j}] \in \mathcal {D}\) for all s = 1, …, t. (By definition of λ(.), the term \(1 - \sum _{j=1}^n \lambda _j^*(p^{\pi }_s)\) can be interpreted as the probability of no-purchase in period s under price \(p^{\pi }_s\).) Denote by Eπ the expectation with respect to the probability measure \({\mathbf {P}}^{\pi }_t\). The total expected revenue under π ∈ Π is then given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} R^{\pi} & = &\displaystyle {\mathbf{E}}^{\pi}\left[\sum_{t=1}^T p^{\pi}_t \cdot D_t(p^{\pi}_t)\right]. \end{array} \end{aligned} $$

The multidimensional version of the deterministic problem in the previous section can be formulated as follows:

By assumption R3, Pλ is a convex program and it can be shown that JD is an upper bound for the total expected revenue under any admissible control, i.e., Rπ ≤ JD for all π ∈ Π. This allows us to define the regret of an admissible control π ∈ Π as ρπ := JD − Rπ. Let λD denote the optimal solution of Pλ, and let pD = p(λD) denote the corresponding optimal deterministic price. (Since r(λ) is strongly concave with respect to λ, by Jensen’s inequality, the optimal solution is static, i.e., λt = λD for all t.) Let Ball(x, r) be a closed Euclidean ball centered at x with radius r. We state our fourth regularity assumption below which essentially states that the static price should neither be too low that it attracts too much demand nor too high that it induces no demand:

  1. R4.

    There exists ϕ > 0 such that \(\mathsf {Ball}(p^D, \phi ) \subseteq \mathcal {P}\).

Finally, we will consider a sequence of problems where the length of the selling season and the initial capacity levels are scaled proportionally by a factor k > 0. One can interpret k as the size of the problem. One can show that the optimal deterministic solution in the scaled problems remains λD. Let ρπ(k) denote the regret under an admissible control π ∈ Π for the problem with scaling factor k. We use the asymptotic order of ρπ(k) as the metric for heuristic performance.

3.2 Parametric Case

In the parametric setting, the functional form of the demand is known, but the finite parameters which pin down the function are unknown. Mathematically, let Θ be a compact subset of \(\mathbb {R}^q\), where \(q \in \mathbb {Z}_{++}\) is the number of unknown parameters. Under the parametric demand case, the seller knows that the underlying demand function λ(.) equals λ(.;θ) for some θ ∈ Θ. Although the function λ(.;θ) is known, the true parameter vector θ is unknown and needs to be estimated from the data. The one-period expected revenue function is given by r(p;θ) := p ⋅ λ(p;θ). To leverage the parametric structure of the unknown function, we will focus primarily on Maximum Likelihood (ML) estimation which not only has certain desirable theoretical properties but is also widely used in practice. As shown in the statistics literature, to guarantee the regular behavior of ML estimator, certain statistical conditions need to be satisfied. To formalize these conditions in our context, it is convenient to first consider the distribution of a sequence of demand realizations when a sequence of \(\tilde {q} \in \mathbb {Z}_{++}\) fixed price vectors \(\tilde {p} = (\tilde {p}^{(1)}, \tilde {p}^{(2)}, \dots , \tilde {p}^{(\tilde {q})}) \in \mathcal {P}^{\tilde {q}}\) have been applied. For all \(d_{1:\tilde {q}} \in \mathcal {D}^{\tilde {q}}\), we define

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{P}}^{\tilde{p}, \theta}(d_{1:\tilde{q}}) & := &\displaystyle \prod_{s=1}^{\tilde{q}} \left[ \left(1 - \sum_{j=1}^n \lambda_j(\tilde{p}^{(s)}; \theta)\right)^{\left(1 - \sum_{j=1}^n d_{s,j}\right)} \prod_{j=1}^n \lambda_j(\tilde{p}^{(s)}; \theta)^{d_{s,j}}\right] \end{array} \end{aligned} $$

and denote by \({\mathbf {E}}^{\tilde {p}}_{\theta }\) the expectation with respect to \({\mathbf {P}}^{\tilde {p}, \theta }\). In addition to the regularity assumptions R1–R4, we impose additional properties to ensure that the function class {λ(.;θ)}θ ∈ Θ is well-behaved.

Parametric Family Assumptions

  1. A1

    λ(p;θ) and \(\frac {\partial \lambda _j}{\partial p_i}(p;\theta )\) for all i, j ∈ [n] and i ≠ j are continuously differentiable in θ.

  2. A2

    R1 and R3 hold for all θ ∈ Θ.

  3. A3

    There exists \(\tilde {p} = (\tilde {p}^{(1)}, \tilde {p}^{(2)}, \dots , \tilde {p}^{(\tilde {q})}) \in \mathcal {P}^{\tilde {q}}\) such as for all θ ∈ Θ,

    1. i.

      \({\mathbf {P}}^{\tilde {p}, \theta }(.) \neq {\mathbf {P}}^{\tilde {p}, \theta '}(.)\) for all θ′∈ Θ and θ′≠ θ.

    2. ii.

      For all \(k \in [\tilde {q}]\) and j ∈ [n], \(\lambda _j(\tilde {p}^{(k)};\theta ) > 0\) and \(\sum _{j=1}^n \lambda _j(\tilde {p}^{(k)};\theta ) <1\).

    3. iii.

      The minimum eigenvalue of the matrix \(\mathcal {I}(\tilde {p}, \theta ):= [\mathcal {I}_{i,j}(\tilde {p}, \theta )] \in \mathbb {R}^{q \times q}\) where

      $$\displaystyle \begin{aligned} \begin{array}{rcl} \mathcal{I}_{i,j}(\tilde{p}, \theta) & = &\displaystyle {\mathbf{E}}^{\tilde{p}}_{\theta}\left[ - \frac{\partial^2}{\partial \theta_i \partial \theta_j} \log {\mathbf{P}}^{\tilde{p}, \theta}(D_{1:\tilde{q}}) \right] \end{array} \end{aligned} $$

      is bounded from below by a positive number.

Note that A1 and A2 are quite natural assumptions satisfied by many demand functions such as linear, multinomial logit, and exponential demand. We call \(\tilde {p}\) in A3 exploration prices. A3 ensures that there exists a set of price vectors (e.g., \(\tilde {p}\)), which, when used repeatedly, would allow the seller to use ML estimator to statistically identify the true demand parameter. Note that the symmetric matrix \(\mathcal {I}(\tilde {p}, \theta )\) defined in A3-iii is known as the Fisher information matrix in the literature, and it captures the amount of information that the seller obtains about the true parameter vector using the exploration prices \(\tilde {p}\). A3-iii requires the Fisher matrix to be strongly positive definite; this is needed to guarantee that the seller’s information about the underlying parameter vector strictly increases as he observes more demand realizations under \(\tilde {p}\). We want to point out that it is easy to find exploration prices for the commonly used demand function families. For example, for linear and exponential demand function families, any \(\tilde {q} = n+1\) price vectors \(\tilde {p}^{(1)}, \dots , \tilde {p}^{(n+1)}\) constitute a set of exploration prices if (a) they are all in the interior of \(\mathcal {P}\) and (b) the vectors \((1;\tilde {p}^{(1)})\), …, \((1;\tilde {p}^{(n+1)}) \in \mathbb {R}^{n+1}\) are linearly independent. For the multinomial logit demand function family, any \(\tilde {q} = 2\) price vectors \(\tilde {p}^{(1)}, \tilde {p}^{(2)}\) constitute a set of exploration prices if (a) they are both in the interior of \(\mathcal {P}\) and (b) \(\tilde {p}^{(1)}_i \neq \tilde {p}_i^{(2)}\) for all i = 1, …, n.

Next, we develop a heuristic called Parametric Self-adjusting Control (PSC). In PSC, the selling season is divided into an exploration stage followed by an exploitation stage. The exploration stage lasts for L periods (L is a tuning parameter to be selected by the seller) where the seller alternates among exploration prices to learn the demand function. At the end of the exploration stage, the seller computes his ML estimate of θ, denoted by \(\hat {\theta }_{L}\) (in case the maximum of the likelihood function is not unique, take any maximum as the ML estimate), based on all his observations so far, and solves P\(_{\lambda }(\hat {\theta }_L)\) for its solution \(\lambda ^D(\hat {\theta }_L)\) as an estimate of the deterministically optimal demand rate λD(θ). Then, for the remaining (T − L)-period exploitation stage, the seller uses price vectors according to a simple adaptive rule which we explain in more detail below. Define \(\hat {\Delta }_t(p_t;\hat {\theta }_{L}) := D_t - \lambda (p_t;\hat {\theta }_{L})\), and let Ct denote the remaining capacity at the end of period t. The complete PSC procedure is given in Algorithm 2.

Algorithm 2 Parametric self-adjusting control (PSC)

In contrast to many proposed heuristics that use the learned deterministic optimal price for exploitation, PSC uses the adaptive price adjustment rule in (5.28) for exploitation. To see the idea behind this design, suppose the estimate of the parameter vector is accurate (Jasin, 2014). In that setting, \(\hat {\Delta }_t\) equals the stochastic variability in demand arrivals Δt := Dt − λ(pt;θ), and the pricing rule in (5.28) reduces to adjusting the prices in each period t to achieve a target demand rate, i.e., \(\lambda ^D(\theta ^*) - \sum _{s=L+1}^{t-1} \frac {\Delta _{s}}{T - s}\). The first part of this expression, λD(θ), is the optimal demand rate if there were no stochastic variability, and we use it as a base rate; the second part of the expression, on the other hand, works as a fine adjustment to the base rate in order to mitigate the observed stochastic variability. To see how such adjustment works, consider the case with a single product: if there is more demand than what the seller expects in period s, i.e., Δs > 0, then the pricing rule automatically accounts for it by reducing the target demand rate for all remaining (T − s)-period; moreover, the target demand rate adjustment is made uniformly across all (T − s)-period so as to minimize unnecessary price variations. Jasin (2014) has shown that the ability to accurately mitigate the stochastic variability allows this self-adjusting pricing rule be effective when the parameter vector is known. However, as one can imagine, such precise adjustment is not possible when the parameter vector is subject to estimation error. Indeed, when \(\hat {\theta }_L \neq \theta ^*\), the seller can only adjust target demand rate based on an estimate of Δs, i.e., \(\hat {\Delta }_s\); moreover, the seller can no longer correctly find out the price vector that accurately induces (on average) the target demand rate since the inverse demand function is also subject to estimation error. Can this pricing rule work well when the underlying demand parameter is subject to estimation error? The answer is yes, and the key observation is that these two sources of systematic biases push the price decisions on opposing directions and their impact is thus reduced. To see that, consider a single product case where the seller overestimates demand for all prices, i.e., \(\lambda (p;\hat {\theta }_L) > \lambda (p;\theta ^*)\) for all p: on the one hand, since the seller would underestimate the stochastic variation that he needs to adjust (i.e., \(\hat {\Delta }_s = D_s - \lambda (p_s;\hat {\theta }_L) < D_s - \lambda (p_s;\theta ^*) = \Delta _s\)), this would push up the target demand rate (which would push down the price) than if there were no estimation error; on the other hand, since \(p(\lambda ;\hat {\theta }_L) > p(\lambda ;\theta ^*)\), for a given target demand rate, the presence of estimation error would push the price up. Quite interestingly, these opposing mechanisms are sufficient for PSC to achieve the optimal rate of regret.

Theorem 3

Suppose that R1–R4 and A1–A3 hold. Set \(L = \lceil \, \sqrt {k T} \, \rceil \). Then, there exists a constant M1 > 0 independent of k ≥ 1 such that \(\rho ^{PSC}(k) \le M_1 \sqrt {k}\) for all k ≥ 1.

Note that in light of the lower bound example in the previous section, PSC achieves the best achievable regret. The reason PSC achieves this tight bound can be briefly explained as follows. First, it leverages the fact that the demand model is fully determined by a finite-dimensional vector θ, which can be efficiently estimated by ML estimation. Under ML, roughly speaking, to obtain an estimation error in the order of 𝜖, the seller needs to spend roughly Θ(𝜖−2) periods exploring the demand curve with exploration prices which are not necessarily optimal. Second, the self-adjusting pricing rule in (5.28) helps reduce the impact of estimation error on revenue obtained during exploitation compared to using the learned deterministic price directly. To see that, suppose that the true parameter vector is misestimated by a small error 𝜖, then one can show that \(\lambda ^D(\hat {\theta }_{L})\) is roughly 𝜖 away from λD(θ). If the seller simply uses the learned deterministic optimal price \(p^D(\hat {\theta }_L)\) throughout the exploitation stage, then the one-period regret is roughly \(r(\lambda ^D(\theta ^*);\theta ^*) - r(\lambda ^D(\hat {\theta }_{L}); \theta ^*) \approx \nabla _{\lambda } r(\lambda ^D(\theta ^*);\theta ^*) \cdot (\lambda ^D(\theta ^*) - \lambda ^D(\hat {\theta }_{L})) \approx \Theta (\epsilon )\)(note that a tighter bound cannot be obtained since the gradient at the constrained optimal solution is not necessarily zero). In PSC, as mentioned above, the pricing rule (5.28) introduces opposing mechanisms to mitigate the impact of systematic error 𝜖 on regret which results in a one-period regret of Θ(𝜖2). Thus, the total regret in both exploration and exploitation is bounded by Θ(L) +  Θ(𝜖2(kT − L)) = O(𝜖−2 + 𝜖2kT), which is bounded by \(O(\sqrt {kT})\) after optimally tuning 𝜖 (or equivalently, L).

3.3 Nonparametric Case

The setting in Sect. 5.3.2 assumes that the seller has a good prior knowledge of the functional form of the demand function which may not be appropriate in cases such as new product launch where no historically relevant data is available. Blindly assuming a parametric demand model may be inappropriate and could potentially result in significant revenue loss if the parametric form is misspecified, e.g., a seller who uses linear model to fit the data generated by a logit model. An alternative setting, also known as the nonparametric approach, is one where the seller has no prior knowledge of the functional form but tries to estimate the demand directly. The challenge of this approach is that, instead of estimating a finite number of parameters, the seller now needs to directly estimate the demand function value at different price vectors to get an idea of the shape of the demand curve; thus, the number of point estimates needed to ensure low estimation error increases exponentially as the number of products increases. To keep the estimation problem tractable, a common assumption made in the statistics literature for nonparametric approaches is to impose smoothness conditions of the underlying demand functions (Gyorfi et al., 2002). To that end, let \(\bar {s}\) denote the largest integer such that \(|\partial ^{a_1,\dots , a_n} \lambda ^*_j(p) / \partial p_1^{a_1} \dots \partial p_n^{a_n}|\) is uniformly bounded for all j ∈ [n] and \(0 \leq a_1, \dots , a_n \leq \bar {s}\). We call \(\bar {s}\) the smoothness index. We make the following smoothness assumptions:

Nonparametric Function Smoothness Assumptions

  1. N1.

    \(\bar {s} \geq 2\).

  2. N2.

    \(\left |\frac {\partial ^{a_1,\dots , a_n} \lambda ^*_j(p)}{\partial p_1^{a_1} \dots \partial p_n^{a_n}}\right | \) is uniformly bounded for all j ∈ [n], \(p \in \mathcal {P}\), \(0 \leq a_1, \dots , a_n \leq \bar {s}\).

The above assumptions are fairly mild and are satisfied by most commonly used demand functions, including linear, polynomial with higher degree, logit, and exponential with a bounded domain of feasible prices. The smoothness index \(\bar {s}\) indicates the level of difficulty in estimating the corresponding demand function: the larger the value of \(\bar {s}\), the smoother the demand function, and the easier it is to estimate its shape because its value cannot have a drastic local change.

The idea of the nonparametric approach to be introduced later in this section is to replace the ML estimator in PSC by a nonparametric estimation procedure. One such approach is to use a linear combination of spline functions to approximate the underlying demand function which we introduce below. Spline functions have been widely used in engineering to approximate complicated functions, and their popularity is primarily due to their flexibility in effectively approximating complex curve shapes (Schumaker, 2007). This flexibility lies in the piecewise nature of spline functions—a spline function is constructed by attaching piecewise polynomial functions with a certain degree, and the coefficients of these polynomials are computed in such a way that a sufficiently high degree of smoothness is ensured in the places where the polynomials are connected. More formally, for all l ∈ [n], let \( \underline {p}_l = x_{l,0} < x_{l,1} \dots < x_{l,d} < x_{l,d+1} = \bar {p}_l\) be a partition that divides \([ \underline {p}_l, \bar {p}_l]\) into d + 1 subintervals of equal length where \(d \in \mathbb {Z}_{++}\). Let \(\mathcal {G} := \otimes _{l=1}^n \mathcal {G}_l\) denote a set of grid points, where \(\mathcal {G}_l = \{x_{l,i}\}_{i=0}^{d+1}\). We define the function space of tensor-product polynomial splines of order \((s;\dots ;s)\in \mathbb {R}^n\) with a set of grid points \(\mathcal {G}\) as \(\mathsf {S}(\mathcal {G},s) := \otimes _{l=1}^n \mathsf {S}_l(\mathcal {G}_l,s)\), where \(\mathsf {S}_l(\mathcal {G}_l, s) := \{f \in \mathsf {C}^{s-2}([ \underline {p}_l, \bar {p}_l]):\) f is a single-variate polynomial of degree s − 1 on each subinterval [xl,i−1, xl,i), for all i ∈ [d] and [xl,d, xl,d+1]}. One of the key questions that spline approximation theory addresses is the following: given an arbitrary function λ that satisfies N1-N2, find a spline function \(g^* \in \mathsf {S}(\mathcal {G},s)\) that approximates λ well. Among the various approaches, one of the most popular approximations is using the so-called tensor-product B-Spline basis functions (Schumaker, 2007). This approach is based on using the linear combinations of a collection of (s + d)n tensor-product B-Spline basis functions, denoted by \(\{N_{i_1, \dots , i_n}(x_1,\dots ,x_n)\}_{i_1=1, \dots ,i_n=1}^{s+d,\dots ,s+d}\), which span the functional space \(\mathsf {S}(\mathcal {G},s)\), to approximate the target function λ. Therefore, the problem of finding g is reduced to the problem of computing the coefficients for representing g. Schumaker (2007) proposed an explicit formula for computing these coefficients when the value of λ is perfectly observable, and the coefficients depend on λ(.) only via its function value evaluated on a finite number of price vectors in \(\mathcal {P}\) (i.e., the (s + d)nsn price vectors in \(\tilde {\mathcal {G}}\) defined in Algorithm 3); the details for the formula are bit technical, but we provide these in Algorithm 3 for completeness. In our problem setting, finding an approximation for \(\lambda ^*_j(.)\) for all j ∈ [n] is more challenging since we observe noisy observations of the function value, so we use empirical mean of demand realizations as a surrogate for \(\lambda ^*_j(p)\) and propose the following Spline Estimation algorithm in Algorithm 4 to estimate the demand, which involves observing \(\tilde {L}_0 := L_0 (s+d)^n s^n\) samples.

Algorithm 3 Spline approximation

Algorithm 4 Spline estimation

Let \(\tilde {\lambda }(.)\) denote the spline function computed via Algorithm 4. It can be shown that with high probability, the approximation error of \(\tilde {\lambda }(.)\) converges to zero at a slightly slower rate than the ML estimator in the parametric case. While one may be tempted to directly apply the exploitation method in PSC, i.e., the pricing rule in (5.28), the analysis of such approach is quite difficult since, given the nature of B-spline functions and the estimation procedure, \(\tilde {\lambda }(.)\) may lose some of the regularity properties that λ(.) possesses. Thus, we introduce two more functional approximations on \(\tilde {\lambda }(.)\) before applying the self-adjusting pricing procedure for exploitation. To that end, we introduce a quadratic program approximation of P in which we approximate the constraints of P with linear functions and its objective with a quadratic function. First, to linearize the constraints of P, since the capacity constraints form an affine transformation of the demand function, we will simply linearize the demand function. For any \(a \in \mathbb {R}^n, B \in \mathbb {R}^{n \times n}\), let B1, …, Bn be the columns of B and define \(\theta _{\iota } = (a; B_1; \dots ; B_n) \in \mathbb {R}^{n^2+n}\), where the subscript ι stands for linear demand. We denote a linear demand function by λ(p;θι) = a + B′p. Next, we explain how we use a quadratic function to approximate the objective of P. For any \(E \in \mathbb {R}, F \in \mathbb {R}^n, G \in \mathbb {R}^{n \times n}\), let G1, …, Gn denote the columns of G and define \(\theta _o = (E;F;G_1;\dots ;G_n) \in \mathbb {R}^{n^2+n+1}\), where the subscript o stands for objective. We denote the resulting quadratic function by \(q(p;\theta _o) = E + F' p + \frac {1}{2} p' G p\). Finally, let \(\theta = (\theta _o;\theta _{\iota }) \in \mathbb {R}^{2n^2+2n+1}\). For any \(\theta \in \mathbb {R}^{2n^2 + 2n + 1}, \delta \in \mathbb {R}^m\), we can define a quadratic program QP(θ;δ) as follows:

It can be shown that quadratic program will have the same optimal solution as P and will possess some very useful stability properties if the parameters of the quadratic and linear functions are chosen as follows: for linear demand function, let \(\theta ^*_{\iota } = (a^*; B^*_1;\dots ;B^*_n)\), where B := ∇λ(pD) and a := λD − (B)′pD; for the quadratic objective function, let \(\theta _o^* = (E^*;F^*;G_1^*;\dots ;G_n^*)\) where

$$\displaystyle \begin{aligned} \begin{array}{rcl} E^* := \frac{1}{2} (p^D)' H^* p^D, \,\,\, F^* := a^* - H^* p^D, \,\,\, G^* := B^* + (B^*)' + H^*, \end{array} \end{aligned} $$

where H is an n by n symmetric matrix defined as \(H^* := B^* \nabla ^2 r_{\lambda }^*(\lambda ^D) (B^*)' - B^* - (B^*)'\). Finally, let \(\theta ^* := (\theta ^*_o;\theta ^*_{\iota })\). Note that QP(θ;0) is a very intuitive approximation of P since the function \(\lambda (p;\theta _{\iota }^*) = a^* + (B^*)' p = \lambda ^D + (B^*)' (p - p^D)\) can be viewed as a linearization of λ(.) at pD. Note also that the gradients of the objective function and the constraints in QP(θ;0) at pD coincide with those in P. By Karush–Kuhn–Tucker (KKT) optimality conditions, it can be shown that the optimal solution of QP(θ;0) is the same as the optimal solution of P.

We are now ready to describe Nonparametric Self-adjusting Control (NSC) and discuss its asymptotic performance. NSC consists of an exploration procedure and an exploitation procedure. The exploration procedure uses the Spline Estimation algorithm in Algorithm 4 to construct a spline approximation \(\tilde {\lambda }(.)\) of the underlying demand function λ(.). This function \(\tilde {\lambda }(.)\) is then used to construct a linear function \(\lambda (.;\hat {\theta }_{\iota })\) that closely approximates \(\lambda (.;\theta ^*_{\iota })\) in the neighborhood of pD and a quadratic program that closely approximates P. During the exploitation phase, we use the optimal solution of the approximate quadratic program as baseline control and automatically adjust the price according to a version of (5.28). Further details will be provided below. Recall that \(\tilde {L}_0\) is the duration of the Spline Estimation algorithm. Let Ct denote the remaining capacity at the end of period t. Let \( \hat {\theta } := (\hat {\theta }_o;\hat {\theta }_{\iota })\), where \(\hat {\theta }_{\iota } := (\hat {a};\hat {B}_1;\dots ;\hat {B}_n), \hat {\theta }_o := (\hat {E};\hat {F};\hat {G}_1;\dots ;\hat {G}_n)\) and

$$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle \hat{B} := \nabla \tilde{\lambda}(\tilde{p}^D), \, \hat{a} := \tilde{\lambda}- \hat{B}' \tilde{p}^D, \, \hat{E} := \textstyle \frac{1}{2} (\tilde{p}^D)' \hat{H} \tilde{p}^D, \, \hat{F} := \hat{a} - \hat{H} \tilde{p}^D,\,\\ & &\displaystyle \hat{G} := \hat{B} + \hat{B}' + \hat{H}, \quad \mbox{and}\\ & &\displaystyle \hat{H} = [\hat{H}_{ij}] \,\, \mbox{where} \, \hat{H}_{ij} :=\! -\hat{u}^{\prime}_{ij} \hat{B}^{-1} \tilde{\lambda}^D \,\, \mbox{and} \,\, \hat{u}_{ij} :=\! \left[\frac{\partial^2 \tilde{\lambda}_1(\tilde{p}^D)}{\partial p_i \partial p_j}; \dots; \frac{\partial^2 \tilde{\lambda}_n(\tilde{p}^D)}{\partial p_i \partial p_j}\right]. \end{array} \end{aligned} $$

(Note that \(\tilde {p}^D\) is the deterministic optimal solution of a version of P, where λ is replaced by \(\tilde {\lambda }\).) The details of NSC is given in Algorithm 5.

Algorithm 5 Nonparametric self-adjusting control (NSC)

The following result states that the performance of NSC is close to the best achievable (asymptotic) performance bound.

Theorem 4

Suppose that s ≥ 4, \(L_0 = \lceil (kT)^{(s+n/2)/(2s + n -2)} (\log (kT))^{(2s + n -4)/(2s +n-2)} \rceil \) and \(d = \lceil (L_0^{1/2} / \log (kT))^{1/(s+n/2)}\rceil \). There exists a constant M1 > 0 independent of k > 3 such that for all s ≥ 4, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \rho^{NSC}(k) & \leq &\displaystyle M_{1} k^{\frac{1}{2} + \epsilon(n,s, \bar{s})} \log k, \,\,\, \mathit{\mbox{where }}\epsilon(n,s, \bar{s}) = \frac{1}{2} \left(\frac{2s - 2(s\wedge \bar{s}) + n+2}{2s + n - 2}\right). \end{array} \end{aligned} $$

Note that since most commonly used demand functions such as polynomial with arbitrary degree, logit, and exponential are infinitely differentiable (i.e., \(\bar {s}\) can be arbitrarily large), for any fixed 𝜖 > 0, we can select integers s ≥ (n + 2)∕(4𝜖) − (n − 2)∕2 such that the performance under NSC is \(\mathcal {O}(k^{1/2 + \epsilon } \log k)\). Theoretically, this means that the asymptotic performance of NSC is very close to the best achievable performance lower bound of \(\Omega (\sqrt {k})\). By comparing the algorithm and the analysis of PSC and NSC, the extra 𝜖 in the exponent of the regret bound of NSC is driven by the slightly slower rate of convergence of the nonparametric approach for estimating demand function. It remains an open question whether there exists a nonparametric approach for the NRM setting with a continuum of feasible price vectors which attains a regret bound of \(O(\sqrt {k})\).

4 Bayesian Learning Setting

The multi-armed bandit (MAB) problem is often used to model the exploration–exploitation trade-off in the dynamic learning and pricing model without inventory constraints (see Chap. 1 for an overview of the MAB problem). In one of the earliest papers on the multi-armed bandit problem, Thompson (1933) proposed a novel randomized Bayesian algorithm, which has since been referred to as the Thompson sampling algorithm. The basic idea of Thompson sampling is that at each time period, random numbers are sampled according to the posterior distributions of the reward for each action, and then the action with the highest sampled reward is chosen. In a revenue management setting, each “action” or “arm” is a price, and “reward” refers to the revenue earned by offering that price. Thus, in the original Thompson sampling algorithm—in the absence of inventory constraints—random numbers are sampled according to the posterior distributions of the mean demand rates for each price, and the price with the highest sampled revenue (i.e., price times sampled demand) is offered.

In this section, we develop a class of Bayesian learning algorithms for the multiproduct pricing problem with inventory constraints. This class of algorithms extends the powerful machine learning technique known as Thompson sampling to address the challenge of balancing the exploration–exploitation trade-off under the presence of inventory constraints. We focus on a model with discrete price sets and present two algorithms (the algorithm can also be used for continuous price sets, see Ferreira et al. (2018)). The first algorithm adapts Thompson sampling by adding a linear programming (LP) subroutine to incorporate inventory constraints. The second algorithm builds upon our first; specifically, in each period, we modify the LP subroutine to further account for the purchases made to date. Both of the algorithms contain two simple steps in each iteration: sampling from a posterior distribution and solving a linear program. As a result, the algorithms are easy to implement in practice.

4.1 Model Setting

We consider a retailer who sells N products, indexed by i ∈ [N], over a finite selling season. (Below, we denote by [x] the set {1, 2, …, x}.) These products consume M resources, indexed by j ∈ [M]. Specifically, we assume that one unit of product i consumes aij units of resource j, where aij is a fixed constant. The selling season is divided into T periods. There are Ij units of initial inventory for each resource j ∈ [M], and there is no replenishment during the selling season. We define Ij(t) as the inventory at the end of period t, and we denote Ij(0) = Ij. In each period t ∈ [T], the following sequence of events occurs:

  1. 1.

    The retailer offers a price for each product from a finite set of admissible price vectors. We denote this set by {p1, p2, …, pK}, where pk (∀k ∈ [K]) is a vector of length N specifying the price of each product. More specifically, we have pk = (p1k, …, pNk), where pik is the price of product i, for all i ∈ [N]. Following the tradition in dynamic pricing literature, we also assume that there is a “shut-off” price p such that the demand for any product under this price is zero with probability one. We denote by P(t) = (P1(t), …, PN(t)) the prices chosen by the retailer in this period, and require that P(t) ∈{p1, p2, …, pK, p}.

  2. 2.

    Customers then observe the prices chosen by the retailer and make purchase decisions. We denote by D(t) = (D1(t), …, DN(t)) the demand of each product at period t. We assume that given P(t) = pk, the demand D(t) is sampled from a probability distribution on \(\mathbb {R}^N_+\) with joint cumulative distribution function (CDF) F(x1, …, xN;pk, θ), indexed by a parameter (or a vector of parameters) θ that takes values in the parameter space \(\Theta \subset \mathbb {R}^l\). The distribution is assumed to be subexponential; note that many commonly used demand distributions such as normal, Poisson, exponential and all bounded distributions belong to the family of subexponential distributions. We also assume that D(t) is independent of the history \(\mathcal {H}_{t-1} = (P(1),D(1),\ldots ,P(t-1),D(t-1))\) given P(t).

    Depending on whether there is sufficient inventory, one of the following events happens:

    1. (a)

      If there is enough inventory to satisfy all demand, the retailer receives an amount of revenue equal to \(\sum _{i=1}^{N}D_{i}(t)P_{i}(t)\), and the inventory level of each resource j ∈ [M] diminishes by the amount of each resource used such that \(I_j(t) = I_j(t-1) - \sum _{i=1}^{N}D_{i}(t)a_{ij}\).

    2. (b)

      If there is not enough inventory to satisfy all demand, the demand is partially satisfied and the rest of demand is lost. Let \(\tilde {D}_i(t)\) be the demand satisfied for product i. We require \(\tilde {D}_i(t)\) to satisfy three conditions: \(0\leq \tilde {D}_i(t)\leq D_i(t), \forall i\in [N]\); the inventory level for each resource at the end of this period is nonnegative: \(I_j(t) = I_j(t-1) - \sum _{i=1}^{N}\tilde {D}_{i}(t)a_{ij}\geq 0, \forall j\in [M]\); there exists at least one resource j′∈ [M] whose inventory level is zero at the end of this period, i.e. \(I_{j'}(t)=0\). Besides these natural conditions, we do not require any additional assumption on how demand is specifically fulfilled. The retailer then receives an amount of revenue equal to \(\sum _{i=1}^{N}\tilde {D}_{i}(t)P_{i}(t)\) in this period.

We assume that the demand parameter θ is fixed but unknown to the retailer at the beginning of the season, and the retailer must learn the true value of θ from demand data. That is, in each period t ∈ [T], the price vector P(t) can only be chosen based on the observed history \(\mathcal {H}_{t-1}\), but cannot depend on the unknown value θ or any event in the future. The retailer’s objective is to maximize expected revenue over the course of the selling season given the prior distribution on θ.

We use a parametric Bayesian approach in our model, where the retailer has a known prior distribution of θ ∈ Θ at the beginning of the selling season. However, our model allows the retailer to choose an arbitrary prior. In particular, the retailer can assume an arbitrary parametric form of the demand CDF, given by F(x1, …, xN;pk, θ). This joint CDF parametrized by θ can parsimoniously model the correlation of demand among products. For example, the retailer may specify products’ joint demand distribution based on some discrete choice model, where θ is the unknown parameter in the multinomial logit function. Another benefit of the Bayesian approach is that the retailer may choose a prior distribution over θ such that demand is correlated for different prices, enabling the retailer to learn demand for all prices, not just the offered price. e selling season as inventory is depleted; this latter idea is incorporated into the second algorithm that we will present later.

4.2 Thompson Sampling with Fixed Inventory Constraints

We now present the first version of the Thompson sampling-based pricing algorithm. For each resource j ∈ [M], we define a fixed constant cj := IjT. Given any demand parameter ρ ∈ Θ, we define the mean demand under ρ as the expectation associated with CDF F(x1, …, xN;pk, ρ) for each product i ∈ [N] and price vector k ∈ [K]. We denote by d = {dik}i ∈ [N],k ∈ [K] the mean demand under the true model parameter θ.

The Thompson sampling with Fixed Inventory Constraints (TS-fixed) algorithm is shown in Algorithm 6. Here, “TS” stands for Thompson sampling, while “fixed” refers to the fact that we use fixed constants cj for all time periods as opposed to updating cj over the selling season as inventory is depleted; this latter idea is incorporated into the second algorithm that we will present later.

Algorithm 6 Thompson sampling with fixed inventory constraints (TS-fixed)

Steps 1 and 4 are based on the Thompson sampling algorithm for the classical multi-armed bandit setting, whereas Steps 2 and 3 are added to incorporate inventory constraints. In Step 1 of the algorithm, we randomly sample parameter θ(t) according to the posterior distribution of unknown demand parameter θ. This step is motivated by the original Thompson sampling algorithm for the classical multi-armed bandit problem. The key idea of the Thompson sampling algorithm is to use random sampling from the posterior distribution to balance the exploration–exploitation trade-off. The algorithm differs from the ordinary Thompson sampling in Steps 2 and 3. In Step 2, the retailer solves a linear program, LP(d(t)), which identifies the optimal mixed price strategy that maximizes expected revenue given the sampled parameters. The first constraint specifies that the average resource consumption in this time period cannot exceed cj, the average inventory available per period. The second constraint specifies that the sum of probabilities of choosing a price vector cannot exceed one. In Step 3, the retailer randomly offers one of the K price vectors (or p) according to probabilities specified by the optimal solution of LP(d(t)). Finally, in Step 4, the algorithm updates the posterior distribution of θ given \(\mathcal {H}_{t}\). Such Bayesian updating is a simple and powerful tool to update belief probabilities as more information—customer purchase decisions in our case—becomes available. By employing Bayesian updating in Step 4, we are ensured that as any price vector pk is offered more and more times, the sampled mean demand associated with pk for each product i becomes more and more centered around the true mean demand, dik.

We note that the LP defined in Step 2 is closely related to the LP used by Gallego and Van Ryzin (1997), where they consider a network revenue management problem in the case of known demand. Essentially, their pricing algorithm is a special case of Algorithm 6 where they solve LP(d), i.e., LP(d(t)) with d(t) = d, in every time period.

Next, we illustrate the application of our TS-fixed algorithm by providing one concrete example. For simplicity, in this example, we assume that the prior distribution of demand for different prices is independent; however, the definition of TS-fixed is quite general and allows the prior distribution to be arbitrarily correlated for different prices. As mentioned earlier, this enables the retailer to learn the mean demand not only for the offered price but also for prices that are not offered.

Example (Bernoulli Demand with Independent Uniform Prior)

We assume that for all prices, the demand for each product is Bernoulli distributed. In this case, the unknown parameter θ is just the mean demand of each product. We use a beta posterior distribution for each θ because it is conjugate to the Bernoulli distribution. We assume that the prior distribution of mean demand dik is uniform in [0, 1] (which is equivalent to a Beta(1, 1) distribution) and is independent for all i ∈ [N] and k ∈ [K]. In this example, the posterior distribution is very simple to calculate. Let Nk(t − 1) be the number of time periods that the retailer has offered price pk in the first t − 1 periods, and let Wik(t − 1) be the number of periods that product i is purchased under price pk during these periods. In Step 1 of TS-fixed, the posterior distribution of dik is Beta(Wik(t − 1) + 1, Nk(t − 1) − Wik(t − 1) + 1), so we sample dik(t) independently from a Beta(Wik(t − 1) + 1, Nk(t − 1) − Wik(t − 1) + 1) distribution for each price k and each product i. In Steps 2 and 3, LP(d(t)) is solved and a price vector \(p_{k'}\) is chosen; then, the customer demand Di(t) is revealed to the retailer. In Step 4, we then update \(N_{k'}(t)\leftarrow N_{k'}(t-1)+1\), \(W_{ik'}(t)\leftarrow W_{ik'}(t-1)+D_{i}(t)\) for all i ∈ [N]. The posterior distributions associated with the K − 1 unchosen price vectors (k ≠ k′) are not changed.

4.3 Thompson Sampling with Inventory Constraint Updating

Now, we propose the second Thompson sampling-based algorithm. Recall that in TS-fixed, we use fixed inventory constants cj in every period. Alternatively, we can update cj over the selling season as inventory is depleted, thereby incorporating real-time inventory information into the algorithm.

In particular, we recall that Ij(t) is the inventory level of resource j at the end of period t. Define cj(t) = Ij(t − 1)∕(T − t + 1) as the average inventory for resource j available from period t to period T. We then replace constants cj with cj(t) in LP(d(t)) in step 2 of TS-fixed, which gives us the Thompson sampling with Inventory Constraint Updating algorithm (TS-update for short) shown in Algorithm 7. The term “update” refers to the fact that in every iteration, the algorithm updates inventory constants cj(t) in LP(d(t)) to incorporate real-time inventory information.

Algorithm 7 Thompson sampling with inventory constraint updating (TS-update)

In the revenue management literature, the idea of using updated inventory rates like cj(t) has been previously studied in various settings (Jasin and Kumar, 2012; Jasin, 2014). TS-update is an algorithm that incorporates real-time inventory updating when the retailer faces an exploration–exploitation trade-off with its pricing decisions. Although intuitively incorporating updated inventory information into the pricing algorithm should improve the performance of the algorithm, Cooper (2002) provides a counterexample where the expected revenue is reduced after the updated inventory information is included. Therefore, it is not immediately clear if TS-update would achieve higher revenue than TS-fixed. We will rigorously analyze the performance of both TS-fixed and TS-update in the next section; our numerical simulation shows that in fact there are situations where TS-update outperforms TS-fixed and vice versa.

4.4 Performance Analysis

To evaluate the proposed Bayesian learning algorithms, we compare the retailer’s revenue with a benchmark where the true demand distribution is known a priori. We define the retailer’s regret over the selling horizon as

$$\displaystyle \begin{aligned} \text{Regret}(T,\theta)=\mathrm{E}[\text{Rev}^{*}(T)\mid \theta]-\mathrm{E}[\text{Rev}(T)\mid \theta], \end{aligned}$$

where Rev(T) is the revenue achieved by the optimal policy if the demand parameter θ is known a priori, and Rev(T) is the revenue achieved by an algorithm that may not know θ. The conditional expectation is taken on random demand realizations given θ and possibly on some external randomization used by the algorithm (e.g., random samples in Thompson sampling). In words, the regret is a nonnegative quantity measuring the retailer’s revenue loss due to not knowing the latent demand parameter.

We also define the Bayesian regret (also known as Bayes risk) by

$$\displaystyle \begin{aligned} \text{BayesRegret}(T) = E[\text{Regret}(T,\theta)], \end{aligned}$$

where the expectation is taken over the prior distribution of θ.

We now prove regret bounds for TS-fixed and TS-update under the realistic assumption of bounded demand. Specifically, in the following analysis, we further assume that for each product i ∈ [N], the demand Di(t) is bounded by \(D_i(t)\in [0, \bar {d}_i]\) under any price vector pk, ∀k ∈ [K]. However, the result can be generalized when the demand is unbounded and follows a sub-Gaussian distribution. We also define the constants

$$\displaystyle \begin{aligned} p_{\max} := \max_{k\in [K]} \sum_{i=1}^N p_{ik}\bar{d}_i,\quad p^j_{\max} := \max_{i\in[N]: a_{ij}\neq 0,k\in[K] }\frac{p_{ik}}{a_{ij}},\;\forall j\in[M], \end{aligned}$$

where \(p_{\max }\) is the maximum revenue that can possibly be achieved in one period, and \(p^j_{\max }\) is the maximum revenue that can possibly be achieved by adding one unit of resource j, ∀j ∈ [M].

Theorem 5

The Bayesian regret of TS-fixed is bounded by

$$\displaystyle \begin{aligned} \mathrm{BayesRegret}(T) \leq \left( 18 p_{\max} + 37 \sum_{i=1}^N \sum_{j=1}^{M}p^j_{\max} a_{ij} \bar{d}_i\right) \sqrt{TK\log K}. \end{aligned}$$

Theorem 6

The Bayesian regret of TS-update is bounded by

$$\displaystyle \begin{aligned} \mathrm{BayesRegret}(T)\leq \left( 18 p_{\max} + 40 \sum_{i=1}^N \sum_{j=1}^M p^j_{\max} a_{ij}\bar{d}_i \right) \sqrt{TK\log K}+p_{\max}M. \end{aligned}$$

The results above state that the Bayesian regrets of both TS-fixed and TS-update are bounded by \(O(\sqrt {TK\log K})\), where K is the number of price vectors that the retailer is allowed to use and T is the number of time periods. Moreover, the regret bounds are prior-free as they do not depend on the prior distribution of parameter θ; the constants in the bounds can be computed explicitly without knowing the demand distribution.

It has been shown that for a multi-armed bandit problem with reward in [0, 1]—a special case of our model with no inventory constraints—no algorithm can achieve a prior-free Bayesian regret smaller than \(\Omega (\sqrt {KT})\) (see Theorem 3.5, Bubeck and Cesa-Bianchi 2012). In that sense, the above regret bounds are optimal with respect to T and cannot be improved by any other algorithm by more than \(\sqrt {\log K}\).

Note that the regret bound of TS-update is slightly worse than the regret bound of TS-fixed. Although intuition would suggest that updating inventory information in TS-update will lead to better performance than TS-fixed, this intuition is somewhat surprisingly not always true—we can find counterexamples where updating inventory information actually deteriorates the performance for any given horizon length T.

The detailed proofs of Theorems 5 and 6 are omitted. We briefly summarize the intuition behind the proofs. For both Theorems 5 and 6, we first assume an “ideal” scenario where the retailer is able to collect revenue even after inventory runs out. We show that if prices are given according to the solutions of TS-fixed or TS-update, the expected revenue achieved by the retailer is within \(O(\sqrt {T})\) compared to the optimal revenue Rev(T). However, this argument overestimates the expected revenue. In order to compute the actual revenue given constrained inventory, we should account for the amount of revenue that is associated with lost sales. For Theorem 5 (TS-fixed), we prove that the amount associated with lost sales is no more than \(O(\sqrt {T})\). For Theorem 6 (TS-update), we show that the amount associated with lost sales is no more than O(1).

5 Remarks and Further Reading

The content of Sect. 5.2 is based on Wang (2012) and Wang et al. (2014). For the proofs of the main results, the readers are referred to Wang et al. (2014). In Wang et al. (2014), there are also implementation suggestions for the proposed algorithms. Note that in practical implementation, the algorithm can be made more efficient by relaxing some requirements stated in the Algorithm 1. Extensive numerical experiments and comparison with other algorithms can be found in Wang (2012) and Wang et al. (2014). Later, Lei et al. (2014) improve the result of Theorem 1 to remove the logarithmic factor in the worst-case regret using a bisection type of method. For details of the algorithm and the analysis, we refer the readers to Lei et al. (2014).

Section 5.3 is adapted from Chen et al. (2019) and Chen et al. (2021), which contain full proofs of the theorems presented and additional numerical studies. Chen et al. (2021) further considers a well-separated condition of demand functions and derive a much sharper \(O(\log ^2 k)\) regret than the \(O(\sqrt {k})\) regret in the general demand case.

Section 5.4 is primarily based on Ferreira et al. (2018). The definition of Bayesian regret used in this section is a standard metric for the performance of online Bayesian algorithms, see Russo and Van Roy (2014). Ferreira et al. (2018) also developed the Thompson sampling algorithms for the linear demand case and the bandits with knapsack problem, see Badanidiyuru et al. (2013).

Other methods have been proposed in the literature to address learning and pricing problems in the constrained inventory setting. One approach is to separate the selling season (T periods) into a disjoint exploration phase (say, from period 1 to τ) and exploitation phase (from period τ + 1 to T) (Besbes and Zeevi, 2009, 2012). One drawback of this strategy is that it does not use purchasing data after period τ to continuously refine demand estimates. Furthermore, when there is very limited inventory, this approach is susceptible to running out of inventory during the exploration phase before any demand learning can be exploited. Another approach is to use multi-armed bandit methods such as the upper confidence bound (UCB) algorithm (Auer et al., 2002) to make pricing decisions in each period. The UCB algorithm creates a confidence interval for unknown demand using purchase data and then selects a price that maximizes revenue among all parameter values in the confidence set. We refer the readers to Badanidiyuru et al. (2013) and Agrawal and Devanur (2014) for UCB algorithms with constrained inventory.