1 Problem Formulation in General

Since the seminal paper of Whitin (1955), joint pricing and inventory control problems have attracted tremendous attention and been studied by hundreds of research papers in the literature. For a comprehensive review, see survey papers Petruzzi and Dada (1999), Elmaghraby and Keskinocak (2003), Yano and Gilbert (2005), and Chen and Simchi-Levi (2012). Traditional literature assumes the demand distribution is known and takes this information as model input, which is hardly satisfied in practice. In this chapter, we relax this assumption and discuss online algorithms to learn the demand only from historical data. As time goes by, the learning algorithms will learn the demand better and better, so that the solutions prescribed by the algorithms converge to the true optimal solution had the demand distribution been known.

In this section, we discuss the general setup for the problem of joint inventory and pricing. Consider a periodic review system in which a firm (e.g., a retailer) sells a non-perishable product over a planning horizon of T periods. At the beginning of each period t, the firm observes on-hand inventory xt and determines an inventory order-up-to level yt and a price pt, where \(y_t \ge x_t, \ y_t \in \mathcal {Y} =[y^l, y^h]\) and \(p_t \in \mathcal {P}=[p^l, p^h]\) with yl < yh and pl < ph. For simplicity we assume that the system is initially empty, i.e., x1 = 0. Demand for period t, denoted by Dt(pt), is stochastic and price dependent. Demand is satisfied as much as possible by on-hand inventory, and profits are collected by the firm. There might be a mismatch between supply and demand. If yt > Dt(pt), any leftover inventories will be carried over to the next period, for each of which the firm pays a holding cost h. If yt < Dt(pt), some demands are not fulfilled, and the firm pays a penalty cost b for any unit of stockout. Per-unit ordering cost is normalized to 0 without loss of generality. The firm’s objective is to maximize the T-period total profit.

If the distribution of Dt(pt) is known a priori to the firm (complete information scenario), then the optimization problem the firm wishes to solve is

(12.1)

where v(pt, yt) is the instantaneous reward during period t. Let V represent the maximum T-period expected profit generated from the optimal policy under complete information.

In practice, the demand distribution is unknown; therefore, the firm needs to develop an admissible policy which prescribes pricing and ordering decisions for each period. An admissible policy is represented by a sequence of prices and order-up-to levels, {(pt, yt), t ≥ 1}, where (pt, yt) depends only on realized data and decisions made prior to period t, and yt ≥ xt, i.e., (pt, yt) is adapted to the filtration generated by {(ps, ys), os : s = 1, …, t − 1}. Here os represents the observable data of demand. Ideally, os = Ds(ps), meaning that demand is fully observable, but in some cases demand data is censored, which yields os < Ds(ps). Given any admissible policy π, the sequence of events for each period t is described as follows:

  1. 1.

    At the beginning of period t, the retailer observes the initial inventory level xt.

  2. 2.

    The retailer decides the selling price pt and the inventory level-up-to level yt ≥ xt. New orderings, if there is any, arrive instantaneously.

  3. 3.

    Demand realizes and is satisfied to the maximum extent using on-hand inventory. Unsatisfied demand is backlogged or lost, and any leftover inventory is carried to the next period. The retailer observes data ot.

  4. 4.

    At the end of period t, the retailer collects profit of the current period.

The firm’s objective is to find an admissible policy to maximize the T-period total profit while learning the unknown demand distribution on the fly. The regret of policy π, denoted by Rπ(T), is defined as the total profit loss over T periods, which is

$$\displaystyle \begin{aligned} R^{\pi}(T)=V^*-\mathbb{E}\left [\sum_{t=1}^{T}v(p_t, y_t)\right]. \end{aligned}$$

The smaller the regret, the better the policy.

In this chapter, we will discuss a number of models under the framework of joint inventory and pricing. These models differ in the following three dimensions.

  1. 1.

    Backlog versus lost-sales

    • In a backlog system, if yt < Dt(pt), any unsatisfied demands will be backlogged and served in future periods, and xt+1 = yt − Dt(pt).

    • In a lost-sales system, unmet demands will leave the market without any purchases, and xt+1 = (ytDt(pt))+.

  2. 2.

    Unlimited price changes versus limited price changes

    • Most models we will discuss allow unlimited number of price changes, i.e., the retailer is allowed to change price every period.

    • We will discuss one model where the firm is not allowed to make price changes more than a certain number of times.

  3. 3.

    With versus without setup cost

    • If a setup cost is present, a fixed amount of fee will be charged whenever a positive amount of inventory is ordered.

In Sects. 12.2 and 12.3, we discuss the classic joint inventory and pricing problem with backlogged demand and lost sales, respectively. In Sect. 12.4, we consider scenarios with a limited number of price changes. In Sect. 12.5, we discuss the joint pricing and inventory control problem with setup cost. In Sect. 12.6, we discuss other models of joint pricing and inventory control that have been studied in the literature.

2 Nonparametric Learning for Backlogged Demand

In this section, we discuss the joint pricing and inventory control problem with backlogged demand, one of the most classical models under the topic of joint pricing and inventory control. We will discuss the model, algorithm, regret convergence results and proof sketch based on Chen et al. (2019).

Per-period demand can be either Dt(pt) = λ(pt) + 𝜖t (additive) or Dt(pt) = λ(pt) 𝜖t (multiplicative), where λ(⋅) is a strictly decreasing deterministic function and 𝜖t, t = 1, 2, …, T, are independent and identically distributed random variables with probability density function f(⋅) and cumulative distribution function F(⋅). Here we focus on the multiplicative demand form. Unsatisfied demands are backlogged, and one has xt+1 = yt − Dt(pt) for all t = 1, …, T. The instantaneous reward for period t is \(G(p_t, y_t)=p_t \mathbb {E}[D_t(p_t)]-h \mathbb {E} [y_t- D_t(p_t)]^+ - b \mathbb {E} [D_t(p_t)-y_t]^+\).

By Sobel (1981), myopic policy is optimal for this problem. Therefore, to optimize the T-period problem in (12.1), it suffices to solve the single-period problem

$$\displaystyle \begin{aligned} \max_{\text{{(p, y)}} \in \mathcal{P} \times \mathcal{Y}} \;\; G(p, y), {} \end{aligned} $$
(12.2)

where

$$\displaystyle \begin{aligned} \begin{array}{rcl} G(p, y)& =&\displaystyle p \mathbb{E}[D(p)]-h \mathbb{E} [y- D(p)]^+ - b \mathbb{E} [D(p)-y]^+ \\ & =&\displaystyle p e^{ \lambda(p)} \mathbb{E}\big [e^{ \epsilon}\big] - \left \{h \mathbb{E} \big[y-e^{\lambda(p)}e^{\epsilon}\big]^+ + b \mathbb{E} \big [e^{ \lambda(p)}e^{ \epsilon}-y\big]^+ \right \}. {} \end{array} \end{aligned} $$

Let \(Q(p, e^{\lambda (p)}) := \max _{y \in \mathcal {Y}}G(p, y)\), then problem (12.2) can be re-written as

$$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle \mathbf{Problem\ CI\!:} \\ & &\displaystyle \max_{p \in \mathcal{P}}Q(p, e^{\lambda(p)}) \\ & &\displaystyle :=\max_{p \in \mathcal{P}} \Big\{ p e^{ \lambda(p)} \mathbb{E}\big [e^{ \epsilon}\big] - \min_{y \in \mathcal{Y}} \left \{h \mathbb{E} \big[y-e^{\lambda(p)}e^{\epsilon}\big]^+ + b \mathbb{E} \big [e^{ \lambda(p)}e^{ \epsilon}-y\big]^+ \right \} \Big\}. \\ {} \end{array} \end{aligned} $$
(12.3)

The inner optimization problem (minimization) determines the optimal order-up-to level that minimizes the expected holding and backlog cost for a given price p, and we denote it by \(\overline y \left (e^{\lambda (p)}\right )\). The outer optimization solves for the optimal price p. Because (p, y) is the optimal solution for (12.3), they satisfy \(y^*=\overline y(e^{\lambda (p^*)})\).

The firm knows neither the function λ(⋅) nor the distribution of random variable 𝜖t. In the backlog system, true demand realizations can be observed. Therefore, ot = Dt(pt), and an admissible policy (pt, yt) is adapted to the filtration generated by {(ps, ys), Ds(ps) : s = 1, …, t − 1}.

Learning Algorithm

A learning algorithm named DDA (shorthand for Data-Driven Algorithm) is proposed in Chen et al. (2019). DDA approximates λ(p) by an affine function, and it constructs empirical and dependent error samples from the collected data, called centered samples. DDA divides the planning horizon into stages whose lengths are exponentially increasing (in the stage index). At the start of each stage, the firm sets two pairs of prices and order-up-to levels based on its current linear estimation of the demand-price function and the constructed centered samples of random error, and the collected demand data from this stage are used to update the linear estimation of the demand-price function and the empirical distribution of random error. These are then utilized to find the pricing and inventory decision for the next stage. The detailed algorithm design is presented in Algorithm 1.

Algorithm 1 Data-Driven Algorithm (DDA)

As shown in Algorithm 1, for i = 1, 2, … in the DDA algorithm, iteration i focuses on stage i that consists of 2Ii periods. The algorithm sets the ordering quantity and selling price for each period in stage i derived from the previous iteration. The first Ii periods (from ti + 1 to ti + Ii) try to implement order-up-to \(\hat y_{i,1}\) policy while the second Ii periods try to implement order-up-to \(\hat y_{i,2}\) policy. Because starting inventory level may be higher than the order-up-to level, \(\hat y_{i,1}\) and \(\hat y_{i,2}\) may not be achieved, and one challenge is to identify the impact of the carryover inventory constraint on the performance of a learning algorithm.

The algorithm applies the realized demand data and least-square method to update the linear approximation, \(\hat {\alpha }_{i+1}- \hat {\beta }_{i+1} p\), of λ(p) and computes a centered sample ηt of random error 𝜖t, for t = ti + 1, …, ti + 2Ii. Note that ηt is not a sample of the random error 𝜖t. This is because 𝜖t = Dt(pt) − λ(pt) but \(1/I_i \sum _{k=t_1+1}^{t_i+I_i} D_k \neq \lambda (p_t)\). For this reason, the constructed objective function for holding and shortage costs is not a sample average of the newsvendor problem. In the traditional SAA, mathematical expectations are replaced by true sample averages, see, e.g., Kleywegt et al. (2002); Levi et al. (2007, 2015). When only biased samples are available, techniques from statistics such as jackknife resampling can be applied to reduce bias for SAA (Wu et al., 1986). In this work, samples of 𝜖t cannot be observed, however,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \eta_t=D_t(p_t)-\frac{1}{I_i}\sum_{k=t_1+1}^{t_i+I_i} D_k=\epsilon_t-\frac{1}{I_i} \sum_{k=t_1+1}^{t_i+I_i} \epsilon_k \end{array} \end{aligned} $$

can be obtained. Since \(\mathbb {E}[\epsilon _k]=0\), \({1}/{I_i} \sum _{k=t_1+1}^{t_i+I_i} \epsilon _k\) converges to 0 in probability as Ii grows, and one would expect ηt → 𝜖t in probability as t grows. Thus, DDA use ηt in place of 𝜖t in computing proxy objectives. Since these samples are obtained from the original i.i.d. samples after subtracting the sample average, we call ηt centered samples, and {ηt, t = ti + 1, …, ti + 2Ii} are dependent.

A data-driven optimization problem is then constructed. When \(\hat \beta _{i+1}>0\), the algorithm solves an optimization problem of a jointly concave function. Technical analyses in the paper show that the probability for \(\hat \beta _{i+1} > 0\) converges to 1 as i grows.

The DDA algorithm integrates a process of earning (exploitation) and learning (exploration) in each stage. The earning phase consists of the first Ii periods starting at ti + 1, during which the algorithm implements the optimal strategy for the proxy optimization problem \(G_{i}^{DD}(p,y)\). In the next Ii periods of learning phase that starts from ti + Ii + 1, the algorithm uses a different price \(\hat p_{i}+ \delta _i\) and its corresponding order-up-to level. The purpose of this phase is to extract demand sensitivity information around the selling price. Note that, even though the firm deviates from the optimal strategy of the proxy problem in the second phase, the policies, \((\hat p_{i}+\delta _i, \hat y_{i,2})\) and \((\hat p_{i}, \hat y_{i,1})\), will be very close to each other as i increases. Chen et al. (2019) show that they both converge to the clairvoyant optimal solution and the loss of profit from this deviation converges to zero.

Regret Convergence

An upper bound for regret of the DDA policy is provided as \(R^{DDA}(T)=V^*-\mathbb {E} \left [\sum _{t=1}^{T}G(p_t,y_t)\right ] \le C_1 T^{1/2}\), for some constant C1 > 0. The lower bound for regret is Ω(T1∕2), which is implied by Keskin and Zeevi (2014). This shows that the regret convergence rate for DDA is tight.

The intuitions for regret convergence are the following. Note that during cycle i, two distinct prices got implemented, based on which demand data is generated. The two prices are different by δi, which decreases to 0 as i increases. Therefore, the two prices are getting closer, and the linear function yielded by linear approximation approaches the tangent line of λ(⋅), providing gradient information for future decisions.

Proof Sketch

To compare the DDA policy with the clairvoyant optimal policy, i.e., the optimal solutions of problem DD (??) and problem CI (12.3), note that these two objective functions have significant differences. In problem CI, both λ(p) and the distribution of 𝜖 are known, but in problem DD, λ(p) is approximated by a linear function and distribution of 𝜖 is estimated using centered samples instead of true samples. Therefore, to analyze DDA, the authors’ approach is to introduce several “intermediate” bridging problems, and in each step we compare two “adjacent” problems that differ along only one dimension.

First, for parameters α and β > 0, we introduce bridging problem B1 defined by

(12.5)

It is easy to see that, the only difference between problem B1 and problem CI in (12.3) is that, in problem B1 we replace the demand-price function in CI by an affine function α − βp. Let \(\overline p\big (\alpha , \beta \big )\) denote the optimal price for problem B1, and for given \(p\in {\mathcal P}\), we let \(\overline y(e^{\alpha -\beta p})\) denote its optimal order-up-to level, which is the optimal solution for the inner minimization problem in (12.5).

The second bridging problem, B2, is defined for each iteration i of the DDA algorithm, and for any α and β > 0, it is given by

$$\displaystyle \begin{aligned} &\mathbf{Bridging\ Problem\ B2:} \\ &\max_{p \in \mathcal{P}} \tilde Q_{i+1}(p,e^{\alpha-\beta p}) :=\max_{p \in \mathcal{P}} \Bigg \{p e^{\alpha-\beta p}\left (\frac{1}{2I_i} \sum_{t=t_i+1}^{t_i+2I_i}e^{{ {\epsilon_t}}}\right) {} \\ &\quad -\min_{y \in \mathcal{Y}} \bigg\{ \frac{1}{2I_i} \sum_{t=t_i+1}^{t_i+2I_i}\Big (h \big (y-e^{ \alpha- \beta p+\epsilon_t}\big)^++b\big(e^{ \alpha- \beta p+\epsilon_t}-y\big)^+\Big)\bigg\} \Bigg \}. \end{aligned} $$
(12.6)

Compared with problem B1, it is seen that B2 is obtained from B1 after replacing the expectations in B1 by sample averages, hence B2 is the sample average approximation (SAA) of problem B1. Here 𝜖t, t = ti + 1, …, ti + 2Ii, represent the realizations of random errors during stage i. Let \(\tilde p_{i+1}\left (\alpha , \beta \right )\) denote the optimal price and \(\tilde y_{i+1}(e^{\alpha -\beta p})\) the optimal order-up-to level for problem B2, which is the optimal solution for the inner minimization problem in (12.6).

The third bridging problem B3 is a variation of problem B2, which replaces the true random error 𝜖t by a biased error sample ζt, t = ti + 1, …, ti + 2Ii. That is, for

$$\displaystyle \begin{aligned} \begin{array}{rcl} \zeta_{t=t_i+1}^{t_1+I_i}=\big(\zeta_{t_i+1}, \ldots, \zeta_{t_i+I_i}\big), \quad \zeta_{t=t_i+I_i+1}^{t_1+2I_i}=\big(\zeta_{t_i+I_i+1}, \ldots, \zeta_{t_i+2I_i}\big), \end{array} \end{aligned} $$

and parameters α and β > 0, we define the third bridging problem B3 as

Note that when \((\alpha , \beta )=(\hat \alpha _{i+1}, \hat \beta _{i+1})\), and ζt = ηt for t = t1 + 1, …, ti + 2Ii, problem B3 reduces to problem DD (??) in the DDA algorithm. Thus, problem B3 serves as a bridge between problem B2 and problem DD. We denote the optimal price of problem B3 by \(\breve p_{i+1} \big ((\alpha , \beta ), \zeta _{t=t_i+1}^{t_1+I_i}, \zeta _{t=t_i+I_i+1}^{t_1+2I_i} \big )\) and its optimal order-up-to level, for given price p, by \(\breve y_{i+1}\big (e^{\alpha -\beta p}, \zeta _{t=t_i+1}^{t_1+I_i}, \zeta _{t=t_i+I_i+1}^{t_1+2I_i}\big )\).

Based on their definitions, problem CI, bridging problems B1–B3, and problem DD require less and less information about the demand process. Problem CI has complete information about both λ(⋅) and the distribution of 𝜖; problem B1 does not know λ(⋅) but knows the distribution of 𝜖; problem B2 does not know either λ(⋅) or the distribution of 𝜖 but has access to true samples of 𝜖; problems B3 and DD do not have true samples and have to use biased samples. Chen et al. (2019) prove convergence for each pair of adjacent problems, and eventually establish convergence of problem DD to problem CI.

3 Nonparametric Learning for Lost-Sales System

Different from Sect. 12.2 that considers backlogged demand, in this section we consider lost sales and censored demand. This scenario happens when, in case of a stockout, rejected customers leave the store without purchasing. These customers cannot be observed by the retailer, and demand data is thus truncated by inventory levels. We will discuss the model, algorithms, and regret convergence results based on Chen et al. (2021a, 2020b).

Consider the additive demand model Dt(pt) = λ(pt) + 𝜖t with λ(⋅) being a non-increasing deterministic function and 𝜖t, t = 1, 2, …, T, being i.i.d. random variables with \(\mathbb {E}[\epsilon _t]=0\). We denote the CDF of 𝜖t by F(⋅), which is assumed to be continuous and differentiable, the PDF by f(⋅) such that f(𝜖t) <  for any 𝜖t, and the standard deviation of 𝜖t by σ. For notational convenience, we use 𝜖t and 𝜖 interchangeably because of the i.i.d. assumption. Demands are satisfied as much as possible by on-hand inventory, and unsatisfied demands are lost and unobservable. For system dynamics one has xt+1 = (ytDt(pt))+. The instantaneous reward for period t is \(p_t \mathbb {E}[\min \{y_t, D_t(p_t)\}] - b \mathbb {E} [D_t(p_t)-y_t]^+ -h \mathbb {E} [y_t- D_t(p_t)]^+=p_t \mathbb {E}[D_t(p_t)] - (b+p_t) \mathbb {E} [D_t(p_t)-y_t]^+ -h \mathbb {E} [y_t- D_t(p_t)]^+\).

The firm knows neither the function λ(pt) nor the distribution of the random term 𝜖t a priori, which must be learned from censored demands collected over time while maximizing the cumulative profit. In this system, demand is censored, therefore, \(o_t=\min \{D_t(p_t), y_t\}\). For an admissible policy, (pt, yt) is adapted to the filtration generated by \(\left \{ (p_s, y_s), \min \left \{D_s(p_s), y_s \right \}: s=1,\ldots , t-1\right \}\) under censored demand.

If the underlying demand-price function λ(p) and the distribution of the error term 𝜖t were known a priori, the clairvoyant optimal policy for this problem is a myopic policy (refer to Sobel 1981). Define the single-period problem by

$$\displaystyle \begin{aligned} \begin{array}{rcl} Q(p,y) = p \mathbb{E}[D_1(p)] - (b+p) \mathbb{E} [D_1(p)-y]^+ -h \mathbb{E} [y-D_1(p)]^+. \end{array} \end{aligned} $$

To find the optimal pricing and inventory decisions, it suffices to maximize the single-period revenue Q(p, y), which can be expressed as

$$\displaystyle \begin{aligned} & \max_{p, y} \left \{p \mathbb{E}[D_1(p)] - (b+p) \mathbb{E} [D_1(p)-y]^+ -h \mathbb{E} [y-D_1(p)]^+ \right \} \\ &\quad = \max_{p} \left\{ p \lambda(p)- \min_{y} \left \{ (b+p) \mathbb{E} \left [\lambda(p)+\epsilon-y\right]^+ +h \mathbb{E} \left[y-\lambda(p)-\epsilon \right]^+ \right \} \right\}. \end{aligned} $$

Hence, we rewrite the clairvoyant problem as

This problem was first studied in Chen et al. (2021a), whose learning method and result will be briefly reviewed. Then we shift our focus to Chen et al. (2020b), which studies the problem in a more general setting and improves the convergence rate in Chen et al. (2021a).

3.1 Algorithms and Results in Chen et al. (2021a)

Chen et al. (2021a) assume G(⋅) is concave and λ(p) is differentiable to a high order. They provide a spline approximation based learning algorithm (SALA) under an exploration-exploitation framework.

Algorithm for Concave G(⋅)

The learning algorithm follows an exploration-exploitation framework and is based on spline approximation.

We now formally describe how a spline approximation for the demand-price function λ(⋅) is constructed. Before doing that, we first present a high-level view of the approximation method.

Spline approximation needs two integer inputs, m > 0 and l > 0, and it requires the specification of knots, basis functions, and coefficients. Knots, denoted as wi, i = 1, …, 2m + l, are equally spaced price points on the whole price interval, and there are in total 2m + l of them. The more knots a model has, the more observations of λ(⋅) the model uses to do estimation, which in general leads to a more accurate spline approximation. Let \(\mathcal {L}\lambda (p)\) denote the spline approximation operator of a deterministic function λ(p), and it can be represented as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \mathcal{L}\lambda(p)=\sum_{i=1}^{m+l}\gamma_i^{\lambda}\cdot N_i^m(p), \end{array} \end{aligned} $$
(12.7)

where \(N_i^m(p)\), i = 1, …, m + l, are the basis functions with coefficients \(\gamma _i^\lambda \). The base function \(N_i^m(p)\) is polynomial in p with the highest order m − 1 and is constructed based on knots wi, …, wi+m. The larger the m, the smoother the \(N_i^m(p)\) and \(\mathcal {L}\lambda (p)\). The coefficient \(\gamma _i^\lambda \) is computed based on some specific price points on [wi, wi+m] and the corresponding values of λ(p) at these price points. To be more specific, price points used here include wi, …, wi+m and τi1, …, τim that will be defined shortly in Algorithm 2. The detailed procedure of spline approximation is also presented in Algorithm 2.

Algorithm 2 Constructing a Spline Approximation (SA)

It follows from Schumaker (2007) that for the basis function \(N_i^m(p)\), its (m − 2)-th order derivative exists and is continuous. Together with Theorem 4.9 in Schumaker (2007), one can verify that the basis function \(N_i^m(p)=0\) for p∉(wi, wi+m) and \(N_i^m(p)>0\) for p ∈ (wi, wi+m).

Given the detailed construction of the spline approximation, we are ready to present the main learning algorithm termed SALA in Algorithm 3.

Algorithm 3 Spline Approximation Based Learning Algorithm (SALA)

The learning algorithm SALA separates the planning horizon into a disjoint exploration phase and exploitation phase.

The algorithm specifies the parameters m and l for determining the density for spline approximation, the parameter Δ for determining the grid size for (sparse) discrete optimization problem, and the parameter L for determining the length of the exploration phase. Note that these parameters are determined “optimally” via (??) to minimize the theoretical regret rate.

SALA then enters the exploration phase of total length of L(m + l)m periods, which is roughly on the order of \(\sqrt {T}\). The price space is discretized into equally spaced prices {τij}’s (which will also be used for constructing a spline approximation). For each i and j, SALA offers the price τij, together with the pre-specified target inventory level yt, for an equal number of periods. We note here that the high-level reason for the target inventory level yt to be on the order of \(\log L\log \log L\) is to ensure that the bias caused by demand censoring is appropriately bounded.

SALA leverages the sales collected over prices {τij}’s to carry out an empirical spline approximation \(\hat {\lambda }(p)\) of the true demand-price function λ(p). Also, SALA computes the so-called residual error ηt, which is used to approximate the random error 𝜖t. It is important to note that \(\hat {\lambda }(p)\) is constructed based on sales (or censored demand) and, therefore, it suffers a bias in estimating λ(p), which must be quantified in the regret analysis. Similarly, due to demand censoring, ηt is also a biased representation of 𝜖t, in which the bias must also be quantified.

SALA essentially treats the empirical spline approximation \(\hat {\lambda }(p)\) as the true demand-price function λ(p) and the residual error ηt as the true random error 𝜖t, and constructs the corresponding sample average approximation (SAA) based surrogate optimization problem. Note that the surrogate optimization problem is solved sequentially: the inner problem is to find the optimal inventory target level for a given price, while the outer problem is to find the optimal price on the grid. The inner problem is convex in the inventory target level, which can be efficiently solved using first-order methods, whereas the outer problem is a one-dimensional discretized problem but solved on a sparse grid.

Finally, SALA completes the exploration phase and enter the exploitation phase. For the remaining planning horizon, SALA implements the optimal price and target inventory level suggested by the (sampled) surrogate optimization problem. Note that the length of the exploitation phase is T − L(m + l)m, which is roughly on the order of \(T-\sqrt {T}\).

Regret Convergence

In Chen et al. (2021a), it shows that the convergence of the spline approximation can be bounded as \(\mathbb {P} \Big \{ \|\lambda '(p)-\hat \lambda '(p) \|{ }_{\infty } \le C_2 T^{-{1}/{4}} \Big \}>1-T^{-2}\) and \(\mathbb {P} \Big \{ \|\lambda '(p)-\hat \lambda '(p) \|{ }_{\infty } \le C_2 T^{-{1}/{4}} \Big \}>1-T^{-2}\) for some constant C2 > 0 and any \(p \in \mathcal {P}\). Moreover, the convergence of error estimation is shown as \(\mathbb {P} \bigg \{ \Big | \mathbb {E}[\epsilon -z^*(p)]^+ -\frac {1}{L(m+l)m} \sum _{t=1}^{L(m+l)m}(\eta _t -\hat z(p))^+ \Big | \le C_3 T^{-{1}/{4}} \bigg \} >1-10T^{-2}\), where \(z^*(p)=F^{-1}\Big (\frac {b+p}{b+p+h}\Big )\) and , for some constant C3 > 0 and any \(p \in \mathcal {P}\). Based on these results, the regret convergence rate of SALA is upper bounded as \({R}^{SALA}(T) \le C_4 T^{{1}/{2}+\varepsilon } (\log T)^{3} \log \log T\), where \(\varepsilon ={1}/\sqrt [3]{\log T}+0.25/\sqrt {\log T}\) and constant C4 > 0. Here note that for any constant c > 0, one has \(\log \log T/\log T <\varepsilon < c\) (or equivalently, \(\log T < T^{\varepsilon } < T^{c}\)), for large enough T. Since the regret lower bound for this problem is Ω(T1∕2), the SALA algorithm matches the lower bound up to Tε.

3.2 Algorithms and Results in Chen et al. (2020b)

Chen et al. (2020b) consider both concave and non-concave G(⋅), provide learning algorithms for the two scenarios, and show that the convergence rates of both algorithms match the theoretical lower bounds, respectively.

3.2.1 Concave G(⋅)

In this section, we discuss the scenario with concave G(⋅).

Algorithm for Concave G(⋅)

A different algorithm is proposed in Chen et al. (2020b) for concave G(⋅), which approaches the optimal y using bisection and optimal p using trisection. The detailed algorithm is presented in Algorithms 4, 5, and 6.

Algorithm 4 Bisection search for order-up-to level y

Algorithm 5 Estimation of reward (G(⋅)) differences at p<p′

Algorithm 6 The main algorithm: trisection search on prices

With the SearchOrderUpTo routine in Algorithm 4, for every price \(p\in [ \underline p,\overline p]\) one can estimate, using relatively few selling periods, the near-optimal order-up-to level \(\hat y_n\) so that \(Q(p,\hat y_n)\approx Q(p,y^*(p))=G(p)\), where y(p) is the optimal inventory level under price p. It is tempting to use a similar strategy on G(⋅) which is strongly concave to the price to find the optimal price p, which has been applied to pure pricing without inventory replenishment problems in the literature (Wang et al., 2014; Lei et al., 2014). Such an approach, however, encounters a major technical hurdle that neither the reward G(⋅) nor its derivative can be directly observed or even accurately estimated, due to the censoring of the demands and the lost-sales component in the objective function.

In this section we present the key idea of this paper that overcomes this significant technical hurdle. The important observation is that, in a bisection or trisection search method, it is not necessary to estimate G(p) accurately. Instead, one only needs to accurately estimate the difference of rewards G(p′) − G(p) at two prices p, p′ in order to decide on how to progress, which can be accurately estimated even in the presence of censored demands and lost sales. We sketch and summarize this idea below.

The Key Idea of Algorithm 5—“Difference Estimator”

Let p < p′ be two different prices and recall the definition that \(G(p)=p\mathbb E[\min \{\lambda (p)+\varepsilon ,y^*(p)\}] - b\mathbb E[(\varepsilon +\lambda (p)-y^*(p))^+] - h\mathbb E[(y^*(p)-\lambda (p)-\varepsilon )^+]\). When y(p) is relatively accurately estimated (from the previous section and Algorithm 4), the only term in G(p) that cannot be directly observed without bias is the lost-sales penalty \(-b\mathbb E[(\varepsilon +\lambda (p)-y^*(p))^+]\). Hence, to estimate G(p′) − G(p) accurately (Chen et al., 2020b) only need to estimate the difference

$$\displaystyle \begin{aligned} \mathbb E[(\varepsilon+\lambda(p)-y^*(p))^+] - \mathbb E[(\varepsilon+\lambda(p')-y^*(p'))^+]. {} \end{aligned} $$
(12.13)

By the property of newsvendor solution, y(p) = λ(p)+zp where zp is such that \(F_\mu (z_p)=\int _{-\infty }^{z_p}f_\mu (u)\mathrm {d} u=\phi (p)=\frac {b+p}{b+p+h}\), and similarly \(y^*(p')=\lambda (p')+z_{p'}\) such that \(F_{\mu }(z_{p'})=\phi (p')=\frac {b+p'}{b+p'+h}\). Since p < p′, we have \(z_p<z_{p'}\). Equation (12.13) can be subsequently simplified to

(12.14)

For Part A of Eq. (12.14), the \(\Pr [\varepsilon \geq {z_{p}}]\) term has the closed-form, known formula of \(\Pr [\varepsilon \geq {z_{p}}]=1-F_\mu ( { z_{p}})=1-\phi ( {p})=\frac {h}{b+ {p}+h}\). To estimate \(z_{p'}-z_p\), which is nonnegative, (Chen et al., 2020b) use the following observation:

$$\displaystyle \begin{aligned} 1-\phi(p)=\frac{h}{b+p+h}=\Pr[\varepsilon\geq z_p] \overset{(*)}{=} \Pr[(z_{p'}-\varepsilon)^+\leq z_{p'}-z_p]. {} \end{aligned} $$
(12.15)

Here the crucial Eq. (*) holds because \(z_{p'}>z_p\), and, therefore, the event ε ≥ zp is equivalent to either \(\varepsilon >z_{p'}\) (for which \((z_{p'}-\varepsilon )^+\) is zero), or \(\varepsilon \leq z_{p'}\) and \(z_{p'}-\varepsilon \leq z_{p'}-z_p\). Furthermore, the random variable \((z_{p'}-\varepsilon )^+=(y^*(p')-\lambda (p')-\varepsilon )^+\) is (approximately) observable when y(p′) is estimated accurately, because this is the leftover inventory at ordering-up-to level y(p′) and posted price p′. Therefore, one can collect samples of \((z_{p'}-\varepsilon )^+\), construct an empirical cumulative distribution function (CDF) and infer the value of \(z_{p'}-z_p\) by inverting the empirical CDF at h∕(b + p + h). A similar approach can be taken to estimate Part B of Eq. (12.14), by plugging in the empirical distribution of the random variable \((z_{p'}-\varepsilon )^+\mathbf 1\{ 0\leq (z_{p'}-\varepsilon )^+\leq z_{p'}-z_p\}\).

A pseudo-code description of the reward difference estimation routine is given in Algorithm 5. The design of Algorithm 5 roughly follows the key ideas demonstrated in the previous paragraph. The ot and δt random variables correspond to the censored demand and the leftover inventory at time period t, and the distribution of δt (or \(\delta _t^{\prime }\)) would be close to the distribution of (zpε)+ (or \((z_{p'}-\varepsilon )^+\)). Using the observation in Eq. (12.15), \(\hat u\) in Algorithm 5 would be a good estimate of \(z_{p'}-z_p\) by inverting the empirical CDFs.

As the last component and the main entry point of the algorithm framework, (Chen et al., 2020b) describe a trisection search method to localize the optimal price p that maximizes G(⋅), based on the strong concavity of G(⋅) in p that is assumed for this scenario. The trisection principle for concave functions itself is not a new idea and has been explored in the existing literature on pure pricing without inventory replenishment problems (Lei et al., 2014; Wang et al., 2014). A significant difference, nevertheless, is that in this application the expected reward function G(⋅) cannot be observed directly (even up to centered additive noise) due to the presence of censored demands, and one must rely on the procedure described in the previous section to estimate the reward difference function ΔG(⋅, ⋅) instead. Below we describe the key idea for this component.

The Key Idea of Algorithm 6

Recall that \(G(p)=\max _{y\in [0,\bar y]}Q(p,y)\) and ΔG(p, p′) = G(p′) − G(p). A trisection search algorithm is used to locate \(p^*\in [ \underline p,\overline p]\) that maximizes G(⋅), under the assumption that G(⋅) is twice continuously differentiable and strongly concave in p. The algorithm starts with \(I_0=[ \underline p,\overline p]\) and attempts to shorten the interval by 2∕3 after each epoch ζ, without throwing away the optimal price p with high probability. Suppose at epoch ζ the interval Iζ = [Lζ, Uζ] includes p, and let αζ, βζ be the trisection points of Iζ. Depending on the location of p relative to αζ, βζ, the updated, shrunk interval Iζ+1 = [Lζ+1, Uζ+1] can be computed. The above discussion shows that trisection search updates can be carried out by simply determining the signs of ΔG(αζ, βζ). A complete pseudo-code description of the procedure is given in Algorithm 6.

Regret Convergence for Concave G(⋅)

The regret rate of the algorithm for concave G(⋅) is upper bounded as \(R(T) \leq O\left (T^{1/2} (\ln T)^{2} \right )\) with probability 1 − O(T−1). This upper bound almost matches the theoretical lower bound of Ω(T1∕2).

3.2.2 Non-Concave G(⋅)

In this section, we discuss the scenario with non-concave G(⋅).

Algorithm for Non-Concave G(⋅)

For non-concave G(⋅), (Chen et al., 2020b) still rely on bisection to search for the optimal y, but for p, the previous trisection framework cannot be applied anymore due to loss of concavity. They design an active tournament algorithm based on the difference estimator to search for the optimal p.

Key idea 1: discretization. The price interval \([ \underline p,\overline p]\) is first being partitioned into J evenly spaced points {p(j)}j ∈ [J], with J = ⌈T1∕5⌉. Because G(⋅) is twice continuously differentiable (implied by the first condition in Chen et al. (2020b)) and \(p^*\in ( \underline p,\overline p)\), there exists \(p_{j^*}\) for some j∈ [J] such that \(G(p^*)-G(p_{j^*})\leq O(|p^*-p_{j^*}|{ }^2) \leq O(J^{-2}) = O(T^{-2/5})\), because G′(p) = 0. The problem then reduces to a multiarmed bandit problem over the J arms of {pj}j ∈ [J], with the important difference of the actual reward of each arm not directly observable due to the censored demands.

Key idea 2: active elimination with tournaments. With the sub-routines developed in Algorithms 4 and 5 in the previous section, we can in principle estimate the reward difference ΔG(p, p′) at two prices p < p′ up to an error on the order of \(\widetilde O(1/\sqrt {n})\), with ≈ 2n review periods for each price and without incurring large regret. In Algorithm 6, we successfully applied this “pairwise comparison” oracle in a trisection approach to utilize the concavity of G(⋅). Without concavity of G(⋅), we are going to use an active elimination with tournaments approach to find the price with the highest rewards in {pj}j ∈ [J].

More specifically, consider epochs γ = 1, 2, ⋯ with geometrically increasing sample sizes nγ implied by geometrically decreasing accuracy levels Δγ = 2γ. At the beginning of each epoch γ, the algorithm maintains an “active set” \(\mathcal S_\gamma \subseteq [J]\) of prices such that for all \(p\in \mathcal S_\gamma \), \(G(p_{j^*})-G(p)\leq \Delta _\gamma \) where \(\Delta _\gamma =\widetilde O(1/\sqrt {n_\gamma })\). Chen et al. (2020b) use a “tournament” approach to eliminate prices in \(\mathcal S_\gamma \) that have large sub-optimality gaps. In particular, all prices in \(\mathcal S_\gamma \) are formed into pairs and each pair is allocated nγ samples to either eliminate the inferior price in the pair, or to combine both prices into one and advance to the next round of the tournament. The tournament ends once there is only one price left, \(\hat p_\gamma \). Afterwards a separate elimination procedure is invoked to retain all other prices that are close to \(\hat p_\gamma \) in terms of performance. A detailed algorithm for non-concave G(⋅) is presented in Algorithm 7.

Algorithm 7 A discretization + tournament approach with non-concave G(⋅)

Regret Convergence for Non-Concave G(⋅)

The regret convergence rate for non-concave G(⋅) is upper bounded as \(R(T) \leq O\left (T^{3/5} (\ln T)^2\right )\) with probability 1 − O(T−1). Chen et al. (2020b) then prove the lower bound for non-concave G(⋅) and show that the upper bound matches the lower bound. They prove that there exist a problem instance such that for any learning-while-doing policy π and the sequential decisions \(\{p_t,y_t\}_{t=1}^T\) the policy π produces, it holds for sufficiently large T that \(\sup _{\lambda } \mathbb E\left [V^*-\sum _{t=1}^T Q(p_t,y_t)\right ] \geq C_5\times {T^{3/5}}/{\ln T}\) for some constant C5 > 0. The lower bound is established by a novel information-theoretical argument based on generalized squared Hellinger distance, which is significantly different from conventional arguments that are based on Kullback–Leibler divergence.

4 Parametric Learning with Limited Price Changes

Models discussed in Sects. 12.2 and 12.3 assume that price can be adjusted at the beginning of every period. In practice, however, retailers may hesitate changing prices too frequently. Cheung et al. (2017) discussed several practical reasons for not allowing frequent price changes, including customers’ negative responses (e.g., that may cause confusion and affect the seller’s brand reputation) and the cost associated with such changes (e.g., due to changing price labels in brick-and-mortar stores, etc.). In this section, we introduce a constraint that only allows the retailer to change prices no more than a certain number of times. Clearly, such a constraint limits the firm’s ability to learn demand.

Demand in period t, t ∈{1, 2, …, T}, is random and depends on the selling price pt, and its distribution function belongs to some family parameterized by \({ \mathbf {z}} \in {\mathcal {Z}} \subset {\mathbb R}^{k}, k \ge 1\), where \({\mathcal {Z}}\) is a compact and convex set. Let Dt(pt, z) be the demand in period t with probability mass function f(⋅;pt, z), cumulative distribution function F(⋅;pt, z), and support {dl, dl + 1, …, dh} with dl being a nonnegative integer and dh ≤ +, and let dt denote the realization of Dt(pt, z). The firm knows f(⋅;pt, z) up to the parameter vector z, which has to be learned from sales data.

Chen and Chao (2019) consider the backlog system and (Chen et al., 2020a) consider the lost-sales system with censored demand. This section will be mainly devoted to discussing algorithms and results in Chen et al. (2020a), where the firm can only observe sales data but not the actual demand when stockout occurs. Therefore, \(o_t=\min \{D_t(p_t, \mathbf {z}), y_t\}\), and (pt, yt) is adapted to the filtration generated by \(\left \{ (p_s, y_s), o_s: s=1,\ldots , t-1\right \}\) under censored demand. Let \(p_t \in \mathcal {P}=[p^l, p^h]\) and \(y_t\in \mathcal {Y}=\{y^l, y^l+1, \ldots , y^h \}\), where the bounds of support 0 ≤ pl ≤ ph < + and 0 ≤ yl ≤ yh < + are known. Assume for any \(p_t \in \mathcal {P}\) it holds that \(\mathbb {E}[D_t(p_t, \mathbf {z})]>0\). The state transition is xt+1 = (ytDt(pt, z))+.

The expected total profit over the planning horizon, given an admissible policy ϕ = ((p1, y1), (p2, y2), …, (pT, yT)), is

$$\displaystyle \begin{aligned} \begin{array}{rcl} V^\phi(T, \mathbf{z}) & =&\displaystyle \sum_{t=1}^{T} \Big\{ p_t \mathbb{E}[\min \{ D_t(p_t, \mathbf{z}), y_t \}] {}\\ & &\displaystyle -\ \left \{h \mathbb{E}\left [y_t-D_t(p_t, \mathbf{z})\right]^++b\mathbb{E}\left[D_t(p_t, \mathbf{z})-y_t \right]^+\right \}\Big\} \end{array} \end{aligned} $$
(12.16)

and the prices need to satisfy the limited price change constraint for some given integer m ≥ 1:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \sum_{t=1}^{T-1} \mathbf{1}(p_t \neq p_{t+1}) \le m , \end{array} \end{aligned} $$
(12.17)

where 1(A) is the indicator function taking value 1 if statement A is true and 0 otherwise.

The single-period objective function is

$$\displaystyle \begin{aligned} \begin{array}{rcl} G(p,y, \mathbf{z}){=}p \mathbb{E}[D(p, \mathbf{z})]{-}h \mathbb{E}\left [y-D(p, \mathbf{z})\right]^+{-}(b+p)\mathbb{E}\left[D(p, \mathbf{z}){-}y\right]^+,\qquad {} \end{array} \end{aligned} $$
(12.18)

where D(p, z) is a generic random demand when the true parameter is z and the price is \(p \in \mathcal {P}\). For the underlying system parameter vector z, let (p, y) be a maximizer of G(p, y, z). If z is known, then the firm could set (p, y) every period without changing the price, and this is the clairvoyant solution, for which the T-period total profit is denoted as V.

Demand models are categorized into two groups, (1) the well-separated case and (2) the general case. Two probability mass functions are said to be identifiable if they are not identically the same.

4.1 Well-Separated Demand

The family of distributions \(\{ f(\cdot ; p, z): z \in \mathcal {Z}\}\) is called well-separated if for any \(p \in \mathcal {P}\), the class of probability mass functions \(\{ f(\cdot ; p, z): z \in \mathcal {Z}\}\) is identifiable, i.e., f(⋅;p, z1) ≠ f(⋅;p, z2) for \(z_1 \neq z_2 \in \mathcal {Z}\).

If a family of distributions is well-separated, then no matter what selling price p is charged, the sales data will allow the firm to learn about the parameter z. This shows that, in the well-separated case, pricing exploration can be a side benefit from exploitation, thus no active pricing exploration is necessary.

Chen et al. (2020a) consider two scenarios of limited-price constraint for well-separated demand. The first scenario is that the number of price changes is restricted to be no more than a given integer m ≥ 1 that is independent of the length of planning horizon T, while for the second scenario, the number of allowed price changes is at most \(\beta \log T\) for the T-period problem for some constant β > 0.

Algorithm for m Price Changes Under Well-Separated Demand

The main idea of the algorithm is to estimate the known parameter z by maximum likelihood estimation based on censored demand. The detailed algorithm is presented in Algorithm 8.

Algorithm 8 m price changes for the well-separated case

As shown in Algorithm 8, exploration in the inventory space is needed. If \(\hat y_i\) equals dl, then implementing \(\hat y_i\) will not yield any information about the demand. Hence the algorithm imposes \(\tilde y_i=\hat y_i+\Delta \), which ensures to reveal some demand information with a positive probability. Then the algorithm constructs an MLE estimator using censored data, \(\min \{d_t, y_t\}\), which are neither independent nor identically distributed. This is because, inventory level yt depends on carryover inventory xt that is a function of earlier inventory level and demand, and earlier demand depends on the pricing decisions. Assumption 1(i) in the paper guarantees that, with a high probability (its complement has a probability decaying exponentially fast in Ii), the objective function in (??) is strictly concave, thus there exists a unique global maximizer.

Regret Convergence for m Price Changes Under Well-Separated Demand

Chen et al. (2020a) provide both regret upper and lower bounds for well-separated demand with m price changes. The regret upper bound is \(R(T) \le C_{6} \; T^{\frac {1}{m+1}}\) for some constant C6 > 0. The lower bound is provided as following. There exist problem instances such that the regret for any admissible learning algorithm that changes price at most m times is lower bounded by \(R(T) \ge C_{7} \; T^{\frac {1}{m+1}}\) for some constant C7 > 0 and large enough T.

One fundamental challenge to prove this lower bound is that the times of price changes are dynamically determined, i.e., they are increasing random stopping times. An adversarial parameter class is constructed, among which a policy needs to identify the true parameter. The parameter class is constructed in a hierarchical manner such that when going further down the hierarchy the parameters are harder to distinguish. A delicate information-theoretical argument is employed to prove the lower bound. Here we only illustrate the high-level idea using a special case m = 2.

Chen et al. (2020a) construct a problem instance in which the inventory order-up-to level for each period is fixed and high enough so that any realization of the demand can be satisfied under any price. Therefore, the effect of lost sales and censored data is eliminated and the original joint pricing and inventory control problem is reduced to a dynamic pricing problem with fixed inventory control strategies. Suppose the demand follows a Bernoulli distribution with a single unknown parameter z ∈ [0, 1].

Let (p0, p1, p2) be the m + 1 = 3 different prices of a policy π, (T0, T1, T2) be the number of time periods each price is committed to, with T2 = T − T0 − T1. The paper constructs an adversarial parameter class consisting of 2m+1 = 8 parameters, among which policy π needs to identify the true parameter. These parameters are constructed in a hierarchical way. The 8 parameters are first partitioned into two 4-parameter groups, with the parameters in each group being close to each other, and the two groups are about 1∕4 apart. Each 4-parameter group can then be divided into two 2-parameter groups, with a distance of T−1∕6 between them. Within each 2-parameter group, the two parameters are T−1∕3 apart. A policy needs to work down the hierarchy levels to locate the true parameter, and the further it works down, the harder to differentiate between groups/parameters.

The proof first shows the tradeoff in deciding (p0, T0) at the first hierarchy level. Assume without loss of generality that z resides in the first branch of the tree. Because policy π does not have any observations when deciding p0, there is a constant probability that p0 is selected to favor the other branch. This high risk yields that T0 cannot be longer than O(T1∕3), because otherwise the regret accumulated during T0 would immediately imply an Ω(T1∕3) regret.

If T0 is upper bounded by O(T1∕3), the tradeoff in deciding (p1, T1) is as follows. With so few demand observations during T0, policy π will not be able to distinguish groups on the second level. Therefore, assuming the true z resides in the first group, it can (and will) be shown that p1 is selected to favor the wrong (second) group with a constant probability. Given this risk and that the parameters between the first and second groups are distanced at T−1∕6, T1 cannot be longer than O(T2∕3) to yield an Ω(T1∕3) regret. The same argument then carries over to the third level when deciding p2. After summing up the regrets from all the three levels, it is shown that the total regret of policy π cannot be better than Ω(T1∕3).

In making real decisions it may happen that T is not clearly specified at the beginning. The firm requires that the price change be not too often, but it usually allows more price changes for longer planning horizon. Chen et al. (2020a) propose a learning algorithm where the number of price changes is restricted to \(\beta \log T\) for some constant β > 0.

Algorithm for \(\beta \log T\) Price Changes Under Well-Separated Demand

The algorithm runs very similarly to the one for m price changes, except that now the number of periods in i is given by \(I_i=\left \lceil I_0 v^{i}\right \rceil \), i = 1, 2…, N, and there is a total of \(N=\mathcal {O}(\log T)\) iterations.

Regret Convergence for \(\beta \log T\) Price Changes Under Well-Separated Demand

The regret convergence rate for the algorithm with less than \(\beta \log T\) price changes is upper bounded as \(R(T) \le C_{8} \log T\), for a constant C8 > 0 and large enough T. The lower bound is also provided. There exist problem instances such that the regret for any learning algorithm satisfies \(R(T) \ge C_{9} \log T\) for some constant C9 > 0 and T ≥ 1.

4.2 General Demand

Now we consider the more general case that the parameters in probability mass function f(⋅;p, z) is a k-dimensional vector, i.e., \(\mathbf {z}=(z_1, \ldots , z_k) \in {\mathcal {Z}} \subset \mathbb {R}^k\) for some integer k ≥ 1. For a set of given prices \(\mathbf {p}=(p_1, \ldots , p_k) \in \mathcal {P}^k\), and correspondingly realized demands d = (d1, …, dk) ∈{dl, dl + 1, …, dh}k, define

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathcal{Q}^{\mathbf{p}, \mathbf{z}}(\mathbf{d}) =\prod_{j=1}^{k} f(d_j; p_j, \mathbf{z}). \end{array} \end{aligned} $$

The family of distributions \(\{ \mathcal {Q}^{ \mathbf {p}, \mathbf {z}}(\cdot ) : \mathbf {z} \in \mathcal {Z} \}\) is said to belong to the general case if there exist k price points \(\bar {\mathbf {p}}=(\bar p_1, \ldots , \bar p_k) \in \mathcal {P}^k\) such that the family of distributions \(\{ \mathcal {Q}^{\bar {\mathbf {p}}, \mathbf {z}}(\cdot ) : \mathbf {z} \in \mathcal {Z} \}\) is identifiable, i.e., \( \mathcal {Q} ^{\bar {\mathbf {p}}, {\mathbf {z}}_1}(\cdot ) \neq \mathcal {Q} ^{\bar {\mathbf {p}}, {\mathbf {z}}_2}(\cdot )\) for any z1 ≠ z2 in \(\mathcal {Z}\).

Suppose we are allowed to make up to m price changes during the planning horizon. We consider the case of m ≥ k in this section, as in the case of m < k no algorithm will be able to identify the k unknown parameters and, therefore, the regret would be linear in T.

Algorithm for General Demand

The algorithm follows an exploration-exploitation framework, and the unknown parameter vector z is estimated by MLE. Detailed algorithm is presented in Algorithm 9.

Algorithm 9 mk price changes for the general case

As shown in Algorithm 9, during the exploration phase, Algorithm-II experiments with k prices (thus k − 1 price changes). Because of censored data, the true demand realizations exceeding inventory level cannot be observed. To make sure to receive sufficient demand data, every time a stockout occurs, the algorithm increases the order-up-to level by a certainty percentage. Because dh may be infinity, this does not mean that the data censoring issue will be totally resolved, but with high probability. In the MLE step, the sales data \(\min \{d_t, y_t \}\) are correlated and non-identically distributed, because inventory levels yt are dependent through the “raising inventory” decisions as well as the carryover inventories. Propositions in Chen et al. (2020a) state that, despite the dependent data, the MLE possesses the desired property. The empirical optimal solution is implemented for the rest of the planning horizon, resulting in k price changes.

Regret Convergence for General Demand

Chen et al. (2020a) provide the regret upper bounded for the general demand case as follows: if the demand is unbounded dh = +, then the regret for general demands is upper bounded by \(R(T) \le C_{10} T^{{1}/{2}} \log T\); if the demand is bounded dh < +, then the regret for general demands is upper bounded by R(T) ≤ C10T1∕2, for some constant C10 > 0. The theoretical lower bound for this problem is Ω(T1∕2), which is established in Broder and Rusmevichientong (2012) for a dynamic pricing problem with infinite initial inventory.

5 Backlog System with Fixed Ordering Cost

In this section, we consider the presence of fixed ordering cost, which is a fixed cost that is incurred by the firm whenever a positive amount of inventory is ordered.

Demand is modeled as D = D0(p) + β, where \(D_0:[0,1]\to [ \underline d_0,\overline d_0]\) is the (expected) demand function and β is the random noise with 0 mean. Unsatisfied demands are backlogged. Chen et al. (2021b) consider both linear models and generalized linear models for D0(p) with unknown parameters θ0. The distribution for β is unknown in the nonparametric sense. Let k > 0 be the fixed ordering cost, c > 0 be the variable ordering cost of ordering one unit of inventory, and \(h:\mathbb R\to \mathbb R^+\) be the holding cost (when the remaining inventory level is positive) or the backlogging cost (when the remaining inventory level is negative). The instantaneous reward for period t is

and the firm would like to maximize the T-period total reward.

With known demand curve D0 and noise distribution μ0, the work of Chen and Simchi-Levi (2004a) proves that, under mild conditions, for both the average and discounted profit criterion there exists an (s, S, p) policy that is optimal in the long run. Under an (s, S, p)-policy, the retailer will only order new inventories when xt < s, and after the ordering of new inventories maintain yt = S. The function p prescribes the pricing decision that depends on the initial inventory level of the same period.

The performance of a particular (s, S, p) policy can be evaluated as follows. Define H0(x, p;μ) as the expected immediate reward of pricing decision p at inventory level x, without ordering new inventories. It is easy to verify that

$$\displaystyle \begin{aligned} H_0(x,p;\mu) = -\mathbb E_\mu[h(x- D_0(p)-\beta)] + pD_0(p) - c D_0(p). {} \end{aligned} $$
(12.22)

For a certain (s, S, p) policy, define quantities I(s, x, p;μ) and M(s, x, p;μ) as follows:

$$\displaystyle \begin{aligned} I(s,x,\mathbf p;\mu) &= \left\{\begin{array}{ll} H_0(x,\mathbf p(x);\mu)+\mathbb E_\mu[I(s,x- D_0(\mathbf p(x))-\beta {,\mathbf p};\mu)],& x\geq s,\\ 0,& x<s;\end{array}\right. {} \end{aligned} $$
(12.23)
$$\displaystyle \begin{aligned} M(s,x,\mathbf p;\mu) &= \left\{\begin{array}{ll} 1+\mathbb E_\mu[M(s,x-D_0(\mathbf p(x))-\beta {,\mathbf p};\mu)],& x\geq s,\\ 0,& x<s;\end{array}\right.{} \end{aligned} $$
(12.24)

Define r(s, S, p;μ) as

$$\displaystyle \begin{aligned} r(s,S,\mathbf p;\mu) = \frac{-k+I(s,S,\mathbf p;\mu)}{M(s,S,\mathbf p;\mu)}. {} \end{aligned} $$
(12.25)

When I(s, S, p;μ0) and M(s, S, p;μ0) are bounded, Lemma 2 from Chen and Simchi-Levi (2004a) shows that limTRT(π) = r(s, S, p;μ0).

Learning Algorithm

The learning algorithm proposed in Chen et al. (2021b) is based on an (s, S, p)-policy with evolving inventory levels (s, S) and pricing strategies p. Because unsatisfied demands are backlogged, the decision maker can observe true demand realizations. A regularized least-squares estimation is used to estimate θ0, and a sample average approximation approach is used to construct an empirical distribution for β.

Next we present the detailed learning algorithm. For linear models, , and the unknown parameter θ0 is estimated by the (regularized) least-squares estimation, i.e., let

(12.26)

For generalized linear models, for υ(⋅) as a given link function. Let the unknown parameter θ0 be estimated by

$$\displaystyle \begin{aligned} \hat\theta_{\mbox{GLM}} := \arg\min_{\theta \in \Theta} \left\|\sum_{t \in \mathcal H} (\upsilon(\eta(p_t)^\top \theta) - d_t) \eta(p_t) \right\|{}_{\Lambda^{-1}} . {} \end{aligned} $$
(12.27)

Let b ∈{1, 2, ⋯ } be a particular epoch and \(\mathcal H_{b-1} = \mathcal B_1\cup \cdots \cup \mathcal B_{b-1}\) be the union of all epochs prior to b. For time period \(t\in \mathcal H_{b-1}\), let pt be the advertised price and dt = D0(pt) + βt be the realized demand. Let the estimate \(\hat \theta _b\) of the unknown regression parameter θ0 be computed by (12.26) if demand is linear or (12.27) if demand is generalized linear given samples from \(\mathcal H_{b-1}\). Define . For every p ∈ [0, 1], define Δb(p) as

$$\displaystyle \begin{aligned} \Delta_b(p) := \gamma\sqrt{\eta(p)^\top\Lambda_b^{-1}\eta(p)}, \end{aligned}$$

where γ > 0 is the oracle-specific parameter. We then define an upper estimate of D0, \(\bar D_b\), as

(12.28)

where \(\overline d_0, \underline d_0\) are maximum and minimum demands and L is the Lipschitz constant. Note that the Lipschitz continuity of η(p) and Λb ≽ I imply the continuity of Δb(⋅) in p, which further implies the continuity of \(\bar D_b(\cdot )\) in p.

One key challenge in the learning-while-doing setting is the fact that all of the important quantities H0, I, M and r involve expectational evaluated under the noise distribution μ0, an object which we do not know a priori. In this section, we give details on how empirical distributions are used to approximate μ0.

At the beginning of epoch b, let \(\mathcal E_{<b}\subseteq \mathcal B_1\cup \cdots \mathcal B_{b-1}\) be a non-empty subset of historical selling periods used to approximate the noise distribution μ0. We define the empirical noise distribution \(\hat \mu _b\) as

(12.29)

where \(\mathbb I[\beta ']\) is the point mass at β′ and b(t) denotes the epoch to which selling period t belongs. Note that samples in are dependent because both pt and \(\hat \theta _{b(t)}\) are dependent across periods. Due to technical reasons, \(\mathcal E_{<b}\) is not chosen to include all selling periods prior to epoch b. Instead, we construct \(\mathcal E_{<b}\) such that all \(t\in \mathcal E_{<b}\) have small estimation errors of D0 on the advertised prices.

To further upper bound the deviation of \(H_0(x,p;\hat \mu _b)\) from H0(x, p;μ0), we need to demonstrate that the empirical distribution \(\hat \mu _b\) is close to the true noise distribution μ0. Because such deviations must include the estimation errors of D0 by \(\bar D_{b(t)}\) themselves, it is crucial to select time periods \(t\in \mathcal B_1\cup \cdots \mathcal B_{b-1}\) during which the error Δb(t)(pt) is small. To this end, we define \(\mathcal E_{<b}\) as

$$\displaystyle \begin{aligned} \mathcal E_{<b} := \left\{t\in\mathcal B_1\cup\cdots\cup\mathcal B_{b-1}: \Delta_{b(t)}(p_t) \leq \kappa/\sqrt{b} \right\}, {} \end{aligned} $$
(12.30)

where κ > 0 is a scaling algorithm parameter, set as . Note that κ will only depend logarithmically on T. As is shown in the proof of the paper, the selection of κ leads to \(|\mathcal E_{< b}| \geq b/2\), meaning that the set is non-empty, and, therefore, the definition in Eq. (12.30) is proper. The idea of the construction of \(\mathcal E_{<b}\) in Eq. (12.30) is as follows. Note that . While βt is the desired sample from the noise distribution, is incurred due to the estimation error of \(\hat \theta _{b(t)}\), which may be very large. Also note that the absolute value of this estimation error is upper bounded by Δb(t)(pt). Constructing \(\mathcal E_{<b}\) as in Eq. (12.30) allows us to only exploit selling periods during which the estimation errors are sufficiently small. This ensures that the obtained (approximate) noise samples are of high quality.

With the upper-confidence bounds \(\bar D_b\) and the approximate noise distribution \(\hat \mu _b\) constructed at the beginning of epoch b, (Chen et al., 2021b) use the dynamic programming approach detailed in the work of Chen and Simchi-Levi (2004a) to obtain an approximately optimal strategy (sb, Sb, pb) to be carried out during epoch b.

First define an upper bound estimate \(\bar H_b(x,p;\hat \mu _b)\) on \(H_0(x,p;\hat \mu _b)\) as

$$\displaystyle \begin{aligned} \bar H_b(x,p;\hat\mu_b) := -\mathbb E_{\hat\mu_b}[h(x-\bar D_b(p)-\beta)] + p\bar D_b(p) - c\bar D_b(p) + (c+L')\Delta_b(p), {}\end{aligned} $$
(12.31)

where L′ is a constant defined in Assumption (A3) of the paper.

For any \(s\in [ \underline s,\overline s]\), \(S\in [ \underline S,\overline S]\), \(r\in \mathbb R\), demand function \(D:[0,1]\to [ \underline d,\infty )\), noise distribution μ and their associated \(H:\mathbb R\times [0,1]\to \mathbb R\), define

$$\displaystyle \begin{aligned} \begin{array}{rcl} &&\phi^{(s,S)}(x;D,r,\mu) \\ &&\quad := \left\{\begin{array}{ll} \sup_{p\in[0,1]} H(x,p;\mu) - r + \mathbb E_\mu[\phi^{(s,S)}(x- D(p)-\beta; D,r,\mu)],& x\geq s;\\ 0,& x<s.\end{array}\right. \\ {} \end{array} \end{aligned} $$
(12.32)

With \(D=\bar D_b\) and \(H=\bar H_b(\cdot ,\cdot ;\hat \mu _b)\), the functions \(\phi ^{(s,S)}(x;\bar D_b,r,\hat \mu _b)\) can be computed for every \(s\in [ \underline s,\overline s]\), \(S\in [ \underline S,\overline S]\) and \(r\in \mathbb R\), since both \(H(\cdot ,\cdot ;\hat \mu _b)\) and the expectation with respect to \(\hat \mu _b\) can be evaluated. For every (s, S), define

$$\displaystyle \begin{aligned} \bar r_b(s,S) := \inf\{r\in\mathbb R: \phi^{(s,S)}(S;\bar D_b,r,\hat\mu_b) = k\} {} \end{aligned} $$
(12.33)

and let the pricing strategy p (associated with inventory levels s, S) be the optimal solution to the \(\phi ^{(s,S)}(\cdot ;\bar D_b,\bar r_b(s,S),\hat \mu _b)\) dynamic programming; that is, p(x) is defined such that \(\phi ^{(s,S)}(x;\bar D_b,\bar r_b(s,S),\hat \mu _b)=\bar H_b(x,\mathbf p(x);\hat \mu _b) - \bar r_b(s,S)+ \mathbb E_{\hat \mu _b}[\phi ^{(s,S)}(x-\bar D_b(\mathbf p(x))-\beta ;\bar D_b,\bar r_b(s,S),\hat \mu _b)]\) for all x.

Comparing equations in (12.32)–(12.33) with those in (12.22)–(12.25), it is easy to observe connections between them. r(s, S, p;μ) in (12.25) represents the expected per-period profit, which includes both the immediate reward H and the fixed ordering cost k. On the other hand, ϕ(s, S)(S;D, r, μ) in (12.32) accumulates the immediate reward H over time and subtracts a constant r every period. If the constant r in (12.32) equals the expected per-period profit involving both H and k, intuitively one would expect ϕ(s, S)(S;D, r, μ) to be equal to k. Lemma 3 of Chen and Simchi-Levi (2004b) confirms this connection, which shows that ϕ(s, S)(S;D, r(s, S), μ) = k, where r(s, S) =suppr(s, S, p;μ). Therefore, \(\bar r_b(s,S)\) can be considered as an empirical approximation of r(s, S).

We finally remark that in practice, one may discretize the choices of s, S, x, and p in the dynamic programming scheme described above with granularity T−1. This leads to a computationally efficient algorithm. On the other hand, by the Lipschitz property of \(\overline {H}_b(\cdot , \cdot ; \hat {\mu }_b)\), it can be shown that the error caused by discretization is at most O(T−1), which does not affect the order of the overall regret.

The proposed algorithm is based on an (s, S, p)-policy with evolving inventory levels (s, S) and pricing strategies p. As mentioned earlier, in the learning algorithm the T time periods are partitioned into epochs, labeled as \(\mathcal B_1,\mathcal B_2,\cdots \). Re-stocking only occurs at the first time period of each epoch \(\mathcal B_b\), b ∈{1, 2, ⋯ }. Each epoch \(\mathcal B_b\) is also associated with inventory levels (sb, Sb) and pricing strategy pb, such that for the first time period \(t_b\in \mathcal B_b\), the re-stocked inventory level is \(y_{t_b}=S_b\); the epoch \(\mathcal B_b\) terminates whenever xt < sb, and for all \(t\in \mathcal B_b\backslash \{t_b\}\), yt = xt and pt = pb(xt). Algorithm 10 gives a pseudo-code description of the proposed algorithm.

Algorithm 10 The main algorithm: dynamic inventory control and pricing with unknown demand

Updates of the (s, S, p) policies being implemented occur at the beginning of each epoch, as detailed from Step ?? to Step ?? in Algorithm 10. More specifically, at the beginning of epoch b when policy update is due, the algorithm first collects all realized demand information from previous epochs to construct model estimate \(\hat \theta _b\) (of the demand-rate curve) and noise distribution \(\hat \mu _b\). With estimates \(\hat \theta _b\) and \(\hat \mu _b\), dynamic programming (reflected in \(\phi ^{(s_b,S_b)}(\cdot ;\bar D_b,\bar r_b,\hat \mu _b)\)) is computed to obtain an approximately optimal pricing function pb, as well as the inventory levels sb, Sb.

Regret Convergence

Regret of the algorithm described above is upper bounded by \(\widetilde O(T^{1/2})\) with probability 1 − O(T−1), where π is the optimal policy that maximizes r(s, S, p;μ0). In the \(\widetilde O(\cdot )\) notation we omit polynomial dependency on \(\log T\) and other problem parameters. With k = c = 0 and h(⋅) ≡ 0, the problem becomes a pure pricing problem with unknown linear demand functions. As long as τ > 1, the work of Broder and Rusmevichientong (2012) proves an Ω(T1∕2) lower bound for any admissible pricing policies. Therefore, the \(\widetilde O(T^{1/2})\) regret established here is optimal.

In Algorithm 10, a dynamic programming needs to be carried out after each epoch b to obtain a new policy (sb, Sb, pb). Because each epoch lasts at most \(\overline S/ \underline d=O(1)\) selling periods, the algorithm requires Ω(T) DP calculations which can be computationally expensive. Chen et al. (2021b) then propose an improved algorithm that only needs \(O(\tau \log T)\) DP calculations to achieve virtually the same regret, which is much more computationally efficient.

Algorithm with Infrequent DP Updates

The detailed description is presented in Algorithm 11.

Algorithm 11 Dynamic inventory control and pricing with infrequent DP solutions

Note that in Algorithm 11, a new (s, S, p) policy is computed only if 2ι, ι ∈{1, 2, ⋯ , } epochs are met, or the determinant of the sample covariance Λb doubles. This greatly reduces the number of DP calculations from O(T) to \(O(\tau \log T)\).

Regret Convergence for Infrequent DP Updates

For the algorithm with infrequent DP updates, the regret is upper bounded by \(\widetilde O(T^{1/2})\) with probability 1 − O(T−1).

6 Other Models

Burnetas and Smith (2000) is one of the earliest papers, if not the first one, that studies joint pricing and inventory control with unknown demand distribution. They assume the lost-sales cost is zero and inventory perishes at the end of each period. The pricing mechanism is modeled as a multiarmed bandit problem, while the order quantity decision is made based on a stochastic approximation procedure. Burnetas and Smith (2000) proves policy convergence of their proposed algorithm. Katehakis et al. (2020) consider the joint optimization problem with discrete backlogged demand in different settings with or without a leading price. Keskin et al. (2021) study the joint pricing and inventory control problem with learning in a changing environment under a parametric demand-rate function and assume lost sales are observable. They provide learning algorithms whose convergence rates match the theoretical lower bound.