1 Introduction

We study iterative primal-dual stochastic subgradient algorithms to solve risk-sensitive optimization problems of the form

(1)

where \(\omega \in \varOmega \) is random and \(\alpha , \pmb {\beta }:=(\beta ^1, \ldots , \beta ^m)\) in [0, 1) define risk-aversion parameters. The collection of real-valued functions \(f_\omega \), \(g^1_\omega , \ldots , g^m_\omega \) are assumed convex but not necessarily differentiable, over the closed convex set \({\mathbb {X}}\subseteq {\mathbb {R}}^n\), where \({\mathbb {R}}\) and \({\mathbb {R}}_+\) stand for the set of real and nonnegative numbers, respectively. Denote by \({\varvec{G}}\) and \({\varvec{g}}_\omega \), the collection of \(G^i\)’s and \(g^i_\omega \)’s, respectively, for \(i = 1,\ldots , m\). \(\mathrm{CVaR}\) stands for conditional value at risk. For any \(\delta \in [0, 1)\), \(\mathrm{CVaR}_\delta [y_\omega ]\) of a scalar random variable \(y_\omega \) with continuous distribution equals its expectation computed over the \(1-\delta \) tail of the distribution of \(y_\omega \). For \(y_\omega \) with general distributions, \(\mathrm{CVaR}\) is defined via the following variational characterization

$$\begin{aligned} \mathrm{CVaR}_{\delta }[y_\omega ] = \min _{u \in {\mathbb {R}}} \left\{ u + \frac{1}{1-\delta }\mathbb {E} [y_\omega - u]^+ \right\} , \end{aligned}$$
(2)

following [36]. For each \(\varvec{x} \in {\mathbb {X}}\), assume that \(\mathbb {E} [ | f_\omega (\varvec{x}) | ]\) and \(\mathbb {E} [ | g^i_\omega (\varvec{x}) | ]\) are finite, implying that F and \({\varvec{G}}\) are well defined everywhere in \({\mathbb {X}}\).

\({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) offers a modeler the flexibility to indicate her risk preference in \(\alpha , \pmb {\beta }\). With \(\alpha \) close to zero, she indicates risk-neutrality toward the uncertain cost associated with the decision. With \(\alpha \) closer to one, she expresses her risk aversion toward the same and seeks a decision that limits the possibility of large random costs associated with the decision. Similarly, \(\beta \)’s express the risk tolerance in constraint violation. Choosing \(\beta \)’s close to zero indicates that constraints should be satisfied on average over \(\varOmega \) rather than on each sample. Driving \(\beta \)’s to unity amounts to requiring the constraints to be met almost surely. Said succinctly, \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) permits the modeler to customize risk preference between the risk-neutral choice of expected evaluations of functions to the conservative choice of robust evaluations.

There is a growing interest in solving risk-sensitive optimization problems with data. See [3, 20] for recent examples that tackle problems with generalized mean semi-deviation risk that equals \(\mathbb {E} [{y_\omega }] + c \mathbb {E} [ \vert {y_\omega } - \mathbb {E} [{y_\omega }] \vert ^{p}]^{1/p}\) for \(p>1\) for a random variable \(y_\omega \). There is a long literature on risk measures, e.g., see [1, 12, 25, 33, 36, 37]. We choose \(\mathrm{CVaR}\) for three particular reasons. First, it is a coherent risk measure, meaning that it is normalized, sub-additive, positively homogeneous and translation invariant, i.e.,

$$\begin{aligned} \mathrm{CVaR}_{\delta }[0] = 0, \; \mathrm{CVaR}_{\delta }[y^1_\omega + y^2_\omega ] \le \mathrm{CVaR}_{\delta }[y^1_\omega ] + \mathrm{CVaR}_{\delta }[y^2_\omega ], \\ \mathrm{CVaR}_{\delta }[t y_\omega ] = t \mathrm{CVaR}_{\delta }[y_\omega ], \; \mathrm{CVaR}_{\delta }[y_\omega + t'] = \mathrm{CVaR}_{\delta }[y_\omega ] + t' \end{aligned}$$

for random variables \(y_\omega , y^1_\omega , y^2_\omega \), \(t >0\) and \(t' \in {\mathbb {R}}\). An important consequence of coherence is that F and \({\varvec{G}}\) in \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) inherit the convexity of \(f_\omega \) and \({\varvec{g}}_\omega \). Convexity together with the variational characterization in (2) allow us to design sampling based primal-dual methods for \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) for which we are able to provide finite sample analysis of approximate optimality and feasibility. The popularity of the \(\mathrm{CVaR}\) measure is our second reason to study \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\). Following Rockafellar and Uryasev’s seminal work in [36], \(\mathrm{CVaR}\) has found applications in various engineering domains, e.g., see [22, 27], and therefore we anticipate wide applications of our result. Our third and final reason to study \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) is its close relation to other optimization paradigms in the literature as we describe next.

\({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) without constraints and \(\alpha = 0\) reduces to the minimization of \(\mathbb {E} [f_\omega (\varvec{x})]\), the canonical stochastic optimization problem. With \(\alpha \uparrow 1\), the problem description of \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) approaches that of a robust optimization problem (see [4]) of the form \(\min _{\varvec{x} \in {\mathbb {X}}} \text {ess}\sup _{\omega \in \varOmega } f_\omega (\varvec{x})\), where \(\text {ess}\sup \) denotes the essential supremum. Driving \(\beta \)’s to unity, \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) demands the constraints to be enforced almost surely. Such robust constraint enforcement is common in multi-stage stochastic optimization problems with recourse and discrete-time optimal control problems, e.g., in [16, 39, 40]. \(\mathrm{CVaR}\)-based constraints are closely related to chance constraints introduced by Charnes and Cooper in [12] that enforce \( \mathsf{Pr}\{g_\omega (\varvec{x}) \le 0 \} > 1-\varepsilon \) where \(\mathsf{Pr}\) refers to the probability measure on \(\varOmega \). Even if \(g_\omega \) is convex, chance-constraints typically describe a nonconvex feasible set. It is well known that \(\mathrm{CVaR}\)-based constraints provide a convex inner approximation of chance-constraints. Restricting the probability of constraint violation does not limit the extent of any possible violation, while \(\mathrm{CVaR}\)-based enforcement does so in expectation. \(\mathrm{CVaR}\) is also intimately related to the buffered probability of exceedence (bPOE) introduced and studied more recently in [25, 45]. In fact, bPOE is the inverse function of \(\mathrm{CVaR}\), and hence, problems with bPOE-constraints can often be reformulated as instances of \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\).

It can be challenging to compute \(\mathrm{CVaR}\) of \(f_\omega (\varvec{x})\) or \({\varvec{g}}_\omega (\varvec{x})\) for a given decision variable \(\varvec{x}\) with respect to a general distribution on \(\varOmega \) for two reasons. First, if samples from \(\varOmega \) are obtained from a simulation tool, an explicit representation of the probability distribution on \(\varOmega \) may not be available. Second, even if such a distribution is available, computation of \(\mathrm{CVaR}\) (or even the expectation) can be difficult. For example, with \(f_\omega \) as the positive part of an affine function and \(\omega \) being uniformly distributed over a unit hypercube, computation of \(\mathbb {E} [f_\omega ]\) via a multivariate integral is #P-hard according to [17, Corollary 1]. Therefore, we do not assume knowledge of F and \({\varvec{G}}\) but rather study a sampling-based algorithm to solve \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\).

Solution architectures for \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) via sampling come in two flavors. The first approach is sample average approximation (SAA) that replaces the expectation in (2) by an empirical average over N samples. One can then solve the sampled problem as a deterministic convex program.Footnote 1 We take the second and alternate approach of stochastic approximation and process independent and identically distributed (i.i.d.) samples from \(\varOmega \) in an online fashion. Iterative stochastic approximation algorithms for the unconstrained problem have been studied since the early works by Robbins and Monro in [34] and by Kiefer and Wolfowitz in [21]. See [24] for a more recent survey. Zinkevich in [46] proposed a projected stochastic subgradient method that can be applied to tackle constraints in such problems. Without directly knowing \({\varvec{G}}\), we cannot easily project the iterates on the feasible set \(\{ \varvec{x} \in {\mathbb {X}}\ | \ {\varvec{G}}(\varvec{x}) \le 0 \}\). We circumvent the challenge by associating Lagrange multipliers \({\varvec{z}} \in {\mathbb {R}}^m_+\) to the constraints and iteratively updating \(\varvec{x}, {\varvec{z}}\) by using \(f_\omega , {\varvec{g}}_\omega \) and their subgradients via a first-order stochastic primal-dual algorithm for \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) along the lines of [30, 42, 44].

In Sect. 2, we first design and analyze Algorithm 1 for \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) with \(\alpha =0, \pmb {\beta } = 0\), i.e., the optimization problem

(3)

First-order stochastic primal-dual algorithms have a long history, dating back almost forty years, including that in [15, 24, 30,31,32]. The analyses of these algorithms often require a bound on the possible growth of the dual variables. Borkar and Meyn in [8] stress the importance of compactness assumptions in their analysis of stochastic approximation algorithms. A priori bounds used in [31] are difficult to know in practice and techniques for iterative construction of such bounds as in [30] require extra computational effort. A regularization term in the dual update has been proposed in [23, 26] to circumvent this limitation. Instead, we propose a different modification to the classical primal-dual stochastic subgradient algorithm. With this simple modification, we are able to circumvent the need to bound the dual variables in executing the algorithm. As will become clear in Sect. 2, we rely on the existence of a saddle point of the Lagrangian function for \({{{{\mathcal {P}}}}^{\mathrm{E}}}\), which is typically guaranteed under Slater-type constraint qualification. However, knowledge of that saddle point or a strictly feasible “Slater” point is not required to execute the algorithm nor derive its convergence rate. While the classical primal-dual approach samples once for a single update of the primal and the dual variables, we sample twice–once to update the primal variable and then again to update the dual variable with the most recent primal iterate–thus, adopting a Gauss–Seidel approach in place of a Jacobi framework. For Algorithm 1, we bound the expected optimality gap and constraint violations at a suitably weighted average of the iterates by \(\eta /\sqrt{K}\) for a constant \(\eta \) with a constant step-size algorithm. Using these bounds, we then carefully optimize the step-size that allows us to reach within a given threshold of suboptimality and constraint violation with the minimum number of iterations. While we do not bound the dual variables to execute the algorithm or to characterize the \(1/\sqrt{K}\) convergence rate, we do require an overestimate of the distance of the dual initialization from an optimal point to calculate the constant \(\eta \) that in turn is required to optimize the constant step-size. The additional sample required in our update aids in the analysis; however, it comes at the price of making the sample complexity double of the iteration complexity. Given the popularity of decaying step-sizes in first-order algorithms, we also provide stability analysis of our algorithm with such step-sizes. This analysis exploits a dissipation inequality that we derive for our Gauss–Seidel approach. Such a stability analysis is crucial for our primal-dual algorithm, given that we do not explicitly restrict the growth of the dual variables.

In Sect. 3, we solve \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) with general risk aversion parameters \(\alpha , \pmb {\beta }\) using Algorithm 1 on an instance of \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) obtained through a standard reformulation via the variational formula for \(\mathrm{CVaR}\) in (2) from [36]. We then bound the expected suboptimality and constraint violation at a weighted average of the iterates for \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) by \(\eta (\alpha , {\varvec{\beta }})/\sqrt{K}\). Upon utilizing the optimized step-sizes from the analysis of \({{{{\mathcal {P}}}}^{\mathrm{E}}}\), we are then able to study the precise growth in the required iteration (and sample) complexity of \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) as a function of \(\alpha , {\varvec{\beta }}\). Not surprisingly, the more risk-averse a problem one aims to solve, the greater this complexity increases. A modeler chooses risk aversion parameters primarily driven by attitudes toward risk in specific applications. Our precise characterization of the growth in sample complexity with risk aversion will permit the modeler to balance between desired risk levels and computational challenges in handling that risk. We remark that the algorithmic architecture for the risk neutral problem may not directly apply to the risk-sensitive variant for general risk measures. For example, the algorithm described in [20] for general mean-semideviation-type risk measures is considerably more complex than that required for the risk-neutral problem. We are able to extend our algorithm and its analysis for \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) to \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\), thanks to the variational form in (2) that \(\mathrm{CVaR}\) admits. See the discussion after the proof of Theorem 3.1 for a precise list of properties a risk measure must exhibit for us to apply the same trick. Using concentration inequalities, we also report an interesting connection of our results to that in [10, 11] on scenario approximations to chance-constrained programs. The resemblance in sample complexity is surprising, given that the approach in [10, 11] solves a deterministic convex program with sampled constraints, while we process samples in an online fashion.

We illustrate properties of our algorithm through a stylized example. Our experiments reveal that the optimized iteration count (and sample complexity) for even a simple example is quite high. This limitation is unfortunately common for subgradient algorithms and likely cannot be overcome in optimizing general nonsmooth functions that we study. While the bounds are order-optimal, our numerical experiments reveal that a solution with desired risk tolerance can be found in less iterations than obtained from the upper bound. This is an artifact of optimizing step-sizes based on upper bounds on suboptimality and constraint violation. We end the paper in Sect. 4 with discussions on possible extensions of our analysis.

Very recently, it was brought to our attention that the work in [7] done concurrently presents a related approach to tackle optimization of composite nonconvex functions under related but different assumptions. In fact, their work appeared at the same time as our early version and claims a similar result that does not require bounds on the dual variables. Our analysis does not require or analyze the case with strongly convex functions within our setup and therefore Nesterov-style acceleration remains untenable. As a result, our algorithm is different. Our focus on \(\mathrm{CVaR}\) permits us to further analyze the growth in optimized sample complexity with risk aversion and its connection to chance-constrained optimization that is quite different.

2 Algorithm for \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) and Its Analysis

We present the primal-dual stochastic subgradient method to solve \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) in Algorithm 1.

figure a

The notation \(\langle \cdot , \cdot \rangle \) stands for the usual inner product in Euclidean space and \(\Vert \cdot \Vert \) denotes the induced \(\ell _2\)-norm. Here, \(\nabla h(\varvec{x})\) stands for a subgradient of an arbitrary convex function h at \(\varvec{x}\). For our analysis, the subgradient in Algorithm 1 for functions \(f_\omega (\varvec{x})\) and \(g_\omega (\varvec{x})\) can be arbitrary elements of the closed convex subdifferential sets \(\partial f_\omega (\varvec{x})\) and \(\partial g_\omega (\varvec{x})\), respectively. We assume that these subdifferential sets are nonempty everywhere in \({\mathbb {X}}\).

The primal-dual method in Algorithm 1 leverages Lagrangian duality theory. Specifically, define the Lagrangian function for \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) as

$$\begin{aligned} {{\mathcal {L}}}(\varvec{x}, {\varvec{z}}) := F(\varvec{x}) + {\varvec{z}}^{{\intercal }} {\varvec{G}}(\varvec{x}) = \mathbb {E} [ {{\mathcal {L}}}_\omega (\varvec{x}, {\varvec{z}})], \end{aligned}$$
(6)

for \(\varvec{x} \in {\mathbb {X}}\), \({\varvec{z}} \in {\mathbb {R}}^m_+\), where \({{\mathcal {L}}}_\omega (\varvec{x}, {\varvec{z}}) :=f_\omega (\varvec{x}) + {\varvec{z}}^{{\intercal }} {\varvec{g}}_\omega (\varvec{x})\). Then, \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) admits the standard reformulation as a min-max problem of the form

$$\begin{aligned} {{p}^{\mathrm{E}}_\star }:=\min _{\varvec{x} \in {\mathbb {X}}} \max _{{\varvec{z}}\in {\mathbb {R}}^m_+} {{\mathcal {L}}}(\varvec{x}, {\varvec{z}}). \end{aligned}$$
(7)

Denote its optimal set by \({\mathbb {X}}_\star \subseteq {\mathbb {X}}\). Define the dual problem of \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) as

$$\begin{aligned} {{d}^{\mathrm{E}}_\star }:=\max _{{\varvec{z}}\in {\mathbb {R}}^m_+} \min _{\varvec{x} \in {\mathbb {X}}} {{\mathcal {L}}}(\varvec{x}, {\varvec{z}}). \end{aligned}$$
(8)

Denote its optimal set by \({\mathbb {Z}}_\star \subseteq {\mathbb {R}}^m_+\). Weak duality then guarantees \({{p}^{\mathrm{E}}_\star }\ge {{d}^{\mathrm{E}}_\star }\). When the inequality is met with an equality, the problem is said to satisfy strong duality. A point \((\varvec{x}_\star , {\varvec{z}}_\star ) \in {\mathbb {X}}\times {\mathbb {R}}^m_+\) is a saddle point of \({{\mathcal {L}}}\) if

$$\begin{aligned} {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}) \le {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star ) \le {{\mathcal {L}}}(\varvec{x}, {\varvec{z}}_\star ) \end{aligned}$$
(9)

for all \((\varvec{x}, {\varvec{z}}) \in {\mathbb {X}}\times {\mathbb {R}}^m_+\). The following well-known saddle point theorem (see [6, Theorem 2.156]) relates saddle points with primal-dual optimal solutions.

Theorem

(Saddle point theorem) A saddle point of \({{\mathcal {L}}}\) exists if and only if \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) satisfies strong duality, i.e., \({{p}^{\mathrm{E}}_\star }= {{d}^{\mathrm{E}}_\star }\). Moreover, the set of saddle points of \({{\mathcal {L}}}\) is given by \({\mathbb {X}}_\star \times {\mathbb {Z}}_\star \).

Our convergence analysis of Algorithm 1 requires the following assumptions.

Assumption 1

\({{{{\mathcal {P}}}}^{\mathrm{E}}}{}\) must satisfy the following properties:

  1. (a)

    Subgradients of F and \({\varvec{G}}\) are bounded, i.e., \(\Vert \nabla F(\varvec{x}) \Vert \le C_F\), \(\Vert \nabla G^i(\varvec{x}) \Vert \le C_G^i\) for each \(i = 1, \dots , m\) and all \(\varvec{x} \in {\mathbb {X}}\).

  2. (b)

    \(\nabla f_\omega \) and \(\nabla g_\omega ^i\) for \(i = 1, \dots , m\) have bounded variance, i.e., \(\mathbb {E} \Vert \nabla f_\omega (\varvec{x}) - \mathbb {E} [\nabla f_\omega (\varvec{x})] \Vert ^2 \le \sigma _F^2\) and \(\mathbb {E} \Vert \nabla g_\omega ^i(\varvec{x}) - \mathbb {E} [\nabla g_\omega ^i(\varvec{x})] \Vert ^2 \le [\sigma _G^i]^2\) for all \(\varvec{x} \in {\mathbb {X}}\).

  3. (c)

    \({\varvec{g}}_\omega (\varvec{x})\) has a bounded second moment, i.e., \(\mathbb {E} \Vert g_\omega ^i(\varvec{x}) \Vert ^2 \le [D_G^i]^2\) for all \(\varvec{x} \in {\mathbb {X}}\).

  4. (d)

    The Lagrangian function \({{\mathcal {L}}}\) admits a saddle point \((\varvec{x}_\star , {\varvec{z}}_\star ) \in {\mathbb {X}}\times {\mathbb {R}}^m_+\).

The subgradient of F and the variance of its noisy estimate are assumed bounded. Such an assumption is standard in the convergence analysis of unconstrained stochastic subgradient methods. The assumptions regarding \({\varvec{G}}\) are similar, but we additionally require the second moment of the noisy estimate of \({\varvec{G}}\) to be bounded over \({\mathbb {X}}\). Boundedness of \({\varvec{G}}\) in primal-dual subgradient methods has appeared in prior literature, e.g., in [42, 44]. The second moment remains bounded if \({g}^i_\omega \) is uniformly bounded over \({\mathbb {X}}\) and \(\varOmega \) for each i. It is also satisfied if \({\varvec{G}}\) remains bounded over \({\mathbb {X}}\) and its noisy estimate has a bounded variance. Convergence analysis of unconstrained optimization problems typically assumes the existence of a finite optimal solution. We extend that requirement to the existence of a saddle point in the primal-dual setting, which by the saddle point theorem is equivalent to the existence of finite primal and dual optimal solutions. A variety of conditions imply the existence of such a point; the next result delineates two such sufficient conditions in (a) and (b), where (a) implies (b).

Lemma 2.1

(Sufficient conditions for existence of a saddle point) For \({{{{\mathcal {P}}}}^{\mathrm{E}}}\), the Lagrangian function \({{\mathcal {L}}}\) admits a saddle point, if either of the following conditions hold:

  1. (a)

    \({\mathbb {X}}_\star \) is nonempty, \({{p}^{\mathrm{E}}_\star }\) is finite and Slater’s constraint qualification holds, i.e., there exists \( \varvec{x}\) in the relative interior of \({\mathbb {X}}\) for which \({\varvec{G}}(\varvec{x}) < 0\).

  2. (b)

    \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) admits a finite (\(\varvec{x}_\star , {\varvec{z}}_\star ) \in {\mathbb {X}}\times {\mathbb {R}}_+^m\) that satisfies the generalized Karush–Kuhn–Tucker (KKT) conditions given by

    $$\begin{aligned} \begin{aligned} 0 \in \partial _x {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star ) + {{\mathcal {N}}}_{{\mathbb {X}}}(\varvec{x}_\star ), \; {G}^i(\varvec{x}_\star ) \le 0, \; z^i_\star G^i(\varvec{x}_\star ) = 0 \end{aligned} \end{aligned}$$
    (10)

    for \(i = 1, \dots , m\), where \({{\mathcal {N}}}_{{\mathbb {X}}}(\varvec{x}_\star )\) denotes the normal cone of \({\mathbb {X}}\) at \(\varvec{x}_\star \).

Proof

Part (a) is a direct consequence of [6, Theorem 1.265]. To prove part (b), notice that (10) ensures the existence of subgradients \(\nabla F(\varvec{x}_\star ) \in \partial F(\varvec{x}_\star )\), \(\nabla {G}^i(\varvec{x}_\star ) \in \partial {G}^i(\varvec{x}_\star )\), \(i=1,\ldots ,m\) and \({\varvec{n}} \in {{\mathcal {N}}}_{\mathbb {X}}(\varvec{x}_\star )\) for which

$$\begin{aligned} \nabla F(\varvec{x}_\star ) + \sum _{i = 1}^m {z}_\star ^i \nabla G^i(\varvec{x}_\star ) + {\varvec{n}} = 0. \end{aligned}$$
(11)

Then, for any \(\varvec{x} \in {\mathbb {X}}\), we have

$$\begin{aligned} \underbrace{\langle \nabla F(\varvec{x}_\star ), \varvec{x} - \varvec{x}_\star \rangle }_{\le F(\varvec{x}) - F(\varvec{x}_\star )} + \sum _{i = 1}^m \underbrace{ {z}_\star ^i \langle \nabla G^i(\varvec{x}_\star ) , \varvec{x} - \varvec{x}_\star \rangle }_{\le {z}_\star ^i [ G^i(\varvec{x}) - G^i(\varvec{x}_\star ) ]} + \underbrace{\langle {\varvec{n}} , \varvec{x} - \varvec{x}_\star \rangle }_{\le 0} = 0. \end{aligned}$$
(12)

The inequalities in the above relation follow from the convexity of F and \({G}^i\)’s, nonnegativity of \({\varvec{z}}_\star \), and the definition of the normal cone. From the above inequalities, we conclude \({{\mathcal {L}}}(\varvec{x}, {\varvec{z}}_\star ) \ge {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star )\) for all \(\varvec{x} \in {\mathbb {X}}\). Furthermore, for any \({\varvec{z}} \ge 0\), we have

$$\begin{aligned} {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star ) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}) = {\varvec{z}}_\star ^{{\intercal }} {\varvec{G}}(\varvec{x}_\star ) - {\varvec{z}}^{{\intercal }} {\varvec{G}}(\varvec{x}_\star ) \ge 0, \end{aligned}$$
(13)

where the last step follows from the nonnegativity of \({\varvec{z}}\) and (10), completing the proof. \(\square \)

We now present our first main result that provides a bound on the expected distance to optimality and constraint violation at a weighted average of the iterates generated by the algorithm on \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) under Assumption 1. Denote by \({\varvec{C}}_G\), \({\varvec{D}}_G\), and \({\varvec{\sigma }}_G\) the collections of \(C_G^i\), \(D_G^i\), and \(\sigma _G^i\), respectively. We make use of the following notation.

$$\begin{aligned} \begin{aligned}&P_1 := 2 \Vert \varvec{x}_1 - \varvec{x}_\star \Vert ^2 + 4\Vert \mathbb {1}+ {\varvec{z}}_\star \Vert ^2, \\&P_2 := 8 (4C_F^2 + \sigma _F^2) + 2 \Vert {\varvec{D}}_G \Vert ^2, \quad \\&P_3:= 8 m (4 \Vert {\varvec{C}}_G \Vert ^2 + \Vert {\varvec{\sigma }}_G \Vert ^2). \end{aligned} \end{aligned}$$
(14)

Theorem 2.1

(Convergence result for \({{{{\mathcal {P}}}}^{\mathrm{E}}}\)) Suppose Assumption 1 holds. For a positive sequence \(\{\gamma _k\}_{k=1}^K\), if \(P_3 \sum _{k=1}^K \gamma _k^2 < 1\), then the iterates generated by Algorithm 1 satisfy

$$\begin{aligned} \mathbb {E} [ F(\bar{\varvec{x}}_{K+1})] - {{p}^{\mathrm{E}}_\star }&\le \frac{1}{4 \sum _{k=1}^K \gamma _k } \left( \frac{P_1 + P_2 \sum _{k=1}^K \gamma _k^2}{1 - P_3 \sum _{k=1}^K \gamma _k^2} \right) , \end{aligned}$$
(15)
$$\begin{aligned} \mathbb {E} [G^i(\bar{\varvec{x}}_{K+1})]&\le \frac{1}{4 \sum _{k=1}^K \gamma _k } \left( \frac{P_1 + P_2 \sum _{k=1}^K \gamma _k^2}{1 - P_3 \sum _{k=1}^K \gamma _k^2} \right) \end{aligned}$$
(16)

for each \(i = 1, \dots , m\), where \(\bar{\varvec{x}}_{K+1} := \frac{\sum _{k=1}^K \gamma _k \varvec{x}_{k+1}}{\sum _{k=1}^K \gamma _k}\). Moreover, if \(\gamma _k =\gamma /\sqrt{K}\) for \(k=1,\ldots ,K\) with \(0< \gamma < P_3^{-1/2}\), then

$$\begin{aligned} \mathbb {E} [ F(\bar{\varvec{x}}_{K+1})] - {{p}^{\mathrm{E}}_\star }\le \frac{\eta }{\sqrt{K}}, \qquad \mathbb {E} [G^i(\bar{\varvec{x}}_{K+1})] \le \frac{\eta }{\sqrt{K}} \end{aligned}$$
(17)

for \(i = 1,\ldots ,m\), where \(\eta := \frac{P_1 + P_2 \gamma ^2}{4\gamma (1-P_3\gamma ^2)}\).

A constant step-size of \(\eta /\sqrt{K}\) over a fixed number of K iterations yields the \({{\mathcal {O}}}(1/\sqrt{K})\) decay rate in the expected distance to optimality and constraint violation of Algorithm 1. This is indeed order optimal, as implied by Nesterov’s celebrated result in [32, Theorem 3.2.1].

Remark 2.1

While we present the proof for an i.i.d. sequence of samples, we believe that the result can be extended to the case where \(\omega \)’s follow a Markov chain with geometric mixing rate following the technique in [41]. For such settings, the expectations in the definition of \({F}, {\varvec{G}}\) should be computed with respect to the stationary distribution of the chain. The results will then possibly apply to Markov decision processes with applications in stochastic control.

Given that the literature on primal-dual subgradient methods is extensive, it is important for us to relate and distinguish Algorithm 1 and Theorem 2.1 with prior work. Using the Lagrangian in (6), Algorithm 1 can be written as

$$\begin{aligned} \begin{aligned} \varvec{x}_{k+1}&:={{\,\mathrm{proj}\,}}_{{\mathbb {X}}} [\varvec{x}_k - \gamma _k \nabla _x {{\mathcal {L}}}_\omega (\varvec{x}_k, {\varvec{z}}_k)], \\ {\varvec{z}}_{k+1}&:={{\,\mathrm{proj}\,}}_{{{\mathbb {R}}_+^m}} [{\varvec{z}}_k + \gamma _k \nabla _z {{\mathcal {L}}}_\omega (\varvec{x}_{k+1}, {\varvec{z}}_k)], \end{aligned} \end{aligned}$$
(18)

where \({{\,\mathrm{proj}\,}}_{\mathbb {A}}\) projects its argument on set \({\mathbb {A}}\). The vectors \(\nabla _x {{\mathcal {L}}}_\omega \) and \(\nabla _z {{\mathcal {L}}}_\omega \) are stochastic subgradients of the Lagrangian function with respect to \(\varvec{x}\) and \({\varvec{z}}\), respectively. Therefore, Algorithm 1 is a projected stochastic subgradient algorithm that seeks to solve the saddle-point reformulation of \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) in (7). Implicit in our algorithm is the assumption that projection on \({\mathbb {X}}\) is computationally easy. Any functional constraints describing \({\mathbb {X}}\) that makes such projection challenging should be included in \({\varvec{G}}\).

Closest in spirit to our work on \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) are the papers by Baes et al. in [2], Yu et al. in [44], Xu in [42], and Nedic and Ozdaglar in [30]. Stochastic mirror-prox algorithm in [2] and projected subgradient method in [30] are similar in their updates to ours except in two ways. First, these algorithms in the context of \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) update the dual variable \({\varvec{z}}_k\) based on \({\varvec{G}}\) or its noisy estimate evaluated at \(\varvec{x}_k\), while we update it based on the estimate at \(\varvec{x}_{k+1}\). Second, both project the dual variable on a compact subset of \({\mathbb {R}}^m_+\) that contains the optimal set of dual multipliers. While authors in [2] assume an a priori set to project on, authors in [30] compute such a set from a “Slater point” that satisfies \({\varvec{G}}(\varvec{x}) < 0\). Specifically, Slater’s condition guarantees that the set of optimal dual solutions \({\mathbb {Z}}_\star \) is bounded (see [6, Theorem 1.265], [18]). Moreover, a Slater point can be used to construct a compact set that contains \({\mathbb {Z}}_\star \), e.g., using [30, Lemma 4.1]. While one can project dual variables on such a set in each iteration, execution of the algorithm then requires a priori knowledge of such a point. We do not assume knowledge of such a point (or any explicit bound on \({\mathbb {Z}}_\star \)) to execute Algorithm 1. Rather, our proof provides an explicit bound on the growth of the dual variable sequence for Algorithm 1, much in line with Xu’s analysis in [42]. Much to our surprise, a minor modification of using a Gauss–Seidel style dual update as opposed to the popular Jacobi style dual update obviates the need for this assumption in the literature for the proofs to work. Unfortunately, our Gauss–Seidel style dual update comes at an additional cost of an extra sample required per iteration of the primal-dual algorithm, making the sample complexity double of the iteration complexity. The constant factor of two, however, does not impact the order-wise complexity. We surmise that the additional sample and the Gauss–Seidel update of the dual variable helps to decouple the analysis of the primal and dual updates and points to a possible extension of our result to an asynchronous setting, often useful in engineering applications. We remark that while we do not utilize a priori knowledge of a dual optimal solution to explicitly restrict the dual variables within a set containing it, an overestimate of \(\left\| {\varvec{z}}_1 - {\varvec{z}}_\star \right\| \) is required to compute \(\eta \) to calculate the precise bound in (17). In other words, gauging the quality of the ergodic mean after K iterations still requires that knowledge. We suspect that the distance of the ergodic mean \(\bar{{\varvec{z}}}_{K+1}\) to the dual optimal set is crucial to bound the extent of expected suboptimality and constraint violation. While analysis such as that in [2] achieves it by explicitly imposing a bound on the entire trajectory of \({\varvec{z}}_k\)’s, we do so by assuming a bound on the distance of the initial point to the optimal set and then characterizing the growth over that trajectory.

Our work shares some parallels with that in [42], but has an important difference. Xu considers a collection of deterministic constraint functions, i.e., \({\varvec{g}}^\omega \) is identical for all \(\omega \in \varOmega \), and considers a modified augmented Lagrangian function of the form \({\tilde{{{\mathcal {L}}}}}(\varvec{x}, {\varvec{z}}) := F(\varvec{x}) + \frac{1}{m} \sum _{i=1}^m \varphi _\delta (\varvec{x}, z^i)\), where

$$\begin{aligned} \varphi _\delta (\varvec{x}, z^i) := {\left\{ \begin{array}{ll} z^i g^i(\varvec{x}) + \frac{\delta }{2} [g^i(\varvec{x})]^2, &{} \text {if } \delta g^i(\varvec{x}) + z^i \ge 0, \\ -\frac{(z^i)^2}{2\delta }, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(19)

for \(i = 1,\ldots ,m\) with a suitable time-varying sequence of \(\delta \)’s. His algorithm is similar to Algorithm 1 but performs a randomized coordinate update for the dual variable instead of (5). To the best of our knowledge, Xu’s analysis in [42] with such a Lagrangian function does not directly apply to our setting with stochastic constraints that is crucial for the subsequent analysis of the risk-sensitive problem \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\).

Finally, Yu et al.’s work in [44] provides an analysis of the algorithm that updates its dual variables using

$$\begin{aligned} {\varvec{z}}_{k+1}&:=\mathop {\mathrm{argmax}}\limits _{{\varvec{z}} \in {\mathbb {R}}_+^m} \left\langle \varvec{v}_k, {\varvec{z}} - {\varvec{z}}_k \right\rangle - \frac{1}{2\gamma _k} \Vert {\varvec{z}} - {\varvec{z}}_k \Vert ^2, \end{aligned}$$
(20)

where \({v}^i := {g}^i_{\omega _{k}}(\varvec{x}_{k}) + \langle \nabla {g}^i_{\omega _k}(\varvec{x}_k) ,\varvec{x}_{k+1} - \varvec{x}_k \rangle \) for \(i=1,\ldots ,m\). In contrast, our \({\varvec{z}}\)-update in (5) samples \(\omega _{k+1/2}\) and sets \(\varvec{v}_k := {\varvec{g}}_{\omega _{k+1/2}}(\varvec{x}_{k+1})\) at the already computed point \(\varvec{x}_{k+1}\). We are able to recover the \({{\mathcal {O}}}(1/\sqrt{K})\) decay rate of suboptimality and constraint violation with a proof technique much closer to the classical analysis of subgradient methods in [9, 30]. Unlike [44], we provide a clean characterization of the constant \(\eta \) in (17) that is crucial to study the growth in sample (and iteration) complexity of Algorithm 1 applied to a reformulation of \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\).

2.1 Proof of Theorem 2.1

The proof proceeds in three steps.

  1. (a)

    We establish the following dissipation inequality that consecutive iterates of the algorithm satisfy.

    $$\begin{aligned} \begin{aligned}&\gamma _k \mathbb {E} [{{\mathcal {L}}}(\varvec{x}_{k+1}, {\varvec{z}}) - {{\mathcal {L}}}(\varvec{x}, {\varvec{z}}_k)] + \frac{1}{2} \mathbb {E} \Vert \varvec{x}_{k+1} - \varvec{x} \Vert ^2 + \frac{1}{2} \mathbb {E} \Vert {\varvec{z}}_{k+1} - {\varvec{z}} \Vert ^2 \\&\quad \le \frac{1}{2} \mathbb {E} \Vert \varvec{x}_k - \varvec{x} \Vert ^2 + \frac{1}{2} \mathbb {E} \Vert {\varvec{z}}_k - {\varvec{z}} \Vert ^2 + \frac{1}{4}P_2\gamma _k^2 + \frac{1}{4}P_3 \gamma _k^2 \mathbb {E} \Vert {\varvec{z}}_k \Vert ^2 \end{aligned} \end{aligned}$$
    (21)

    for any \(\varvec{x} \in {\mathbb {X}}\) and \({\varvec{z}} \in {\mathbb {R}}_+^m\).

  2. (b)

    Next, we bound \(\mathbb {E} \Vert {\varvec{z}}_k\Vert ^2\) generated by our algorithm from above using step (a) as

    $$\begin{aligned} \mathbb {E} \Vert {\varvec{z}}_k\Vert ^2 \le \frac{P_1 + P_2 A_K}{1 - P_3 A_K} \end{aligned}$$
    (22)

    for \(k = 1, \ldots , K\), where \(A_K := \sum _{k=1}^K \gamma _k^2\).

  3. (c)

    We combine the results in steps (a) and (b) to complete the proof.

Define the filtration , where is the \(\sigma \)-algebra generated by the samples \(\omega _1, \ldots \omega _{k-1/2}\) for k being multiples of 1/2, starting from unity. Then, \(\{ \varvec{x}_1, {\varvec{z}}_1, \dots , \varvec{x}_k, {\varvec{z}}_k \}\) becomes -measurable, while \(\{ \varvec{x}_1, {\varvec{z}}_1, \dots , \varvec{x}_k, {\varvec{z}}_k, \varvec{x}_{k+1} \}\) is -measurable.

\(\bullet \) Step (a)—Proof of (21): We first utilize the \(\varvec{x}\)-update in (4) to prove

(23)

for all \(\varvec{x} \in {\mathbb {X}}\). Then, we utilize the \({\varvec{z}}\)-update in (5) to prove

(24)

for all \({\varvec{z}} \in {\mathbb {R}}^m_+\). The law of total probability is then applied to the sum of (23) and (24) followed by a multiplication by \(\gamma _k\) yielding the desired result in (21).

Proof of (23): The \(\varvec{x}\)-update in (4) yields

$$\begin{aligned} \left\langle \varvec{x}_{k+1} - \varvec{x}, \nabla f_\omega (\varvec{x}_k) + \sum _{i = 1}^m {z}_k^i \nabla {\varvec{g}}_\omega ^i(\varvec{x}_k) + \frac{1}{\gamma _k} (\varvec{x}_{k+1} - \varvec{x}_k) \right\rangle \le 0. \end{aligned}$$
(25)

We now simplify the inner product. The product with \(\nabla f_\omega (\varvec{x}_k)\) can be expressed as

$$\begin{aligned} \begin{aligned} \langle \varvec{x}_{k+1} - \varvec{x}, \nabla f_\omega (\varvec{x}_k) \rangle =&\underbrace{\langle \varvec{x}_{k+1} - \varvec{x}_k, \nabla F(\varvec{x}_{k+1}) \rangle }_{\ge F(\varvec{x}_{k+1}) - F(\varvec{x}_k) } + \langle \varvec{x}_k - \varvec{x}, \nabla f_\omega (\varvec{x}_k) \rangle \\&\; - \langle \varvec{x}_{k+1} - \varvec{x}_k, \nabla F(\varvec{x}_{k+1}) - \nabla f_\omega (\varvec{x}_k) \rangle , \end{aligned} \end{aligned}$$
(26)

where \(\nabla F(\varvec{x}_{k+1})\) denotes a subgradient of F at \(\varvec{x}_{k+1}\). The inequality for the first term follows from the convexity of F. Since from [5], the expectation of the second summand on the right-hand side (RHS) of (26) satisfies

(27)

Taking expectations in (26), the above relation implies

(28)

Next, we bound the inner product with the second term on the RHS of (26). To that end, utilize the convexity of member functions in \({\varvec{g}}_\omega \) and \({\varvec{G}}\) along the above lines to infer

(29)

To tackle the inner product with the third term in the RHS of (25), we use the identity

$$\begin{aligned} \begin{aligned}&\left\langle \varvec{x}_{k+1} - \varvec{x}, \frac{1}{\gamma _k} (\varvec{x}_{k+1} - \varvec{x}_k) \right\rangle \\&\quad = \frac{1}{2\gamma _k} [\Vert \varvec{x}_{k+1} - \varvec{x} \Vert ^2 - \Vert \varvec{x}_k - \varvec{x} \Vert ^2 + \Vert \varvec{x}_{k+1} - \varvec{x}_k \Vert ^2 ]. \end{aligned} \end{aligned}$$
(30)

The inequalities in (28), (29), and the equality in (30) together give

(31)

To simplify the above relation, apply Young’s inequality to obtain

(32)

Recall that , subgradients of F are bounded and \(\nabla f_\omega \) has bounded variance. Therefore, we infer from the above inequality that

(33)

Appealing to Young’s inequality m times and a similar line of argument as above gives

(34)

Leveraging the relations in (33) and (34) in (31), we get

(35)

that upon simplification gives (23).

Proof of (24): From the \({\varvec{z}}\)-update in (5), we obtain

$$\begin{aligned} \left\langle {\varvec{z}}_{k+1} - {\varvec{z}}, -{\varvec{g}}_\omega (\varvec{x}_{k+1}) + \frac{1}{\gamma _k} ({\varvec{z}}_{k+1} - {\varvec{z}}_k) \right\rangle \le 0 \end{aligned}$$
(36)

for all \({\varvec{z}} \ge 0\). Again, we deal with the two summands in the second factor of the inner product of (36) separately. The expectation of the inner product with the first term yieldsFootnote 2

(37)

In the above derivation, we have utilized Young’s inequality and the boundedness of the second moment of \({\varvec{g}}_\omega \). Since , the law of total probability can be used to condition (37) on rather than on . To simplify the inner product with the second term in (36), we use the identity

$$\begin{aligned} \begin{aligned} \left\langle {\varvec{z}}_{k+1} - {\varvec{z}}, \frac{1}{\gamma _k } ({\varvec{z}}_{k+1} - {\varvec{z}}_k) \right\rangle = \frac{1}{2\gamma _k} [ \Vert {\varvec{z}}_{k+1} - {\varvec{z}} \Vert ^2 - \Vert {\varvec{z}}_k - {\varvec{z}} \Vert ^2 + \Vert {\varvec{z}}_{k+1} - {\varvec{z}}_k \Vert ^2]. \end{aligned} \end{aligned}$$
(38)

Utilizing (37) and (38) in (36) gives (24). Adding (23) and (24) followed by a multiplication by \(\gamma _k\) yields

(39)

Taking the expectation and applying the law of total probability completes the proof of (21).

\(\bullet \) Step (b)—Proof of (22): Plugging \((\varvec{x}, {\varvec{z}}) = (\varvec{x}_\star , {\varvec{z}}_\star )\) in the inequality for the one-step update in (21) and summing it over \(k=1, \ldots , \kappa \) for \(\kappa \le K\) gives

$$\begin{aligned} \begin{aligned}&\sum _{k=1}^\kappa \gamma _k \underbrace{\mathbb {E} [{{\mathcal {L}}}(\varvec{x}_{k+1}, {\varvec{z}}_\star ) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_k)] }_{\ge 0 \text { from } (9)} + \frac{1}{2} \sum _{k=1}^\kappa \left[ \mathbb {E} \Vert \varvec{x}_{k+1} - \varvec{x}_\star \Vert ^2 + \mathbb {E} \Vert {\varvec{z}}_{k+1} - {\varvec{z}}_\star \Vert ^2 \right] \\&\quad \le \frac{1}{2} \sum _{k=1}^\kappa \left[ \mathbb {E} \Vert \varvec{x}_k - \varvec{x}_\star \Vert ^2 + \mathbb {E} \Vert {\varvec{z}}_k - {\varvec{z}}_\star \Vert ^2 \right] + \frac{1}{4} P_2 \sum _{k=1}^\kappa \gamma _k^2 + \frac{1}{4} P_3 \sum _{k=1}^\kappa \gamma _k^2 \mathbb {E} \Vert {\varvec{z}}_k\Vert ^2 \end{aligned} \end{aligned}$$
(40)

for \(\kappa = 1,\ldots ,K\). The above then yields

$$\begin{aligned} \begin{aligned}&\underbrace{ \mathbb {E} \Vert \varvec{x}_{\kappa +1} - \varvec{x}_\star \Vert ^2}_{\ge 0} + {\mathbb {E} \Vert {\varvec{z}}_{\kappa +1} - {\varvec{z}}_\star \Vert ^2} \\&\quad \le \Vert \varvec{x}_1 - \varvec{x}_\star \Vert ^2 + \Vert {\varvec{z}}_1 - {\varvec{z}}_\star \Vert ^2 + \frac{1}{2} P_2 \sum _{k=1}^\kappa \gamma _k^2 + \frac{1}{2} P_3 \sum _{k=1}^\kappa \gamma _k^2 \mathbb {E} \Vert {\varvec{z}}_k\Vert ^2. \end{aligned} \end{aligned}$$
(41)

Notice that \( 2 \mathbb {E} \Vert {\varvec{z}}_{\kappa +1} - {\varvec{z}}_\star \Vert ^2 + 2\Vert {\varvec{z}}_\star \Vert ^2 \ge \mathbb {E} \Vert {\varvec{z}}_{\kappa +1}\Vert ^2\). This inequality and \({\varvec{z}}_1 = 0\) in (41) gives

$$\begin{aligned} \begin{aligned} \mathbb {E} \Vert {\varvec{z}}_{\kappa +1}\Vert ^2&\le 2 \Vert \varvec{x}_{1} - \varvec{x}_\star \Vert ^2 + 4 \Vert {\varvec{z}}_\star \Vert ^2 + P_2 \sum _{k=1}^\kappa \gamma _k^2 + P_3 \sum _{k=1}^\kappa \gamma _k^2 \mathbb {E} \Vert {\varvec{z}}_k\Vert ^2 \\&\le P_1 + P_2 \sum _{k=1}^\kappa \gamma _k^2 + P_3 \sum _{k=1}^\kappa \gamma _k^2 \mathbb {E} \Vert {\varvec{z}}_k\Vert ^2. \end{aligned} \end{aligned}$$
(42)

We argue the bound on \(\mathbb {E} \Vert {\varvec{z}}_k\Vert ^2\) for \(k=1,\ldots ,K\) inductively. Since \({\varvec{z}}_1 = 0\), the base case trivially holds. Assume that the bound holds for \(k = 1,\ldots ,\kappa \) for \(\kappa < K\). With the notation \(A_K = \sum _{k=1}^K \gamma _k^2\), the relation in (42) implies

$$\begin{aligned} \mathbb {E} \Vert {\varvec{z}}_{\kappa +1}\Vert ^2&\le P_1 + P_2 \sum _{k=1}^{\kappa } \gamma _k^2 + P_3 \sum _{k=1}^{\kappa } \gamma _k^2 \frac{P_1 + P_2 A_K}{1 - P_3 A_K} \nonumber \\&\le P_1 + P_2 A_K + P_3 \frac{P_1 + P_2 A_K}{1 - P_3 A_K} A_K \nonumber \\&= \frac{P_1 + P_2 A_K}{1 - P_3 A_K}, \end{aligned}$$
(43)

completing the proof of step (b).

\(\bullet \) Step (c)—Combining steps (a) and (b) to prove Theorem 2.1: For any \({\varvec{z}} \ge 0\), the inequality in (21) with \(\varvec{x} = \varvec{x}_\star \) from step (a) summed over \(k=1,\ldots , K\) gives

$$\begin{aligned} \begin{aligned}&\sum _{k=1}^K \gamma _k \mathbb {E} [{{\mathcal {L}}}(\varvec{x}_{k+1}, {\varvec{z}}) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_k)] + \frac{1}{2} \sum _{k=1}^K \left[ \mathbb {E} \Vert \varvec{x}_{k+1} - \varvec{x}_\star \Vert ^2 + \mathbb {E} \Vert {\varvec{z}}_{k+1} - {\varvec{z}} \Vert ^2 \right] \\&\quad \le \frac{1}{2} \sum _{k=1}^K \left[ \mathbb {E} \Vert \varvec{x}_k - \varvec{x}_\star \Vert ^2 + \mathbb {E} \Vert {\varvec{z}}_k - {\varvec{z}} \Vert ^2 \right] + \frac{1}{4} P_2\sum _{k=1}^K \gamma _k^2 + \frac{1}{4} P_3 \sum _{k=1}^K \gamma _k^2 \mathbb {E} \Vert {\varvec{z}}_k\Vert ^2. \end{aligned} \end{aligned}$$
(44)

Using \({\varvec{z}}_1 = 0\) and an appeal to the saddle point property of \((\varvec{x}_\star , {\varvec{z}}_\star )\) yields

$$\begin{aligned}&\sum _{k=1}^K \gamma _k \mathbb {E} [{{\mathcal {L}}}(\varvec{x}_{k+1}, {\varvec{z}}) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star )] + \frac{1}{2} \mathbb {E} \Vert \varvec{x}_{K+1} - \varvec{x}_\star \Vert ^2 + \frac{1}{2} \mathbb {E} \Vert {\varvec{z}}_{K+1} - {\varvec{z}} \Vert ^2 \nonumber \\&\quad \le \frac{1}{2} \mathbb {E} \Vert \varvec{x}_1 - \varvec{x}_\star \Vert ^2 + \frac{1}{4} P_2\sum _{k=1}^K \gamma _k^2 + \frac{1}{4} P_3 \sum _{k=1}^K \gamma _k^2 \mathbb {E} \Vert {\varvec{z}}_k\Vert ^2 + \frac{1}{2} \Vert {\varvec{z}}\Vert ^2 \nonumber \\&\quad \le \frac{1}{4} P_1 - \Vert \mathbb {1}+ {\varvec{z}}_\star \Vert ^2 + \frac{1}{4} P_2 A_K + \frac{1}{4} P_3 \sum _{k=1}^K \gamma _k^2 \frac{P_1 + P_2A_K}{1 - P_3 A_K} + \frac{1}{2} \Vert {\varvec{z}}\Vert ^2 \nonumber \\&\quad = \frac{1}{4} P_1 + \frac{1}{4} P_2 A_K + \frac{1}{4} P_3 A_K \frac{P_1 + P_2A_K}{1 - P_3 A_K} + \frac{1}{2} \Vert {\varvec{z}}\Vert ^2 - \Vert \mathbb {1}+ {\varvec{z}}_\star \Vert ^2 \nonumber \\&\quad = \frac{1}{4} \left( \frac{P_1 + P_2 A_K}{1 - P_3 A_K} \right) + \frac{1}{2} \Vert {\varvec{z}}\Vert ^2 - \Vert \mathbb {1}+{\varvec{z}}_\star \Vert ^2. \end{aligned}$$
(45)

In deriving the above inequality, we have utilized the bound on \(\mathbb {E} \Vert {\varvec{z}}_k\Vert ^2\) from step (b) and the definition of \(P_1\) and \(A_K\). To further simplify the above inequality, notice that the saddle point property of \((\varvec{x}_\star , {\varvec{z}}_\star )\) in (9) yields

$$\begin{aligned} F(\varvec{x}_\star ) = {{\mathcal {L}}}(\varvec{x}_\star , 0) \le {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star ) = F(\varvec{x}_\star ) + {\varvec{z}}_\star ^{\intercal }{\varvec{G}}(\varvec{x}_\star ), \end{aligned}$$
(46)

which implies \({\varvec{z}}_\star ^{\intercal }{\varvec{G}}(\varvec{x}_\star ) \ge 0\). However, the saddle point theorem guarantees that \(\varvec{x}_\star \) is an optimizer of \({{{{\mathcal {P}}}}^{\mathrm{E}}}\), meaning that \(\varvec{x}_\star \) is feasible and \({\varvec{G}}(\varvec{x}_\star ) \le 0\), implying \({\varvec{z}}_\star ^{\intercal }{\varvec{G}}(\varvec{x}_\star ) \le 0\) as \({\varvec{z}}_\star \in {\mathbb {R}}^m_+\). Taken together, we infer

$$\begin{aligned} {\varvec{z}}_\star ^{\intercal }{\varvec{G}}(\varvec{x}_\star ) = 0 \implies {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star ) = F(\varvec{x}_\star ) = {{p}^{\mathrm{E}}_\star }. \end{aligned}$$
(47)

Since \({{\mathcal {L}}}(\varvec{x}, {\varvec{z}})\) is convex in \(\varvec{x}\), Jensen’s inequality and (47) implies

$$\begin{aligned} \sum _{k=1}^K \gamma _k \mathbb {E} [{{\mathcal {L}}}(\varvec{x}_{k+1}, {\varvec{z}}) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star )] \ge \left( {\sum _{k=1}^K \gamma _k} \right) \mathbb {E} [{{\mathcal {L}}}(\bar{\varvec{x}}_{K+1}, {\varvec{z}}) - {{p}^{\mathrm{E}}_\star }], \end{aligned}$$
(48)

where recall that \(\bar{\varvec{x}}_{K+1}\) is the \(\gamma \)-weighted average of the iterates. Utilizing (48) in (45), we get

$$\begin{aligned} \begin{aligned} \left( \sum _{k=1}^K \gamma _k \right) \mathbb {E} [ {{\mathcal {L}}}(\bar{\varvec{x}}_{K+1}, {\varvec{z}}) - {{p}^{\mathrm{E}}_\star }]&\le \frac{1}{4} \left( \frac{P_1 + P_2 A_K}{1 - P_3 A_K} \right) + \frac{1}{2} \Vert {\varvec{z}}\Vert ^2 - \Vert \mathbb {1}+ {\varvec{z}}_\star \Vert ^2. \end{aligned} \end{aligned}$$
(49)

The above relation defines a bound on \(\mathbb {E} [{{\mathcal {L}}}(\bar{\varvec{x}}_{K+1}, {\varvec{z}})]\) for every \({\varvec{z}} \ge 0\). Choosing \({\varvec{z}} = 0\) and noting \(\Vert \mathbb {1}+ {\varvec{z}}_\star \Vert ^2 \ge 0\), we get the bound on expected suboptimality in (15). To derive the bound on expected constraint violation in (16), notice that the saddle point property in (9) and (47) implies

$$\begin{aligned} \begin{aligned}&\mathbb {E} [{{\mathcal {L}}}(\bar{\varvec{x}}_{K+1}, \mathbb {1}^i+{\varvec{z}}_\star ) - {{p}^{\mathrm{E}}_\star }] \\&\quad = \mathbb {E} [{{\mathcal {L}}}(\bar{\varvec{x}}_{K+1}, {\varvec{z}}_\star ) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star )] + \mathbb {E} \left[ [\mathbb {1}^i]^{{\intercal }} {\varvec{G}}(\bar{\varvec{x}}_{K+1})\right] \\&\quad \ge \mathbb {E} [G^i(\bar{\varvec{x}}_{K+1})], \end{aligned} \end{aligned}$$
(50)

where \(\mathbb {1}^i \in {\mathbb {R}}^m\) is a vector of all zeros except the i-th entry that is unity. Choosing \({\varvec{z}} = \mathbb {1}^i + {\varvec{z}}_\star \) in (49) and the observation in (50) then gives

$$\begin{aligned} \mathbb {E} [G^i(\bar{\varvec{x}}_{K+1})]&\; \le \frac{1}{4 \sum _{k=1}^K \gamma _k } \left( \frac{P_1 + P_2 A_K}{1 - P_3 A_K} + 2 \Vert \mathbb {1}^i + {\varvec{z}}_\star \Vert ^2 - 4 \Vert \mathbb {1}+ {\varvec{z}}_\star \Vert ^2 \right) \nonumber \\&\; \le \frac{1}{4 \sum _{k=1}^K \gamma _k } \left( \frac{P_1 + P_2 A_K}{1 - P_3 A_K} \right) \end{aligned}$$
(51)

for each \(i=1,\ldots ,m\). This completes the proof of (16). The bounds in (17) are immediate from that in (15)–(16). This completes the proof of Theorem 2.1. \(\square \)

Remark 2.2

The bound in (16) can be sharpened to

$$\begin{aligned} \sum _{i=1}^m \mathbb {E} \left[ {G}^i(\bar{\varvec{x}}_{K+1})\right] ^+ \le \frac{1}{4 \sum _{k=1}^K \gamma _k } \left( \frac{P_1 + P_2 \sum _{k=1}^K \gamma _k^2}{1 - P_3 \sum _{k=1}^K \gamma _k^2} \right) \end{aligned}$$
(52)

using \({\varvec{z}}\) defined by \( {\varvec{z}}^i :={\varvec{z}}^i_\star + {{\,\mathrm{{\mathbb {I}}}\,}}_{\{{G}^i(\bar{\varvec{x}}_{K+1}) > 0 \}} \) for \(i = 1, \ldots , m\) in (49). Here, \({{\,\mathrm{{\mathbb {I}}}\,}}_{\{A\}}\) is the indicator function, evaluating to 1 if A holds and 0 otherwise. This improved bound was suggested to us by an anonymous reviewer. Notice that (52) is a much tighter bound on the expected constraint violation per constraint than (16) when m is large.

In what follows, we offer insights into two specific aspects of our proof. First, we present our conjecture on where the Gauss–Seidel nature of our dual update obtained with an extra sample helps us circumvent the need for an a priori bound on the dual variable. Notice that our dual update allows us to derive the third line of (37) that ultimately yields the term \(-{\varvec{z}}_k^{{\intercal }} {\varvec{G}}(\varvec{x}_{k+1})\) in (24). This term conveniently disappears when (24) is added to the inequality in (23) obtained from the primal update. We conjecture that this cancellation made possible by our dual update makes the theoretical analysis particularly easy. We anticipate that the classical Jacobi-style dual iteration derived with one sample shared within the primal and the dual steps will not lead to said cancellation and yield a term of the form \({\varvec{z}}_k^{{\intercal }} \left[ {\varvec{G}}(\varvec{x}_{k+1}) - {\varvec{G}}(\varvec{x}_{k}) \right] \). Bounding the growth of such a term might prove challenging without an available bound on \(\Vert {\varvec{z}}_k\Vert \) and will likely require a different argument. A detailed comparison between the proof techniques of the Jacobi and the Gauss–Seidel updates is left for future endeavors.

Second, we comment on the presence of a dimensionless constant \(\mathbb {1}\) in \(P_1\) together with \({\varvec{z}}_\star \). We use the inequality in (21) to establish (49) that is valid at all \({\varvec{z}} \ge 0\). Inspired by arguments in [42], we then utilize (49) not only at the dual iterate \({\varvec{z}}_k\)–that is often the case with many prior analyses—but also at \({\varvec{z}}=0\) and \({\varvec{z}} = \mathbb {1}^i + {\varvec{z}}_\star \). Specifically, the nature of the Lagrangian function \({{\mathcal {L}}}(\varvec{x}, {\varvec{z}})\) in \({\varvec{z}}\) permits us to relate these evaluations at \({\varvec{z}}=0\) and \({\varvec{z}} = \mathbb {1}^i + {\varvec{z}}_\star \) to the extents of suboptimality and constraint violation, respectively, using

$$\begin{aligned} \mathcal {L}(\varvec{x}, 0) = F(\varvec{x}), \quad \mathcal {L}(\varvec{x}, \mathbb {1}^i + {\varvec{z}}) = \mathcal {L}(\varvec{x}, {\varvec{z}}) + G^i(\varvec{x}). \end{aligned}$$
(53)

The deliberate inclusion of \(\Vert {\mathbb {1}+{\varvec{z}}_\star } \Vert ^2\) in constant \(P_1\) aids in drowning the effect of the term \(\frac{1}{2}\Vert {{\varvec{z}}}\Vert ^2\) in (49) evaluated at \({\varvec{z}} = \mathbb {1}^i + {\varvec{z}}_\star \) when deriving the bound on the extent of constraint violation, without impacting the same when evaluated at \({\varvec{z}}=0\), used in deriving the bound on the extent of suboptimality.

2.2 Optimal Step Size Selection

We exploit the bounds in Theorem 2.1 to select a step size that minimizes the iteration count to reach an \(\varepsilon \)-approximately feasible and optimal solution to \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) and solveFootnote 3

$$\begin{aligned} \begin{aligned} \underset{K, \ \gamma > 0}{\text {minimize}}&\qquad \ K, \\ \text {subject to }&\qquad \ \frac{\eta }{\sqrt{K}} = \frac{P_1 + P_2 \gamma ^2}{4 \gamma \sqrt{K}(1 - P_3 \gamma ^2)} \le \varepsilon , \ \ P_3 \gamma ^2 < 1. \end{aligned} \end{aligned}$$
(54)

The following characterization of optimal step sizes and the resulting iteration count from Proposition 2.1 will prove useful in studying the growth in iteration complexity in solving \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) with the risk-aversion parameters \(\alpha , {\varvec{\beta }}\) in the following section.

Proposition 2.1

For any \(\varepsilon > 0\), the optimal solution of (54) satisfies

$$\begin{aligned} \begin{aligned} \gamma _\star ^2 = \frac{2 P_3^{-1}}{2 + y + \sqrt{y^2 + 8y}}, \; \; K_\star = \frac{(P_1 + P_2 \gamma _\star ^2)^2}{16 \gamma _\star ^2 (1 - P_3 \gamma _\star ^2)^2 \varepsilon ^2}, \end{aligned} \end{aligned}$$
(55)

where \(y = 1 + \frac{P_2}{P_1 P_3}\).

Proof

It is evident from (55) that \(\gamma _\star ^2 < P_3^{-1}\). Then, it suffices to show that \(\gamma _\star \) from (55) minimizes

$$\begin{aligned} \sqrt{K} = \frac{P_1 + P_2 \gamma ^2}{4 \gamma (1 - P_3 \gamma ^2) \varepsilon } \end{aligned}$$
(56)

over \(\gamma > 0\). To that end, notice that

$$\begin{aligned} \frac{d}{d\gamma } \left( \frac{P_1 + P_2 \gamma ^2}{\gamma (1 - P_3 \gamma ^2)}\right)&= \frac{P_2 P_3 \gamma ^4 + (P_2 + 3 P_1 P_3) \gamma ^2 - P_1}{\gamma ^2 (1 - P_3 \gamma ^2)^2}. \end{aligned}$$
(57)

The above derivative is negative at \(\gamma = 0^+\) and vanishes only at \(\gamma _\star \) over positive values of \(\gamma \), certifying it as the global minimizer. \(\square \)

Parameter \(P_1\) is generally not known a priori. However, it is often possible to bound it from above. One can calculate \(\gamma _\star \) and \(K_\star \) using (55), replacing \(P_1\) with its overestimate. Notice that

$$\begin{aligned} \frac{d K_\star }{d P_1} := \frac{\partial K_\star }{\partial P_1} + \frac{\partial K_\star }{\partial \gamma _\star } \frac{d \gamma _\star }{d y} \frac{d y}{d P_1}. \end{aligned}$$
(58)

It is straightforward to verify that \(\frac{\partial K_\star }{\partial P_1} > 0\), \(\frac{d y}{d P_1} \le 0\), and \(\frac{\partial \gamma _\star }{\partial y} \le 0\), and hence, overestimating \(P_1\) results in a smaller \(\gamma _\star \). Finally, \(\frac{\partial K_\star }{\partial \gamma } > 0\) for \(\gamma > \gamma _\star \), implying that \(K_\star \) calculated with an overestimate of \(P_1\) is larger than the optimal iteration count–the computational burden we must bear for not knowing \(P_1\). Our algorithm does require knowledge of \(P_3\) to implement the algorithm that in turn depends only on the nature of the functions defining the constraints and not a primal-dual optimizer.

2.3 Asymptotic Almost Sure Convergence with Decaying Step-Sizes

Subgradient methods are often studied with decaying nonsummable square-summable step sizes, for which they converge to an optimizer in the unconstrained setting. The result holds even for distributed variants and for mirror descent methods (see [13]). Establishing convergence of Algorithm 1 to a primal-dual optimizer of \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) is much more challenging without assumptions of strong convexity in the objective. With such step-sizes, we provide the following result to guarantee the stability of our algorithm, which is reminiscent of [28, Theorem 4].

Proposition 2.2

Suppose Assumption 1 holds and \(\{\gamma _k\}_{k=1}^\infty \) is a nonsummable square-summable nonnegative sequence, i.e., \(\sum _{k=1}^\infty \gamma _k = \infty , \sum _{k=1}^\infty \gamma _k^2 < \infty \). Then, \((\varvec{x}_k, {\varvec{z}}_k)\) generated by Algorithm 1 remains bounded and \(\lim _{k\rightarrow \infty } {{\mathcal {L}}}(\varvec{x}_k, {\varvec{z}}_\star ) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_k) = 0\) almost surely.

This ‘gap’ function \({{\mathcal {L}}}(\varvec{x}, {\varvec{z}}_\star ) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}})\) looks notoriously similar to the duality gap at \((\varvec{x}, {\varvec{z}})\), but is not the same. We are unaware of any results on asymptotic almost sure convergence of primal-dual first-order algorithms to an optimizer for constrained convex programs with convex, but not necessarily strongly convex, objectives. A recent result in [43] establishes such a convergence in primal-dual dynamics in continuous time; our attempts at leveraging discretizations of the same have yet proven unsuccessful.

The proof of Proposition 2.2 takes advantage of the one-step update in (21) that makes it amenable to the well-studied almost supermartingale convergence result by Robbins and Siegmund in [35, Theorem 1].

Theorem

(Convergence of almost supermartingales) Let \(m_k, n_k, r_k, s_k\) be \({{\mathcal {F}}}_k\)-measurable finite nonnegative random variables, where \({{\mathcal {F}}}_1 \subseteq {{\mathcal {F}}}_2 \subseteq \ldots \) describes a filtration. If \(\sum _{k=1}^\infty s_k < \infty \), \(\sum _{k=1}^\infty r_k < \infty \), and

$$\begin{aligned} \mathbb {E} [m_{k+1} | {{\mathcal {F}}}_k] \le m_k(1 + s_k) + r_k - n_k, \end{aligned}$$
(59)

then \(\lim _{k \rightarrow \infty } m_k\) exists and is finite and \(\sum _{k = 1}^\infty n_k < \infty \) almost surely.

Proof of Proposition 2.2

Using notation from the proof of Theorem 2.1, (23) and (24) together yields

(60)

We utilize the above to derive a similar inequality replacing with \({{\mathcal {L}}}(\varvec{x}_{k}, {\varvec{z}}_\star )\) by bounding the difference between them. Then, we apply the almost supermartingale convergence theorem to the result to conclude the proof. To bound said difference, the convexity of \({{\mathcal {L}}}\) in \(\varvec{x}\) and Young’s inequality together implies

(61)

where \({\nabla }_x {{\mathcal {L}}}\) denotes a subgradient of \({{\mathcal {L}}}\) w.r.t. \(\varvec{x}\). To further bound the RHS of (61), Assumption 1 allows us to deduce

$$\begin{aligned} \begin{aligned} \Vert \nabla {{\mathcal {L}}}(\varvec{x}, {\varvec{z}}_\star ) \Vert ^2&\le 2 \Vert \nabla F(\varvec{x}) \Vert ^2 + 2m\sum _{i = 1}^m\Vert z_\star ^i \nabla G^i(\varvec{x}) \Vert ^2 \\&\le 2 C_F^2 + 2m \Vert {\varvec{z}}_\star \Vert ^2 \Vert {\varvec{C}}_G \Vert ^2 \\&:= 2Q_1. \end{aligned} \end{aligned}$$
(62)

for any \(\varvec{x} \in {\mathbb {X}}\). Furthermore, the \(\varvec{x}\)-update in (18) and the nonexpansive nature of the projection operator yield

(63)

From Assumption 1, we get

(64)

and along similar lines

(65)

that together in (63) yield

(66)

Combining the above with (62) in (61) gives

(67)

Adding (67) to (60) and simplifying, we obtain

(68)

The above inequality with

$$\begin{aligned} \Vert {\varvec{z}}_k \Vert ^2 \le 2\Vert \varvec{x}_k - \varvec{x}_\star \Vert ^2 + 2 \Vert {\varvec{z}}_k - {\varvec{z}}_\star \Vert ^2 + 2 \Vert {\varvec{z}}_\star \Vert ^2 \end{aligned}$$
(69)

becomes (59), where

$$\begin{aligned} m_k = \frac{1}{2} \mathbb {E} \Vert \varvec{x}_k - \varvec{x}_\star \Vert ^2 + \frac{1}{2} \mathbb {E} \Vert {\varvec{z}}_k - {\varvec{z}}_\star \Vert ^2, \quad n_k = \gamma _k [{{\mathcal {L}}}(\varvec{x}_{k}, {\varvec{z}}_\star ) - {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_k) ], \nonumber \\ r_k = \gamma _k^2 \left[ \frac{1}{4}P_2 + Q_1 + Q_2 + \left( \frac{1}{2}P_3 + 2 Q_3 \right) \Vert {\varvec{z}}_\star \Vert ^2 \right] , \quad s_k = \gamma _k^2 \left( \frac{1}{2}P_3 + 2 Q_3 \right) . \nonumber \\ \end{aligned}$$
(70)

Each term is nonnegative, owing to (9), and \({\varvec{\gamma }}\) defines a square-summable sequence. Applying [35, Theorem 1], \(m_k\) converges to a constant and \(\sum _{k = 1}^\infty n_k < \infty \). The latter combined with the nonsummability of \({\varvec{\gamma }}\) implies the result. \(\square \)

3 Algorithm for \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) and Its Analysis

We now devote our attention to solving \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) via a primal-dual algorithm. To do so, we reformulate it as an instance of \({{{{\mathcal {P}}}}^{\mathrm{E}}}\) and utilize Algorithm 1 to solve that reformulation with constant step-sizes under a stronger set of assumptions given below. In the sequel, we use \({{\mathcal {L}}}\) to denote the Lagrangian function defined in (6), but with F and \({\varvec{G}}\) as defined in \({{{{\mathcal {P}}}}^\mathrm{CVaR}}{}\).

Assumption 2

\({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) must satisfy the following properties:

  1. (a)

    Subgradients of F and \({\varvec{G}}\) are bounded, i.e., \(\Vert \nabla f_\omega (\varvec{x}) \Vert \le C_F\) and \(\Vert \nabla g_\omega ^i(\varvec{x}) \Vert \le C_G^i\) almost surely for all \(\varvec{x} \in {\mathbb {X}}\).

  2. (b)

    \({\varvec{g}}_\omega (x)\) is bounded, i.e., \(\Vert g^i_\omega (\varvec{x}) \Vert \le D_G^i\) for all \(\varvec{x} \in {\mathbb {X}}\), almost surely.

  3. (c)

    The Lagrangian function \({{\mathcal {L}}}\) admits a saddle point \((\varvec{x}_\star , {\varvec{z}}_\star ) \in {\mathbb {X}}\times {\mathbb {R}}^m_+\).Footnote 4

Using the variational characterization (2) of \(\mathrm{CVaR}\), rewrite \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) as

$$\begin{aligned} \begin{aligned} \underset{\varvec{x} \in {\mathbb {X}}}{\text {minimize}}&\qquad \ \min _{u^0 \in {\mathbb {R}}} \mathbb {E} [ \psi ^f_\omega (\varvec{x}, u^0; \alpha )], \\ \text {subject to }&\qquad \ \min _{u^i \in {\mathbb {R}}} \mathbb {E} [\psi ^{g^i}_\omega (\varvec{x}, u^i; \beta ^i)] \le 0, \quad i=1,\ldots , m, \end{aligned} \end{aligned}$$
(71)

where \({\psi ^h_\omega (\varvec{x}, u; \delta )} :={u + \frac{1}{1-\delta }[h_\omega (\varvec{x}) - u]^+}\) for any collection of convex functions \(h_\omega : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\), \(\omega \in \varOmega \). Coupled with Assumption 2, we will show that we can bound \(|u^i| \le D_G^i\) for each \(i = 1, \dots , m\)Footnote 5 that allows us to rewrite \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) as

(72)

where \(| \cdot |\) denotes the element-wise absolute value. Call the optimal value of \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) as \(p^\mathrm{CVaR}_\star \) in the sequel.

Theorem 3.1

(Convergence result for \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\)) Suppose Assumption 2 holds. The iterates generated by Algorithm 1 on \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\) for \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) with parameters \(\alpha , {\varvec{\beta }}\) satisfy

$$\begin{aligned} \mathbb {E} [\mathrm{CVaR}_{\alpha }(f_\omega (\bar{\varvec{x}}_{K+1}))] - p^\mathrm{CVaR}_\star&\le \frac{\eta (\alpha , {\varvec{\beta }})}{\sqrt{K}}, \end{aligned}$$
(73)
$$\begin{aligned} \mathbb {E} [\mathrm{CVaR}_{\beta ^i}(g_\omega ^i(\bar{\varvec{x}}_{K+1}))]&\le \frac{\eta (\alpha , {\varvec{\beta }})}{\sqrt{K}} \end{aligned}$$
(74)

for \(i = 1, \ldots , m\) with step sizes \(\gamma _k = \gamma / \sqrt{K}\) for \(k = 1, \dots , K\) with \(0< \gamma < {P_3^{-1/2}(\alpha , {\varvec{\beta }})}\), where \(\eta (\alpha , {\varvec{\beta }}) :=\frac{P_1 + \gamma ^2 P_2(\alpha , {\varvec{\beta }})}{4\gamma (1 - \gamma ^2 P_3(\alpha , {\varvec{\beta }}))}\) and

$$\begin{aligned} \begin{aligned} P_2(\alpha , {\varvec{\beta }})&:=\frac{16 (C_F^2 + 1)}{(1 - \alpha )^2} + 2 \left\| {\mathrm {diag}}(\mathbb {1}+ {\varvec{\beta }}) {\mathrm {diag}}(\mathbb {1}- {\varvec{\beta }})^{-1} {\varvec{D}}_G \right\| ^2, \\ P_3(\alpha , {\varvec{\beta }})&:=16 m \left\| \begin{pmatrix} {\mathrm {diag}}(\mathbb {1}- {\varvec{\beta }})^{-1} {\varvec{C}}_G \\ {\mathrm {diag}}(\mathbb {1}- {\varvec{\beta }})^{-1} \mathbb {1}\end{pmatrix} \right\| ^2. \end{aligned} \end{aligned}$$
(75)

Proof

We prove the result in the following steps.

  1. (a)

    Under Assumption 2, we revise \(P_2\) and \(P_3\) in Theorem 2.1 for \({{{{\mathcal {P}}}}^{\mathrm{E}}}\).

  2. (b)

    We show that if \(f_\omega , {\varvec{g}}_\omega \) satisfy Assumption 2, then \(\psi ^f_\omega \) and \(\psi ^{g^i}_\omega , i=1,\ldots ,m\) satisfy Assumption 2, but with different bounds on the gradients and function values. Leveraging these bounds, we obtain \(P_2(\alpha , {\varvec{\beta }})\) and \(P_3(\alpha , {\varvec{\beta }})\) for \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\) using step (a).

  3. (c)

    Using Assumption 2, we prove that the Lagrangian function \({{\mathcal {L}}}':{\mathbb {X}}\times {\mathbb {R}}\times {\mathbb {U}}\times {\mathbb {R}}^m_+ \rightarrow {\mathbb {R}}\) defined as

    $$\begin{aligned} {{\mathcal {L}}}'(\varvec{x}, u^0, {\varvec{u}}, {\varvec{z}}) :=\mathbb {E} [ \psi ^f_\omega (\varvec{x}, u^0; \alpha )] + \sum _{i=1}^m z^i \mathbb {E} [ \psi ^f_\omega (\varvec{x}, u^0; \alpha )] \end{aligned}$$
    (76)

    admits a saddle point in \({\mathbb {X}}\times {\mathbb {R}}\times {\mathbb {U}}\times {\mathbb {R}}^m_+\), where \({\mathbb {U}}:=\left\{ {\varvec{u}} \in {\mathbb {R}}^m \mid | {\varvec{u}}| \le {\varvec{D}}_G\right\} \).

  4. (d)

    We then apply Theorem 2.1 with \(P_2(\alpha , {\varvec{\beta }})\) and \(P_3(\alpha , {\varvec{\beta }})\) from step (b) on \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\) to derive the bounds in (73) and (74).

\(\bullet \) Step (a)—Revising Theorem 2.1with Assumption 2: Recall that in the derivation of (33) in the proof of Theorem 2.1, Assumption 1 yields

$$\begin{aligned} \Vert \nabla F(\varvec{x}_{k+1}) - \nabla f_\omega (\varvec{x}_k) \Vert ^2 \le 2(4C_F^2 + \sigma _F^2). \end{aligned}$$
(77)

Assumption 2 allows us to bound the same by \(4C_F^2\), yielding \(P_2 = 16 C_F^2 + 2\Vert {\varvec{D}}_G \Vert ^2\). Along the same lines, we get \(P_3 = 16 m \Vert {\varvec{C}}_G \Vert ^2\).

\(\bullet \) Step (b)—Deriving properties of \(\psi _\omega \): Consider the stochastic subgradient of \(\psi ^f_\omega (\varvec{x}, t; \alpha )\) given by

$$\begin{aligned} \nabla \psi _\omega ^f(\varvec{x}, u; \alpha )&= \begin{pmatrix} \frac{1}{1 - \alpha } \nabla f_\omega (\varvec{x}) {{\,\mathrm{{\mathbb {I}}}\,}}_{\{ f_\omega (\varvec{x}) \ge u\}} \\ 1 - \frac{1}{1 - \alpha } {{\,\mathrm{{\mathbb {I}}}\,}}_{\{ f_\omega (\varvec{x}) \ge u\}} \end{pmatrix}, \end{aligned}$$
(78)

where \({{\,\mathrm{{\mathbb {I}}}\,}}_{\{\cdot \}}\) is the indicator function. Recall that \(\Vert \nabla f_\omega (\varvec{x}) \Vert \le C_F\) for all \(\varvec{x} \in {\mathbb {X}}\) almost surely. Therefore, we have

$$\begin{aligned} \begin{aligned} \Vert \nabla \psi _\omega ^f(\varvec{x}, u; \alpha ) \Vert ^2&= \left\| \frac{1}{1 - \alpha } \nabla f_\omega (\varvec{x}) {{\,\mathrm{{\mathbb {I}}}\,}}_{\{ f_\omega (\varvec{x}) \ge u\}} \right\| ^2 + \left\| 1 - \frac{1}{1 - \alpha } {{\,\mathrm{{\mathbb {I}}}\,}}_{\{ f_\omega (\varvec{x}) \ge u\}} \right\| ^2 \\&\le \frac{C_F^2 + 1}{(1 - \alpha )^2}. \end{aligned} \end{aligned}$$
(79)

Proceeding similarly, we obtain

$$\begin{aligned} \left\| \nabla \psi ^{g^i}_\omega (\varvec{x}, u^i; \beta ^i)\right\| ^2 \le \frac{[C_G^i]^2 + 1}{(1-\beta ^{i})^2}. \end{aligned}$$
(80)

We also have

$$\begin{aligned} \begin{aligned} \Vert \psi ^{g^i}_\omega (\varvec{x}, u^i; \beta ^i) \Vert = \left\| \max \left\{ \frac{g^i_\omega (\varvec{x}) - \beta ^i u^i}{1 - \beta ^i}, u^i \right\} \right\| \le \frac{1 + \beta ^i}{1 - \beta ^i} D_G^i. \end{aligned} \end{aligned}$$
(81)

Then, (75) follows from step (a) using (79), (80), and (81).

\(\bullet \) Step (c)—Showing that \({{\mathcal {L}}}'\) admits a saddle point: According to [36, Theorem 10], the minimizers of \(\mathbb {E} [ \psi ^f_\omega (\varvec{x}, u^0; \alpha )]\) over \(u^0\) define a nonempty closed bounded interval (possibly a singleton). Thus, we have

$$\begin{aligned} F(\varvec{x}) = \mathbb {E} [ \psi ^f_\omega (\varvec{x}, u^0(\varvec{x}); \alpha )] \end{aligned}$$
(82)

for some \(u^0(\varvec{x})\in {\mathbb {R}}\) for each \(\varvec{x} \in {\mathbb {X}}\). Similarly, we infer

$$\begin{aligned} G^i(\varvec{x}) = \mathbb {E} [ \psi ^{g^i}_\omega (\varvec{x}, u^i(\varvec{x}); \beta ^i)] \end{aligned}$$
(83)

for some \(u^i(\varvec{x})\in {\mathbb {R}}\) for each \(\varvec{x} \in {\mathbb {X}}\). Moreover, for all \(u^i > D_G^i\), we have

$$\begin{aligned} \mathbb {E} [ \psi ^{g^i}_\omega (\varvec{x}, u^i; \beta ^i)] = u^i, \end{aligned}$$
(84)

and for \(u^i < -D_G^i\), we have

$$\begin{aligned} \mathbb {E} [ \psi ^{g^i}_\omega (\varvec{x}, u^i; \beta ^i)] = \frac{1}{1-\beta ^i} \left( \mathbb {E} [g^i_\omega (\varvec{x}) -{\beta ^i}u^i \right) . \end{aligned}$$
(85)

Thus, \(\mathbb {E} [ \psi ^{g^i}_\omega (\varvec{x}, u^i; \beta ^i)]\) is nonincreasing in \(u^i\) below \(-D_G^i\) and increasing in it beyond \(D_G^i\). Hence, at least one among the minimizers of \(\mathbb {E} [ \psi ^{g^i}_\omega (\varvec{x}, u^i; \beta ^i)]\) must lie in \([-D_G^i, D_G^i]\). In the sequel, let \(u^i(\varvec{x})\) refer to such a minimizer.

Consider a saddle point \((\varvec{x}_\star , {\varvec{z}}_\star ) \in {\mathbb {X}}\times {\mathbb {R}}^m_+\) of \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\). We argue that \((\varvec{x}_\star , u^0(\varvec{x}_\star ), {\varvec{u}}(\varvec{x}_\star ), {\varvec{z}}_\star )\) is a saddle point of \({{\mathcal {L}}}'\). From the definitions of \({{\mathcal {L}}}\), \({{\mathcal {L}}}'\), (82), (83), and the saddle point property of \((\varvec{x}_\star , {\varvec{z}}_\star )\), we obtain

$$\begin{aligned} \begin{aligned} {{\mathcal {L}}}'(\varvec{x}_\star , u^0(\varvec{x}_\star ), {\varvec{u}}(\varvec{x}_\star ), {\varvec{z}}_\star )&= {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star ) \\&\le {{\mathcal {L}}}(\varvec{x}, {\varvec{z}}_\star ) \\&= \mathbb {E} [ \psi ^{f}_\omega (\varvec{x}, u^0(\varvec{x}); \alpha )] + \sum _{i=1}^m z^i_\star \mathbb {E} [ \psi ^{g^i}_\omega (\varvec{x}, u^i(\varvec{x}); \beta ^i)] \\&\le {{\mathcal {L}}}'(\varvec{x}, u^0, {\varvec{u}}, {\varvec{z}}_\star ) \end{aligned} \end{aligned}$$
(86)

for all \((\varvec{x}, u^0, {\varvec{u}}) \in {\mathbb {X}}\times {\mathbb {R}}\times {\mathbb {U}}\). Also, for all \({\varvec{z}} \in {\mathbb {R}}^m_+\), we have

$$\begin{aligned} \begin{aligned} {{\mathcal {L}}}'(\varvec{x}_\star , u^0(\varvec{x}_\star ), {\varvec{u}}(\varvec{x}_\star ), {\varvec{z}}) = {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}) \le {{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star ) = {{\mathcal {L}}}'(\varvec{x}_\star , u^0(\varvec{x}_\star ), {\varvec{u}}(\varvec{x}_\star ), {\varvec{z}}_\star ). \end{aligned} \end{aligned}$$
(87)

\(\bullet \) Step (d)—Proof of (73) and (74): By the saddle point theorem and (86), we have \({{\mathcal {L}}}(\varvec{x}_\star , {\varvec{z}}_\star ) = p_\star ^\mathrm{CVaR}\) that also equals the optimal value of \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\). Applying Theorem 2.1 with revised \(P_2\) and \(P_3\) from step (b) to \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\) for which \(\varvec{x}_0, \dots , \varvec{x}_{K+1}\) and \(u^0_0, \dots , u^0_{K+1}\) are -measurable, we obtain

(88)

Following a similar argument for \(i=1,\ldots ,m\), we get

(89)

completing the proof. \(\square \)

Our proof architecture generalizes to problems with other risk measures as long as that measure preserves convexity of \(f_\omega , {\varvec{g}}_\omega \), admits a variational characterization as in (2), and a subgradient for this modified objective can be easily computed and remains bounded. We restrict our attention to \(\mathrm{CVaR}\) to keep the exposition concrete.

Opposed to sample average approximation (SAA) algorithms, we neither compute nor estimate \(F(\varvec{x}) =\mathrm{CVaR}[f_\omega (\varvec{x})]\), \({\varvec{G}}(\varvec{x}) = \mathrm{CVaR}[{\varvec{g}}_\omega (\varvec{x})]\) for any given decision \(\varvec{x}\) to run the algorithm. Yet, our analysis provides guarantees on the same at \(\bar{\varvec{x}}_{K+1}\) in expectation. If one needs to compute F at any decision variable, e.g., at \(\bar{\varvec{x}}_{K+1}\), one can employ the variational characterization in (2). Such evaluation requires additional computational effort. Notice that Theorem 3.1 does not relate \(F(\bar{\varvec{x}}_{K+1})\) to \(p_\star ^\mathrm{CVaR}\) in an almost sure sense; it only relates the two in expectation according to (73), where the expectation is evaluated with respect to the stochastic sample path.

\(\mathrm{CVaR}\) of a random variable depends on the tail of its distribution. The higher the risk aversion, the further into the tail one needs to look, generally requiring more samples. Even if we do not explicitly compute the tail-dependent CVaR relevant to the objective or the constraints, it is natural to expect our sample complexity to grow with risk aversion, which the following result confirms.

Proposition 3.1

Suppose Assumption 2 holds. For an \(\varepsilon \)-approximately feasible and optimal solution of \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) with risk aversion parameters \(\alpha , {\varvec{\beta }}\) using Algorithm 1 on \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\), then \(\gamma _\star (\alpha , {\varvec{\beta }})\) and \(K_\star (\alpha , {\varvec{\beta }})\) from Proposition 2.1, respectively, decreases and increases with both \(\alpha \) and \({\varvec{\beta }}\).

Proof

We borrow the notation from Proposition 2.1 and tackle the variation with \(\alpha \) and \({\varvec{\beta }}\) separately.

\(\bullet \) Variation with \(\alpha \): \(P_2\) increases with \(\alpha \), implying \(\gamma _\star \) decreases with \(\alpha \) because \(\frac{d \gamma _\star ^2}{d y} \le 0\) and \(\frac{d y}{d P_2} \ge 0\). Furthermore, using \(\frac{\partial K_\star }{\partial \gamma _\star } < 0\) for \(\gamma < \gamma _\star \) and \(\frac{\partial K_\star }{\partial P_2} \ge 0\) in

$$\begin{aligned} \frac{d K_\star }{d P_2} = \frac{\partial K_\star }{\partial P_2} + \frac{\partial K_\star }{\partial \gamma _\star }\frac{d \gamma _\star }{d P_2} \end{aligned}$$
(90)

we infer that \(K_\star \) increases with \(\alpha \).

\(\bullet \) Variation with \(\beta ^i\): Both \(P_2\) and \(P_3\) increase with \(\beta ^i\) and

$$\begin{aligned} \frac{d \gamma _\star ^2}{d {\beta ^i}} = \frac{\partial \gamma _\star ^2}{\partial P_2} \frac{d P_2}{d {\beta ^i}} + \frac{\partial \gamma _\star ^2}{\partial P_3} \frac{d P_3}{d {\beta ^i}}. \end{aligned}$$
(91)

Following an argument similar to that for the variation with \(\alpha \), the first term on the RHS of the above equation can be shown to be nonpositive. Next, we show that the second term is nonpositive to conclude that \(\gamma _\star \) decreases with \(\beta ^i\), where we use \(\frac{d P_3}{d {\beta ^i}} \ge 0\). Utilizing \(\frac{P_2}{P_1 P_3} = y - 1\), we infer

$$\begin{aligned} \begin{aligned} \frac{\partial \gamma _\star ^2}{\partial P_3}&= -\frac{2}{P_3^2(2 + y + \sqrt{y^2 + 8y})} + \frac{\partial \gamma _\star ^2}{\partial y} \frac{\partial y}{\partial P_3} \\&= -\frac{2}{P_3^2(2 + y + \sqrt{y^2 + 8y})} + 2\frac{4 + y + \sqrt{y^2 + 8y}}{P_3 \sqrt{y^2 + 8y} (2 + y + \sqrt{y^2 + 8y})^2} \frac{P_2}{P_1 P_3^2} \\&= -2\frac{{ 5y + 4 + 3 \sqrt{y^2 + 8y} }}{P_3^2 \sqrt{y^2 + 8y} (2 + y + \sqrt{y^2 + 8y})^2} \\&\le 0. \end{aligned} \end{aligned}$$
(92)

To characterize the variation of \(K_\star \), notice that

$$\begin{aligned} \frac{d K_\star }{d \beta ^i} = \frac{\partial K_\star }{\partial P_2}\frac{\partial P_2}{\partial \beta ^i} + \frac{\partial K_\star }{\partial P_3}\frac{\partial P_3}{\partial \beta ^i} . \end{aligned}$$
(93)

Again, the first term on the RHS of the above relation is nonnegative, owing to an argument similar to that used for the variation of \(K_\star \) with \(\alpha \). We show \(\frac{\partial K_\star }{\partial P_3} \le 0\) to conclude the proof. Treating \(K_\star \) as a function of \(P_3\) and \(\gamma _\star \), we obtain

$$\begin{aligned} \frac{d K_\star }{d P_3} = \frac{\partial K_\star }{\partial P_3} + \frac{\partial K_\star }{\partial \gamma _\star } \frac{\partial \gamma _\star }{\partial P_3}. \end{aligned}$$
(94)

It is straightforward to verify that the first summand is nonnegative. We have already argued that \(\gamma _\star \) decreases with \(P_3\), and \(\frac{\partial K_\star }{\partial \gamma } < 0\) for \(\gamma < \gamma _\star \), implying that the second summand is nonnegative as well, completing the proof. \(\square \)

It is easy to compute the optimized iteration count \(K_\star (\alpha , {\varvec{\beta }})\) and the optimized constant step-size \(\gamma _\star (\alpha , {\varvec{\beta }})/\sqrt{K_\star (\alpha , {\varvec{\beta }})}\) from Proposition 2.1. The formula is omitted for brevity. Instead, we derive additional insight by fixing \({\varvec{\beta }}\) and driving \(\alpha \) towards unity. For such an \(\alpha , {\varvec{\beta }}\), we have

$$\begin{aligned} P_2(\alpha , {\varvec{\beta }}) \sim (1-\alpha )^{-2}, \ \gamma _\star (\alpha , {\varvec{\beta }}) \sim (1-\alpha ), \ K_\star (\alpha , {\varvec{\beta }}) \sim \frac{1}{\varepsilon ^2(1-\alpha )^2}. \end{aligned}$$
(95)

With \(\alpha \) approaching unity, notice that \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) approaches a robust optimization problem. Thus, Algorithm 1 for \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\) is aiming to solve a robust optimization problem via sampling. Not surprisingly, the sample complexity exhibits unbounded growth with such robustness requirements, since we do not assume \(\varOmega \) to be finite. Also, this growth matches that of solving the SAA problem within \(\varepsilon \)-tolerance on the unconstrained problem to minimize \({\widehat{F}}(\varvec{x}) := \frac{1}{K} \sum _{j=1}^K \psi _{\omega ^j}^f(\varvec{x}, u; \alpha )\). To see this, apply Theorem 2.1 on \({\widehat{F}}(\varvec{x})\) with optimized step size from Proposition 2.1, where \(P_2 \sim \Vert \nabla {\widehat{F}}(\varvec{x})\Vert ^2 \sim (1-\alpha )^{-2}\) and \(P_3=1\).

Parallelization can lead to stronger bounds. More precisely, run stochastic approximation in parallel on N machines, each with K samples and compute \( \langle \bar{\varvec{x}}\rangle _{K+1} := \frac{1}{N} \sum _{j=1}^N \bar{\varvec{x}}_{K+1}[j]\) using \(\bar{\varvec{x}}_{K+1} [1], \ldots , \bar{\varvec{x}}_{K+1} [N]\) obtained from the N separate runs. Then, we have

$$\begin{aligned} \begin{aligned}&\mathsf{Pr}\left\{ G^i \left( \langle \bar{\varvec{x}}\rangle _{K+1} \right) \ge (1+ \tau ) {\eta (\alpha , {\varvec{\beta }})}/{\sqrt{K}} \right\} \\&\quad \le \mathsf{Pr}\left\{ \frac{1}{N} \sum _{j=1}^N \mathrm{CVaR}_{\beta ^i}\left[ { g^i_\omega \left( \bar{\varvec{x}}_{K+1}[j] \right) } \right] \ge (1+ \tau ) \frac{\eta (\alpha , {\varvec{\beta }})}{\sqrt{K}} \right\} \\&\quad \le \exp \left( - \frac{N \tau ^2 \eta ^2(\alpha , {\varvec{\beta }})}{{K [D^i_G]^2}} \right) \end{aligned} \end{aligned}$$
(96)

for \(i = 1,\ldots ,m\) and \(\tau > 0\). The steps combine coherence of \(\mathrm{CVaR}\), convexity and uniform boundedness of \(g^i_\omega \), Hoeffding’s inequality and Theorem 3.1. A similar bound can be derived for suboptimality. Thus, parallelized stochastic approximation produces a result whose \({{\mathcal {O}}}(1/\sqrt{K})\)-violation occurs with a probability that decays exponentially with the degree of parallelization N.

The bound in (96) reveals an interesting connection with results for chance constrained programs. To describe the link, notice that \(\mathrm{CVaR}_{\delta }[y_\omega ] \le 0\) implies \(\mathsf{Pr}\{ y_\omega \le 0\} \ge 1-\delta \) for any random variable \(y_\omega \) and \(\delta \in [0, 1)\). Therefore, (96) implies

$$\begin{aligned} \mathsf{Pr}\left\{ \mathsf{Pr}\left\{ g^i_\omega \left( \bar{\varvec{x}}_{K+1} \right) \le C/{\sqrt{K}} \right\} \ge 1 - \beta ^i \text { is violated}\right\} \le \exp \left( - C' / K \right) \le \nu , \end{aligned}$$
(97)

for constants \(C, C'\). Said differently, our stochastic approximation algorithm requires \({{\mathcal {O}}}(\log (1/{\nu }))\) samples to produce a solution that satisfies an \({{\mathcal {O}}}( 1/\sqrt{\log (1/\sqrt{\nu })})\)-approximate chance-constraint with a violation probability bounded by \(\nu \). This result bears a striking similarity to that derived in [11], where the authors deterministically enforce \({{\mathcal {O}}}(\log (1/{\nu }))\) sampled constraints to produce a solution that satisfies the exact chance-constraint \(\mathsf{Pr}\left\{ g^i_\omega \left( \varvec{x} \right) \le 0 \right\} \ge 1-\beta ^i \) with a violation probability bounded by \(\nu \). This resemblance in order-wise sample complexity is intriguing, given the significant differences between the algorithms.

3.1 An Illustrative Example

We explore the use of our algorithm on the following example problem

$$\begin{aligned} \begin{aligned} \underset{-\frac{1}{2} \le x \le \frac{1}{2}}{\text {minimize}} \ \ \mathrm{CVaR}_{\alpha }\left[ \frac{1}{2} \left( x - \omega -\frac{1}{2} \right) ^2 \right] , \ \text {subject to} \ \mathrm{CVaR}_{\beta }\left[ x + \omega \right] \le 0. \end{aligned} \end{aligned}$$
(98)

Let \(\omega \sim \frac{1}{3}\mathsf{beta}(2, 2)\) and consider the specific choice of risk parameters \(\alpha = 0.3, \beta = 0.2\). To gain intuition into the optimal solution for this example, we numerically estimate F(x) and \(G^1(x)\) for each x and plot them in Fig. 1a. To that end, we first obtain a million samples of \(\omega \). Then, for each value of the decision variable x, we sort the objective function value \(f_\omega (x)\) and the constraint function value \(g_\omega (x)\) with these samples. We then estimate F and \(G^1\) as the average of the highest \(1-\alpha =70\%\) and \(1-\beta =80\%\) among \(f_\omega (x)\)’s and \(g_\omega (x)\)’s, respectively, at each x with those samples. The unique optimum for (98) is numerically evaluated as \(x_\star \approx -0.1929\) for which \(F(x_\star ) \approx 0.4042\) and \(G^1(x_\star ) \approx 0\).

For this example, it is easy to show that \(C_F = \frac{4}{3}\), \(C_G = 1\) and \(D_G = \frac{5}{6}\) that yields \(P_2(0.3, 0.2) = \frac{8276}{93}\) and \(P_3(0.3, 0.2) = 50\). To run Algorithm 1 on \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\) derived from (98), we can use constant step-size \(\gamma _k = \gamma /\sqrt{K}\) with a pre-determined number of steps K for any \(0< \gamma < P_3^{-1/2}(0.3, 0.2) = \frac{1}{5\sqrt{2}}\). With any given K, Theorem 3.1 guarantees that the expected distance to \(F(x_\star )\) and the expected constraint violation evaluated at \(\bar{{x}}_{K+1}\) decays as \(1/\sqrt{K}\). For a given K and \(\gamma < \frac{1}{5\sqrt{2}}\), calculating the precise bound \(\eta (0.3, 0.2)/\sqrt{K}\) requires the knowledge of \(P_1\) or its overestimate. For this example, \(| x_\star | \le \frac{1}{2}\) and \(|u^1_\star | \le D_G = \frac{5}{6}\). Also, \( | u^0_\star |\) is bounded above by the maximum value that \(|f_\omega (x)|\) can take, that is given by \(\frac{8}{9}\). Since we cannot determine \(z_\star \) a priori, we assume \(| z_\star | \le 2\) (that will later be shown to be consistent with our result). Starting from \((x_0, u^0_0, u^1_0, z_0) = 0\), we then obtain \(P_1=\frac{3197}{81}\). To solve \({{{{\mathcal {P}}}}^\mathrm{CVaR}}\) (or equivalently \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\)) with a tolerance of \(\varepsilon =5 \times 10^{-3}\), we require \(\eta (0.3, 0.2)/\sqrt{K} \le 5 \times 10^{-3}\). With this tolerance and the values of \(P_1, P_2, P_3\), Proposition 2.1 yields an optimized \(\gamma _\star = 0.0808\) and \(K_\star \approx 1.35 \times 10^{9}\). We run Algorithm 1 on \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\) with constant step-size \(\gamma _\star /\sqrt{K_\star }\) and plot F and \(G^1\) at the running ergodic mean of the iterates, i.e., at \({\bar{x}}_k := \frac{1}{k}\sum _{j=1}^k x_j\) for each k. Again F and \(G^1\) are evaluated numerically using the \(\mathrm{CVaR}\)-estimation procedure we outlined above.

Fig. 1
figure 1

Plots of a numerically estimated F and \(G^1\) over \({\mathbb {X}}= [-\frac{1}{2}, \frac{1}{2}]\), and b convergence of the running ergodic mean and FG evaluated at the mean for the example problem (98) with \(\alpha = 0.3\), \(\beta = 0.2\)

Notice that Theorem 3.1 only guarantees a bound on \(F ({\bar{x}}_{K_\star +1}) - F({\bar{x}}_\star )\) and \(G^1({\bar{x}}_{K_\star +1})\) in expectation. Thus, one would expect that only the average of the \(\mathrm{CVaR}\) of F and \(G^1\) evaluated at \({\bar{x}}_{K_\star + 1}\) over multiple sample paths to respect the \(\varepsilon \)-bound. However, our simulation yielded \({\bar{x}}_{K_\star + 1} = -0.1926\) and \({\bar{z}}_{K_\star + 1} = 0.8976\), for which

$$\begin{aligned} \begin{aligned} F({\bar{x}}_{K_\star + 1}) \approx 0.4040 \le F(x_\star ) + \varepsilon \approx 0.4042 + 0.0050 = 0.4092, \\ G^1({\bar{x}}_{K_\star + 1}) \approx 0.0002 \le G^1(x_\star ) + \varepsilon \approx 0 + 0.0050 = 0.0050, \end{aligned} \end{aligned}$$
(99)

i.e., the ergodic mean after \(K_\star \) iterations respects the \(\varepsilon \)-bound over the plotted sample path. The same behavior was observed over multiple sample paths. The ergodic mean of the dual iterate is indeed consistent with our assumption \(|z_\star | \le 2\) made in deriving \(\eta (0.3, 0.2)\). We point out that the ergodic mean in Fig. 1b moves much more smoothly than our evaluation of F and \(G^1\) at those means, especially for large k. The noise in F in \(G^1\) emanates from the finitely many samples we use to evaluate F and \(G^1\). The errors appear much more pronounced at larger k, given the logarithmic scale of the plot.

The optimized iteration count \(K_\star (\alpha , \beta )\) from Proposition 2.1 with a modest \(\alpha =0.3, \beta =0.2\) is quite high even for this simple example. This iteration count only grows with increased risk aversion as Fig. 2 reveals. Figure 1b suggests that the \(\varepsilon =5\times 10^{-3}\) tolerance is met earlier than \(K_\star \) iterations. This is the downside of optimizing upper bounds to decide step-sizes for subgradient methods. Carefully designed termination criteria may prove useful in practical implementations. In Fig. 2, we calculate \(K_\star \) and \(\gamma _\star \) with \(P_1 = \frac{3197}{81}\) obtained using \(| z_\star | \le 2\); extensive simulations with various \((\alpha , \beta ) \in [0,0.99]^2\) suggest that this over-estimate indeed holds.

Fig. 2
figure 2

Plot of the optimized number of iterations \(K_\star (\alpha , {\beta })\) on the left and the optimized step size \(\gamma _\star (\alpha , \beta )/\sqrt{K_\star (\alpha , \beta )}\) on the right to achieve a tolerance of \(\varepsilon =5 \times 10^{-3}\) for the example problem in (98)

We end the numerical example with a remark about the comparison of Algorithm 1 that uses Gauss–Seidel-type dual update in (5) and another that uses the popular Jacobi-type dual update on \({{{{\mathcal {P}}}}^{\mathrm{E}}}'\) for (98) with \(\alpha =0.3, \beta =0.2\). This alternate dual update replaces \({\varvec{g}}_{\omega _{k +1/2}}(\varvec{x}_{k+1})\) in (5) by \({\varvec{g}}_{\omega _{k}}(\varvec{x}_{k})\). That is, the same sample \(\omega _k\) is used for both the primal and the dual update. And, the primal iterate \(\varvec{x}_k\) is used instead of \(\varvec{x}_{k+1}\) to update the dual variable. We numerically compared this primal-dual algorithm with Algorithm 1 with various choices of step-sizes (consistent with the requirements of Theorem 3.1) and iteration count for our example and its variations. For each run, we found that the iterates from both these algorithms moved very similarly. The differences are too small to report. The Jacobi-type update requires half the number of samples compared to Algorithm 1. While the extra sample helps us in the theoretical analysis, our experience with this stylized example does not suggest any empirical advantage. A more thorough comparison between these algorithms, both theoretically and empirically, is left to future work.

4 Conclusions and Future Work

In this paper, we study a stochastic approximation algorithm for \(\mathrm{CVaR}\)-sensitive optimization problems. Such problems are remarkably rich in their modeling power and encompass a plethora of stochastic programming problems with broad applications. We study a primal-dual algorithm to solve that problem that processes samples in an online fashion, i.e., obtains samples and updates decision variables in each iteration. Such algorithms are useful when sampling is easy and intermediate approximate solutions, albeit inexact, are useful. The convergence analysis allows us to optimize the number of iterations required to reach a solution within a prescribed tolerance on expected suboptimality and constraint violation. The sample and iteration complexity predictably grows with risk-aversion. Our work affirms that a modeler must not only consider the attitude toward risk but also consider the computational burdens of risk in deciding the problem formulation.

Two possible extensions are of immediate interest to us. First, primal-dual algorithms find applications in multi-agent distributed optimization problems over a possibly time-varying communication network. We plan to extend our results to solve distributed risk-sensitive convex optimization problems over networks, borrowing techniques from [14, 29]. Second, the relationship to sample complexity for chance-constrained programs in [11] encourages us to pursue a possible exploration of stochastic approximation for such optimization problems.