1 Introduction

Extended real-valued upper-semicontinuous (usc) functions on \({\mathbb {R}}^n\) are fundamental in the study of finite-dimensional constrained maximization problems as essentially all such problems can be represented by usc functions. They arise in probability theory with distribution and càdlàg functions also being usc. Emerging applications of usc functions in infinite-dimensional problems include nonparametric statistical M-estimation [38], distributionally robust optimization [36], and more generally function identification [35]. In these applications, optimization problems are formulated over spaces of usc functions. Regardless of the setting, it becomes important to have means to approximate usc functions as well as an understanding of the difficulty with such an undertaking. This article provides three main results in these directions: (i) We establish that every usc function is the limit of a hypo-converging sequence of mesh-free piecewise affine functions of the difference-of-max type. Thus, as a corollary, the difference-of-convex (dc) functions are hypo-dense in spaces of usc functions. With the advances in computational treatment of dc functions (see for example [8]), this leads to numerous algorithmic possibilities, which we illustrate in the context of function identification problems. (ii) We provide upper and lower bounds on covering numbers for bounded sets of usc functions under the Attouch-Wets (aw) distance and thereby quantify the ease with which classes of usc functions can be approximated by finite collections. (iii) For stochastic optimization problems defined over spaces of usc functions, we establish confidence regions for optimal solutions in terms of the aw-distance and sample average approximations, with rates of convergence as the sample size grows. The result requires only semicontinuity of the objective function and therefore applies in challenging settings such as simulation optimization of “black-box” stochastic systems where little structure may be known.

The consideration of approximations in the sense of hypo-convergence, which is metrized by the aw-distance, is natural and convenient in many applications. If an usc function is approximated in this sense, then the maximizers of the approximating function will be “near” those of the actual function. This is exactly the desired property when the usc function represents a constrained maximization problem. It is also the goal when the usc functions is a probability density function and we need to estimate its modes; the approximating density will have modes “near” the actual modes. The situation is similar when the usc function is a surrogate model in an engineering design problem that needs to be maximized to find an optimal design; see Sect. 5 for an example. The notion of approximation is further motivated in the context of distribution functions by the fact that for such functions hypo-convergence is equivalent to convergence in distribution, a property that is leveraged to address optimization under stochastic ambiguity in [36]. An alternative focus on approximations in the sense of uniform convergence would have limited the scope to finite-valued continuous functions with common compact domains, which is too restrictive in many applications. Hypo-convergence permits treatment of usc functions defined on any subset of \({\mathbb {R}}^n\).

The study of hypo-converging usc functions and, in parallel, epi-converging extended real-valued lower-semicontinuous (lsc) functions has a long history, with important accomplishments in convex and nonsmooth analysis as well as the approximation theory of maximization and minimization problems; see [31] for details. Connections to probability theory are established in [39, 40] and more recently in [36]. The first formulation of infinite-dimensional optimization problems over spaces of semicontinuous functions appears in [34], with theoretical developments in [35]. In particular, the latter reference defines the class of epi-splines (see also [33]), which are piecewise polynomial functions, and establishes that the class is dense in spaces of semicontinuous functions under the aw-distance. Even though epi-splines furnish a means to approximate arbitrary semicontinuous functions using a finite number of parameters, they suffer from the need to partition \({\mathbb {R}}^n\) into a finite number of subsets. In the present paper, we show that semicontinuous functions can be approximated by piecewise affine functions that are defined without specifying a partition and that are characterized structurally as being the difference of two functions of the form \(x\mapsto \max _{k=1, \dots , p} \langle a^k, x\rangle + \alpha _k\). Consequently, we refer to these piecewise affine functions as mesh free; the domain of each affine component adapts and is not preselected. This is a significant feature for medium- and high-dimensional problems, where representative low-dimensional subspaces need to be discovered and exploited and standard polynomial approximations become challenging (see [47] for some progress in such directions). Our approximation result for usc functions extends the well-known fact that every continuous function on a convex compact set is the limit of a uniformly convergent sequence of dc functions, which can be traced back to the local property of dc functions established by [18]; see for example Proposition 2.3 in [21].

Covering numbers express the size of a class of functions in terms of the smallest number of balls with a certain radius needed to cover the class and are central to most consistency, rate of convergence, and error analysis in (non)parametric estimation and machine learning; see for example [16, 44, 45]. The pioneering work [5, 23] deal with continuous and smooth functions; see [30] for a more recent discussion. Functions of bounded variation are considered in [3] and analytic functions in [7]. An upper estimate for the covering numbers of the unit ball of Gaussian reproducing kernel Hilbert spaces is given in [48], with further refinements and applications in [24, 46]. Covering numbers of sets of convex functions are established in [6, 11], with significant improvements in [14]. The present paper establishes an upper bound on the covering numbers of bounded classes of usc functions under the aw-distance and show that it is sharp within a logarithmic factor.

Although sample average approximations are often used to solving stochastic optimization problems, it remains challenging to assess the quality of a solution obtained through such approximations. Upper and lower bounds on minimum values can be computed using the approaches in [4, 20, 27, 29] (see also [42, Sect. 5.6]), at least when problem relaxations can be solved to near global optimality. Validation approaches based on optimality conditions are found in [19, 25, 26, 32, 43] and [42, Sect. 5.6]. Rates of convergence of optimization problems with Lipschitz continuous objective functions defined on a compact subset of \({\mathbb {R}}^n\) are given in [42, Sect. 5.3]. We leverage the results on covering numbers to establish confidence regions of optimal solutions of infinite-dimensional stochastic optimization problems defined on spaces of usc functions without assuming Lipschitz continuity. The result is novel even when specialized to finite dimensions. For a Hölder continuous case, we obtain, in some sense, a stronger result.

After a section laying out notation and terminology, we proceed in Sect. 3 with the result on piecewise affine approximations and its applications. Section 4 establishes bounds on covering numbers. Section 5 constructs confidence regions and discusses rates of convergence for solutions of stochastic optimization problems and their applications to nonparametric estimation. The paper ends with an “Appendix” supplementing a proof.

2 Preliminaries

In some applications, it would be natural and beneficial to consider usc functions defined only on a strict subset of \({\mathbb {R}}^n\) and their extensions to the whole \({\mathbb {R}}^n\) by assigning the value \(-\infty \) may not be meaningful. For example, if an usc function represents a necessarily nonnegative probability density, then such an assignment would not imply a useful extension. Consequently, we develop most results for usc functions defined on a nonempty closed subset \(S\subset {\mathbb {R}}^n\), which could be all of \({\mathbb {R}}^n\), and is assumed to include the origin. Throughout, S will be such a set and the analysis will usually take place on the metric spaces \((S,\Vert \cdot - \cdot \Vert _\infty )\) and \((S\times {\mathbb {R}},\Vert \cdot - \cdot \Vert _\infty )\); the difference from the usual \(({\mathbb {R}}^n, \Vert \cdot -\cdot \Vert _\infty )\) is anyhow minor and will be highlighted when significant. Of course, the sup-norm can be replaced by any other norm, but this choice simplifies some expressions in Sect. 4. Likewise, the assumption \(0\in S\) can be relaxed, with the introduction of additional notation better avoided here. The facts of this section can be found in or deduced from [31, Chapter 7] and [33].

We recall that \(\mathrm{hypo} \;f = \{(x,\alpha )\in S\times {\mathbb {R}}~|~f(x) \ge \alpha \}\subset S\times {\mathbb {R}}\) is the hypograph of a function \(f:S\rightarrow {\overline{{\mathbb {R}}}}= [-\infty ,\infty ]\). The collection of usc functions on S is denoted by

$$\begin{aligned} {\text {usc-fcns}}(S) = \{f:S\rightarrow {\overline{{\mathbb {R}}}}~|~\mathrm{hypo} \;f \text{ is } \text{ nonempty } \text{ and } \text{ closed }\}. \end{aligned}$$

We let \({\mathbb {N}}= \{1, 2, \dots \}\). The outer limit of a sequence of sets \(\{A^\nu , \nu \in {\mathbb {N}}\}\) in a topological space, denoted by \(\mathop {\mathrm{OutLim}}\nolimits A^\nu \), is the collection of points to which a subsequence of \(\{a^\nu \in A^\nu , \nu \in {\mathbb {N}}\}\) converges. The inner limit, denoted by \(\mathop {\mathrm{InnLim}}\nolimits A^\nu \), is the collection of points to which a sequence \(\{a^\nu \in A^\nu , \nu \in {\mathbb {N}}\}\) converges. If both limits exist and are equal to A, we say that \(\{A^\nu , \nu \in {\mathbb {N}}\}\) set-converges to A and write \(A^\nu \rightarrow A\) or \(\mathop {\mathrm{Lim}}\nolimits A^\nu = A\). We denote by \(\mathrm{int}A\) and \(\mathrm{cl}A\) the interior and closure of A, respectively.

For \(f^\nu , f\in {\text {usc-fcns}}(S)\),

$$\begin{aligned} f^\nu \, \textit{hypo-converges}\;\hbox {to } f, \hbox { written } f^\nu \rightarrow f \Longleftrightarrow \mathrm{hypo} \;f^\nu \rightarrow \mathrm{hypo} \;f. \end{aligned}$$

Set-convergence of hypographs in this case, and therefore also hypo-convergence, is equivalent to having

$$\begin{aligned}&\forall x^\nu \in S\rightarrow x, ~\mathop {\mathrm{limsup}}\nolimits f^\nu (x^\nu ) \le f(x) \end{aligned}$$
(1)
$$\begin{aligned}&\forall x\in S, ~\exists x^\nu \in S\rightarrow x \text{ with } \mathop {\mathrm{liminf}}\nolimits f^\nu (x^\nu ) \ge f(x). \end{aligned}$$
(2)

The Attouch-Wets (aw) distance \({\mathbb {d}}\), which quantifies the distance between usc functions in terms of a distance between their hypographs, metrizes hypo-convergence. Specifically, for \(f,g\in {\text {usc-fcns}}(S)\), it is defined as

$$\begin{aligned} {\mathbb {d}}(f,g) = \int _{0}^\infty {\mathbb {d}}_\rho (f,g) e^{-\rho } d\rho , \end{aligned}$$

where, for \(\rho \ge 0\), the \(\rho \)-aw-distance

$$\begin{aligned} {\mathbb {d}}_\rho (f,g) = \mathop {\mathrm{max}}\nolimits _{z\in \rho {\mathbb {B}}_\infty } \big |\mathop {\mathrm{dist}}\nolimits _\infty \big (z, \mathrm{hypo} \;f\big ) - \mathop {\mathrm{dist}}\nolimits _\infty \big (z, \mathrm{hypo} g\big )\big |, \end{aligned}$$

with \(\mathop {\mathrm{dist}}\nolimits _\infty (z, A)\) being the usual point-to-set distance between a point \(z\in S\times {\mathbb {R}}\) and a set \(A\subset S\times {\mathbb {R}}\) under the sup-norm, \(\rho {\mathbb {B}}_\infty = {\mathbb {B}}_\infty (0,\rho )\), with \({\mathbb {B}}_\infty ({\bar{z}},\rho ) = \{z \in S\times {\mathbb {R}}~|~ \Vert {\bar{z}} - z\Vert _\infty \le \rho \}\) for any \({\bar{z}}\in S\times {\mathbb {R}}\). Since the meaning will be clear from the context, we also write \({\mathbb {B}}_\infty ({\bar{x}},\rho ) = \{x \in S~|~ \Vert {\bar{x}} - x\Vert _\infty \le \rho \}\) with \({\bar{x}}\in S\). For any nonempty closed set \(S\subset {\mathbb {R}}^n\), \(({\text {usc-fcns}}(S),{\mathbb {d}})\) is a complete separable metric space. Every closed and bounded subset \(F\subset ({\text {usc-fcns}}(S),{\mathbb {d}})\) is compact. Moreover, for all \(f,g\in {\text {usc-fcns}}(S)\),

$$\begin{aligned}&\big |\mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} \;f) - \mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} \;g)\big | \nonumber \\&\quad \le {\mathbb {d}}(f,g) \le \max \big \{\mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} \;f), \mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} \;g)\big \} + 1 \end{aligned}$$
(3)

and, thus, if \(f,g\ge 0\), then \(0\le {\mathbb {d}}(f,g) \le 1\). We also see that a sufficient condition for F to be bounded is that there exists \((x,\alpha )\in S\times {\mathbb {R}}\) such that \(f(x)\ge \alpha \) for all \(f\in F\).

If not specified otherwise, the index \(\nu \) runs over \({\mathbb {N}}\) so that \(x^\nu \rightarrow x\) means that the whole sequence \(\{x^\nu , \nu \in {\mathbb {N}}\}\) converges to x. Let

$$\begin{aligned} {{\mathcal {N}}}_\infty ^{\scriptscriptstyle \#} \text{ be } \text{ all } \text{ the } \text{ subsets } \text{ of } {\mathbb {N}}\hbox { determined by subsequences}, \end{aligned}$$

i.e., \(N\in {{\mathcal {N}}}_\infty ^{\scriptscriptstyle \#}\) is an infinite collection of strictly increasing natural numbers. Thus, \(\{x^\nu , \nu \in N\}\) is a subsequence of \(\{x^\nu , \nu \in {\mathbb {N}}\}\); its convergence to x is noted by \(x^\nu \xrightarrow {N}x\).

A similar development is available for functions defined on the metric space \(({\text {usc-fcns}}(S),{\mathbb {d}})\). However, we adopt a slightly different set-up that highlights the role of domains of definition. For \(F,F^\nu \subset {\text {usc-fcns}}(S)\), the functions \(\varphi ^\nu :F^\nu \rightarrow {\overline{{\mathbb {R}}}}\) epi-converge to \(\varphi :F\rightarrow {\overline{{\mathbb {R}}}}\) whenever

$$\begin{aligned}&\forall N\in {{\mathcal {N}}}_\infty ^{\scriptscriptstyle \#} \text{ and } f^\nu \in F^\nu \xrightarrow {N}f, ~\mathop {\mathrm{liminf}}\nolimits _{\nu \in N} \varphi ^\nu (f^\nu ) \\&\quad \ge \varphi (f) \text{ if } f\in F \text{ and } \varphi ^\nu (f^\nu )\xrightarrow {N}\infty \text{ otherwise }\\&\forall f\in F, ~\exists f^\nu \in F^\nu \rightarrow f \text{ with } \mathop {\mathrm{limsup}}\nolimits \varphi ^\nu (f^\nu ) \le \varphi (f). \end{aligned}$$

For \(\varepsilon \ge 0\), \(\varepsilon \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F} \varphi (f) = \{f\in F~|~\varphi (f) \le \mathop {\mathrm{inf} }\nolimits _{g\in F} \varphi (g) + \varepsilon \}\), with the usual extended real-valued calculus in play when needed. We deviate slightly from the convention in [31] by setting \(\varepsilon \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F} \varphi (f) = F\) when \(\varphi (f) = \infty \) for all \(f\in F\). This is tenable because we restrain from extending functions to the whole space and \(\infty \) is not assigned a special role in that regard. Consequently, we alleviate the need for checking that functions are finite at least somewhere, which in an infinite-dimensional setting may require excessively strong assumptions.

3 Piecewise affine approximations

In this section, we establish that every usc function on \(S\subset {\mathbb {R}}^n\) can be approximated by piecewise affine functions with a particular structure under the additional assumption that S is convex. For \(\rho \in [0,\infty )\) and \(q\in {\mathbb {N}}\), let

$$\begin{aligned}&{\text {pa-fcns}}^{q}_\rho (S) = \Big \{f:S\rightarrow [-\infty ,\infty )~\Big |~\exists a^k,b^k\in {\mathbb {R}}^n, \\&\quad \alpha _k,\beta _k\in {\mathbb {R}},~ k=1, \dots , q \text{ such } \text{ that } \\&~f(x) = \mathop {\mathrm{max}}\nolimits _{k=1, \dots , q} \big [\langle a^k, x\rangle + \alpha _k\big ] - \mathop {\mathrm{max}}\nolimits _{k=1, \dots , q} \big [\langle b^k, x\rangle + \beta _k\big ]~\forall x\in S\cap \rho {\mathbb {B}}_\infty ;\\&\quad f(x) = -\infty \text{ otherwise }\Big \}. \end{aligned}$$

A function in \({\text {pa-fcns}}^{q}_\rho (S)\) is a difference of pointwise maxima of affine functions on \(S \cap \rho {\mathbb {B}}_\infty \) and therefore is finite and continuous on that set. We say that \(U\subset {\text {usc-fcns}}(S)\) is hypo-dense in \({\text {usc-fcns}}(S)\) if every \(f\in {\text {usc-fcns}}(S)\) is the limit of a hypo-converging sequence \(\{f^\nu \in U, \nu \in {\mathbb {N}}\}\).

Theorem 3.1

(piecewise affine approximations) Suppose that S is convex and \(\rho ^\nu \in [0,\infty )\) as well as \(q^\nu \in {\mathbb {N}}\) tend to \(\infty \). Then,

$$\begin{aligned} \bigcup _{\nu \in {\mathbb {N}}} {\text {pa-fcns}}^{q^\nu }_{\rho ^\nu }(S) \text{ is } \text{ hypo-dense } \text{ in } {\text {usc-fcns}}(S). \end{aligned}$$

Proof

Let \(f\in {\text {usc-fcns}}(S)\). We construct a sequence in \(\cup _{\nu \in {\mathbb {N}}} {\text {pa-fcns}}^{q^\nu }_{\rho ^\nu }(S)\) that hypo-converges to f. Let \(f^\nu :S\rightarrow {\overline{{\mathbb {R}}}}\) have \(f^\nu (x) = \min \{f(x),\nu \}\) for all \(x\in S\). Clearly, \(f^\nu \rightarrow f\); recall that \(\rightarrow \) always denotes hypo-convergence when written between usc functions. For any \(\nu \in {\mathbb {N}}\), \(f^\nu \) is (upper) prox-bounded so that for every \(\lambda >0\), the (upper) Moreau-envelope \(e_\lambda f^\nu :S\rightarrow {\mathbb {R}}\) of \(f^\nu \), which is given by

$$\begin{aligned} e_\lambda f^\nu (x) = \mathop {\mathrm{sup}}\nolimits _{y\in S} f^\nu (y) - \frac{1}{2\lambda } \Vert y-x\Vert _2^2, \end{aligned}$$

is finite and continuous. Moreover, \(e_\lambda f^\nu \rightarrow f^\nu \) as \(\lambda \searrow 0\) (see for example the discussion after Proposition 7.4 in [31]). Thus, there exists \(\{\lambda ^\nu >0, ~\nu \in {\mathbb {N}}\}\rightarrow 0\) such that

$$\begin{aligned} e_{\lambda ^\nu } f^\nu \rightarrow f. \end{aligned}$$

Next, we define \(\varphi ^\nu :S\rightarrow {\overline{{\mathbb {R}}}}\) as

$$\begin{aligned} \varphi ^\nu (x) = e_{\lambda ^\nu } f^\nu (x) ~\forall x\in S\cap \rho ^\nu {\mathbb {B}}_\infty \text{ and } \varphi ^\nu (x) = -\infty \text{ otherwise }. \end{aligned}$$

Since \(\rho ^\nu {\mathbb {B}}_\infty \rightarrow {\mathbb {R}}^n\), we also have that \(\varphi ^\nu \rightarrow f\).

Every real-valued continuous function on a convex compact subset of \({\mathbb {R}}^n\) is the limit in the sup-norm of finite-valued dc functions defined on the same set; see for example [21, Prop. 2.3]. Consequently, for every \(\nu \in {\mathbb {N}}\), there exist convex functions \(\{g^\nu _\mu ,h^\nu _\mu :S\rightarrow {\overline{{\mathbb {R}}}}, ~\mu \in {\mathbb {N}}\}\), finite on \(S\cap \rho ^\nu {\mathbb {B}}_\infty \), with the property that

$$\begin{aligned} \mathop {\mathrm{sup}}\nolimits _{x\in S\cap \rho ^\nu {\mathbb {B}}_\infty } \big |\varphi ^\nu (x) - [g^\nu _\mu (x) - h^\nu _\mu (x)]\big | \rightarrow 0 \text{ as } \mu \rightarrow \infty . \end{aligned}$$

Let \(\psi ^\nu _\mu :S\rightarrow {\overline{{\mathbb {R}}}}\) be defined by

$$\begin{aligned} \psi ^\nu _\mu (x) = g^\nu _\mu (x) - h^\nu _\mu (x) \text{ for } x\in S\cap \rho ^\nu {\mathbb {B}}_\infty \text{ and } \psi ^\nu _\mu (x) = - \infty \text{ otherwise. } \end{aligned}$$

Since already \(\varphi ^\nu \rightarrow f\), we can construct \(\{\mu ^\nu \in {\mathbb {N}}, ~\nu \in {\mathbb {N}}\}\rightarrow \infty \) as \(\nu \rightarrow \infty \) such that \(\psi ^\nu _{\mu ^\nu } \rightarrow f\). Let \(\psi ^\nu = \psi ^\nu _{\mu ^\nu }\).

The convex functions \(\{g^\nu _{\mu ^\nu },h^\nu _{\mu ^\nu }, ~\nu \in {\mathbb {N}}\}\) are lsc and proper. Let \(g^\nu = g^\nu _{\mu ^\nu }\) and \(h^\nu = h^\nu _{\mu ^\nu }\). Consequently, for every \(\nu \in {\mathbb {N}}\),

$$\begin{aligned} g^\nu (x) = \mathop {\mathrm{sup}}\nolimits _{(a,\alpha )\in A(g^\nu )} \{\langle a,x\rangle + \alpha \} \text{ and } h^\nu (x) = \mathop {\mathrm{sup}}\nolimits _{(a,\alpha )\in A(h^\nu )} \{\langle a,x\rangle + \alpha \} \text{ for } x\in S, \end{aligned}$$

where for \(u:S\rightarrow {\overline{{\mathbb {R}}}}\),

$$\begin{aligned} A(u) = \{(a,\alpha )\in {\mathbb {R}}^n\times {\mathbb {R}}~|~\langle a,x\rangle + \alpha \le u(x)~\forall x\in S\}. \end{aligned}$$

Since \(A(g^\nu ),A(h^\nu )\subset {\mathbb {R}}^{n+1}\), which is separable, there exist increasing sets \(\{A^\mu (g^\nu ), A^\mu (h^\nu ), ~\mu \in {\mathbb {N}}\}\), each with finite cardinality, such that

$$\begin{aligned} \bigcup _{\mu \in {\mathbb {N}}} A^\mu (g^\nu ) \text{ is } \text{ dense } \text{ in } A(g^\nu ) \text{ and } \bigcup _{\mu \in {\mathbb {N}}} A^\mu (h^\nu ) \text{ is } \text{ dense } \text{ in } A(h^\nu ). \end{aligned}$$

For \(\nu , \mu \in {\mathbb {N}}\), we define \({\tilde{g}}^\nu _\mu ,\tilde{h}^\nu _\mu :S\rightarrow {\overline{{\mathbb {R}}}}\) by setting

$$\begin{aligned} {\tilde{g}}_\mu ^\nu (x)= & {} \mathop {\mathrm{max}}\nolimits _{(a,\alpha )\in A^\mu (g^\nu )} \{\langle a,x\rangle + \alpha \} \text{ and } \\ {\tilde{h}}_\mu ^\nu (x)= & {} \mathop {\mathrm{max}}\nolimits _{(a,\alpha )\in A^\mu (h^\nu )} \{\langle a,x\rangle + \alpha \} \text{ for } x\in S\cap \rho ^\nu {\mathbb {B}}_\infty , \end{aligned}$$

and \({\tilde{g}}_\mu ^\nu (x)={\tilde{h}}_\mu ^\nu (x)=\infty \) otherwise. The characterization of hypo-convergence in (1) and (2) enables us to conclude that for all \(\nu \in {\mathbb {N}}\),

$$\begin{aligned} -{\tilde{g}}_\mu ^\nu \rightarrow -g^\nu \text{ and } -{\tilde{h}}_\mu ^\nu \rightarrow -h^\nu \text{ as } \mu \rightarrow \infty , \end{aligned}$$

with pointwise convergence holding as well on \(S\cap \rho ^\nu {\mathbb {B}}_\infty \).

Let \({{\tilde{\psi }}}^\nu _\mu :S\rightarrow {\overline{{\mathbb {R}}}}\) be defined by

$$\begin{aligned} {{\tilde{\psi }}}^\nu _\mu (x) = {\tilde{g}}^\nu _\mu (x) - \tilde{h}^\nu _\mu (x) \text{ for } x\in S\cap \rho ^\nu {\mathbb {B}}_\infty \text{ and } {{\tilde{\psi }}}^\nu _\mu (x) = - \infty \text{ otherwise. } \end{aligned}$$

If \({{\tilde{\psi }}}_\mu ^\nu \rightarrow \psi ^\nu \) as \(\mu \rightarrow \infty \), then we can construct \(\{\mu ^\nu \in {\mathbb {N}}, ~\nu \in {\mathbb {N}}\}\rightarrow \infty \) such that

$$\begin{aligned} {{\tilde{\psi }}}^\nu _{\mu ^\nu } \rightarrow f \end{aligned}$$

because already \(\psi ^\nu \rightarrow f\). Since \({{\tilde{\psi }}}^\nu _{\mu ^\nu }\in {\text {pa-fcns}}^{q^\nu }_{\rho ^\nu }(S)\), with \(q^\nu \) being the largest cardinality of \(A^{\mu ^\nu }(g^\nu )\) and of \(A^{\mu ^\nu }(h^\nu )\), the conclusion will follow.

It only remains to establish that \({{\tilde{\psi }}}_\mu ^\nu \rightarrow \psi ^\nu \) as \(\mu \rightarrow \infty \). For this purpose, we again leverage the characterization of hypo-convergence in (1) and (2). Suppose that \(x^\mu \in S\cap \rho ^\nu {\mathbb {B}}_\infty \rightarrow x\). Then,

$$\begin{aligned} \mathop {\mathrm{limsup}}\nolimits _\mu \big ( {\tilde{g}}_\mu ^\nu (x^\mu ) - {\tilde{h}}_\mu ^\nu (x^\mu ) \big )&= \mathop {\mathrm{limsup}}\nolimits _\mu {\tilde{g}}_\mu ^\nu (x^\mu ) - \mathop {\mathrm{liminf}}\nolimits _\mu {\tilde{h}}_\mu ^\nu (x^\mu )\\&\le \mathop {\mathrm{limsup}}\nolimits _\mu g^\nu (x^\mu ) - h^\nu (x) \le g^\nu (x) - h^\nu (x), \end{aligned}$$

where we use the facts that \({\tilde{g}}_\mu ^\nu \) lower bounds \(g^\nu \), \(-{\tilde{h}}_\mu ^\nu \rightarrow -h^\nu \), and \(g^\nu \) is continuous on \(S\cap \rho ^\nu {\mathbb {B}}_\infty \). Also, for \(x\in S\cap \rho ^\nu {\mathbb {B}}_\infty \),

$$\begin{aligned} \mathop {\mathrm{liminf}}\nolimits _\mu \big ( {\tilde{g}}_\mu ^\nu (x) - {\tilde{h}}_\mu ^\nu (x) \big ) = \mathop {\mathrm{liminf}}\nolimits _\mu {\tilde{g}}_\mu ^\nu (x) - \mathop {\mathrm{limsup}}\nolimits _\mu \tilde{h}_\mu ^\nu (x) \ge g^\nu (x) - h^\nu (x). \end{aligned}$$

These inequalities are trivially satisfied for sequences outside of \(S\cap \rho ^\nu {\mathbb {B}}_\infty \). Hence, the assertion is established. \(\square \)

We illustrate the usefulness of the theorem in the solution of function identification problems of the form

$$\begin{aligned} \text{(FIP) }~~~~\min _{f\in F} \varphi (f), \text{ where } F\subset {\text {usc-fcns}}(S) \text{ and } \varphi :F\rightarrow {\overline{{\mathbb {R}}}}. \end{aligned}$$

These problems arise in nonparametric estimation, spatial statistics, and curve fitting; see [34] for applications to estimation of financial curves, electricity demand, commodity prices, and uncertainty in physical systems. For example, if F is a class of n-dimensional probability density functions and \(\varphi (f) = -\frac{1}{m}\sum _{j=1}^m \log f(x^j)\), then any minimizer of (FIP) is a maximum likelihood estimate based on the data \(x^1, \dots x^m\in S\). When \(\varphi (f) = \frac{1}{m}\sum _{j=1}^m (y^j - f(x^j))^2\), a minimizer furnishes a least-squares fit of the data \(\{(x^j, y^j)\in S\times {\mathbb {R}}, ~j=1, \dots , m\}\) over the class F. We refer to [35, 38] for numerous examples. Further unexplored applications for usc functions and their piecewise affine approximations may arise in stochastic and robust optimization where problems can be formulated over spaces of decision rules and be approximated using polynomial and piecewise polynomial functions [2], finite collections of policies [17], and linear decision rules, possibly in a higher dimensional spaces [12]. In special cases of (FIP), such as when F consists of concave functions only, one might be able to reformulate the problem as an equivalent finite-dimensional one; see [9] for the context of maximum likelihood estimation over the log-concave class and [41] for least-squares regression over convex functions. However, this is not possible in general and we need to settle for approximations.

For \(\rho ^\nu \in [0,\infty )\) and \(q^\nu \in {\mathbb {N}}\), we consider the approximating function identification problem

$$\begin{aligned} \text{(FIP) }^\nu ~~~~\min _{f\in F^\nu } \varphi (f), \text{ where } F^\nu = F\cap {\text {pa-fcns}}^{q^\nu }_{\rho ^\nu }(S). \end{aligned}$$

Every function in \({\text {pa-fcns}}^{q^\nu }_{\rho ^\nu }(S)\) is described by \(2q^\nu (n+1)\) parametersFootnote 1. Thus, \(\text{(FIP) }^\nu \) is equivalent to a finite-dimensional optimization problem with the same number of variables. The number grows only linearly in n, which makes the approach promising for high-dimensional problems. In comparison, approximations based on epi-splines (see [35, 38]) require a preselected partition of S not easily decided on in a computationally tractable manner beyond four or five dimensions. The piecewise approximations in \(\text{(FIP) }^\nu \) are mesh free, with the domain of each affine component adapting to the problem at hand. Thus, we expect to be able to identify and leverage low-dimensional structures, if present, when solving (FIP) by means of \(\text{(FIP) }^\nu \).

It is apparent that an application may demand approximations also of the objective function in (FIP), which we address in Sect. 5 for the central case of stochastic optimization where \(\varphi = {\mathbb {E}}[\psi ({{\varvec{\xi }}},\cdot )]\); the expectation with respect to the distribution of a random element \({{\varvec{\xi }}}\) is denoted by \({\mathbb {E}}\). Here, we concentrate on the application of Theorem 3.1 to justify \(\text{(FIP) }^\nu \).

Let \(\varphi ^\nu :F^\nu \rightarrow {\overline{{\mathbb {R}}}}\) be the function defined by \(\varphi ^\nu (f) = \varphi (f)\) for \(f\in F^\nu \). Suppose that S is convex, F is a nonempty and solid subset of \(({\text {usc-fcns}}(S),{\mathbb {d}})\), i.e., \(F=\mathrm{cl}(\mathrm{int}F)\), and \(\varphi :F\rightarrow {\overline{{\mathbb {R}}}}\) is continuous on F. Then, a standard argument (see for example the proof of Theorem 3.16 in [35]) in conjunction with Theorem 3.1 establishes that

$$\begin{aligned} \varphi ^\nu \text{ epi-converges } \text{ to } \varphi \text{ provided } \text{ that } \rho ^\nu ,q^\nu \rightarrow \infty . \end{aligned}$$

Thus, when \(\{\varepsilon ^\nu \ge 0,~ \nu \in {\mathbb {N}}\}\rightarrow 0\),

$$\begin{aligned} \mathop {\mathrm{OutLim}}\nolimits \Big ( \varepsilon ^\nu \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F^\nu } \varphi (f) \Big ) \subset \mathop {\mathrm{argmin}}\nolimits _{f\in F} \varphi (f), \end{aligned}$$

which can be deduced, for example, from [33]. The constraint qualification that F is solid cannot be relaxed without introducing some other assumption. Obviously, \(F\cap {\text {pa-fcns}}^{q^\nu }_{\rho ^\nu }(S)\) can, in general, be empty for all \(\nu \), but when F is solid this is ruled out.

In view of this discussion, the challenge of solving an infinite-dimensional function identification problem from the broad class (FIP) is shifted to that of obtaining a near-minimizer of a finite-dimensional problem. Of course, the difficulty of that task depends on the specific properties of \(\varphi \) and F. Typically, \(\text{(FIP) }^\nu \) would be nonconvex, but the special affine structure of functions in \({\text {pa-fcns}}^{q^\nu }_{\rho ^\nu }(S)\) is bound to be important in developing computational procedures. Initial efforts in that direction are already found in [8], which presents several algorithms with guarantees to obtain at least certain stationary points and numerical results from nonparametric least-squares regression, as well as in [28], which approximates functions in up to \(n=41\) dimensions using piecewise affine functions in the context of nonparametric support vector machines. The nonconvexity of \(\text{(FIP) }^\nu \) encountered in [28] appears to be only moderately challenging and handled by common randomization strategies.

4 Covering numbers

It is well known that every bounded \(F\subset ({\text {usc-fcns}}(S), {\mathbb {d}})\) has a finite cover by virtue of being totally bounded. However, this is not sufficient to establish certain rates of convergence results for sample average approximations of stochastic optimization problems of the form \(\min _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}},f)]\). It is usually necessary to bound for all \(\varepsilon >0\) the covering number of F, denoted by \(N(F,\varepsilon )\), which is the smallest number of closed \({\mathbb {d}}\)-balls of radius \(\varepsilon \) needed to cover F. We next provide such a bound and show that it is nearly sharp. Section 5 applies the result to establish rates of convergence of minimizers of stochastic optimization problems.

We start by recording a useful estimate of the hypo-distance. For \(f,g\in {\text {usc-fcns}}(S)\) and \(\rho \ge 0\), we define the auxiliary quantity

$$\begin{aligned} \hat{\mathbb {d}}_\rho (f,g)&= \mathop {\mathrm{inf} }\nolimits \Big \{ \tau \ge 0 ~\Big |\\&\quad \mathop {\mathrm{sup}}\nolimits _{y\in {\mathbb {B}}_\infty (x,\tau )} g(y) \ge \min \{f(x), \rho \} + \tau , \forall x\in \rho {\mathbb {B}}_\infty \text{ with } f(x) \ge -\rho \\&\quad \mathop {\mathrm{sup}}\nolimits _{y\in {\mathbb {B}}_\infty (x,\tau )} f(y) \ge \min \{g(x), \rho \} + \tau , \forall x\in \rho {\mathbb {B}}_\infty \text{ with } g(x) \ge -\rho \Big \}. \end{aligned}$$

As the notation indicates, \(\hat{\mathbb {d}}_\rho \) is closely related to \({\mathbb {d}}_\rho \) (see Proposition 3.1 in [33]) and therefore also to \({\mathbb {d}}\). We record the relevant properties next.

Lemma 4.1

For \(f,g\in {\text {usc-fcns}}(S)\) and \(\rho \ge 0\),

$$\begin{aligned} e^{-\rho } \hat{\mathbb {d}}_\rho (f,g)\le {\mathbb {d}}(f,g) \le (1-e^{-\rho }){\hat{\mathbb {d}}}_{2\rho +\delta }(f,g) + e^{-\rho }(\delta +\rho +1), \end{aligned}$$

where \(\delta = \max \{\mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} \;f), \mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} g)\}\).

Proof

The results can be deduced from Propositions 3.1 and 3.2 in [33]. \(\square \)

As mentioned in Sect. 1, epi-splines [33, 35] furnish a dense subset of classes of semicontinuous functions and associated error bounds are known. We leverage these results here. Although the piecewise affine functions of Sect. 3 are also dense in the usc functions, they have unknown error and cannot presently serve as the basis for the construction in the proof of the next theorem. This is anyhow less critical as we see through a lower bound result (Theorem 4.4 below) that the obtained upper bound on covering numbers is within a logarithmic factor of being sharp.

For any \(f:S\rightarrow {\overline{{\mathbb {R}}}}\) and \(x\in S\), let \(\mathop {\mathrm{liminf}}\nolimits _{{\bar{x}}\rightarrow x} f({\bar{x}}) = \mathop {\mathrm{lim}}\nolimits _{\delta \downarrow 0}\mathop {\mathrm{inf} }\nolimits _{\bar{x}\in {\mathbb {B}}_\infty (x,\delta )} f({\bar{x}})\). Epi-splines are defined in terms a finite collection of subsets of S. A finite collection \(R_1, R_2, \ldots , R_K\) of open subsetsFootnote 2 of S is a partition of S if \(\cup _{k=1}^K \mathrm{cl}R_k = S\) and \(R_k \cap R_l = \emptyset \) for all \(k\ne l\). Specifically, an epi-spline \(s:S\rightarrow {\mathbb {R}}\), with partition \({{\mathcal {R}}}=\{R_1, \dots , R_K\}\) of S, is a function that

$$\begin{aligned}&\text{ on } \text{ each } R_k\text{, } k=1, \ldots , K\text{, } \text{ takes } \text{ a } \text{ constant } \text{ real } \text{ number } \text{ as } \text{ value, }\\&\text{ and } \text{ for } \text{ every } x\in S\text{, } \text{ has } s(x) = \mathop {\mathrm{liminf}}\nolimits _{x'\rightarrow x} s(x'). \end{aligned}$$

The family of all such epi-splines is denoted by \({\text {e-spl}}({{\mathcal {R}}})\). Epi-splines are lsc by construction and approximate lsc functions in the sense of epi-convergence. Since the present setting involves usc functions and hypo-convergence, we “reorientation” and introduce minus in some expressions. We refer to [33, 35] for further information and extensions that go beyond these zeroth order epi-splines and also beyond \({\mathbb {R}}^n\).

Proposition 4.2

For a partition \({{\mathcal {R}}}=\{R_1, \dots , R_K\}\) of S and \(\rho \ge 0\), we have that for every \(f\in {\text {usc-fcns}}(S)\), there exists an \(s\in {\text {e-spl}}({{\mathcal {R}}})\) such that

$$\begin{aligned}&\hat{\mathbb {d}}_\rho (f,-s) \le \mu _\rho ({{\mathcal {R}}}) \\&\quad = \mathop {\mathrm{inf} }\nolimits \big \{ \tau \ge 0~|~ R_k \subset {\mathbb {B}}_\infty (x,\tau ) \text{ for } \text{ all } x\in \rho {\mathbb {B}}_\infty \text{ and } k \text{ satisfying } x \in \mathrm{cl}R_k\big \}. \end{aligned}$$

If \(\mu _\rho ({{\mathcal {R}}}) \le \rho \), then s can be taken to satisfy \(-\rho ' \le s(x) \le \max \{-\rho ', \min [\rho ', -f(x)]\}\) for any \(\rho '>\rho \) and \(x\in S\).

Proof

The first part of the proposition is a direct application of [33, Theorem 5.9]. The fact that s can be taken to satisfy \(-\rho ' \le s(x) \le \max \{-\rho ',\min [\rho ',-f(x)]\}\) for any \(\rho '>\rho \) follows from an examination of that theorem’s proof. \(\square \)

The quantity \(\mu _\rho ({{\mathcal {R}}})\) is the meshsize of \({{\mathcal {R}}}=\{R_1, \dots , R_K\}\) and, essentially, quantifies the size of the largest \(R_k\).

Theorem 4.3

(covering numbers) For every bounded subset F of \(({\text {usc-fcns}}(S), {\mathbb {d}})\), there exist \(c\ge 0\) and \({{\bar{\varepsilon }}}>0\) (both independent of n, the dimension of S) such that

$$\begin{aligned} \log N(F,\varepsilon ) \le \left( \frac{c}{\varepsilon }\right) ^n\left( \log \frac{1}{\varepsilon }\right) ^{n+1} \text{ for } \text{ all } \varepsilon \in (0, {{\bar{\varepsilon }}}]. \end{aligned}$$

Proof

Since F is bounded, there exists an \(r>0\) such that \(\mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} \;f) \le r\) for all \(f\in F\). Let \(\gamma _1,\gamma _2,\gamma _3 > 0\) be such that \(\gamma _1+\gamma _2+\gamma _3 = 1\). Set \({{\bar{\varepsilon }}} \in (0,1)\) such that

$$\begin{aligned} \frac{2(r + 1)}{r}\left[ \log \frac{1}{\varepsilon } + \log \frac{1}{\gamma _1} + \frac{r}{2} + \log \left( r + 1\right) \right] - 1>\gamma _2 \varepsilon \text{ for } \text{ all } \varepsilon \in (0,{{\bar{\varepsilon }}}]. \end{aligned}$$
(4)

Fix \(\varepsilon \in (0,{{\bar{\varepsilon }}}]\) and define \(\rho \) to be the expression on the left-hand side of (4). We next construct a partition of S and set \(\omega >1\) and

$$\begin{aligned} \nu = \left\lceil \frac{2\omega \rho }{\gamma _2 \varepsilon } \right\rceil , \end{aligned}$$

where \(\lceil a \rceil \) is the smallest integer no smaller than a. The partition is obtained by dividing the box \([-\omega \rho ,\omega \rho ]^n\subset {\mathbb {R}}^n\) into \(\nu ^n\) boxes of equal size and then intersecting with S. Let \(K = \nu ^n + 1\). Specifically, for \(k=1, 2, \dots , \nu ^n\), set

$$\begin{aligned} R_k = \mathrm{int}\Big (S\cap \prod _{i=1}^n (l_i^k, u_i^k)\Big ), \text{ with } l_i^k = 2(k-1)\omega \rho /\nu - \omega \rho \text{ and } u_i^k = l_i^k + 2\omega \rho /\nu \end{aligned}$$

so that \(\cup _{k=1}^{K-1} \mathrm{cl}R_k = S\cap [-\omega \rho ,\omega \rho ]^n\). Also, \(R_K = \mathrm{int}( S\setminus [-\omega \rho ,\omega \rho ]^n)\). Again, we recall that the interior and closure are taken relative to \((S,\Vert \cdot -\cdot \Vert _\infty )\). We denote by \({{\mathcal {R}}}= \{R_1, \dots , R_K\}\) this partition. Clearly, \(\mu _{\rho }({{\mathcal {R}}}) = 2\omega \rho /\nu \). Next, we consider a discretization of parts of the range of functions and set

$$\begin{aligned} m = \left\lceil \frac{\omega \rho }{\gamma _3 \varepsilon } \right\rceil + 1. \end{aligned}$$

The points \(\sigma _j = -\omega \rho + 2(j-1)\omega \rho /(m-1)\), \(j=1, 2, \dots , m\), discretize the interval \([-\omega \rho , \omega \rho ]\). Let \(F_0\) be the collection of piecewise constant functions on \({{\mathcal {R}}}\) with values in \(\{\sigma _1, \dots . \sigma _m\}\) defined as follows. If \(f\in F_0\), then for every \(k\in \{1, \dots , K\}\) there exists a \(j_k\in \{1, \dots , m\}\) such that \(f(x) = \sigma _{j_k}\) for \(x\in R_k\) and \(f(x) = \mathop {\mathrm{lim}}\nolimits _{\delta \downarrow 0} \mathop {\mathrm{sup}}\nolimits _{y\in {\mathbb {B}}_\infty (x,\delta )} f(y)\) otherwise. By construction, f is usc. Obviously, \(F_0\) contains \(m^K\) functions. We now show that every \(f\in F\) has \({\mathbb {d}}(f,f_0) \le \varepsilon \) for some \(f_0\in F_0\).

Let \(f\in F\) be arbitrary. By Proposition 4.2 and the fact that \(\mu _\rho ({{\mathcal {R}}}) = 2\omega \rho /\nu \le \gamma _2\varepsilon < \rho \), there exists \(s\in {\text {e-spl}}({{\mathcal {R}}})\) such that

$$\begin{aligned} \hat{\mathbb {d}}_{\rho }(f,-s) \le \mu _{\rho }({{\mathcal {R}}}) \text{ and } -\omega \rho \le s(x) \le \max \{-\omega \rho , \min [\omega \rho ,-f(x)]\} \text{ for } x\in S. \end{aligned}$$

Since \(\mathop {\mathrm{dist}}\nolimits (0,\mathrm{hypo} \;f) \le r\), there exists \(x\in r{\mathbb {B}}_\infty \) such that \(f(x)\ge -r\). Consequently, \(-s(x) \ge \min \{\omega \rho , \max [-\omega \rho ,f(x)]\} \ge -r\). So we also have that \(\mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} -s) \le r\).

Since \(\varepsilon ,\gamma _1 \le 1\),

$$\begin{aligned} \rho \ge \frac{2(r+1)}{r}\left[ \frac{r}{2} + \log (r+1) \right] - 1 = r + \frac{2(r+1)}{r}\log (r+1) \ge r. \end{aligned}$$

Thus, using the notation \({{\bar{\rho }}} = (\rho -r)/2\), Lemma 4.1 gives that

$$\begin{aligned}&{\mathbb {d}}(f,-s) \le \hat{\mathbb {d}}_{\rho }(f,-s) + e^{-{{\bar{\rho }}}}(r + {{\bar{\rho }}} + 1) \le \mu _{\rho }({{\mathcal {R}}}) + e^{-{{\bar{\rho }}}}(r + {{\bar{\rho }}} + 1) \\&\quad = 2\omega \rho /\nu + e^{-{{\bar{\rho }}}}(r + {{\bar{\rho }}} + 1). \end{aligned}$$

In view of [35, Theorem 3.17], there exists \(f_0 \in F_0\) such that \({\mathbb {d}}(-s,f_0) \le \omega \rho /(m-1)\) since we can select \(f_0\) such that \(|s(x)-f_0(x)| \le \omega \rho /(m-1)\) for all \(x\in S\). The triangle inequality then gives that

$$\begin{aligned} {\mathbb {d}}(f,f_0) \le \omega \rho /(m-1) + 2\omega \rho /\nu + e^{-{{\bar{\rho }}}}(r + {{\bar{\rho }}} + 1). \end{aligned}$$
(5)

It remains to show that the right-hand side is no greater than \(\varepsilon \). We start with the last term in (5). By concavity of the log-function, we have that

$$\begin{aligned} \log \left( \frac{1}{2}(\rho +r)+1\right) \le \log \left( r+1\right) + \frac{\rho -r}{2r+ 2}. \end{aligned}$$

Consequently,

$$\begin{aligned}&\log \left[ e^{-{{\bar{\rho }}}}(r + {{\bar{\rho }}} + 1)\right] \\&\quad = \frac{1}{2}(r - \rho ) + \log \left( \frac{1}{2}(\rho +r)+1\right) \le \frac{1}{2}(r - \rho ) + \log \left( r+1\right) + \frac{\rho -r}{2r+ 2}\\&\quad = \frac{r}{2}-\frac{r(\rho +1)}{2(r+1)} + \log (r+1) = \log \gamma _1 \varepsilon , \end{aligned}$$

where the last equality follows from inserting the expression for \(\rho \). Thus, \(e^{-{{\bar{\rho }}}}(r + {{\bar{\rho }}} + 1)\le \gamma _1\varepsilon \). We next examine the second term on the right-hand side of (5). Inserting the expression for \(\nu \), we obtain that

$$\begin{aligned} \frac{2\omega \rho }{\nu } \le \gamma _2 \varepsilon . \end{aligned}$$

Finally, we consider the first term on the right-hand side of (5). In view of the definition of m, we have that

$$\begin{aligned} \frac{\omega \rho }{m-1} \le \gamma _3 \varepsilon . \end{aligned}$$

Thus, \({\mathbb {d}}(f,f_0) \le \varepsilon \) and we have established that \({\mathbb {d}}\)-balls with radius \(\varepsilon \) and centered at points in \(F_0\) cover F. The logarithm of the number of functions in \(F_0\) is \((\nu ^n + 1)\log m\). At this point, the order of the result is immediate. A possible expression for the constant c is obtained as follows. Let \(c_1 = 2(r+1)/r\) and

$$\begin{aligned} c_2 = \frac{2(r + 1)}{r}\left[ \log \frac{1}{\gamma _1} + \frac{r}{2} + \log \left( r + 1\right) \right] - 1. \end{aligned}$$

Thus, \(\rho = c_1\log \varepsilon ^{-1} + c_2\). Moreover, let \(c_3 = 2\omega /\gamma _2\) and \(c_4 = \omega /\gamma _3\). Using these expressions, we find that

$$\begin{aligned} (\nu ^n + 1)\log m\le & {} \left[ \left( c_1c_3 + \frac{c_2c_3+1}{\log {{\bar{\varepsilon }}}^{-1}} \right) ^n \left( \frac{1}{\varepsilon } \log \frac{1}{\varepsilon } \right) ^n +1 \right] \\&\log \left[ \left( c_1c_4 + \frac{c_2c_4+2}{\log {{\bar{\varepsilon }}}^{-1}} \right) \frac{1}{\varepsilon }\log \frac{1}{\varepsilon } \right] . \end{aligned}$$

Let

$$\begin{aligned} c_5 = c_1c_3 + \frac{c_2c_3+1}{\log {{\bar{\varepsilon }}}^{-1}} \text{ and } c_6 = c_1c_4 + \frac{c_2c_4+2}{\log {{\bar{\varepsilon }}}^{-1}}. \end{aligned}$$

We then find that

$$\begin{aligned} (\nu ^n + 1)\log m \le c_7^n \left( \frac{1}{\varepsilon } \log \frac{1}{\varepsilon } \right) ^n \left[ \log c_6 + \log \frac{1}{\varepsilon } + \log \log \frac{1}{\varepsilon } \right] , \end{aligned}$$

where \(c_7 = c_5 + \frac{1}{{{\bar{\varepsilon }}}^{-1}\log \bar{\varepsilon }^{-1}}\).

Using the fact that \(\log \log \varepsilon ^{-1}/\log \varepsilon ^{-1} \le e^{-1}\) for \(\varepsilon \in (0,1)\), we obtain

$$\begin{aligned} (\nu ^n + 1)\log m \le c_7^n\left[ \frac{\log c_6}{\log \bar{\varepsilon }^{-1}} + 1 + e^{-1} \right] \frac{1}{\varepsilon ^n} \left( \log \frac{1}{\varepsilon } \right) ^{n+1}, \end{aligned}$$
(6)

which gives a particular expression for c in the theorem statement. Since the choice of \({{\bar{\varepsilon }}}\) is independent of n, this c is independent of n. For example, for \({{\bar{\varepsilon }}} = 0.01\), \(\omega = 0.00000001\), \(r = 3.22\), then \(c_7 = 13.5\) and the term in brackets in (6) evaluates to 2.3. \(\square \)

Although a comparison to the classical result of \(O(\varepsilon ^{-n})\) for Lipschitz continuous functions on bounded subsets, which goes back to [23] (see for example [44, Theorem 2.7.1]), is not entirely relevant due to the different settings, we note that our bound is only slightly worse (a logarithmic term) for larger families of usc functions. We do not require any bound on the variation of the functions and allow functions defined on all of \({\mathbb {R}}^n\), possibly extended real-valued. Still, the entropy integralFootnote 3\(\int _0^{{{\bar{\varepsilon }}}} \sqrt{\log N(F,\varepsilon )} d\varepsilon \) is finite only for \(n = 1\) and, therefore, these families are “large,” and increasingly so as n grows.

Theorem 4.4

(covering numbers; lower bound) For every \(n\in {\mathbb {N}}\), there exist a bounded subset \(F\subset {\text {usc-fcns}}({\mathbb {R}}^n)\) and corresponding \(c\ge 0\) and \(\bar{\varepsilon }>0\) (independent of n) such that

$$\begin{aligned} \log N(F,\varepsilon ) \ge \left( \frac{c}{\varepsilon }\right) ^n \log \frac{1}{\varepsilon } \text{ for } \text{ all } \varepsilon \in (0, {{\bar{\varepsilon }}}]. \end{aligned}$$

Proof

See the “Appendix”. \(\square \)

In comparison with the upper bound of Theorem 4.3, we see that the lower bound differs by a logarithmic factor only and therefore the upper bound is nearly sharp. The size of the bounded set F in Theorem 4.4 does not have to be large. In fact, an examination of the proof reveals that F might be selected to have \({\mathbb {d}}(0,f)\le r\) for all \(f\in F\), with r being only slightly above one. Here, 0 is the function in \({\text {usc-fcns}}({\mathbb {R}}^n)\) that is identical to zero.

The proof of Theorem 4.4 constructs a collection of functions which is finite on a grid of points in \([0,\rho ]^n\), with \(\rho >0\) and grid points spaced roughly \(\varepsilon \) apart. At each of these grid points, a function takes on a value among a set of discretized values between \(-\rho \) and 0, again spaced roughly \(\varepsilon \) apart. Outside these grid points, the functions are assigned \(-\infty \). It is clear that the number of such functions is \((\rho /\varepsilon )^\nu \), where \(\nu = (\rho /\varepsilon )^n\). Thus, its logarithm is of the order \(O(\varepsilon ^{-n} \log \varepsilon ^{-1})\).

5 Applications to stochastic optimization and statistical estimation

Suppose that \((\Xi , {{\mathcal {A}}}, {\mathbb {P}})\) is a complete probability space, \(F\subset {\text {usc-fcns}}(S)\) is closed, and \(\psi :\Xi \times F\rightarrow {\overline{{\mathbb {R}}}}\) is a function with suitable properties as discussed below. We denote by boldface, for example \({{\varvec{\xi }}}\), random elements with values in \(\Xi \). Then, a function identification problem under uncertainty takes the form

$$\begin{aligned} \text{(FIP-U) }~~~~\min _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}},f)] = \int \psi (\xi ,f) d{\mathbb {P}}(\xi ). \end{aligned}$$

Section 3 furnishes some instances of \(\psi \) in the context of probability density estimation and regression, see also below, but there are numerous other examples.

A sample average approximation of the problem leverages a sample \({{\varvec{\xi }}}^1, {{\varvec{\xi }}}^2, \dots \) of independent random elements, each with values in \(\Xi \) and distributed according to \({\mathbb {P}}\), and leads to the approximating problem

$$\begin{aligned} \text{(FIP-U) }^\nu ~~~~ \min _{f\in F} \frac{1}{\nu } \sum _{j=1}^\nu \psi ({{\varvec{\xi }}}^j, f). \end{aligned}$$

Under mild assumptions (see Proposition 5.1 below), minimizers of \(\text{(FIP-U) }^\nu \) tend to those of (FIP-U) almost surely. However, a main challenge in the justification of such an approach is to quantify the rate with which the error in solutions of \(\text{(FIP-U) }^\nu \) vanishes as \(\nu \) grows. Before stating the results that rely on the covering numbers of the previous section, we formalize the setting. The following definitions and facts are well known; see for exampleFootnote 4 [31, Ch. 14].

We say that \(\psi :\Xi \times F\rightarrow {\overline{{\mathbb {R}}}}\) is a random lsc function if for all \(\xi \in \Xi \), \(\psi (\xi ,\cdot )\) is lsc as a function on the metric space \((F,{\mathbb {d}})\) and \(\psi \) is measurable with respect to the product sigma-algebraFootnote 5 on \(\Xi \times F\). A random lsc function \(\psi :\Xi \times F\rightarrow {\overline{{\mathbb {R}}}}\) is locally inf-integrable ifFootnote 6

$$\begin{aligned} \forall f\in F~\exists \rho>0 \text{ such } \text{ that } \int \mathop {\mathrm{inf} }\nolimits _{g\in F} \{\psi (\xi ,g)~|~{\mathbb {d}}(g,f)\le \rho \}~ d{\mathbb {P}}(\xi ) > -\infty . \end{aligned}$$

If \(\psi :\Xi \times F\rightarrow {\overline{{\mathbb {R}}}}\) is a locally inf-integrable random lsc function, then \(f\mapsto {\mathbb {E}}[\psi ({{\varvec{\xi }}},f)]\) is well-defined, always greater than \(-\infty \), and lsc.

Proposition 5.1

Suppose that \((\Xi , {{\mathcal {A}}}, {\mathbb {P}})\) is a complete probability space, \(F\subset {\text {usc-fcns}}(S)\) is closed, and \(\psi :\Xi \times F\rightarrow {\overline{{\mathbb {R}}}}\) is an inf-integrable random lsc function. If \({{\varvec{\xi }}}^1, {{\varvec{\xi }}}^2, \dots \) is a sequence of independent random elements each with values in \(\Xi \) and distribution \({\mathbb {P}}\) and \(\{\varepsilon ^\nu \ge 0, \nu \in {\mathbb {N}}\}\rightarrow 0\), then, almost surely,

$$\begin{aligned} \mathop {\mathrm{OutLim}}\nolimits \left( \varepsilon ^\nu \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F} \frac{1}{\nu }\sum _{j=1}^\nu \psi ({{\varvec{\xi }}}^j,f)\right) \subset \mathop {\mathrm{argmin}}\nolimits _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]. \end{aligned}$$

Moreover, if F is bounded, then almost surely

$$\begin{aligned} \lim \left( \mathop {\mathrm{inf} }\nolimits _{f\in F} \frac{1}{\nu }\sum _{j=1}^\nu \psi ({{\varvec{\xi }}}^j,f)\right) = \mathop {\mathrm{inf} }\nolimits _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)] > -\infty . \end{aligned}$$

Proof

This is a consequence of a law of large numbers for lsc functions and epi-convergence; see for example Proposition 7.1 in [38]. \(\square \)

In the following, we assume that \(F\subset {\text {usc-fcns}}(S)\) is closed and bounded as it results in some simplifications. In particular, \((F,{\mathbb {d}})\) then becomes a compact metric space. The assumption is anyhow minor as it is often acceptable in applications to impose a lower bound on the functions in \({\text {usc-fcns}}(S)\) under considerations; see the remark in conjunction with (3).

The excess of a set \(F_1\subset F\) over a set \(F_2\subset F\) is given by

$$\begin{aligned} \mathop {\mathrm{exs}}\nolimits (F_1,F_2) = \mathop {\mathrm{sup}}\nolimits _{f\in F_1} \mathop {\mathrm{dist}}\nolimits (f, F_2) \text{ if } F_1,F_2 \text{ are } \text{ nonempty }, \end{aligned}$$

\(\mathop {\mathrm{exs}}\nolimits (F_1,F_2) = \infty \) if \(F_1\) nonempty and \(F_2\) empty, and \(\mathop {\mathrm{exs}}\nolimits (F_1,F_2) = 0\) otherwise. Here, \(\mathop {\mathrm{dist}}\nolimits (f, F_2) = \mathop {\mathrm{inf} }\nolimits _{g\in F_2} {\mathbb {d}}(f,g)\) is the usual point-to-set distance in \((F,{\mathbb {d}})\). The Pompeiu-Hausdorff distance \({\mathbb {H}}(F_1,F_2) = \max \{\mathop {\mathrm{exs}}\nolimits (F_1,F_2), \mathop {\mathrm{exs}}\nolimits (F_2,F_1)\}\). Let the level sets of any \(\varphi :F\rightarrow {\overline{{\mathbb {R}}}}\) be denoted by

$$\begin{aligned} \mathop {{\mathrm{lev}}}\nolimits _{\le \delta } \varphi = \{f\in F~|~\varphi (f) \le \delta \}. \end{aligned}$$

If \(\psi :\Xi \times F\rightarrow {\overline{{\mathbb {R}}}}\) is a random lsc function and \(F_1,F_2\subset F\) are closed, then

$$\begin{aligned} \xi \mapsto \mathop {\mathrm{exs}}\nolimits \big (\varepsilon \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F_1} \psi (\xi ,f), ~F_2\big ) \text{ and } \xi \mapsto \mathop {\mathrm{exs}}\nolimits \big (F_2, ~\mathop {{\mathrm{lev}}}\nolimits _{\le \delta } \psi (\xi ,\cdot )\big ) \end{aligned}$$

are random variables on \((\Xi , {{\mathcal {A}}}, {\mathbb {P}})\) for any \(\varepsilon \ge 0\) and \(\delta \in {\mathbb {R}}\).

When considering sample average approximations with sample size \(\nu \), the relevant probability space is the \(\nu \)-fold product space formed by \((\Xi ,{{\mathcal {A}}}, {\mathbb {P}})\); the above definitions apply also for this probability space. The sample average function

$$\begin{aligned} \big ((\xi ^1, \dots , \xi ^\nu ),f\big )\mapsto \frac{1}{\nu }\sum _{j=1}^\nu \psi (\xi ^j,f) \end{aligned}$$

is then a random lsc function on the product probability space provided that \(\psi >-\infty \) is a random lsc function on \((\Xi ,{{\mathcal {A}}},{\mathbb {P}})\). Since this is the case below, the following probabilistic statements are meaningful. The measure on the product probability space is denoted by \(P^\nu \) and the sample space by \(\Xi ^\nu \).

We also need a quantitative result about differences between minimizers and related quantities. The following result improves on [33, Thms. 4.3 and 4.5]; see also [10] for related results in the convex setting. We denote by \({\mathbb {B}}(f,\rho ) = \{g\in {\text {usc-fcns}}(S)~|~{\mathbb {d}}(f,g)\le \rho \}\).

Proposition 5.2

For a closed and bounded \(F\subset {\text {usc-fcns}}(S)\), let \(F_1,F_2\subset F\) be nonempty and \(\varphi _1:F_1\rightarrow (-\infty ,\infty ]\) as well as \(\varphi _2:F_2\rightarrow (-\infty , \infty ]\) be lsc functions on the metric space \((F,{\mathbb {d}})\). Suppose that for some \(\tau ,\gamma \ge 0\),

$$\begin{aligned} {\mathbb {B}}(g,\gamma )\cap F_1\ne \emptyset \text{ and } \mathop {\mathrm{inf} }\nolimits _{f\in {\mathbb {B}}(g,\gamma )\cap F_1} \varphi _1(f) \le \varphi _2(g) + \tau ~\forall g\in F_2. \end{aligned}$$

Then, for any \(\delta \in {\mathbb {R}}\),

$$\begin{aligned} \mathop {\mathrm{exs}}\nolimits \big (\mathop {{\mathrm{lev}}}\nolimits _{\le \delta } \varphi _2, \mathop {{\mathrm{lev}}}\nolimits _{\le \delta + \tau } \varphi _1\big ) \le \gamma . \end{aligned}$$
(7)

If in addition

$$\begin{aligned} {\mathbb {B}}(g,\gamma )\cap F_2\ne \emptyset \text{ and } \mathop {\mathrm{inf} }\nolimits _{f\in {\mathbb {B}}(g,\gamma )\cap F_2} \varphi _2(f) \le \varphi _1(g) + \tau ~\forall g\in F_1, \end{aligned}$$

then for any \(\varepsilon \ge 0\),

$$\begin{aligned} \mathop {\mathrm{exs}}\nolimits \big (\varepsilon \text {-}\mathop {\mathrm{argmin}}\nolimits _{f\in F_1} \varphi _1(f), (\varepsilon +2\tau )\text {-}\mathop {\mathrm{argmin}}\nolimits _{f\in F_2} \varphi _2(f)\big ) \le \gamma . \end{aligned}$$
(8)

Proof

Let \(g\in \mathop {{\mathrm{lev}}}\nolimits _{\le \delta } \varphi _2\). Since \(\varphi _1\) is lsc and \({\mathbb {B}}(g,\gamma )\) is compact, there exists \(f^\star \in {\mathbb {B}}(g,\gamma )\cap F_1\) such that

$$\begin{aligned} \varphi _1(f^\star )=\mathop {\mathrm{inf} }\nolimits _{f\in {\mathbb {B}}(g,\gamma )\cap F_1} \varphi _1(f) \le \varphi _2(g) + \tau \le \delta +\tau . \end{aligned}$$

We have established that \(f^\star \in \mathop {{\mathrm{lev}}}\nolimits _{\le \delta + \tau } \varphi _1\). Thus, \(\mathop {\mathrm{dist}}\nolimits (g,\mathop {{\mathrm{lev}}}\nolimits _{\le \delta + \tau } \varphi _1)\le \gamma \) and (7) follows.

For (8), we note that there exists \(f^\star \in \mathop {\mathrm{argmin}}\nolimits _{f\in F_2} \varphi _2(f)\) because \(F_2\) is totally bounded. Thus,

$$\begin{aligned} \mathop {\mathrm{inf} }\nolimits _{f\in F_1} \varphi _1(f) \le \mathop {\mathrm{inf} }\nolimits _{f\in {\mathbb {B}}(f^\star ,\gamma )\cap F_1} \varphi _1(f) \le \varphi _2(f^\star ) + \tau = \mathop {\mathrm{inf} }\nolimits _{f\in F_2} \varphi _2(f) + \tau . \end{aligned}$$

Suppose that \(g\in \varepsilon \text {-}\mathop {\mathrm{argmin}}\nolimits _{f\in F_1} \varphi _1(f)\). Again, there exists \(f^{\star \star } \in {\mathbb {B}}(g,\gamma )\cap F_2\) such that

$$\begin{aligned} \varphi _2(f^{\star \star })= & {} \mathop {\mathrm{inf} }\nolimits _{f\in {\mathbb {B}}(g,\gamma )\cap F_2} \varphi _2(f)\le \varphi _1(g) + \tau \\\le & {} \mathop {\mathrm{inf} }\nolimits _{f\in F_1} \varphi _1(g) + \varepsilon + \tau \le \mathop {\mathrm{inf} }\nolimits _{f\in F_2} \varphi _2(g) + \varepsilon +2\tau . \end{aligned}$$

We have established that \(f^{\star \star } \in (\varepsilon +2\tau )\text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F_2} \varphi _2(f)\). Thus, \(\mathop {\mathrm{dist}}\nolimits (g,(\varepsilon +2\tau )\text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F_2} \varphi _2(f))\le \gamma \) and (8) follows. \(\square \)

We observe that if \(\varphi _1\) and \(\varphi _2\) in the proposition are pointwise within \(\delta \) of each other uniformly on F, then \(\tau \) can be set to \(\delta \) and \(\gamma \) to zero. However, the focus on uniform bounds is limiting as it rules out discontinuous functions and especially cases with \(F_1 \ne F_2\).

5.1 Confidence regions

We are then in a position to state the first of the two main results in this section.

Theorem 5.3

(confidence region) For a complete probability space \((\Xi ,{{\mathcal {A}}},{\mathbb {P}})\) and a closed and bounded set \(F\subset {\text {usc-fcns}}(S)\), suppose that \(\psi :\Xi \times F\rightarrow (-\infty ,\infty ]\) is an inf-integrable random lsc function, \({{\varvec{\xi }}}^1, {{\varvec{\xi }}}^2, \dots \) are independent random elements, each with values in \(\Xi \) and distributed according to \({\mathbb {P}}\), and \(\psi ({{\varvec{\xi }}}^1,f)\) is sub-exponentialFootnote 7 for all \(f\in F\). Given \(\alpha \in (0,1)\) and \(\delta > \mathop {\mathrm{inf} }\nolimits _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]\), there exist \({{\bar{\nu }}}\in {\mathbb {N}}\) and \(c\in [0,\infty )\) such that for all \(\nu \ge {{\bar{\nu }}}\)

$$\begin{aligned}&P^\nu \bigg [\mathop {\mathrm{exs}}\nolimits \Big (\mathop {\mathrm{argmin}}\nolimits _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)], ~\mathop {{\mathrm{lev}}}\nolimits _{\le \delta } \Big \{\frac{1}{\nu }\sum _{j=1}^\nu \psi ({{\varvec{\xi }}}^j, \cdot )\Big \}\Big ) \le \frac{c(\log \nu )^{1+1/n}}{\nu ^{1/n}}\bigg ]\\&\quad \ge 1 - \alpha . \end{aligned}$$

Proof

Let \(\varphi :F\rightarrow {\mathbb {R}}\) have values \(\varphi (f) = {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]\), which is well-defined, lsc, and indeed finite valued due the sub-exponential assumption. Let \(\varphi ^\nu :\Xi ^\nu \times F\rightarrow (-\infty , \infty ]\) have values \(\varphi ^\nu ((\xi ^1, \dots , \xi ^\nu ), f) = \nu ^{-1}\sum _{j=1}^\nu \psi (\xi ^j,f)\), which then is a random lsc function on the product probability space. At a given \(f\in F\), with probability one, \(\varphi ^\nu (({{\varvec{\xi }}}^1, \dots , {{\varvec{\xi }}}^\nu ), f) < \infty \) because otherwise \({\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]\) would not have been finite.

As in Theorem 4.3, there is a finite number \(N = N(F,\gamma _1)\) of closed balls in \(({\text {usc-fcns}}(S),{\mathbb {d}})\) with radius \(\gamma _1>0\) and center \(f_k\in {\text {usc-fcns}}(S)\) covering F. Without loss of generality, we can assume that \({\mathbb {B}}(f_k,\gamma _1)\cap F \ne \emptyset \). Moreover, let \(f_k^\star \in \mathop {\mathrm{argmin}}\nolimits _{f\in {\mathbb {B}}(f_k,\gamma _1)\cap F} \varphi (f)\). Since \(\psi ({{\varvec{\xi }}}^1,f_k^\star )\) is sub-exponential, there exists by Bernstein’s inequality \(\gamma _2 \in (0, \delta - \mathop {\mathrm{inf} }\nolimits _{f\in F} \varphi (f))\) and \(c_0>0\) such that

$$\begin{aligned} P^\nu \big (|\varphi ^\nu (({{\varvec{\xi }}}^1, \dots , {{\varvec{\xi }}}^\nu ), f_k^\star ) - \varphi (f_k^\star )| \ge \gamma _2\big ) \le 2e^{-\nu c_0\gamma _2^2} \text{ for } \text{ all } k=1, \dots , N. \end{aligned}$$

Consequently, as long as

$$\begin{aligned} 2Ne^{-\nu c_0\gamma _2^2} \le \alpha ~ \text{ or, } \text{ equivalently, } ~\nu \ge \frac{\log N - \log (\alpha /2)}{c_0 \gamma _2^2} \end{aligned}$$
(9)

we have that

$$\begin{aligned} P^\nu \bigg (\max _{k=1, \dots ,N} \Big |\varphi ^\nu (({{\varvec{\xi }}}^1,\dots , {{\varvec{\xi }}}^\nu ), f_k^\star ) - \varphi (f_k^\star )\Big | \ge \gamma _2\bigg ) \le \alpha . \end{aligned}$$

Suppose that we have an event \((\xi ^1, \dots , \xi ^\nu )\in \Xi ^\nu \) where

$$\begin{aligned} \max _{k=1, \dots ,N} \Big |\varphi ^\nu ((\xi ^1,\dots , \xi ^\nu ), f_k^\star ) - \varphi (f_k^\star )\Big | < \gamma _2. \end{aligned}$$

Next, we apply Proposition 5.2 and start by establishing the required condition. Let \(g\in F\) and \(\delta _0 = \mathop {\mathrm{inf} }\nolimits _{f\in F} \varphi (f)\), which is finite because F is compact. Then, there exists \(k^\star \in \{1, \dots , N\}\) such that \(g\in {\mathbb {B}}(f_{k^\star },\gamma _1)\) and

$$\begin{aligned}&\mathop {\mathrm{inf} }\nolimits _{f\in {\mathbb {B}}(g,2\gamma _1)\cap F} \varphi ^\nu ((\xi ^1, \dots , \xi ^\nu ),f) \le \varphi ^\nu ((\xi ^1, \dots , \xi ^\nu ),f_{k^\star }^\star ) \\&\quad \le \varphi (f_{k^\star }^\star ) + \gamma _2 \le \varphi (g) + \delta - \delta _0. \end{aligned}$$

Thus, the first condition in Proposition 5.2 holds with \(\gamma = 2\gamma _1\) and \(\tau = \delta - \delta _0\), and

$$\begin{aligned} \mathop {\mathrm{exs}}\nolimits \big (\mathop {{\mathrm{lev}}}\nolimits _{\le \delta _0} \varphi , \mathop {{\mathrm{lev}}}\nolimits _{\le \delta _0 + \tau } \varphi ^\nu ((\xi ^1, \dots , \xi ^\nu ),\cdot )\big ) \le 2\gamma _1. \end{aligned}$$

Equivalently,

$$\begin{aligned} \mathop {\mathrm{exs}}\nolimits \big (\mathop {\mathrm{argmin}}\nolimits _{f\in F} \varphi (f), \mathop {{\mathrm{lev}}}\nolimits _{\le \delta } \varphi ^\nu ((\xi ^1, \dots , \xi ^\nu ),\cdot )\big ) \le 2\gamma _1. \end{aligned}$$

By Theorem 4.3, there exists \({{\bar{\varepsilon }}}>0\) such that \(\log N\) is bounded from above by a term proportional to \(\gamma _1^{-n}(\log \gamma _1^{-1})^{n+1}\) for all \(\gamma _1\in (0,{{\bar{\varepsilon }}}]\). Thus, there is a constant \(c_1>0\) such that

$$\begin{aligned} \frac{\log N - \log (\alpha /2)}{c_0 \gamma _2^2} \le c_1 \gamma _1^{-n}(\log \gamma _1^{-1})^{n+1} \text{ for } \gamma _1 \in (0,{{\bar{\varepsilon }}}]. \end{aligned}$$
(10)

In view of (9), the right-hand side of (10) provides the rate of increase in sample size that is needed to guarantee an excess of at most \(2\gamma _1\) with confidence level \(1-\alpha \). Inverting the expression, we find that \(\gamma _1\) can be propositional to \(\nu ^{-1/n} (\log \nu )^{1+1/n}\) as long as \(\nu \) is sufficiently large, which establishes the conclusion. \(\square \)

When \(f^\star \in \mathop {\mathrm{argmin}}\nolimits _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]\), the theorem guarantees that with probability \(1-\alpha \)

$$\begin{aligned} \mathop {\mathrm{dist}}\nolimits \Big (f^\star , \mathop {{\mathrm{lev}}}\nolimits _{\le \delta } \Big \{\frac{1}{\nu }\sum _{j=1}^\nu \psi ({{\varvec{\xi }}}^j, \cdot )\Big \}\Big ) \le \frac{c (\log \nu )^{1+1/n}}{\nu ^{1/n}} \end{aligned}$$

for sufficiently large \(\nu \). Hence, the minimizer \(f^\star \) of (FIP-U) is covered by the given level set when appropriately enlarged with a quantity that vanishes with increasing sample size at nearly the rate \(\nu ^{-1/n}\). The confidence region is not given in terms of minimizers of the approximating problem (FIP-U)\(^\nu \), but rather certain level sets. Membership in such a level set is trivially assessed, does not require solving the approximating problem, and can be used to rule out the optimality of a candidate f. In general, minimizers of (FIP-U)\(^\nu \) are not well behaved and depend on the conditioning of (FIP-U) as discussed in [33]. Theorem 5.3 bypasses this issue by considering level sets. Other strengths of Theorem 5.3 are its mild assumption on the (random) objective function \(\psi \) and the wide range of constraints that is permitted; the family F can be any bounded closed set in \(({\text {usc-fcns}}(S),{\mathbb {d}})\). The functions \(f\mapsto \psi (\xi ,f)\) is only required to be lsc. The assumption about sub-exponential distribution of \(\psi ({{\varvec{\xi }}}^1,f)\) can be checked pointwise for each \(f\in F\). Actually, this assumption can be relaxed because the proof of Theorem 5.3 only requires that sample averages are sufficiently low relative to the actual expectations, but this merely improves c in the theorem and we omit this refinement.

The practical construction of confidence regions is hampered by the unknown and hard-to-estimate constants c and \({{\bar{\nu }}}\) in Theorem 5.3. In practice, coverage may therefore only be guaranteed asymptotically. The other unknown parameter \(\delta \) is easy to estimate conservatively because for any \(f\in F\), the sample average \(\frac{1}{\nu }\sum _{j=1}^\nu \psi ({{\varvec{\xi }}}^j,f)\), using a different sample, furnishes an estimator of \({\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]\), which in turn is an upper bound on \(\mathop {\mathrm{inf} }\nolimits _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]\). An effort to select a low \(\delta \) would obviously result in a smaller level set, but typically also large c and \({{\bar{\nu }}}\).

The effect of n on the rate of convergence is profound and in line with the growth of the covering numbers as n increase. It highlights, for example, the fundamental challenge associated with high-dimensional nonparametric estimation already well documented (see [1, 22]). On the positive note, if \(n=1\), which already captures many interesting applications [37], then the convergence rate is nearly \(\nu ^{-1}\) and therefore faster than the canonical \(\nu ^{-1/2}\) rate. If F is restricted to some finite-dimensional subset of \({\text {usc-fcns}}(S)\), then the covering numbers from Theorem 4.3 can be replaced by much improved ones, typically of order \(O(\varepsilon ^{-1})\) so that their logarithm is of order \(O(\log \varepsilon ^{-1})\) and the rate improves from essentially \(\nu ^{-1/n}\) to \(e^{-\nu }\) in Theorem 5.3.

We illustrate the application of Theorem 5.3 on stochastic optimization problems arising in nonparametric statistics.

Example 1

Maximum likelihood estimation of probability densities. Suppose that we would like to estimate an unknown probability density function \(f^0\in {\text {usc-fcns}}(S)\). Since we permit densities to have value zero on a subset of S, there is no requirement that the support of \(f^0\) is known; S just needs to contain the support. Given a sample \({{\varvec{\xi }}}^1, \dots , {{\varvec{\xi }}}^\nu \), which in this case takes values in S, i.e., \(\Xi = S\), a maximum likelihood estimator of \(f^0\) over a class \(F\subset {\text {usc-fcns}}(S)\) is any minimizer of

$$\begin{aligned} \mathop {\mathrm{min}}\nolimits _{f\in F} -\frac{1}{\nu }\sum _{j=1}^\nu \log f({{\varvec{\xi }}}^j) \end{aligned}$$

and, in the notation above, \(\psi (\xi ,f) = -\log f(\xi )\). The function \((\xi ,f)\mapsto -\log f(\xi )\) is a random lsc function on the probability space \((S,{{\mathcal {B}}},P)\), where P is the probability distribution of \(f^0\) and \({{\mathcal {B}}}\) contains the Borel sets of \((S,\Vert \cdot -\cdot \Vert _\infty )\) supplemented with the necessary probability-zero sets to make the probability space complete. This fact is easily realized because the function is actually lsc jointly in its arguments; see [38] for details.

In this case, \(\psi ({{\varvec{\xi }}}^1,f)\) being sub-exponential amounts to having F consist of sub-exponential densities. The requirement about inf-integrability is extensively discussed in [38]. For example, suppose F is a nonempty closed subset of

$$\begin{aligned} \Bigg \{f\in {\text {usc-fcns}}(S)~\Bigg | ~\int f(x) dx = 1, ~\int x f(x) dx \in C, ~u(x) \le f(x) \le v(x), ~\forall x\in S\Bigg \}, \end{aligned}$$

where \(C\subset {\mathbb {R}}^n\) is closed and \(u,v:S\rightarrow (0,\infty )\), with \(v\in {\text {usc-fcns}}(S)\). Moreover, suppose that the actual density \(f^0\in F\) and for some \(\gamma _1\ge 0,\gamma _2>0\), and, \(\zeta _1,\zeta _2\in {\mathbb {R}}\),

$$\begin{aligned} u(x) \ge e^{-\gamma _1 \Vert x\Vert _\infty + \zeta _1} \text{ and } v(x) \le e^{-\gamma _2 \Vert x\Vert _\infty + \zeta _2}. \end{aligned}$$

All the assumptions of Theorem 5.3 are then satisfied. The requirement that F is bounded is automatically satisfied because \(f\ge 0\) for all \(f\in F\).

In this example the maximum likelihood estimator finds the best estimate that satisfies the given pointwise bounds and moment restriction. There is no requirement that the actual density or its estimate should be smooth or even continuous. Of course, a large variety of other constraints can be brought in too; see [38] for some possibilities.

We recall that a subset \(F_0 \subset {\text {usc-fcns}}(S)\) is equi-usc [31, Sect. 7.B] if there exists \(\delta :S\times (0,\infty ) \times (0,\infty ) \rightarrow (0,\infty )\) such that for any \(\varepsilon ,\rho >0\), \({\bar{x}}\in S\), and \(f\in F_0\),

$$\begin{aligned} \mathop {\mathrm{sup}}\nolimits _{x\in {\mathbb {B}}_\infty ({\bar{x}},\delta ({\bar{x}},\varepsilon ,\rho ))} f({\bar{x}}) \le \max \{f(x) + \varepsilon , -\rho \}. \end{aligned}$$

If \(F_0\) is a singleton, then the condition reduces to that of usc. If \(F_0\) contains only Lipschitz continuous functions, or only piecewise Lipschitz continuous functions, or only finite-valued concave functions on \({\mathbb {R}}^n\), to mention some examples, then \(F_0\) is equi-usc.

Example 2

Least-Squares Regression. Suppose that we are given the random design model

$$\begin{aligned} \mathbf{y }^j = f^0(\mathbf{x }^j) + \mathbf{z }^j, ~~j=1, 2, \dots , \nu \end{aligned}$$

where \(\mathbf{x }^1, \mathbf{x }^2, \dots , \mathbf{x }^\nu \) are independent and identically distributed n-dimensional random vectors that take values in a closed set \(S\subset {\mathbb {R}}^n\), \(\mathbf{z }^1, \mathbf{z }^2, \dots , \mathbf{z }^\nu \) are zero-mean random variables that are also independent of \(\mathbf{x }^1, \mathbf{x }^2, \dots , \mathbf{x }^\nu \), and \(f^0:S\rightarrow {\mathbb {R}}\) is an unknown function to be estimated based on observations of \({{\varvec{\xi }}}^1 = (\mathbf{x }^1,\mathbf{y }^1)\). In this case, \(\Xi = S \times {\mathbb {R}}\), again we adopt a sigma-algebra that contains the Borel sets on \(\Xi \) and that results in a complete probability space under the distribution of \((\mathbf{x }^1,\mathbf{y }^1)\). The least-squares estimator of \(f^0\) over the class \(F\subset {\text {usc-fcns}}(S)\) is then any minimizer of

$$\begin{aligned} \mathop {\mathrm{min}}\nolimits _{f\in F} \frac{1}{\nu }\sum _{j=1}^\nu \big (\mathbf{y }^j - f(\mathbf{x }^j)\big )^2. \end{aligned}$$

Resulting estimates furnish approximations of \(f^0\) that in an engineering design context can be maximized to find an optimal design without any (additional) costly simulation of system performance. The only simulations required are those needed to generate a data set \(\{(x^j,y^j), j=1, \ldots , \nu \}\).

In this case, \(\psi ((x,y),f) = (y - f(x))^2\). Since \((x,f)\mapsto f(x)\) is usc and thus measurable, we also have that \(\psi \) is measurable. Consequently, \(\psi \) is a random lsc function provided that F is equi-usc, an assumption that provides the necessary pointwise convergence (cf. [31, Thm. 7.10]). Its nonnegativity ensures that \(\psi \) is also locally inf-integrable.

A confidence region for \(f^0\in F\) emerges from Theorem 5.3 when \((\mathbf{y }^1 - f(\mathbf{x }^1))^2\) is sub-exponential for all \(f\in F\). For example, this will be the case when \(\mathbf{z }^1\) and every component of \(\mathbf{x }^1\) are sub-Gaussian, and for some \(\gamma ,\zeta \in {\mathbb {R}}\),

$$\begin{aligned} f\in F \Longrightarrow |f(x)| \le \gamma \Vert x\Vert _\infty + \zeta , ~\forall x\in S. \end{aligned}$$

Since \(f^0\) must be a minimizer of \(\mathop {\mathrm{min}}\nolimits _{f\in F} {\mathbb {E}}[(\mathbf{y }^1 - f(\mathbf{x }^1))^2]\), provided that \(f^0\in F\), Theorem 5.3 guarantees that \(f^0\) is covered by the stipulated level set when appropriately enlarged.

5.2 Rates of convergence under Hölder condition

Theorem 5.3 does not rule out the possibility that the limit of the given level sets strictly contains \(\mathop {\mathrm{argmin}}\nolimits _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]\). In fact, this cannot be ruled out unless additional assumptions are brought in; [33] contains a discussion. Still, a Hölder condition enables us to “reverse” Theorem 5.3 and quantify the rate of convergence of the excess of minimizers of (FIP-U)\(^\nu \) over those of (FIP-U). Since it is relatively straightforward, we also address approximating constraints. Although the approximating constraints can be rather general, the rate of convergence in the following theorem depends on the rate with which the approximating feasible set approaches the actual one. Thus, it is not immediately clear how the piecewise affine functions discussed in Sect. 3, which have unknown rate of convergence, can be used for constructing these approximations.

As usual, we let \(\alpha ^\nu = o(r^\nu )\) imply that for every \(\delta >0\) there exists \({{\bar{\nu }}}\) such that \(\alpha ^\nu \le \delta r^\nu \) for all \(\nu \ge {{\bar{\nu }}}\).

Theorem 5.4

(rate of convergence) For a complete probability space \((\Xi ,{{\mathcal {A}}},{\mathbb {P}})\) and closed and bounded sets \(F^\nu ,F^0\subset F\subset {\text {usc-fcns}}(S)\), suppose that \(\psi :\Xi \times F\rightarrow (-\infty ,\infty ]\) is a random lsc function for which there exist \(p\in (0,\infty )\) and integrable random variable \(\kappa :\Xi \rightarrow [0,\infty )\) such that

$$\begin{aligned} |\psi (\xi ,f) - \psi (\xi ,g)| \le \kappa (\xi ) [ {\mathbb {d}}(f,g) ]^p \text{ for } \text{ all } f,g\in F \text{ and } \xi \in \Xi . \end{aligned}$$

Suppose also that \({{\varvec{\xi }}}^1, {{\varvec{\xi }}}^2, \dots \) are independent random elements, each with values in \(\Xi \) and distributed according to \({\mathbb {P}}\), and \(\psi ({{\varvec{\xi }}}^1,f)\) is sub-exponential for all \(f\in F\). Let

$$\begin{aligned} r^\nu = \nu ^{\frac{-1}{2+n/p}}(\log \nu )^{\frac{1+n}{2+n/p}}. \end{aligned}$$

If \({\mathbb {H}}(F^\nu ,F^0) = o(\min \{r^\nu ,(r^\nu )^{1/p}\})\) and \(\alpha \in (0,1)\), then there exist \(c\in [0,\infty )\) and \(\bar{\nu }\in {\mathbb {N}}\) such that for \(\nu \ge {{\bar{\nu }}}\) and \(\varepsilon ^\nu \ge 0\),

$$\begin{aligned}&P^\nu \bigg [\mathop {\mathrm{exs}}\nolimits \Big (\varepsilon ^\nu \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F^\nu } \frac{1}{\nu }\sum _{j=1}^\nu \psi ({{\varvec{\xi }}}^j, f), ~~(\varepsilon ^\nu +c r^\nu )\text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F^0} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)]\Big )\\&\quad \le {\mathbb {H}}(F^\nu ,F^0)\bigg ] \ge 1 - \alpha . \end{aligned}$$

Proof

Let \(\zeta >0\). Since \(\kappa ({{\varvec{\xi }}}^1)\) is integrable, there exists \({{\bar{\nu }}}_0\in {\mathbb {N}}\) such that \(P^\nu (|\nu ^{-1}\sum _{j=1}^\nu \kappa ({{\varvec{\xi }}}^j) - {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)]| \ge \zeta ) \le \alpha /2\) for all \(\nu \ge {{\bar{\nu }}}_0\). Let \(\gamma _1>0\). As in Theorem 4.3, there is a finite number \(N = N(F,\gamma _1/2)\) of closed balls in \(({\text {usc-fcns}}(S),{\mathbb {d}})\) with radius \(\gamma _1/2\) and center \(f_k'\) covering F. To make sure that the balls are centered at points in F, we can always select some other centers \(f_k\in F\) and balls with radius \(\gamma _1\) and still cover F.

Let \(\varphi \) and \(\varphi ^\nu \) be as defined in the proof of Theorem 5.3. We note that \(\psi \) is locally inf-integrable due to the Hölder condition and the pointwise sub-exponential property. Since \(\psi ({{\varvec{\xi }}}^1,f_k)\) is sub-exponential, there exists by Bernstein’s inequality \({{\bar{\gamma }}}_2>0\) and \(c_0>0\) such that for \(\gamma _2 \in [0, {{\bar{\gamma }}}_2]\),

$$\begin{aligned} P^\nu \big (|\varphi ^\nu (({{\varvec{\xi }}}^1, \dots , {{\varvec{\xi }}}^\nu ), f_k) - \varphi (f_k)| \ge \gamma _2\big ) \le 2e^{-\nu c_0\gamma _2^2} \text{ for } \text{ all } k=1, \dots , N. \end{aligned}$$

Consequently, as long as \(\nu \ge {{\bar{\nu }}}_0\) and \(2Ne^{-\nu c_0\gamma _2^2} \le \alpha /2\), or, equivalently,

$$\begin{aligned} \nu \ge \max \Big \{{{\bar{\nu }}}_0, \frac{\log N - \log (\alpha /4)}{c_0 \gamma _2^2}\Big \} \end{aligned}$$

we have that

$$\begin{aligned}&P^\nu \bigg (\max _{k=1, \dots ,N} \Big |\varphi ^\nu (({{\varvec{\xi }}}^1,\dots , {{\varvec{\xi }}}^\nu ), f_k) - \varphi (f_k)\Big | \ge \gamma _2 ~ \text{ or } ~\Big |\frac{1}{\nu }\sum _{j=1}^\nu \kappa ({{\varvec{\xi }}}^j) - {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)]\Big | \ge \zeta \bigg )\\&\quad \le \alpha . \end{aligned}$$

Suppose that we have an event \((\xi ^1, \dots , \xi ^\nu )\in \Xi ^\nu \) where

$$\begin{aligned} \max _{k=1, \dots ,N} \Big |\varphi ^\nu ((\xi ^1,\dots , \xi ^\nu ), f_k) - \varphi (f_k)\Big |< \gamma _2 ~ \text{ and } ~ \Big |\frac{1}{\nu }\sum _{j=1}^\nu \kappa (\xi ^j) - {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)]\Big | < \zeta . \end{aligned}$$

Next, we apply Proposition 5.2 for the lsc functions \({{\bar{\varphi }}}:F^0\rightarrow {\mathbb {R}}\) given by \({{\bar{\varphi }}}(f) = \varphi (f)\) and \({{\bar{\varphi }}}^\nu :F^\nu \rightarrow {\mathbb {R}}\) given by \({{\bar{\varphi }}}^\nu (f) = \varphi ^\nu ((\xi ^1, \dots , \xi ^\nu ),f)\). In view of the Hölder assumption on \(\psi \), this implies that \({{\bar{\varphi }}}^\nu \) is finite when defined. Moreover, for all \(f,g\in F\),

$$\begin{aligned} \big |\varphi (f) - \varphi (g)\big | \le {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)] [{\mathbb {d}}(f,g)]^p. \end{aligned}$$

Let \(\delta ^\nu = {\mathbb {H}}(F^\nu ,F^0)\). Suppose that \(f\in F^\nu \). Then, there is \(f'\in F^0\) and \(k^\star \in \{1, \dots , N\}\) such that \({\mathbb {d}}(f,f')\le \delta ^\nu \) and \({\mathbb {d}}(f',f_{k^\star }) \le \gamma _1\). Thus,

$$\begin{aligned}&\mathop {\mathrm{inf} }\nolimits _{g\in {\mathbb {B}}(f,\delta ^\nu )\cap F} {{\bar{\varphi }}}(g) \le {{\bar{\varphi }}}(f') \le {{\bar{\varphi }}}(f_{k^\star }) + {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)] \gamma _1^p\\&\quad < {{\bar{\varphi }}}^\nu (f_{k^\star }) + \gamma _2 + {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)] \gamma _1^p\\&\quad \le {{\bar{\varphi }}}^\nu (f) + \gamma _2 + {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)] \gamma _1^p + \frac{1}{\nu }\sum _{j=1}^\nu \kappa (\xi ^j)(\delta ^\nu + \gamma _1)^p\\&\quad \le {{\bar{\varphi }}}^\nu (f) + \gamma _2 + {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)](\gamma _1^p + (\delta ^\nu + \gamma _1)^p) + \zeta (\delta ^\nu + \gamma _1)^p. \end{aligned}$$

Similarly, suppose that \(f\in F^0\). Then, there is \(f'\in F^\nu \) and \(k^\star \in \{1, \dots , N\}\) such that \({\mathbb {d}}(f,f')\le \delta ^\nu \) and \({\mathbb {d}}(f',f_{k^\star }) \le \gamma _1\). Consequently,

$$\begin{aligned} \mathop {\mathrm{inf} }\nolimits _{g\in {\mathbb {B}}(f,\delta ^\nu )\cap F} {{\bar{\varphi }}}^\nu (g)&\le {{\bar{\varphi }}}^\nu (f') \le {{\bar{\varphi }}}^\nu (f_{k^\star }) + \frac{1}{\nu }\sum _{j=1}^\nu \kappa (\xi ^j)\gamma _1^p\\&< {{\bar{\varphi }}}(f_{k^\star }) + \gamma _2 + \frac{1}{\nu }\sum _{j=1}^\nu \kappa (\xi ^j)\gamma _1^p\\&\le {{\bar{\varphi }}}(f) + {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)](\delta ^\nu + \gamma _1)^p + \gamma _2 + \frac{1}{\nu }\sum _{j=1}^\nu \kappa (\xi ^j)\gamma _1^p\\&\le {{\bar{\varphi }}}(f) + {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)][(\delta ^\nu + \gamma _1)^p + \gamma _1^p] + \gamma _2 + \zeta \gamma _1^p. \end{aligned}$$

Thus, we have shown that the conditions of Proposition 5.2 hold for the functions \({{\bar{\varphi }}}\) and \({{\bar{\varphi }}}^\nu \) with

$$\begin{aligned} \delta ^\nu \text{ and } \tau _0 = \gamma _2 + {\mathbb {E}}[\kappa ({{\varvec{\xi }}}^1)](\gamma _1^p + (\delta ^\nu + \gamma _1)^p) + \zeta (\delta ^\nu + \gamma _1)^p \end{aligned}$$

as the two error parameters (\(\gamma \) and \(\tau \)) and we therefore have that

$$\begin{aligned} \mathop {\mathrm{exs}}\nolimits \big (\varepsilon ^\nu \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F^\nu } \varphi ^\nu ((\xi ^1, \dots , \xi ^\nu ), f), ~(\varepsilon ^\nu +2\tau _0)\text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F^0} \varphi (f) \big ) \le \delta ^\nu . \end{aligned}$$

By Theorem 4.3, \(\log N\) is bounded from by a term proportional to \(\gamma _1^{-n}(\log \gamma _1^{-1})^{n+1}\) for sufficiently small \(\gamma _1\). Thus, there exist constants \(c_1,c_2>0\) such that

$$\begin{aligned} \frac{\log N - \log (\alpha /4)}{c_0 \gamma _2^2} \le c_1 \gamma _1^{-n}(\log \gamma _1^{-1})^{n+1}\gamma _2^{-2} + c_2 \gamma _2^{-2}, \end{aligned}$$

which gives the rate of growth in \(\nu \) as \(\gamma _1\) and \(\gamma _2\) vanish. For \(\tau >0\), the error \(\tau _0\) can be kept below \(\tau \) if \(\gamma _1\) is proportional to \(\tau ^{1/p}\), \(\gamma _2\) is proportional to \(\tau \), \(\delta ^\nu \) is proportional to \(\min \{\tau , \tau ^{1/p}\}\), and the (positive) proportionality constants are selected sufficiently close to zero. In view of these choices about \(\gamma _1\) and \(\gamma _2\), there is a constant \(c_3>0\) such that

$$\begin{aligned} c_1 \gamma _1^{-n}(\log \gamma _1^{-1})^{n+1}\gamma _2^{-2} + c_2 \gamma _2^{-2} \le c_3 \tau ^{-2-n/p}(\log \tau ^{-1/p})^{n+1}. \end{aligned}$$

With \(\nu \) above \({{\bar{\nu }}}_0\) as well as the previous right-hand side, or equivalently for some \(c_4>0\),

$$\begin{aligned} \tau \ge c_4 \nu ^{\frac{-1}{2+n/p}} (\log \nu )^{\frac{1+n}{2+n/p}}, \end{aligned}$$

we ensure the required confidence level and the conclusion follows. \(\square \)

A corollary of the theorem for the case with \(\varepsilon ^\nu =0\) and \(F^\nu = F^0 = F\) is that

$$\begin{aligned} \mathop {\mathrm{argmin}}\nolimits _{f\in F} \frac{1}{\nu } \sum _{j=1}^\nu \psi ({{\varvec{\xi }}}^j,f) ~ \subset ~cr^\nu \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F} {\mathbb {E}}[\psi ({{\varvec{\xi }}}^1,f)] \end{aligned}$$

with at least probability \(1-\alpha \). Thus, minimizers of (FIP-U)\(^\nu \) converge at the rate \(r^\nu \) to a minimizer of (FIP-U). The rate depends on the Hölder coefficient p as well as the dimension n of the space of function under considerations.

We illustrate the assumptions of the theorem for two stochastic optimization problems arising in nonparametric statistics, but start with an intermediate result.

Proposition 5.5

For Lipschitz continuous functions \(f,g\in {\text {usc-fcns}}(S)\) with common modulus \(\kappa \in [0,\infty )\),

$$\begin{aligned} |f(x) - g(x)| \le (1+\kappa )e^{\rho (x)} {\mathbb {d}}(f,g) \text{ for } \text{ all } x\in S, \end{aligned}$$

where \(\rho (x) = \max \{\Vert x\Vert _\infty , |f(x)|, |g(x)|\}\).

Proof

Let \(x\in S\). The first result is trivial if \(\rho (x) = \infty \). Suppose that \(\rho (x)<\infty \). From Lemma 4.1, \({\mathbb {d}}(f,g) \ge e^{-\rho (x)} \hat{\mathbb {d}}_{\rho (x)}(f,g)\). Set \(\tau \in (\hat{\mathbb {d}}_{\rho (x)}(f,g),\infty )\). Again, by Lemma 4.1, there exists \(y\in {\mathbb {B}}_\infty (x,\tau )\) such that \(f(y) \ge g(x) - \tau \). Thus, \(g(x) - f(x) = g(x) - f(y) + f(y) - f(x) \le \tau + \kappa \tau \). A similar argument establishes that \(f(x) - g(x) \le \tau + \kappa \tau \). Hence, by letting \(\tau \) tends to its lower limit, we obtain that \(|f(x) - g(x)| \le (1 + \kappa )\hat{\mathbb {d}}_{\rho (x)}(f,g)\) and the conclusion follows. \(\square \)

Example 3

Least-Squares Regression. We return to the setting of Example 2, but now let F be a family that contains only Lipschitz continuous functions with common modulus \(\kappa _0\ge 0\). Suppose also that \(\mathbf{z }^1\) and every component of \(\mathbf{x }^1\) are sub-Gaussian, the unknown function \(f^0\in F\), and there exists \(\beta <\infty \) such that \(f(0)\le \beta \) for all \(f\in F\). Then, F is equi-usc and there are \(\gamma ,\zeta \in {\mathbb {R}}\) such that \(|f(x)| \le \gamma \Vert x\Vert _\infty +\zeta \) for all \(f\in F\). Proposition 5.5 then ensures that the Hölder condition in Theorem 5.4 holds with \(p=2\) and \(\kappa ((x,y)) = (1+\kappa _0)^2 \exp (2\max \{\Vert x\Vert _\infty , \gamma \Vert x\Vert _\infty +\zeta \})\), which is integrable in view of the sub-Gaussianity of \(\mathbf{x }^1\).

Using the bound on |f(x)|, we also have that \((\mathbf{y }^1-f(\mathbf{x }^1))^2 = (f^0(\mathbf{x }^1)-f(\mathbf{x }^1) + \mathbf{z }^1)^2\) is sub-exponential. The assumptions of Theorem 5.4 therefore hold,

$$\begin{aligned} r^\nu = \nu ^{\frac{-2}{4+n}}(\log \nu )^{\frac{1+n}{2+n/2}}, \end{aligned}$$

and, for closed \(F^\nu ,F^0\subset F\), there exist \(c\in [0,\infty )\) and \({{\bar{\nu }}}\in {\mathbb {N}}\) such that

$$\begin{aligned}&P^\nu \bigg [\mathop {\mathrm{exs}}\nolimits \Big (\mathop {\mathrm{argmin}}\nolimits _{f\in F^\nu } \frac{1}{\nu }\sum _{j=1}^\nu (\mathbf{y }^j - f(\mathbf{x }^j))^2, ~~cr^\nu \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F^0} {\mathbb {E}}[(\mathbf{y }^1 - f(\mathbf{x }^1))^2]\Big )\\&\quad \le {\mathbb {H}}(F^\nu ,F^0)\bigg ] \ge 1 - \alpha \end{aligned}$$

provided that \(\nu \ge {{\bar{\nu }}}\) and \({\mathbb {H}}(F^\nu ,F^0) = o(r^\nu )\). Thus, when \({\mathbb {H}}(F^\nu ,F^0) =0\) and \({\hat{f}}^\nu \in \mathop {\mathrm{argmin}}\nolimits _{f\in F^\nu } \frac{1}{\nu }\sum _{j=1}^\nu \) \((\mathbf{y }^j - f(\mathbf{x }^j))^2\) is measurable,

$$\begin{aligned} P^\nu \Big ({\mathbb {E}}\big [({\hat{f}}^\nu (\mathbf{x }^1) - f^0(\mathbf{x }^1))^2\big ] \le cr^\nu \Big ) \ge 1-\alpha . \end{aligned}$$

The rates developed here apply in rather general settings and remain in effect even if \(f^0\not \in F^0\). More specific settings give improved results as in the case of regression with fixed design and Lipschitz continuous functions defined on compact convex subset [44, p. 333] and in the univariate case [15].

Example 4

Least-Squares Probability Density Estimation. We return to the setting of Example 1, but now consider the least-squares estimator of \(f^0\), which is any minimizer of

$$\begin{aligned} \mathop {\mathrm{min}}\nolimits _{f\in F^\nu } -\frac{2}{\nu }\sum _{j=1}^\nu f({{\varvec{\xi }}}^j) + \int [f(x)]^2 dx. \end{aligned}$$

This estimator is motivated by the fact that the unknown function

$$\begin{aligned} f^0 \in \mathop {\mathrm{argmin}}\nolimits _{f\in F} \int \big [f(x) - f^0(x)\big ]^2 dx = \mathop {\mathrm{argmin}}\nolimits _{f\in F} -2{\mathbb {E}}[f({{\varvec{\xi }}}^1)] + \int [f(x)]^2 dx \end{aligned}$$

whenever \(f^0\in F\). To make the case rather concrete, let \(\kappa \in [0,\infty )\) and for some bounded function \(h:S\rightarrow [0,\infty )\), with \(\int h(x) dx < \infty \),

$$\begin{aligned} F= & {} \bigg \{f\in {\text {usc-fcns}}(S)~\bigg |~\int f(x) dx = 1, ~0 \le f(x) \le h(x), ~|f(x)\\&- f(y)|\le \kappa \Vert x-y\Vert _\infty , \forall x,y\in S\bigg \}, \end{aligned}$$

which can be shown to be closed and bounded; see arguments in [38]. In this case, \((\xi ,x) \mapsto \psi (\xi ,f) = -2f(\xi ) + \int [f(x)]^2 dx\) is a random lsc function as can be seen by invoking Fatou’s Lemma and pointwise convergence; again see [38]. Then, \(\psi ({{\varvec{\xi }}}^1,f)\) is sub-exponential for all \(f\in F\) as it is in fact bounded.

It remains to check the Hölder condition in Theorem 5.4. Suppose that \({\mathbb {E}}[\exp (\Vert {{\varvec{\xi }}}^1\Vert _\infty )]<\infty \). In view of Proposition 5.5, if the integral term in \(\psi \) had not been present, then the condition holds with \(p=1\); Lipschitz continuity and the fact that \({\mathbb {E}}[\exp (\Vert {{\varvec{\xi }}}^1\Vert _\infty )]<\infty \) ensures integrability of the Hölder modulus. If S were compact, then \(\psi \) would still satisfy the condition with \(p=1\). For a noncompact S, the argument needs to be slightly modified by first “ignoring” the integral term and second reintroduce it in a slightly generalized version of Theorem 5.4. We omit the details.

In summary, for the given F and under the assumption that \({\mathbb {E}}[\exp (\Vert {{\varvec{\xi }}}^1\Vert _\infty )]<\infty \), we can show by invoking Theorem 5.4 (or the mentioned extensions) that for any \(\alpha \in (0,1)\) there exist \(c\in [0,\infty )\) and \(\bar{\nu }\in {\mathbb {N}}\) such that for every \(\nu \ge {{\bar{\nu }}}\) and \(\varepsilon ^\nu \ge 0\)

$$\begin{aligned}&P^\nu \bigg [\varepsilon ^\nu \text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F} -\frac{2}{\nu }\sum _{j=1}^\nu f({{\varvec{\xi }}}^j) + \int [f(x)]^2 dx \subset \\&\quad (\varepsilon ^\nu +cr^\nu )\text{- }\mathop {\mathrm{argmin}}\nolimits _{f\in F} -2{\mathbb {E}}[f({{\varvec{\xi }}}^1)] + \int [f(x)]^2 dx\bigg ] \\&\qquad \ge 1 - \alpha \text{ with } r^\nu = \nu ^{\frac{-1}{2+n}}(\log \nu )^{\frac{1+n}{2+n}}. \end{aligned}$$

A sharper result is available in the univariate case over the class of nonincreasing convex functions [13].