1 INTRODUCTION

In many problems of mathematical statistics and learning theory, a crucial step is to understand how well the empirical distribution of a sample approximates the underlying true distribution. The theory of empirical processes is devoted to this question. There are many papers and books treating this and related problems, both from asymptotic and nonasymptotic points of view; see, for instance, del Barrio et al. [5], van der Vaart and Wellner [20]. Among many remarkable achievements of the theory of empirical processes, there are two results that have been particularly often evoked and used in the recent literature in statistics and machine learning.

To quickly present these two results, let us give some details on the framework. It is assumed that \(n\) independent copies \(X_{1},\ldots,X_{n}\) of a random variable \(X\) taking its values in the \(d\)-dimensional hypercube \([0,1]^{d}\) are observed. The aforementioned two results characterize the order of magnitude of supremum of the empirical process \(\mathbb{X}_{n}(f)=\frac{1}{n}\sum_{i=1}f(X_{i})-\mathbb{E}[f(X)]\) over some class of functions \(\mathcal{F}\). More precisely, the first result established by [6] states that \(\sup_{f\in\textsf{Lip(1)}}\mathbb{X}_{n}(f)\) is of order \(O(n^{-1/d})\),where Lip(1) is the set of all the Lipschitz-continuous functions with Lipschitz constant 1. The second result [3, Lemma 1], tells us that if \(\mathcal{F}\) contains functions that are smooth enough, for instance functions that are in a finite ball of a RKHS defined by a bounded kernel, then \(\sup_{f\in\mathcal{F}}\mathbb{X}_{n}(f)\) is of order \(O(n^{-1/2})\), i.e., the same order as in the case when \(\mathcal{F}\) contains only one function.

The main result of this note provides an interpolation between the two aforementioned results. Roughly speaking, it shows that if \(\mathcal{F}\) is the class of functions defined on \([0,1]^{d}\) that are Hölder-continuous for a given constant \(L\) and a given order \(\alpha>0\), then the supremum of the empirical process over \(\mathcal{F}\) is of order \(O(n^{-(\frac{\alpha}{d}\wedge\frac{1}{2})})\) with an additional slowly varying factor \(\log n\) when \(\alpha=d/2\). Clearly, when \(\alpha=1\) this coincides with the result from [6], while for \(\alpha\geqslant d/2\) we get the fast and dimension-free rate \(n^{-1/2}\), up to a log factor.

The rest of this note is organized as follows. We complete this introduction by providing all the important notations used throughout this note. Section 2 is devoted to presenting and formally defining Hölder classes and Integral Probability Metrics (IPM). In Section 3, we expose some important concepts and results from empirical process theory needed for our proofs. We end this note by stating our main theorem in Section 4. Some extensions are mentioned in Section 5. The proofs are postponed to the Appendix.

Notations

A multi-index \(\mathbf{k}\) is a vector with integer coordinates \((k_{1},\dots,k_{d})\). We write \(|\mathbf{k}|=\sum_{i=1}^{d}k_{i}\). For a given multi-index \(\mathbf{k}=(k_{1},\dots,k_{d})\), we define the differential operator

$$D^{\mathbf{k}}=\frac{\partial^{|\mathbf{k}|}}{\partial x_{1}^{k_{1}}\dots\partial x_{d}^{k_{d}}}.$$

For any positive real number \(x\), \(\lfloor x\rfloor\) denotes the largest integer strictly smaller than \(x\). We let \(\mathcal{X}\) be a convex bounded set in \(\mathbb{R}^{d}\) with non-empty interior. We assume that all the functions and function classes considered in this note are supported on the bounded set \(\mathcal{X}\). For any integer \(k\), we denote by \(C^{k}(\mathcal{X},\mathbb{R})\) the class of real-valued functions with domain \(\mathcal{X}\) which are \(k\)-times differentiable with continuous \(k\)-th differentials. For any real-valued bounded function \(f\) on \(\mathcal{X}\), we let \(||f||_{\infty}:=\sup_{x\in\mathcal{X}}|f(x)|\in[0,+\infty)\). Note that we can consider the essential supremum instead of the supremum over \(\mathcal{X}\) in which case our results would hold almost surely. We let \(||\cdot||\) denote some norm on \(\mathbb{R}^{d}\). We denote by \(\sigma_{1},\dots,\sigma_{n}\) i.i.d. Rademacher random variables, i.e., discrete random variables such that \(\mathbb{P}(\sigma_{1}=1)=\mathbb{P}(\sigma_{1}=-1)=1/2\) which are independent of any other source of randomness. We use the convention \({1}/{0}=+\infty\).

2 A PRIMER ON HÖLDER CLASSES AND INTEGRAL PROBABILITY METRICS

In this section we define Hölder classes of functions and integral probability metrics. We then discuss some properties of these notions and highlight their role in statistics and statistical learning theory.

2.1 Hölder Classes

A central problem in nonparametric statistics is to estimate a function belonging to an infinite-dimensional space (e.g., density estimation, regression function estimation, hazard function estimation), see Tsybakov [19] for an introduction to the topic of nonparametric estimation. To obtain nontrivial rates of convergence, some kind of regularity is assumed on the function of interest. It can be expressed as conditions on the function itself, on its derivatives, on the coefficients of the function in a given basis, etc. Hölder classes are one of the most common classes considered in the nonparametric estimation literature, they form a natural extension of Lipschitz-continuous functions and can be formalised with the following simple conditions. For any real number \(\alpha>0\), we define the Hölder norm of smoothness \(\alpha\) of a \(\lfloor\alpha\rfloor\)-times differentiable function \(f\) as

$$||f||_{\mathcal{H}^{\alpha}}:=\max_{|k|\leqslant\lfloor\alpha\rfloor}||D^{k}f||_{\infty}+\max_{|k|=\lfloor\alpha\rfloor}\sup_{x\neq y}\frac{|D^{k}f(x)-D^{k}f(y)|}{||x-y||^{\alpha-\lfloor\alpha\rfloor}}.$$

The Hölder ball of smoothness \(\alpha\) and radius \(L>0\), denoted by \(\mathcal{H}^{\alpha}(L)\), is then defined as the class of \(\lfloor\alpha\rfloor\)-times continuously differentiable functions with Hölder norm bounded by the radius \(L\):

$${\mathcal{H}}^{\alpha}(L)=\left\{f\in C^{\lfloor\alpha\rfloor}(\mathcal{X},\mathbb{R})|||f||_{\mathcal{H}^{\alpha}}\leqslant L\right\}.$$

2.2 Integral Probability Metrics

The class \(\mathcal{H}^{1}\)(1) of \(1\)-Lipschitz functions has received a lot of attention in the optimal transport literature; see [13] for an overview of the topic of mathematical optimal transport. This interest comes from the Kantorovitch duality, which implies that the Wasserstein-1 distance (also known as the earth mover’s distance) can be expressed, for any probability measures \(P,Q\), as a supremum of some functional over \(1\)-Lipschitz functions:

$$W_{1}(P,Q)=\sup_{f\in\mathcal{H}^{1}(1)}|\mathbb{E}_{X\sim P}f(X)-\mathbb{E}_{Y\sim Q}f(Y)|.$$

More generally, for a given class \(\mathcal{F}\) of bounded functions, one can define a pseudo-metric on the space of probability measures, the integral probability metric (IPM) induced by the class \(\mathcal{F}\), as

$$d_{\mathcal{F}}(P,Q)=\sup_{f\in\mathcal{F}}|\mathbb{E}_{X\sim P}f(X)-\mathbb{E}_{Y\sim Q}f(Y)|.$$

The literature on IPM has recently been boosted by the advent of adversarial generative models [1, 8]. A reason for this is that an IPM can be seen as an adversarial loss: to compare two probability distributions, it seeks for the function which discriminates the most the two distributions in expectation. Initially studied by the deep learning community, impressive empirical results obtained by adversarial generative models on several tasks such as image generation led statisticians to study it theoretically [3, 4, 10] (see also Sriperumbudur et al. [18] for statistical results on IPM in a general framework). Since, as pointed out earlier, Lipschitz functions are also Hölder, one can wonder what happens for IPM indexed by general Hölder classes. Such IPM already appeared in the literature: Scetbon et al. [14] showed that \(\alpha\)-Hölder IPM with smoothness \(\alpha\leqslant 1\) correspond to the cost of a generalized optimal transport problem.

To further motivate our study, let us consider the abstract problem of minimum distance estimation: for a given probability measure \(P\), find a distribution \(Q\) in a given set of probability measures \(\mathcal{Q}\) such that \(Q\) is close to \(P\) under the metric \(d_{\mathcal{F}}\):

$$\min_{Q\in\mathcal{Q}}d_{\mathcal{F}}(Q,P).$$
(1)

For example, when \(\mathcal{F}\) is taken to be the class of \(1\)-Lipschitz function, this problem is known as minimum Kantorovitch estimation [2]. In statistics, the probability \(P\) is usually unknown and one is only given i.i.d. samples \(X_{1},\dots,X_{n}\) from the probability distribution \(P\). A natural strategy is then to employ the empirical distribution \(P_{n}=1/n\sum_{i=1}^{n}\delta_{X_{i}}\) as a proxy for the theoretical distribution and instead of (1) solve the problem:

$$\min_{Q\in\mathcal{Q}}d_{\mathcal{F}}(Q,P_{n}).$$
(2)

Since the triangle inequality yields

$$|d_{\mathcal{F}}(Q,P)-d_{\mathcal{F}}(Q,P_{n})|\leqslant d_{\mathcal{F}}(P,P_{n})=\sup_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathbb{E}f(X)\right|,$$

one question of interest is to measure how fast the empirical measure approximates the true measure under the IPM \(d_{\mathcal{F}}\). If the rates are fast, we do not loose much by considering the empirical problem (2) instead of the theoretical one of (1). However if the rates are slow, one cannot expect the distances of the solutions to the measure \(P\) to be close. We will see in the next section that the latter expression corresponds to the supremum of the empirical process indexed by the class \(\mathcal{F}\), it will enable us to leverage the rich literature on empirical processes to obtain rates of convergence for \(d_{\mathcal{F}}(P,P_{n})\).

3 EMPIRICAL PROCESSES, METRIC ENTROPY AND DUDLEY’S BOUNDS

This section provides a short account of the notions and tools from the theory of empirical processes which are necessary for stating and establishing the main result.

3.1 Empirical Processes

Empirical process are ubiquitous in statistical learning theory, we refer the reader to [7, 9] for a general presentation of results on empirical processes and their link with statistics and learning theory. For clarity, we begin by recalling the definition of an empirical process.

Definition 1. Let \(\mathcal{F}\) be a class of real-valued functions \(f\colon\mathcal{X}\to\mathbb{R}\), where \((\mathcal{X},\mathcal{A},P)\) is a probability space. Let \(X\) be a random point in \(\mathcal{X}\) distributed according to the distribution \(P\) and let \(X_{1},\dots,X_{n}\) be independent copies of \(X\). The random process \(\big{(}\mathbb{X}_{n}(f)\big{)}_{f\in\mathcal{F}}\) defined by

$$\mathbb{X}_{n}(f):=\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathbb{E}f(X),$$

is called an empirical process indexed by \(\mathcal{F}\) .

In our case, we are interested in controlling the (expectation of the) supremum of an empirical process, a common case in the literature. Most of the time, the first step to apply for achieving this goal is to ‘‘symmetrize’’ the empirical process as allowed by the following lemma. Let \(\widehat{R}_{n}(\mathcal{F})\) be the empirical Rademacher complexity of function class \(\mathcal{F}\), defined as

$$\widehat{R}_{n}(\mathcal{F})=\mathbb{E}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}f(X_{i})\Big{|}X_{1},\ldots,X_{n}\right].$$

Lemma 1 (Symmetrization). For any class \(\mathcal{F}\) of \(P\)-integrable functions,

$$\mathbb{E}\left[\sup_{f\in\mathcal{F}}|\mathbb{X}_{n}(f)|\right]\leqslant 2\mathbb{E}\big{[}\widehat{R}_{n}(\mathcal{F})\big{]}.$$

The advantage of Rademacher processes is that, regardless of the distribution of the random variable \(X\) and the function class \(\mathcal{F}\), for a fixed sample \(X_{1},\dots,X_{n}\), the random variable \(\sum_{i=1}^{n}\sigma_{i}f(X_{i})\) has a sub-Gaussian behavior, in the following sense.

Definition 2 (Sub-Gaussian behavior). A centered random variable \(Y\) has a sub-Gaussian behavior if there exists a positive constant \(\sigma\) such that

$$\mathbb{E}e^{\lambda Y}\leqslant e^{\lambda^{2}\sigma^{2}/2},\quad\forall\lambda\in\mathbb{R}.$$

In that case, we define the sub-Gaussian norm Footnote 1of \(Y\) as

$$||Y||_{\psi_{2}}=\inf\left\{t>0:\mathbb{E}e^{Y^{2}/t^{2}}\leqslant 2\right\}.$$

Having a sub-Gaussian behavior essentially means to be at least as concentrated as a Gaussian random variable around its mean. Our definition is equivalent to the tail inequalities

$$\mathbb{P}(|Y|>t)\leqslant 2e^{-t^{2}/(2\sigma^{2})},\quad\forall t>0.$$

This type of behavior will be crucial to obtain the main result of this note. Indeed, as we will see, the behavior of the supremum of an empirical process (and more generally a stochastic process) which has sub-Gaussian increments exclusively depends on the topology of the space by which the process is indexed.

3.2 Metric Entropy

Let \((T,d)\) be a totally bounded metric space, i.e., for every real number \(\varepsilon>0\), there exists a finite collection of open balls of radius \(\varepsilon\) whose union contains \(T\). We give a formal definition of such finite collections.

Definition 3. Given \(\varepsilon>0\), a subset \(T_{\varepsilon}\subset T\) is called an \(\varepsilon\)-cover of \(T\) if for every \(t\in T\), there exists \(s\in T_{\varepsilon}\) such that \(d(s,t)\leqslant\varepsilon\).

Note that adding any point to an \(\varepsilon\)-cover still yields an \(\varepsilon\)-cover. Thus we can look for \(\varepsilon\)-covers of a set with smallest cardinality, which we call covering number.

Definition 4. The \(\varepsilon\)-covering number of \(T\), denoted by \(\mathcal{N}(T,d,\varepsilon)\), is the cardinality of the smallest \(\varepsilon\)-cover of \(T\), that is

$$\mathcal{N}(T,d,\varepsilon):=\min\big{\{}|T_{\varepsilon}|:T_{\varepsilon}\textit{ is an }\varepsilon\textit{-cover of }T\big{\}}.$$

The metric entropy of \(T\) is given by the logarithm of the \(\varepsilon\)-covering number.

Remark 1. A totally bounded metric space \((T,d)\) is pre-compact in the sense that its closure is compact. The metric entropy (or entropic numbers) of \((T,d)\) can then be seen as some measure of compactness of the space. Indeed, \(\mathcal{N}(T,d,\varepsilon)\) quantifies precisely how many balls of radius \(\varepsilon\) are needed to cover the whole space \(T\) .

Entropic numbers for Hölder classes are known and can be found in e.g., Shiryayev [15], van der Vaart and Wellner [20].

Theorem 1 (Theorem 2.7.3 in [20]). Let \(\mathcal{X}\) be a bounded, convex subset of \(\mathbb{R}^{d}\) with nonempty interior. There exists a constant \(K_{\alpha,d}\) depending only on \(\alpha\) and \(d\) such that, for every \(\varepsilon>0\),

$$\log\mathcal{N}(\mathcal{H}^{\alpha}(1),||\cdot||_{\infty},\varepsilon)\leqslant K_{\alpha,d}\lambda_{d}(\mathcal{X}^{1})\varepsilon^{-d/\alpha},$$

where \(\lambda_{d}\) is the \(d\)-dimensional Lebesgue measure and \(\mathcal{X}^{1}\) is the \(1\)-blowup of \(\mathcal{X}\): \(\mathcal{X}^{1}=\{y:\inf_{x\in\mathcal{X}}||y-x||<1\}\).

3.3 Dudley’s Bound and Its Refined Version

We now present classic results which show the link between the topology of the indexing set and the behavior of the supremum of the corresponding empirical process. Following [21, Definition 8.1.1], for \(K\geqslant 0\), we say that a random process \((X_{t})_{t\in T}\) on a metric space \((T,d)\) has \(K\)-sub-Gaussian increments if

$$||X_{t}-X_{s}||_{\psi_{2}}\leqslant Kd(t,s),\quad\text{ for all }\quad t,s\in T.$$

Theorem 2 (Dudley’s inequality). Let \((X_{t})_{t\in T}\) be a mean-zero random process on a metric space \((T,d)\) with \(K\)-sub-Gaussian increments. Then

$$\mathbb{E}\Big{[}\sup_{t\in T}X_{t}\Big{]}\leqslant CK\int\limits_{0}^{+\infty}\sqrt{\log\mathcal{N}(T,d,\varepsilon)}d\varepsilon$$

for some universal constant \(C>0\).

One drawback of Dudley’s bound is that the integral on the right hand side may diverge if the metric entropy of \(T\) tends to infinity at a very fast rate when \(\varepsilon\to 0\). For example, when the metric entropy is upper bounded by \(\varepsilon^{-\gamma}\), as it was seen to be the case with \(\gamma=d/\alpha\) for \(\alpha\)-Hölder-smooth \(d\)-variate functions, the integral converges if and only if \(\gamma<2\).

An improvement of Dudley’s bound in the case where the process \(X_{t}\) is a Rademacher average indexed by a class of functions \(\mathcal{F}\)—circumventing the problem of divergence of the integral—was proposed by [17, Lemma A.3] (see also Srebro and Sridharan [16]). Before stating the theorem, let us recall the definition of the \(L_{2}(P_{n})\) norm of a function \(f\):

$$||f||_{L_{2}(P_{n})}^{2}=\int\limits_{\mathcal{X}}f^{2}dP_{n}=\frac{1}{n}\sum_{i=1}^{n}f(X_{i})^{2}.$$

Theorem 3. Let \(\mathcal{F}\subset\{f\colon\mathcal{X}\to\mathbb{R}\}\) be any class of measurable functions containing the uniformly zero function and let \(S_{n}(\mathcal{F})=\sup_{f\in\mathcal{F}}||f||_{L_{2}(P_{n})}\) . We have

$$\widehat{R}_{n}(\mathcal{F})\leqslant\inf_{\tau>0}\left\{4\tau+\frac{12}{\sqrt{n}}\int\limits_{\tau}^{S_{n}(\mathcal{F})}\sqrt{\log\mathcal{N}(\mathcal{F},L_{2}(P_{n}),\varepsilon)}d\varepsilon\right\}.$$

Note that the refined Dudley bound gives an upper bound on the empirical Rademacher process and depends on the metric entropy with respect to the empirical norm \(L_{2}(P_{n})\). The following simple lemma shows that the \(L_{2}(P_{n})\)-norm can be replaced by the supremum-norm in the refined Dudley bound.

Lemma 2. Let \(\mathcal{F}\) be any class of bounded functions defined on \(\mathcal{X}\). For any sample \(X_{1},\dots,X_{n}\), let \(\mathcal{F}_{|X_{1},\ldots,X_{n}}\) be the subset of \(\mathbb{R}^{n}\) defined by

$$\mathcal{F}_{|X_{1},\ldots,X_{n}}=\big{\{}u\in\mathbb{R}^{n}:\exists f\in\mathcal{F}\textit{ such that }u_{i}=f(X_{i})\textit{ for all }i=1,\ldots,n\big{\}}.$$

For any \(\varepsilon>0\), we have

$$\mathcal{N}(\mathcal{F},L_{2}(P_{n}),\varepsilon)\leqslant\mathcal{N}(\mathcal{F}_{|X_{1},\dots,X_{n}},||\cdot||_{\infty},\varepsilon)\leqslant\mathcal{N}(\mathcal{F},||\cdot||_{\infty},\varepsilon).$$

Proof. Let \(\{u_{1},\dots,u_{M}\}\) be a minimal \(\varepsilon\)-net for \(\mathcal{F}_{|X_{1},\dots,X_{n}}\) with respect to the supremum norm. Let \(f_{1},\ldots,f_{M}\in\mathcal{F}\) be such that \(\big{(}f_{j}(X_{1}),\ldots,f_{j}(X_{n}))=u_{j}\) for every \(j=1,\ldots,M\). Then, for any \(f\in\mathcal{F}\), there exists an index \(j\in[M]\) such that \(\max_{i}|f(X_{i})-(u_{j})_{i}|=\max_{i}|f(X_{i})-f_{j}(X_{i})|\leqslant\varepsilon\). Since for any function \(f\) in \(\mathcal{F}\),

$$||f-f_{j}||_{L_{2}(P_{n})}^{2}=\frac{1}{n}\sum_{i=1}^{n}(f(X_{i})-f_{j}(X_{i}))^{2}\leqslant||f-f_{j}||_{\infty}^{2},$$

\(\{f_{1},\dots,f_{M}\}\) is an \(\varepsilon\)-net for \(\mathcal{F}\) with respect to the empirical \(L_{2}\) norm. This proves the first inequality. Let now \(f_{1},\ldots,f_{M}\) be an \(\varepsilon\)-net of \((\mathcal{F},||\cdot||_{\infty})\). One readily checks that \(u_{1},\ldots,u_{M}\) defined by \(u_{j}=(f_{j}(X_{1}),\ldots,f_{j}(X_{n}))\) is an \(\varepsilon\)-net of \(\mathcal{F}_{|X_{1},\dots,X_{n}}\). This completes the proof. \(\Box\)

4 MAIN RESULT

We are now in a position to state the main theorem which gives, for an IPM defined by a Hölder class, the rate of convergence of the empirical measure towards its theoretical counterpart.

Theorem 4. Let \(\mathcal{X}\subset\mathbb{R}^{d}\) be a convex bounded set with non-empty interior. Let \(\mathcal{H}^{\alpha}(L)\) be the Hölder class of \(\alpha\)-smooth functions supported on the set \(\mathcal{X}\) and with Hölder norm bounded by \(L\). For any probability distribution \(P\) supported on \(\mathcal{X}\), denoting by \(P_{n}\) the empirical measure associated to i.i.d. samples \(X_{1},\dots,X_{n}\sim P\), we have,

$$\mathbb{E}\big{[}d_{\mathcal{H}^{\alpha}(L)}(P_{n},P)\big{]}=\mathbb{E}\bigg{[}\sup_{h\in\mathcal{H}^{\alpha}(L)}\big{|}\mathbb{X}_{n}(h)\big{|}\bigg{]}\leqslant cL\begin{cases}n^{-{\alpha}/{d}}\quad\text{if $\alpha<{d}/{2}$}\\ n^{-{1}/{2}}\ln(n)\quad\text{if $\alpha={d}/{2}$}\\ n^{-{1}/{2}}\quad\text{if $\alpha>{d}/{2}$},\end{cases}$$

where \(c\) is a constant depending only on \(d\), \(\lambda_{d}(\mathcal{X}^{1})\) and \(\alpha\).

We notice two different regimes: for highly smooth functions (\(\alpha>d/2\)), the rate of convergence does not depend on the smoothness \(\alpha\) nor on the dimension \(d\) and corresponds to the usual parametric rate of convergence (note that it also matches the rate known for the Maximum Mean Discrepancy metric, which is an IPM indexed by the unit ball of a RKHS with bounded kernel [3]). For less regular Hölder functions (\(\alpha<d/2\)), the rate of convergence depends both on the smoothness and on the dimension in a typical curse of dimensionality behavior. These two regimes coincide, up to a logarithmic factor, at their smoothness boundary \(\alpha=d/2\): we have a continuous transition in terms of the exponent of the sample size. Interestingly the rates we obtain interpolate between the \(n^{-1/d}\) rate known for Wasserstein-1 distance [22] when considering \(\mathcal{H}^{1}(1)\) and the \(n^{-1/2}\) rate for Maximum Mean Discrepancy when considering Hölder classes with enough smoothness.

Finally, let us mention that the formulation of Theorem 4 given above aims at characterizing the behaviour of the expected error in the asymptotic setting of large samples. This result follows from the following finite sample upper bound (proved in Section 6.2):

$$\mathbb{E}\left[d_{\mathcal{H}^{\alpha}(L)}(P_{n},P)\right]\leqslant 12L\begin{cases}\left(\frac{K\lambda}{n}\right)^{{\alpha}/{d}}\left[\frac{d}{d-2\alpha}\wedge(1+0.5\log(\frac{n}{9K\lambda}))\right]\quad\text{if $\alpha<{d}/{2}$}\\ \left(\frac{K\lambda}{n}\right)^{{1}/{2}}\left[\frac{2\alpha}{2\alpha-d}\wedge(1+\frac{\alpha}{d}\log(\frac{n}{9K\lambda}))\right]\quad\text{if $\alpha\geqslant{d}/{2}$},\end{cases}$$

where \(\lambda:=\lambda_{d}(\mathcal{X}^{1})\) and \(K=K_{\alpha,d}\) is the constant depending only on \(\alpha\) and \(d\) borrowed from Theorem 1.

5 SOME EXTENSIONS

A slightly less precise but more general result can be obtained for any bounded class whose entropy grows polynomially in \(1/\varepsilon\); see also Rakhlin et al. [12, Theorem 2], where this condition naturally arises. Such an extension can be stated as follows.

Theorem 5. Let \(\mathcal{X}\subset\mathbb{R}^{d}\) be a convex bounded set with non-empty interior. Let \(\mathcal{H}\) be a bounded class of functions supported on the set \(\mathcal{X}\). Assume that the entropy of the class grows polynomially, i.e., there exist positive real numbers \(p\) and \(A\) such that

$$\forall\varepsilon>0,\quad\log\mathcal{N}(\mathcal{H},||\cdot||_{\infty},\varepsilon)\leqslant A\varepsilon^{-p}.$$

Then, for any probability distribution \(P\) supported on \(\mathcal{X}\), denoting by \(P_{n}\) the empirical measure associated to i.i.d. samples \(X_{1},\dots,X_{n}\sim P\), we have,

$$\mathbb{E}\big{[}d_{\mathcal{H}}(P_{n},P)\big{]}=\mathbb{E}\bigg{[}\sup_{h\in\mathcal{H}}\big{|}\mathbb{X}_{n}(h)\big{|}\bigg{]}\leqslant c\begin{cases}n^{-{1}/{p}}\quad\text{if $p>2$}\\ n^{-{1}/{2}}\ln(n)\quad\text{if $p=2$}\\ n^{-{1}/{2}}\quad\text{if $p<2$},\end{cases}$$

where \(c\) is a constant.

The proof of the extension is exactly the same as the proof of Theorem 4 up to constants. In this note we have seen Hölder classes as examples of classes with polynomial growth of the entropy but there are many other such classes. To illustrate this we give the example of Sobolev classes which, in some cases, are more general than Hölder classes. For a positive integer \(s\) and a real number \(1\leqslant p\leqslant+\infty\), define the Sobolev space \(\mathcal{W}_{p}^{s}(r)\) with radius \(r>0\) as

$$\mathcal{W}^{s}_{p}(r):=\left\{f\in C^{s}(\mathcal{X},\mathbb{R}):\sum_{|k|\leqslant s}||D^{k}f||_{p}\leqslant r\right\}.$$

Note that for any positive integer \(s\) and for any positive radius \(L\), there exist radii \(r\) and \(r^{\prime}\) such that

$$\mathcal{W}^{s}_{\infty}(r)\subset\mathcal{H}^{s}(L)\subset\mathcal{W}^{s-1}_{\infty}(r^{\prime}).$$

A consequence of [11, Corollary 1] is that for any positive integer \(s>0\), and real number \(p\) such that \(d/s<p\leqslant+\infty\), the entropy of a Sobolev class grows polynomially as

$$\log\mathcal{N}(\mathcal{W}^{s}_{p}(L),||\cdot||_{\infty},\varepsilon)\leqslant A\varepsilon^{-d/s}$$

for some positive constant \(A\). Thus Theorem 5 holds for this class. Finally we point out that such bounds on the entropy hold for more general spaces such as some Besov spaces. We refer the reader to Nickl and Pötscher [11] for more details.