1 Introduction

We introduce a new Orlicz norm which we name the Bernstein–Orlicz norm. It interpolates sub-Gaussian and sub-exponential tail behavior. With this new norm, we apply the usual techniques based on Orlicz norms. In particular, we derive deviation inequalities for suprema in a fairly simple and straightforward way. The Bernstein–Orlicz norm captures Bernstein’s probability inequalities, and its use puts further derivations in a unifying framework, shared for example by techniques for the sub-Gaussian case, such as those for empirical processes based on symmetrization and Hoeffding’s inequality.

We furthermore introduce chaining and generic chaining along a tree, which is we believe conceptually simpler than the usual chaining and generic chaining. We invoke it for the presentation of maximal inequalities for general random variables with finite Bernstein–Orlicz norm. The supremum of the empirical process is then studied as a special case, and we show that chaining along a tree can be done using entropy with bracketing. We establish a deviation inequality for the empirical process indexed by a class of functions \(\mathcal{G}\) in terms of the new Bernstein–Orlicz norm. The class \(\mathcal{G}\) is assumed to satisfy a uniform Bernstein condition, but need not be uniformly bounded in supremum norm.

The paper is organized as follows. In Sect. 2, we introduce the Bernstein–Orlicz norm and discuss the relation with Bernstein’s inequality. We then present some bounds for maxima of finitely many random variables (Sect. 3) and suprema over a countable set of random variables (Sect. 4). Section 4 also contains the concept of (generic) chaining along a tree. The proofs of the results in Sects. 2, 3 and 4 are elementary and given immediately following their statement. Section 5 contains the application to the empirical process. The proofs here are more technical and given separately in Sects. 6 and 7.

2 The Bernstein–Orlicz norm

Consider a random variable \(Z \in \mathbb R \) with distribution \(\mathbb P \). We first recall the general Orlicz norm (see e.g. [7]).

Definition 1

Let \(\varPsi : [0, \infty ) \rightarrow [0 , \infty ) \) be with \(\varPsi (0)=0\). The \(\varPsi \)-Orlicz norm of \(Z\) is

$$\begin{aligned} \Vert Z \Vert _{\varPsi } := \inf \biggl \{ c >0 : \mathbb E \varPsi \biggl ({ |Z | \over c } \biggr ) \le 1 \biggr \} . \end{aligned}$$

A special case is the \(L_m (\mathbb P )\)-norm (\(m \ge 1\)) which corresponds to \(\varPsi (z) = z^m \). Other important special cases are \(\varPsi (z) = \exp [ z^2 ] - 1\) for sub-Gaussian random variables and \(\varPsi (z) = \exp (z) -1 \) for sub-exponential random variables. We propose functions \(\varPsi \) that combine sub-Gaussian intermediate tails and sub-exponential far tails.

For each \(L>0\) we define

$$\begin{aligned} \varPsi _L (z):= \exp \biggl [ { \sqrt{1+ 2L z } - 1 \over L } \biggr ]^2 - 1 , \quad z \ge 0 . \end{aligned}$$
(1)

It is easy to see that \(\varPsi _L\) is increasing and convex, and that \(\varPsi _L (0)=0\).

Definition 2

Let \(L>0\) be given. The (\(L\)-)Bernstein–Orlicz norm is the \(\varPsi \)-Orlicz norm with \(\varPsi = \varPsi _L\) given in (1).

Indeed, the Bernstein–Orlicz norm combines sub-Gaussian and sub-exponential behavior:

$$\begin{aligned} \Psi _L (z) \approx \left\{ \begin{array}{ll}\exp [ z^2] -1&\quad \text{ for}\;Lz \;\text{ small} \\ \exp [2z/L ] -1&\quad \text{ for}\; Lz \;\text{ large} \\ \end{array}\right. \end{aligned}$$

Note that the constant \(L\) governs the range of the sub-Gaussian behavior. It is a dimensionless constant, i.e., it does not depend on the scale of measurement.

The inverse of \(\varPsi _L\) is

$$\begin{aligned} \varPsi _L^{-1} (t)= \sqrt{ \log (1+t)} + {L \over 2} \log (1+t) , \quad t \ge 0 . \end{aligned}$$

With this and with Chebyshev’s inequality, one now directly derives a probability inequality for \(Z\).

Lemma 1

Let \(\tau := \Vert Z \Vert _{\varPsi _L} \). We have for all \(t >0\),

$$\begin{aligned} \mathbb P \biggl ( |Z| >\tau \biggl [ \sqrt{t} + {Lt \over 2} \biggr ] \biggr ) \le 2 \exp [-t] . \end{aligned}$$

Proof of Lemma 1

By Chebyshev’s inequality, for all \(c>\Vert Z \Vert _{\varPsi _L} \),

$$\begin{aligned} \mathbb P \biggl ( | Z | / c \ge \sqrt{t} + {Lt \over 2} \biggr )&= \mathbb P \biggl ( |Z| / c \ge \varPsi _L^{-1} ( \mathrm{e}^t -1) \biggr )\\&= \mathbb P \biggl ( \varPsi _L (| Z|/c) \ge \mathrm{e}^t -1 \biggr ) \le \biggl ( \mathbb E \varPsi _L ( |Z|/ c ) +1 \biggr ) \mathrm{e}^{-t} . \end{aligned}$$

Thus,

$$\begin{aligned} \mathbb P \biggl ( | Z | / \tau > \sqrt{t} + {Lt \over 2} \biggr )&= \lim _{c \downarrow \tau } \mathbb P \biggl ( | Z | / c > \sqrt{t} + {Lt \over 2} \biggr )\\&\le \lim _{c \downarrow \tau } \biggl ( \mathbb E \varPsi _L ( |Z|/ c ) +1 \biggr ) \mathrm{e}^{-t} \le 2 \mathrm{e}^{-t} . \end{aligned}$$

\(\square \)

The next lemma says that a converse result holds as well, that is, from the probability inequality of Lemma 1 one can derive a bound for the Bernstein–Orlicz norm, with constants \(L\) and \(\tau \) multiplied by \(\sqrt{3}\).Footnote 1

Lemma 2

Suppose that for for some constants \(\tau \) and \(L\), and for all \(t >0\),

$$\begin{aligned} \mathbb P \biggl ( |Z| \ge \tau \biggl [ \sqrt{t} + {Lt \over 2} \biggr ] \biggr ) \le 2 \exp [-t] . \end{aligned}$$

Then \(\Vert Z \Vert _{\varPsi _{\sqrt{3} L }} \le \sqrt{3} \tau \).

Proof of Lemma 2

We have

$$\begin{aligned}&\mathbb E \varPsi _{\sqrt{3}L} \biggl (|Z|/(\sqrt{3} \tau )\biggr )= \int _0^{\infty } \mathbb P \biggl ( |Z| \ge \sqrt{3} \tau \varPsi _{\sqrt{3} L}^{-1} (t) \biggr )dt\\&\quad = \int _0^{\infty } \mathbb P \biggl ( |Z| \ge \sqrt{3} \tau \biggl [ \sqrt{\log (1+t) } + {\sqrt{3} L \over 2 } \log (1+t ) \biggr ] \biggr )dt\\&\quad = \int _0^{\infty } \mathbb P \biggl ( |Z| \ge \tau \biggl [ \sqrt{\log (1+t)^3 } + { L \over 2 } \log (1+t )^3 \biggr ] \biggr )dt \le 2 \int _{0}^{\infty } {1 \over (1+t)^3} dt = 1. \end{aligned}$$

\(\square \)

We recall Bernstein’s inequality, see [2].

Theorem 1

Let \(X_1 , \ldots , X_n \) be independent random variables with values in \(\mathbb R \) and with mean zero. Suppose that for some constants \(\sigma \) and \(K\), one has

$$\begin{aligned} {1 \over n} \sum _{i=1}^n \mathbb E | X_i|^m \le {m! \over 2} K^{m-2} \sigma ^2 , \quad m =2 , 3 , \ldots . \end{aligned}$$

Then for all \(t>0\),

$$\begin{aligned} \mathbb P \left( {1 \over \sqrt{n} } \biggl | \sum _{i=1}^n X_i \biggr | \ge \sigma \sqrt{2 t } + { Kt \over \sqrt{n}} \right) \le 2 \exp [-t] . \end{aligned}$$

The following corollary shows that \(\Vert \cdot \Vert _{\varPsi _L}\) indeed captures the nature of Bernstein’s inequality.

Corollary 1

Let \(X_1 , \ldots , X_n \) be independent random variables satisfying the conditions of Theorem 1. Then by this theorem and Lemma 2, for \(L := \sqrt{6} K /( \sqrt{n} \sigma ) \), we have

$$\begin{aligned} \biggl \Vert {1 \over \sqrt{n} } \sum _{i=1}^n X_i \biggr \Vert _{\varPsi _L} \le \sqrt{6} \sigma . \end{aligned}$$

3 The Bernstein–Orlicz norm for the maximum of finitely many variables

Using Orlicz norms, the argument for obtaining a bound for the expectation of maxima is standard. We refer to [14] for a general approach. We consider the special case of the Bernstein–Orlicz norm.

Lemma 3

Let \(\tau \) and \(L\) be constants, and let \(Z_1 , \ldots , Z_p\) be random variables satisfying

$$\begin{aligned} \max _{1 \le j \le p } \Vert Z_j \Vert _{\varPsi _L} \le \tau . \end{aligned}$$

Then

$$\begin{aligned} \mathbb E \max _{1 \le j \le p} |Z_j | \le \tau \biggl [ \sqrt{\log (1+ p)} + {L\over 2} \log (1+p ) \biggr ] . \end{aligned}$$

Proof of Lemma 3

Let \(c > \tau \). Then by Jensen’s inequality

$$\begin{aligned} \mathbb E \max _{1 \le j \le p} |Z_j |&\le c \varPsi _L^{-1} \left( \mathbb E \varPsi _L\biggl ( \max _{1 \le j \le p } |Z_j | / c\biggr ) \right) = c \varPsi _L^{-1} \left( \mathbb E \max _{1 \le j \le p } \varPsi _L\biggl ( |Z_j | / c\biggr ) \right)\\&\le c \varPsi _L^{-1} \left( \sum _{j=1}^p \mathbb E \varPsi _L\biggl ( |Z_j | / c\biggr ) \right) \le c \varPsi _L^{-1} \left( p \max _{1 \le j \le p } \mathbb E \varPsi _L\biggl ( |Z_j | / c\biggr ) \right)\!. \end{aligned}$$

Therefore,

$$\begin{aligned} \mathbb E \max _{1 \le j \le p} |Z_j |&\le \lim _{c \downarrow \tau } c \varPsi _L^{-1} \left( p \max _{1 \le j \le p } \mathbb E \varPsi _L\biggl ( |Z_j | / c\biggr ) \right) \le \tau \varPsi _L^{-1} ( p )\\&= \tau \biggl [ \sqrt{\log (1+ p)} + {L\over 2} \log (1+p ) \biggr ] . \end{aligned}$$

\(\square \)

As a special case, one may consider the random variables

$$\begin{aligned} Z_j := {1 \over \sqrt{n}} \sum _{i=1}^n g_j(X_i), \quad j=1 , \ldots , p, \end{aligned}$$

where \(X_1 , \ldots , X_n\) are independent random variables with values in some space \(\mathcal{X}\), and where \(g_1 , \ldots , g_p\) are real-valued functions on \(\mathcal{X}\). If the \(g_j(X_i)\) are centered for all \(i\) and \(j\), and if one assumes the Bernstein condition

$$\begin{aligned} {1 \over n } \sum _{i=1}^n \mathbb E | g_j (X_i) |^m \le { m! \over 2 } K^{m-2} \sigma ^2 , \quad m=2,3, \ldots , \ j=1 , \ldots , p , \end{aligned}$$

then one can apply Lemma 3, with \(\tau := \sqrt{6} \sigma \) and \(L= \sqrt{6} K/ (\sqrt{n} \sigma )\), giving the inequality

$$\begin{aligned} \mathbb E \max _{1 \le j \le p} \biggl | {1 \over \sqrt{n}} \sum _{i=1}^n g_j (X_i) \biggr | \le \sigma \sqrt{6 \log (1+ p)} + { 3K \over \sqrt{n}} \log (1+ p ) . \end{aligned}$$
(2)

This follows from Corollary 1. The constants can however be improved when using direct arguments (see e.g. Lemma 14.12 [5]).

We now present a deviation inequality in probability for the maximum of finitely many variables.

Lemma 4

Let \(Z_1 , \ldots , Z_p\) be random variables satisfying for some \(L\) and \(\tau \)

$$\begin{aligned} \max _{1 \le j \le p } \Vert Z_j \Vert _{\varPsi _L} \le \tau . \end{aligned}$$

Then for all \(t > 0\)

$$\begin{aligned} \mathbb P \left( \max _{1 \le j \le p } | Z_j | \ge \tau \biggl [ \sqrt{\log (1+ p)} + {L\over 2} \log (1+p ) + \sqrt{t} +{ Lt \over 2} \biggr ] \right) \le 2 \exp [-t] . \end{aligned}$$

Proof of Lemma 4

We first use that for any \(a>0\) and \(t>0\) one has \(\sqrt{a} + \sqrt{t} > \sqrt{a+t}\) so that

$$\begin{aligned}&\mathbb P \left( \max _{1 \le j \le p } | Z_j | \ge \tau \biggl [ \sqrt{\log (1+ p)} + {L\over 2} \log (1+p ) + \sqrt{t} +{ Lt \over 2} \biggr ] \right)\\&\quad \le \mathbb P \left( \max _{1 \le j \le p } | Z_j | > \tau \biggl [ \sqrt{t+\log (1+ p)} + {L\over 2} ( t+ \log (1+p ) ) \biggr ] \right). \end{aligned}$$

Next, we apply the union bound and Lemma 1:

$$\begin{aligned}&\mathbb P \left( \max _{1 \le j \le p } | Z_j | > \tau \biggl [ \sqrt{t+\log (1+ p)} + {L\over 2} ( t+ \log (1+p ) ) \biggr ] \right)\\&\quad \le \sum _{j=1}^p \mathbb P \left( | Z_j | > \tau \biggl [ \sqrt{t+\log (1+ p)} + {L\over 2} ( t+ \log (1+p ) ) \biggr ] \right)\\&\quad \le 2 p \exp \biggl [ -(t+ \log (1+p) ) \biggr ] = {2p \over 1+p} \exp [-t] \le 2 \exp [-t] . \end{aligned}$$

\(\square \)

Using Lemma 2, this is easily converted into the following deviation inequality for the Bernstein–Orlicz norm. We use the notation

$$\begin{aligned} x_+ := x \mathrm{l } \{ x > 0 \} . \end{aligned}$$

Lemma 5

Let \(Z_1 , \ldots , Z_p\) be random variables satisfying for some \(L\) and \(\tau \)

$$\begin{aligned} \max _{1 \le j \le p } \Vert Z_j \Vert _{\varPsi _L} \le \tau . \end{aligned}$$

Then

$$\begin{aligned} \biggl \Vert \biggl ( \max _{1 \le j \le p } |Z_j| - \tau \biggl [ \sqrt{\log (1+p)} + {L \over 2} \log (1+p) \biggr ] \biggr )_+ \biggr \Vert _{\varPsi _{\sqrt{3} L}} \le \sqrt{3} \tau . \end{aligned}$$

Proof of Lemma 5

Let

$$\begin{aligned} Z := \biggl (\max _{1 \le j \le p } |Z_j| - \tau \biggl [ \sqrt{\log (1+ p)} + {L\over 2} \log (1+p ) \biggr ] \biggr )_+ . \end{aligned}$$

By Lemma 4, we have for all \(t >0\)

$$\begin{aligned}&\mathbb P \biggl ( Z \ge \tau \biggl [ \sqrt{t} + {Lt \over 2} \biggr ] \biggr )\\&\quad \!=\! \mathbb P \left( \max _{1 \le j \le p } | Z_j | \ge \tau \biggl [ \sqrt{\log (1\!+\! p)} \!+\! {L\over 2} \log (1\!+\!p ) \!+\! \sqrt{t} \!+\!{ L t\over 2} \biggr ] \right)\le 2 \exp [-t] . \end{aligned}$$

Application of Lemma 2 finishes the proof. \(\square \)

4 Chaining along a tree

A common technique for bounding suprema of stochastic processes is chaining as developed by Kolmogorov, leading to versions of Dudley’s entropy bound ([6]). See e.g. [14] or [13] and the references therein. We however propose another method which we call chaining along a tree. This method is conceptually simpler than the usual chaining and, as far as we know, does not introduce unnecessary restrictions. An example will be detailed in Sect. 5 for the case of entropy with bracketing. The generic chaining technique of [12] is a refinement of the usual chaining which we shall also consider in Definition 6 and Theorem 3.

Let \(S \in \mathbb N _0\) be fixed.

Definition 3

A finite treeFootnote 2 \(\mathcal{T} \) is a partition \(\{ G_s \}_{s=0}^S \) of \(\{ 1 , \ldots , N\} \) together with a function

$$\begin{aligned} \mathrm{parent} : \{ 1 , \ldots , N\} \rightarrow \{ 1 , \ldots , N\} , \end{aligned}$$

such that \(\mathrm{parent } (j) \in G_{s-1}\) for \(j \in G_s \), \(s \in 1 , \ldots , S \). We call an element of \(\{ 1 , \ldots , N\}\) a node and \(G_s\) a generation, \(s=0 , \ldots , S\). A branch with end node \(j_S \in G_S\) is the sequence \(\{ j_0 , \ldots , j_S\} \) with \(j_{s-1} = \mathrm{parent}({j_s})\), \(s=1, \ldots , S\).

Definition 4

Let a collection of real-valued random variables \(\mathcal{W }:= \{ W_j \}_{j=1}^N\) be given. A finite labeled tree \((\mathcal{T}, \mathcal{W})\) is a finite tree with on each node \(j\) a label \(W_j \).

Let \(\Theta \) be some countable set and let \(Z_{\theta } \in \mathbb R \) be a random variable defined for each \(\theta \in \Theta \). We consider supremum of the process \(\{ |Z_{\theta } | : \ \theta \in \Theta \} \).

Definition 5

Let \(\delta >0\) and \(\tau >0\) be constants and let \(\mathcal{L} := \{ L_s \}_{s=0}^S\) be a sequence of positive numbers. A \((\delta ,\tau , \mathcal{L} )\) finite tree chain for \(\{ Z_{\theta } \}\) is a finite labeled tree \((\mathcal{T} ,\mathcal{W} )\) such that for all \(s=0 , \ldots , S\),

$$\begin{aligned} \Vert W_{j} \Vert _{\varPsi _{L_s}} \le \tau 2^{-s} , \quad j \in G_s, \end{aligned}$$

and such that one can apply chaining of \(\{ Z_{\theta } \}\) along the tree \((\mathcal{T} , \mathcal{W})\), with approximation error \(\delta \). That is, for each \(\theta \in \Theta \) there is an end node \(j_S \in G_S\) such that the branch \(\{ j_0 , \ldots , j_S \}\) satisfies

$$\begin{aligned} | Z_{\theta }| \le \sum _{s=0}^S | W_{j_s} | + \delta . \end{aligned}$$

In the above definition, the approximation error \(\delta \) will generally depend on the depth \(S\) of the tree. We assume that at a fine enough level, the approximation error is small. The usual chaining technique does not assume a tree structure, but indeed often needs only a finite number of steps. A tree structure follows if the members at the finest level are taken as end nodes. With a finite number of steps, the sum given in (3) is finite. This avoids requiring convergence of an infinite sum.

We have presented the definition of a finite tree chain for the Bernstein–Orlicz norm \(\Vert \cdot \Vert _{\varPsi _L}\). However, the concept is not particularly tied up with this norm, e.g., for sub-Gaussian cases one may choose to replace the Bernstein–Orlicz norm by the \(L_2 (\mathbb P )\) norm (corresponding to case where the constants in \(\mathcal{L}\) all vanish): see Lemma 14.35 in [5].

Let us now turn to the results.

Theorem 2

Let \((\mathcal{T},\mathcal{W} )\) be an \((\delta ,\tau , \mathcal{L} )\) finite tree chain for \(\{ Z_{\theta } \}\). Define

$$\begin{aligned} \gamma := \tau \sum _{s=0}^S 2^{-s} \biggl [ \sqrt{\log (1+ |G_s| ) } + {L_s \over 2} \log (1+ |G_s| ) \biggr ] . \end{aligned}$$
(3)

It holds that

$$\begin{aligned} \mathbb E \sup _{\theta \in \Theta } | Z_{\theta } | \le \gamma + \delta . \end{aligned}$$
(4)

Remark 1

One may minimize the right hand side of (4) over all finite trees.

Proof of Theorem 2

We have

$$\begin{aligned} \mathbb E \sup _{\theta \in \Theta } |Z_{\theta } | \le \sum _{s=0}^S \mathbb E \max _{j \in G_s } | W_j |+ \delta . \end{aligned}$$

Application of Lemma 3 gives that for each \(s \in \{ 0 , \ldots , S \} \)

$$\begin{aligned} \mathbb E \max _{j \in G_s } |W_j| \le \tau 2^{-s} \biggl [ \sqrt{\log (1+ |G_s| ) } + { L_s \over 2} \log (1+ | G_s | ) \biggr ] . \end{aligned}$$

\(\square \)

With generic chaining, the condition on the Bernstein–Orlicz norm of the labels is dropped in the definition of the tree. This Bernstein–Orlicz norm then turns up in the constants (5) and (6) which appear in the generic chaining bound of Theorem 3.

Definition 6

Let \(\delta >0\) be a constant. A \(\delta \) finite generic tree chain for \(\{ Z_{\theta } \}\) is a finite labeled tree \((\mathcal{T} ,\mathcal{W} )\) such that one can apply generic chaining of \(\{ Z_{\theta } \}\) along the tree \((\mathcal{T} , \mathcal{W})\) with approximation error \(\delta \). That is, for each \(\theta \in \Theta \) there is an end node \(j_S \in G_S\) such that the branch \(\{ j_0 , \ldots , j_S \}\) satisfies

$$\begin{aligned} | Z_{\theta }| \le \sum _{s=0}^S | W_{j_s} | + \delta . \end{aligned}$$

Let \((\mathcal{T} ,\mathcal{W} )\) a finite labeled tree. For each end node \(k \in G_S\), we let

$$\begin{aligned} \{ j_0 (k) , \ldots , j_S(k)\} \end{aligned}$$

be the corresponding branch (so that \(j_S (k) =k \)), and we write

$$\begin{aligned} W_s(k):= W_{j_s(k)} , \quad k \in G_S , \ s=0, 1 , \ldots , S . \end{aligned}$$

Fix a sequence of positive constants \(\mathcal{L} := \{ L_s \}_{s=0}^S \). We write for \(k \in G_S\),

$$\begin{aligned} \gamma _{1,*} (k)&:= \sum _{s=0}^S \Vert W_s (k) \Vert _{\varPsi _{L_s}} \sqrt{\log (1+ | G_s | ) } ,\end{aligned}$$
(5)
$$\begin{aligned} \gamma _{2,*} (k)&:= \sum _{s=0}^S \Vert W_s (k) \Vert _{\varPsi _{L_s}} { L_s } \log (1+ | G_s | ) , \end{aligned}$$
(6)
$$\begin{aligned} \gamma _*(k) := \gamma _{1,*} (k)+ { \gamma _{2,*}(k) \over 2} . \end{aligned}$$

Moreover, we let

$$\begin{aligned} \gamma _{1,*} := \max _{k \in G_S} \gamma _{1,*} (k) , \quad \gamma _{2,*} := \max _{k \in G_S} \gamma _{2,*} (k) , \quad \gamma _{*} := \max _{k \in G_S} \gamma _{*} (k) , \end{aligned}$$

and

$$\begin{aligned} \tau _* := \max _{k \in G_S } \sum _{s=0}^S \Vert W_s (k) \Vert _{\varPsi _{L_s}} \sqrt{1+s} , \end{aligned}$$

and

$$\begin{aligned} L_* \tau _* := \max _{k \in G_S } \sum _{s=0}^S \Vert W_s (k) \Vert _{\varPsi _{L_s}} (1+s) L_s . \end{aligned}$$

Theorem 3

Let \((\mathcal{T} ,\mathcal{W} )\) be a \(\delta \) finite generic tree chain for \(\{ Z_{\theta } \} \). Then for all \(t >0\)

$$\begin{aligned} \mathbb P \biggl ( \sup _{\theta \in \Theta } | Z_{\theta } | \ge \gamma _* + \delta + \tau _* \biggl [1+ {L_* \over 2} \biggr ] + \tau _* \biggl [ \sqrt{t} + {L_* t \over 2} \biggr ] \biggr ) \le 2 \exp [-t] . \end{aligned}$$

Remark 2

The result of Theorem 3 may again be optimized over all finite generic trees.

Proof of Theorem 3

Define for \(s=0 , \ldots , S\),

$$\begin{aligned} \alpha _s&:= \biggl [ \sqrt{\log (1+ |G_s| ) } + {L_s \over 2} \log (1+ |G_s| ) \biggr ] \\&\quad+ \biggl [ \sqrt{(1+s)(1+t) } + {(1+s)(1+t) L_s \over 2}\biggr ] . \end{aligned}$$

Using Lemma 4, we see that

$$\begin{aligned} \mathbb P \biggl ( \max _{j \in G_s } { |W_j | \over \Vert W_j \Vert _{\varPsi _{L_s} } } \ge \alpha _s \biggr ) \le 2 \exp [-(1+t)(1+s) ] , \quad s=0 , \ldots , S . \end{aligned}$$
(7)

We have

$$\begin{aligned}&\mathbb P \biggl (\max _{k } \sum _{s=0}^S | W_{s} (k)| \ge \max _{k } \sum _{s=0}^S\Vert W_s(k) \Vert _{\varPsi _{L_s}} \alpha _s \biggr )\\&\quad \le \mathbb P \biggl (\exists k : \ \sum _{s=0}^S | W_{s} (k)| \ge \sum _{s=0}^S \Vert W_s(k) \Vert _{\varPsi _{L_s}} \alpha _s \biggr )\\&\quad \le \sum _{s=0}^{S} \mathbb P \biggl (\exists k : \ |W_{s} (k)| \ge \Vert W_s(k) \Vert _{\varPsi _{L_s}} \alpha _s \biggr )\\&\quad = \sum _{s=0}^S \mathbb P \biggl (\max _k { | W_{s} (k)| \over \Vert W_s(k) \Vert _{\varPsi _{L_s}} }\ge \alpha _s \biggr ) \le \sum _{s=0}^S \mathbb P \biggl ( \max _{j \in G_s } { |W_j | \over \Vert W_j \Vert _{\varPsi _{L_s}} } \ge \alpha _s \biggr ) . \end{aligned}$$

Now insert (7) to find

$$\begin{aligned}&\mathbb P \biggl (\max _k \sum _{s=0}^S |W_{s} (k) | \ge \max _k \sum _{s=0}^S \Vert W_s (k) \Vert _{\varPsi _{L_s}} \alpha _s \biggr ) \\&\quad \le 2 \sum _{s=0}^S \exp [-(1+t)(1+s) ]\le {2 \mathrm{e} ^{-(1+t)} \over 1- \mathrm{e} ^{-(1+t) } } \le {2 \mathrm{e} ^{-1} \over 1- \mathrm{e} ^{-1 } } \exp [-t] \le 2 \exp [-t] . \end{aligned}$$

We have by definition

$$\begin{aligned}&\displaystyle \max _k \sum _{s=0}^S \Vert W_s(k) \Vert _{\varPsi _{L_s}} \biggl [ \sqrt{\log (1+ |G_s| ) } + {L_s \over 2} \log (1+ |G_s| ) \biggr ]=\gamma _*,\\&\displaystyle \max _k \sum _{s=0}^S \Vert W_s(k) \Vert _{\varPsi _{L_s}} \sqrt{(1+s)} =\tau _* , \end{aligned}$$

and

$$\begin{aligned} \max _k \sum _{s=0}^S \Vert W_s(k) \Vert _{\varPsi _{L_s}} (1+s) L_s =\tau _* L_*. \end{aligned}$$

Therefore,

$$\begin{aligned} \max _k \sum _{s=0}^S \Vert W_s(k) \Vert _{\varPsi _{L_s}}\alpha _s&\le \gamma _* + \tau _* \sqrt{1+t} + { \tau _* (1+t) L_* \over 2}\\&\le \gamma _* + \tau _* + {\tau _* L_* \over 2} + \tau _* \biggl [ \sqrt{t} + {L_*t \over 2} \biggr ] . \end{aligned}$$

Note that the constants \(L_*\) and \(\tau _*\) possibly depend on the complexity of \(\Theta \) through the quantities \(\{ \Vert W_s (k) \Vert _{\varPsi _{L_s} } : \ k \in G_s , \ s=0, \ldots , S \} \). Moreover, the choice of the constants \(\mathcal{L} = \{ L_s \}_{s=0}^S \) may also depend on the complexity of \(\Theta \). In the application to the empirical process (see Sect. 5), the latter will be indeed the case. We will nevertheless derive there a deviation inequality where we put the dependency on the complexity of \(\Theta \) in the shift.

As a simple corollary of Theorem 3, one obtains a deviation inequality in the Bernstein–Orlicz norm. We state this for completeness. In Sect. 5 we will not apply Corollary 2 directly, because as such, it does not allow us to put all dependency on the complexity of \(\Theta \) in the shift.

Corollary 2

Let the conditions of Theorem 3 be met. Then the combination of this theorem with Lemma 2 gives

$$\begin{aligned} \biggl \Vert \biggl ( \sup _{\theta \in \Theta } | Z_{\theta } | - ( \gamma _* + \delta + \tau _* [1+ L_* / 2 ] ) \biggr )_+ \biggr \Vert _{\Psi _{\sqrt{3} L_*}} \le \sqrt{3} \tau _* . \end{aligned}$$

By Jensen’s inequality, we then get

$$\begin{aligned} \mathbb E \sup _{\theta \in \Theta } | Z_{\theta } | \le \gamma _* + \delta + \tau _* \biggl [1+ { L_* \over 2} \biggr ] + \sqrt{3} \tau _* \biggl [ \sqrt{\log 2} + {\sqrt{3} L_* \over 2} \log 2 \biggr ] . \end{aligned}$$

Example 1

In [12], the sizes \(|G_s|\) of generation \(s\) is fixed to be

$$\begin{aligned} | G_s |= 2^{2^{2s}} , \quad s=0 , \ldots , S . \end{aligned}$$

In that case,

$$\begin{aligned} \log (1+ |G_s | ) \le (2^{2s} +1 ) \log 2 \le 2^{2s+1} \le 2^{2(s+1)}. \end{aligned}$$

Hence

$$\begin{aligned} \gamma _* \le 2 \gamma _0 , \end{aligned}$$

where

$$\begin{aligned} \gamma _0 := \max _{k \in G_S} \gamma _0 (k) , \end{aligned}$$

and for \(k \in G_S\),

$$\begin{aligned} \gamma _{0} (k) := \gamma _{1,0} (k) + { \gamma _{2,0} (k) \over 2} , \end{aligned}$$

and

$$\begin{aligned} \gamma _{1,0} (k) := \sum _{s=0}^S \Vert W_s (k) \Vert _{\varPsi _{L_s}} 2^s ,\quad \gamma _{2,0} (k) := \sum _{s=0}^S \Vert W_s(k) \Vert _{\varPsi _{L_s}} L_s 2^{2s} . \end{aligned}$$

Furthermore, since \(1+s \le 2^{2s} \) for all \(s \ge 0\),

$$\begin{aligned} \tau _* \le \gamma _{1,0} := \max _{k \in G_S} \gamma _{1,0} (k) , \end{aligned}$$

and

$$\begin{aligned} \tau _* L_* \le \gamma _{2,0} := \max _{k \in G_S} \gamma _{2,0} (k) . \end{aligned}$$

Hence,

$$\begin{aligned} \gamma _* + \tau _* \biggl [1+ {L_* \over 2}\biggr ] \le 3 \biggl [ \gamma _{1,0} + {\gamma _{2,0} \over 2} \biggr ] , \end{aligned}$$

and

$$\begin{aligned} \sqrt{3}\tau _* \biggl [ \sqrt{\log 2} + {\sqrt{3} L_* \over 2} \log 2 \biggr ] \le \sqrt{3 \log 2} \ \gamma _{1,0} + { 3 \log 2 \over 2} \gamma _{2,0} . \end{aligned}$$

It follows from Corollary 2 that

$$\begin{aligned} \mathbb E \sup _{\theta \in \Theta } | Z_{\theta } | \le (3+ \sqrt{3\log 2} ) \gamma _{1,0} + {3+ 3 \log 2 \over 2} \gamma _{2,0} + \delta .\end{aligned}$$

Thus, we arrive at a special case of Theorem 1.2.7 in [12].

When using a \((\delta , \tau , \mathcal{L} )\) finite tree chain, one takes \(\Vert W_s (k) \Vert _{\varPsi _{L_s}} \le \tau 2^{-s} \) for all \(s\) and \(k \in G_s\). In that case, the constants \(\tau _*\) and \(L_*\) in the bounds given in Corollary 2 only depend on the scale parameter \(\tau \) and on the constants \(\mathcal{L} = \{ L_s \}_{s=0}^S \). This is detailed in the next theorem.

Theorem 4

Let the conditions of Theorem 2 be met, and define

$$\begin{aligned} \gamma := \tau \sum _{s=0}^S2^{-s} \biggl [ \sqrt{\log (1+ |G_s| ) } + {L_s \over 2} \log (1+ |G_s| ) \biggr ] \end{aligned}$$

and

$$\begin{aligned} L:= \sum _{s=0}^S { 2^{-s} L_s (1+s) \over 4}. \end{aligned}$$

Then for all \(t>0\)

$$\begin{aligned} \mathbb P \biggl ( \sup _{\theta \in \Theta } |Z_{\theta } | \ge \gamma + \delta +4\tau \biggl [ 1 + {L \over 2} \biggr ] + 4 \tau \biggl [ \sqrt{t} + {Lt \over 2} \biggr ]\biggr ) \le 2 \exp [-t] . \end{aligned}$$

Proof of Theorem 4

This follows from Theorem 3, where one takes

$$\begin{aligned} \Vert W_s (k) \Vert _{\varPsi _{L_s}} \le \tau 2^{-s} . \end{aligned}$$

We have

$$\begin{aligned} \tau _* /\tau \le \sum _{s=0}^S 2^{-s} \sqrt{(1+s)} =2 \sum _{s=1}^{S+1} 2^{-s} \sqrt{s} \le 2 \int _0^{\infty } 2^{-x} \sqrt{x} dx = {\sqrt{\pi }\over (\log 2)^{3/2} } \le 4 . \end{aligned}$$

Moreover,

$$\begin{aligned} L_* \tau _* \le \sum _{s=0}^S 2^{-s} L_s (1+s) = 4L . \end{aligned}$$

\(\square \)

5 Application to empirical processes

Let \(\mathcal{X}\) be some measurable space, and consider independent \(\mathcal{X}\)-valued random variables \(X_1 , \ldots , X_n\). Let \(\mathcal{G}\) be a collection of real-valued functions on \(\mathcal{X}\).

Write for a real valued function \(g\)

$$\begin{aligned} P_n g := {1 \over n} \sum _{i=1}^n g(X_i) \end{aligned}$$

and

$$\begin{aligned} Pg := {1 \over n} \sum _{i=1}^n \mathbb E g(X_i), \ \Vert g \Vert ^2 := {1 \over n} \sum _{i=1}^n \mathbb E g^2 (X_i) \end{aligned}$$

whenever the expectations exist. We assume the normalization

$$\begin{aligned} \sup _{g \in \mathcal{G} } \Vert g \Vert \le 1 . \end{aligned}$$

We study the supremum of the empirical process \( \{ \nu _n (g) : \ g \in \mathcal{G} \} \), where \( \nu _n (g) := \sqrt{n} (P_n - P)g \).

We recall the deviation inequality of [9], which presents explicit and refined constants for results in [11].

Theorem 5

[9] Suppose that for a constant \(K\)

$$\begin{aligned} \sup _{g \in \mathcal{G} } \sup _{x \in \mathcal{X} } | g (x) | \le K . \end{aligned}$$
(8)

Then for all \(\epsilon >0\) and all \(t>0\), it holds that

$$\begin{aligned} \mathbb P \left( \sup _{g \in \mathcal{G}} | \nu _n (g) | \ge (1+ \epsilon ) \mathbb E \sup _{g \in \mathcal{G}} | \nu _n (g) |+ \sqrt{2 \kappa t} + \kappa (\epsilon ) K t /\sqrt{n} \right) \le \exp [-t] , \end{aligned}$$
(9)

where the values of \(\kappa \) and \(\kappa (\epsilon )\) can be chosen to be \(\kappa =4\) and \(\kappa (\epsilon )= 2.5+ 32/\epsilon \).

For the i.i.d. case, [4] obtained constants remarkably close to those for the case where \(\mathcal{G}\) is a singleton. In fact, [9] and others have derived concentration inequalities which in addition to upper bounds show similar lower bounds for the supremum of the empirical process. This is complemented in [8] by moment concentration inequalities assuming only moment conditions on the envelope \(\Gamma ( \cdot ) := \sup _{g \in \mathcal{G}} |g(\cdot ) |\), instead of the boundedness assumption (8).

In this paper, we provide a deviation inequality of the same spirit as in the above Theorem 5, where we replace condition (8) by a weaker Bernstein condition (see (11)), which essentially requires that the \(g(X_i)\) have sub-exponential tails, and where we also present a deviation result in Bernstein–Orlicz norm. These deviation results in probability and in Bernstein–Orlicz norm are given in Theorem 8. We have not tried to optimize the constants. Moreover, we replace the expectation \(\mathbb E \sup _{g \in \mathcal{G} }| \nu _n (g) |\) in (9) by the upper bound we obtain from chaining arguments.Footnote 3 Deviation inequalities for the sub-exponential case can be found in literature (see e.g. [15]), but these do not cover the more refined interpolation of sub-Gaussian and sub-exponential tail behavior. The above cited work also contains lower bounds for suprema, thus completing the results to concentration inequalities.

Now our first aim is to show that entropy with bracketing conditions allow one to construct a finite tree chain. We recall here the definition of a bracketing set and entropy with bracketing (see [3], or see [13, 14] and their references).

Definition 7

Let \(s>0\) be arbitrary. A \(2^{-s}\)-bracketing set for \(\{ \mathcal{G}, \Vert \cdot \Vert \} \) is a finite collection of functions \(\{ [\tilde{g}_j^L, \tilde{g}_j^U ] \}_{j=1}^{\tilde{N}_s} \) satisfying \(\Vert \tilde{g}_j^U - \tilde{g}_j^L \Vert \le 2^{-s} \) for all \(j\) and such that for each \(g \in \mathcal{G}\) there is a \(j \in \{ 1 , \ldots , \tilde{N}_s \} \) such that \(\tilde{g}_j^L \le g \le \tilde{g}_j^U \). If no such finite collection exists, we write \(\tilde{N}_s = \infty \).

We also introduce a generalized bracketing set, in the spirit of [13].

Definition 8

Let \(K>0\) be a fixed constant. A generalized bracketing set for \( \mathcal{G} \) is a finite collection of functions \(\{ [\tilde{g}_j^L,\tilde{g}_j^U ] \}_{j=1}^{\tilde{N}_0} \) satisfying for all \(j\)

$$\begin{aligned} P| \tilde{g}_j^U - \tilde{g}_j^L |^m \le {m! \over 2} (2K)^{m-2} , \quad m=2,3, \ldots \end{aligned}$$

and such that for each \(g \in \mathcal{G}\) there is a \(j \in \{ 1 , \ldots , { \tilde{N}}_0 \} \) such that \(\tilde{g}_j^L \le g \le \tilde{g}_j^U \). Write \(\tilde{N}_0= \infty \) if no such finite collection exists.

A special case is where the envelope function \(\Gamma := \sup _{g \in \mathcal{G}} | g| \) satisfies the Bernstein condition

$$\begin{aligned} P\Gamma ^m \le {m! \over 2} (2K)^{m-2} ,\quad m=2,3, \ldots . \end{aligned}$$

Then one can take \([ - \Gamma , \Gamma ]\) as generalized bracketing set, consisting of only one element.

In what follows we let, for each \(s \in \mathbb N \), \(\tilde{N}_s\) be the cardinality of a minimal \(2^{-s}\)-bracketing set for \(\mathcal{G}\). The \(2^{-s} \)-entropy with bracketing of \(\mathcal{G}\) is

$$\begin{aligned} \tilde{H}_s := \log (1+ \tilde{N}_s) , \quad s \in \mathbb N . \end{aligned}$$

Moreover, \(\tilde{N}_0 \) is the cardinality of a minimal generalized bracketing set, and we let

$$\begin{aligned} { \tilde{H}}_0 := \log (1+ { \tilde{N}}_0). \end{aligned}$$

Finally, we write

$$\begin{aligned} N_s := \prod _{k=0}^s \tilde{N}_k , \ H_s := \log (1+ N_s ) , \quad s \in \mathbb N _0 . \end{aligned}$$
(10)

The following theorem uses arguments of [10] and is comparable to Theorem 2.7.11 in [12] (who adapts the technique of [10]). However, unlike [12], we apply the usual chaining technique and not generic chaining here. On the other hand, our results lead to the more involved deviation inequalities as given in Theorem 8.

Theorem 6

Suppose that, for some constant \(K \ge 1\), one has the Bernstein condition

$$\begin{aligned} \sup _{g \in \mathcal{G}} P| g|^m \le {m! \over 2} K^{m-2} , \quad m=2,3, \ldots . \end{aligned}$$
(11)

Let \(S\) be some integer, \(\tau := 3 \sqrt{6}\) and \(\delta : = 4 \sqrt{n} \sum _{s=1}^{S} 2^{-{2s} } / K_{s-1} +\sqrt{n} 2^{-S} \), where \(\{K_{s-1} \}_{s=1}^S \) is an arbitrary decreasing sequence of positive constants (called truncation levels). Suppose that \(\tilde{N}_s < \infty \) for all \(s=0, \ldots , S\). Then there is a \((\delta , \tau , \mathcal{L} )\) finite tree chain for \(\{ \nu _n (g) \}\) with \(|G_s | \le N_s\), \(s=0, \ldots , S\), and with

$$\begin{aligned} L_0 = { 4 \sqrt{6} K \over \sqrt{n}} , \ L_s = { 2 \sqrt{6} \ 2^s K_{s-1} \over 3 \sqrt{n} } , \ s=1 , \ldots , S . \end{aligned}$$

As a consequence, we can derive a bound for the expectation of the supremum of the empirical process.

Theorem 7

Assume the Bernstein condition (11). Let

$$\begin{aligned} \bar{\mathbf{E}}_S := 2^{-S} \sqrt{n} + 14 \sum _{s=0}^S 2^{-s} \sqrt{6 \tilde{H}_s } + 6^2 K { \tilde{H}_0 \over \sqrt{n} } . \end{aligned}$$

Then one has

$$\begin{aligned} \mathbb E \biggl ( \sup _{g \in \mathcal{G} } |\nu _n (g) | \biggr ) \le \min _{S } \bar{\mathbf{E}}_S . \end{aligned}$$

Remark 3

When \(\Theta \) is finite, say \(|\Theta |=p\), one may choose a bound with \(S=\delta =0\), and \(\tilde{H}_0 \le \log (1+ p)\). Theorem then 7 yields—up to constants—the same bound as in (2).

Finally, we present the main result of this section. We give deviation results in probability and in Bernstein–Orlicz norm, where the dependency on the complexity of \(\mathcal{G}\) is only in the shift.

Theorem 8

Assume the Bernstein condition (11). Define as in Theorem 7,

$$\begin{aligned} \bar{\mathbf{E}}_S := 2^{-S} \sqrt{n} + 14 \sum _{s=0}^S 2^{-s} \sqrt{6 \tilde{H}_s } + 6^2 K { \tilde{H}_0 \over \sqrt{n} } . \end{aligned}$$

Let

$$\begin{aligned} \tilde{L}:= { \sqrt{6} K \over 2 \sqrt{n} } . \end{aligned}$$

Then for all \(t>0\),

$$\begin{aligned} \mathbb P \biggl ( \sup _{g \in \mathcal{G}} | \nu _n (g) | \ge \min _S \bar{ \mathbf E}_S + 6^2 K / \sqrt{n} + 24 \sqrt{6} \biggl [ \sqrt{t} + { \tilde{L} t \over 2} \biggr ] \biggr ] \biggr ) \le 2 \exp [-t] . \end{aligned}$$

Moreover,

$$\begin{aligned} \biggl \Vert \biggr ( \biggl [ \sup _{g \in \mathcal{G}} | \nu _n (g)| \biggr ] - \biggl [ \min _{S} \bar{\mathbf{E}}_S + 6^2 K /\sqrt{n} + 24 \sqrt{6} \biggr ] \biggr )_+ \biggr \Vert _{\varPsi _{\sqrt{3} \tilde{L}} } \le 72 \sqrt{2} . \end{aligned}$$

Theorem 8 can be compared to results in [1]. Taking \(\varPsi (z) = \exp (z) -1\), \( z \ge 0\), one sees that our bound replaces the sub-exponential Orlicz-norm

$$\begin{aligned} \biggl \Vert \max _{1 \le i \le n} \sup _{g \in \mathcal{G} } | g(X_i) |\biggr \Vert _{\varPsi } , \end{aligned}$$

occurring in [1] by a constant proportional to \(K\), which means we generally gain a \(\log n \)-term. On the other hand, the shift in [1] is up to a factor \((1+\epsilon )\) equal to the expectation

$$\begin{aligned} \mathbb E \sup _{g \in \mathcal{G} } | \nu _n (g) | , \end{aligned}$$

as in [9]) (whose result is cited here in Theorem 5).

Remark 4

Again, when \(|\Theta |= p\) is finite, one can choose \(S=\delta =0\), and \(\tilde{H}_0 \le \log (1+p)\) as in Remark 3. Theorem 8 then reduces to the usual union bound type deviation inequalities for the maximum of finitely many random variables (that is, the results are—up to constants—a special case of Lemmas 4 and 5).

6 Proofs for Sect. 5

6.1 Proof of Theorem 6

This follows from similar arguments as in [13], who uses in turn ideas of [10]. Let for \(s=1, \ldots , S\),

$$\begin{aligned} \{ [ \tilde{g}_j^{s,L} , \tilde{g}_j^{s,U} ]\}_{j=1}^{\tilde{N}_s} \end{aligned}$$

be a minimal \(2^{-s }\)-bracketing set for \(\Vert \cdot \Vert \). Let \(\{ [\tilde{g}_j^{0,L}, \tilde{g}_j^{0,U} ] \}_{j=1}^{{\tilde{N}}_0} \) be a generalized bracketing set.

Consider some \(g\in \mathcal{G}\), and let \( [\tilde{g}^{0,L} , \tilde{g}^{0,U} ] \) be the corresponding generalized bracket, and for all \(s\in \{1 , \ldots , S \} \), let the corresponding brackets be \([\tilde{g}^{s,L} , \tilde{g}^{s,U} ]\). Thus

$$\begin{aligned} \tilde{g}^{s,L} \le g \le \tilde{g}^{s,U} , \quad s=0, \ldots , S, \end{aligned}$$

and

$$\begin{aligned}&\displaystyle P | \tilde{g}^{0,U} - \tilde{g}^{0,L} |^m \le { m! \over 2} (2K)^{m-2}, \quad m=2,3, \ldots ,\\&\displaystyle P | \tilde{g}^{s,U} - \tilde{g}^{s.L} |^2\le 2^{-2s} , \quad s=1, \ldots , S . \end{aligned}$$

If for some \(s\) there are several brackets in \(\{ [ \tilde{g}_j^{s,L}, \tilde{g}_j^{s,U} ] \}_{j=1}^{\tilde{N}_s}\) corresponding to \(g\), we choose a fixed but otherwise arbitrary one. Define

$$\begin{aligned} g^{s,L} := \max _{ 0 \le k \le s} \tilde{g}^{k,L} , \quad g^{s,U} := \min _{ 0 \le k \le s} \tilde{g}^{k,U}. \end{aligned}$$

Then

$$\begin{aligned} g^{0,L} \le g^{1,L} \le \cdots \le g^{S,L} \le g \le g^{S,U} \le \cdots \le g^{1,U} \le g^{0,U} , \end{aligned}$$

and moreover \( g^{s,U} - g^{s,L} \le \tilde{g}^{s,U} - \tilde{g}^{s,L} \). Denote the difference between upper and lower bracket by

$$\begin{aligned} \Delta ^s := g^{s,U}-g^{s,L} , \quad s=0 , \ldots , S. \end{aligned}$$

The differences \(\Delta ^s\) are decreasing in \(s\). Furthermore, \(\Vert \Delta ^s \Vert \le 2^{-s} \), for all \(s\in \{ 0,1 , \ldots , S \} \).

Let \(\mathcal{N}_s := | \{ [ g_j^{s,L}, g_j^{s,U}] \} | \), \(s=0, \ldots , S\). It is easy to see that

$$\begin{aligned} \mathcal{N}_s \le \prod _{k=0}^s \tilde{N}_k =: N_s, \quad s=0, \ldots , S. \end{aligned}$$

We define a tree with end nodes \(\{ 1 , \ldots , \mathcal{N}_S\}\). At each end node \(j\) sits a pair of brackets \([g_j^{S,L} , g_j^{S,U} ]\). For each \(s=0 , \ldots , S-1\), we define the parents at generation \(s\) as follows. Let

$$\begin{aligned} \tilde{V}_k^s := \left\{ l: \left[g_k^{s-1,L}, g_k^{s-1,U}\right] \ \mathrm{forms \ a} \ 2^{-(s-1)}\mathrm{-bracket \ for} \left[ g_l^{s,L} , g_l^{s,U} \right] \right\} . \end{aligned}$$

Then \(\cup _{k=1}^{\mathcal{N}_{s-1} } \tilde{V}_k^s = \{ 1 , \ldots , \mathcal{N}_s\}\), that is, for each bracket \([ g_l^{s,L}, g_l^{s,U} ] \) there is a \(k\in \{ 1 , \ldots , \mathcal{N}_{s-1}\}\) with \( l \in \tilde{V}_k\). To see this, we note that for each \(l\), there is a function \(g\) with \(g_l^{s,L} \le g \le g_l^{s,U}\), and by the above construction, there is a \(k\) with \(g_k^{{s-1}, L} \le g_l^{s, L} \le g \le g_l^{s,U} \le g_k^{s-1, U} \). We let \(\{ V_k^s \}_{k=1}^{\mathcal{N}_{s-1}} \) be a disjoint version of \(\{ \tilde{V}_k^s \} \), e.g., the one given by

$$\begin{aligned} V_1^s = \tilde{V}_1^s , \ V_k^s = \tilde{V}_k^s \backslash \cup _{l=1}^{k-1} \tilde{V}_l^s, \quad k=1 , \ldots , \mathcal{N}_{s-1} . \end{aligned}$$

We let

$$\begin{aligned} \mathrm{parent} (j_s) = k \quad \mathrm{if} \ j_s \in V_k^s. \end{aligned}$$

We now turn to an adaptive truncation device. For each \(s=0, \ldots , S-1\), we are given truncation levels \(K_s\), such that \(K_s\) is assumed to be decreasing in \(s\). Let \(g\) be fixed and

$$\begin{aligned} g^{0,L} \le g^{1,L} \le \cdots \le g^{S,L} \le g \le g^{S,U} \le \cdots \le g^{1,U} \le g^{0,U} . \end{aligned}$$

Define

$$\begin{aligned} \Delta ^s := g^{s,U} - g^{s,L}, \quad y_s := \mathrm{l} \{ \Delta ^s \ge K_s \} . \end{aligned}$$

Then

$$\begin{aligned} K_s \mathrm{l} \{ y_s=1\} \le \Delta ^s \mathrm{l} \{ y_s=1 \} , \quad s=0 , \ldots , S-1 , \end{aligned}$$

which implies (for \(s=0, \ldots , S-1\))

$$\begin{aligned} P\Delta ^s \mathrm{l} \{ y_s=1 \} \le { P| \Delta ^s|^2 \over K_s } \le { 2^{-2s} \over K_s} \end{aligned}$$

We can write any \(g \in \mathcal{G}\) as

$$\begin{aligned} g&= \sum _{s=1}^S ( g - g^{0,s} ) \mathrm{l} \{ y_s =1 , y_{s-1} = \cdots = y_0 =0 \} \nonumber \\&\quad\!+\! \sum _{s=1}^S ( g^{s,L} \!-\! g^{s-1, L} ) \mathrm{l} \{ y_{s-1}\!=\! \cdots = y_0 =0 \} \!+\! g_{0,L} + ( g- g^{0,L} ) \mathrm{l} \{ y_0=1 \} \nonumber \\ \end{aligned}$$
(12)

Let

$$\begin{aligned}&\displaystyle W_{j_0 } := |\nu _n ( g^{0,L}) |+ |\nu _n (\Delta ^0 ) | ,\\&\displaystyle W_{j_s }:=|\nu _n (\Delta ^s \mathrm{l} \{ y_{s-1}=0 \} ) |+ |\nu _n ((g^{s,L} - g^{s-1, L} )\mathrm{l}\{ y_{s-1}=0 \} ) | , \quad s=1 , \ldots , S . \end{aligned}$$

Then it follows from (12) that

$$\begin{aligned} | \nu _n (g) | \le \sum _{s=0}^S | W_{j_s} | + \sqrt{n} \sum _{s=0}^S P \Delta ^s \mathrm{l} \{ y_s=1 \} \le \sum _{s=0}^S | W_{j_s} | + \delta , \end{aligned}$$

for

$$\begin{aligned} \delta = \sqrt{n} \sum _{s=1}^{S} { 4 \ 2^{-2s} \over K_{s-1} }+ \sqrt{n} 2^{-S} . \end{aligned}$$

Note now that

$$\begin{aligned} ( P |g^{0,L} |^m)^{1/m}&\le ( P |g |^m)^{1/m} + ( P |\Delta ^0 |^m)^{1/m}\\&\le ( { m! \over 2} K^{m-2} )^{1/m} + ( { m! \over 2} (2K)^{m-2} )^{1/m} \le 2 ( { m! \over 2} (2K)^{m-2} )^{1/m} , \end{aligned}$$

so

$$\begin{aligned} P |g^{0,L} |^m \le {m!\over 2} (4K)^{m-2} 2^2 . \end{aligned}$$

By Corollary 1

$$\begin{aligned} \Vert \nu _n ( g^{0,L} ) \Vert _{\Psi _{L_0} } \le 2 \sqrt{6} , \end{aligned}$$

for

$$\begin{aligned} L_0 = { \sqrt{6} } (8K /2)/\sqrt{n} = 4 \sqrt{6} {/ \sqrt{n}} , \end{aligned}$$

where we multiplied by a factor 2 because the Bernstein condition for the centered functions holds with the above \(4K\) replaced by \(8K\). Moreover, \(L_0 = {\sqrt{6} } (4K)/\sqrt{n}\), so again by Corollary 1,

$$\begin{aligned} \Vert \nu _n ( \Delta ^0 ) \Vert _{\varPsi _{L_0} } \le \sqrt{6} . \end{aligned}$$

The triangle inequality gives

$$\begin{aligned} \biggl \Vert | \nu _n (g^{0,L} ) | + |\nu _n (\Delta ^0) | \biggr \Vert _{\varPsi _{L_0} } \le 3 \sqrt{6} =: \tau . \end{aligned}$$

Moreover, for \(s=1 , \ldots , S\),

$$\begin{aligned} | ( g^{s, L} - g^{s-1, L}) \mathrm{l} \{ y_{s-1}=0 \} | \le \Delta ^{s-1} \le K_{s-1} , \quad \Vert \Delta ^{s-1} \Vert \le 2^{-s+1} , \end{aligned}$$

and

$$\begin{aligned} \Delta ^s \mathrm{l}\{ y_{s-1}=0 \} \le \Delta ^{s-1} \le K_{s-1} , \quad \Vert \Delta ^s \Vert \le 2^{-s}. \end{aligned}$$

So, again by Corollary 1, we may take

$$\begin{aligned} L_s := \sqrt{6} \ 2^s \max \left( {2 \over 3} K_{s-1} /2, {2 \over 3} K_{s-1} \right)\left.\right.\sqrt{n} = {2 \sqrt{6}K_{s-1} \over 3 \sqrt{n} } , \quad s=1 , \ldots , S . \end{aligned}$$

Then, again by the triangle inequality,

$$\begin{aligned} \biggl \Vert | \nu _n (( g^{s, L} - g^{s-1, L}) \mathrm{l} \{ y_{s-1}=0 \} ) |+ |\nu _n (\Delta ^s \mathrm{l} \{ y_{s-1}=0 \}) | \biggr \Vert _{\varPsi _{L_s}} \le 3 \sqrt{6} \ 2^{-s} . \end{aligned}$$

\(\square \)

6.2 Three technical lemmas

To apply the result of Theorem 6, we need three technical lemmas. First we need a bound for \(N_s:= \prod _{k=0}^s \tilde{N}_s\), or actually for \(H_s := \log (1+ N_s)\).

Lemma 6

Let \(s \in \{ 0, \ldots , S\}\), \(H_s := \log (1+ \prod _{k=0}^s \tilde{N}_k ) \) and \( \tilde{H}_s := \log (1+ \tilde{N}_s)\). It holds that

$$\begin{aligned} \sum _{s=1}^S 2^{-s} \sqrt{H_s} \le \sqrt{\tilde{H}_0} + 2 \sum _{s=1}^S 2^{-s} \sqrt{\tilde{H}_s} . \end{aligned}$$

Proof of Lemma 6

We have

$$\begin{aligned} \sqrt{H_s} \le \sum _{k=0}^s \sqrt{\tilde{H}_k } , \end{aligned}$$

so

$$\begin{aligned} \sum _{s=1}^S 2^{-s} \sqrt{H_s}&\le \sum _{s=1}^S2^{-s} \sqrt{ \tilde{H}_0 } +\sum _{s=1}^S 2^{-s} \sum _{k=1}^s \sqrt{\tilde{H}_k }\\&\le \sqrt{\tilde{H}_0} + \sum _{k=1}^S \sum _{s=k}^S 2^{-s} \sqrt{\tilde{H}_k } \le \sqrt{\tilde{H}_0} +2 \sum _{k=1}^S 2^{-k} \sqrt{\tilde{H}_k } . \end{aligned}$$

\(\square \)

The next lemma inserts a special choice for the truncation levels \(\{ K_s \}\), and then establishes a bound for the expectation of the supremum of the empirical process, derived from the one of Theorem 2.

Lemma 7

Let \(S\) be some integer and \(\epsilon \ge 0\) be an arbitrary constant. Take

$$\begin{aligned} K_{s-1} := 2^{-s} \sqrt{n} \biggl ( { \sqrt{6} \over 3 \sqrt{ \log (1+ N_s) }} \wedge { 1 \over \epsilon } \biggr ) , \quad s=1 , \ldots , S, \end{aligned}$$

where \(u\wedge v\) denotes the minimum of \(u\) and \(v\). Define as in Theorem 6,

$$\begin{aligned}&\displaystyle L_0 := {4 \sqrt{6} K \over \sqrt{n}} , \ L_s := { 2 \sqrt{6} \ 2^s K_{s-1} \over 3 \sqrt{n} } , \quad s=1 , \ldots , S ,\\&\displaystyle \delta := 4 \sqrt{n} \sum _{s=1}^{S} 2^{-{2s} } / K_{s-1} +\sqrt{n} 2^{-S} , \end{aligned}$$

and \(\tau := 3\sqrt{6}\). Let

$$\begin{aligned} \mathbf{E}_S := \tau \sum _{s=0}^S 2^{-s} \biggl [ \sqrt{\log (1+ N_s)} + {L_s \over 2} \log (1+ N_s ) \biggr ] + \delta . \end{aligned}$$

Then

$$\begin{aligned} \mathbf{E}_S \le \bar{\mathbf{E}}_S +4\epsilon , \end{aligned}$$

where

$$\begin{aligned} \bar{\mathbf{E}}_S := 2^{-S} \sqrt{n} + 14 \sum _{s=0}^S 2^{-s} \sqrt{6 \tilde{H}_s } + 6^2 K { H_0 \over \sqrt{n} } . \end{aligned}$$

Proof of Lemma 7

We have

$$\begin{aligned} \mathbf{E}_S&= \sum _{s=1}^{S} {4 \ 2^{-2s} \sqrt{n} \over K_{s-1} } + 2^{-S} \sqrt{n} + \tau \sqrt{\log (1+ N_0)} + 2 \sqrt{6}\ \tau K { \log (1+ N_0) \over \sqrt{n} }\\&\quad+ \tau \sum _{s=1}^S 2^{-s} \biggl [\sqrt{\log (1+ N_s)} + {1\over 3} \sqrt{6} \ 2^s K_{s-1} { \log (1+ N_s ) \over \sqrt{n}} \biggr ]\\&= \sum _{s=1}^{S} {4\ 2^{-2s} \sqrt{n} \over K_{s-1} } + 2^{-S} \sqrt{n} + 3 \sqrt{6 \log (1+ N_0)} + 6^2 K { \log (1+ N_0) \over \sqrt{n} }\\&\quad+ 3 \sum _{s=1}^S 2^{-s}\sqrt{6 \log (1+ N_s)} + \sum _{s=1}^S 6 K_{s-1} { \log (1+ N_s ) \over \sqrt{n}} =I+II+III, \end{aligned}$$

where

$$\begin{aligned} I&:= 2^{-S} \sqrt{n} + 3 \sqrt{6 \log (1+ N_0)} + 6^2 K { \log (1+ N_0) \over \sqrt{n} },\\ II&:= 3 \sum _{s=1}^S 2^{-s}\sqrt{6 \log (1+ N_s)}, \end{aligned}$$

and

$$\begin{aligned} III:= \sum _{s=1}^{S} {4 \ 2^{-2s} \sqrt{n} \over K_{s-1} } + \sum _{s=1}^S 6 K_{s-1} { \log (1+ N_s ) \over \sqrt{n}}. \end{aligned}$$

Insert

$$\begin{aligned} K_{s-1}={1 \over 3} \sqrt{6} \ 2^{-s} \sqrt{ n \over \log (1+ N_s) } \wedge 2^{-s} { \sqrt{n} \over \epsilon } , \quad s=1 , \ldots , S. \end{aligned}$$

Note that \(K_s\) is decreasing in \(s\). Moreover

$$\begin{aligned} {4 \ 2^{-2s} \sqrt{n} \over K_{s-1} } + 6 K_{s-1} { \log (1+ N_s ) \over \sqrt{n}}\le 4 \sqrt{6}\ 2^{-s} \sqrt{ \log (1+ N_s ) } + 4\ 2^{-s} \epsilon . \end{aligned}$$

We find

$$\begin{aligned} III \le 4 \sqrt{6}\sum _{s=1}^S 2^{-s} \sqrt{ \log (1+ N_s ) } + 4\epsilon , \end{aligned}$$

so that

$$\begin{aligned} II+III\le 7 \sqrt{6}\sum _{s=1}^S 2^{-s} \sqrt{ \log (1+ N_s ) } + 4\epsilon . \end{aligned}$$

Now apply Lemma 6. This gives

$$\begin{aligned} II + III \le 7 \sqrt{6} \sqrt{ \log (1+ \tilde{N}_0) }+14 \sqrt{6} \sum _{s=1}^S 2^{-s} \sqrt{ \log (1+ \tilde{N}_s ) } + 4\epsilon . \end{aligned}$$

Hence,

$$\begin{aligned} I+II+III&\le 2^{-S} \sqrt{n} + 6^2K { \log (1+ \tilde{N}_0) \over \sqrt{n} } +10 \sqrt{6} \sqrt{\log (1+ \tilde{N}_0) }\\&+ 14 \sqrt{6} \sum _{s=1}^S 2^{-s} \sqrt{ \log (1+ \tilde{N}_s)} + 4\epsilon \\&\le 2^{-S} \sqrt{n} + 14 \sqrt{6} \sum _{s=0}^S 2^{-s} \sqrt{ \log (1+ \tilde{N}_s)} + 6^2K { \log (1+ \tilde{N}_0) \over \sqrt{n} } +4\epsilon . \end{aligned}$$

\(\square \)

We now derive some bounds which will be used for obtaining the deviation inequalities in probability and in Bernstein–Orlicz norm of Theorem 8.

Lemma 8

Let the constants \(\{K_{s-1} \}_{s=1}^S\), \(\{ L_s\}_{s=0}^S\), and \(\tau \) be as in Lemma 7. Let

$$\begin{aligned} L:= \sum _{s=0}^S 2^{-s} { L_s (1+s) \over 4}. \end{aligned}$$

Then

$$\begin{aligned} L \le \sqrt{6} K/\sqrt{n} +2 \wedge {\sqrt{6} \over \epsilon } , \end{aligned}$$

and

$$\begin{aligned} 4 \tau (1+L/2) \le 6^2 K/\sqrt{n} + 24 \sqrt{6} . \end{aligned}$$

Proof of Lemma 8

We have

$$\begin{aligned} L&= {L_0 \over 4} + \sum _{s=1}^S {2^{-s} L_s (1+s) \over 4 }\\&= { \sqrt{6} K \over \sqrt{n}} + \sum _{s=1}^S { (1+s) K_{s-1} \over \sqrt{6n}} . \end{aligned}$$

But

$$\begin{aligned} \sum _{s=1}^S 2^{-s} (1+s) \le 2 \int _0^{\infty } 2^{-x} x dx = {2 \over (\log 2)^2 } , \end{aligned}$$

and since \(H_s =\log (1+ N_s ) \ge \log (2)\),

$$\begin{aligned} K_{s-1} \le 2^{-s} \sqrt{n} \biggl ( {\sqrt{6} \over 3 ( \log (2))^{1/2} } \wedge {1 \over \epsilon } \biggr ) . \end{aligned}$$

Hence,

$$\begin{aligned} L&\le { \sqrt{6} K \over \sqrt{n}} + { 2 \over \sqrt{6} (\log 2)^2} \biggl ( {\sqrt{6} \over 3 ( \log (2))^{1/2} } \wedge {1 \over \epsilon }\biggr )\\&= { \sqrt{6} K \over \sqrt{n}} + { 2 \over 3 (\log 2)^{(5/2)}} \wedge { 2 \over 6 (\log 2)^2} {\sqrt{6} \over \epsilon }\\&\le {\sqrt{6} K\over \sqrt{n}} + 2 \wedge {\sqrt{6} \over \epsilon }. \end{aligned}$$

As \(\tau =3 \sqrt{6}\), we get

$$\begin{aligned} 4 \tau (1+L/2) \le 6^2 K/\sqrt{n} + 24 \sqrt{6} . \end{aligned}$$

\(\square \)

7 Proof of Theorems 7 and 8

Proof of Theorem 7

This follows from Theorem 2, Theorem 6, and Lemma 7 with \(\epsilon =0\). \(\square \)

Proof of Theorem 8

Let \(t>0\) be arbitrary. Note that \(\bar{\mathbf{E}}_S\) is as in Lemma 7. Apply the bounds of Lemma 8 with \(\epsilon =3 \sqrt{t}\) for the constant \(L\) defined there. Then

$$\begin{aligned}&\tau (4+2L) +4\epsilon + 4\tau \biggl [ \sqrt{t}+ {L t \over 2} \biggr ] \\&\quad \le 6^2 K/\sqrt{n} + 24 \sqrt{6} + 4 \epsilon + 12 \sqrt{6t} + 2\tau { \sqrt{6} K t \over \sqrt{n}} + 2\tau {\sqrt{6} t \over \epsilon }\\&\quad = 6^2 K /\sqrt{n} + 6^2 Kt/\sqrt{n} + 24 \sqrt{6}+ 12 \sqrt{6t} +24 \sqrt{t}\\&\quad \le 6^2 K / \sqrt{n} + 24 \sqrt{6}+ 24 \sqrt{6} \biggl [ \sqrt{t} + { \tilde{L} t \over 2} \biggr ] , \end{aligned}$$

where

$$\begin{aligned} \tilde{L}:= { \sqrt{6} K \over 2 \sqrt{n} } . \end{aligned}$$

Then by Theorem 4,

$$\begin{aligned} \mathbb P \biggl ( \sup _{g \in \mathcal{G}} | \nu _n (g) | \ge \min _S \bar{\mathbf{E}}_S + 6^2 K / \sqrt{n} +24 \sqrt{6}+ 24 \sqrt{6} \biggl [ \sqrt{t} + { \tilde{L} t \over 2} \biggr ] \biggr ] \biggr ) \le 2 \exp [-t] \end{aligned}$$

and by Lemma 2

$$\begin{aligned} \biggl \Vert \biggr ( \biggl [ \sup _{g \in \mathcal{G}} | \nu _n (g)| \biggr ] - \biggl [ \min _{S} \bar{\mathbf{E}}_S + 6^2 K /\sqrt{n} + 24 \sqrt{6} \biggr ] \biggr )_+ \biggr \Vert _{\varPsi _{\sqrt{3} \tilde{L}} } \le 72 \sqrt{2} . \end{aligned}$$

\(\square \)