1 Introduction

Concentration inequalities bound the probabilities that random quantities deviate from their average, median, or otherwise typical values. If this deviation is small with high probability, then a repeated experiment or observation will likely produce a similar result. In this way concentration inequalities can give quantitative guarantees of reproducibility, a concept at the heart of empirical science [25].

In this chapter we limit ourselves to study quantities whose randomness arises through the dependence on many independent random variables. Suppose that \(\left( \Omega _{i},\Sigma _{i}\right) \) are measurable spaces for \( i\in \left\{ 1,...,n\right\} \) and that f is real valued function defined on the product space \(\Omega =\prod _{i=1}^{n}\Omega _{i}\),

$$\begin{aligned} f:\mathbf {x}=\left( x_{1},...,x_{n}\right) \in \Omega \mapsto f\left( \mathbf {x}\right) \in \mathbb {R}\text {.} \end{aligned}$$

Now let \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) be a vector of independent random variables, where \(X_{i}\) is distributed as \(\mu _{i}\) in \( \Omega _{i}\). For \(t>0\) and \(\mathbf {X}^{\prime }\) iid to \(\mathbf {X}\) we then want to give bounds on the upwards deviation probability

$$\begin{aligned} \Pr _{\mathbf {X}}\left\{ f\left( \mathbf {X}\right) -E\left[ f\left( \mathbf {X} ^{\prime }\right) \right] >t\right\} \end{aligned}$$

in terms of the deviation t, the measures \(\mu _{i}\) and properties of the function f. Downward deviation bounds are then obtained by replacing f with \(-f\). Usually we will just write \(\Pr \left\{ f-Ef>t\right\} \) for the deviation probability above.

The first bounds of this type were given by Chebychev and Bienaimé [11] in the late 19th century for additive functions of the form

$$\begin{aligned} f\left( \mathbf {x}\right) =\sum _{i=1}^{n}f_{i}\left( x_{i}\right) . \end{aligned}$$
(1)

The subject has since been developed by Bernstein, Chernoff, Bennett, Hoeffding, and many others [4, 16], and results were extended from sums to more general and complicated nonlinear functions. During the past decades research activity has been stimulated by the contributions of Michel Talagrand [27, 28] and by the relevance of concentration phenomena to the rapidly growing field of computer science. Some concentration inequalities, like the well known bounded difference inequality, have become standard tools in the analysis of algorithms [23].

One of the more recent methods to derive concentration inequalities, the so-called entropy method, is rooted in the early investigations of Boltzmann [5] and Gibbs [12] into the foundations of statistical mechanics. While the modern entropy method evolved along a complicated historical path via quantum field theory and the logarithmic Sobolev-inequality of Leonard Gross [14], its hidden simplicity was understood and emphasized by Michel Ledoux, who also recognized the key role which the subadditivity of entropy can play in the derivation of concentration inequalities [18]. The method has been refined by Bobkov, Massart [20], Bousquet [9], and Boucheron et al. [7]. Recently Boucheron et al. [8] showed that the entropy method is sufficiently strong to derive a form of Talagrand’s convex distance inequality.

In this chapter we present a variation of the entropy method in a compact and simplified form, closely tied to its origins in statistical mechanics. We give an exposition of the method in Sect. 2 and compress it into a toolbox to derive concentration inequalities.

In Sect. 3 we will then use this method to prove two classical concentration inequalities, the bounded difference inequality and a generalization of Bennett’s inequality. As example applications we treat vector-valued concentration and generalization in empirical risk minimization, a standard problem in machine learning theory.

In Sect. 4 we address more difficult problems. The bounded difference inequality is used to prove the famous Gaussian concentration inequality. We also give some more recent inequalities which we apply to analyze the concentration of convex Lipschitz functions on \( \left[ 0,1\right] ^{n}\), or of the spectral norm of a random matrix.

In Sect. 5 we describe some of the more advanced techniques, self-boundedness, and decoupling. As examples we give sub-Gaussian lower tail bounds for convex Lipschitz functions and a version of the Hanson-Wright inequality for bounded random variables and we derive an exponential inequality for the suprema of empirical processes. We conclude with another version of Bernstein’s inequality and its application to U-statistics.

We limit ourselves to exponential deviation bounds from the mean. For moment bounds and other advanced methods to establish concentration inequalities, such as the transportation method or an in-depth treatment of logarithmic Sobolev inequalities, we recommend the monographs by Ledoux [18] and Boucheron, Lugosi, and Massart [6], and the overview article by McDiarmid [23]. Another important recent development not covered is the method of exchangeable pairs proposed by Chatterjee [10] .

We fix some conventions and notation:

If \(\left( \Omega ,\Sigma \right) \) is any measurable space \(\mathcal {A} \left( \Omega \right) \) will denote the algebra of bounded, measurable real valued functions on \(\Omega \). When there is no ambiguity we often just write \(\mathcal {A}\) for \(\mathcal {A}\left( \Omega \right) \). Although we give some results for unbounded functions, most functions for which we will prove concentration inequalities are assumed to be measurable and bounded, that is \(f\in \mathcal {A}\). This assumption simplifies the statement of our results, because it guarantees the existence of algebraic and exponential moments and makes our arguments more transparent.

If \(\left( \Omega ,\Sigma ,\mu \right) \) is a probability space we write \( \Pr F=\mu \left( F\right) \) for \(F\in \Sigma \), and \(E\left[ f\right] =\int _{\Omega }fd\mu \) for \(f\in L_{1}\left[ \mu \right] \) and \(\sigma ^{2} \left[ f\right] =E\left[ \left( f-E\left[ f\right] \right) ^{2}\right] \) for \(f\in L_{2}\left[ \mu \right] \). Wherever we use \(\Pr \), E or \(\sigma ^{2},\) we assume that there is an underlying probability space \(\left( \Omega ,\Sigma ,\mu \right) \). If we refer to other measures than \(\mu \), then we identify them with corresponding subscripts.

If \(\mathcal {X}\) is any set and \(n\in \mathbb {N}\), then for \(y\in \mathcal {X} \) and \(k\in \left\{ 1,...,n\right\} \) the substitution operator \(S_{y}^{k}: \mathcal {X}^{n}\rightarrow \mathcal {X}^{n}\) is defined as

$$\begin{aligned} S_{y}^{k}x=\left( x_{1},...,x_{k-1},y,x_{k+1},...,x_{n}\right) \text { for } x=\left( x_{1},...,x_{n}\right) \in \mathcal {X}^{n}. \end{aligned}$$

This and other notation which we introduce along the way is also summarized in a final section in tabular form.

2 The Entropy Method

In this section we develop the entropy method and package it into a toolbox to prove concentration inequalities.

2.1 Markov’s Inequality and Exponential Moment Method

The most important tool in the proof of deviation bounds is Markov’s inequality, which we now introduce along with two corollaries, Chebychev’s inequality and the exponential moment method.

Theorem 1

(Markov inequality) Let \(f\in L_{1}\left[ \mu \right] \), \(f\ge 0\) and \(t>0\) . Then

$$\begin{aligned} \Pr \left\{ f>t\right\} \le \frac{E\left[ f\right] }{t} \end{aligned}$$

Proof

Since \(f\ge 0\) and \(t>0\) we have \(1_{f>t}\le f/t\) and therefore

$$\begin{aligned} \Pr \left\{ f>t\right\} =E\left[ 1_{f>t}\right] \le E\left[ f/t\right] = \frac{E\left[ f\right] }{t}. \end{aligned}$$

   \(\square \)

Corollary 2

(Chebychev inequality) Let \(f\in L_{2}\left[ \mu \right] \) and \(t>0\). Then

$$\begin{aligned} \Pr \left\{ \left| f-E\left[ f\right] \right|>t\right\} =\Pr \left\{ \left( f-E\left[ f\right] \right) ^{2}>t^{2}\right\} \le \frac{E\left[ \left( f-E\left[ f\right] \right) ^{2}\right] }{t^{2}}=\frac{\sigma ^{2}\left( f\right) }{t^{2}}. \end{aligned}$$

To use Chebychev’s inequality we need to bound the variance \(\sigma ^{2}\left( f\right) \). If f is a sum of independent variables, the variance of f is just the sum of the variances of the individual variables, but this doesn’t work for general functions. In Sect. 3.1, however, we give the Efron–Stein inequality, which asserts that for functions of independent variables the variance is bounded by the expected sum of conditional variances.

The idea of Chebychev’s inequality obviously extends to other even centered moments \(E\left[ \left( f-E\left[ f\right] \right) ^{2p}\right] \). Bounding higher moments of functions of independent variables is an important technique discussed, for example, in [6].

Here the most important corollary of Markov’s inequality is the exponential moment method, an idea apparently due to Bernstein [4].

Corollary 3

(exponential moment method) Let \( f\in \mathcal {A}\), \(\beta \ge 0\) and \(t>0\). Then

$$\begin{aligned} \Pr \left\{ f>t\right\} =\Pr \left\{ e^{\beta f}>e^{\beta t}\right\} \le e^{-\beta t}E\left[ e^{\beta f}\right] . \end{aligned}$$

To use this we need to bound the quantity \(E\left[ e^{\beta f}\right] \) and to optimize the right-hand side above over \(\beta \). We call \(E\left[ e^{\beta f}\right] \) the partition function, denoted \(Z_{\beta f}=E \left[ e^{\beta f}\right] \). Bounding the partition function (or its logarithm) is the principal problem in the derivation of exponential tail bounds.

If f is a sum of independent components (as in (1)), then the partition function is the product of the partition functions corresponding to these components, and its logarithm (called the moment generating function) is additive. This is a convenient basis to obtain deviation bounds for sums, but it does not immediately extend to general non-additive functions. The approach is taken here, the entropy method, balances simplicity, and generality.

2.2 Entropy and Concentration

For the remainder of this section we take the function \(f\in \mathcal {A}\) as fixed. We could interpret the points \(x\in \Omega \) as possible states of a physical system and f as the negative energy (or Hamiltonian) function, so that \(-f\left( x\right) \) is the system’s energy in the state x. The measure \(\mu \) then models an a priori probability distribution of states in the absence of any constraining information. We will define another probability measure on \(\Omega \), with specified expected energy, but with otherwise minimal assumptions.

If \(\rho \) is a function on \(\Omega \), \(\rho \ge 0\) and \(E\left[ \rho \right] =1\), the Kullback–Leibler divergence \(KL\left( \rho d\mu ,d\mu \right) \) of \(\rho d\mu \) to \(d\mu \) is

$$\begin{aligned} KL\left( \rho d\mu ,d\mu \right) =E\left[ \rho \ln \rho \right] . \end{aligned}$$

\(KL\left( \rho d\mu ,d\mu \right) \) can be interpreted as the information we gain, if we are told that the probability measure is \(\rho d\mu \) instead of the a priori measure \(d\mu \).

Theorem 4

For all \(f\in \mathcal {A}\), \(\beta \in \mathbb {R}\)

$$\begin{aligned} \sup _{\rho }\beta E\left[ \rho f\right] -E\left[ \rho \ln \rho \right] =\ln E \left[ e^{\beta f}\right] , \end{aligned}$$

where the supremum is over all nonnegative measurable functions \(\rho \) on \( \Omega \) satisfying \(E\left[ \rho \right] =1\).

The supremum is attained for the density

$$\begin{aligned} \rho _{\beta f}=e^{\beta f}/E\left[ e^{\beta f}\right] . \end{aligned}$$

Proof

We can assume \(\beta =1\) by absorbing it in f. Let \(\rho \ge 0\) satisfy \(E \left[ \rho \right] =1\), so that \(\rho d\mu \) is a probability measure and \( g\in \mathcal {A}\mapsto E_{\rho }\left[ g\right] :=E\left[ \rho g\right] \) an expectation functional. Let \(\phi \left( x\right) =1/\rho \left( x\right) \) if \(\rho \left( x\right) >0\) and \(\phi \left( x\right) =0\) if \(\rho \left( x\right) =0\). Then \(E\left[ \rho \ln \rho \right] =-E\left[ \rho \ln \phi \right] =-E_{\rho }\left[ \ln \phi \right] \) and with Jensen’s inequality

$$\begin{aligned} E\left[ \rho f\right] -E\left[ \rho \ln \rho \right]= & {} E_{\rho }\left[ f+\ln \phi \right] =\ln \exp \left( E_{\rho }\left[ f+\ln \phi \right] \right) \\\le & {} \ln E_{\rho }\left[ \exp \left( f+\ln \phi \right) \right] =\ln E_{\rho }\left[ \phi e^{f}\right] \\= & {} \ln E\left[ \rho \phi e^{f}\right] =\ln E\left[ 1_{\rho >0}e^{f}\right] \\\le & {} \ln E\left[ e^{f}\right] . \end{aligned}$$

On the other hand

$$\begin{aligned} E\left[ \rho _{f}f\right] -E\left[ \rho _{f}\ln \rho _{f}\right] =\frac{E \left[ fe^{f}\right] }{E\left[ e^{f}\right] }-\frac{E\left[ e^{f}\ln \left( e^{f}/E\left[ e^{f}\right] \right) \right] }{E\left[ e^{f}\right] }=\ln E \left[ e^{f}\right] . \end{aligned}$$

   \(\square \)

In statistical physics the maximizing probability measure \(d\mu _{\beta f}=\rho _{\beta f}d\mu =e^{\beta f}d\mu /E\left[ e^{\beta f}\right] \) is called the thermal measure, sometimes also the canonical ensemble. It is used to describe a system in thermal equilibrium with a heat reservoir at temperature \(T\approx 1/\beta \). The corresponding expectation functional

$$\begin{aligned} E_{\beta f}\left[ g\right] =\frac{E\left[ ge^{\beta f}\right] }{E\left[ e^{\beta f}\right] }=Z_{\beta f}^{-1}E\left[ ge^{\beta f}\right] \text {, for }g\in \mathcal {A} \end{aligned}$$

is called the thermal expectation. The normalizing quantity \( Z_{\beta f}=E\left[ e^{\beta f}\right] \) is the partition function already introduced above. Notice that for any constant c we have \(E_{\beta \left( f+c\right) }\left[ g\right] =E_{\beta f}\left[ g\right] \).

The value of the function \(\rho \mapsto E\left[ \rho \ln \rho \right] \) at the thermal density \(\rho _{\beta f}=Z_{\beta f}^{-1}e^{\beta f}\) is called the canonical entropy or simply entropy,

$$\begin{aligned} \mathrm {Ent}_{f}\left( \beta \right) =E\left[ \rho _{\beta f}\ln \rho _{\beta f}\right] =\beta E_{\beta f}\left[ f\right] -\ln Z_{\beta f}. \end{aligned}$$
(2)

Note that \(\mathrm {Ent}_{-f}\left( \beta \right) =\mathrm {Ent}_{f}\left( -\beta \right) \), a simple but very useful fact.

Suppose that \(\rho \) is any probability density on \(\Omega \), which gives the same expected value for the energy as \(\rho _{\beta f}\), so that \(E\left[ \rho f\right] =E_{\beta f}\left[ f\right] \). Then

$$\begin{aligned} 0\le & {} KL\left( \rho d\mu ,Z_{\beta f}^{-1}e^{\beta f}d\mu \right) \\= & {} E\left[ \rho \ln \rho \right] -\beta E\left[ \rho f\right] +\ln Z_{\beta f} \\= & {} E\left[ \rho \ln \rho \right] -\beta E_{\beta f}\left[ f\right] +\ln Z_{\beta f} \\= & {} KL\left( \rho d\mu ,d\mu \right) -KL\left( \rho _{\beta f}d\mu ,d\mu \right) . \end{aligned}$$

The thermal measure \(d\mu _{\beta f}=\rho _{\beta f}d\mu \) therefore minimizes the information gain relative to the a priori measure \(d\mu \), given the expected value \(-E_{\beta f}\left[ f\right] \) of the internal energy.

For \(g\in \mathcal {A}\) and \(\rho =Z_{\beta f}^{-1}e^{\beta f}\) Theorem 4 gives

$$\begin{aligned} E_{\beta f}\left[ g\right] \le \mathrm {Ent}_{f}\left( \beta \right) +\ln E \left[ e^{g}\right] , \end{aligned}$$

which allows to decouple g from f. This plays an important role later on in this chapter.

For \(\beta \ne 0\) define a function

$$\begin{aligned} A_{f}\left( \beta \right) =\frac{1}{\beta }\ln Z_{\beta f}=\frac{1}{\beta } \ln E\left[ e^{\beta f}\right] \text {.} \end{aligned}$$
(3)

By l’Hospital’s rule we have \(\lim _{\beta \rightarrow 0}A_{f}\left( \beta \right) =E\left[ f\right] \), so \(A_{f}\) extends continuously to \(\mathbb {R}\) by setting \(A_{f}\left( 0\right) =E\left[ f\right] \). In statistical physics the quantity \(A_{f}\left( \beta \right) \) so defined is called the free energy corresponding to the Hamiltonian (energy function) \(H=-f\) and temperature \(T\approx \beta ^{-1}\). Theorem 4 exhibits the free energy and the canonical entropy as a pair of convex conjugates. Dividing (2) by \(\beta \) and writing \(U=E_{\beta f} \left[ f\right] \), we recover the classical thermodynamic relation

$$\begin{aligned} A=U-T~\mathrm {Ent}, \end{aligned}$$

which describes the macroscopically available energy A as the difference between the internal energy U and an energy portion \(T~\mathrm {Ent}\), which is inaccessible due to ignorance of the microscopic state.

The following theorem establishes the connection of entropy, the exponential moment method and concentration inequalities.

Theorem 5

For \(f\in \mathcal {A}\) and any \(\beta \ge 0\) we have

$$\begin{aligned} \ln E\left[ e^{\beta \left( f-Ef\right) }\right] =\beta \int _{0}^{\beta } \frac{\mathrm {Ent}_{f}\left( \gamma \right) }{\gamma ^{2}}d\gamma \end{aligned}$$

and, for \(t\ge 0\),

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \inf _{\beta \ge 0}\exp \left( \beta \int _{0}^{\beta }\frac{\mathrm {Ent}_{f}\left( \gamma \right) }{\gamma ^{2}} d\gamma -\beta t\right) . \end{aligned}$$

Proof

Differentiating the free energy with respect to \(\beta \) we find

$$\begin{aligned} A_{f}^{\prime }\left( \beta \right) =\frac{1}{\beta }E_{\beta f}\left[ f \right] -\frac{1}{\beta ^{2}}\ln Z_{\beta f}=\beta ^{-2}\mathrm {Ent} _{f}\left( \beta \right) . \end{aligned}$$

By the fundamental theorem of calculus

$$\begin{aligned} \ln E\left[ e^{\beta \left( f-Ef\right) }\right]= & {} \ln Z_{\beta f}-\beta E\left[ f\right] =\beta \left( A_{f}\left( \beta \right) -A_{f}\left( 0\right) \right) \\= & {} \beta \int _{0}^{\beta }A_{f}^{\prime }\left( \gamma \right) d\gamma =\beta \int _{0}^{\beta }\frac{\mathrm {Ent}_{f}\left( \gamma \right) }{\gamma ^{2}}d\gamma , \end{aligned}$$

which is the first inequality. Then by Markov’s inequality

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\}\le & {} e^{-\beta t}E\left[ e^{\beta \left( f-Ef\right) }\right] \\\le & {} \exp \left( \beta \int _{0}^{\beta }\frac{\mathrm {Ent}_{f}\left( \gamma \right) }{\gamma ^{2}}d\gamma -\beta t\right) . \end{aligned}$$

   \(\square \)

Our strategy to establish concentration results will therefore be the search for appropriate bounds on the entropy.

2.3 Entropy and Energy Fluctuations

The thermal variance of a function \(g\in \mathcal {A}\) is just the variance of g relative to the thermal expectation. It is denoted \(\sigma _{\beta f}^{2}\left( g\right) \) and defined by

$$\begin{aligned} \sigma _{\beta f}^{2}\left( g\right) =E_{\beta f}\left[ \left( g-E_{\beta f}\left[ g\right] \right) ^{2}\right] =E_{\beta f}\left[ g^{2}\right] -\left( E_{\beta f}\left[ g\right] \right) ^{2}. \end{aligned}$$

For constant c we have \(\sigma _{\beta \left( f+c\right) }^{2}\left[ g\right] =\sigma _{\beta f}^{2}\left[ g\right] \).

The proof of the following lemma consists of straightforward calculations, an easy exercise to familiarize oneself with thermal measure, expectation and variance.

Lemma 6

The following formulas hold for \(f\in \mathcal {A} \)

1. \(\frac{d}{d\beta }\left( \ln Z_{\beta f}\right) =E_{\beta f}\left[ f\right] \).

2. If \(h:\beta \mapsto h\left( \beta \right) \in \mathcal {A}\) is differentiable and \(\left( d/d\beta \right) h\left( \beta \right) \in \mathcal {A}\) then

$$\begin{aligned} \frac{d}{d\beta }E_{\beta f}\left[ h\left( \beta \right) \right] =E_{\beta f}\left[ h\left( \beta \right) f\right] -E_{\beta f}\left[ h\left( \beta \right) \right] E_{\beta f}\left[ f\right] +E_{\beta f}\left[ \frac{d}{d\beta }h\left( \beta \right) \right] . \end{aligned}$$

3. \(\frac{d}{d\beta }E_{\beta f}\left[ f^{k}\right] =E_{\beta f}\left[ f^{k+1}\right] -E_{\beta f}\left[ f^{k}\right] E_{\beta f}\left[ f\right] .\)

4. \(\frac{d^{2}}{d\beta ^{2}}\left( \ln Z_{\beta f}\right) =\frac{d}{d\beta }E_{\beta f}\left[ f\right] =\sigma _{\beta f}^{2}\left[ f\right] .\)

Proof

1. is immediate and 2. a straightforward computation. 3. and 4. are immediate consequences of 1. and 2.   \(\square \)

Since the members of \(\mathcal {A}\) are bounded it follows from 2. that for \(f,g\in \mathcal {A}\) the functions \(\beta \mapsto E_{\beta f}\left[ g\right] \) and \(\beta \mapsto \sigma _{\beta f}^{2}\left[ g\right] \) are \(C_{\infty }\).

The thermal variance of f itself corresponds to energy fluctuations. The next theorem represents entropy as a double integral of such fluctuations. The utility of this representation to derive concentration results has been noted by David McAllester [22].

Theorem 7

We have for \(\beta >0\)

$$\begin{aligned} \mathrm {Ent}_{f}\left( \beta \right) =\int _{0}^{\beta }\int _{t}^{\beta }\sigma _{sf}^{2}\left[ f\right] ds~dt\text {.} \end{aligned}$$

Proof

Using the previous lemma and the fundamental theorem of calculus we obtain the formulas

$$\begin{aligned} \beta E_{\beta f}\left[ f\right] =\int _{0}^{\beta }E_{\beta f}\left[ f\right] dt=\int _{0}^{\beta }\left( \int _{0}^{\beta }\sigma _{sf}^{2}\left[ f\right] ds+E\left[ f\right] \right) dt \end{aligned}$$

and

$$\begin{aligned} \ln Z_{\beta f}=\int _{0}^{\beta }E_{tf}\left[ f\right] dt=\int _{0}^{\beta }\left( \int _{0}^{t}\sigma _{sf}^{2}\left[ f\right] ds+E\left[ f\right] \right) dt, \end{aligned}$$

which we subtract to obtain

$$\begin{aligned} \mathrm {Ent}_{f}\left( \beta \right)= & {} \beta E_{\beta f}\left[ f\right] -\ln Z_{\beta f}=\int _{0}^{\beta }\left( \int _{0}^{\beta }\sigma _{sf}^{2}\left[ f\right] ds-\int _{0}^{t}\sigma _{sf}^{2}\left[ f\right] ds\right) dt \\= & {} \int _{0}^{\beta }\left( \int _{t}^{\beta }\sigma _{sf}^{2}\left[ f\right] ds\right) dt. \end{aligned}$$

   \(\square \)

Since bounding \(\sigma _{\beta f}^{2}\left[ f\right] \) is central to our method, it is worth mentioning an interpretation in terms of heat capacity, or specific heat. Recall that \(-E_{\beta f}\left[ f\right] \) is the expected internal energy. The rate of change of this quantity with temperature T is the heat capacity. By conclusion 4 of Lemma 6 we have

$$\begin{aligned} \frac{d}{dT}\left( -E_{\beta f}\left[ f\right] \right) =\frac{1}{T^{2}}\sigma _{\beta f}^{2}\left[ f\right] , \end{aligned}$$

which exhibits the proportionality of heat capacity and energy fluctuations.

2.4 Product Spaces and Conditional Operations

We now set \(\Omega =\prod _{k=1}^{n}\Omega _{k}\) and \(d\mu =\prod _{k=1}^{n}d\mu _{k}\), where each \(\mu _{k}\) is the probability measure representing the distribution of some variable \(X_{k}\) in the space \(\Omega _{k}\), so that the \(X_{k}\) are assumed to be independent.

With \(\mathcal {A}_{k}\) we denote the subalgebra of those functions \(f\in \mathcal {A}\), which are independent of the k-th argument. To efficiently deal with operations performed on individual arguments of functions in \(\mathcal {A}\) we need some special notation.

Now let \(k\in \left\{ 1,...,n\right\} \) and \(y\in \Omega _{k}\). If \(\Xi \) is any set and F is any function \(F:\Omega \rightarrow \Xi \), we extend the definition of the substitution operator \(S_{y}^{k}\) to F by \(S_{y}^{k}\left( F\right) =F\circ S_{y}^{k}\). This means

$$\begin{aligned} S_{y}^{k}\left( F\right) \left( x_{1},...,x_{n}\right) =F\left( x_{1},...,x_{k-1},y,x_{k+1},...,x_{n}\right) , \end{aligned}$$

so the k-th argument is simply replaced by y. Since for \(f\in \mathcal {A} \) the function \(S_{y}^{k}f\) is independent of \(x_{k}\) (which had been replaced by y) we see that \(S_{y}^{k}\) is a homomorphic (linear and multiplication-preserving) projection of \(\mathcal {A}\) onto \(\mathcal {A}_{k}\).

For \(k\in \left\{ 1,...,n\right\} \) and \(y,y^{\prime }\in \Omega _{k}\) we define the difference operator \(D_{y,y^{\prime }}^{k}:\mathcal {A}\rightarrow \mathcal {A}_{k}\) by

$$\begin{aligned} D_{y,y^{\prime }}^{k}f=S_{y}^{k}f-S_{y^{\prime }}^{k}f\text { for }f\in \mathcal {A}\text {.} \end{aligned}$$

Clearly \(D_{y,y^{\prime }}^{k}\) annihilates \(\mathcal {A}_{k}\). The operator \(r_{k}:\mathcal {A}\rightarrow \mathcal {A}_{k}\), defined by \(r_{k}f=\sup _{y,y^{\prime }\in \Omega _{k}}D_{y,y^{\prime }}^{k}f\) is called the k-th conditional range. We also use the abbreviations \(\inf _{k}f=\inf _{y\in \Omega _{k}}S_{y}^{k}f\) and \(\sup _{k}f=\sup _{y\in \Omega _{k}}S_{y}^{k}f\) for the conditional infimum and supremum.

Given the measures \(\mu _{k}\) and \(k\in \left\{ 1,...,n\right\} \) we the operator \(E_{k}:\mathcal {A}\rightarrow \mathcal {A}_{k}\) by

$$\begin{aligned} E_{k}f=E_{y\sim \mu _{k}}\left[ S_{y}^{k}f\right] =\int _{\Omega _{k}}S_{y}^{k}f~d\mu _{k}\left( y\right) . \end{aligned}$$

The operator \(E_{k}\left[ .\right] =E\left[ .|X_{1},...,X_{k-1},X_{k+1},...,X_{n}\right] \) is the expectation conditional to all variables with indices different to k. \(E_{k}\) is a linear projection onto \(\mathcal {A}_{k}\). Also the \(E_{k}\) commute among each other, and for \(h\in \mathcal {A}\) and \(g\in \mathcal {A}_{k}\) we have

$$\begin{aligned} E\left[ \left[ E_{k}h\right] g\right] =E\left[ E_{k}\left[ hg\right] \right] =E\left[ hg\right] . \end{aligned}$$
(4)

Replacing the operator E by \(E_{k}\) leads to the definition of conditional thermodynamic quantities, all of which are now members of the algebra \(\mathcal {A}_{k}\):

  • The conditional partition function \(Z_{k,\beta f}=E_{k}\left[ e^{\beta f}\right] \),

  • The conditional thermal expectation \(E_{k,\beta f}\left[ g\right] =Z_{k,\beta f}^{-1}E_{k}\left[ ge^{\beta f}\right] \) for \(g\in \mathcal {A},\)

  • The conditional entropy \(\mathrm {Ent}_{k,f}\left( \beta \right) =\beta E_{k,\beta f}\left[ f\right] -\ln Z_{k,\beta f},\)

  • The conditional thermal variance \(\sigma _{k,\beta f}^{2}\left[ g\right] =E_{k,\beta f}\left[ \left( g-E_{k,\beta f}\left[ g\right] \right) ^{2}\right] \) for \(g\in \mathcal {A}\). As \(\beta \rightarrow 0\) this becomes

  • The conditional variance \(\sigma _{k}^{2}\left[ g\right] =E_{k}\left[ \left( g-E_{k}\left[ g\right] \right) ^{2}\right] \) for \(g\in \mathcal {A}\).

The previously established relations hold also for the corresponding conditional quantities. Of particular importance for our method is the conditional version of Theorem 7

$$\begin{aligned} \mathrm {Ent}_{k,f}\left( \beta \right) =\int _{0}^{\beta }\int _{t}^{\beta }\sigma _{k,sf}^{2}\left[ f\right] ds~dt. \end{aligned}$$

The following lemma, which states that the conditional thermal expectation just behaves like a conditional expectation, will also be used frequently.

Lemma 8

For any \(f,g\in \mathcal {A}\), \(k\in \left\{ 1,...,n\right\} \), \(\beta \in \mathbb {R}\)

$$\begin{aligned} E_{\beta f}\left[ E_{k,\beta f}\left[ g\right] \right] =E_{\beta f}\left[ g\right] . \end{aligned}$$

Proof

Using \(E\left[ E_{k}\left[ h\right] g\right] =E\left[ hE_{k}\left[ g\right] \right] \)

$$\begin{aligned} E_{\beta f}\left[ E_{k,\beta f}\left[ g\right] \right]= & {} Z_{\beta f}^{-1}E\left[ E_{k}\left[ ge^{\beta f}\right] \frac{e^{\beta f}}{E_{k}\left[ e^{\beta f}\right] }\right] \\= & {} Z_{\beta f}^{-1}E\left[ ge^{\beta f}E_{k}\left[ \left( \frac{e^{\beta f}}{E_{k}\left[ e^{\beta f}\right] }\right) \right] \right] \\= & {} Z_{\beta f}^{-1}E\left[ ge^{\beta f}\right] \\= & {} E_{\beta f}\left[ g\right] . \end{aligned}$$

   \(\square \)

2.5 The Subadditivity of Entropy

In the non-interacting case, when the energy function f is a sum, \(f=\sum f_{k},\) it is easily verified that \(\mathrm {Ent}_{k,f}\left( \beta \right) \left( \mathbf {x}\right) =\mathrm {Ent}_{k,f}\left( \beta \right) \) is independent of \(\mathbf {x}\) and that

$$\begin{aligned} \mathrm {Ent}_{f}\left( \beta \right) =\sum _{k=1}^{n}\mathrm {Ent}_{k,f}\left( \beta \right) . \end{aligned}$$

In statistical physics it is said that entropy is an extensive quantity: the entropy of non-interacting systems is equal to the sum of the individual entropies.

Equality no longer holds in the interacting, nonlinear case, but there is a subadditivity property which is sufficient for the purpose of concentration inequalities:

The total entropy is no greater than the thermal average of the sum of the conditional entropies.

Theorem 9

For \(f\in \mathcal {A}\) and \(\beta >0\)

$$\begin{aligned} \mathrm {Ent}_{f}\left( \beta \right) \le E_{\beta f}\left[ \sum _{k=1}^{n}\mathrm {Ent}_{k,f}\left( \beta \right) \right] \end{aligned}$$
(5)

In 1975 Elliott Lieb [19] gave a proof of this result, which was probably known some time before, at least in the classical setting relevant to our arguments. Together with Theorem 5 and Theorem 7 it completes our basic toolbox to prove concentration inequalities. For the proof we need two auxiliary results.

Lemma 10

Let \(h,g>0\) be bounded measurable functions on \(\Omega \). Then for any expectation E

$$\begin{aligned} E\left[ h\right] \ln \frac{E\left[ h\right] }{E\left[ g\right] }\le E\left[ h\ln \frac{h}{g}\right] . \end{aligned}$$

Proof

Define an expectation functional \(E_{g}\) by \(E_{g}\left[ h\right] =E\left[ gh\right] /E\left[ g\right] \). The function \(\Phi \left( t\right) =t\ln t\) is convex for positive t, since \(\Phi ^{\prime \prime }=1/t>0\). Then

$$\begin{aligned} \Phi \left( E_{g}\left[ \frac{h}{g}\right] \right) =\frac{E\left[ h\right] }{E\left[ g\right] }\ln \frac{E\left[ h\right] }{E\left[ g\right] }. \end{aligned}$$

Thus, by Jensen’s inequality,

$$\begin{aligned} E\left[ h\right] \ln \frac{E\left[ h\right] }{E\left[ g\right] }= & {} E\left[ g\right] E_{g}\left[ \frac{h}{g}\right] \ln E_{g}\left[ \frac{h}{g}\right] =E\left[ g\right] \Phi \left( E_{g}\left[ \frac{h}{g}\right] \right) \\\le & {} E\left[ g\right] E_{g}\left[ \Phi \left( \frac{h}{g}\right) \right] =E\left[ h\ln \frac{h}{g}\right] . \end{aligned}$$

   \(\square \)

Next we prove (5) for general positive functions.

Lemma 11

Let \(\rho \in \mathcal {A}\), \(\rho >0\). Then

$$\begin{aligned} E\left[ \rho \ln \frac{\rho }{E\left[ \rho \right] }\right] \le \sum _{k}E\left[ \rho \ln \frac{\rho }{E_{k}\left[ \rho \right] }\right] . \end{aligned}$$

Proof

Write \(E^{k}\left[ .\right] =E_{1}E_{2}...E_{k}\left[ .\right] \) with \(E^{0}\) being the identity map on \(\mathcal {A}\). The innocuous looking identity \(E\left[ E^{k}\left[ .\right] \right] =E\left[ .\right] \) is an obvious consequence of the fact that we work with product probabilities. Without independence it would not hold, and the following simple argument would break down. Note that \(E^{n}=E\). We expand

$$\begin{aligned} \frac{\rho }{E\left[ \rho \right] }=\frac{E^{0}\left[ \rho \right] }{E^{1}\left[ \rho \right] }\frac{E^{1}\left[ \rho \right] }{E_{2}\left[ \rho \right] }...\frac{E^{n-1}\left[ \rho \right] }{E^{n}\left[ \rho \right] }=\prod _{k=1}^{n}\frac{E^{k-1}\left[ \rho \right] }{E^{k-1}\left[ E_{k}\left[ \rho \right] \right] }. \end{aligned}$$

We get from Lemma 10, using \(E\left[ E^{k-1}\left[ .\right] \right] =E\left[ .\right] ,\)

$$\begin{aligned} E\left[ \rho \ln \frac{\rho }{E\left[ \rho \right] }\right]= & {} \sum _k E\left[ E^{k-1}\left[ \rho \right] \ln \frac{E^{k-1}\left[ \rho \right] }{E^{k-1}\left[ E_{k}\left[ \rho \right] \right] }\right] \\\le & {} \sum _{k}E\left[ E^{k-1}\left[ \rho \ln \frac{\rho }{E_{k}\left[ \rho \right] }\right] \right] =\sum _{k}E\left[ \rho \ln \frac{\rho }{E_{k}\left[ \rho \right] }\right] \end{aligned}$$

   \(\square \)

Finally we specialize to the canonical entropy.

Proof of Theorem 9

9 Set \(\rho =e^{\beta f}\) in Lemma 11 to get

$$\begin{aligned} \mathrm {Ent}_{f}\left( \beta \right)= & {} Z_{\beta f}^{-1}E\left[ e^{\beta f}\ln \frac{e^{\beta f}}{E\left[ e^{\beta f}\right] }\right] \\\le & {} Z_{\beta f}^{-1}\sum _{k}E\left[ e^{\beta f}\ln \frac{e^{\beta f}}{E_{k}\left[ e^{\beta f}\right] }\right] \\= & {} \sum _{k}E_{\beta f}\left[ \beta f-\ln E_{k}\left[ e^{\beta f}\right] \right] \\= & {} E_{\beta f}\left[ \sum _{k}\text {Ent}_{k,f}\left( \beta \right) \right] , \end{aligned}$$

where we used Lemma 8 in the last identity.   \(\square \)

2.6 Summary of Results

The exponential moment method, Corollary 3, and Theorems 5, 7, and 9 provide us with the tools to prove several useful concentration inequalities. Here is a summary:

Theorem 12

For \(f\in A\) and \(\beta >0\) we have

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\}\le & {} E\left[ e^{\beta \left( f-Ef\right) }\right] e^{-\beta t} \end{aligned}$$
(6)
$$\begin{aligned} \ln E\left[ e^{\beta \left( f-Ef\right) }\right]= & {} \beta \int _{0}^{\beta }\frac{\mathrm {Ent}_{f}\left( \gamma \right) }{\gamma ^{2}}d\gamma \end{aligned}$$
(7)
$$\begin{aligned} \mathrm {Ent}_{f}\left( \beta \right)\le & {} E_{\beta f}\left[ \sum _{k=1}^{n}\mathrm {Ent}_{k,f}\left( \beta \right) \right] \end{aligned}$$
(8)
$$\begin{aligned} \mathrm {Ent}_{f}\left( \beta \right)= & {} \int _{0}^{\beta }\int _{t}^{\beta }\sigma _{sf}^{2}\left[ f\right] ds~dt \end{aligned}$$
(9)
$$\begin{aligned} \mathrm {Ent}_{k,f}\left( \beta \right)= & {} \int _{0}^{\beta }\int _{t}^{\beta }\sigma _{k,sf}^{2}\left[ f\right] ds~dt \end{aligned}$$
(10)

Concatenating the exponential moment bound (6), the entropy representation of the moment generating function (7), the subadditivity of entropy (8) and the fluctuation representation of the conditional entropy (10), we obtain the following generic concentration inequality.

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \inf _{\beta >0}\exp \left( \beta \int _{0}^{\beta }\gamma ^{-2}E_{\gamma f}\left[ \sum _{k=1}^{n}\int _{0}^{\gamma }\int _{t}^{\gamma }\sigma _{k,sf}^{2}\left[ f\right] ds~dt\right] d\gamma -\beta t\right) . \end{aligned}$$

This is the template for the results given in the next section.

3 First Applications of the Entropy Method

We now develop some first consequences of the method, beginning with the Efron–Stein inequality, a general bound on the variance. Then we continue with the derivation of the bounded difference inequality, a simple and perhaps the most useful concentration inequality, for which we give two illustrating applications. Then we give a Bennett-Bernstein type inequality which we apply to the concentration of vector-valued random variables.

3.1 The Efron–Stein Inequality

Combining the fluctuation representations (9) and (10) with the subadditivity (8) of entropy and dividing by \(\beta ^{2}\) we obtain

$$\begin{aligned} \frac{1}{\beta ^{2}}\int _{0}^{\beta }\int _{t}^{\beta }\sigma _{sf}^{2}\left[ f\right] ds~dt\le E_{\beta f}\left[ \sum _{k=1}^{n}\frac{1}{\beta ^{2}}\int _{0}^{\beta }\int _{t}^{\beta }\sigma _{k,sf}^{2}\left[ f\right] ds~dt\text {.}\right] \end{aligned}$$

Using the continuity properties of \(\beta \mapsto E_{\beta f}\left[ g\right] \) and \(\beta \mapsto \sigma _{\beta f}^{2}\left[ f\right] \), which follow from Lemma 6 we can take the limit as \(\beta \rightarrow 0\) and multiply by 2 to obtain

$$\begin{aligned} \sigma ^{2}\left[ f\right] \le E\left[ \sum _{k}\sigma _{k}^{2}\left[ f\right] \right] =E\left[ \Sigma ^{2}\left( f\right) \right] , \end{aligned}$$
(11)

where we introduced the notation \(\Sigma ^{2}\left( f\right) =\sum _{k}\sigma _{k}^{2}\left[ f\right] \) for the sum of conditional variances.

Equation (11) is the famous Efron–Stein–Steele inequality [26]. It is an easy exercise to provide the details of the above limit process and to extend the inequality to general functions \(f\in L_{2}\left[ \mu \right] \) by approximation with a sequence of truncations.

3.2 The Bounded Difference Inequality

The variance of a bounded real random variable is never greater than a quarter of the square of its range.

Lemma 13

If \(f\in \mathcal {A}\) satisfies \(a\le f\le b\) then \(\sigma ^{2}\left[ f\right] \le \left( b-a\right) ^{2}/4\).

Proof

$$\begin{aligned} \sigma ^{2}\left( f\right)= & {} E\left[ \left( f-E\left[ f\right] \right) f\right] =E\left[ \left( f-E\left[ f\right] \right) \left( f-a\right) \right] \\\le & {} E\left[ \left( b-E\left[ f\right] \right) \left( f-a\right) \right] =\left( b-E\left[ f\right] \right) \left( E\left[ f\right] -a\right) \\\le & {} \frac{\left( b-a\right) ^{2}}{4}. \end{aligned}$$

To see the last inequality use calculus to find the maximal value of the function \(t\rightarrow \left( b-t\right) \left( t-a\right) \).   \(\square \).

The bounded difference inequality bounds the deviation of a function from its mean in terms of the sum of squared conditional ranges, which is the operator \(R^{2}:\mathcal {A\rightarrow A}\) defined by

$$\begin{aligned} R^{2}\left( f\right) =\sum _{k=1}^{n}r_{k}\left( f\right) ^{2}=\sum _{k=1}^{n}\sup _{y,y^{\prime }\in \Omega _{k}}\left( D_{y,y^{\prime }}^{k}f\right) ^{2}. \end{aligned}$$

Theorem 14

(Bounded difference inequality) For \(f\in \mathcal {A}\) and \(t>0\)

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \exp \left( \frac{-2t^{2}}{\sup _{\mathbf {x}\in \Omega }R^{2}\left( f\right) \left( \mathbf {x}\right) }\right) . \end{aligned}$$

Proof

Applied to the conditional thermal variance Lemma 13 gives

$$\begin{aligned} \sigma _{k,sf}^{2}\left[ f\right] \le \frac{1}{4}\sup _{y,y^{\prime }\in \Omega _{k}}\left( D_{y,y^{\prime }}^{k}f\right) ^{2}=\frac{1}{4}r_{k}\left( f\right) ^{2}, \end{aligned}$$

so combining the subadditivity of entropy (8) and the fluctuation representation (10) gives

$$\begin{aligned} \text {Ent}_{f}\left( \gamma \right)\le & {} E_{\gamma f}\left[ \sum _{k=1}^{n}\mathrm {Ent}_{k,f}\left( \gamma \right) \right] =E_{\gamma f}\left[ \sum _{k=1}^{n}\int _{0}^{\gamma }\int _{t}^{\gamma }\sigma _{k,sf}^{2}\left[ f\right] ds~dt\right] \nonumber \\\le & {} \frac{1}{4}E_{\gamma f}\left[ \int _{0}^{\gamma }\int _{t}^{\gamma }\sum _{k=1}^{n}r_{k}\left( f\right) ^{2}\right] ds~dt \nonumber \\= & {} \frac{\gamma ^{2}}{8}E_{\gamma f}\left[ R^{2}\left( f\right) \right] . \end{aligned}$$
(12)

Bounding the thermal expectation \(E_{\gamma f}\) by the supremum over \(\mathbf {x}\in \Omega \) we obtain from Theorem 12 (7)

$$\begin{aligned} \ln E\left[ e^{\beta \left( f-Ef\right) }\right] =\beta \int _{0}^{\beta }\frac{\mathrm {Ent}_{f}\left( \gamma \right) }{\gamma ^{2}}d\gamma \le \frac{\beta ^{2}}{8}\sup _{\mathbf {x}\in \Omega }R^{2}\left( f\right) \left( \mathbf {x}\right) , \end{aligned}$$

and the tail bound (6) gives for all \(\beta >0\)

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \exp \left( \frac{\beta ^{2}}{8}\sup _{\mathbf {x}\in \Omega }R^{2}\left( f\right) \left( \mathbf {x}\right) -\beta t\right) . \end{aligned}$$

Substitution of the minimizing value \(\beta =4t/\left( \sup _{\mathbf {x}\in \Omega }R^{2}\left( f\right) \left( \mathbf {x}\right) \right) \) completes the proof.   \(\square \)

Notice that the conditional range \(r_{k}\left( f\right) \) is a function in \(\mathcal {A}_{k}\) and may depend on all \(x_{i}\) except \(x_{k}\). The sum \(R^{2}\left( f\right) =\sum _{k=1}^{n}r_{k}\left( f\right) ^{2}\) may thus depend on all the \(x_{i}\). It is therefore a very pleasant feature that the supremum over \(\mathbf {x}\) is taken outside the sum. In the sequel this will allow us to derive the Gaussian concentration inequality from Theorem 14. The bound (12) will be re-used in Sect. 5.4 to prove a version of the Hanson-Wright inequality for quadratic forms.

In the literature one often sees the following weaker version of Theorem 14.

Corollary 15

For \(f\in \mathcal {A}\) and \(t>0\)

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \exp \left( \frac{-2t^{2}}{\sum _{k=1}^{n}\sup _{\mathbf {x}\in \Omega }r_{k}\left( f\right) ^{2}\left( \mathbf {x}\right) }\right) . \end{aligned}$$

If f is a sum \(f=\sum _{k}X_{k}\), then \(r_{k}^{2}\) is independent of \(\mathbf {x}\) and the two results are equivalent. In this case we obtain the well known Hoeffding inequality [16].

Corollary 16

(Hoeffding’s inequality) Let \(X_{k}\) be real random variables \(a_{k}\le X_{k}\le b_{k}\). Then

$$\begin{aligned} \Pr \left\{ \sum _{k}\left( X_{k}-E\left[ X_{k}\right] \right) >t\right\} \le \exp \left( \frac{-2t^{2}}{\sum _{k=1}^{n}\left( b_{k}-a_{k}\right) ^{2}}\right) . \end{aligned}$$

In returning to the general case of non-additive functions, it is remarkable that for many applications the following “little bounded difference inequality”, which is yet weaker than Corollary 15, seems to be sufficient.

Corollary 17

For \(f\in \mathcal {A}\) and \(t>0\)

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \exp \left( \frac{-2t^{2}}{nc^{2}}\right) , \end{aligned}$$

where

$$\begin{aligned} c=\max _{k}\sup _{\mathbf {x}\in \Omega ,y,y^{\prime }\in \Omega _{k}}D_{y,y^{\prime }}^{k}f\left( \mathbf {x}\right) . \end{aligned}$$

3.3 Vector-Valued Concentration

Suppose the \(X_{i}\) are independent random variables with values in a normed space \(\mathcal {B}\) such that \(EX_{i}=0\) and \(\left\| X_{i}\right\| \le c_{i}\). Let \(\Omega _{i}=\left\{ y\in \mathcal {B}:\left\| y\right\| \le c_{i}\right\} \) and define \(f:\prod _{i=1}^{n}\Omega _{i}\rightarrow \mathbb {R}\) by

$$\begin{aligned} f\left( \mathbf {x}\right) =\left\| \sum _{i}x_{i}\right\| \text {.} \end{aligned}$$

Then by the triangle inequality, for \(y,y^{\prime }\) with \(\left\| y\right\| ,\left\| y^{\prime }\right\| \le c_{k}\)

$$\begin{aligned} D_{y,y^{\prime }}^{k}f\left( \mathbf {x}\right)= & {} \left\| \sum _{i}S_{y}^{k}\left( \mathbf {x}\right) _{i}\right\| -\left\| \sum _{i}S_{y^{\prime }}^{k}\left( \mathbf {x}\right) _{i}\right\| \\\le & {} \left\| \sum _{i}S_{y}^{k}\left( x\right) _{i}-\sum _{i}S_{y^{\prime }}^{k}\left( x\right) _{i}\right\| =\left\| y-y^{\prime }\right\| \\\le & {} 2c_{k}, \end{aligned}$$

so \(R^{2}\left( f\right) \left( \mathbf {x}\right) \le 4\sum _{i}c_{i}^{2}\). It follows from Corollary 15 that

$$\begin{aligned} \Pr \left\{ f-E\left[ f\right] >t\right\} \le \exp \left( \frac{-t^{2}}{2\sum _{i}c_{i}^{2}}\right) , \end{aligned}$$

or that for \(\delta >0\) with probability at least \(1-\delta \) in \(\left( X_{1},...,X_{n}\right) \)

$$\begin{aligned} \left\| \sum _{i}X_{i}\right\| \le E\left\| \sum _{i}X_{i}\right\| +\sqrt{2\sum _{i}c_{i}^{2}\ln \left( 1/\delta \right) }. \end{aligned}$$
(13)

If \(\mathcal {B}\) is a Hilbert space we can bound \(E\left\| \sum _{i}X_{i}\right\| \le \sqrt{\sum _{i}E\left[ \left\| X_{i}\right\| ^{2}\right] }\) by Jensen’s inequality and if all the \(X_{i}\) are iid we get with probability at least \(1-\delta \)

$$\begin{aligned} \left\| \frac{1}{n}\sum _{i}X_{i}\right\| \le \sqrt{\frac{E\left[ \left\| X_{1}\right\| ^{2}\right] }{n}}+c_{1}\sqrt{\frac{2\ln \left( 1/\delta \right) }{n}} \end{aligned}$$
(14)

3.4 Rademacher Complexities and Generalization

Now let \(\mathcal {X}\) be any measurable space and \(\mathcal {F}\) a countable class of functions \(f:\mathcal {X\rightarrow }\left[ 0,1\right] \) and \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) be a vector of iid random variables with values in \(\mathcal {X}\).

Empirical risk minimization really wants to find \(f\in \mathcal {F}\) with minimal risk \(E\left[ f\left( X\right) \right] \), but, as the true distribution of X is unknown, it has to be content with minimizing the empirical surrogate

$$\begin{aligned} \frac{1}{n}\sum _{i}f\left( X_{i}\right) . \end{aligned}$$

One way to justify this method is by giving a bound on the uniform estimation error

$$\begin{aligned} \sup _{f\in \mathcal {F}}\frac{1}{n}\left| \sum _{i}f\left( X_{i}\right) -E\left[ f\left( X\right) \right] \right| . \end{aligned}$$

The vector space

$$\begin{aligned} \mathcal {B}=\left\{ g:\mathcal {F}\rightarrow \mathbb {R}:\sup _{f\in \mathcal {F}}\left| g\left( f\right) \right| <\infty \right\} \end{aligned}$$

becomes a normed space with norm \(\left\| g\right\| =\sup _{f\in \mathcal {F}}\left| g\left( f\right) \right| \). For each \(X_{i}\) define \(\hat{X}_{i}\) \(\in \mathcal {B}\) by \(\hat{X}_{i}\left( f\right) =f\left( X_{i}\right) -E\left[ f\left( X_{i}\right) \right] \). Then the \(\hat{X}_{i}\) are zero mean random variables in \(\mathcal {B}\) satisfying \(\left\| \hat{X}_{i}\right\| \le 1\), and (13) of the preceding section gives with probability at least \(1-\delta \)

$$\begin{aligned} \sup _{f\in \mathcal {F}}\left| \frac{1}{n}\sum _{i}f\left( X_{i}\right) -E\left[ f\left( X_{i}\right) \right] \right| \le \frac{1}{n}E\sup _{f\in \mathcal {F}}\left| \sum _{i}f\left( X_{i}\right) -E\left[ f\left( X_{i}\right) \right] \right| +\sqrt{\frac{2\ln \left( 1/\delta \right) }{n}}. \end{aligned}$$

The expectation term on the right-hand side can be bounded in terms of Rademacher complexity [3]. This is the function \(\mathcal {R}:\mathcal {F}\times \mathcal {X}^{n}\rightarrow \mathbb {R}\) defined as

$$\begin{aligned} {\mathcal {R}}\left( \mathcal {F},\mathbf {x}\right) =\frac{2}{n}E_{\mathbf {\epsilon }}\sup _{f\in \mathcal {F}}\left| \sum _{i}\epsilon _{i}f\left( x_{i}\right) \right| , \end{aligned}$$

where the \(\mathbf {\epsilon }=\left( \epsilon _{1},...,\epsilon _{n}\right) \) are vectors of independent Rademacher variables which are uniformly distributed on \(\left\{ -1,1\right\} \). We have, with \(X_{i}^{\prime }\) iid to \(X_{i}\)

$$\begin{aligned} \frac{1}{n}E\sup _{f\in \mathcal {F}}\left| \sum _{i}f\left( X_{i}\right) -E\left[ f\left( X_{i}\right) \right] \right|\le & {} \frac{1}{n}E_{\mathbf {XX}^{\prime }}\sup _{f\in \mathcal {F}}\left| \sum _{i}f\left( X_{i}\right) -f\left( X_{i}^{\prime }\right) \right| \\= & {} \frac{1}{n}E_{\mathbf {XX}^{\prime }}\sup _{f\in \mathcal {F}}\left| \sum _{i}\epsilon _{i}\left( f\left( X_{i}\right) -f\left( X_{i}^{\prime }\right) \right) \right| , \end{aligned}$$

for any \(\epsilon \in \left\{ -1,1\right\} ^{n}\), because the expectation is invariant under the interchange of \(X_{i}\) and \(X_{i}^{\prime }\) on an arbitrary subset of indices. Passing to the expectation in \(\epsilon \) and using the triangle inequality gives

$$\begin{aligned} \frac{1}{n}E\sup _{f\in \mathcal {F}}\left| \sum _{i}f\left( X_{i}\right) -E\left[ f\left( X_{i}\right) \right] \right|\le & {} \frac{1}{n}E_{\mathbf {XX}^{\prime }}E_{\mathbf {\epsilon }}\sup _{f\in \mathcal {F}}\left| \sum _{i}\epsilon _{i}\left( f\left( X_{i}\right) -f\left( X_{i}^{\prime }\right) \right) \right| \\\le & {} \frac{2}{n}E_{\mathbf {X}}E_{\mathbf {\epsilon }}\sup _{f\in \mathcal {F}}\left| \sum _{i}\epsilon _{i}f\left( X_{i}\right) \right| \\= & {} E_{\mathbf {X}}{\mathcal {R}}\left( \mathcal {F},\mathbf {X}\right) . \end{aligned}$$

Now we use the bounded difference inequality again to bound the deviation of \(\mathcal {R}\left( \mathcal {F},\mathbf {.}\right) \) from its expectation. We have, again using the triangle inequality,

$$\begin{aligned} D_{y,y^{\prime }}^{k}{\mathcal {R}}\left( \mathcal {F},\mathbf {x}\right)= & {} \frac{2}{n}E_{\mathbf {\epsilon }}\left[ \sup _{f\in \mathcal {F}}\left| \sum _{i}\epsilon _{i}S_{y}^{k}f\left( x_{i}\right) \right| -\sup _{f\in \mathcal {F}}\left| \sum _{i}\epsilon _{i}S_{y^{\prime }}^{k}f\left( x_{i}\right) \right| \right] \\\le & {} \frac{2}{n}E_{\mathbf {\epsilon }}\left[ \sup _{f\in \mathcal {F}}\left| \epsilon _{i}\left( f\left( y\right) -f\left( y^{\prime }\right) \right) \right| \right] \le \frac{2}{n} \end{aligned}$$

and thus obtain

$$\begin{aligned} \Pr \left\{ E\left[ {\mathcal {R}}\left( \mathcal {F},\mathbf {.}\right) \right] >{\mathcal {R}}\left( \mathcal {F},\mathbf {.}\right) +t\right\} \le e^{-nt^{2}/2}, \end{aligned}$$

or, for every \(\delta >0\) with probability at least \(1-\delta \)

$$\begin{aligned} E\left[ {\mathcal {R}}\left( \mathcal {F},\mathbf {X}\right) \right] \le {\mathcal {R}}\left( \mathcal {F},\mathbf {X}\right) +\sqrt{\frac{2\ln \left( 1/\delta \right) }{n}}. \end{aligned}$$
(15)

By a union bound we conclude that with probability at least \(1-\delta \)

$$\begin{aligned} \sup _{f\in \mathcal {F}}\left| \frac{1}{n}\sum _{i}f\left( X_{i}\right) -E\left[ f\left( X_{i}\right) \right] \right| \le {\mathcal {R}}\left( \mathcal {F},\mathbf {X}\right) +2\sqrt{\frac{2\ln \left( 2/\delta \right) }{n}}. \end{aligned}$$

3.5 The Bennett and Bernstein Inequalities

The proof of the bounded difference inequality relied on bounding the thermal variance \(\sigma _{k,\beta f}^{2}\left( f\right) \) uniformly in \(\beta \), using the constraints on the conditional ranges of f. We now consider the case, where we only use one constraint on the ranges, say \(f-E_{k}\left[ f\right] \le 1\), but we use information on the conditional variances. This leads to a Bennett type inequality as in [23]. Recall the notation for the sum of conditional variances \(\Sigma ^{2}\left( f\right) :=\sum \sigma _{k}^{2}\left( f\right) \). Again we start with a bound on the thermal variance.

Lemma 18

Assume \(f-Ef\le 1\). Then for \(\beta >0\)

$$\begin{aligned} \sigma _{\beta f}^{2}\left( f\right) \le e^{\beta }\sigma ^{2}\left( f\right) \end{aligned}$$

Proof

$$\begin{aligned} \sigma _{\beta f}^{2}\left( f\right)= & {} \sigma _{\beta \left( f-Ef\right) }^{2}\left( f-Ef\right) =E_{\beta \left( f-Ef\right) }\left[ \left( f-Ef\right) ^{2}\right] -\left( E_{\beta \left( f-Ef\right) }\left[ f-Ef\right] \right) ^{2} \\\le & {} E_{\beta \left( f-Ef\right) }\left[ \left( f-Ef\right) ^{2}\right] =\frac{E\left[ \left( f-Ef\right) ^{2}e^{\beta \left( f-Ef\right) }\right] }{E\left[ e^{\beta \left( f-Ef\right) }\right] } \\\le & {} E\left[ \left( f-Ef\right) ^{2}e^{\beta \left( f-Ef\right) }\right] \text { (by Jensen's inequality)} \\\le & {} e^{\beta }E\left[ \left( f-Ef\right) ^{2}\right] \text { (now using }f-Ef\le 1\text {).} \end{aligned}$$

   \(\square \)

Next we bound the entropy Ent\(_{f}\left( \beta \right) \).

Lemma 19

Assume that \(f-E_{k}f\le 1\) for all \(k\in \left\{ 1,...,n\right\} \). Then for \(\beta >0\)

$$\begin{aligned} \text {Ent}_{f}\left( \beta \right) \le \left( \beta e^{\beta }-e^{\beta }+1\right) ~E_{\beta f}\left[ \Sigma ^{2}\left( f\right) \right] . \end{aligned}$$

Proof

From Theorem 12 and the previous lemma we get

$$\begin{aligned} \text {Ent}_{f}\left( \beta \right) \le E_{\beta f}\left[ \sum _{k=1}^{n}\int _{0}^{\beta }\int _{t}^{\beta }\sigma _{k,sf}^{2}\left[ f\right] ds~dt\right] \le \int _{0}^{\beta }\int _{t}^{\beta }e^{s}ds~dt~E_{\beta f}\left[ \Sigma ^{2}\left( f\right) \right] . \end{aligned}$$

The conclusion follows from the formula

$$\begin{aligned} \int _{0}^{\beta }\int _{t}^{\beta }e^{s}ds~dt=\int _{0}^{\beta }\left( e^{\beta }-e^{t}\right) dt=\beta e^{\beta }-e^{\beta }+1. \end{aligned}$$

   \(\square \)

We need one more technical Lemma.

Lemma 20

For \(x\ge 0\)

$$\begin{aligned} \left( 1+x\right) \ln \left( 1+x\right) -x\ge 3x^{2}/\left( 6+2x\right) . \end{aligned}$$

Proof

We have to show that

$$\begin{aligned} f_{1}\left( x\right) :=\left( 6+8x+2x^{2}\right) \ln \left( 1+x\right) -6x-5x^{2}\ge 0. \end{aligned}$$

Since \(f_{1}\left( 0\right) =0\) and \(f_{1}^{\prime }\left( x\right) =4f_{2}\left( x\right) \) with \(f_{2}\left( x\right) :=\left( 2+x\right) \ln \left( 1+x\right) -2x\), it is enough to show that \(f_{2}\left( x\right) \ge 0\). But \(f_{2}\left( 0\right) =0\) and \(f_{2}^{\prime }\left( x\right) =\left( 1+x\right) ^{-1}+\ln \left( 1+x\right) -1\), so \(f_{2}^{\prime }\left( 0\right) =0\), but \(f_{2}^{\prime \prime }\left( x\right) =x\left( 1+x\right) ^{-2}\ge 0\), so \(f_{2}\left( x\right) \ge 0\).   \(\square \)

Now we can prove our version of Bennett’s inequality.

Theorem 21

Assume \(f-E_{k}f\le 1,\forall k\). Let \(t>0\) and denote \(V=\sup _{\mathbf {x}\in \Omega }\Sigma ^{2}\left( f\right) \left( \mathbf {x}\right) \). Then

$$\begin{aligned} \Pr \left\{ f-E\left[ f\right] >t\right\}\le & {} \exp \left( -V\left( \left( 1+tV^{-1}\right) \ln \left( 1+tV^{-1}\right) -tV^{-1}\right) \right) \\\le & {} \exp \left( \frac{-t^{2}}{2V+2t/3}\right) . \end{aligned}$$

Proof

Fix \(\beta >0\). We define the real function

$$\begin{aligned} \psi \left( t\right) =e^{t}-t-1, \end{aligned}$$
(16)

which arises from deleting the first two terms in the power series expansion of the exponential function and observe that

$$\begin{aligned} \int _{0}^{\beta }\frac{\gamma e^{\gamma }-e^{\gamma }+1}{\gamma ^{2}}d\gamma =\beta ^{-1}\left( e^{\beta }-\beta -1\right) =\beta ^{-1}\psi \left( \beta \right) , \end{aligned}$$

because \(\left( d/d\gamma \right) \left( \gamma ^{-1}\left( e^{\gamma }-1\right) \right) =\gamma ^{-2}\left( \gamma e^{\gamma }-e^{\gamma }+1\right) \) and \(\lim _{\gamma \rightarrow 0}\gamma ^{-1}\left( e^{\gamma }-1\right) =1\). Theorem 12 and Lemma 19 combined with a uniform bound then give

$$\begin{aligned} \ln Ee^{\beta \left( f-Ef\right) }= & {} \beta \int _{0}^{\beta }\frac{\text {Ent}_{f}\left( \gamma \right) d\gamma }{\gamma ^{2}} \\\le & {} \beta \left( \int _{0}^{\beta }\frac{\gamma e^{\gamma }-e^{\gamma }+1}{\gamma ^{2}}d\gamma \right) \sup _{\mathbf {x}\in \Omega }\Sigma ^{2}\left( f\right) \left( \mathbf {x}\right) =\psi \left( \beta \right) V. \end{aligned}$$

It now follows from Theorem 12 that \(\Pr \left\{ f-E\left[ f\right] >t\right\} \le \exp \left( \psi \left( \beta \right) V-\beta t\right) \) for any \(\beta >0\). Substitution of \(\beta =\ln \left( 1+tV^{-1}\right) \) gives the first inequality, the second follows from Lemma 20.   \(\square \)

Observe that f is assumed bounded above by the assumptions of the theorem. The existence of exponential moments \(E\left[ e^{\beta f}\right] \) is needed only for \(\beta \ge 0\), so the assumption \(f\in \mathcal {A}\) can be dropped in this case.

If f is additive the theorem reduces to the familiar Bennett and Bernstein inequalities [16].

Corollary 22

Let \(X_{k}\) be real random variables \(X_{k}\le E\left[ X_{k}\right] +1\) and let \(V=\sum _{k}\sigma ^{2}\left( X_{k}\right) \). Then

$$\begin{aligned} \Pr \left\{ \sum _{k}\left( X_{k}-E\left[ X_{k}\right] \right) >t\right\}\le & {} \exp \left( -V\left( \left( 1+tV^{-1}\right) \ln \left( 1+tV^{-1}\right) -tV^{-1}\right) \right) \\\le & {} \exp \left( \frac{-t^{2}}{2V+2t/3}\right) . \end{aligned}$$

Theorem 21 and its corollary can be applied to functions satisfying \(f-E_{k}\left[ f\right] <b\) by a simple rescaling argument. Then Bernstein’s inequality becomes

$$\begin{aligned} \Pr \left\{ f-E\left[ f\right] >t\right\} \le \exp \left( \frac{-t^{2}}{2\sup _{\mathbf {x}\in \Omega }\Sigma ^{2}\left( f\right) \left( \mathbf {x}\right) +2bt/3}\right) . \end{aligned}$$

Inequalities of this kind exhibit two types of tails, depending in which of the two terms in the denominator \(A+Bt\) of the exponent is dominant. In the sub-Gaussian regime \(A>>Bt\) the tail decays as \(e^{-t^{2}/A}\). This is the way the bounded difference inequality behaves globally, but with a very crude approximation for A, while Bernstein’s inequality uses variance information. But for larger deviations, when \(A<<Bt\), the tail only decays as \(e^{-t/A}\). This subexponential behavior is absent in the bounded difference inequality and the price paid for the fine-tuning in Bernstein’s inequality.

3.6 Vector-Valued Concentration Revisited

We look again at the situation of Sect. 3.3. Suppose again that the \(X_{i}\) are independent zero mean random variables with values in normed space, which we now assume to be a Hilbert space H, but that now we have a uniform bound \(\left\| X_{i}\right\| \le c\). Again we define \(f:\left\{ y\in H:\left\| y\right\| \le c\right\} ^{n}\rightarrow \mathbb {R}\) by \(f\left( \mathbf {x}\right) =\left\| \sum _{i}x_{i}\right\| \) and observe that for \(y,y^{\prime }\in H\), \(D_{y,y^{\prime }}^{k}f\left( \mathbf {x}\right) \le \left\| y-y^{\prime }\right\| \). This implies that \(f-E_{k}\left[ f\right] \le 2c\) and also

$$\begin{aligned} \sigma _{k}^{2}\left( f\right) =\frac{1}{2}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left( D_{y,y^{\prime }}^{k}f\left( \mathbf {x}\right) \right) ^{2}\le \frac{1}{2}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left\| y-y^{\prime }\right\| ^{2}=E\left\| X_{k}\right\| ^{2}. \end{aligned}$$

Thus \(\Sigma ^{2}\left( f\right) \le \sum _{i}E\left\| X_{i}\right\| ^{2}\) and by Bernstein’s inequality, Theorem 21,

$$\begin{aligned} \Pr \left\{ f-E\left[ f\right] >t\right\} \le \exp \left( \frac{-t^{2}}{2\sum _{i}E\left\| X_{i}\right\| ^{2}+4ct/3}\right) , \end{aligned}$$

or that for \(\delta >0\) with probability at least \(1-\delta \) in \(\left( X_{1},...,X_{n}\right) \)

$$\begin{aligned} \left\| \sum _{i}X_{i}\right\| \le \sqrt{\sum _{i}E\left[ \left\| X_{i}\right\| ^{2}\right] }+\sqrt{2\sum _{i}E\left\| X_{i}\right\| ^{2}\ln \left( 1/\delta \right) }+4c\ln \left( 1/\delta \right) /3, \end{aligned}$$

where we again used Jensen’s inequality to bound \(E\left\| \sum _{i}X_{i}\right\| \). If all the \(X_{i}\) are iid we get with probability at least \(1-\delta \)

$$\begin{aligned} \left\| \frac{1}{n}\sum _{i}X_{i}\right\| \le \sqrt{\frac{E\left[ \left\| X_{1}\right\| ^{2}\right] }{n}}\left( 1+\sqrt{2\ln \left( 1/\delta \right) }\right) +\frac{4c\ln \left( 1/\delta \right) }{2n}. \end{aligned}$$

If the variance \(E\left[ \left\| X_{1}\right\| ^{2}\right] \) is small and n is large, this is much better than the bound (14), which we got from the bounded difference inequality.

4 Inequalities for Lipschitz Functions and Dimension Free Bounds

We now prove some more advanced concentration inequalities. First we will use the bounded difference inequality to prove a famous sub-gaussian bound for Lipschitz functions of independent standard normal variables. We then derive an exponential Efron–Stein inequality which allows to prove a similar result for convex Lipschitz functions on \(\left[ 0,1\right] ^{n}\). We also obtain a concentration inequality for the operator norm of a random matrix, with deviations independent of the size of the matrix.

4.1 Gaussian Concentration

The advantage of the bounded difference inequality, Theorem 14, over its simplified Corollary 15 is the supremum over \(\mathbf {x}\) outside the sum over k. This allows us to prove the following powerful Gaussian concentration inequality (Tsirelson-Ibragimov–Sudakov inequality, Theorem 5.6 in [6]). We assume \(\Omega _{k}=\mathbb {R}\) and \(\mu _{k}\) to be the distribution of a standard normal variable, and we require f to be an L-Lipschitz function, which means that for all \(\mathbf {x,x}^{\prime }\in \mathbb {R}^{n}\)

$$\begin{aligned} f\left( \mathbf {x}\right) -f\left( \mathbf {x}^{\prime }\right) \le L\left\| \mathbf {x}-\mathbf {x}^{\prime }\right\| , \end{aligned}$$

where \(\left\| .\right\| \) is the Euclidean norm on \(\mathbb {R}^{n}\).

Theorem 23

Let \(f:\mathbb {R}^{n}\rightarrow \mathbb {R}\) be L-Lipschitz and let \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) be a vector of independent standard normal variables. Then for any \(s>0\)

$$\begin{aligned} \Pr \left\{ f\left( \mathbf {X}\right) >{{\mathbb {E}}}f\left( \mathbf {X}\right) +s\right\} \le e^{-s^{2}/2L^{2}}. \end{aligned}$$

Note that the function f is not assumed to be bounded on \(\mathbb {R}^{n}\).

Proof

The idea of the proof is to use the central limit theorem to approximate the \(X_{i}\) by appropriately scaled Rademacher sums \(h_{K}\left( \mathbf {\epsilon }_{i}\right) \) and to apply the bounded difference inequality to \(f\left( h_{K}\left( \mathbf {\epsilon }_{1}\right) ,...,h_{K}\left( \mathbf {\epsilon }_{n}\right) \right) \).

By an approximation argument using convolution with Gaussian kernels of decreasing width it suffices to prove the result if f is \(C^{2}\) with \(\left| \left( \partial ^{2}/x_{i}^{2}\right) f\left( \mathbf {x}\right) \right| \le B\) for all \(\mathbf {x\in }\) \(\mathbb {R}\)\(^{n}\) and \(i\in \left\{ 1,...,n\right\} \), where B is a finite, but arbitrarily large constant. For \(K\in \mathbb {N} \) define a function \(h_{K}:\left\{ -1,1\right\} ^{K}\rightarrow \mathbb {R}\), a vector-valued function \(\mathbf {h}_{K}:\left\{ -1,1\right\} ^{Kn}\rightarrow \mathbb {R}^{n}\) and a function \(G_{K}:\left\{ -1,1\right\} ^{Kn}\rightarrow \mathbb {R}\) by

$$\begin{aligned} h_{K}\left( \mathbf {\epsilon }\right)= & {} \frac{1}{\sqrt{K}}\sum _{k=1}^{K}\epsilon _{k},\text { for }\epsilon \in \left\{ -1,1\right\} ^{K} \\ \mathbf {h}_{K}\left( \mathbf {\epsilon }\right)= & {} \left( h_{K}\left( \mathbf {\epsilon }_{1}\right) ,...,h_{K}\left( \mathbf {\epsilon }_{n}\right) \right) \text { for }\mathbf {\epsilon }=\left( \mathbf {\epsilon }_{1},...,\mathbf {\epsilon }_{n}\right) \in \left\{ -1,1\right\} ^{Kn} \\ G_{K}= & {} f\left( \mathbf {h}_{K}\left( \mathbf {\epsilon }\right) \right) \text { for }\mathbf {\epsilon }\in \left\{ -1,1\right\} ^{Kn}\text {.} \end{aligned}$$

We will use Theorem 14 on the function \(G_{K}\) applied to independent Rademacher variables \(\mathbf {\epsilon }\).

Fix an arbitrary configuration \(\mathbf {\epsilon \in }\left\{ -1,1\right\} ^{Kn}\) and let \(\mathbf {x}=\left( x_{1},...,x_{n}\right) =\mathbf {h}_{K}\left( \mathbf {\epsilon }\right) \). For each \(i\in \left\{ 1,...,n\right\} \) we introduce the real function \(f_{i}\left( t\right) =S_{t}^{i}f\left( \mathbf {x}\right) \), so that we replace the i-th argument \(x_{i}\) by t, leaving all the other \(x_{j}\) fixed. Since f is \(C^{2}\) we have for any \(t\in \mathbb {R}\)

$$\begin{aligned} f_{i}\left( x+t\right) -f_{i}\left( x\right) =tf_{i}^{\prime }\left( x\right) +\frac{t^{2}}{2}f_{i}^{\prime \prime }\left( s\right) \end{aligned}$$

for some \(s\in \mathbb {R}\), and by the Lipschitz condition and the bound on \(\left| f_{i}^{\prime \prime }\right| \)

$$\begin{aligned} \left( f_{i}\left( x+t\right) -f_{i}\left( x\right) \right) ^{2}= & {} t^{2}\left( f_{i}^{\prime }\left( x\right) \right) ^{2}+t^{3}f_{i}^{\prime }\left( x\right) f_{i}^{\prime \prime }\left( s\right) +\frac{t^{4}}{4}\left( f_{i}^{\prime \prime }\left( s\right) \right) ^{2} \\\le & {} t^{2}\left( f_{i}^{\prime }\left( x\right) \right) ^{2}+\left| t\right| ^{3}LB+\frac{t^{4}}{4}B^{2}. \end{aligned}$$

Now fix a pair of indices \(\left( i,k\right) \) with \(i\in \left\{ 1,...,n\right\} \) and \(k\in \left\{ 1,...,K\right\} \) and arbitrary values \(y,y^{\prime }\in \left\{ -1,1\right\} \) with \(y\ne y^{\prime }\). We want to bound \(\left( D_{y,y^{\prime }}^{\left( i,k\right) }G_{K}\left( \mathbf {\epsilon }\right) \right) ^{2}\). Now either one of y or \(y^{\prime }\) is equal to \(\epsilon _{ik}\), so either \(S_{y}^{\left( i,k\right) }G_{K}\left( \mathbf {\epsilon }\right) \) or \(S_{y^{\prime }}^{\left( i,k\right) }G_{K}\left( \mathbf {\epsilon }\right) \) is equal to \(G_{K}\left( \mathbf {\epsilon }\right) \). Without loss of generality we assume the second. Furthermore \(S_{y}^{k}h_{K}\left( \mathbf {\epsilon }_{i}\right) \) and \(h_{K}\left( \mathbf {\epsilon }_{i}\right) \) differ by at most \(2/\sqrt{K}\), so

$$\begin{aligned} \left( D_{y,y^{\prime }}^{\left( i,k\right) }G_{K}\left( \mathbf {\epsilon }\right) \right) ^{2}= & {} \left( f\left( x_{1},...,S_{y}^{k}h_{K}\left( \mathbf {\epsilon }_{i}\right) ,...,x_{n}\right) -f\left( x_{1},...,h_{K}\left( \mathbf {\epsilon }_{i}\right) ,...,x_{n}\right) \right) ^{2} \\= & {} \left( f_{i}\left( h_{K}\left( \mathbf {\epsilon }_{i}\right) \pm \frac{2}{\sqrt{K}}\right) -f_{i}\left( h_{K}\left( \mathbf {\epsilon }_{i}\right) \right) \right) ^{2} \\\le & {} \frac{4f_{i}^{\prime }\left( h_{K}\left( \mathbf {\epsilon }_{i}\right) \right) ^{2}}{K}+\frac{8LB}{K^{3/2}}+\frac{4B^{2}}{K^{2}}. \end{aligned}$$

Now \(f_{i}^{\prime }\left( h_{K}\left( \mathbf {\epsilon }_{i}\right) \right) \) is just equal to \(\left( \partial /\partial x_{i}\right) f\left( \mathbf {x}\right) \), so

$$\begin{aligned} \sum _{i}f_{i}^{\prime }\left( h_{K}\left( \mathbf {\epsilon }_{i}\right) \right) ^{2}\le \sup _{\mathbf {x}\in \mathbb {R}^{n}}\left\| \nabla f\left( \mathbf {x}\right) \right\| ^{2}\le L^{2}\text {.} \end{aligned}$$

Since \(\mathbf {\epsilon }\) was arbitrary we have

$$\begin{aligned} \sup _{\mathbf {\epsilon }}\sum _{k,i}\sup _{y,y^{\prime }}\left( D_{y,y^{\prime }}^{\left( i,k\right) }G_{K}\left( \mathbf {\epsilon }\right) \right) ^{2}\le 4L^{2}+\frac{8nLB}{K^{1/2}}+\frac{4nB^{2}}{K}. \end{aligned}$$

From Theorem 14 we conclude from \(f\left( \mathbf {h}_{K}\left( \mathbf {\epsilon }\right) \right) =G_{K}\left( \mathbf {\epsilon }\right) \) that

$$\begin{aligned} \Pr \left\{ f\left( \mathbf {h}_{K}\left( \mathbf {\epsilon }\right) \right) -{{\mathbb {E}}}f\left( \mathbf {h}_{K}\left( \mathbf {\epsilon }^{\prime }\right) \right) >s\right\} \le \exp \left( \frac{-s^{2}}{2L^{2}+4nLB/K^{1/2}+2nB^{2}/K}\right) . \end{aligned}$$

The conclusion now follows from the central limit theorem since h \(_{K}\left( \mathbf {\epsilon }\right) \rightarrow \mathbf {X}\) weakly as \(K\rightarrow \infty \).   \(\square \)

4.2 Exponential Efron Stein Inequalities

We will now use the entropy method to derive some other “dimension free” bounds of this type. We need the following very useful result.

Lemma 24

(Chebychev’s association inequality) Let g and h be real functions, X a real random variable.

If g and h are either both nondecreasing or both nonincreasing then

$$\begin{aligned} E\left[ g\left( X\right) h\left( X\right) \right] \ge E\left[ g\left( X\right) \right] E\left[ h\left( X\right) \right] . \end{aligned}$$

If either one of g or h is nondecreasing and the other nonincreasing then

$$\begin{aligned} E\left[ g\left( X\right) h\left( X\right) \right] \le E\left[ g\left( X\right) \right] E\left[ h\left( X\right) \right] . \end{aligned}$$

Proof

Let \(X^{\prime }\) be a random variable iid to X. Then

$$\begin{aligned} E\left[ g\left( X\right) h\left( X\right) \right] -E\left[ g\left( X\right) \right] E\left[ h\left( X\right) \right] =\frac{1}{2}E\left[ \left( g\left( X\right) -g\left( X^{\prime }\right) \right) \left( h\left( X\right) -h\left( X^{\prime }\right) \right) \right] . \end{aligned}$$

Now if g and h are either both nondecreasing or both nonincreasing then

$$\begin{aligned} \left( g\left( X\right) -g\left( X^{\prime }\right) \right) \left( h\left( X\right) -h\left( X^{\prime }\right) \right) \end{aligned}$$

is always nonnegative, because both factors always have the same sign, in the other case it is always nonpositive.   \(\square \)

We use this inequality to prove a bound on the thermal variance. First recall that for two iid random variables X and \(X^{\prime }\) we have

$$\begin{aligned} \sigma ^{2}\left( X\right)= & {} \frac{1}{2}E_{XX^{\prime }}\left[ \left( X-X^{\prime }\right) ^{2}\right] \\= & {} \frac{1}{2}E_{XX^{\prime }}\left[ \left( X-X^{\prime }\right) ^{2}1_{X>X^{\prime }}\right] +\frac{1}{2}E_{XX^{\prime }}\left[ \left( X-X^{\prime }\right) ^{2}1_{X<X^{\prime }}\right] \\= & {} E_{XX^{\prime }}\left[ \left( X-X^{\prime }\right) _{+}^{2}\right] . \end{aligned}$$

Lemma 25

Let \(0\le s\le \beta \). Then

$$\begin{aligned} \sigma _{sf}^{2}\left( f\right) \le E_{x\sim \mu _{\beta f}}\left[ E_{x^{\prime }\sim \mu }\left[ \left( f\left( x\right) -f\left( x^{\prime }\right) \right) _{+}^{2}\right] \right] . \end{aligned}$$

Proof

Let \(\psi \) be any real function. Lemma 6 (2) gives

$$\begin{aligned} \frac{d}{d\beta }E_{\beta f}\left[ \psi \left( f\right) \right] =E_{\beta f}\left[ \psi \left( f\right) f\right] -E_{\beta f}\left[ \psi \left( f\right) \right] E_{\beta f}\left[ f\right] . \end{aligned}$$
(17)

By Chebychev’s association inequality \(E_{\beta f}\left[ \psi \left( f\right) \right] \) is nonincreasing (nondecreasing) in \(\beta \) if \(\psi \) is nonincreasing (nondecreasing). Now define \(g:\mathbb {R}^{2}\rightarrow \mathbb {R}\) by

$$\begin{aligned} g\left( s,t\right) =E_{x\sim \mu _{sf}}\left[ E_{x^{\prime }\sim \mu _{tf}}\left[ \left( f\left( x\right) -f\left( x^{\prime }\right) \right) ^{2}1_{f\left( x\right) \ge f\left( x^{\prime }\right) }\right] \right] , \end{aligned}$$

so that

$$\begin{aligned} \sigma _{sf}^{2}\left( f\right) =\frac{1}{2}E_{x\sim \mu _{sf}}\left[ E_{x^{\prime }\sim \mu _{sf}}\left[ \left( f\left( x\right) -f\left( x^{\prime }\right) \right) ^{2}\right] \right] =g\left( s,s\right) . \end{aligned}$$

Now for fixed x the function \(\left( f\left( x\right) -f\left( x^{\prime }\right) \right) ^{2}1_{f\left( x\right) \ge f\left( x^{\prime }\right) }\) is nonincreasing in \(f\left( x^{\prime }\right) \), so \(g\left( s,t\right) \) is nonincreasing in t. On the other hand, for fixed \(x^{\prime }\), \(\left( f\left( x\right) -f\left( x^{\prime }\right) \right) ^{2}1_{f\left( x\right) \ge f\left( x^{\prime }\right) }\) is nondecreasing in \(f\left( x\right) \), so \(g\left( s,t\right) \) is nondecreasing in s (this involves exchanging the two expectations in the definition of \(g\left( s,t\right) \)). So, since \(\mu _{0f}=\mu \), we get from \(0\le s\le \beta \) that

$$\begin{aligned} \sigma _{sf}^{2}\left( f\right) =g\left( s,s\right) \le g\left( \beta ,0\right) =E_{x\sim \mu _{\beta f}}\left[ E_{x^{\prime }\sim \mu }\left[ \left( f\left( x\right) -f\left( x^{\prime }\right) \right) _{+}^{2}\right] \right] . \end{aligned}$$

   \(\square \)

Here is another way to write the conclusion: let \(h\in \mathcal {A}\) be defined by \(h\left( x\right) =E_{x^{\prime }\sim \mu }\left[ \left( f\left( x\right) -f\left( x^{\prime }\right) \right) _{+}^{2}\right] \). Then \(\sigma _{sf}^{2}\left( f\right) \le E_{\beta f}\left[ h\right] \).

Define two operators \(D^{2}:\mathcal {A}\rightarrow \mathcal {A}\) and \(V_{+}^{2}:\mathcal {A}\rightarrow \mathcal {A}\) by

$$\begin{aligned} D^{2}f= & {} \sum _{k}\left( f-\inf _{y\in \Omega _{k}}S_{y}^{k}f\right) ^{2} \\ \text {and }V_{+}^{2}f= & {} \sum _{k}E_{y\sim \mu _{k}}\left[ \left( \left( f-S_{y}^{k}f\right) _{+}\right) ^{2}\right] . \end{aligned}$$

Clearly \(V_{+}^{2}f\le D^{2}f\) as \(D^{2}f\) is obtained by bounding the expectations in the definition of \(V_{+}^{2}\) by their suprema.

Lemma 26

For \(\beta >0\) and \(f\in \mathcal {A}\)

$$\begin{aligned} \text {Ent}_{f}\left( \beta \right) \le \frac{\beta ^{2}}{2}E_{\beta f}\left[ V^{+}\left( f\right) \right] . \end{aligned}$$

Proof

For \(k\in \left\{ 1,...,n\right\} \) write \(h_{k}=E_{y\sim \mu _{k}}\left[ \left( f-S_{y}^{k}f\right) _{+}^{2}\right] \), so that \(V^{+}\left( f\right) =\sum _{k}h_{k}\). The conditional version of Lemma 25 then reads for \(0\le s\le \beta \) and \(k\in \left\{ 1,...,n\right\} \)

$$\begin{aligned} \sigma _{k,sf}^{2}\left( f\right) \le E_{k,\beta f}\left[ h_{k}\right] . \end{aligned}$$

Theorem 12 gives

$$\begin{aligned} \text {Ent}_{f}\left( \beta \right)\le & {} \int _{0}^{\beta }\int _{t}^{\beta }\sum _{k}E_{\beta f}\left[ \sigma _{k,sf}^{2}\left( f\right) \right] dsdt \\\le & {} \int _{0}^{\beta }\int _{t}^{\beta }\sum _{k}E_{\beta f}\left[ E_{k,\beta f}\left[ h_{k}\right] \right] dsdt \\= & {} \int _{0}^{\beta }\int _{t}^{\beta }\sum _{k}E_{\beta f}\left[ h_{k}\right] dsdt \\= & {} \frac{\beta ^{2}}{2}E_{\beta f}\left[ V^{+}\left( f\right) \right] , \end{aligned}$$

where we used the identity \(E_{\beta f}\left[ E_{k,\beta f}\left[ h\right] \right] =E_{\beta f}\left[ h\right] \) for \(h\in \mathcal {A}.\)   \(\square \)

The usual arguments involving Theorem 12 and an optimization in \(\beta \) now immediately lead to

Theorem 27

With \(t>0\)

$$\begin{aligned} \Pr \left\{ f-E\left[ f\right] >t\right\} \le \exp \left( \frac{-t^{2}}{2\sup _{\mathbf {x}\in \Omega }V_{+}^{2}f\left( \mathbf {x}\right) }\right) \le \exp \left( \frac{-t^{2}}{2\sup _{\mathbf {x}\in \Omega }D^{2}f\left( \mathbf {x}\right) }\right) . \end{aligned}$$

We get a corresponding lower tail bound only for \(D^{2}\) and we have to use an estimate similar to what was used in the proof of Bennett’s inequality.

Lemma 28

If \(f-\inf _{k}f\le 1,\forall k\) then for \(\beta >0\)

$$\begin{aligned} \text {Ent}_{-f}\left( \beta \right) \le \psi \left( \beta \right) E_{-\beta f}\left[ D^{2}f\right] , \end{aligned}$$

with \(\psi \left( t\right) =e^{t}-t-1\) defined as in (16).

Proof

Let \(k\in \left\{ 1,...,n\right\} \). We write \(h_{k}:=f-\inf _{k}f\). Then \(h_{k}\in \left[ 0,1\right] \) and for \(s\le \beta \) we have \(1\le e^{\left( \beta -s\right) h_{k}}\le e^{\beta -s}\), so

$$\begin{aligned} E_{k,-sh_{k}}\left[ h_{k}^{2}\right] =\frac{E_{k}\left[ h_{k}^{2}e^{-\beta h_{k}}e^{\left( \beta -s\right) h_{k}}\right] }{E_{k}\left[ e^{-\beta h_{k}}e^{\left( \beta -s\right) h_{k}}\right] }\le e^{\left( \beta -s\right) }\frac{E_{k}\left[ h_{k}^{2}e^{-\beta h_{k}}\right] }{E_{k}\left[ e^{-\beta h_{k}}\right] }=e^{\left( \beta -s\right) }E_{k,-\beta h_{k}}\left[ h_{k}^{2}\right] . \end{aligned}$$

We therefore have

$$\begin{aligned} \int _{0}^{\beta }\int _{t}^{\beta }E_{k,-sf}\left[ h_{k}^{2}\right] ds~dt= & {} \int _{0}^{\beta }\int _{t}^{\beta }E_{k,-sh_{k}}\left[ h_{k}^{2}\right] ds~dt \\\le & {} \left( \int _{0}^{\beta }\int _{t}^{\beta }e^{\beta -s}ds~dt\right) E_{k,-\beta h_{k}}\left[ h_{k}^{2}\right] =\psi \left( \beta \right) E_{k,-\beta f}\left[ h_{k}^{2}\right] , \end{aligned}$$

where we used the formula

$$\begin{aligned} \int _{0}^{\beta }\int _{t}^{\beta }e^{-s}ds~dt=1-e^{-\beta }-\beta e^{-\beta }. \end{aligned}$$

Thus, using Theorem 12 and the identity \(E_{-\beta f}E_{k,-\beta f}=E_{-\beta f},\)

$$\begin{aligned} \text {Ent}_{-f}\left( \beta \right)\le & {} E_{-\beta f}\left[ \sum _{k}\int _{0}^{\beta }\int _{t}^{\beta }\sigma _{k,-sf}^{2}\left[ f\right] ds~dt\right] \le E_{-\beta f}\left[ \sum _{k}\int _{0}^{\beta }\int _{t}^{\beta }E_{k,-sf}\left[ h_{k}^{2}\right] ds~dt\right] \\\le & {} \psi \left( \beta \right) E_{-\beta f}\left[ \sum _{k}E_{k,-\beta f}\left[ h_{k}^{2}\right] \right] =\psi \left( \beta \right) E_{-\beta f}\left[ D^{2}f\right] . \end{aligned}$$

   \(\square \)

Lemmas 26 and 28 together with (7) imply the inequalities

$$\begin{aligned} \ln E\left[ e^{\beta \left( f-E\left[ f\right] \right) }\right] \le \frac{\beta }{2}\int _{0}^{\beta }E_{\gamma f}\left[ V_{+}^{2}f\right] d\gamma . \end{aligned}$$
(18)

and, if \(f-\inf _{k}f\le 1\) for all k, then

$$\begin{aligned} \ln E\left[ e^{\beta \left( E\left[ f\right] -f\right) }\right] \le \frac{\psi \left( \beta \right) }{\beta }\int _{0}^{\beta }E_{-\gamma f}\left[ D^{2}f\right] d\gamma , \end{aligned}$$
(19)

where in the last inequality we also used the fact that \(\gamma \mapsto \psi \left( \gamma \right) /\gamma ^{2}\) is nondecreasing. Bounding the thermal expectation with the uniform norm and substitution of \(\beta =\ln \left( 1+t\left\| D^{2}f\right\| _{\infty }^{-1}\right) \) gives the following lower tail bound as in the proof of the Bennett-Bernstein inequalities.

Theorem 29

If \(f-\inf _{k}f\le 1\) for all k, then for \(t>0\) and with \(\Delta :=\sup _{\mathbf {x}\in \Omega }D^{2}f\left( \mathbf {x}\right) \)

$$\begin{aligned} \Pr \left\{ Ef-f>t\right\}&\le \exp \left( -\Delta \left( \left( 1+\frac{t}{\Delta }\right) \ln \left( 1+\frac{t}{\Delta }\right) -\frac{t}{\Delta }\right) \right) \\&\le \exp \left( \frac{-t^{2}}{2\sup _{\mathbf {x}\in \Omega }D^{2}f\left( \mathbf {x}\right) +2t/3}\right) . \end{aligned}$$

4.3 Convex Lipschitz Functions

In Sect. 4.1 we gave a sub-gaussian bound for Lipschitz functions of independent standard normal variables. Now we prove the same upper tail bound under different hypotheses. Instead of assuming \(\mu _{k}\) to be standard normal we require \(\Omega _{k}=\left[ 0,1\right] \) and let \(\mu _{k}\) be perfectly arbitrary. On the other hand, in addition to being an L-Lipschitz function, we require f to be convex (actually only separately convex in each argument).

Theorem 30

Let \(\Omega _{k}=\mathcal {I}\), an interval of unit diameter, and let \(f\in \mathcal {A}\) be \(C^{1}\), L-Lipschitz and such that \(y\in \left[ 0,1\right] \mapsto S_{y}^{k}f\left( \mathbf {x}\right) \) is convex for all k and all \(\mathbf {x}\). Then

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le e^{-t^{2}/2L^{2}}. \end{aligned}$$

Proof

By an approximation argument we can assume f to be differentiable. Let \(\mathbf {x}\in \left[ 0,1\right] ^{n}\), \(k\in \left\{ 1,...,n\right\} \) and \(y\in \left[ 0,1\right] \) such that \(S_{y}^{k}f\left( \mathbf {x}\right) \le f\left( \mathbf {x}\right) \). Then, using separate convexity,

$$\begin{aligned} f\left( \mathbf {x}\right) -S_{y}^{k}f\left( \mathbf {x}\right) \le \left\langle \mathbf {x}-S_{y}^{k}\mathbf {x},\partial f\left( \mathbf {x}\right) \right\rangle _{\mathbb {R}^{n}}=\left( x_{k}-y\right) \frac{\partial }{\partial x_{k}}f\left( \mathbf {x}\right) \le \left| \frac{\partial }{\partial x_{k}}f\left( \mathbf {x}\right) \right| . \end{aligned}$$

We therefore have \(f\left( \mathbf {x}\right) -\inf _{y}S_{y}^{k}f\left( \mathbf {x}\right) \le \left| \left( \partial /\partial x_{k}\right) f\left( \mathbf {x}\right) \right| \) and

$$\begin{aligned} D^{2}f\left( \mathbf {x}\right) =\sum _{k=1}^{n}\left( f\left( \mathbf {x}\right) -\inf _{y}S_{y}^{k}f\left( \mathbf {x}\right) \right) ^{2}\le \left\| \nabla f\left( \mathbf {x}\right) \right\| _{\mathbb {R}^{n}}^{2}\le L^{2}. \end{aligned}$$

Theorem 27 then gives the conclusion.   \(\square \)

For future reference we record the following fact: if \(\Omega _{k}\) is an interval of unit diameter and A an \(m\times n\)-matrix then \(\mathbf {x}\mapsto \left\| A\mathbf {x}\right\| \) is a convex function with Lipschitz constant \(\left\| A\right\| \) and thus

$$\begin{aligned} D^{2}\left( \left\| A\mathbf {x}\right\| \right) \le \left\| A\right\| ^{2}. \end{aligned}$$
(20)

4.4 The Operator Norm of a Random Matrix

For \(\mathbf {x}\in \left[ -1,1\right] ^{n^{2}}\) let \(M\left( \mathbf {x}\right) \) be the \(n\times n\) matrix whose entries are given by the components of \(\mathbf {x}\). We are interested in the concentration properties of the operator norm of \(M\left( \mathbf {X}\right) \), when \(\mathbf {X}\) is a vector with independent, but possibly not identically distributed components chosen from \(\left[ -1,1\right] \). The function in question is then \(f:\left[ -1,1\right] ^{n^{2}}\rightarrow \mathbb {R}\) defined by

$$\begin{aligned} f\left( \mathbf {x}\right) =\left\| M\left( \mathbf {x}\right) \right\| =\sup _{\left\| w\right\| ,\left\| v\right\| =1}\left\langle M\left( \mathbf {x}\right) v,w\right\rangle , \end{aligned}$$

where \(\left\langle .,.\right\rangle \) and \(\left\| .\right\| \) refer to inner product and norm in \(\mathbb {R}^{n}\).

To bound \(D^{2}f\left( \mathbf {x}\right) \) first let \(\mathbf {x}\in \left[ -1,1\right] ^{n^{2}}\) be arbitrary but fixed, and let v and w be unit vectors witnessing the supremum in the definition of \(f\left( \mathbf {x}\right) \).

Now let \(\left( k,l\right) \) be any index to a matrix entry and choose any \(y\in \left[ -1,1\right] \) such that \(S_{y}^{\left( k,l\right) }f\left( \mathbf {x}\right) \le f\left( \mathbf {x}\right) \). Then

$$\begin{aligned} f\left( \mathbf {x}\right) -S_{y}^{\left( k,l\right) }f\left( \mathbf {x}\right)= & {} \left\langle M\left( \mathbf {x}\right) v,w\right\rangle -\sup _{\left\| w^{\prime }\right\| ,\left\| v^{\prime }\right\| =1}\left\langle S_{y}^{\left( k,l\right) }M\left( \mathbf {x}\right) v^{\prime },w^{\prime }\right\rangle \\\le & {} \left\langle \left( M\left( \mathbf {x}\right) -S_{y}^{\left( k,l\right) }M\left( \mathbf {x}\right) \right) v,w\right\rangle =\left( x_{kl}-y\right) v_{k}w_{l} \\\le & {} 2\left| v_{k}\right| \left| w_{l}\right| . \end{aligned}$$

Note that \(f-\inf _{k}f\le 2\). Also

$$\begin{aligned} D^{2}f\left( \mathbf {x}\right)= & {} \sum _{k,l}\left( f\left( \mathbf {x}\right) -\inf _{y\in \left[ -1,1\right] }S_{y}^{\left( k,l\right) }f\left( \mathbf {x}\right) \right) ^{2} \\\le & {} 4\sum _{k,l}\left| v_{k}\right| ^{2}\left| w_{l}\right| ^{2}=4. \end{aligned}$$

The results of the previous section (rescaling for the lower tail to get \(f-\inf _{k}f\le 1\)) then lead to a concentration inequality independent of the size of the random matrix.

Theorem 31

Let \(\mathbf {X}=\left( X_{ij}\right) _{1\le i,j\le n}\) be a vector of \(n^{2}\) independent random variables with values in \(\left[ -1,1\right] \), and \(\mathbf {X}^{\prime }\) iid to \(\mathbf {X}\). Then for \(t>0\).

$$\begin{aligned} \Pr \left\{ \left\| M\left( \mathbf {X}\right) \right\| -E\left[ \left\| M\left( \mathbf {X}^{\prime }\right) \right\| \right] \ge t\right\} \le \exp \left( \frac{-t^{2}}{8}\right) \end{aligned}$$

and

$$\begin{aligned} \Pr \left\{ E\left[ \left\| M\left( \mathbf {X}^{\prime }\right) \right\| \right] -\left\| M\left( \mathbf {X}\right) \right\| \ge t\right\} \le \exp \left( \frac{-t^{2}}{8+4t/3}\right) . \end{aligned}$$

Observe that the argument depends on the fact that the unit vectors v and w could be fixed independent of k and l. This would not have been possible with the bounded difference inequality. Also note that square matrices were chosen for notational convenience only. The same proof would work for rectangular matrices.

5 Beyond Uniform Bounds

All of the above applications of the entropy method to derive upper tail bounds involved an inequality of the form

$$\begin{aligned} \text {Ent}_{f}\left( \gamma \right) \le \xi \left( \gamma \right) E_{\gamma f}\left[ G\left( f\right) \right] , \end{aligned}$$

where \(\xi \) is some nonnegative real function and G is some operator \(G:\mathcal {A}\rightarrow \mathcal {A}\), which is positively homogeneous of order two. For the bounded difference inequality \(\xi \left( \gamma \right) =\gamma ^{2}/8\) and \(G=R^{2}\), for the Bennett inequality \(\xi \left( \gamma \right) =\gamma e^{\gamma }-e^{\gamma }+1\) and \(G=\Sigma ^{2}\), for Theorem 27 we had \(\xi \left( \gamma \right) =\gamma ^{2}/2\) and \(G=V_{+}^{2}\). Theorem 12 is then used to conclude that

$$\begin{aligned} \ln Ee^{\beta \left( f-Ef\right) }\le \beta \int _{0}^{\beta }\frac{\xi \left( \gamma \right) }{\gamma ^{2}}E_{\gamma f}\left[ G\left( f\right) \right] d\gamma \le \beta \sup _{\mathbf {x}}G\left( f\right) \left( \mathbf {x}\right) \int _{0}^{\beta }\frac{\xi \left( \gamma \right) d\gamma }{\gamma ^{2}}. \end{aligned}$$
(21)

An analogous strategy was employed for the various lower tail bounds.

The uniform estimate \(E_{\gamma f}\left[ G\left( f\right) \right] \le \sup _{\mathbf {x}}G\left( f\right) \left( \mathbf {x}\right) \) in (21), while being very simple, is somewhat loose and can sometimes be avoided by exploiting special properties of the thermal expectation and the function in question.

5.1 Self-boundedness

The first possibility we consider is that the function \(G\left( f\right) \) can be bounded in terms of the function f itself, a property referred to as self-boundedness [8]. For example, if simply \(G\left( f\right) \le f\), then \(E_{\gamma f}\left[ G\left( f\right) \right] \le E_{\gamma f}\left[ f\right] =\left( d/d\gamma \right) \ln Z_{\gamma f}\), and if the function \(\xi \) has some reasonable behavior, then the first integral in (21) above can be bounded by partial integration or even more easily. As an example we apply this idea in the setting of Theorems 27 and 29.

Lemma 32

Suppose that for \(f\in \mathcal {A}\) there are nonnegative numbers ab such that

(i) \(V_{+}^{2}f\le af+b\). Then for \(0\le \beta <2/a\)

$$\begin{aligned} \ln E\left[ e^{\beta \left( f-E\left[ f\right] \right) }\right] \le \frac{\beta ^{2}\left( aEf+b\right) }{2-a\beta }, \end{aligned}$$

(ii) \(D^{2}f\le af+b\). If in addition \(f-\inf _{k}f\le 1\) for all k, then for \(\beta <0\) and \(a\ge 1\)

$$\begin{aligned} \ln E\left[ e^{\beta \left( E\left[ f\right] -f\right) }\right] \le \frac{\beta ^{2}\left( aE\left[ f\right] +b\right) }{2}. \end{aligned}$$

Proof

(i) We use (18) and get

$$\begin{aligned} \ln E\left[ e^{\beta \left( f-E\left[ f\right] \right) }\right]\le & {} \frac{\beta }{2}\int _{0}^{\beta }E_{\gamma f}\left[ V_{+}^{2}f\right] d\gamma \le \frac{a\beta }{2}\int _{0}^{\beta }E_{\gamma f}\left[ f\right] d\gamma +\frac{b\beta ^{2}}{2} \\= & {} \frac{a\beta }{2}\ln Z_{\beta f}+\frac{b\beta ^{2}}{2}, \end{aligned}$$

where the last identity follows from the fact that \(E_{\gamma f}\left[ f\right] =\left( d/d\gamma \right) \ln Z_{\gamma f}\). Thus

$$\begin{aligned} \ln E\left[ e^{\beta \left( f-E\left[ f\right] \right) }\right] \le \frac{a\beta }{2}\ln Ee^{\beta \left( f-E\left[ f\right] \right) }+\frac{a\beta ^{2}}{2}Ef+\frac{b\beta ^{2}}{2}, \end{aligned}$$

and rearranging this inequality for \(\beta \in \left( 0,2/a\right) \) establishes the claim.

(ii) We use (19)

$$\begin{aligned} \ln E\left[ e^{\beta \left( E\left[ f\right] -f\right) }\right]\le & {} \frac{\psi \left( \beta \right) }{\beta }\int _{0}^{\beta }E_{-\gamma f}\left[ D^{2}f\right] d\gamma \\\le & {} \frac{a\psi \left( \beta \right) }{\beta }\int _{0}^{\beta }E_{-\gamma f}\left[ f\right] d\gamma +b\psi \left( \beta \right) =\frac{-a\psi \left( \beta \right) }{\beta }\ln Z_{-\beta f}+b\psi \left( \beta \right) \\= & {} \frac{-a\psi \left( \beta \right) }{\beta }\ln E\left[ e^{\beta \left( E\left[ f\right] -f\right) }\right] +\psi \left( \beta \right) \left( aE\left[ f\right] +b\right) . \end{aligned}$$

Rearranging gives

$$\begin{aligned} \ln E\left[ e^{\beta \left( E\left[ f\right] -f\right) }\right] \le \frac{\psi \left( \beta \right) }{1+a\beta ^{-1}\psi \left( \beta \right) }\left( aE\left[ f\right] +b\right) \le \frac{\beta ^{2}\left( aE\left[ f\right] +b\right) }{2}, \end{aligned}$$

where one verifies that for \(\beta >0\) and \(a\ge 1\) we have \(\psi \left( \beta \right) \left( 1+a\beta ^{-1}\psi \left( \beta \right) \right) ^{-1}\le \beta ^{2}/2\).   \(\square \)

The bound in part (i) requires an upper bound on \(\beta \). To proceed we need the following optimization lemma, which will be used several times in the sequel and leads to tail bounds with both sub-Gaussian and subexponential regimes, similar to Bernstein’s inequality.

Lemma 33

Let C and b denote two positive real numbers, \(t>0\). Then

$$\begin{aligned} \inf _{\beta \in [0,1/b)}\left( -\beta t+\frac{C\beta ^{2}}{1-b\beta }\right) \le \frac{-t^{2}}{2\left( 2C+bt\right) }. \end{aligned}$$
(22)

Proof

Let \(h\left( t\right) =1+t-\sqrt{1+2t}\). Then use

$$\begin{aligned} 2h\left( t\right) \left( 1+t\right)= & {} 2\left( 1+t\right) ^{2}-2\left( 1+t\right) \sqrt{1+2t} \\= & {} \left( 1+t\right) ^{2}-2\left( 1+t\right) \sqrt{1+2t}+\left( 1+2t\right) +t^{2} \\= & {} \left( 1+t-\sqrt{1+2t}\right) ^{2}+t^{2} \\\ge & {} t^{2}, \end{aligned}$$

so that

$$\begin{aligned} h\left( t\right) \ge \frac{t^{2}}{2\left( 1+t\right) }. \end{aligned}$$
(23)

Substituting

$$\begin{aligned} \beta =\frac{1}{b}\left( 1-\left( 1+\frac{bt}{C}\right) ^{-1/2}\right) \end{aligned}$$

in the left side of (22) we obtain

$$\begin{aligned} \inf _{\beta \in [0,1/b)}\left( -\beta t+\frac{C\beta ^{2}}{1-b\beta }\right) \le -\frac{2C}{b^{2}}h\left( \frac{bt}{2C}\right) \le \frac{-t^{2}}{2\left( 2C+bt\right) }, \end{aligned}$$

where we have used (23).   \(\square \)

Theorem 34

Suppose for \(f\in A\) there are nonnegative numbers ab such that

(i) \(V_{+}^{2}f\le af+b\). Then for \(t>0\) we have

$$\begin{aligned} \Pr \left\{ f-E\left[ f\right] >t\right\} \le \exp \left( \frac{-t^{2}}{2\left( aE\left[ f\right] +b+at/2\right) }\right) . \end{aligned}$$

(ii) \(D^{2}f\le af+b\). If in addition, \(a\ge 1\) and \(f-\inf _{k}f\le 1,\forall k\in \left\{ 1,...,n\right\} \), then

$$\begin{aligned} \Pr \left\{ E\left[ f\right] -f>t\right\} \le \exp \left( \frac{-t^{2}}{2\left( aE\left[ f\right] +b\right) }\right) . \end{aligned}$$

Proof

Part (i) follows from Lemmas 32 (i) and Lemma 33). Part (ii) is immediate from Lemma 32 (ii).   \(\square \)

Boucheron et al. [8] have given a refined version for the lower tail, where the condition \(a\ge 1\) is relaxed to \(a\ge 1/3\) for the lower tail. There they also show that Theorems 34 and 27 together suffice to derive a version of the convex distance inequality which differs from Talagrand’s original result only in that it has an inferior constant in the exponent.

5.2 Convex Lipschitz Functions Revisited

In Sect. 4.3 we gave a sub-Gaussian bound for the upper tail of separately convex Lipschitz functions on \(\left[ 0,1\right] ^{n}\). Now we use self-boundedness to complement this with a sub-Gaussian lower bound, using an elegant trick of Boucheron et al. [6] where the lower bound in Theorem 34 is applied to the square of the Lipschitz function f. The essence of the trick is the following simple lemma.

Lemma 35

If \(f\ge 0\) then \(D^{2}\left( f^{2}\right) \le 4D^{2}\left( f\right) f^{2}\).

Proof

Since \(f\ge 0\) we have \(\inf _{k}\left( f^{2}\right) =\left( \inf _{k}f\right) ^{2}\), so, using \(\left( a+b\right) ^{2}\le 2a^{2}+2b^{2},\)

$$\begin{aligned} D^{2}\left( f^{2}\right)= & {} \sum _{k}\left( f^{2}-\inf _{k}f^{2}\right) ^{2}=\sum _{k}\left( f-\inf _{k}f\right) ^{2}\left( f+\inf _{k}f\right) ^{2} \\\le & {} 4f^{2}\sum _{k}\left( f-\inf _{k}f\right) ^{2}=4D^{2}\left( f\right) f^{2}. \end{aligned}$$

   \(\square \)

For the sub-Gaussian lower bound we need the additional assumption that \(f^{2}\) takes values in an interval of length at most one.

Theorem 36

Let \(\Omega _{k}=\left[ 0,1\right] \) and let \(f\in \mathcal {A}\) be L-Lipschitz, nonnegative and such that \(y\in \left[ 0,1\right] \mapsto S_{y}^{k}f\left( \mathbf {x}\right) \) is convex for all k and all \(\mathbf {x}\), and suppose in addition, that \(f^{2}\) takes values in an interval of length at most one. Then for all \(t\in \left[ 0,E\left[ f\right] \right] \)

$$\begin{aligned} \Pr \left\{ E\left[ f\right] -f>t\right\} \le e^{-t^{2}/8L^{2}}. \end{aligned}$$

Proof

The trick is to study the function \(f^{2}\) instead of f. Let \(\mathbf {x}\in \left[ 0,1\right] ^{n}\). Using separate convexity as in the proof of Theorem 30 we have \(D^{2}f\le L^{2}\), so by the previous lemma \(D^{2}\left( f\right) ^{2}\le 4L^{2}f^{2}\). For any k we have \(f^{2}\left( \mathbf {x}\right) -\inf f_{k}^{2}\left( \mathbf {x}\right) \le 1\), so by the lower tail bound of Theorem 34 we get a lower tail bound for \(f^{2}\)

$$\begin{aligned} \Pr \left\{ E\left[ f^{2}\right] -f^{2}>t\right\} \le \exp \left( \frac{-t^{2}}{8L^{2}E\left[ f^{2}\right] }\right) . \end{aligned}$$

Thus

$$\begin{aligned} \Pr \left\{ E\left[ f\right] -f>t\right\}= & {} \Pr \left\{ \sqrt{E\left[ f^{2}\right] }\left( E\left[ f\right] -f\right)>\sqrt{E\left[ f^{2}\right] }t\right\} \\\le & {} \Pr \left\{ \left( \sqrt{E\left[ f^{2}\right] }+f\right) \left( \sqrt{E\left[ f^{2}\right] }-f\right)>\sqrt{E\left[ f^{2}\right] }t\right\} \\= & {} \Pr \left\{ E\left[ f^{2}\right] -f^{2}>\sqrt{E\left[ f^{2}\right] }t\right\} \\\le & {} \exp \left( \frac{-t^{2}}{8L^{2}}\right) . \end{aligned}$$

Here we used \(E\left[ f\right] \le \sqrt{E\left[ f^{2}\right] }\) and the assumption that f is nonnegative in the first inequality.   \(\square \)

5.3 Decoupling

A second method to avoid the uniform bound on the thermal expectation uses decoupling. By the duality formula of Theorem 4 we have for any \(f,g\in \mathcal {A}\) and \(\beta \in \mathbb {R}\)

$$\begin{aligned} E_{\beta f}\left[ g\right] \le \text {Ent}_{f}\left( \beta \right) +\ln E\left[ e^{g}\right] . \end{aligned}$$
(24)

Recall the discussion at the beginning of Sect. 5, where we had a general bound of the form Ent\(_{f}\left( \beta \right) \le \xi \left( \beta \right) E_{\beta f}\left[ G\left( f\right) \right] \). Using (24) we can now obtain for any \(\lambda >0\)

$$\begin{aligned} \text {Ent}_{f}\left( \beta \right) \le \xi \left( \beta \right) \lambda ^{-1}E_{\beta f}\left[ \lambda G\left( f\right) \right] \le \xi \left( \beta \right) \lambda ^{-1}\left( \text {Ent}_{f}\left( \beta \right) +\ln E\left[ \exp \left( \lambda G\left( f\right) \right) \right] \right) , \end{aligned}$$

and for values of \(\beta \) and \(\lambda \) where \(\lambda >\xi \left( \beta \right) \) we obtain

$$\begin{aligned} \text {Ent}_{f}\left( \beta \right)\le & {} \frac{\xi \left( \beta \right) }{\lambda -\xi \left( \beta \right) }\ln E\left[ \exp \left( \lambda G\left( f\right) \right) \right] \\= & {} \frac{\xi \left( \beta \right) }{\lambda -\xi \left( \beta \right) }\left( \ln E\left[ e^{\lambda \left( G\left( f\right) -E\left[ G\left( f\right) \right] \right) }\right] +\lambda E\left[ G\left( f\right) \right] \right) . \nonumber \end{aligned}$$
(25)

Hence, if we can control the moment generating function of \(G\left( f\right) \) (or some suitable bound thereof), we obtain concentration inequalities for f, effectively passing from the thermal measure \(\mu _{\beta f}\) to the thermal measure \(\mu _{\lambda G\left( f\right) }\). The second line shows that in this way the supremum of \(G\left( f\right) \) can possibly be replaced by an expectation. The \(\lambda -\xi \left( \beta \right) \) in the denominator makes some constraint on \(\beta \) necessary, so the improvement comes at the price of an extra or enlarged subexponential term in the resulting concentration inequality. We conclude this chapter with three applications of this trick, which has been proposed in [7].

5.4 Quadratic Forms

As a first illustration we give a version of the Hanson-Wright inequality (Theorem 6.2.1 in [29]) for bounded variables. Let A be a symmetric \(n\times n\)-matrix, which is zero on the diagonal, that is \(A_{ii}=0\) for all i, and suppose that \(X_{1},...,X_{n}\) are independent random variables with values in an interval \(\mathcal {I}\) of unit diameter. We study the random variable \(f\left( \mathbf {X}\right) \), where

$$\begin{aligned} f\left( \mathbf {x}\right) =\sum _{i,j}x_{i}A_{ij}x_{j}. \end{aligned}$$

As operator G we use \(R^{2}\), the sum of squared conditional ranges which appears in the bounded difference inequality. For the function in question we have

$$\begin{aligned} D_{y,y^{\prime }}^{k}f\left( \mathbf {x}\right) =2\left( y-y^{\prime }\right) \sum _{i}A_{ki}x_{i}=2\left( y-y^{\prime }\right) \left( A\mathbf {x}\right) _{k}, \end{aligned}$$

and, since \(\mathcal {I}\) has unit diameter

$$\begin{aligned} R^{2}\left( f\right) \left( \mathbf {x}\right) =\sum _{k}\sup _{y,y^{\prime }\in \mathcal {I}}\left( D_{y,y^{\prime }}^{k}f\left( \mathbf {x}\right) \right) ^{2}\le 4\sum _{k}\left( A\mathbf {x}\right) _{k}^{2}=4\left\| A\mathbf {x}\right\| ^{2}\text {.} \end{aligned}$$

We can therefore conclude from (12) in the proof of the bounded difference inequality (Theorem 14), that Ent\(_{f}\left( \gamma \right) \le \left( \gamma ^{2}/8\right) E_{\gamma f}\left[ R^{2}\left( f\right) \right] \le \left( \gamma ^{2}/2\right) E_{\gamma f}\left[ \left\| A\mathbf {X}\right\| ^{2}\right] \). But instead of bounding the last thermal expectation by a supremum, as we did before, we now look for concentration properties of the function \(\mathbf {x}\mapsto \left\| A\mathbf {x}\right\| ^{2}\).

By (20) and Lemma 35 we have the self-bounding inequality \(D^{2}\left( \left\| A\mathbf {x}\right\| ^{2}\right) \le 4\left\| A\right\| ^{2}\left\| A\mathbf {x}\right\| ^{2}\) and Lemma 32 gives for \(0\le \lambda <1/\left( 2\left\| A\right\| ^{2}\right) \)

$$\begin{aligned} \ln E\left[ e^{\lambda \left\| A\mathbf {x}\right\| ^{2}}\right] \le \frac{\lambda E\left[ \left\| A\mathbf {x}\right\| ^{2}\right] }{1-2\left\| A\right\| ^{2}\lambda }. \end{aligned}$$

Now Let \(0<\gamma <1/\left\| A\right\| \) and set \(\lambda :=\gamma /\left( 2\left\| A\right\| \right) <1/\left( 2\left\| A\right\| ^{2}\right) \). Using the above bound on Ent\(_{f}\left( \gamma \right) \) and the decoupling inequality (24) we get

$$\begin{aligned} \lambda \text {Ent}_{f}\left( \gamma \right)\le & {} \frac{\gamma ^{2}}{2}E_{\gamma f}\left[ \lambda \left\| Ax\right\| ^{2}\right] \le \frac{\gamma ^{2}}{2}\left( \text {Ent}_{f}\left( \gamma \right) +\ln E\left[ e^{\lambda \left\| Ax\right\| ^{2}}\right] \right) \\\le & {} \frac{\gamma ^{2}}{2}\text {Ent}_{f}\left( \gamma \right) +\frac{\gamma ^{2}}{2}\frac{\lambda E\left[ \left\| Ax\right\| ^{2}\right] }{1-2\left\| A\right\| ^{2}\lambda }. \end{aligned}$$

Collect terms in Ent\(_{f}\left( \gamma \right) \), divide by \(\lambda -\gamma ^{2}/2\) (which is positive by the constraint on \(\gamma \) and the choice of \(\lambda \)) and substitute the value of \(\lambda \) to get

$$\begin{aligned} \text {Ent}_{f}\left( \gamma \right) \le \frac{\gamma ^{2}}{\left( 1-\left\| A\right\| \gamma \right) ^{2}}\frac{E\left[ \left\| Ax\right\| ^{2}\right] }{2}. \end{aligned}$$

From Theorem 12 we conclude that for \(\beta <1/\left\| A\right\| \)

$$\begin{aligned} \Pr \left\{ f-Ef\right\}\le & {} \exp \left( \beta \int _{0}^{\beta }\frac{\mathrm {Ent}_{f}\left( \gamma \right) }{\gamma ^{2}}d\gamma -\beta t\right) \\\le & {} \exp \left( \frac{\beta ^{2}}{1-\left\| A\right\| \beta }\frac{E\left[ \left\| Ax\right\| ^{2}\right] }{2}-\beta t\right) , \end{aligned}$$

and using Lemma 33 to minimize the last expression in \(\beta \in \left( 0,1/\left\| A\right\| \right) \) gives our version of the Hanson-Wright inequality for bounded variables.

Theorem 37

Let A be a symmetric \(n\times n\)-matrix, zero on the diagonal, and \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) a vector of independent random variables with values in an interval \(\mathcal {I}\) of unit diameter. Let \(f:\mathcal {X}^{n}\rightarrow \mathbb {R}\) be defined by \(f\left( x\right) =\sum _{ij}x_{i}A_{ij}x_{j}\). Then for \(t>0\)

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \exp \left( \frac{-t^{2}}{2E\left[ \left\| A\mathbf {X}\right\| ^{2}\right] +2\left\| A\right\| t}\right) . \end{aligned}$$

5.5 The Supremum of an Empirical Process

We will now apply the decoupling trick to the upwards tail of the supremum of an empirical process, sharpening the bound obtained in Sect. 3.4.

Theorem 38

Let \(X_{1},...,X_{n}\) be independent with values in some space \(\mathcal {X}\) with \(X_{i}\) distributed as \(\mu _{i}\), and let \(\mathcal {F}\) be an at most countable class of functions \(f:\mathcal {X}\rightarrow \left[ -1,1\right] \) with \(E\left[ f\left( X_{i}\right) \right] =0\). Define \(F:\mathcal {X}^{n}\rightarrow \mathbb {R}\) and \(W:\mathcal {X}^{n}\rightarrow \mathbb {R}\) by

$$\begin{aligned} F\left( \mathbf {x}\right)= & {} \sup _{f\in \mathcal {F}}\sum _{i}f\left( x_{i}\right) \text { and} \\ W\left( \mathbf {x}\right)= & {} \sup _{f\in \mathcal {F}}\sum _{i}\left( f^{2}\left( x_{i}\right) +E\left[ f^{2}\left( X_{i}\right) \right] \right) . \end{aligned}$$

Then for \(t>0\)

$$\begin{aligned} \Pr \left\{ F-E\left[ F\right] >t\right\} \le \exp \left( \frac{-t^{2}}{2E\left[ W\right] +t}\right) . \end{aligned}$$

This inequality improves over Theorem 12.2 in [6], since by the triangle inequality \(E\left[ W\right] \le \Sigma ^{2}+\sigma ^{2}\) and the constants in the denominator of the exponent are better by a factor of two, and optimal for the variance term.

Proof

Let \(0<\gamma \le \beta <2\). Using Theorem 26 and (24) we get

$$\begin{aligned} \text {Ent}_{F}\left( \gamma \right) \le \frac{\gamma }{2}E_{\gamma F}\left[ \gamma V_{+}^{2}\left( F\right) \right] \le \frac{\gamma }{2}\left( \text {Ent}_{F}\left( \gamma \right) +\ln Ee^{\gamma V_{+}^{2}\left( F\right) }\right) . \end{aligned}$$

Rearranging gives

$$\begin{aligned} \text {Ent}_{F}\left( \gamma \right) \le \frac{\gamma }{2-\gamma }\ln Ee^{\gamma V_{+}^{2}\left( F\right) }. \end{aligned}$$
(26)

Fix some \(\mathbf {x}\in \mathcal {X}^{n}\) and let \(\hat{f}\in \mathcal {F}\) witness the maximum in the definition of \(F\left( \mathbf {x}\right) \). For \(y\in \mathcal {X}\) we have \(\left( F-S_{y}^{k}F\right) _{+}\le \left( \hat{f}\left( x_{i}\right) -\hat{f}\left( y\right) \right) _{+}\) and by the zero mean assumption

$$\begin{aligned} V_{+}^{2}\left( F\right) \left( \mathbf {x}\right)= & {} \sum _{k}E_{y\sim \mu _{k}}\left[ \left( F\left( \mathbf {x}\right) -S_{y}^{k}F\left( \mathbf {x}\right) \right) _{+}^{2}\right] \\\le & {} \sum _{k}E_{y\sim \mu _{k}}\left( \hat{f}\left( x_{k}\right) -\hat{f}\left( y\right) \right) _{+}^{2} \\\le & {} \sum _{k}E_{y\sim \mu _{k}}\left( \hat{f}\left( x_{k}\right) -\hat{f}\left( y\right) \right) ^{2} \\= & {} \sum _{k}\left( \hat{f}^{2}\left( x_{k}\right) +E\left[ \hat{f}^{2}\left( X_{k}\right) \right] \right) \\\le & {} W\left( \mathbf {x}\right) . \end{aligned}$$

So \(V_{+}^{2}\left( F\right) \le W\). It follows from (26) that

$$\begin{aligned} \text {Ent}_{F}\left( \gamma \right) \le \frac{\gamma }{2-\gamma }\ln Ee^{\gamma V^{+}\left( F\right) }\le \frac{\gamma }{2-\gamma }\ln E\left[ e^{\gamma W}\right] . \end{aligned}$$
(27)

Next we establish self-boundedness of W. Let \(\hat{f}\in \mathcal {F}\) (different from the previous \(\hat{f}\), which we don’t need any more) witness the maximum in the definition of \(W\left( \mathbf {x}\right) \). Then

$$\begin{aligned} V_{+}^{2}\left( W\right) \left( \mathbf {x}\right)= & {} \sum _{k}E_{y\sim \mu _{k}}\left( W\left( \mathbf {x}\right) -S_{y}^{k}W\left( \mathbf {x}\right) \right) _{+}^{2} \\\le & {} \sum _{k}E_{y\sim \mu _{k}}\left[ \left( \hat{f}^{2}\left( x_{k}\right) -\hat{f}^{2}\left( y\right) \right) _{+}^{2}\right] \\\le & {} \sum _{k}\hat{f}^{2}\left( x_{k}\right) \\\le & {} W. \end{aligned}$$

It therefore follows from the self-bounding lemma, Lemma 32, that

$$\begin{aligned} \ln E\left[ e^{\gamma W}\right] \le \frac{\gamma ^{2}E\left[ W\right] }{2-\gamma }+\gamma E\left[ W\right] =\frac{\gamma E\left[ W\right] }{1-\gamma /2}. \end{aligned}$$

Combining this with (27) gives

$$\begin{aligned} \text {Ent}_{F}\left( \gamma \right) \le \frac{\gamma }{2-\gamma }\left( \frac{\gamma E\left[ W\right] }{1-\gamma /2}\right) =\frac{\gamma ^{2}}{\left( 1-\gamma /2\right) ^{2}}\frac{E\left[ W\right] }{2}. \end{aligned}$$

From (6) in Theorem 12 we conclude that

$$\begin{aligned} \ln Ee^{\beta \left( F-EF\right) }= & {} \beta \int _{0}^{\beta }\frac{\text {Ent}_{F}\left( \gamma \right) }{\gamma ^{2}}d\gamma \le \beta \int _{0}^{\beta }\frac{1}{\left( 1-\gamma /2\right) ^{2}}d\gamma ~\frac{E\left[ W\right] }{2} \\= & {} \frac{\beta ^{2}}{1-\beta /2}\frac{E\left[ W\right] }{2}. \end{aligned}$$

Using Lemma 33 it follows that

$$\begin{aligned} \Pr \left\{ F-E\left[ F\right] >t\right\}\le & {} \inf _{\beta \in \left( 0,2\right) }\exp \left( -\beta t+\frac{\beta ^{2}}{1-\beta /2}\frac{E\left[ W\right] }{2}\right) \\\le & {} \exp \left( \frac{-t^{2}}{2E\left[ W\right] +t}\right) . \end{aligned}$$

   \(\square \)

5.6 Another Version of Bernstein’s Inequality

A potential weakness of Theorem 21 is the occurrence of the supremum in the definition of the variance parameter \(V=\sup _{\mathbf {x}\in \Omega }\Sigma ^{2}\left( f\right) \left( \mathbf {x}\right) \). If the supremum could be replaced by an expectation, the variance parameter would become the Efron–Stein upper bound \(E\left[ \Sigma ^{2}\left( f\right) \right] \) on the variance \(\sigma ^{2}\left( f\right) \), making the inequality considerably stronger. Such a modification is possible at the expense of enlarging the subexponential term in Bernstein’s inequality. Define the interaction functional

$$\begin{aligned} J\left( f\right) =2\left( \sup _{\mathbf {x,z}\in \Omega }\sum _{k,l:k\ne l}\sigma _{k}^{2}\left( f-S_{z_{l}}^{l}f\right) \left( \mathbf {x}\right) \right) ^{1/2}. \end{aligned}$$

The following theorem is given in [21]

Theorem 39

Suppose \(f\in \mathcal {A}\left( \Omega \right) \) satisfies \(f-E_{k}f\le b\) for all k. Then for all \(t>0\)

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \exp \left( \frac{-t^{2}}{2E\left[ \Sigma ^{2}\left( f\right) \right] +\left( 2b/3+J\left( f\right) \right) t}\right) . \end{aligned}$$

Here we will use the tools introduced above to prove a slight strengthening of this result, removing the boundedness conditions above.

Let \(f:\Omega =\prod _{i=1}^{n}\Omega _{i}\rightarrow \mathbb {R}\) and consider the three conditions

$$\begin{aligned} \left( A\right)= & {} \left( \left( f-E_{k}f\right) \le b\text { for all }k\right) \\ \left( B\right)= & {} \left( E_{k}\left[ \left( f-E_{k}f\right) ^{m}\right] \le \frac{1}{2}m!\sigma _{k}^{2}\left( f\right) b^{m-2}\text { for }m\ge 2\text { and all }k\right) \\ \left( C\right)= & {} \left( \sum _{k=1}^{n}E_{k}\left[ \left( f-E_{k}f\right) ^{m}\right] \le \frac{\Sigma ^{2}\left( f\right) }{2}m!b^{m-2}\text { for }m\ge 2\right) . \end{aligned}$$

Then \(\left( A\right) \implies \left( B\right) \implies \left( C\right) \). The last condition (sometimes called “Bernstein condition” in the literature) is sufficient for the following version of Bernstein’s inequality, which extends Theorem 2.10 in [6] from sums to general functions and replaces the one-sided boundedness requirement of Theorem 39 by the Bernstein condition.

Theorem 40

Let \(f:\Omega =\prod _{i=1}^{n}\Omega _{i}\rightarrow \mathbb {R}\) be measurable and suppose that \(\left( C\right) \) holds. Then for \(t>0\)

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\} \le \exp \left( \frac{-t^{2}}{2E\left[ \Sigma ^{2}\left( f\right) \right] +\left( 2b+J\left( f\right) \right) t}\right) . \end{aligned}$$

The first step is to bound the entropy of f under the condition (C), thus replacing Lemma 19 in the proof of Theorem 21.

Lemma 41

Suppose \(\left( C\right) \) holds with \(b=1\). Then for all \(\beta \in \left[ 0,1\right) \)

$$\begin{aligned} \text {Ent}_{f}\left( \beta \right) \le \frac{\beta ^{2}E_{\beta f}\left[ \Sigma ^{2}\left( f\right) \right] }{2\left( 1-\beta \right) ^{2}}. \end{aligned}$$

Proof

First we get from the variational property of variance, that

$$\begin{aligned} \sigma _{k,\beta f}^{2}\left( f\right)\le & {} E_{k,\beta f}\left[ \left( f-E_{k}\left( f\right) \right) ^{2}\right] =\frac{E_{k}\left[ \left( f-E_{k}\left( f\right) \right) ^{2}e^{\beta \left( f-E_{k}f\right) }\right] }{E_{k}\left[ e^{\beta \left( f-E_{k}f\right) }\right] } \\\le & {} E_{k}\left[ \left( f-E_{k}\left( f\right) \right) ^{2}e^{\beta \left( f-E_{k}f\right) }\right] , \end{aligned}$$

where we used Jensen’s inequality to get \(E_{k}\left[ \exp \left( \beta \left( f-E_{k}f\right) \right) \right] \ge 1\) for the second inequality. From monotone convergence and \(\left( C\right) \) we then get

$$\begin{aligned} \sum _{k=1}^{n}\sigma _{k,\beta f}^{2}\left( f\right)\le & {} \sum _{k=1}^{n}E_{k}\left[ \left( f-E_{k}f\right) ^{2}e^{\beta \left( f-E_{k}f\right) }\right] =\sum _{m=0}^{\infty }\sum _{k=1}^{n}\frac{\beta ^{m}}{m!}E_{k}\left[ \left( f-E_{k}f\right) ^{m+2}\right] \\\le & {} \frac{\Sigma ^{2}\left( f\right) }{2}\sum _{m=0}^{\infty }\left( m+1\right) \left( m+2\right) \beta ^{m}. \end{aligned}$$

Thus from Theorem 12

$$\begin{aligned} \text {Ent}_{f}\left( \beta \right)\le & {} E_{\beta f}\left[ \int _{0}^{\beta }\int _{t}^{\beta }\sum _{k=1}^{n}\sigma _{k,sf}^{2}\left( f\right) ~ds~dt\right] \\\le & {} \frac{E_{\beta f}\left[ \Sigma ^{2}\left( f\right) \right] }{2}\sum _{m=0}^{\infty }\left( m+1\right) \left( m+2\right) \int _{0}^{\beta }\int _{t}^{\beta }s^{m}dsdt \\= & {} \frac{E_{\beta f}\left[ \Sigma ^{2}\left( f\right) \right] }{2}\beta ^{2}\sum _{m=0}^{\infty }\left( m+1\right) \beta ^{m}=\frac{\beta ^{2}E_{\beta f}\left[ \Sigma ^{2}\left( f\right) \right] }{2\left( 1-\beta \right) ^{2}}. \end{aligned}$$

   \(\square \)

At this point we could bound the thermal expectation \(E_{\beta f}\left[ \Sigma ^{2}\left( f\right) \right] \) by a supremum and proceed along the usual path to obtain a version of Theorem 21 under condition (C), which, for sums of independent variables, would reduce to Theorem 2.10 in [6]. Instead we wish to exploit the decoupling idea and look for concentration properties of \(\Sigma ^{2}\left( f\right) \).

The crucial property of the interaction functional J is, that \(J^{2}\) is a self-bound for \(\Sigma ^{2}\left( f\right) \). The following Lemma is also the key to the proof of Theorem 39.

Lemma 42

We have \(D^{2}\left( \Sigma ^{2}\left( f\right) \right) \le J\left( f\right) ^{2}~\Sigma ^{2}\left( f\right) \) for any \(f\in \mathcal {A}\left( \Omega \right) \).

Proof

Fix \(\mathbf {x}\in \Omega \). Below all members of \(\mathcal {A}\) are understood as evaluated on \(\mathbf {x}\). For \(l\in \left\{ 1,...,n\right\} \) let \(z_{l}\in \Omega _{l}\) be a minimizer in z of \(S_{z}^{l}\Sigma ^{2}\left( f\right) \). Then

$$\begin{aligned} D^{2}\left( \Sigma ^{2}\left( f\right) \right) =\sum _{l}\left( \sum _{k:k\ne l}\left( \sigma _{k}^{2}\left( f\right) -S_{z_{l}}^{l}\sigma _{k}^{2}\left( f\right) \right) \right) ^{2}. \end{aligned}$$

The sum over \(k\ne l\), since \(\sigma _{k}^{2}\left( f\right) \in \mathcal {A}_{k}\), so \(S_{z_{l}}^{l}\sigma _{k}^{2}\left( f\right) =\sigma _{k}^{2}\left( f\right) \). Then, using \(2\sigma _{k}^{2}\left( f\right) =E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left( D_{y,y^{\prime }}^{k}f\right) ^{2}\), we get

$$\begin{aligned} 4D^{2}\left( \Sigma ^{2}\left( f\right) \right)= & {} \sum _{l}\left( \sum _{k:k\ne l}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left( D_{y,y^{\prime }}^{k}f\right) ^{2}-S_{z_{l}}^{l}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left( D_{y,y^{\prime }}^{k}f\right) ^{2}\right) ^{2} \\= & {} \sum _{l}\left( \sum _{k\ne l}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left[ \left( D_{y,y^{\prime }}^{k}f\right) ^{2}-\left( D_{y,y^{\prime }}^{k}S_{z_{l}}^{l}f\right) ^{2}\right] \right) ^{2} \\= & {} \sum _{l}\left( \sum _{k\ne l}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left[ \left( D_{y,y^{\prime }}^{k}f-D_{y,y^{\prime }}^{k}S_{z_{l}}^{l}f\right) \left( D_{y,y^{\prime }}^{k}f+D_{y,y^{\prime }}^{k}S_{z_{l}}^{l}f\right) \right] \right) ^{2} \\\le & {} \sum _{l}\sum _{k:k\ne l}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left[ D_{y,y^{\prime }}^{k}\left( f-S_{z_{l}}^{l}f\right) \right] ^{2}\times \\ \text { \ }&{ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }\sum _{k:k\ne l}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left[ D_{y,y^{\prime }}^{k}f+D_{y,y^{\prime }}^{k}S_{z_{l}}^{l}f\right] ^{2} \end{aligned}$$

by an application of Cauchy–Schwarz. Now, using \(\left( a+b\right) ^{2}\le 2a^{2}+2b^{2}\), we can bound the last sum independent of l by

$$\begin{aligned}&\sum _{k:k\ne l}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left[ 2\left( D_{y,y^{\prime }}^{k}f\right) ^{2}+2\left( D_{y,y^{\prime }}^{k}S_{z_{l}}^{l}f\right) ^{2}\right] \\&=4\sum _{k:k\ne l}\sigma _{k}^{2}\left( f\right) +4S_{z_{l}}^{l}\sum _{k:k\ne l}\sigma _{k}^{2}\left( f\right) \\&\le 4\left( \Sigma ^{2}\left( f\right) +S_{z_{l}}^{l}\Sigma ^{2}\left( f\right) \right) =4\left( \Sigma ^{2}\left( f\right) +\inf _{z\in \Omega _{l}}S_{z}^{l}\Sigma ^{2}\left( f\right) \right) \le 8\Sigma ^{2}\left( f\right) , \end{aligned}$$

so that

$$\begin{aligned} D^{2}\left( \Sigma ^{2}\left( f\right) \right)\le & {} 2\sum _{l}\sum _{k:k\ne l}E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}} \left[ D_{y,y^{\prime }}^{k}\left( f-S_{z_{l}}^{l}f\right) \right] ^{2}\Sigma ^{2}\left( f\right) \\\le & {} 4\sup _{\mathbf {x,z}\in \Omega }\sum _{k,l:k\ne l}\sigma _{k}^{2}\left( f-S_{z}^{l}f\right) \left( \mathbf {x}\right) \Sigma ^{2}\left( f\right) =J^{2}\left( f\right) \Sigma ^{2}\left( f\right) . \end{aligned}$$

   \(\square \)

Now we can use decoupling to put these pieces together.

Proof of Theorem 40

By rescaling it suffices to prove the result for \(b=1\). We can also assume \(J:=J\left( f\right) >0\). Let \(0<\gamma \le \beta <1/\left( 1+J/2\right) \) and set \(\theta =\gamma /\left( J\left( 1-\gamma \right) \right) \). Then \(\gamma ^{2}/\left( 2\left( 1-\gamma \right) ^{2}\right)<\theta <2/J^{2}\). By the Lemma 41

$$\begin{aligned} \theta \text {Ent}_{f}\left( \gamma \right) \le \frac{\gamma ^{2}}{2\left( 1-\gamma \right) ^{2}}E_{\gamma f}\left[ \theta \Sigma ^{2}\left( f\right) \right] \le \frac{\gamma ^{2}}{2\left( 1-\gamma \right) ^{2}}\left( \text {Ent}_{f}\left( \gamma \right) +\ln E\left[ e^{\theta \Sigma ^{2}\left( f\right) }\right] \right) , \end{aligned}$$

where the second inequality follows from the decoupling inequality (24). Subtract \(\gamma ^{2}/\left( 2\left( 1-\gamma \right) ^{2}\right) \)Ent\(_{f}\left( \gamma \right) \) to get

$$\begin{aligned} \text {Ent}_{f}\left( \gamma \right) \left( \theta -\frac{\gamma ^{2}}{2\left( 1-\gamma \right) ^{2}}\right) \le \frac{\gamma ^{2}}{2\left( 1-\gamma \right) ^{2}}\ln E\left[ e^{\theta \Sigma ^{2}\left( f\right) }\right] . \end{aligned}$$

Since \(\gamma ^{2}/\left( 2\left( 1-\gamma \right) ^{2}\right) <\theta \) this simplifies, using the value of \(\theta \), to

$$\begin{aligned} \text {Ent}_{f}\left( \gamma \right) \le \frac{\gamma J}{2\left( 1-\left( 1+J/2\right) \gamma \right) }\ln E\left[ e^{\theta \Sigma ^{2}\left( f\right) }\right] . \end{aligned}$$
(28)

On the other hand \(\theta <2/J^{2}\), so by the self-boundedness of \(\Sigma ^{2}\left( f\right) \) (Lemma 42) and part (i) of Lemma 32 give

$$\begin{aligned} \ln E\left[ e^{\theta \Sigma ^{2}\left( f\right) }\right] \le \frac{\theta }{1-J^{2}\theta /2}E\left[ \Sigma ^{2}\left( f\right) \right] =\frac{\gamma /J}{1-\left( 1+J/2\right) \gamma }E\left[ \Sigma ^{2}\left( f\right) \right] . \end{aligned}$$
(29)

Combining (28) and (29) to get a bound on \(S_{f}\left( \gamma \right) \) gives

$$\begin{aligned} \text {Ent}_{f}\left( \gamma \right) \le \frac{\gamma ^{2}}{2\left( 1-\left( 1+J/2\right) \gamma \right) ^{2}}E\left[ \Sigma ^{2}\left( f\right) \right] \end{aligned}$$

and from Theorem 12 and Lemma 33

$$\begin{aligned} \Pr \left\{ f-Ef>t\right\}\le & {} \inf _{\beta \in \left( 0,1/\left( 1+J/2\right) \right) }\exp \left( \frac{E\left[ \Sigma ^{2}\left( f\right) \right] }{2}\frac{\beta ^{2}}{1-\left( 1+J/2\right) \beta }-\beta t\right) \\\le & {} \exp \left( \frac{-t^{2}}{2\left( E\left[ \Sigma ^{2}\left( f\right) \right] +\left( 1+J/2\right) t\right) }\right) . \end{aligned}$$

   \(\square \)

To use Theorem 40 one has to bound b and J. For the latter it is often sufficient to use the simple bound

$$\begin{aligned} J\left( f\right) \le n\max _{k\ne l}\sup _{\mathbf {x}\in \Omega }\sup _{z,z^{\prime },y,y^{\prime }\in \Omega _{l}}D_{z,z^{\prime }}^{l}D_{y,y^{\prime }}^{k}f\left( \mathbf {x}\right) . \end{aligned}$$
(30)

which can be obtained from Lemma 13.

We conclude with an application to U-statistics. Let \(m<n\) be integers, \(\Omega _{i}=\mathcal {X}\) and \(\kappa :\mathcal {X}^{m}\rightarrow \mathbb {R}\) a symmetric kernel. For a subset of indices with cardinality m, \(S=\left\{ j_{1},...,j_{m}\right\} \subseteq \left\{ 1,...,n\right\} \) define \(\kappa _{S}:\mathcal {X}^{n}\rightarrow \mathbb {R}\) by \(\kappa _{S}\left( \mathbf {x}\right) =\kappa \left( x_{j_{1}},...,x_{j_{m}}\right) \). The U-statistic of order m induced by \(\kappa \) is then the function \(U:\mathcal {X}^{n}\rightarrow \mathbb {R}\) given by

$$\begin{aligned} U\left( \mathbf {x}\right) =\left( {\begin{array}{c}n\\ m\end{array}}\right) ^{-1}\sum _{S\subseteq \left\{ 1,...,n\right\} }\kappa _{S}\left( \mathbf {x}\right) . \end{aligned}$$

U-statistics were introduced by Hoeffding [15]. Their importance stems from the fact that for iid \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) the random variable \(U\left( \mathbf {X}\right) \) is an unbiased estimator for \(E\left[ \kappa \left( X_{1},...,X_{m}\right) \right] \). Starting with the work of Hoeffding there has been a lot of work on concentration inequalities for U-statistics. To simplify the presentation we will not use the advantage of Theorem 40 over Theorem 39 and assume the kernel \(\kappa \) to be bounded, \(\kappa :\mathcal {X}^{m}\rightarrow \left[ 0,1\right] \) for simplicity.

Notice that,if \(k\notin S\), then \(\kappa _{S}\in \mathcal {A}_{k}\), so \(\kappa _{S}\left( \mathbf {x}\right) -E_{k}\left[ \kappa _{S}\left( \mathbf {x}\right) \right] =0\) and thus

$$\begin{aligned} U\left( \mathbf {x}\right) -E_{k}\left[ U\left( \mathbf {x}\right) \right]= & {} \left( {\begin{array}{c}n\\ m\end{array}}\right) ^{-1}\sum _{\begin{array}{c} S\subseteq \left\{ 1,...,n\right\} \\ k\in S \end{array}}\left( \kappa _{S}\left( \mathbf {x}\right) -E_{k}\left[ \kappa _{S}\left( \mathbf {x}\right) \right] \right) \\\le & {} \left( {\begin{array}{c}n\\ m\end{array}}\right) ^{-1}\left| \left\{ S\subseteq \left\{ 1,...,n\right\} :k\in S\right\} \right| \\= & {} \frac{\left( {\begin{array}{c}n-1\\ m-1\end{array}}\right) }{\left( {\begin{array}{c}n\\ m\end{array}}\right) }^{-1}=\frac{m!\left( n-1\right) !}{n!\left( m-1\right) !}=\frac{m}{n}, \end{aligned}$$

so we can set the quantity b in Theorem 40 to m/n. To bound J use (30) to get

$$\begin{aligned} J\left( U\right)\le & {} n\max _{k\ne l}\sup _{\mathbf {x}\in \Omega }\sup _{z,z^{\prime },y,y^{\prime }\in \Omega _{l}}D_{z,z^{\prime }}^{l}D_{y,y^{\prime }}^{k}U\left( \mathbf {x}\right) \\\le & {} n\left( {\begin{array}{c}n\\ m\end{array}}\right) ^{-1}\sum _{\begin{array}{c} S\subseteq \left\{ 1,...,n\right\} \\ k,l\in S:k\ne l \end{array}}D_{z,z^{\prime }}^{l}D_{y,y^{\prime }}^{k}\kappa _{S}\left( \mathbf {x}\right) \\= & {} 2n\left( {\begin{array}{c}n\\ m\end{array}}\right) ^{-1}\left| \left\{ S\subseteq \left\{ 1,...,n\right\} :k,l\in S,k\ne l\right\} \right| \\= & {} \frac{2n\left( {\begin{array}{c}n-2\\ m-2\end{array}}\right) }{\left( {\begin{array}{c}n\\ m\end{array}}\right) }\le \frac{2m^{2}}{n}. \end{aligned}$$

Substitution in Theorem 40 gives for \(t>0\)

$$\begin{aligned} \Pr \left\{ U-EU>t\right\} \le \exp \left( \frac{-t^{2}}{2E\left[ \Sigma ^{2}\left( U\right) \right] +2\left( m+m^{2}\right) t/n}\right) . \end{aligned}$$

It can be shown (see, e.g., [21], Houdré [17]) that in general \(E\left[ \Sigma ^{2}\left( f\right) \right] \le \sigma ^{2}\left( f\right) +J^{2}\left( f\right) /4\), so that for U-statistics the Efron–Stein inequality is tight in the sense that \(E\left[ \Sigma ^{2}\left( U\right) \right] \le \sigma ^{2}\left( U\right) +m^{4}/n^{2}\). It follows that for deviations \(t>1/n\)

$$\begin{aligned} \Pr \left\{ U-EU>t\right\} \le \exp \left( \frac{-t^{2}}{2\sigma ^{2}\left( U\right) +2\left( m+2m^{2}\right) t/n}\right) . \end{aligned}$$

This inequality can be compared to the classical work of Hoeffding [15] and more recent results of Arcones [2], which both consider undecoupled, nondegenerate U-statistics of arbitrary order. Hoeffding [15] does not have the correct variance term, while [2] gives the correct variance term but severely overestimates the subexponential coefficient in Bernstein’s inequality to be exponential in the degree m of the U-statistic (above it is only of order \(m^{2}\)). This exponential dependence on m results from the use of the decoupling inequalities in [24] and seems to beset most works on U-statistics of higher order (e.g., [1, 13]), which in many other ways improve over our simple inequality above.

6 Appendix I. Table of Notation

General notation

 

\(\Omega =\prod _{k=1}^{n}\Omega _{k}\)

underlying (product-) probability space

\(\mathcal {A}\)

bounded measurable functions on \(\Omega \)

\(\mu =\otimes _{k=1}^{n}\mu _{k}\)

(product-) probability measure on \(\Omega \)

\(X_{k}\)

random variable distributed as \(\mu _{k}\) in \(\Omega _{k}\)

\(f\in \mathcal {A}\)

fixed function under investigation

\(g\in \mathcal {A}\)

generic function

\(E\left[ g\right] =\int _{\Omega }gd\mu \)

expectation of g in \(\mu \)

\(\sigma ^{2}\left[ g\right] =E\left[ \left( g-E\left[ g\right] \right) ^{2}\right] \)

variance of g in \(\mu \)

Notation for the entropy method

 

\(\beta =1/T\)

inverse temperature

\(E_{\beta f}\left[ g\right] =E\left[ ge^{\beta f}\right] /E\left[ e^{\beta f}\right] \)

thermal expectation of g

\(Z_{\beta f}=E\left[ e^{\beta f}\right] \)

partition function

\(d\mu _{\beta f}=Z_{\beta f}^{.1}~e^{\beta f}d\mu \)

thermal measure (canonical ensemble)

Ent\(_{f}\left( \beta \right) =\beta E_{\beta f}\left[ f\right] -\ln Z_{\beta f}.\)

(canonical) entropy

\(A_{f}\left( \beta \right) =\frac{1}{\beta }\ln Z_{\beta f}\)

free energy

\(\sigma _{\beta f}^{2}\left( g\right) =E_{\beta f}\left[ \left( g-E_{\beta f}\left[ g\right] \right) ^{2}\right] \)

thermal variance of g

\(\psi \left( t\right) =e^{t}-t-1\)

 

\(S_{y}^{k}F\left( \mathbf {x}\right) =F\left( x_{1},...,x_{k-1},y,x_{k+1},...,x_{n}\right) \)

substitution operator

\(E_{k}\left[ g\right] \left( \mathbf {x}\right) =\int _{\Omega _{k}}S_{y}^{k}g~d\mu _{k}\left( y\right) \)

conditional expectation

\(\mathcal {A}_{k}\subset \mathcal {A}\)

functions independent of k-th variable

\(Z_{k,\beta f}=E_{k}\left[ e^{\beta f}\right] \)

conditional partition function

\(E_{k,\beta f}\left[ g\right] =Z_{k,\beta f}^{-1}E_{k}\left[ ge^{\beta f}\right] \)

conditional thermal expectation

Ent\(_{k,f}\left( \beta \right) =\beta E_{k,\beta f}\left[ g\right] -\ln Z_{k,\beta f}\)

conditional entropy

\(\sigma _{k,\beta f}^{2}\left[ g\right] =E_{k,\beta f}\left[ \left( g-E_{k,\beta f}\left[ g\right] \right) ^{2}\right] \)

conditional thermal variance

\(\sigma _{k}^{2}\left[ g\right] =E_{k}\left[ \left( g-E_{k}\left[ g\right] \right) ^{2}\right] \)

conditional variance

Operators on \(\mathcal {A}\)

 

\(D_{y,y^{\prime }}^{k}g=S_{y}^{k}g-S_{y^{\prime }}^{k}g\)

difference operator

\(r_{k}\left( g\right) =\sup _{y,y^{\prime }\in \Omega _{k}}D_{y,y^{\prime }}^{k}f\)

conditional range operator

\(R^{2}\left( g\right) =\sum _{k}r_{k}^{2}\left( g\right) \)

sum of conditional square ranges

\(\Sigma ^{2}\left( g\right) =\sum _{k}\sigma _{k}^{2}\left[ g\right] \)

sum of conditional variances

\(\left( \inf _{k}g\right) \left( \mathbf {x}\right) =\inf _{y\in \Omega _{k}}S_{y}^{k}g\left( \mathbf {x}\right) \)

conditional infimum operator

\(V_{+}^{2}g=\sum _{k}E_{y\sim \mu _{k}}\left[ \left( \left( g-S_{y}^{k}\right) _{+}\right) ^{2}\right] \)

Efron–Stein variance proxy

\(D^{2}g=\sum _{k}\left( g-\inf _{k}g\right) ^{2}.\)

worst case variance proxy