Abstract
The chapter presents the entropy method to prove concentration inequalities for functions of independent random variables. After the introduction of a general framework the famous bounded difference inequality, versions of Bernstein’s inequality, and the Gaussian concentration inequality are derived. Applications include vector-valued concentration, random matrices, and the suprema of empirical processes.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
1 Introduction
Concentration inequalities bound the probabilities that random quantities deviate from their average, median, or otherwise typical values. If this deviation is small with high probability, then a repeated experiment or observation will likely produce a similar result. In this way concentration inequalities can give quantitative guarantees of reproducibility, a concept at the heart of empirical science [25].
In this chapter we limit ourselves to study quantities whose randomness arises through the dependence on many independent random variables. Suppose that \(\left( \Omega _{i},\Sigma _{i}\right) \) are measurable spaces for \( i\in \left\{ 1,...,n\right\} \) and that f is real valued function defined on the product space \(\Omega =\prod _{i=1}^{n}\Omega _{i}\),
Now let \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) be a vector of independent random variables, where \(X_{i}\) is distributed as \(\mu _{i}\) in \( \Omega _{i}\). For \(t>0\) and \(\mathbf {X}^{\prime }\) iid to \(\mathbf {X}\) we then want to give bounds on the upwards deviation probability
in terms of the deviation t, the measures \(\mu _{i}\) and properties of the function f. Downward deviation bounds are then obtained by replacing f with \(-f\). Usually we will just write \(\Pr \left\{ f-Ef>t\right\} \) for the deviation probability above.
The first bounds of this type were given by Chebychev and Bienaimé [11] in the late 19th century for additive functions of the form
The subject has since been developed by Bernstein, Chernoff, Bennett, Hoeffding, and many others [4, 16], and results were extended from sums to more general and complicated nonlinear functions. During the past decades research activity has been stimulated by the contributions of Michel Talagrand [27, 28] and by the relevance of concentration phenomena to the rapidly growing field of computer science. Some concentration inequalities, like the well known bounded difference inequality, have become standard tools in the analysis of algorithms [23].
One of the more recent methods to derive concentration inequalities, the so-called entropy method, is rooted in the early investigations of Boltzmann [5] and Gibbs [12] into the foundations of statistical mechanics. While the modern entropy method evolved along a complicated historical path via quantum field theory and the logarithmic Sobolev-inequality of Leonard Gross [14], its hidden simplicity was understood and emphasized by Michel Ledoux, who also recognized the key role which the subadditivity of entropy can play in the derivation of concentration inequalities [18]. The method has been refined by Bobkov, Massart [20], Bousquet [9], and Boucheron et al. [7]. Recently Boucheron et al. [8] showed that the entropy method is sufficiently strong to derive a form of Talagrand’s convex distance inequality.
In this chapter we present a variation of the entropy method in a compact and simplified form, closely tied to its origins in statistical mechanics. We give an exposition of the method in Sect. 2 and compress it into a toolbox to derive concentration inequalities.
In Sect. 3 we will then use this method to prove two classical concentration inequalities, the bounded difference inequality and a generalization of Bennett’s inequality. As example applications we treat vector-valued concentration and generalization in empirical risk minimization, a standard problem in machine learning theory.
In Sect. 4 we address more difficult problems. The bounded difference inequality is used to prove the famous Gaussian concentration inequality. We also give some more recent inequalities which we apply to analyze the concentration of convex Lipschitz functions on \( \left[ 0,1\right] ^{n}\), or of the spectral norm of a random matrix.
In Sect. 5 we describe some of the more advanced techniques, self-boundedness, and decoupling. As examples we give sub-Gaussian lower tail bounds for convex Lipschitz functions and a version of the Hanson-Wright inequality for bounded random variables and we derive an exponential inequality for the suprema of empirical processes. We conclude with another version of Bernstein’s inequality and its application to U-statistics.
We limit ourselves to exponential deviation bounds from the mean. For moment bounds and other advanced methods to establish concentration inequalities, such as the transportation method or an in-depth treatment of logarithmic Sobolev inequalities, we recommend the monographs by Ledoux [18] and Boucheron, Lugosi, and Massart [6], and the overview article by McDiarmid [23]. Another important recent development not covered is the method of exchangeable pairs proposed by Chatterjee [10] .
We fix some conventions and notation:
If \(\left( \Omega ,\Sigma \right) \) is any measurable space \(\mathcal {A} \left( \Omega \right) \) will denote the algebra of bounded, measurable real valued functions on \(\Omega \). When there is no ambiguity we often just write \(\mathcal {A}\) for \(\mathcal {A}\left( \Omega \right) \). Although we give some results for unbounded functions, most functions for which we will prove concentration inequalities are assumed to be measurable and bounded, that is \(f\in \mathcal {A}\). This assumption simplifies the statement of our results, because it guarantees the existence of algebraic and exponential moments and makes our arguments more transparent.
If \(\left( \Omega ,\Sigma ,\mu \right) \) is a probability space we write \( \Pr F=\mu \left( F\right) \) for \(F\in \Sigma \), and \(E\left[ f\right] =\int _{\Omega }fd\mu \) for \(f\in L_{1}\left[ \mu \right] \) and \(\sigma ^{2} \left[ f\right] =E\left[ \left( f-E\left[ f\right] \right) ^{2}\right] \) for \(f\in L_{2}\left[ \mu \right] \). Wherever we use \(\Pr \), E or \(\sigma ^{2},\) we assume that there is an underlying probability space \(\left( \Omega ,\Sigma ,\mu \right) \). If we refer to other measures than \(\mu \), then we identify them with corresponding subscripts.
If \(\mathcal {X}\) is any set and \(n\in \mathbb {N}\), then for \(y\in \mathcal {X} \) and \(k\in \left\{ 1,...,n\right\} \) the substitution operator \(S_{y}^{k}: \mathcal {X}^{n}\rightarrow \mathcal {X}^{n}\) is defined as
This and other notation which we introduce along the way is also summarized in a final section in tabular form.
2 The Entropy Method
In this section we develop the entropy method and package it into a toolbox to prove concentration inequalities.
2.1 Markov’s Inequality and Exponential Moment Method
The most important tool in the proof of deviation bounds is Markov’s inequality, which we now introduce along with two corollaries, Chebychev’s inequality and the exponential moment method.
Theorem 1
(Markov inequality) Let \(f\in L_{1}\left[ \mu \right] \), \(f\ge 0\) and \(t>0\) . Then
Proof
Since \(f\ge 0\) and \(t>0\) we have \(1_{f>t}\le f/t\) and therefore
\(\square \)
Corollary 2
(Chebychev inequality) Let \(f\in L_{2}\left[ \mu \right] \) and \(t>0\). Then
To use Chebychev’s inequality we need to bound the variance \(\sigma ^{2}\left( f\right) \). If f is a sum of independent variables, the variance of f is just the sum of the variances of the individual variables, but this doesn’t work for general functions. In Sect. 3.1, however, we give the Efron–Stein inequality, which asserts that for functions of independent variables the variance is bounded by the expected sum of conditional variances.
The idea of Chebychev’s inequality obviously extends to other even centered moments \(E\left[ \left( f-E\left[ f\right] \right) ^{2p}\right] \). Bounding higher moments of functions of independent variables is an important technique discussed, for example, in [6].
Here the most important corollary of Markov’s inequality is the exponential moment method, an idea apparently due to Bernstein [4].
Corollary 3
(exponential moment method) Let \( f\in \mathcal {A}\), \(\beta \ge 0\) and \(t>0\). Then
To use this we need to bound the quantity \(E\left[ e^{\beta f}\right] \) and to optimize the right-hand side above over \(\beta \). We call \(E\left[ e^{\beta f}\right] \) the partition function, denoted \(Z_{\beta f}=E \left[ e^{\beta f}\right] \). Bounding the partition function (or its logarithm) is the principal problem in the derivation of exponential tail bounds.
If f is a sum of independent components (as in (1)), then the partition function is the product of the partition functions corresponding to these components, and its logarithm (called the moment generating function) is additive. This is a convenient basis to obtain deviation bounds for sums, but it does not immediately extend to general non-additive functions. The approach is taken here, the entropy method, balances simplicity, and generality.
2.2 Entropy and Concentration
For the remainder of this section we take the function \(f\in \mathcal {A}\) as fixed. We could interpret the points \(x\in \Omega \) as possible states of a physical system and f as the negative energy (or Hamiltonian) function, so that \(-f\left( x\right) \) is the system’s energy in the state x. The measure \(\mu \) then models an a priori probability distribution of states in the absence of any constraining information. We will define another probability measure on \(\Omega \), with specified expected energy, but with otherwise minimal assumptions.
If \(\rho \) is a function on \(\Omega \), \(\rho \ge 0\) and \(E\left[ \rho \right] =1\), the Kullback–Leibler divergence \(KL\left( \rho d\mu ,d\mu \right) \) of \(\rho d\mu \) to \(d\mu \) is
\(KL\left( \rho d\mu ,d\mu \right) \) can be interpreted as the information we gain, if we are told that the probability measure is \(\rho d\mu \) instead of the a priori measure \(d\mu \).
Theorem 4
For all \(f\in \mathcal {A}\), \(\beta \in \mathbb {R}\)
where the supremum is over all nonnegative measurable functions \(\rho \) on \( \Omega \) satisfying \(E\left[ \rho \right] =1\).
The supremum is attained for the density
Proof
We can assume \(\beta =1\) by absorbing it in f. Let \(\rho \ge 0\) satisfy \(E \left[ \rho \right] =1\), so that \(\rho d\mu \) is a probability measure and \( g\in \mathcal {A}\mapsto E_{\rho }\left[ g\right] :=E\left[ \rho g\right] \) an expectation functional. Let \(\phi \left( x\right) =1/\rho \left( x\right) \) if \(\rho \left( x\right) >0\) and \(\phi \left( x\right) =0\) if \(\rho \left( x\right) =0\). Then \(E\left[ \rho \ln \rho \right] =-E\left[ \rho \ln \phi \right] =-E_{\rho }\left[ \ln \phi \right] \) and with Jensen’s inequality
On the other hand
\(\square \)
In statistical physics the maximizing probability measure \(d\mu _{\beta f}=\rho _{\beta f}d\mu =e^{\beta f}d\mu /E\left[ e^{\beta f}\right] \) is called the thermal measure, sometimes also the canonical ensemble. It is used to describe a system in thermal equilibrium with a heat reservoir at temperature \(T\approx 1/\beta \). The corresponding expectation functional
is called the thermal expectation. The normalizing quantity \( Z_{\beta f}=E\left[ e^{\beta f}\right] \) is the partition function already introduced above. Notice that for any constant c we have \(E_{\beta \left( f+c\right) }\left[ g\right] =E_{\beta f}\left[ g\right] \).
The value of the function \(\rho \mapsto E\left[ \rho \ln \rho \right] \) at the thermal density \(\rho _{\beta f}=Z_{\beta f}^{-1}e^{\beta f}\) is called the canonical entropy or simply entropy,
Note that \(\mathrm {Ent}_{-f}\left( \beta \right) =\mathrm {Ent}_{f}\left( -\beta \right) \), a simple but very useful fact.
Suppose that \(\rho \) is any probability density on \(\Omega \), which gives the same expected value for the energy as \(\rho _{\beta f}\), so that \(E\left[ \rho f\right] =E_{\beta f}\left[ f\right] \). Then
The thermal measure \(d\mu _{\beta f}=\rho _{\beta f}d\mu \) therefore minimizes the information gain relative to the a priori measure \(d\mu \), given the expected value \(-E_{\beta f}\left[ f\right] \) of the internal energy.
For \(g\in \mathcal {A}\) and \(\rho =Z_{\beta f}^{-1}e^{\beta f}\) Theorem 4 gives
which allows to decouple g from f. This plays an important role later on in this chapter.
For \(\beta \ne 0\) define a function
By l’Hospital’s rule we have \(\lim _{\beta \rightarrow 0}A_{f}\left( \beta \right) =E\left[ f\right] \), so \(A_{f}\) extends continuously to \(\mathbb {R}\) by setting \(A_{f}\left( 0\right) =E\left[ f\right] \). In statistical physics the quantity \(A_{f}\left( \beta \right) \) so defined is called the free energy corresponding to the Hamiltonian (energy function) \(H=-f\) and temperature \(T\approx \beta ^{-1}\). Theorem 4 exhibits the free energy and the canonical entropy as a pair of convex conjugates. Dividing (2) by \(\beta \) and writing \(U=E_{\beta f} \left[ f\right] \), we recover the classical thermodynamic relation
which describes the macroscopically available energy A as the difference between the internal energy U and an energy portion \(T~\mathrm {Ent}\), which is inaccessible due to ignorance of the microscopic state.
The following theorem establishes the connection of entropy, the exponential moment method and concentration inequalities.
Theorem 5
For \(f\in \mathcal {A}\) and any \(\beta \ge 0\) we have
and, for \(t\ge 0\),
Proof
Differentiating the free energy with respect to \(\beta \) we find
By the fundamental theorem of calculus
which is the first inequality. Then by Markov’s inequality
\(\square \)
Our strategy to establish concentration results will therefore be the search for appropriate bounds on the entropy.
2.3 Entropy and Energy Fluctuations
The thermal variance of a function \(g\in \mathcal {A}\) is just the variance of g relative to the thermal expectation. It is denoted \(\sigma _{\beta f}^{2}\left( g\right) \) and defined by
For constant c we have \(\sigma _{\beta \left( f+c\right) }^{2}\left[ g\right] =\sigma _{\beta f}^{2}\left[ g\right] \).
The proof of the following lemma consists of straightforward calculations, an easy exercise to familiarize oneself with thermal measure, expectation and variance.
Lemma 6
The following formulas hold for \(f\in \mathcal {A} \)
1. \(\frac{d}{d\beta }\left( \ln Z_{\beta f}\right) =E_{\beta f}\left[ f\right] \).
2. If \(h:\beta \mapsto h\left( \beta \right) \in \mathcal {A}\) is differentiable and \(\left( d/d\beta \right) h\left( \beta \right) \in \mathcal {A}\) then
3. \(\frac{d}{d\beta }E_{\beta f}\left[ f^{k}\right] =E_{\beta f}\left[ f^{k+1}\right] -E_{\beta f}\left[ f^{k}\right] E_{\beta f}\left[ f\right] .\)
4. \(\frac{d^{2}}{d\beta ^{2}}\left( \ln Z_{\beta f}\right) =\frac{d}{d\beta }E_{\beta f}\left[ f\right] =\sigma _{\beta f}^{2}\left[ f\right] .\)
Proof
1. is immediate and 2. a straightforward computation. 3. and 4. are immediate consequences of 1. and 2. \(\square \)
Since the members of \(\mathcal {A}\) are bounded it follows from 2. that for \(f,g\in \mathcal {A}\) the functions \(\beta \mapsto E_{\beta f}\left[ g\right] \) and \(\beta \mapsto \sigma _{\beta f}^{2}\left[ g\right] \) are \(C_{\infty }\).
The thermal variance of f itself corresponds to energy fluctuations. The next theorem represents entropy as a double integral of such fluctuations. The utility of this representation to derive concentration results has been noted by David McAllester [22].
Theorem 7
We have for \(\beta >0\)
Proof
Using the previous lemma and the fundamental theorem of calculus we obtain the formulas
and
which we subtract to obtain
\(\square \)
Since bounding \(\sigma _{\beta f}^{2}\left[ f\right] \) is central to our method, it is worth mentioning an interpretation in terms of heat capacity, or specific heat. Recall that \(-E_{\beta f}\left[ f\right] \) is the expected internal energy. The rate of change of this quantity with temperature T is the heat capacity. By conclusion 4 of Lemma 6 we have
which exhibits the proportionality of heat capacity and energy fluctuations.
2.4 Product Spaces and Conditional Operations
We now set \(\Omega =\prod _{k=1}^{n}\Omega _{k}\) and \(d\mu =\prod _{k=1}^{n}d\mu _{k}\), where each \(\mu _{k}\) is the probability measure representing the distribution of some variable \(X_{k}\) in the space \(\Omega _{k}\), so that the \(X_{k}\) are assumed to be independent.
With \(\mathcal {A}_{k}\) we denote the subalgebra of those functions \(f\in \mathcal {A}\), which are independent of the k-th argument. To efficiently deal with operations performed on individual arguments of functions in \(\mathcal {A}\) we need some special notation.
Now let \(k\in \left\{ 1,...,n\right\} \) and \(y\in \Omega _{k}\). If \(\Xi \) is any set and F is any function \(F:\Omega \rightarrow \Xi \), we extend the definition of the substitution operator \(S_{y}^{k}\) to F by \(S_{y}^{k}\left( F\right) =F\circ S_{y}^{k}\). This means
so the k-th argument is simply replaced by y. Since for \(f\in \mathcal {A} \) the function \(S_{y}^{k}f\) is independent of \(x_{k}\) (which had been replaced by y) we see that \(S_{y}^{k}\) is a homomorphic (linear and multiplication-preserving) projection of \(\mathcal {A}\) onto \(\mathcal {A}_{k}\).
For \(k\in \left\{ 1,...,n\right\} \) and \(y,y^{\prime }\in \Omega _{k}\) we define the difference operator \(D_{y,y^{\prime }}^{k}:\mathcal {A}\rightarrow \mathcal {A}_{k}\) by
Clearly \(D_{y,y^{\prime }}^{k}\) annihilates \(\mathcal {A}_{k}\). The operator \(r_{k}:\mathcal {A}\rightarrow \mathcal {A}_{k}\), defined by \(r_{k}f=\sup _{y,y^{\prime }\in \Omega _{k}}D_{y,y^{\prime }}^{k}f\) is called the k-th conditional range. We also use the abbreviations \(\inf _{k}f=\inf _{y\in \Omega _{k}}S_{y}^{k}f\) and \(\sup _{k}f=\sup _{y\in \Omega _{k}}S_{y}^{k}f\) for the conditional infimum and supremum.
Given the measures \(\mu _{k}\) and \(k\in \left\{ 1,...,n\right\} \) we the operator \(E_{k}:\mathcal {A}\rightarrow \mathcal {A}_{k}\) by
The operator \(E_{k}\left[ .\right] =E\left[ .|X_{1},...,X_{k-1},X_{k+1},...,X_{n}\right] \) is the expectation conditional to all variables with indices different to k. \(E_{k}\) is a linear projection onto \(\mathcal {A}_{k}\). Also the \(E_{k}\) commute among each other, and for \(h\in \mathcal {A}\) and \(g\in \mathcal {A}_{k}\) we have
Replacing the operator E by \(E_{k}\) leads to the definition of conditional thermodynamic quantities, all of which are now members of the algebra \(\mathcal {A}_{k}\):
-
The conditional partition function \(Z_{k,\beta f}=E_{k}\left[ e^{\beta f}\right] \),
-
The conditional thermal expectation \(E_{k,\beta f}\left[ g\right] =Z_{k,\beta f}^{-1}E_{k}\left[ ge^{\beta f}\right] \) for \(g\in \mathcal {A},\)
-
The conditional entropy \(\mathrm {Ent}_{k,f}\left( \beta \right) =\beta E_{k,\beta f}\left[ f\right] -\ln Z_{k,\beta f},\)
-
The conditional thermal variance \(\sigma _{k,\beta f}^{2}\left[ g\right] =E_{k,\beta f}\left[ \left( g-E_{k,\beta f}\left[ g\right] \right) ^{2}\right] \) for \(g\in \mathcal {A}\). As \(\beta \rightarrow 0\) this becomes
-
The conditional variance \(\sigma _{k}^{2}\left[ g\right] =E_{k}\left[ \left( g-E_{k}\left[ g\right] \right) ^{2}\right] \) for \(g\in \mathcal {A}\).
The previously established relations hold also for the corresponding conditional quantities. Of particular importance for our method is the conditional version of Theorem 7
The following lemma, which states that the conditional thermal expectation just behaves like a conditional expectation, will also be used frequently.
Lemma 8
For any \(f,g\in \mathcal {A}\), \(k\in \left\{ 1,...,n\right\} \), \(\beta \in \mathbb {R}\)
Proof
Using \(E\left[ E_{k}\left[ h\right] g\right] =E\left[ hE_{k}\left[ g\right] \right] \)
\(\square \)
2.5 The Subadditivity of Entropy
In the non-interacting case, when the energy function f is a sum, \(f=\sum f_{k},\) it is easily verified that \(\mathrm {Ent}_{k,f}\left( \beta \right) \left( \mathbf {x}\right) =\mathrm {Ent}_{k,f}\left( \beta \right) \) is independent of \(\mathbf {x}\) and that
In statistical physics it is said that entropy is an extensive quantity: the entropy of non-interacting systems is equal to the sum of the individual entropies.
Equality no longer holds in the interacting, nonlinear case, but there is a subadditivity property which is sufficient for the purpose of concentration inequalities:
The total entropy is no greater than the thermal average of the sum of the conditional entropies.
Theorem 9
For \(f\in \mathcal {A}\) and \(\beta >0\)
In 1975 Elliott Lieb [19] gave a proof of this result, which was probably known some time before, at least in the classical setting relevant to our arguments. Together with Theorem 5 and Theorem 7 it completes our basic toolbox to prove concentration inequalities. For the proof we need two auxiliary results.
Lemma 10
Let \(h,g>0\) be bounded measurable functions on \(\Omega \). Then for any expectation E
Proof
Define an expectation functional \(E_{g}\) by \(E_{g}\left[ h\right] =E\left[ gh\right] /E\left[ g\right] \). The function \(\Phi \left( t\right) =t\ln t\) is convex for positive t, since \(\Phi ^{\prime \prime }=1/t>0\). Then
Thus, by Jensen’s inequality,
\(\square \)
Next we prove (5) for general positive functions.
Lemma 11
Let \(\rho \in \mathcal {A}\), \(\rho >0\). Then
Proof
Write \(E^{k}\left[ .\right] =E_{1}E_{2}...E_{k}\left[ .\right] \) with \(E^{0}\) being the identity map on \(\mathcal {A}\). The innocuous looking identity \(E\left[ E^{k}\left[ .\right] \right] =E\left[ .\right] \) is an obvious consequence of the fact that we work with product probabilities. Without independence it would not hold, and the following simple argument would break down. Note that \(E^{n}=E\). We expand
We get from Lemma 10, using \(E\left[ E^{k-1}\left[ .\right] \right] =E\left[ .\right] ,\)
\(\square \)
Finally we specialize to the canonical entropy.
Proof of Theorem 9
9 Set \(\rho =e^{\beta f}\) in Lemma 11 to get
where we used Lemma 8 in the last identity. \(\square \)
2.6 Summary of Results
The exponential moment method, Corollary 3, and Theorems 5, 7, and 9 provide us with the tools to prove several useful concentration inequalities. Here is a summary:
Theorem 12
For \(f\in A\) and \(\beta >0\) we have
Concatenating the exponential moment bound (6), the entropy representation of the moment generating function (7), the subadditivity of entropy (8) and the fluctuation representation of the conditional entropy (10), we obtain the following generic concentration inequality.
This is the template for the results given in the next section.
3 First Applications of the Entropy Method
We now develop some first consequences of the method, beginning with the Efron–Stein inequality, a general bound on the variance. Then we continue with the derivation of the bounded difference inequality, a simple and perhaps the most useful concentration inequality, for which we give two illustrating applications. Then we give a Bennett-Bernstein type inequality which we apply to the concentration of vector-valued random variables.
3.1 The Efron–Stein Inequality
Combining the fluctuation representations (9) and (10) with the subadditivity (8) of entropy and dividing by \(\beta ^{2}\) we obtain
Using the continuity properties of \(\beta \mapsto E_{\beta f}\left[ g\right] \) and \(\beta \mapsto \sigma _{\beta f}^{2}\left[ f\right] \), which follow from Lemma 6 we can take the limit as \(\beta \rightarrow 0\) and multiply by 2 to obtain
where we introduced the notation \(\Sigma ^{2}\left( f\right) =\sum _{k}\sigma _{k}^{2}\left[ f\right] \) for the sum of conditional variances.
Equation (11) is the famous Efron–Stein–Steele inequality [26]. It is an easy exercise to provide the details of the above limit process and to extend the inequality to general functions \(f\in L_{2}\left[ \mu \right] \) by approximation with a sequence of truncations.
3.2 The Bounded Difference Inequality
The variance of a bounded real random variable is never greater than a quarter of the square of its range.
Lemma 13
If \(f\in \mathcal {A}\) satisfies \(a\le f\le b\) then \(\sigma ^{2}\left[ f\right] \le \left( b-a\right) ^{2}/4\).
Proof
To see the last inequality use calculus to find the maximal value of the function \(t\rightarrow \left( b-t\right) \left( t-a\right) \). \(\square \).
The bounded difference inequality bounds the deviation of a function from its mean in terms of the sum of squared conditional ranges, which is the operator \(R^{2}:\mathcal {A\rightarrow A}\) defined by
Theorem 14
(Bounded difference inequality) For \(f\in \mathcal {A}\) and \(t>0\)
Proof
Applied to the conditional thermal variance Lemma 13 gives
so combining the subadditivity of entropy (8) and the fluctuation representation (10) gives
Bounding the thermal expectation \(E_{\gamma f}\) by the supremum over \(\mathbf {x}\in \Omega \) we obtain from Theorem 12 (7)
and the tail bound (6) gives for all \(\beta >0\)
Substitution of the minimizing value \(\beta =4t/\left( \sup _{\mathbf {x}\in \Omega }R^{2}\left( f\right) \left( \mathbf {x}\right) \right) \) completes the proof. \(\square \)
Notice that the conditional range \(r_{k}\left( f\right) \) is a function in \(\mathcal {A}_{k}\) and may depend on all \(x_{i}\) except \(x_{k}\). The sum \(R^{2}\left( f\right) =\sum _{k=1}^{n}r_{k}\left( f\right) ^{2}\) may thus depend on all the \(x_{i}\). It is therefore a very pleasant feature that the supremum over \(\mathbf {x}\) is taken outside the sum. In the sequel this will allow us to derive the Gaussian concentration inequality from Theorem 14. The bound (12) will be re-used in Sect. 5.4 to prove a version of the Hanson-Wright inequality for quadratic forms.
In the literature one often sees the following weaker version of Theorem 14.
Corollary 15
For \(f\in \mathcal {A}\) and \(t>0\)
If f is a sum \(f=\sum _{k}X_{k}\), then \(r_{k}^{2}\) is independent of \(\mathbf {x}\) and the two results are equivalent. In this case we obtain the well known Hoeffding inequality [16].
Corollary 16
(Hoeffding’s inequality) Let \(X_{k}\) be real random variables \(a_{k}\le X_{k}\le b_{k}\). Then
In returning to the general case of non-additive functions, it is remarkable that for many applications the following “little bounded difference inequality”, which is yet weaker than Corollary 15, seems to be sufficient.
Corollary 17
For \(f\in \mathcal {A}\) and \(t>0\)
where
3.3 Vector-Valued Concentration
Suppose the \(X_{i}\) are independent random variables with values in a normed space \(\mathcal {B}\) such that \(EX_{i}=0\) and \(\left\| X_{i}\right\| \le c_{i}\). Let \(\Omega _{i}=\left\{ y\in \mathcal {B}:\left\| y\right\| \le c_{i}\right\} \) and define \(f:\prod _{i=1}^{n}\Omega _{i}\rightarrow \mathbb {R}\) by
Then by the triangle inequality, for \(y,y^{\prime }\) with \(\left\| y\right\| ,\left\| y^{\prime }\right\| \le c_{k}\)
so \(R^{2}\left( f\right) \left( \mathbf {x}\right) \le 4\sum _{i}c_{i}^{2}\). It follows from Corollary 15 that
or that for \(\delta >0\) with probability at least \(1-\delta \) in \(\left( X_{1},...,X_{n}\right) \)
If \(\mathcal {B}\) is a Hilbert space we can bound \(E\left\| \sum _{i}X_{i}\right\| \le \sqrt{\sum _{i}E\left[ \left\| X_{i}\right\| ^{2}\right] }\) by Jensen’s inequality and if all the \(X_{i}\) are iid we get with probability at least \(1-\delta \)
3.4 Rademacher Complexities and Generalization
Now let \(\mathcal {X}\) be any measurable space and \(\mathcal {F}\) a countable class of functions \(f:\mathcal {X\rightarrow }\left[ 0,1\right] \) and \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) be a vector of iid random variables with values in \(\mathcal {X}\).
Empirical risk minimization really wants to find \(f\in \mathcal {F}\) with minimal risk \(E\left[ f\left( X\right) \right] \), but, as the true distribution of X is unknown, it has to be content with minimizing the empirical surrogate
One way to justify this method is by giving a bound on the uniform estimation error
The vector space
becomes a normed space with norm \(\left\| g\right\| =\sup _{f\in \mathcal {F}}\left| g\left( f\right) \right| \). For each \(X_{i}\) define \(\hat{X}_{i}\) \(\in \mathcal {B}\) by \(\hat{X}_{i}\left( f\right) =f\left( X_{i}\right) -E\left[ f\left( X_{i}\right) \right] \). Then the \(\hat{X}_{i}\) are zero mean random variables in \(\mathcal {B}\) satisfying \(\left\| \hat{X}_{i}\right\| \le 1\), and (13) of the preceding section gives with probability at least \(1-\delta \)
The expectation term on the right-hand side can be bounded in terms of Rademacher complexity [3]. This is the function \(\mathcal {R}:\mathcal {F}\times \mathcal {X}^{n}\rightarrow \mathbb {R}\) defined as
where the \(\mathbf {\epsilon }=\left( \epsilon _{1},...,\epsilon _{n}\right) \) are vectors of independent Rademacher variables which are uniformly distributed on \(\left\{ -1,1\right\} \). We have, with \(X_{i}^{\prime }\) iid to \(X_{i}\)
for any \(\epsilon \in \left\{ -1,1\right\} ^{n}\), because the expectation is invariant under the interchange of \(X_{i}\) and \(X_{i}^{\prime }\) on an arbitrary subset of indices. Passing to the expectation in \(\epsilon \) and using the triangle inequality gives
Now we use the bounded difference inequality again to bound the deviation of \(\mathcal {R}\left( \mathcal {F},\mathbf {.}\right) \) from its expectation. We have, again using the triangle inequality,
and thus obtain
or, for every \(\delta >0\) with probability at least \(1-\delta \)
By a union bound we conclude that with probability at least \(1-\delta \)
3.5 The Bennett and Bernstein Inequalities
The proof of the bounded difference inequality relied on bounding the thermal variance \(\sigma _{k,\beta f}^{2}\left( f\right) \) uniformly in \(\beta \), using the constraints on the conditional ranges of f. We now consider the case, where we only use one constraint on the ranges, say \(f-E_{k}\left[ f\right] \le 1\), but we use information on the conditional variances. This leads to a Bennett type inequality as in [23]. Recall the notation for the sum of conditional variances \(\Sigma ^{2}\left( f\right) :=\sum \sigma _{k}^{2}\left( f\right) \). Again we start with a bound on the thermal variance.
Lemma 18
Assume \(f-Ef\le 1\). Then for \(\beta >0\)
Proof
\(\square \)
Next we bound the entropy Ent\(_{f}\left( \beta \right) \).
Lemma 19
Assume that \(f-E_{k}f\le 1\) for all \(k\in \left\{ 1,...,n\right\} \). Then for \(\beta >0\)
Proof
From Theorem 12 and the previous lemma we get
The conclusion follows from the formula
\(\square \)
We need one more technical Lemma.
Lemma 20
For \(x\ge 0\)
Proof
We have to show that
Since \(f_{1}\left( 0\right) =0\) and \(f_{1}^{\prime }\left( x\right) =4f_{2}\left( x\right) \) with \(f_{2}\left( x\right) :=\left( 2+x\right) \ln \left( 1+x\right) -2x\), it is enough to show that \(f_{2}\left( x\right) \ge 0\). But \(f_{2}\left( 0\right) =0\) and \(f_{2}^{\prime }\left( x\right) =\left( 1+x\right) ^{-1}+\ln \left( 1+x\right) -1\), so \(f_{2}^{\prime }\left( 0\right) =0\), but \(f_{2}^{\prime \prime }\left( x\right) =x\left( 1+x\right) ^{-2}\ge 0\), so \(f_{2}\left( x\right) \ge 0\). \(\square \)
Now we can prove our version of Bennett’s inequality.
Theorem 21
Assume \(f-E_{k}f\le 1,\forall k\). Let \(t>0\) and denote \(V=\sup _{\mathbf {x}\in \Omega }\Sigma ^{2}\left( f\right) \left( \mathbf {x}\right) \). Then
Proof
Fix \(\beta >0\). We define the real function
which arises from deleting the first two terms in the power series expansion of the exponential function and observe that
because \(\left( d/d\gamma \right) \left( \gamma ^{-1}\left( e^{\gamma }-1\right) \right) =\gamma ^{-2}\left( \gamma e^{\gamma }-e^{\gamma }+1\right) \) and \(\lim _{\gamma \rightarrow 0}\gamma ^{-1}\left( e^{\gamma }-1\right) =1\). Theorem 12 and Lemma 19 combined with a uniform bound then give
It now follows from Theorem 12 that \(\Pr \left\{ f-E\left[ f\right] >t\right\} \le \exp \left( \psi \left( \beta \right) V-\beta t\right) \) for any \(\beta >0\). Substitution of \(\beta =\ln \left( 1+tV^{-1}\right) \) gives the first inequality, the second follows from Lemma 20. \(\square \)
Observe that f is assumed bounded above by the assumptions of the theorem. The existence of exponential moments \(E\left[ e^{\beta f}\right] \) is needed only for \(\beta \ge 0\), so the assumption \(f\in \mathcal {A}\) can be dropped in this case.
If f is additive the theorem reduces to the familiar Bennett and Bernstein inequalities [16].
Corollary 22
Let \(X_{k}\) be real random variables \(X_{k}\le E\left[ X_{k}\right] +1\) and let \(V=\sum _{k}\sigma ^{2}\left( X_{k}\right) \). Then
Theorem 21 and its corollary can be applied to functions satisfying \(f-E_{k}\left[ f\right] <b\) by a simple rescaling argument. Then Bernstein’s inequality becomes
Inequalities of this kind exhibit two types of tails, depending in which of the two terms in the denominator \(A+Bt\) of the exponent is dominant. In the sub-Gaussian regime \(A>>Bt\) the tail decays as \(e^{-t^{2}/A}\). This is the way the bounded difference inequality behaves globally, but with a very crude approximation for A, while Bernstein’s inequality uses variance information. But for larger deviations, when \(A<<Bt\), the tail only decays as \(e^{-t/A}\). This subexponential behavior is absent in the bounded difference inequality and the price paid for the fine-tuning in Bernstein’s inequality.
3.6 Vector-Valued Concentration Revisited
We look again at the situation of Sect. 3.3. Suppose again that the \(X_{i}\) are independent zero mean random variables with values in normed space, which we now assume to be a Hilbert space H, but that now we have a uniform bound \(\left\| X_{i}\right\| \le c\). Again we define \(f:\left\{ y\in H:\left\| y\right\| \le c\right\} ^{n}\rightarrow \mathbb {R}\) by \(f\left( \mathbf {x}\right) =\left\| \sum _{i}x_{i}\right\| \) and observe that for \(y,y^{\prime }\in H\), \(D_{y,y^{\prime }}^{k}f\left( \mathbf {x}\right) \le \left\| y-y^{\prime }\right\| \). This implies that \(f-E_{k}\left[ f\right] \le 2c\) and also
Thus \(\Sigma ^{2}\left( f\right) \le \sum _{i}E\left\| X_{i}\right\| ^{2}\) and by Bernstein’s inequality, Theorem 21,
or that for \(\delta >0\) with probability at least \(1-\delta \) in \(\left( X_{1},...,X_{n}\right) \)
where we again used Jensen’s inequality to bound \(E\left\| \sum _{i}X_{i}\right\| \). If all the \(X_{i}\) are iid we get with probability at least \(1-\delta \)
If the variance \(E\left[ \left\| X_{1}\right\| ^{2}\right] \) is small and n is large, this is much better than the bound (14), which we got from the bounded difference inequality.
4 Inequalities for Lipschitz Functions and Dimension Free Bounds
We now prove some more advanced concentration inequalities. First we will use the bounded difference inequality to prove a famous sub-gaussian bound for Lipschitz functions of independent standard normal variables. We then derive an exponential Efron–Stein inequality which allows to prove a similar result for convex Lipschitz functions on \(\left[ 0,1\right] ^{n}\). We also obtain a concentration inequality for the operator norm of a random matrix, with deviations independent of the size of the matrix.
4.1 Gaussian Concentration
The advantage of the bounded difference inequality, Theorem 14, over its simplified Corollary 15 is the supremum over \(\mathbf {x}\) outside the sum over k. This allows us to prove the following powerful Gaussian concentration inequality (Tsirelson-Ibragimov–Sudakov inequality, Theorem 5.6 in [6]). We assume \(\Omega _{k}=\mathbb {R}\) and \(\mu _{k}\) to be the distribution of a standard normal variable, and we require f to be an L-Lipschitz function, which means that for all \(\mathbf {x,x}^{\prime }\in \mathbb {R}^{n}\)
where \(\left\| .\right\| \) is the Euclidean norm on \(\mathbb {R}^{n}\).
Theorem 23
Let \(f:\mathbb {R}^{n}\rightarrow \mathbb {R}\) be L-Lipschitz and let \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) be a vector of independent standard normal variables. Then for any \(s>0\)
Note that the function f is not assumed to be bounded on \(\mathbb {R}^{n}\).
Proof
The idea of the proof is to use the central limit theorem to approximate the \(X_{i}\) by appropriately scaled Rademacher sums \(h_{K}\left( \mathbf {\epsilon }_{i}\right) \) and to apply the bounded difference inequality to \(f\left( h_{K}\left( \mathbf {\epsilon }_{1}\right) ,...,h_{K}\left( \mathbf {\epsilon }_{n}\right) \right) \).
By an approximation argument using convolution with Gaussian kernels of decreasing width it suffices to prove the result if f is \(C^{2}\) with \(\left| \left( \partial ^{2}/x_{i}^{2}\right) f\left( \mathbf {x}\right) \right| \le B\) for all \(\mathbf {x\in }\) \(\mathbb {R}\)\(^{n}\) and \(i\in \left\{ 1,...,n\right\} \), where B is a finite, but arbitrarily large constant. For \(K\in \mathbb {N} \) define a function \(h_{K}:\left\{ -1,1\right\} ^{K}\rightarrow \mathbb {R}\), a vector-valued function \(\mathbf {h}_{K}:\left\{ -1,1\right\} ^{Kn}\rightarrow \mathbb {R}^{n}\) and a function \(G_{K}:\left\{ -1,1\right\} ^{Kn}\rightarrow \mathbb {R}\) by
We will use Theorem 14 on the function \(G_{K}\) applied to independent Rademacher variables \(\mathbf {\epsilon }\).
Fix an arbitrary configuration \(\mathbf {\epsilon \in }\left\{ -1,1\right\} ^{Kn}\) and let \(\mathbf {x}=\left( x_{1},...,x_{n}\right) =\mathbf {h}_{K}\left( \mathbf {\epsilon }\right) \). For each \(i\in \left\{ 1,...,n\right\} \) we introduce the real function \(f_{i}\left( t\right) =S_{t}^{i}f\left( \mathbf {x}\right) \), so that we replace the i-th argument \(x_{i}\) by t, leaving all the other \(x_{j}\) fixed. Since f is \(C^{2}\) we have for any \(t\in \mathbb {R}\)
for some \(s\in \mathbb {R}\), and by the Lipschitz condition and the bound on \(\left| f_{i}^{\prime \prime }\right| \)
Now fix a pair of indices \(\left( i,k\right) \) with \(i\in \left\{ 1,...,n\right\} \) and \(k\in \left\{ 1,...,K\right\} \) and arbitrary values \(y,y^{\prime }\in \left\{ -1,1\right\} \) with \(y\ne y^{\prime }\). We want to bound \(\left( D_{y,y^{\prime }}^{\left( i,k\right) }G_{K}\left( \mathbf {\epsilon }\right) \right) ^{2}\). Now either one of y or \(y^{\prime }\) is equal to \(\epsilon _{ik}\), so either \(S_{y}^{\left( i,k\right) }G_{K}\left( \mathbf {\epsilon }\right) \) or \(S_{y^{\prime }}^{\left( i,k\right) }G_{K}\left( \mathbf {\epsilon }\right) \) is equal to \(G_{K}\left( \mathbf {\epsilon }\right) \). Without loss of generality we assume the second. Furthermore \(S_{y}^{k}h_{K}\left( \mathbf {\epsilon }_{i}\right) \) and \(h_{K}\left( \mathbf {\epsilon }_{i}\right) \) differ by at most \(2/\sqrt{K}\), so
Now \(f_{i}^{\prime }\left( h_{K}\left( \mathbf {\epsilon }_{i}\right) \right) \) is just equal to \(\left( \partial /\partial x_{i}\right) f\left( \mathbf {x}\right) \), so
Since \(\mathbf {\epsilon }\) was arbitrary we have
From Theorem 14 we conclude from \(f\left( \mathbf {h}_{K}\left( \mathbf {\epsilon }\right) \right) =G_{K}\left( \mathbf {\epsilon }\right) \) that
The conclusion now follows from the central limit theorem since h \(_{K}\left( \mathbf {\epsilon }\right) \rightarrow \mathbf {X}\) weakly as \(K\rightarrow \infty \). \(\square \)
4.2 Exponential Efron Stein Inequalities
We will now use the entropy method to derive some other “dimension free” bounds of this type. We need the following very useful result.
Lemma 24
(Chebychev’s association inequality) Let g and h be real functions, X a real random variable.
If g and h are either both nondecreasing or both nonincreasing then
If either one of g or h is nondecreasing and the other nonincreasing then
Proof
Let \(X^{\prime }\) be a random variable iid to X. Then
Now if g and h are either both nondecreasing or both nonincreasing then
is always nonnegative, because both factors always have the same sign, in the other case it is always nonpositive. \(\square \)
We use this inequality to prove a bound on the thermal variance. First recall that for two iid random variables X and \(X^{\prime }\) we have
Lemma 25
Let \(0\le s\le \beta \). Then
Proof
Let \(\psi \) be any real function. Lemma 6 (2) gives
By Chebychev’s association inequality \(E_{\beta f}\left[ \psi \left( f\right) \right] \) is nonincreasing (nondecreasing) in \(\beta \) if \(\psi \) is nonincreasing (nondecreasing). Now define \(g:\mathbb {R}^{2}\rightarrow \mathbb {R}\) by
so that
Now for fixed x the function \(\left( f\left( x\right) -f\left( x^{\prime }\right) \right) ^{2}1_{f\left( x\right) \ge f\left( x^{\prime }\right) }\) is nonincreasing in \(f\left( x^{\prime }\right) \), so \(g\left( s,t\right) \) is nonincreasing in t. On the other hand, for fixed \(x^{\prime }\), \(\left( f\left( x\right) -f\left( x^{\prime }\right) \right) ^{2}1_{f\left( x\right) \ge f\left( x^{\prime }\right) }\) is nondecreasing in \(f\left( x\right) \), so \(g\left( s,t\right) \) is nondecreasing in s (this involves exchanging the two expectations in the definition of \(g\left( s,t\right) \)). So, since \(\mu _{0f}=\mu \), we get from \(0\le s\le \beta \) that
\(\square \)
Here is another way to write the conclusion: let \(h\in \mathcal {A}\) be defined by \(h\left( x\right) =E_{x^{\prime }\sim \mu }\left[ \left( f\left( x\right) -f\left( x^{\prime }\right) \right) _{+}^{2}\right] \). Then \(\sigma _{sf}^{2}\left( f\right) \le E_{\beta f}\left[ h\right] \).
Define two operators \(D^{2}:\mathcal {A}\rightarrow \mathcal {A}\) and \(V_{+}^{2}:\mathcal {A}\rightarrow \mathcal {A}\) by
Clearly \(V_{+}^{2}f\le D^{2}f\) as \(D^{2}f\) is obtained by bounding the expectations in the definition of \(V_{+}^{2}\) by their suprema.
Lemma 26
For \(\beta >0\) and \(f\in \mathcal {A}\)
Proof
For \(k\in \left\{ 1,...,n\right\} \) write \(h_{k}=E_{y\sim \mu _{k}}\left[ \left( f-S_{y}^{k}f\right) _{+}^{2}\right] \), so that \(V^{+}\left( f\right) =\sum _{k}h_{k}\). The conditional version of Lemma 25 then reads for \(0\le s\le \beta \) and \(k\in \left\{ 1,...,n\right\} \)
Theorem 12 gives
where we used the identity \(E_{\beta f}\left[ E_{k,\beta f}\left[ h\right] \right] =E_{\beta f}\left[ h\right] \) for \(h\in \mathcal {A}.\) \(\square \)
The usual arguments involving Theorem 12 and an optimization in \(\beta \) now immediately lead to
Theorem 27
With \(t>0\)
We get a corresponding lower tail bound only for \(D^{2}\) and we have to use an estimate similar to what was used in the proof of Bennett’s inequality.
Lemma 28
If \(f-\inf _{k}f\le 1,\forall k\) then for \(\beta >0\)
with \(\psi \left( t\right) =e^{t}-t-1\) defined as in (16).
Proof
Let \(k\in \left\{ 1,...,n\right\} \). We write \(h_{k}:=f-\inf _{k}f\). Then \(h_{k}\in \left[ 0,1\right] \) and for \(s\le \beta \) we have \(1\le e^{\left( \beta -s\right) h_{k}}\le e^{\beta -s}\), so
We therefore have
where we used the formula
Thus, using Theorem 12 and the identity \(E_{-\beta f}E_{k,-\beta f}=E_{-\beta f},\)
\(\square \)
Lemmas 26 and 28 together with (7) imply the inequalities
and, if \(f-\inf _{k}f\le 1\) for all k, then
where in the last inequality we also used the fact that \(\gamma \mapsto \psi \left( \gamma \right) /\gamma ^{2}\) is nondecreasing. Bounding the thermal expectation with the uniform norm and substitution of \(\beta =\ln \left( 1+t\left\| D^{2}f\right\| _{\infty }^{-1}\right) \) gives the following lower tail bound as in the proof of the Bennett-Bernstein inequalities.
Theorem 29
If \(f-\inf _{k}f\le 1\) for all k, then for \(t>0\) and with \(\Delta :=\sup _{\mathbf {x}\in \Omega }D^{2}f\left( \mathbf {x}\right) \)
4.3 Convex Lipschitz Functions
In Sect. 4.1 we gave a sub-gaussian bound for Lipschitz functions of independent standard normal variables. Now we prove the same upper tail bound under different hypotheses. Instead of assuming \(\mu _{k}\) to be standard normal we require \(\Omega _{k}=\left[ 0,1\right] \) and let \(\mu _{k}\) be perfectly arbitrary. On the other hand, in addition to being an L-Lipschitz function, we require f to be convex (actually only separately convex in each argument).
Theorem 30
Let \(\Omega _{k}=\mathcal {I}\), an interval of unit diameter, and let \(f\in \mathcal {A}\) be \(C^{1}\), L-Lipschitz and such that \(y\in \left[ 0,1\right] \mapsto S_{y}^{k}f\left( \mathbf {x}\right) \) is convex for all k and all \(\mathbf {x}\). Then
Proof
By an approximation argument we can assume f to be differentiable. Let \(\mathbf {x}\in \left[ 0,1\right] ^{n}\), \(k\in \left\{ 1,...,n\right\} \) and \(y\in \left[ 0,1\right] \) such that \(S_{y}^{k}f\left( \mathbf {x}\right) \le f\left( \mathbf {x}\right) \). Then, using separate convexity,
We therefore have \(f\left( \mathbf {x}\right) -\inf _{y}S_{y}^{k}f\left( \mathbf {x}\right) \le \left| \left( \partial /\partial x_{k}\right) f\left( \mathbf {x}\right) \right| \) and
Theorem 27 then gives the conclusion. \(\square \)
For future reference we record the following fact: if \(\Omega _{k}\) is an interval of unit diameter and A an \(m\times n\)-matrix then \(\mathbf {x}\mapsto \left\| A\mathbf {x}\right\| \) is a convex function with Lipschitz constant \(\left\| A\right\| \) and thus
4.4 The Operator Norm of a Random Matrix
For \(\mathbf {x}\in \left[ -1,1\right] ^{n^{2}}\) let \(M\left( \mathbf {x}\right) \) be the \(n\times n\) matrix whose entries are given by the components of \(\mathbf {x}\). We are interested in the concentration properties of the operator norm of \(M\left( \mathbf {X}\right) \), when \(\mathbf {X}\) is a vector with independent, but possibly not identically distributed components chosen from \(\left[ -1,1\right] \). The function in question is then \(f:\left[ -1,1\right] ^{n^{2}}\rightarrow \mathbb {R}\) defined by
where \(\left\langle .,.\right\rangle \) and \(\left\| .\right\| \) refer to inner product and norm in \(\mathbb {R}^{n}\).
To bound \(D^{2}f\left( \mathbf {x}\right) \) first let \(\mathbf {x}\in \left[ -1,1\right] ^{n^{2}}\) be arbitrary but fixed, and let v and w be unit vectors witnessing the supremum in the definition of \(f\left( \mathbf {x}\right) \).
Now let \(\left( k,l\right) \) be any index to a matrix entry and choose any \(y\in \left[ -1,1\right] \) such that \(S_{y}^{\left( k,l\right) }f\left( \mathbf {x}\right) \le f\left( \mathbf {x}\right) \). Then
Note that \(f-\inf _{k}f\le 2\). Also
The results of the previous section (rescaling for the lower tail to get \(f-\inf _{k}f\le 1\)) then lead to a concentration inequality independent of the size of the random matrix.
Theorem 31
Let \(\mathbf {X}=\left( X_{ij}\right) _{1\le i,j\le n}\) be a vector of \(n^{2}\) independent random variables with values in \(\left[ -1,1\right] \), and \(\mathbf {X}^{\prime }\) iid to \(\mathbf {X}\). Then for \(t>0\).
and
Observe that the argument depends on the fact that the unit vectors v and w could be fixed independent of k and l. This would not have been possible with the bounded difference inequality. Also note that square matrices were chosen for notational convenience only. The same proof would work for rectangular matrices.
5 Beyond Uniform Bounds
All of the above applications of the entropy method to derive upper tail bounds involved an inequality of the form
where \(\xi \) is some nonnegative real function and G is some operator \(G:\mathcal {A}\rightarrow \mathcal {A}\), which is positively homogeneous of order two. For the bounded difference inequality \(\xi \left( \gamma \right) =\gamma ^{2}/8\) and \(G=R^{2}\), for the Bennett inequality \(\xi \left( \gamma \right) =\gamma e^{\gamma }-e^{\gamma }+1\) and \(G=\Sigma ^{2}\), for Theorem 27 we had \(\xi \left( \gamma \right) =\gamma ^{2}/2\) and \(G=V_{+}^{2}\). Theorem 12 is then used to conclude that
An analogous strategy was employed for the various lower tail bounds.
The uniform estimate \(E_{\gamma f}\left[ G\left( f\right) \right] \le \sup _{\mathbf {x}}G\left( f\right) \left( \mathbf {x}\right) \) in (21), while being very simple, is somewhat loose and can sometimes be avoided by exploiting special properties of the thermal expectation and the function in question.
5.1 Self-boundedness
The first possibility we consider is that the function \(G\left( f\right) \) can be bounded in terms of the function f itself, a property referred to as self-boundedness [8]. For example, if simply \(G\left( f\right) \le f\), then \(E_{\gamma f}\left[ G\left( f\right) \right] \le E_{\gamma f}\left[ f\right] =\left( d/d\gamma \right) \ln Z_{\gamma f}\), and if the function \(\xi \) has some reasonable behavior, then the first integral in (21) above can be bounded by partial integration or even more easily. As an example we apply this idea in the setting of Theorems 27 and 29.
Lemma 32
Suppose that for \(f\in \mathcal {A}\) there are nonnegative numbers a, b such that
(i) \(V_{+}^{2}f\le af+b\). Then for \(0\le \beta <2/a\)
(ii) \(D^{2}f\le af+b\). If in addition \(f-\inf _{k}f\le 1\) for all k, then for \(\beta <0\) and \(a\ge 1\)
Proof
(i) We use (18) and get
where the last identity follows from the fact that \(E_{\gamma f}\left[ f\right] =\left( d/d\gamma \right) \ln Z_{\gamma f}\). Thus
and rearranging this inequality for \(\beta \in \left( 0,2/a\right) \) establishes the claim.
(ii) We use (19)
Rearranging gives
where one verifies that for \(\beta >0\) and \(a\ge 1\) we have \(\psi \left( \beta \right) \left( 1+a\beta ^{-1}\psi \left( \beta \right) \right) ^{-1}\le \beta ^{2}/2\). \(\square \)
The bound in part (i) requires an upper bound on \(\beta \). To proceed we need the following optimization lemma, which will be used several times in the sequel and leads to tail bounds with both sub-Gaussian and subexponential regimes, similar to Bernstein’s inequality.
Lemma 33
Let C and b denote two positive real numbers, \(t>0\). Then
Proof
Let \(h\left( t\right) =1+t-\sqrt{1+2t}\). Then use
so that
Substituting
in the left side of (22) we obtain
where we have used (23). \(\square \)
Theorem 34
Suppose for \(f\in A\) there are nonnegative numbers a, b such that
(i) \(V_{+}^{2}f\le af+b\). Then for \(t>0\) we have
(ii) \(D^{2}f\le af+b\). If in addition, \(a\ge 1\) and \(f-\inf _{k}f\le 1,\forall k\in \left\{ 1,...,n\right\} \), then
Proof
Part (i) follows from Lemmas 32 (i) and Lemma 33). Part (ii) is immediate from Lemma 32 (ii). \(\square \)
Boucheron et al. [8] have given a refined version for the lower tail, where the condition \(a\ge 1\) is relaxed to \(a\ge 1/3\) for the lower tail. There they also show that Theorems 34 and 27 together suffice to derive a version of the convex distance inequality which differs from Talagrand’s original result only in that it has an inferior constant in the exponent.
5.2 Convex Lipschitz Functions Revisited
In Sect. 4.3 we gave a sub-Gaussian bound for the upper tail of separately convex Lipschitz functions on \(\left[ 0,1\right] ^{n}\). Now we use self-boundedness to complement this with a sub-Gaussian lower bound, using an elegant trick of Boucheron et al. [6] where the lower bound in Theorem 34 is applied to the square of the Lipschitz function f. The essence of the trick is the following simple lemma.
Lemma 35
If \(f\ge 0\) then \(D^{2}\left( f^{2}\right) \le 4D^{2}\left( f\right) f^{2}\).
Proof
Since \(f\ge 0\) we have \(\inf _{k}\left( f^{2}\right) =\left( \inf _{k}f\right) ^{2}\), so, using \(\left( a+b\right) ^{2}\le 2a^{2}+2b^{2},\)
\(\square \)
For the sub-Gaussian lower bound we need the additional assumption that \(f^{2}\) takes values in an interval of length at most one.
Theorem 36
Let \(\Omega _{k}=\left[ 0,1\right] \) and let \(f\in \mathcal {A}\) be L-Lipschitz, nonnegative and such that \(y\in \left[ 0,1\right] \mapsto S_{y}^{k}f\left( \mathbf {x}\right) \) is convex for all k and all \(\mathbf {x}\), and suppose in addition, that \(f^{2}\) takes values in an interval of length at most one. Then for all \(t\in \left[ 0,E\left[ f\right] \right] \)
Proof
The trick is to study the function \(f^{2}\) instead of f. Let \(\mathbf {x}\in \left[ 0,1\right] ^{n}\). Using separate convexity as in the proof of Theorem 30 we have \(D^{2}f\le L^{2}\), so by the previous lemma \(D^{2}\left( f\right) ^{2}\le 4L^{2}f^{2}\). For any k we have \(f^{2}\left( \mathbf {x}\right) -\inf f_{k}^{2}\left( \mathbf {x}\right) \le 1\), so by the lower tail bound of Theorem 34 we get a lower tail bound for \(f^{2}\)
Thus
Here we used \(E\left[ f\right] \le \sqrt{E\left[ f^{2}\right] }\) and the assumption that f is nonnegative in the first inequality. \(\square \)
5.3 Decoupling
A second method to avoid the uniform bound on the thermal expectation uses decoupling. By the duality formula of Theorem 4 we have for any \(f,g\in \mathcal {A}\) and \(\beta \in \mathbb {R}\)
Recall the discussion at the beginning of Sect. 5, where we had a general bound of the form Ent\(_{f}\left( \beta \right) \le \xi \left( \beta \right) E_{\beta f}\left[ G\left( f\right) \right] \). Using (24) we can now obtain for any \(\lambda >0\)
and for values of \(\beta \) and \(\lambda \) where \(\lambda >\xi \left( \beta \right) \) we obtain
Hence, if we can control the moment generating function of \(G\left( f\right) \) (or some suitable bound thereof), we obtain concentration inequalities for f, effectively passing from the thermal measure \(\mu _{\beta f}\) to the thermal measure \(\mu _{\lambda G\left( f\right) }\). The second line shows that in this way the supremum of \(G\left( f\right) \) can possibly be replaced by an expectation. The \(\lambda -\xi \left( \beta \right) \) in the denominator makes some constraint on \(\beta \) necessary, so the improvement comes at the price of an extra or enlarged subexponential term in the resulting concentration inequality. We conclude this chapter with three applications of this trick, which has been proposed in [7].
5.4 Quadratic Forms
As a first illustration we give a version of the Hanson-Wright inequality (Theorem 6.2.1 in [29]) for bounded variables. Let A be a symmetric \(n\times n\)-matrix, which is zero on the diagonal, that is \(A_{ii}=0\) for all i, and suppose that \(X_{1},...,X_{n}\) are independent random variables with values in an interval \(\mathcal {I}\) of unit diameter. We study the random variable \(f\left( \mathbf {X}\right) \), where
As operator G we use \(R^{2}\), the sum of squared conditional ranges which appears in the bounded difference inequality. For the function in question we have
and, since \(\mathcal {I}\) has unit diameter
We can therefore conclude from (12) in the proof of the bounded difference inequality (Theorem 14), that Ent\(_{f}\left( \gamma \right) \le \left( \gamma ^{2}/8\right) E_{\gamma f}\left[ R^{2}\left( f\right) \right] \le \left( \gamma ^{2}/2\right) E_{\gamma f}\left[ \left\| A\mathbf {X}\right\| ^{2}\right] \). But instead of bounding the last thermal expectation by a supremum, as we did before, we now look for concentration properties of the function \(\mathbf {x}\mapsto \left\| A\mathbf {x}\right\| ^{2}\).
By (20) and Lemma 35 we have the self-bounding inequality \(D^{2}\left( \left\| A\mathbf {x}\right\| ^{2}\right) \le 4\left\| A\right\| ^{2}\left\| A\mathbf {x}\right\| ^{2}\) and Lemma 32 gives for \(0\le \lambda <1/\left( 2\left\| A\right\| ^{2}\right) \)
Now Let \(0<\gamma <1/\left\| A\right\| \) and set \(\lambda :=\gamma /\left( 2\left\| A\right\| \right) <1/\left( 2\left\| A\right\| ^{2}\right) \). Using the above bound on Ent\(_{f}\left( \gamma \right) \) and the decoupling inequality (24) we get
Collect terms in Ent\(_{f}\left( \gamma \right) \), divide by \(\lambda -\gamma ^{2}/2\) (which is positive by the constraint on \(\gamma \) and the choice of \(\lambda \)) and substitute the value of \(\lambda \) to get
From Theorem 12 we conclude that for \(\beta <1/\left\| A\right\| \)
and using Lemma 33 to minimize the last expression in \(\beta \in \left( 0,1/\left\| A\right\| \right) \) gives our version of the Hanson-Wright inequality for bounded variables.
Theorem 37
Let A be a symmetric \(n\times n\)-matrix, zero on the diagonal, and \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) a vector of independent random variables with values in an interval \(\mathcal {I}\) of unit diameter. Let \(f:\mathcal {X}^{n}\rightarrow \mathbb {R}\) be defined by \(f\left( x\right) =\sum _{ij}x_{i}A_{ij}x_{j}\). Then for \(t>0\)
5.5 The Supremum of an Empirical Process
We will now apply the decoupling trick to the upwards tail of the supremum of an empirical process, sharpening the bound obtained in Sect. 3.4.
Theorem 38
Let \(X_{1},...,X_{n}\) be independent with values in some space \(\mathcal {X}\) with \(X_{i}\) distributed as \(\mu _{i}\), and let \(\mathcal {F}\) be an at most countable class of functions \(f:\mathcal {X}\rightarrow \left[ -1,1\right] \) with \(E\left[ f\left( X_{i}\right) \right] =0\). Define \(F:\mathcal {X}^{n}\rightarrow \mathbb {R}\) and \(W:\mathcal {X}^{n}\rightarrow \mathbb {R}\) by
Then for \(t>0\)
This inequality improves over Theorem 12.2 in [6], since by the triangle inequality \(E\left[ W\right] \le \Sigma ^{2}+\sigma ^{2}\) and the constants in the denominator of the exponent are better by a factor of two, and optimal for the variance term.
Proof
Let \(0<\gamma \le \beta <2\). Using Theorem 26 and (24) we get
Rearranging gives
Fix some \(\mathbf {x}\in \mathcal {X}^{n}\) and let \(\hat{f}\in \mathcal {F}\) witness the maximum in the definition of \(F\left( \mathbf {x}\right) \). For \(y\in \mathcal {X}\) we have \(\left( F-S_{y}^{k}F\right) _{+}\le \left( \hat{f}\left( x_{i}\right) -\hat{f}\left( y\right) \right) _{+}\) and by the zero mean assumption
So \(V_{+}^{2}\left( F\right) \le W\). It follows from (26) that
Next we establish self-boundedness of W. Let \(\hat{f}\in \mathcal {F}\) (different from the previous \(\hat{f}\), which we don’t need any more) witness the maximum in the definition of \(W\left( \mathbf {x}\right) \). Then
It therefore follows from the self-bounding lemma, Lemma 32, that
Combining this with (27) gives
From (6) in Theorem 12 we conclude that
Using Lemma 33 it follows that
\(\square \)
5.6 Another Version of Bernstein’s Inequality
A potential weakness of Theorem 21 is the occurrence of the supremum in the definition of the variance parameter \(V=\sup _{\mathbf {x}\in \Omega }\Sigma ^{2}\left( f\right) \left( \mathbf {x}\right) \). If the supremum could be replaced by an expectation, the variance parameter would become the Efron–Stein upper bound \(E\left[ \Sigma ^{2}\left( f\right) \right] \) on the variance \(\sigma ^{2}\left( f\right) \), making the inequality considerably stronger. Such a modification is possible at the expense of enlarging the subexponential term in Bernstein’s inequality. Define the interaction functional
The following theorem is given in [21]
Theorem 39
Suppose \(f\in \mathcal {A}\left( \Omega \right) \) satisfies \(f-E_{k}f\le b\) for all k. Then for all \(t>0\)
Here we will use the tools introduced above to prove a slight strengthening of this result, removing the boundedness conditions above.
Let \(f:\Omega =\prod _{i=1}^{n}\Omega _{i}\rightarrow \mathbb {R}\) and consider the three conditions
Then \(\left( A\right) \implies \left( B\right) \implies \left( C\right) \). The last condition (sometimes called “Bernstein condition” in the literature) is sufficient for the following version of Bernstein’s inequality, which extends Theorem 2.10 in [6] from sums to general functions and replaces the one-sided boundedness requirement of Theorem 39 by the Bernstein condition.
Theorem 40
Let \(f:\Omega =\prod _{i=1}^{n}\Omega _{i}\rightarrow \mathbb {R}\) be measurable and suppose that \(\left( C\right) \) holds. Then for \(t>0\)
The first step is to bound the entropy of f under the condition (C), thus replacing Lemma 19 in the proof of Theorem 21.
Lemma 41
Suppose \(\left( C\right) \) holds with \(b=1\). Then for all \(\beta \in \left[ 0,1\right) \)
Proof
First we get from the variational property of variance, that
where we used Jensen’s inequality to get \(E_{k}\left[ \exp \left( \beta \left( f-E_{k}f\right) \right) \right] \ge 1\) for the second inequality. From monotone convergence and \(\left( C\right) \) we then get
Thus from Theorem 12
\(\square \)
At this point we could bound the thermal expectation \(E_{\beta f}\left[ \Sigma ^{2}\left( f\right) \right] \) by a supremum and proceed along the usual path to obtain a version of Theorem 21 under condition (C), which, for sums of independent variables, would reduce to Theorem 2.10 in [6]. Instead we wish to exploit the decoupling idea and look for concentration properties of \(\Sigma ^{2}\left( f\right) \).
The crucial property of the interaction functional J is, that \(J^{2}\) is a self-bound for \(\Sigma ^{2}\left( f\right) \). The following Lemma is also the key to the proof of Theorem 39.
Lemma 42
We have \(D^{2}\left( \Sigma ^{2}\left( f\right) \right) \le J\left( f\right) ^{2}~\Sigma ^{2}\left( f\right) \) for any \(f\in \mathcal {A}\left( \Omega \right) \).
Proof
Fix \(\mathbf {x}\in \Omega \). Below all members of \(\mathcal {A}\) are understood as evaluated on \(\mathbf {x}\). For \(l\in \left\{ 1,...,n\right\} \) let \(z_{l}\in \Omega _{l}\) be a minimizer in z of \(S_{z}^{l}\Sigma ^{2}\left( f\right) \). Then
The sum over \(k\ne l\), since \(\sigma _{k}^{2}\left( f\right) \in \mathcal {A}_{k}\), so \(S_{z_{l}}^{l}\sigma _{k}^{2}\left( f\right) =\sigma _{k}^{2}\left( f\right) \). Then, using \(2\sigma _{k}^{2}\left( f\right) =E_{\left( y,y^{\prime }\right) \sim \mu _{k}^{2}}\left( D_{y,y^{\prime }}^{k}f\right) ^{2}\), we get
by an application of Cauchy–Schwarz. Now, using \(\left( a+b\right) ^{2}\le 2a^{2}+2b^{2}\), we can bound the last sum independent of l by
so that
\(\square \)
Now we can use decoupling to put these pieces together.
Proof of Theorem 40
By rescaling it suffices to prove the result for \(b=1\). We can also assume \(J:=J\left( f\right) >0\). Let \(0<\gamma \le \beta <1/\left( 1+J/2\right) \) and set \(\theta =\gamma /\left( J\left( 1-\gamma \right) \right) \). Then \(\gamma ^{2}/\left( 2\left( 1-\gamma \right) ^{2}\right)<\theta <2/J^{2}\). By the Lemma 41
where the second inequality follows from the decoupling inequality (24). Subtract \(\gamma ^{2}/\left( 2\left( 1-\gamma \right) ^{2}\right) \)Ent\(_{f}\left( \gamma \right) \) to get
Since \(\gamma ^{2}/\left( 2\left( 1-\gamma \right) ^{2}\right) <\theta \) this simplifies, using the value of \(\theta \), to
On the other hand \(\theta <2/J^{2}\), so by the self-boundedness of \(\Sigma ^{2}\left( f\right) \) (Lemma 42) and part (i) of Lemma 32 give
Combining (28) and (29) to get a bound on \(S_{f}\left( \gamma \right) \) gives
and from Theorem 12 and Lemma 33
\(\square \)
To use Theorem 40 one has to bound b and J. For the latter it is often sufficient to use the simple bound
which can be obtained from Lemma 13.
We conclude with an application to U-statistics. Let \(m<n\) be integers, \(\Omega _{i}=\mathcal {X}\) and \(\kappa :\mathcal {X}^{m}\rightarrow \mathbb {R}\) a symmetric kernel. For a subset of indices with cardinality m, \(S=\left\{ j_{1},...,j_{m}\right\} \subseteq \left\{ 1,...,n\right\} \) define \(\kappa _{S}:\mathcal {X}^{n}\rightarrow \mathbb {R}\) by \(\kappa _{S}\left( \mathbf {x}\right) =\kappa \left( x_{j_{1}},...,x_{j_{m}}\right) \). The U-statistic of order m induced by \(\kappa \) is then the function \(U:\mathcal {X}^{n}\rightarrow \mathbb {R}\) given by
U-statistics were introduced by Hoeffding [15]. Their importance stems from the fact that for iid \(\mathbf {X}=\left( X_{1},...,X_{n}\right) \) the random variable \(U\left( \mathbf {X}\right) \) is an unbiased estimator for \(E\left[ \kappa \left( X_{1},...,X_{m}\right) \right] \). Starting with the work of Hoeffding there has been a lot of work on concentration inequalities for U-statistics. To simplify the presentation we will not use the advantage of Theorem 40 over Theorem 39 and assume the kernel \(\kappa \) to be bounded, \(\kappa :\mathcal {X}^{m}\rightarrow \left[ 0,1\right] \) for simplicity.
Notice that,if \(k\notin S\), then \(\kappa _{S}\in \mathcal {A}_{k}\), so \(\kappa _{S}\left( \mathbf {x}\right) -E_{k}\left[ \kappa _{S}\left( \mathbf {x}\right) \right] =0\) and thus
so we can set the quantity b in Theorem 40 to m/n. To bound J use (30) to get
Substitution in Theorem 40 gives for \(t>0\)
It can be shown (see, e.g., [21], Houdré [17]) that in general \(E\left[ \Sigma ^{2}\left( f\right) \right] \le \sigma ^{2}\left( f\right) +J^{2}\left( f\right) /4\), so that for U-statistics the Efron–Stein inequality is tight in the sense that \(E\left[ \Sigma ^{2}\left( U\right) \right] \le \sigma ^{2}\left( U\right) +m^{4}/n^{2}\). It follows that for deviations \(t>1/n\)
This inequality can be compared to the classical work of Hoeffding [15] and more recent results of Arcones [2], which both consider undecoupled, nondegenerate U-statistics of arbitrary order. Hoeffding [15] does not have the correct variance term, while [2] gives the correct variance term but severely overestimates the subexponential coefficient in Bernstein’s inequality to be exponential in the degree m of the U-statistic (above it is only of order \(m^{2}\)). This exponential dependence on m results from the use of the decoupling inequalities in [24] and seems to beset most works on U-statistics of higher order (e.g., [1, 13]), which in many other ways improve over our simple inequality above.
6 Appendix I. Table of Notation
General notation | |
---|---|
\(\Omega =\prod _{k=1}^{n}\Omega _{k}\) | underlying (product-) probability space |
\(\mathcal {A}\) | bounded measurable functions on \(\Omega \) |
\(\mu =\otimes _{k=1}^{n}\mu _{k}\) | (product-) probability measure on \(\Omega \) |
\(X_{k}\) | random variable distributed as \(\mu _{k}\) in \(\Omega _{k}\) |
\(f\in \mathcal {A}\) | fixed function under investigation |
\(g\in \mathcal {A}\) | generic function |
\(E\left[ g\right] =\int _{\Omega }gd\mu \) | expectation of g in \(\mu \) |
\(\sigma ^{2}\left[ g\right] =E\left[ \left( g-E\left[ g\right] \right) ^{2}\right] \) | variance of g in \(\mu \) |
Notation for the entropy method | |
---|---|
\(\beta =1/T\) | inverse temperature |
\(E_{\beta f}\left[ g\right] =E\left[ ge^{\beta f}\right] /E\left[ e^{\beta f}\right] \) | thermal expectation of g |
\(Z_{\beta f}=E\left[ e^{\beta f}\right] \) | partition function |
\(d\mu _{\beta f}=Z_{\beta f}^{.1}~e^{\beta f}d\mu \) | thermal measure (canonical ensemble) |
Ent\(_{f}\left( \beta \right) =\beta E_{\beta f}\left[ f\right] -\ln Z_{\beta f}.\) | (canonical) entropy |
\(A_{f}\left( \beta \right) =\frac{1}{\beta }\ln Z_{\beta f}\) | free energy |
\(\sigma _{\beta f}^{2}\left( g\right) =E_{\beta f}\left[ \left( g-E_{\beta f}\left[ g\right] \right) ^{2}\right] \) | thermal variance of g |
\(\psi \left( t\right) =e^{t}-t-1\) | |
\(S_{y}^{k}F\left( \mathbf {x}\right) =F\left( x_{1},...,x_{k-1},y,x_{k+1},...,x_{n}\right) \) | substitution operator |
\(E_{k}\left[ g\right] \left( \mathbf {x}\right) =\int _{\Omega _{k}}S_{y}^{k}g~d\mu _{k}\left( y\right) \) | conditional expectation |
\(\mathcal {A}_{k}\subset \mathcal {A}\) | functions independent of k-th variable |
\(Z_{k,\beta f}=E_{k}\left[ e^{\beta f}\right] \) | conditional partition function |
\(E_{k,\beta f}\left[ g\right] =Z_{k,\beta f}^{-1}E_{k}\left[ ge^{\beta f}\right] \) | conditional thermal expectation |
Ent\(_{k,f}\left( \beta \right) =\beta E_{k,\beta f}\left[ g\right] -\ln Z_{k,\beta f}\) | conditional entropy |
\(\sigma _{k,\beta f}^{2}\left[ g\right] =E_{k,\beta f}\left[ \left( g-E_{k,\beta f}\left[ g\right] \right) ^{2}\right] \) | conditional thermal variance |
\(\sigma _{k}^{2}\left[ g\right] =E_{k}\left[ \left( g-E_{k}\left[ g\right] \right) ^{2}\right] \) | conditional variance |
Operators on \(\mathcal {A}\) | |
---|---|
\(D_{y,y^{\prime }}^{k}g=S_{y}^{k}g-S_{y^{\prime }}^{k}g\) | difference operator |
\(r_{k}\left( g\right) =\sup _{y,y^{\prime }\in \Omega _{k}}D_{y,y^{\prime }}^{k}f\) | conditional range operator |
\(R^{2}\left( g\right) =\sum _{k}r_{k}^{2}\left( g\right) \) | sum of conditional square ranges |
\(\Sigma ^{2}\left( g\right) =\sum _{k}\sigma _{k}^{2}\left[ g\right] \) | sum of conditional variances |
\(\left( \inf _{k}g\right) \left( \mathbf {x}\right) =\inf _{y\in \Omega _{k}}S_{y}^{k}g\left( \mathbf {x}\right) \) | conditional infimum operator |
\(V_{+}^{2}g=\sum _{k}E_{y\sim \mu _{k}}\left[ \left( \left( g-S_{y}^{k}\right) _{+}\right) ^{2}\right] \) | Efron–Stein variance proxy |
\(D^{2}g=\sum _{k}\left( g-\inf _{k}g\right) ^{2}.\) | worst case variance proxy |
References
Adamczak, R., et al.: Moment inequalities for u-statistics. Ann. Probab. 34(6), 2288–2314 (2006)
Arcones, M.A.: A Bernstein-type inequality for u-statistics and u-processes. Stat. Probab. Lett. 22(3), 239–247 (1995)
Bartlett, P., Mendelson, S.: Rademacher and gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002)
Bernstein, S.: Theory of Probability. Moscow (1927)
Boltzmann, L.: Über die Beziehung zwischen dem zweiten Hauptsatze des mechanischen Wärmetheorie und der Wahrscheinlichkeitsrechnung, respective den Sätzen über das Wärmegleichgewicht. Kk Hof-und Staatsdruckerei (1877)
Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities. Oxford University Press, Oxford (2013)
Boucheron, S., Lugosi, G., Massart, P., et al.: Concentration inequalities using the entropy method. Ann. Probab. 31(3), 1583–1614 (2003)
Boucheron, S., Lugosi, G., Massart, P., et al.: On concentration of self-bounding functions. Electron. J. Probab. 14, 1884–1899 (2009)
Bousquet, O.: A Bennett concentration inequality and its application to suprema of empirical processes. C.R. Math. 334(6), 495–500 (2002)
Chatterjee, S.: Concentration inequalities with exchangeable pairs. Ph.D. thesis, Citeseer (2005)
Chebyshev, P.L.: Sur les valeurs limites des intégrales. Imprimerie de Gauthier-Villars (1874)
Gibbs, J.W.: Elementary Principles in Statistical Mechanics: Developed with Especial Reference to the Rational Foundations of Thermodynamics. C. Scribner’s Sons, New York (1902)
Giné, E., Latała, R., Zinn, J.: Exponential and moment inequalities for u-statistics. High Dimensional Probability II, pp. 13–38. Springer, Berlin (2000)
Gross, L.: Logarithmic sobolev inequalities. Am. J. Math. 97(4), 1061–1083 (1975)
Hoeffding, W.: A class of statistics with asymptotically normal distribution. Ann. Math. Stat., pp. 293–325 (1948)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 301 (1963)
Houdré, C.: The iterated jackknife estimate of variance. Statist. Probab. Lett. 35(2), 197–201 (1997)
Ledoux, M.: The Concentration of Measure Phenomenon, vol. 89. American Mathematical Society, Providence (2001)
Lieb, E.H.: Some convexity and subadditivity properties of entropy. Inequalities, pp. 67–79. Springer, Berlin (2002)
Massart, P., et al.: About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab. 28(2), 863–884 (2000)
Maurer, A., et al.: A Bernstein-type inequality for functions of bounded interaction. Bernoulli 25(2), 1451–1471 (2019)
McAllester, D., Ortiz, L.: Concentration inequalities for the missing mass and for histogram rule error. J. Mach. Learn. Res. 4(Oct), 895–911 (2003)
McDiarmid, C.: Concentration. Probabilistic Methods of Algorithmic Discrete Mathematics, pp. 195–248. Springer, Berlin (1998)
de la Peña, V.H.: Decoupling and khintchine’s inequalities for u-statistics. Ann. Probab., pp. 1877–1892 (1992)
Popper, K.R.: Logik der forschung (1934). The Logic of Scientific Discovery. [Google Scholar] (1968)
Steele, J.M.: An Efron-Stein inequality for nonsymmetric statistics. Ann. Probab., pp. 753–758 (1986)
Talagrand, M.: Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathématiques de l’Institut des Hautes Etudes Scientifiques 81(1), 73–205 (1995)
Talagrand, M.: A new look at independence. Ann. Probab., pp. 1–34 (1996)
Vershynin, R.: High-dimensional Probability: An Introduction with Applications in Data Science, vol. 47. Cambridge University Press, Cambridge (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Maurer, A. (2021). Entropy and Concentration. In: De Mari, F., De Vito, E. (eds) Harmonic and Applied Analysis. Applied and Numerical Harmonic Analysis. Birkhäuser, Cham. https://doi.org/10.1007/978-3-030-86664-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-86664-8_2
Published:
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-030-86663-1
Online ISBN: 978-3-030-86664-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)