Abstract
We prove extensions of classical concentration inequalities for random variables that have α-subexponential tail decay for any α ∈ (0, 2]. This includes Hanson–Wright-type and convex concentration inequalities in various situations. In particular, we show uniform Hanson–Wright inequalities and convex concentration results for simple random tensors in the spirit of recent work by Klochkov–Zhivotovskiy (Electron J Probab 25(22):30, 2020) and Vershynin (Bernoulli 26(4):3139–3162, 2020).
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Concentration of measure phenomenon
- Orlicz norms
- Subexponential random variables
- Hanson–Wright inequality
- Convex concentration
2020 Mathematics Subject Classification:
1 Introduction
The aim of this note is to compile a number of smaller results that extend some classical as well as more recent concentration inequalities for bounded or sub-Gaussian random variables to random variables with heavier (but still exponential type) tails. In detail, we shall consider random variables X that satisfy
for any t ≥ 0, some α ∈ (0, 2], and a suitable constant C1,α > 0. Such random variables are sometimes called α-subexponential (for α = 2, they are sub-Gaussian) or sub-Weibull(α) (cf. [23, Definition 2.2]).
There are several equivalent reformulations of (1.1), e. g., in terms of Lp norms:
for any p ≥ 1. Another characterization is that these random variables have finite Orlicz norms of order α, i. e.,
If α < 1, \(\lVert \cdot \rVert _{\Psi _\alpha }\) is actually a quasi-norm; however, many norm-like properties (such as a triangle-type inequality) can nevertheless be recovered up to α-dependent constants (see, e. g., [12, Appendix A]). In fact, C1,α, C2,α, and C3,α can be chosen such that they only differ by a constant α-dependent factor.
Note that α-subexponential random variables have log-convex (if α ≤ 1) or log-concave (if α ≥ 1) tails, i. e., \(t \mapsto - \log \mathbb {P}(\lvert {X}\rvert \ge t)\) is convex or concave, respectively. For log-convex or log-concave measures, two-sided Lp norm estimates for polynomial chaos (and as a consequence, concentration bounds) have been established over the last 25 years. In the log-convex case, results of this type have been derived for linear forms in [17] and for forms of any order in [12, 21]. For log-concave measures, starting with linear forms again in [10], important contributions have been made in [3, 24, 25, 27].
In this note, we mainly present four different results for functions of α-subexponential random variables: a Hanson–Wright-type inequality in Sect. 2, a version of the convex concentration inequality in Sect. 3, a uniform Hanson–Wright inequality in Sect. 4, and finally a convex concentration inequality for simple random tensors in Sect. 5. These results are partly based on and generalize recent research, e. g., [20] and [42]. In fact, they partially build upon each other: for instance, in the proofs of Sect. 5, we apply results both from Sects. 2 and 3. A more detailed discussion is provided in each of the sections.
Finally, let us introduce some conventions that we will use in this chapter.
Notations.
If X1, …, Xn is a sequence of random variables, we denote by X = (X1, …, Xn) the corresponding random vector. Moreover, we shall need the following types of norms throughout the paper:
-
The norms \(\lVert x \rVert _p := (\sum _{i=1}^n\lvert {x_i}\rvert ^p)^{1/p}\) for \(x \in \mathbb {R}^n\)
-
The Lp norms \(\lVert X \rVert _{L^p} := (\mathbb {E}|X|{ }^p)^{1/p}\) for random variables X (cf. (1.2))
-
The Orlicz (quasi-)norms \(\lVert X \rVert _{\Psi _\alpha }\) as introduced in (1.3)
-
The Hilbert–Schmidt and operator norms \(\lVert A \rVert _{\mathrm {HS}} := (\sum _{i,j} a_{ij}^2)^{1/2}\), \(\lVert A \rVert _{\mathrm {op}} := \sup \{ \lVert Ax \rVert _2 : \lVert x \rVert _2 = 1\}\) for matrices A = (aij)
The constants appearing in this chapter (typically denoted C or c) may vary from line to line. Without subscript, they are assumed to be absolute, and if they depend on α (only), we shall write Cα or cα.
2 A Generalized Hanson–Wright Inequality
Arguably, the most famous concentration result for quadratic form is the Hanson–Wright inequality, which first appeared in [16]. We may state it as follows: assuming X1, …, Xn are centered, independent random variables satisfying \(\lVert X_i \rVert _{\Psi _2} \le K\) for any i and A = (aij) is a symmetric matrix, we have for any t ≥ 0
For a modern proof, see [33], and for various developments, cf. [2, 4, 18, 43].
In this note, we provide an extension of the Hanson–Wright inequality to random variables with bounded Orlicz norms of any order α ∈ (0, 2]. This complements the results in [12], where the case of α ∈ (0, 1] was considered, while for α = 2, we get back the actual Hanson–Wright inequality.
Theorem 21
For any α ∈ (0, 2], let X1, …, Xn be independent, centered random variables such that \(\lVert X_i \rVert _{\Psi _\alpha } \le K\) for any i and A = (aij) be a symmetric matrix. Then, for any t ≥ 0,
Theorem 21 generalizes and implies a number of inequalities for quadratic forms in α-subexponential random variables (in particular for α = 1) that are spread throughout the literature. For a detailed discussion, see [12, Remark 1.7]. Note that it is possible to sharpen the tail estimate given by Theorem 21, cf., e. g., [12, Corollary 1.4] for α ∈ (0, 1] or [3, Theorem 3.2] for α ∈ [1, 2] (in fact, the proof of Theorem 21 works by evaluating the family of norms used therein). The main benefit of Theorem 21 is that it uses norms that are easily calculable and in many situations already sufficient for applications.
Before we give the proof of Theorem 21, let us briefly mention that for the standard Hanson–Wright inequality, a number of selected applications can be found in [33]. Some of them were generalized to α-subexponential random variables with α ≤ 1 in [12], and it is no problem to extend these proofs to any order α ∈ (0, 2] using Theorem 21. Here, we just focus on a single example that yields a concentration result for the Euclidean norm of a linear transformation of a vector X having independent components with bounded Orlicz norms around the Hilbert–Schmidt norm of the transformation matrix. This is a variant and extension of [12, Proposition 2.1] and will be applied in Sect. 5.
Proposition 22
Let X1, …, Xn be independent, centered random variables such that \({\mathbb {E}} X_i^2 = 1\) and \(\lVert X_i \rVert _{\Psi _\alpha } \le K\) for some α ∈ (0, 2] and let B ≠ 0 be an m × n matrix. For any t ≥ 0, we have
In particular, for any t ≥ 0, it holds
For the proofs, let us recall some elementary relations that we will use throughout the paper to adjust the constants in the tail bounds we derive.
Adjusting constants.
For any two constants C1 > C2 > 1, we have for all r ≥ 0 and C > 0
whenever the left-hand side is smaller or equal to 1 (cf., e. g., [35, Eq. (3.1)]). Moreover, for any α ∈ (0, 2), any γ > 0, and all t ≥ 0, we may always estimate
using \(\exp (-s^2) \le \exp (1-s^\alpha )\) for any s > 0 and (2.3). More precisely, we may choose \(C' := C/\log ^{1/\alpha }(2)\). Note that strictly speaking, the range of t∕C ≤ 1 is not covered by (2.3); however, in this case (in particular, choosing C′ as suggested), both sides of (2.4) are at least 1 anyway so that the right-hand side still provides a valid upper bound for any probability.
Let us now turn to the proof of Theorem 21. In what follows, we actually show that for any p ≥ 2,
From here, Theorem 21 follows by standard means (cf. [34, Proof of Theorem 3.6]). Moreover, we may restrict ourselves to α ∈ (1, 2], since the case of α ∈ (0, 1] has been proven in [12].
Proof of Theorem 21
First we shall treat the off-diagonal part of the quadratic form. Let \(w^{(1)}_i, w^{(2)}_i\) be independent (of each other as well as of the Xi) symmetrized Weibull random variables with scale 1 and shape α, i. e., \(w^{(j)}_i\) are symmetric random variables with \(\mathbb {P}(\lvert {w^{(j)}_i}\rvert \ge t) = \exp (-t^\alpha )\). In particular, the \(w^{(j)}_i\) have logarithmically concave tails.
Using standard decoupling and symmetrization arguments (cf. [8, Theorem 3.1.1 & Lemma 1.2.6]) as well as [3, Theorem 3.2] in the second inequality, for any p ≥ 2, it holds
where the norms \(\lVert A \rVert _{\mathcal {J},p}^{\mathcal {N}}\) are defined as in [3]. Instead of repeating the general definitions, we will only focus on the case we need in our situation. Indeed, for the symmetric Weibull distribution with parameter α, we have (again, in the notation of [3]) N(t) = tα, and so for α ∈ (1, 2], it follows that \(\hat {N}(t) = \min (t^2, \lvert {t}\rvert ^\alpha )\). Hence, the norms can be written as follows:
Before continuing with the proof, we next introduce a lemma that will help to rewrite the norms in a more tractable form. □
Lemma 23
For any p ≥ 2, define
Then I1(p) = I2(p).
Proof
The inclusion I1(p) ⊇ I2(p) is an easy calculation, and the inclusion I1(p) ⊆ I2(p) follows by defining \(z_i = \lVert (x_{ij})_j \rVert \) and \(y_{ij} = x_{ij} / \lVert (x_{ij})_j \rVert \) (or 0, if the norm is zero). □
Proof of Theorem 21, continued
For brevity, for any matrix A = (aij), let us write \(\lVert A \rVert _m := \max _{i = 1,\ldots ,n} ( \sum _{j = 1}^n a_{ij}^2 )^{1/2}\). Note that clearly, \(\lVert A \rVert _m \le \lVert A \rVert _{\mathrm {op}}\).
Now, fix some vector \(z \in {\mathbb {R}}^n\) such that \(\sum _{i = 1}^n \min (\lvert z_i \rvert ^\alpha , z_i^2) \le p\). The condition also implies
where in the second step we used α ∈ [1, 2] to estimate \(\lvert z_i \rvert ^\alpha \mathbb m{1} _{\{\lvert z_i \rvert > 1\}} \ge \lvert z_i \rvert \mathbb m{1} _{\{\lvert z_i \rvert > 1\}}\). So, given any z and y satisfying the conditions of I2(p), we can write
So, this yields
As for \(\lVert A \rVert _{\{\{1\},\{2\}\}, p}^{\mathcal {N}}\), we can use the decomposition z = z1 + z2, where \((z_1)_i = z_i \mathbb m{1} _{\{\lvert z_i \rvert > 1\}}\) and z2 = z − z1, and obtain
(in the braces, the conditions \(\lVert \cdot \rVert _\beta \le p^{1/\beta }\) have been replaced by \(\lVert \cdot \rVert _\beta \le 1\)). Clearly, since \(\lVert x_1 \rVert _\alpha \le 1\) implies \(\lVert x_1 \rVert _2 \le 1\) (and the same for y1), all of the norms can be upper bounded by \(\lVert A \rVert _{\mathrm {op}}\), i. e., we have
where the last inequality follows from p ≥ 2 and 1∕2 ≤ 1∕α ≤ 1 ≤ (α + 2)∕(2α) ≤ 2∕α.
Combining the estimates (2.6), (2.7), and (2.8) yields
To treat the diagonal terms, we use Corollary 6.1 in [12], as \(X_i^2\) are independent and satisfy \(\lVert X_i^2 \rVert _{\Psi _{\alpha /2}} \le K^2\), so that it yields
Now it is clear that \(\max _{i = 1,\ldots , n} \lvert a_{ii} \rvert \le \lVert A \rVert _{\mathrm {op}}\) and \(\sum _{i = 1}^n a_{ii}^2 \le \lVert A \rVert _{\mathrm {HS}}^2\). In particular,
The claim (2.5) now follows from Minkowski’s inequality. □
Finally, we prove Proposition 22.
Proof of Proposition 22
It suffices to prove (2.1) for matrices satisfying \(\lVert B \rVert _{\mathrm {HS}} = 1\), as otherwise we set \(\widetilde {B} = B \lVert B \rVert _{\mathrm {HS}}^{-1}\) and use the equality
Now let us apply Theorem 21 to the matrix A := BTB. An easy calculation shows that \(\mathrm {trace}(A) = \mathrm {trace}(B^T B) = \lVert B \rVert _{\mathrm {HS}}^2 = 1\), so that we have for any t ≥ 0
Here, the first step follows from \(\lvert z - 1 \rvert \le \min ( \lvert z^2 -1 \rvert , \lvert z^2 - 1 \rvert ^{1/2})\) for z ≥ 0, in the second step, we have used the estimates \(\lVert A \rVert _{\mathrm {HS}}^2 \le \lVert B \rVert _{\mathrm {op}}^2 \lVert B \rVert _{\mathrm {HS}}^2 = \lVert B \rVert _{\mathrm {op}}^2\) and \(\lVert A \rVert _{\mathrm {op}} \le \lVert B \rVert _{\mathrm {op}}^2\), and moreover, the fact that since \(\mathbb {E}X_i^2 = 1\), K ≥ Cα > 0 (cf., e. g., [12, Lemma A.2]), while the last step follows from (2.4) and (2.3). Setting \(t = K^2s\lVert B \rVert _{\mathrm {op}}\) for s ≥ 0 finishes the proof of (2.1). Finally, (2.2) follows by taking m = n and B = I. □
3 Convex Concentration for Random Variables with Bounded Orlicz Norms
Assume X1, …, Xn are independent random variables each taking values in some bounded interval [a, b]. Then, by convex concentration as established in [19, 29, 38], for every convex 1-Lipschitz function \(f : [a,b]^n \to \mathbb {R}\),
for any t ≥ 0 (see, e. g., [36, Corollary 3]).
While convex concentration for bounded random variables is by now standard, there is less literature for unbounded random variables. In [31], a Martingale-type approach is used, leading to a result for functionals with stochastically bounded increments. The special case of suprema of unbounded empirical processes was treated in [1, 28, 40]. Another branch of research, begun in [29] and continued, e. g., in [5, 13,14,15, 36, 37], is based on functional inequalities (such as Poincaré or log-Sobolev inequalities) restricted to convex functions and weak transport-entropy inequalities. In [20, Lemma 1.8], a generalization of (3.1) for sub-Gaussian random variables (α = 2) was proven, which we may extend to any order α ∈ (0, 2].
Proposition 31
Let X1, …, Xn be independent random variables, α ∈ (0, 2] and \(f : \mathbb {R}^n \to \mathbb {R}\) convex and 1-Lipschitz. Then, for any t ≥ 0,
In particular,
Note that the main results of the following two sections can be regarded as applications of Proposition 31. If f is separately convex only (i. e., convex is every coordinate with the other coordinates being fixed), it is still possible to prove a corresponding result for the upper tails. Indeed, it is no problem to modify the proof below accordingly, replacing (3.1) by [7, Theorem 6.10]. Moreover, note that \(\lVert \max _i |X_i| \rVert _{\Psi _\alpha }\) cannot be replaced by \(\max _i \lVert |X_i| \rVert _{\Psi _\alpha }\) (a counterexample for α = 2 is provided in [20]). In general, the Orlicz norm of \(\max _i\lvert X_i \rvert \) will be of order \((\log n)^{1/\alpha }\) (cf. Lemma 56).
Proof of Proposition 31
Following the lines of the proof of [20, Lemma 3.5], the key step is a suitable truncation that goes back to [1]. Indeed, write
with \(M := 8 \mathbb {E}\max _i |X_i|\) (in particular, \(M \le C_\alpha \lVert \max _i |X_i| \rVert _{\Psi _\alpha }\), cf. [12, Lemma A.2]), and let Y = (Y1, …, Yn), Z = (Z1, …, Zn). By the Lipschitz property of f,
and hence, it suffices to bound the terms in the last line.
Applying (3.1) to Y and using (2.4) and (2.3), we obtain
Furthermore, below we will show that
Hence, for any t ≥ 0,
and by [12, Lemma A.2],
Temporarily writing \(K := C_\alpha \lVert \max _i |X_i| \rVert _{\Psi _\alpha }\), where Cα is large enough so that (3.5), (3.7), and (3.8) hold, (3.4) and (3.8) yield
if t ≥ K. Using subadditivity and invoking (3.5) and (3.7), we obtain
where the last step holds for t ≥ K + δ for some δ > 0. This bound extends trivially to any t ≥ 0 (if necessary, by a suitable change of constants). Finally, the constant in front of the exponential may be adjusted to 2 by (2.3), which finishes the proof.
It remains to show (3.6). To this end, recall the Hoffmann-Jørgensen inequality (cf. [30, Theorem 6.8]) in the following form: if W1, …, Wn are independent random variables, Sk := W1 + … + Wk, and t ≥ 0 is such that \(\mathbb {P}(\max _k |S_k| > t) \le 1/8\), then
In our case, we set \(W_i := Z_i^2\), t = 0, and note that by Chebyshev’s inequality,
and consequently, recalling that \(S_k = Z_1^2 + \ldots + Z_k^2\),
Thus, together with [12, Lemma A.2], we obtain
Now it is easy to see that \(\lVert \max _i Z_i^2 \rVert _{\Psi _{\alpha /2}} \le \lVert \max _i |X_i| \rVert _{\Psi _\alpha }^2\), so that altogether we arrive at
Furthermore, by [30, Theorem 6.21], if W1, …, Wn are independent random variables with zero mean and α ∈ (0, 1],
In our case, we consider \(W_i = Z_i^2 - \mathbb {E} Z_i^2\) and α∕2 (instead of α). Together with the previous arguments (in particular, (3.9)) and [12, Lemma A.3], this yields
Combining this with [12, Lemma A.3] and (3.9), we arrive at (3.6). □
4 Uniform Tail Bounds for First- and Second-Order Chaos
In this section, we discuss bounds for the tails of the supremum of certain chaos-type classes of functions. Even if we are particularly interested in quadratic forms, i. e., uniform Hanson–Wright inequalities, let us first consider linear forms.
Let X1, …, Xn be independent random variables, let α ∈ (0, 2], and let \(\{a_{i,t} : i = 1, \ldots , n, t \in \mathcal {T}\}\) be a compact set of real numbers, where \(\mathcal {T}\) is some index set. Consider \(g(X) := \sup _{t \in \mathcal {T}} \sum _{i=1}^n a_{i,t}X_i\). Clearly, g is convex and has Lipschitz constant \(D := \sup _{t \in \mathcal {T}} (\sum _{i=1}^n a_{i,t}^2)^{1/2}\). Therefore, applying Proposition 31, we immediately obtain that for any t ≥ 0,
For bounded random variables, corresponding tail bounds can be found, e. g., in [32, Eq. (14)], and choosing α = 2, we get back this result up to constants.
Our main aim is to derive a second-order analogue of (4.1), i. e., a uniform Hanson–Wright inequality. A pioneering result in this direction (for Rademacher variables) can be found in [39]. Later results include [2] (which requires the so-called concentration property), [22], [9], and [11] (certain classes of weakly dependent random variables). In [20], a uniform Hanson–Wright inequality for sub-Gaussian random variables was proven. We may show a similar result for random variables with bounded Orlicz norms of any order α ∈ (0, 2].
Theorem 41
Let X1, …, Xn be independent, centered random variables and \(K := \lVert \max _i \lvert X_i \rvert \rVert _{\Psi _\alpha }\), where α ∈ (0, 2]. Let \(\mathcal {A}\) be a compact set of real symmetric n × n matrices, and let \(f(X) := \sup _{A \in \mathcal {A}} (X^TAX - \mathbb {E}X^TAX)\). Then, for any t ≥ 0,
For α = 2, this gives back [20, Theorem 1.1] (up to constants and a different range of t). Comparing Theorems 41 to 21, we note that instead of a sub-Gaussian term, we obtain an α-subexponential term (which can be trivially transformed into a sub-Gaussian term for \(t \le \mathbb {E}\sup _{A\in \mathcal {A}}\lVert AX \rVert _2\), but this does not cover the complete α-subexponential regime). Moreover, Theorem 41 only gives a bound for the upper tails. Therefore, if \(\mathcal {A}\) just consists of a single matrix, Theorem 21 is stronger. These differences have technical reasons.
To prove Theorem 41, we shall follow the basic steps of [20] and modify those where the truncation comes in. Let us first repeat some tools and results. In the sequel, for a random vector W = (W1, …, Wn), we shall denote
where \(g : \mathbb {R}^{n \times n} \to \mathbb {R}\) is some function. Moreover, if A is any matrix, we denote by Diag(A) its diagonal part (regarded as a matrix with zero entries on its off-diagonal). The following lemma combines [20, Lemmas 3.2 & 3.5].
Lemma 42
-
(1)
Assume the vector W has independent components that satisfy Wi ≤ K a.s. Then, for any t ≥ 1, we have
$$\displaystyle \begin{aligned} f(W) - \mathbb {E}f(W) \le C\big(K(\mathbb {E}\sup_{A \in \mathcal{A}}\lVert AW \rVert _2 &+ \mathbb {E}\sup_{A \in \mathcal{A}} \lVert \mathrm{Diag}(A)W \rVert _2)\sqrt{t} \\ &+ K^2\sup_{A\in \mathcal{A}}\lVert A \rVert _{\mathrm{op}}t\big) \end{aligned} $$with probability at least 1 − e−t.
-
(2)
Assuming the vector W has independent (but not necessarily bounded) components with mean zero, we have
$$\displaystyle \begin{aligned} \mathbb {E}\sup_{A\in\mathcal{A}}\lVert \mathrm{Diag}(A)W \rVert _2 \le C\mathbb {E}\sup_{A\in\mathcal{A}}\lVert AW \rVert _2. \end{aligned}$$
From now on, let X be the random vector from Theorem 41, and recall the truncated random vector Y that we introduced in (3.3) (and the corresponding “remainder” Z). Then, Lemma 42 (1) for f(Y ) with \(g(A) = \mathbb {E}X^TAX\) yields
with probability at least 1 − e−t (actually, (4.3) even holds with α = 2, but in the sequel we will have to use the weaker version given above anyway). Here we recall that \(M \le C_\alpha \lVert \max _i |X_i| \rVert _{\Psi _\alpha }\).
To prove Theorem 41, it remains to replace the terms involving the truncated random vector Y by the original vector X. First, by Proposition 31 and since \(\sup _{A\in \mathcal {A}} \lVert AX \rVert _2\) is \(\sup _{A\in \mathcal {A}}\lVert A \rVert _{\mathrm {op}}\)-Lipschitz, we obtain
Moreover, by (3.8),
Next we estimate the difference between the expectations of f(X) and f(Y ).
Lemma 43
We have
Proof
First note that
The same holds if we reverse the roles of X and Y . As a consequence,
and thus, taking expectations and applying Hölder’s inequality,
We may estimate \((\mathbb {E}\lVert Z \rVert _2^2)^{1/2}\) using (3.9). Moreover, by related arguments as in (3.8), from (4.4), we get that
Arguing similarly and using (4.5), the same bound also holds for \((\mathbb {E}\sup _{A\in \mathcal {A}} \) \(\lVert AY \rVert _2^2)^{1/2}\). Taking roots and plugging everything into (4.7) complete the proof. □
Finally, we prove the central result of this section.
Proof of Theorem 41
First, it immediately follows from Lemma 43 that
Moreover, by (4.5) and Lemma 42 (2),
Finally, it follows from (4.6), (4.4), and (4.5) that
with probability at least 1 − 4e−t for all t ≥ 1. Using (3.7), it follows that
with probability at least 1 − 6e−t for all t ≥ 1. Combining (4.8), (4.9), and (4.10) and plugging into (4.3) thus yield that with probability at least 1 − 6e−t for all t ≥ 1,
If \(u \ge \max (a,b)\), it follows that
By standard means (a suitable change of constants, using (2.3)), this bound may be extended to any u ≥ 0, and the constant may be adjusted to 2. □
5 Random Tensors
By a simple random tensor, we mean a random tensor of the form
where all Xk are independent random vectors in \(\mathbb {R}^n\) whose coordinates are independent, centered random variables with variance one. Concentration results for random tensors (typically for polynomial-type functions) have been shown in [6, 12, 26], for instance.
Recently, in [42], new and interesting concentration bounds for simple random tensors were shown. In comparison to previous work, these inequalities focus on small values of t, e. g., a regime where sub-Gaussian tail decay holds. Moreover, in contrast to previous papers, [42] provides constants with optimal dependence on d. One of these results is the following convex concentration inequality: assuming that n and d are positive integers, \(f : \mathbb {R}^{n^d} \to \mathbb {R}\) is convex and 1-Lipschitz, and the Xij are bounded a.s., then for any t ∈ [0, 2nd∕2],
where C > 0 only depends on the bound of the coordinates. Using Theorem 21 and Proposition 31, we may extend this result to unbounded random variables as follows:
Theorem 51
Let \(n, d \in \mathbb {N}\) and \(f : \mathbb {R}^{n^d} \to \mathbb {R}\) be convex and 1-Lipschitz. Consider a simple random tensor X := X1 ⊗⋯ ⊗ Xd as in (5.1). Fix α ∈ [1, 2], and assume that \(\lVert X_{i,j} \rVert _{\Psi _\alpha } \le K\). Then, for any \(t \in [0, c_\alpha n^{d/2} (\log n)^{1/\alpha }/K]\),
On the other hand, if α ∈ (0, 1), then, for any \(t \in [0, c_\alpha n^{d/2} (\log n)^{1/\alpha }d^{1/\alpha -1/2}/K]\),
The logarithmic factor stems from the Orlicz norm of maxi|Xi| in Proposition 31. For a slightly sharper version that includes the explicit dependence on these norms (and also gives back (5.2) for bounded random variables and α = 2), see (5.12) in the proof of Theorem 51. We believe that Theorem 51 is non-optimal for α < 1 as we would expect a bound of the same type as for α ∈ [1, 2]. However, a key difference in the proofs is that in the case of α ≥ 1 we can make use of moment-generating functions. This is clearly not possible if α < 1, so that less subtle estimates must be invoked instead.
For the proof of Theorem 51, we first adapt some preliminary steps and compile a number of auxiliary lemmas whose proofs are deferred to the appendix. As a start, we need some additional characterizations of α-subexponential random variables via the behavior of the moment-generating functions:
Proposition 52
Let X be a random variable and α ∈ (0, 2]. Then, the properties (1.1), (1.2), and (1.3) are equivalent to
for all 0 ≤ λ ≤ 1∕C4,α. If α ∈ [1, 2] and \(\mathbb {E}X = 0\), then the above properties are moreover equivalent to
The parameters Ci,α, i = 1, …, 5, can be chosen such that they only differ by constant α-dependent factors. In particular, we can take \(C_{i,\alpha } = c_{i,\alpha }\lVert X \rVert _{\Psi _\alpha }\).
To continue, note that \(\lVert X \rVert _2 = \prod _{i=1}^d \lVert X_i \rVert _2\). A key step in the proofs of [42] is a maximal inequality that simultaneously controls the tails of \(\prod _{i=1}^k \lVert X_i \rVert _2\), k = 1, …, d, where the Xi have independent sub-Gaussian components, i. e., α = 2. Generalizing these results to any order α ∈ (0, 2] is not hard. The following preparatory lemma extends [42, Lemma 3.1]. Note that in the proof (given in the appendix again), we apply Proposition 22.
Lemma 53
Let \(X_1, \ldots , X_d \in \mathbb {R}^n\) be independent random vectors with independent, centered coordinates such that \({\mathbb {E}} X_{i,j}^2 = 1\) and \(\lVert X_{i,j} \rVert _{\Psi _\alpha } \le K\) for some α ∈ (0, 2]. Then, for any t ∈ [0, 2nd∕2],
To control all k = 1, …, d simultaneously, we need a generalized version of the maximal inequality [42, Lemma 3.2] that we state next.
Lemma 54
Let \(X_1, \ldots , X_d \in \mathbb {R}^n\) be independent random vectors with independent, centered coordinates such that \({\mathbb {E}} X_{i,j}^2 = 1\) and \(\lVert X_{i,j} \rVert _{\Psi _\alpha } \le K\) for some α ∈ (0, 2]. Then, for any u ∈ [0, 2],
The following Martingale-type bound is directly taken from [42]:
Lemma 55 ([42], Lemma 4.1)
Let X1, …Xd be independent random vectors. For each k = 1, …, d, let fk = fk(Xk, …, Xd) be an integrable real-valued function and \(\mathcal {E}_k\) be an event that is uniquely determined by the vectors Xk, …, Xd. Let \(\mathcal {E}_{d+1}\) be the entire probability space. Suppose that for every k = 1, …, d, we have
for every realization of Xk+1, …, Xd in \(\mathcal {E}_{k+1}\). Then, for \(\mathcal {E} := \mathcal {E}_2 \cap \cdots \cap \mathcal {E}_d\), we have
Finally, we need a bound for the Orlicz norm of maxi|Xi|.
Lemma 56
Let X1, …, Xn be independent, centered random variables such that \(\lVert X_i \rVert _{\Psi _\alpha } \le K\) for any i and some α > 0. Then,
Here, we may choose \(C_\alpha = \max \{2^{1/\alpha -1}, 2^{1-1/\alpha }\}\).
Note that for α ≥ 1, [8, Proposition 4.3.1] provides a similar result. However, we are also interested in the case of α < 1 in the present note. The condition \(\mathbb {E}X_i = 0\) in Lemma 56 can easily be removed only at the expense of a different absolute constant.
We are now ready to prove Theorem 51.
Proof of Theorem 51
We shall adapt the arguments from [42]. First let
and let \(\mathcal {E}_{d+1}\) be the full space. It then follows from Lemma 54 for u = 1 that
where \(\mathcal {E} := \mathcal {E}_2 \cap \cdots \cap \mathcal {E}_d\).
Now fix any realization x2, …, xd of the random vectors X2, …, Xd in \(\mathcal {E}_2\), and apply Proposition 31 to the function f1(x1) given by x1↦f(x1, …xd). Clearly, f1 is convex, and since
we see that it is 2n(d−1)∕2-Lipschitz. Hence, it follows from (3.2) that
for any x2, …, xd in \(\mathcal {E}_2\), where \(\mathbb {E}_{X_1}\) denotes taking the expectation with respect to X1 (which, by independence, is the same as conditionally on X2, …, Xd).
To continue, fix any realization x3, …, xd of the random vectors X3, …, Xd that satisfy \(\mathcal {E}_3\) and apply Proposition 31 to the function f2(x2) given by \(x_2 \mapsto \mathbb {E}_{X_1}f(X_1, x_2, \ldots , x_d)\). Again, f2 is a convex function, and since
f2 is 2n(d−1)∕2-Lipschitz. Applying (3.2), we thus obtain
for any x3, …, xd in \(\mathcal {E}_3\). Iterating this procedure, we arrive at
for any realization xk+1, …, xd of Xk+1, …, Xd in \(\mathcal {E}_{k+1}\).
We now combine (5.8) for k = 1, …, d. To this end, we write
and apply Proposition 52. Here we have to distinguish between the cases where α ∈ [1, 2] and α ∈ (0, 1). If α ≥ 1, we use (5.4) to arrive at a bound for the moment-generating function. Writing \(M_k := \lVert \max _j |X_{k,j}| \rVert _{\Psi _\alpha }\), we obtain
for all xk+1, …, xd in \(\mathcal {E}_{k+1}\), where the first line holds if |λ|≤ 1∕(cαn(d−1)∕2Mk) and the second one if |λ|≥ 1∕(cαn(d−1)∕2Mk) and α > 1. For the simplicity of presentation, temporarily assume that cαn(d−1)∕2 = 1 (alternatively, replace Mk by cαn(d−1)∕2Mk in the following arguments) and that M1 ≤… ≤ Md. Using Lemma 55, we obtain
for |λ|∈ [1∕Mk+1, 1∕Mk], where we formally set M0 := 0 and Md+1 := ∞. In particular, setting \(M := (M_1^2 + \ldots + M_d^2)^{1/2}\), we have
for all |λ|≤ 1∕Md = 1∕(maxkMk). Furthermore, for α > 1, it is not hard to see that
If |λ|∈ [1∕Mk+1, 1∕Mk] for some k = 0, 1, …, d − 1 or |λ|∈ [1∕M, 1∕Md] for k = d. Indeed, by monotonicity (divide by λ2 and compare the coefficients), it suffices to check this for λ = 1∕Mk+1 or λ = 1∕M if k = d. The cases of k = 0 and k = d follow by simple calculations. In the general case, set \(x^2 = (M_1^2 + \ldots + M_{k+1}^2)/M_{k+1}^2\) and \(y^{\alpha /(\alpha -1)} = (M_{k+2}^{\alpha /(\alpha -1)} + \ldots + M_d^{\alpha /(\alpha -1)})/M_{k+1}^{\alpha /(\alpha -1)}\). Clearly, (x2 + yα∕(α−1))(α−1)∕α ≤ (x2 + y2)1∕2 since x ≥ 1 and α∕(α − 1) ≥ 2. Moreover, \(y^2 \le (M_{k+2}^2 + \ldots + M_d^2)/M_{k+1}^2\), which proves the inequality. Altogether, inserting the factor cαn(d−1)∕2 again, we therefore obtain
where the first line holds if |λ|≤ 1∕(cαn(d−1)∕2M) and the second one if |λ|≥ 1∕(cαn(d−1)∕2M) and α > 1.
On the other hand, if α < 1, we use (5.3). Together with Lemma 55 and the subadditivity of |⋅|α for α ∈ (0, 1), this yields
for λ ∈ [0, 1∕(cαn(d−1)∕2maxkMk)].
To finish the proof, first consider α ∈ [1, 2]. Then, for any λ > 0, we have
where the last step follows by standard arguments (similarly as in the proof of Proposition 52 given in the appendix), using (5.9) and (5.5). Now, assume that t ≤ cαnd∕2M∕(K2d1∕2). Then, the right-hand side of (5.11) is dominated by the first term (possibly after adjusting constants), so that we arrive at
The same arguments hold if f is replaced by − f. Adjusting constants by (2.3), we obtain that for any t ∈ [0, cαnd∕2M∕(K2d1∕2)],
Now it remains to note that by Lemma 56, we have
If α ∈ (0, 1), similarly to (5.11), using (5.10), (5.5) and Proposition 52,
where \(M_\alpha := (M_1^\alpha + \ldots + M_d^\alpha )^{1/\alpha }\). The rest follows as above. □
References
R. Adamczak, A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electron. J. Probab. 13(34), 1000–1034 (2008)
R. Adamczak, A note on the Hanson-Wright inequality for random vectors with dependencies. Electron. Commun. Probab. 20(72), 13 (2015)
R. Adamczak, R. Latała, Tail and moment estimates for chaoses generated by symmetric random variables with logarithmically concave tails. Ann. Inst. Henri Poincaré Probab. Stat. 48(4), 1103–1136 (2012)
R. Adamczak, R. Latała, R. Meller, Hanson-Wright inequality in Banach spaces. Ann. Inst. Henri Poincaré Probab. Stat. 56(4), 2356–2376 (2020)
R. Adamczak, M. Strzelecki, On the convex Poincaré inequality and weak transportation inequalities. Bernoulli 25(1), 341–374 (2019)
R. Adamczak, P. Wolff, Concentration inequalities for non-Lipschitz functions with bounded derivatives of higher order. Probab. Theory Relat. Fields 162(3–4), 531–586 (2015)
S. Boucheron, G. Lugosi, P. Massart, Concentration Inequalities (Oxford University Press, Oxford, 2013). A nonasymptotic theory of independence, With a foreword by Michel Ledoux
V.H. de la Peña, E. Giné, Decoupling. Probability and Its Applications (New York) (Springer, New York, 1999)
L.H. Dicker, M.A. Erdogdu, Flexible results for quadratic forms with applications to variance components estimation. Ann. Stat. 45(1), 386–414 (2017)
E.D. Gluskin, S. Kwapień, Tail and moment estimates for sums of independent random variables with logarithmically concave tails. Studia Math. 114(3), 303–309 (1995)
F. Götze, H. Sambale, A. Sinulis, Concentration inequalities for bounded functionals via log-Sobolev-type inequalities. J. Theor. Probab. 34(3), 1623–1652 (2021)
F. Götze, H. Sambale, A. Sinulis, Concentration inequalities for polynomials in α-sub-exponential random variables. Electron. J. Probab. 26(48), 22 (2021)
N. Gozlan, C. Roberto, P.-M. Samson, From dimension free concentration to the Poincaré inequality. Calc. Var. Partial Differ. Equ. 52(3–4), 899–925 (2015)
N. Gozlan, C. Roberto, P.-M. Samson, Y. Shu, P. Tetali, Characterization of a class of weak transport-entropy inequalities on the line. Ann. Inst. Henri Poincaré Probab. Stat. 54(3), 1667–1693 (2018)
N. Gozlan, C. Roberto, P.-M. Samson, P. Tetali, Kantorovich duality for general transport costs and applications. J. Funct. Anal. 273(11), 3327–3405 (2017)
D.L. Hanson, F.T. Wright, A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Stat. 42, 1079–1083 (1971)
P. Hitczenko, S.J. Montgomery-Smith, K. Oleszkiewicz, Moment inequalities for sums of certain independent symmetric random variables. Studia Math. 123(1), 15–42 (1997)
D. Hsu, S.M. Kakade, T. Zhang, A tail inequality for quadratic forms of sub-Gaussian random vectors. Electron. Commun. Probab. 17(52), 6 (2012)
W.B. Johnson, G. Schechtman, Remarks on Talagrand’s deviation inequality for Rademacher functions, in Functional Analysis (Austin, TX, 1987/1989). Lecture Notes in Mathematics, vol. 1470 (Springer, Berlin, 1991), pp. 72–77
Y. Klochkov, N. Zhivotovskiy, Uniform Hanson–Wright type concentration inequalities for unbounded entries via the entropy method. Electron. J. Probab. 25(22), 30 (2020)
K. Kolesko, R. Latała, Moment estimates for chaoses generated by symmetric random variables with logarithmically convex tails. Stat. Probab. Lett. 107, 210–214 (2015)
F. Krahmer, S. Mendelson, H. Rauhut, Suprema of chaos processes and the restricted isometry property. Commun. Pure Appl. Math. 67(11), 1877–1904 (2014)
A.K. Kuchibhotla, A. Chakrabortty, Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression. arXiv preprint, 2018
R. Latała, Tail and moment estimates for sums of independent random vectors with logarithmically concave tails. Studia Math. 118(3), 301–304 (1996)
R. Latała, Tail and moment estimates for some types of chaos. Studia Math. 135(1), 39–53 (1999)
R. Latała, Estimates of moments and tails of Gaussian chaoses. Ann. Probab. 24(6), 2315–2331 (2006)
R. Latała, R. Łochowski, Moment and tail estimates for multidimensional chaos generated by positive random variables with logarithmically concave tails, in Stochastic Inequalities and Applications. Progress in Probability, vol. 56 (Birkhäuser, Basel, 2003), pp. 77–92
J. Lederer, S. van de Geer, New concentration inequalities for suprema of empirical processes. Bernoulli 20(4), 2020–2038 (2014)
M. Ledoux, On Talagrand’s deviation inequalities for product measures. ESAIM Probab. Stat. 1, 63–87 (1995/1997)
M. Ledoux, M. Talagrand, Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)], vol. 23 (Springer, Berlin, 1991)
A. Marchina, Concentration inequalities for separately convex functions. Bernoulli 24(4A), 2906–2933 (2018)
P. Massart, Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. (6) 9(2), 245–303 (2000)
M. Rudelson, R. Vershynin, Hanson-Wright inequality and sub-Gaussian concentration. Electron. Commun. Probab. 18(82), 9 (2013)
H. Sambale, A. Sinulis, Logarithmic Sobolev inequalities for finite spin systems and applications. Bernoulli 26(3), 1863–1890 (2020)
H. Sambale, A. Sinulis, Modified log-Sobolev inequalities and two-level concentration. ALEA Lat. Am. J. Probab. Math. Stat. 18, 855–885 (2021)
P.-M. Samson, Concentration of measure inequalities for Markov chains and Φ-mixing processes. Ann. Probab. 28(1), 416–461 (2000)
P.-M. Samson, Concentration inequalities for convex functions on product spaces, in Stochastic Inequalities and Applications. Progress in Probability, vol. 56 (Birkhäuser, Basel, 2003), pp. 33–52
M. Talagrand, An isoperimetric theorem on the cube and the Khintchine-Kahane inequalities. Proc. Am. Math. Soc. 104(3), 905–909 (1988)
M. Talagrand, New concentration inequalities in product spaces. Invent. Math. 126(3), 505–563 (1996)
S. van de Geer, J. Lederer, The Bernstein-Orlicz norm and deviation inequalities. Probab. Theory Relat. Fields 157(1–2), 225–250 (2013)
R. Vershynin, High-Dimensional Probability. Cambridge Series in Statistical and Probabilistic Mathematics, vol. 47 (Cambridge University Press, Cambridge, 2018)
R. Vershynin, Concentration inequalities for random tensors. Bernoulli 26(4), 3139–3162 (2020)
V.H. Vu, K. Wang, Random weighted projections, random quadratic forms and random eigenvectors. Random Struct. Algorithm 47(4), 792–821 (2015)
Acknowledgements
This work was supported by the German Research Foundation (DFG) via CRC 1283 “Taming uncertainty and profiting from randomness and low regularity in analysis, stochastics and their applications.” The author would moreover like to thank Arthur Sinulis for carefully reading this paper and many fruitful discussions and suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A
Appendix A
Proof of Proposition 52
The equivalence of (1.1), (1.2), (1.3), and (5.3) is easily seen by directly adapting the arguments from the proof of [41, Proposition 2.5.2]. To see that these properties imply (5.4), first note that since in particular \(\lVert X \rVert _{\Psi _1} < \infty \), the bound for \(|\lambda | \le 1/C_{5,\alpha }^{\prime }\) directly follows from [41], Proposition 2.7.1 (e). To see the bound for large values of |λ|, we infer that by the weighted arithmetic–geometric mean inequality (with weights α − 1 and 1),
for any y, z ≥ 0. Setting y := |λ|α∕(α−1) and z := |x|α, we may conclude that
for any \(\lambda , x \in \mathbb {R}\). Consequently, using (5.3), assuming C4,α = 1, for any |λ|≥ 1,
This yields (5.4) for \(|\lambda | \ge 1/C_{5,\alpha }^{\prime \prime }\). The claim now follows by taking \(C_{5,\alpha } := \max (C_{5,\alpha }^{\prime }, C_{5,\alpha }^{\prime \prime })\).
Finally, starting with (5.4), assuming C5,α = 1, let us check (1.1). To this end, note that for any λ > 0,
Now choose λ := t∕2 if t ≤ 2, λ := ((α − 1)t∕α)α−1 if t ≥ α∕(α − 1), and λ := 1 if t ∈ (2, α∕(α − 1)). This yields
Now use (2.3), (2.4), and the fact that \(\exp (-(t-1)) \le \exp (- t^\alpha /C_\alpha ^\alpha )\) for any t ∈ (2, α∕(α − 1)). It follows that
for any t ≥ 0. The same argument for − X completes the proof. □
Proof of Lemma 53
By the arithmetic and geometric means inequality and since \(\mathbb {E}\lVert X_i \rVert _2 \le \sqrt {n}\), for any s ≥ 0,
Moreover, by (2.2) and [12, Corollary A.5],
for any i = 1, …, d. On the other hand, if Y1, …, Yd are independent centered random variables with \(\lVert Y_i \rVert _{\Psi _\alpha } \le M\), we have
Here, the first estimate follows from [10] (α > 1) and [17] (α ≤ 1), while the last step follows from (2.4). As a consequence, (A.1) can be bounded by \(2 \exp (-s^\alpha d^{\alpha /2} / (K^{2\alpha }C_\alpha ))\).
For u ∈ [0, 2] and \(s = u \sqrt {n}/2d\), we have \( (\sqrt {n} + s)^d \le n^{d/2}(1+u)\). Plugging in, we arrive at
Now set u := t∕nd∕2. □
Proof of Lemma 54
Let us first recall the partition into “binary sets” that appears in the proof of [42, Lemma 3.2]. Here we assume that d = 2L for some \(L \in \mathbb {N}\) (if not, increase d). Then, for any ℓ ∈{0, 1, …, L}, we consider the partition \(\mathcal {I}_\ell \) of {1, …, d} into 2ℓ successive (integer) intervals of length dℓ := d∕2ℓ that we call “binary intervals.” It is not hard to see that for any k = 1, …, d, we can partition [1, k] into binary intervals of different lengths such that this partition contains at most one interval of each family \(\mathcal {I}_\ell \).
Now it suffices to prove that
(cf. Step 3 of the proof of [42, Lemma 3.2], where the reduction to this case is explained in detail). To this end, for any ℓ ∈{0, 1, …, L}, any \(I \in \mathcal {I}_\ell \), and dℓ := |I| = d∕2ℓ, we apply Lemma 53 for dℓ and \(t := 2^{-\ell /4} n^{d_\ell /2}u\). This yields
Altogether, we arrive at
We may now assume that (n1∕2u∕(K2d1∕2))α∕Cα ≥ 1 (otherwise the bound in Lemma 54 gets trivial by adjusting Cα). Using the elementary inequality ab ≥ (a + b)∕2 for all a, b ≥ 1, we arrive at
Using this in (A.2), we obtain the upper bound
By (2.3), we can assume cα = 2. □
To prove Lemma 56, we first present a number of lemmas and auxiliary statements. In particular, recall that if α ∈ (0, ∞), then for any x, y ∈ (0, ∞),
where cα := 2α−1 ∧ 1 and \(\widetilde {c}_\alpha := 2^{\alpha -1} \vee 1\). Indeed, if α ≤ 1, using the concavity of the function x↦xα, it follows by standard arguments that 2α−1(xα + yα) ≤ (x + y)α ≤ xα + yα. Likewise, for α ≥ 1, using the convexity of x↦xα, we obtain xα + yα ≤ (x + y)α ≤ 2α−1(xα + yα).
Lemma A.1
Let X1, …, Xn be independent, centered random variables such that \(\lVert X_i \rVert _{\Psi _\alpha } \le 1\) for some α > 0. Then, if Y :=maxi|Xi| and \(c := (c_\alpha ^{-1}\log n)^{1/\alpha }\), we have
with cα as in (A.3).
Proof
We have
where we have used (A.3) in the next-to-last step. □
Lemma A.2
Let Y ≥ 0 be a random variable that satisfies
for some c ≥ 0 and any t ≥ 0. Then,
with \(\widetilde {c}_\alpha \) as in (A.3).
Proof
By (A.3) and monotonicity, we have \(Y^\alpha \le \widetilde {c}_\alpha ((Y-c)_+^\alpha + c^\alpha )\), where \(x_+ := \max (x,0)\). Thus,
where we have set \(t := s\widetilde {c}_\alpha ^{-1/\alpha }\). Obviously, \(I_1 \le \sqrt {2}\) if \(t \ge c(1/\log \sqrt {2})^{1/\alpha }\). As for I2, we have
if \(t \ge ((\sqrt {2}+1)/(\sqrt {2}-1))^{1/\alpha }\). Therefore, I1I2 ≤ 2 if \(t \ge \max \{((\sqrt {2}+1)/(\sqrt {2}-1))^{1/\alpha }, c(2/\log 2)^{1/\alpha }\}\), which finishes the proof. □
Having these lemmas at hand, the proof of Lemma 56 is easily completed.
Proof of Lemma 56
The random variables \(\hat {X}_i := X_i/K\) obviously satisfy the assumptions of Lemma A.1. Hence, setting \(Y := \max _i |\hat {X}_i| = K^{-1} \max _i |X_i|\),
Therefore, we may apply Lemma A.2 to \(\hat {Y} := c_\alpha ^{1/\alpha } K^{-1} \max _i |X_i|\). This yields
i. e., the claim of Lemma 56, where we have set \(C := (\widetilde {c}_\alpha c_\alpha ^{-1})^{1/\alpha }\). □
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sambale, H. (2023). Some Notes on Concentration for α-Subexponential Random Variables. In: Adamczak, R., Gozlan, N., Lounici, K., Madiman, M. (eds) High Dimensional Probability IX. Progress in Probability, vol 80. Birkhäuser, Cham. https://doi.org/10.1007/978-3-031-26979-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-26979-0_7
Published:
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-031-26978-3
Online ISBN: 978-3-031-26979-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)