Abstract
Finite sample properties of random covariance-type matrices have been the subject of much research. In this paper we focus on the “lower tail” of such a matrix, and prove that it is sub-Gaussian under a simple fourth moment assumption on the one-dimensional marginals of the random vectors. A similar result holds for more general sums of random positive semidefinite matrices, and our (relatively simple) proof uses a variant of the so-called PAC-Bayesian method for bounding empirical processes. Using this bound, we obtain a nearly optimal finite-sample result for the ordinary least squares estimator under random design.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Let \(X_1,\ldots ,X_n\) be i.i.d. random (column) vectors in \(\mathbb {R}^p\) with finite second moments. This paper contributes to the problem of obtaining finite-sample concentration bounds for the random covariance-type operator
with mean \(\varSigma := \mathbb {E}\left[ X_1X_1^T\right] \). This problem has received a great deal of attention recently, and has important applications to the estimation of covariance matrices [15, 22], to the analysis of methods for least squares problems [10] and to compressed sensing and high dimensional, small sample size statistics [2, 18, 21].
The most basic problem is computing how many samples are needed to bring \(\widehat{\varSigma }_{n}\) close to \(\varSigma \). In general one needs at least \(n\ge p\) to bring \(\widehat{\varSigma }_{n}\) close to \(\varSigma \), so that the ranks of the two matrices can match. A basic problem is to find conditions under which \(n\ge C(\varepsilon )\,p\) samples are enough for guaranteeing
where \(C(\varepsilon )\) depends only on \(\varepsilon >0\) and on moment assumptions on the \(X_i\)’s.
A well known bound by Rudelson [17, 20] implies \(C(\varepsilon )\,p\log p\) samples are necessary and sufficient if the vectors \(\varSigma ^{-1/2}X_i/\sqrt{p}\) have uniformly bounded norms. Removing the \(\log p\) factor is relatively easy for sub-Gaussian vectors \(X_i\), but even the seemingly nice case of log-concave random vectors (which have sub-exponential moments) had to wait for the breakthrough papers by Adamczak et al. [1, 3]. A series of results [9, 12, 15, 22] have proven similar results under finite-moment conditions on the one-dimensional marginals plus a (necessary) a high probability bound on \(\max _{i\le n}\,|X_i|_2\).
1.1 The sub-Gaussian lower tail
In this paper we focus on concentration properties of the lower tail of \(\widehat{\varSigma }_{n}\). As it turns out, information about the lower tail is sufficient for many applications, including the analysis of regression-type problems (see Theorem 1.2 below for an example). Moreover, the asymmetry between upper and lower tails is interesting from a purely mathematical perspective.
Our main result is the following theorem.
Theorem 1.1
(Proven in Sect. 4) Let \(X_1,\ldots ,X_n\) be i.i.d. copies of a random vector \(X\in \mathbb {R}^p\) with finite fourth moments. Define \(\varSigma :=\mathbb {E}\left[ XX^T\right] \) and assume
for some \(\mathsf{h}\in (1,+\infty )\). Set
Then, if the number n of samples satisfies
we have the bound
Notice that the sample size n in Theorem 1.1 depends on \(p/\varepsilon ^2\), which is optimal if \(X_1\) has i.i.d. entries due to the Bai-Yin theorem [5]. Moreover, the dependence of n on \(\ln (1/\delta )/\varepsilon ^2\) shows that the sample size depends on the confidence level \(\delta \) in a sub-Gaussian fashion. More precisely, Theorem 1.1 implies that
has a Gaussian-like right tail
with \(C>0\) universal. We observe that Theorem 1.1 has a more general version involving general sums of i.i.d. positive semidefinite random matrices; see Theorem 4.1 below for details.
The main assumption in Theorem 1.1 is (3). This is a finite-moment assumption, and, from a theoretical perspective, it seems remarkable that one can obtain sub-Gaussian concentration from it. From the perspective of applications, there are reasonably natural settings where (3) is a sensible assumption.
-
1.
Assume first \(X=(X[1],X[2],\dots ,X[p])^T\) has diagonal \(\varSigma \) and satisfies a near unbiasedness assumption: for all \((i_1,i_2,i_3,i_4)\in \{1,\dots ,p\}^4\),
$$\begin{aligned} \,i_4\not \in \{i_1,i_2,i_3\}\Rightarrow \mathbb {E}\left[ X[i_1]\,X[i_2]\,X[i_3]\,X[i_4]\right] =0. \end{aligned}$$This is true if \(X[1],X[2],\ldots ,X[p]\) are mean-zero four-wise independent random variables, or if \(X[1],X[2],\ldots ,X[p]\) is unconditional (i.e. its law is preserved when each coordinate is multiplied by a sign). From this assumption we may obtain (3) with
$$\begin{aligned} \mathsf{h}:= 6\,\max \left\{ \frac{\sqrt{\mathbb {E}\left[ X[i]^4\right] }}{\mathbb {E}\left[ X[i]^2\right] }\,:\, i=1,2,\ldots ,p,\, \mathbb {E}\left[ X^2[i]\right] >0\right\} . \end{aligned}$$ -
2.
Assume now that some X satisfying (3) is replaced by \(AX+\mu \) for some linear map \(A\in \mathbb {R}^{p\times p'}\) and some \(\mu \in \mathbb {R}^{p'}\). The new vector still satisfies \(\mathsf{h}<+\infty \), although \(\mathsf{h}\) may change by a universal constant factor. Note that the matrix A may be singular and/or that one may have \(p'>p\), in which case \(AX+\mu \) will have highly correlated components. This is allowed if X “comes from a vector with uncorrelated entries”.
-
3.
The property \(\mathsf{h}<+\infty \) is also preserved when X is multiplied by an independent scalar \(\xi \), as long as \(\mathbb {E}\left[ \xi ^4\right] /\mathbb {E}\left[ \xi ^2\right] ^2\) is bounded by an absolute constant. As noted in [22], this is strictly weaker than what is needed for two-sided concentration as in (2).
-
4.
An assumption not covered by our theorem is that of bounded designs: \(\varSigma ^{-1/2}X_1/\sqrt{p}\) a.s. bounded. This is verified when the coordinates X are a orthonormal functions such as the Fourier basis over [0, 1] (in this case \(\varSigma =I_{p\times p}\)). We note that this bounded design case is optimally covered by Rudelson’s aforementioned bound [17, 20].
One further attraction of Theorem 1.1 is its proof method, which is based on a PAC-Bayesian argument. The main feature of this method is that it provides a way to control empirical processes via entropic inequalities, as opposed to usual chaining methods. Further details about this method are given in Sect. 3 below. Although our application of this method is indebted to previous work by Audibert/Catoni [4] and Langford/Shawe-Taylor [13], we believe that this technique has much greater potential than what has been explored so far in the literature.
Remark 1
(Recent developments in lower tails) Many developments on variants of Theorem 1.1 have appeared since the first version of this paper. Almost simultaneously with us, Koltchinskii and Mendelson [11] obtained analogues of Theorem 1.1 under the assumption of \(q>4\) moments on the one dimensional marginals of \(X_1\). They also obtained results under our assumption (3), albeit with suboptimal dependence on p and \(\varepsilon \). Later, Yaskov [24, 25] obtained bounds under the assumption of uniform bounds for the weak \(L^q\) norms of one dimensional marginals, where \(q\ge 2\) is arbitrary. For each value of \(q>2\), he obtains the optimal exponent \(\alpha _q>0\) so that \(n=\Theta (p/\varepsilon ^{\alpha _q})\) samples are necessary and sufficient for (4) (with \(\delta =e^{-p}\)). His theorem is thus stronger Theorem 4.1 except possibly for the dependence of n on \(\delta \). However, the first part of [8, Theorem 3.1] by van de Geer and Muro achieves similar bounds as Yaskov, with the same dependence on \(\delta \) as our own Theorem 1.1. There has also been some related progress in checking lower- and upper-tail properties that are relevant in the \(p\gg n\) setting [9].
1.2 Application to ordinary least squares with random design
Theorem 1.1 will be illustrated with an application to random design linear regression when \(n\gg p\gg 1\). In this setting one is given data in the form of n i.i.d. copies \((X_i,Y_i)_{i=1}^n\) of a random pair \((X,Y)\in \mathbb {R}^p\times \mathbb {R}\), where X is a vector of covariates and Y is a response variable. The goal is to find a vector \(\widehat{\beta }_n\) that depends solely on the data so that the square loss
is as small as possible. This setting of random design should be contrasted with the technically simpler case of fixed design, where the \(X_i\) are non-random. Fixed design results are not informative about out-of-sample prediction, which is important in many routine applications of OLS e.g. in Statistical Learning and in Linear Aggregation.
We show below that the usual ordinary least squares (OLS) estimator
achieves error rates \(\approx \sigma ^2\,(p+\ln (1/\delta ))/n\) in the random design setting, where \(\sigma ^2\) measures the magnitude of “errors”. The formal theorem (modulo some definitions in Sect. 5.1) is as follows.
Theorem 1.2
(Proven in Sect. 5.2) Define \((X,Y),(X_1,Y_1),\ldots ,(X_n,Y_n)\) as above. Let \({\beta }_{\min }\) denote a minimizer of \(\ell \) and let \(\eta :=Y-{\beta }_{\min }^TX\). Define \(\varSigma :=\mathbb {E}\left[ XX^T\right] \), and let \(\varSigma ^{-1/2}\) be the Moore-Penrose pseudoinverse of \(\varSigma ^{1/2}\). Also set \(Z:=\eta \,\varSigma ^{-1/2}X\). Let \(\mathsf{h},\sigma ^2,\mathsf{h}_{*}>0\) and \(q>2\) and assume that, for all \(v\in \mathbb {R}^p\),
Then for any \(\varepsilon \in (0,1/2)\), there exists \(C>0\) depending only on \(\mathsf{h}_*,\varepsilon \) and q such that, when \(\delta \in (C/n^{q/2-1},1)\) and
then
So for \(n\gg p\gg 1\) the excess loss of OLS is bounded by \((1+o\left( 1\right) )\,\sigma ^2p/n\), with high probability. This can be shown to be tight; cf. the end of Sect. 5.1 for details. An important point is that Theorem 1.2 makes minimal assumptions on the data, and works in a completely model-free, non-parametric, heteroskedastic setting. Our moment assumptions are reasonable e.g. when those of Theorem 1.1 are reasonable and \(Z=\eta \,\varSigma ^{-1/2}X\) is not too far from isotropic. For instance, if the “noise” term \(\eta \) is independent from X, this property follows from suitable moment assumptions on the noise and on the one-dimensional marginals of X. Even when there is no independence, one only needs higher moment assumptions on X and \(\eta \) (thanks to Hölder’s inequality).
Theorem 1.2 extends recent papers Hsu et al. [10] and Audibert and Catoni [4]. Hsu et al. prove a variant of Theorem 1.2 where they assume an independent noise model with sub-Gaussian properties, as well as bounds on \(\varSigma ^{-1/2}X_i/\sqrt{p}\). Their bound does have the advantage of working up to much smaller values of \(\delta \). Audibert and Catoni obtain bounds for \(\delta \ge 1/n\), albeit with worse constants and only by assuming that \((v^TX_1)^2 \le B\,v^T\varSigma \,v\) almost surely for some \(B>0\). To the best of our knowledge, no excess loss bounds of optimal order were known under finite moment assumptions. We do note, however, that Theorem 1.2 is a simple consequence of our main result, Theorem 1.1, and a Fuk-Nagaev bound by Einmahl and Li [7, Theorem 4].
1.3 Organization
The remainder of the paper is organized as follows. Section 2 reviews some preliminaries and defines our notation. Section 3 discusses our PAC-Bayesian proof method, and Sect. 4 contains the proof of the sub-Gaussian lower tail (cf. Theorem 1.1). Some facts about OLS and our proof of Theorem 1.2 are presented in Sect. 5. The final Section contains some further remarks and open problems.
2 Notation and preliminaries
The coordinates of a vector \(v\in \mathbb {R}^p\) are denoted by \(v[1],v[2],\ldots ,v[p]\). We denote the space of matrices with p rows, \(p'\) columns and real entries by \(\mathbb {R}^{p\times p'}\). A is symmetric if it equals its own transpose \(A^T\). Given \(A\in \mathbb {R}^{p\times p}\), we let \(\mathrm{tr}(A)\) denote the trace of A and \(\lambda _{\max }(A)\) denote its largest eigenvalue. Also, \(\mathrm{diag}(A)\) is the diagonal matrix whose diagonal entries match those of A. The \(p\times p\) identity matrix is denoted by \(I_{p\times p}\). We identify \(\mathbb {R}^p\) with the space of column vectors \(\mathbb {R}^{p\times 1}\), so that the standard Euclidean inner product of \(v,w\in \mathbb {R}^p\) is \(v^Tw\). The Euclidean norm is denoted by \(|v|_2:=\sqrt{v^Tv}\).
We say that \(A\in \mathbb {R}^{p\times p}\) is positive semidefinite, and write \(A\succeq 0\), if it is symmetric and \(v^TAv\ge 0\) for all \(v\in \mathbb {R}^p\). In this case one can easily show that
The \(2\rightarrow 2\) operator norm of \(A\in \mathbb {R}^{p\times p'}\) is
For symmetric \(A\in \mathbb {R}^{p\times p}\) this is the largest absolute value of its eigenvalues. Moreover, if A is positive semidefinite \(|A|_{2\rightarrow 2}=\lambda _{\max }(A)\) is the largerst eigenvalue, and (when A is invertible)
Finally, we write \(A\succeq B\) if \(A-B\succeq 0\).
Throughout the paper we use big-oh and little-oh notation informally, mostly as shorthand. For instance, \(a=O\left( b\right) \) means that a is at most of the same order of magnitude as b, whereas \(a=o\left( b\right) \) or \(a\ll b\) means a is much smaller than b.
3 The PAC-Bayesian method
In this section we give an overview of the PAC-Bayesian method as applied to our problem. The actual proof of Theorem 1.1 is presented in Sect. 4 below.
At first sight it may seem odd that we can obtain strong concentration as in Theorem 1.1 from finite moment assumptions. The key point here is that, for any \(v\in \mathbb {R}^p\), the expression
is a sum of random variables which are independent, identically distributed and non negative. Such sums are well known to have sub-Gaussian lower tails under weak assumptions; this follows e.g. Lemma A.1 below.
This fact may be used to show concentration of \(v^T\widehat{\varSigma }_{n}\,v\) for any fixed \(v\in \mathbb {R}^p\). It is less obvious how to turn this into a uniform bound. The standard techniques for this, such as chaining, involve looking at a discretized subsets of \(\mathbb {R}^p\) and moving from this finite set to the whole space. In our case this second step is problematic, because it requires upper bounds on \(v^T\widehat{\varSigma }_{n}\,v\), and we know that our assumptions are not strong enough to obtain this.
What we use instead is the so-called PAC-Bayesian method [6] for controlling empirical processes. At a very high level, this method replaces chaining and union bounds with arguments based on the relative entropy. What this means in our case is that a “smoothed-out” version of the process \(v^T\widehat{\varSigma }_{n}\,v\) (\(v\in \mathbb {R}^p\)), where v is averaged over a Gaussian measure, automatically enjoys very strong concentration properties. This implies that the original process is also well behaved as long as the effect of the smoothing can be shown to be negligible. Many of our ideas come from Audibert and Catoni [4], who in turn credit Langford and Shawe-Taylor [13] for the idea of Gaussian smoothing.
To make our ideas more definite, we present a technical result that encapsulates the main ideas in our PAC-Bayesian approach. This requires some conditions.
Assumption 1
\(\{Z_\theta :\,\theta \in \mathbb {R}^p\}\) is a family of random variables defined on a common probability space \((\varOmega ,\mathcal {F},\mathbb {P})\). We assume that the map
is continuous for each \(\omega \in \varOmega \). Given \(v\in \mathbb {R}^p\) and an invertible, positive semidefinite \(C\in \mathbb {R}^{p\times p}\), we let \(\Gamma _{v,C}\) denote the Gaussian probability measure over \(\mathbb {R}^p\) with mean v and covariance matrix C. We will also assume that for all \(\omega \in \varOmega \) the integrals
are well defined and depend continuously on v. We will use the notation \(\Gamma _{v,C}f_\theta \) to denote the integral of \(f_\theta \) (which may also depend on other parameters) over the variable \(\theta \) with the measure \(\Gamma _{v,C}\).
Proposition 3.1
(PAC-Bayesian Proposition) Assume the above setup, and also that C is invertible and \(\mathbb {E}\left[ e^{Z_\theta }\right] \le 1\) for all \(\theta \in \mathbb {R}^d\). Then for any \(t\ge 0\),
In the next subsection we will apply this to prove Theorem 4.1. Here is a brief overview: we will perform a change of coordinates under which \(\varSigma =I_{p\times p}\). We will then define \(Z_\theta \) as
where \(\xi >0\) will be chosen in terms of t and the “other terms” will ensure that \(\mathbb {E}\left[ e^{Z_\theta }\right] \le 1\). Taking \(C=\gamma \,I_{p\times p}\) will result in
where
is a new term introduced by the“smoothing operator” \(\Gamma _{v,\gamma C}\). The choice \(\gamma =1/p\) will ensure that this term is small, and the “other terms” will also turn out to be manageable. The actual proof will be slightly complicated by the fact that we need to truncate the operator \(\widehat{\varSigma }_{n}\) to ensure that \(S_v\) is highly concentrated.
Proof
As a preliminary step, we note that under our assumptions the map:
is measurable, since (by continuity) we may take the supremum over \(v\in \mathbb {Q}^p\), which is a countable set. In particular, the event in the statement of the proposition is indeed a measurable set.
To continue, recall the definition of Kullback Leiber divergence (or relative entropy) for probability measures over a measurable space \((\Theta ,\mathcal {G})\):
A variational principle [14, eqn. (5.13)] implies that for any measurable function \(h:\Theta \rightarrow \mathbb {R}\):
We apply this when \(\Theta =\mathbb {R}^d\) with \(\mathcal {G}\) equal to the Borel \(\sigma \)-field \(\mathcal {B}(\mathbb {R}^d)\), \(\mu _1=\Gamma _{v,C}\), \(\mu _0=\Gamma _{0,C}\) and \(h=Z_\theta \). In this case it is well-known that the relative entropy of the two measures is \(|C^{-1/2}v|_2^2/2\) [19, Appendix A.5]. This implies:
To finish, we prove that:
But this follows from Markov’s inequality and Fubini’s Theorem:
because \(\mathbb {E}\left[ e^{Z_\theta }\right] \le 1\) for any fixed \(\theta \). \(\square \)
4 The sub-Gaussian lower tail
The goal of this section is to discuss and prove the following slight generalization of Theorem 1.1.
Theorem 4.1
Assume \(A_1,\dots ,A_n\in \mathbb {R}^{p\times p}\) are i.i.d. random self-adjoint, positive semidefinite matrices whose coordinates have bounded second moments. Define \(\varSigma := \mathbb {E}\left[ A_1\right] \) (this is an entrywise expectation) and
Assume \(\mathsf{h}\in (1,+\infty )\) satisfies \(\sqrt{\mathbb {E}\left[ (v^TA_1\,v)^2\right] }^{1/2}\le \mathsf{h}\,v^T\,\varSigma v\) for all \(v\in \mathbb {R}^p\). Then for any \(\delta \in (0,1)\):
Theorem 1.1 is recovered when we set \(A_i=X_iX_i^T\), check that the moment assumption on \(v^TA_1\,v\) translates into (3), and note that
Theorem 4.1 is proved in several steps over the next subsections.
4.1 Preliminaries: normalization and truncation
We first note that we may assume that \(\varSigma \) is invertible. Indeed, if that is not the case, we can restrict ourselves to the range of \(\varSigma \), which is isometric to \(\mathbb {R}^{p'}\) for some \(p'\le p\), noting that \(A_iv=0\) and \(v^TA_i=0\) almost surely for any v that is orthogonal to the range (this follows from \(\mathbb {E}\left[ v^TA_1v\right] =0\) for v orthogonal to the range, combined with (8) above).
Granted invertibility, we may define:
and note that \(B_1,\dots ,B_n\) are i.i.d. positive semidefinite with \(\mathbb {E}\left[ B_1\right] =I_{p\times p}\). Moreover,
Define
Our goal is to show that the following holds with probability \(\ge 1-2e^{-t}\):
Notice that, by homonegeity, it suffices to consider vectors of the form \(w=\varSigma ^{-1/2}\,v\) with \(|v|_2=1\). Thus our goal may be restated as follows.
We will make yet another change to our goal. Fix some \(R>0\) and define (with hindsight) truncated operators
with the convention that this is simply 0 if \(\mathrm{tr}(B_i)=0\). We collect some estimates for later use.
Lemma 4.1
We have for all \(v\in \mathbb {R}^p\) with unit norm
Proof
The first assertion follows from the fact that the are positive semidefinite, so \(v^T\,B_iv\ge 0\) and \(v^T\,B^R_iv = \alpha _i\,v^TB_ib\) for each i, with \(\alpha _i\in [0,1]\) a scalar. This same reasoning implies \(B_i^R\preceq B_i\), a fact that we will use below.
To prove the second assertion, we let \(e_1,\ldots ,e_p\) denote the canonical basis of \(\mathbb {R}^p\), and apply Minkowski’s inequality:
To prove the third assertion, we fix some \(v\in \mathbb {R}^p\) with \(|v|_2=1\). We use again that \(v^TB_i\,v\ge 0\) to deduce
We may bound the RHS via Cauchy Schwarz, noting that \(\mathbb {E}\left[ (v^TB_i\,v)^2\right] \le \mathsf{h}^2\) by (13) and \(\Pr {\mathrm{tr}(B_i)>R}\le \mathbb {E}\left[ \mathrm{tr}(B_i)^2\right] /R^2 = (\mathsf{h}\,p/R)^2\). This gives:
\(\square \)
4.2 Applying the PAC-Bayesian method
We continue to use our definitions of \(B_1,\ldots ,B_n\) and \(B^R_1,\ldots ,B_n^R\), with the goal of proving (15). The parameters t and \(\varepsilon \) are as in (14). We also fix \(\xi >0\). We intend apply Proposition 3.1 with \(C=I_{p\times p}/p\) and
Let us check that the assumptions of the theorem are satisfied. First note that \(Z_\theta \) is a quadratic form in \(\theta \), and is therefore a.s. continuous as a function of \(\theta \). The same argument combined with the square integrability of the normal distribution shows that \(\Gamma _{v,C}\,Z_\theta \) is continuous in \(v\in \mathbb {R}^p\). The inequality \(\mathbb {E}\left[ e^{Z_\theta }\right] \le 1\) follows from independence, which implies
plus the fact that, for any non-negative, square-integrable random variable W and any \(\xi >0\):
(this is shown in Lemma A.1 in the Appendix). Therefore all assumptions of Proposition 3.1 are satisfied, and we may deduce from that result that, with probability \(\ge 1-e^{-t}\), for all \(v\in \mathbb {R}^p\):
This is the same as saying that, with probability \(\ge 1-e^{-t}\), the following inequality holds for all \(v\in \mathbb {R}^p\) with \(|v|_2=1\):
4.3 Dealing with the terms
The next step in the proof is to control all the terms involving \(\Gamma _{v,C}\) that appear in (17). For \(v\in \mathbb {R}^p\) with \(|v|_2=1\), explicit calculations reveal
We also need estimates for \(\Gamma _{v,C}\mathbb {E}\left[ ( {\theta }^T{B_1^R\theta })^2\right] \). Standard calculations with the normal distribution show that:
That is, instead of averaging \(\theta \) over \(\Gamma _{v,C}\), we may replace \(\theta \) by \(v+\theta \) and then average over \(\Gamma _{0,C}\).
We now consider the RHS of (20). The first two terms inside the brackets in the RHS are non-negative. By Cauchy Schwarz and the AM/GM inequality, the third term satisfies
We deduce that
and plugging this into (20) gives
We now compute the term in (23) as follows. First of all, since \(C=I_{p\times p}/p\),
where the \(\lambda _1,\ldots ,\lambda _p\ge 0\) are the eigenvalues of \(B_1^R\) and the \(N_1,\ldots ,N_p\) are independent standard Gaussian random variables .We note that \(\mathbb {E}\left[ N_i^2N_j^2\right] \le \mathbb {E}\left[ N_i^4\right] =3\) for all \(1\le i,j\le p\) and that the eigenvalues of \(B_1^R\) are all real and nonnegative (since \(B_1^R\succeq 0\)), therefore
We combine this with (21), (22), and (23), then apply Lemma 4.1 and recall \(|v|_2=1\) to obtain:
We plug this last estimate into (17) together with (19) and (18). This results in the following inequality, which holds with probability \(\ge 1-e^{-t}\) simultaneously for all \(v\in \mathbb {R}^d\) with \(|v|_2=1\):
This holds for any choice of \(\xi \). Optimizing over this parameter shows that, with probability \(\ge 1-e^{-t}\), we have the following inequality simultaneously for all \(v\in \mathbb {R}^p\) with \(|v|_2=1\).
4.4 The final step: control of the trace
We now take care of the term involving the traces on the RHS. This is precisely the moment when the truncation of \(B_i\) is useful, as it allows for the use of Bernstein’s concentration inequality [23, Sect. 2.6]. This inequality states that, for independent random variables \(Z_1,\dots ,Z_n\) with \(\mathbb {E}\left[ Z_i\right] =0\), \(\sum _{i=1}^n\mathbb {E}\left[ Z_i^2\right] =\sigma ^2\) and \(|Z_i|\le M\) for each \(1\le i\le n \) (\(M>0\) a constant), then
The term involving traces in (25) is a sum of i.i.d. mean-zero random variables that (because of the truncation) lie between \(-R/p n\) and R / pn. Moreover, the variance of each term is at most \(\mathbb {E}\left[ \mathrm{tr}(B_i^R)^2\right] /p^2 n^2\le \mathsf{h}^2/n^2\) by Lemma 4.1. We deduce:
Combining this with (25) implies that, for any \(t\ge 0\), the following inequality holds with probability \(\ge 1-2e^{-t}\), simultaneously for all \(v\in \mathbb {R}^p\) with \(|v|_2=1\):
This holds for any \(R>0\). Optimization over R gives
so, with the right choice of R,
according to the definition of \(\varepsilon \) in (14). We obtain
Inequality (15) follows. As noted in Sect. 4.1, (15) implies Theorem 4.1 and finishes the proof.
5 Applications in random-design linear regression
The main goal of this section is to prove Theorem 1.2.
5.1 Preliminaries
We begin by recalling the general facts about this problem. We assume \((X,Y)\in \mathbb {R}^p\times \mathbb {R}\) is a random pair, with \(X\in \mathbb {R}^p\) a vector of covariates and \(Y\in \mathbb {R}\) a response variable. We assume \(\mathbb {E}\left[ |X|_2^2\right] <+\infty \) and \(\mathbb {E}\left[ Y^2\right] < +\infty \). As in the introduction, we define the square loss function:
It is not hard to show that \(\ell \) has at least one minimizer \({\beta }_{\min }\in \mathbb {R}^p\), defined so that \({\beta }_{\min }^TX\) equals the \(L^2\) projection of Y onto the linear space generated by the coordinates of X. In fact, this property uniquely defines the random variable \({\beta }_{\min }^TX\), if not necessarily the vector \({\beta }_{\min }\). It also implies
In fact, \({\beta }_{\min }\) is a minimizer of \(\ell \) if and only if (28) holds. Another calculation shows that
where \(\varSigma :=\mathbb {E}\left[ XX^T\right] \). In particular, \({\beta }_{\min }\) is the unique minimizer of \(\ell \) if and only if \(\varSigma \) is non-singular.
Our main interest is in the OLS estimator, which satisfies
If \(\widehat{\varSigma }_{n}:=n^{-1}\sum _{i=1}^nX_iX_i^T\) is invertible, \(\widehat{\beta }_n\) is uniquely defined by the formula:
If \(\widehat{\varSigma }_{n}\) is not invertible, we may still define \(\widehat{\beta }_n\) by (30) if we let \(\widehat{\varSigma }_{n}^{-1}\) denote the Moore-Penrose pseudoinverse of \(\widehat{\varSigma }_{n}\). This definition will be used implicitly below.
An interesting test case for our result is that of a linear model with Gaussian noise and Gaussian design, where we assume that X is mean-zero Gaussian with covariance matrix \(\varSigma \) and \(\eta \) is mean zero Gaussian with variance \(\sigma ^2\) and independent of X. Using the notation of Theorem 1.2, we see that \(\mathbb {E}\left[ \eta ^2\right] =\sigma ^2\) is the variance of the noise, and \(\mathsf{h}_*\) does not depend on n or p. An explicit calculation (which we omit) implies that, for \(n\gg p\gg 1\),
Theorem 1.2 guarantees that OLS achieves this error rate under much weaker assumptions on the distribution of (X, Y).
5.2 Proof of Theorem 1.2
Proof
We will assume for convenience that \(\varSigma \) is invertible; the general case requires minor modifications. For each i, define
We note for later use that the \(Z_i\) are independent copies of the vector Z in the statement of Theorem 1.2.
The assumptions on X of Theorem 1.2 imply those of Theorem 1.1 with \(\varepsilon \) replaced by \(\varepsilon /10\) and \(\delta \) replaced by \(\delta /2\) (at least when C is a large enough constant). This implies that the event
satisfies \(\Pr {\mathsf{(Lower)}}\ge 1-\delta /2\) whenever the condition on n in Theorem 1.2 is satisfied. When \(\mathsf{Lower}\) holds, \(\widehat{\varSigma }_{n}\) is also invertible, so
Comparing this with the definition of \(\widehat{\beta }_n\) in (30), we see that:
Therefore, when \(\mathsf{Lower}\) holds, the excess loss \(\ell (\widehat{\beta }_n)-\ell ({\beta }_{\min })\) satisfies
since Lower and (9) imply that the \(2\rightarrow 2\) operator norm of \(\varSigma ^{1/2}\widehat{\varSigma }_{n}^{-1}\,\varSigma ^{1/2}\) is at most \(1/(1-\varepsilon /10)\).
What we have discussed so far shows that (34) holds with probability \(\ge 1-\delta /2\). We now show that
noting that this finishes the proof with some room to spare regarding the dependency on \(\varepsilon \). To do this, we use the Fuk-Nagaev-type inequality by Einmahl and Li in [7, Theorem 4]. In the Euclidean setting, that result implies that if \(U_1,\dots ,U_n\) are i.i.d. mean zero random vectors with \(\varLambda _n:=n\lambda _{\max }(\mathbb {E}\left[ U_1U_1^T\right] )\), then for any \(q>2\), \(\alpha ,\phi \in (0,1)\) one can find \(D>0\) such that, for any \(t>0\),
To obtain (35), we apply the previous display with \(U_i=Z_i/n\), \(\phi =\varepsilon /10\), \(\alpha =1\), and
We observe that
\(\varLambda _n\le \sigma ^2/n\) and \(\mathbb {E}\left[ |U_1|^q_2/t^q\right] \le (100\mathsf{h}^2_*)^{q/2}/(\varepsilon \,n)^{q/2}\) under our assumptions. We deduce
This clearly implies (35) after suitably adjusting the constants, at least in the desired range \(\delta \ge C/n^{q/2-1}\). \(\square \)
6 Final remarks
-
The PAC-Bayesian method used in this paper seems an efficient alternative to chaining and other typical empirical processes methods. As such, it would be interesting to find other applications of it. One interesting question is if some variant of the method can be used to prove two-sided concentration of \(\widehat{\varSigma }_{n}\).
-
Consider the setting of Theorem 4.1. Let \(\mathbb {R}^p_s\) denote the set of all \(v\in \mathbb {R}^p\) that are s-sparse, i.e. have at most s nonzero coordinates. \(\mathbb {R}^p_s\) is a union of \(\left( {\begin{array}{c}p\\ s\end{array}}\right) \le (ep/s)^s\) s-dimensional spaces, so if
$$\begin{aligned} n\ge 100\mathsf{h}^2\,\frac{s + s\ln (ep/s) + 2\ln (2/\delta )}{\varepsilon ^2} \end{aligned}$$one may apply Theorem 4.1 to these subspaces and deduce that \(v^T\widehat{\varSigma }_{n}\,v\ge (1-\varepsilon )\,v^T\varSigma \,v\) for all \(v\in \mathbb {R}^p_s\), with probability \(\ge 1-\delta \). In a companion paper [16] we show that this result is relevant to prove that \(\widehat{\varSigma }_{n}\) satisfies restricted eigenvalue properties when \(p\gg n\). This result is relevant to the Compressed Sensing and High Dimensional Statistics.
References
Adamczak, R., Litvak, A.E., Pajor, A., Tomczak-Jaegermann, N.: Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles. J. Am. Math. Soc. 23, 535–561 (2010). doi:10.1090/S0894-0347-09-00650-X
Adamczak, R., Litvak, A.E., Pajor, A., Tomczak-Jaegermann, N.: Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling. Constr. Approx. 34(1), 61–88 (2011). doi:10.1007/s00365-010-9117-4
Adamczak, R., Litvak, A.E., Pajor, A., Tomczak-Jaegermann, N.: Sharp bounds on the rate of convergence of the empirical covariance matrix. Comptes Rendus Mathematique 349(34):195– 200 (2011). doi:10.1016/j.crma.2010.12.014. http://www.sciencedirect.com/science/article/pii/S1631073X10003936
Audibert, J.Y., Catoni, O.: Robust linear least squares regression. Ann. Stat. 39(5), 2766–2794 (2011). doi:10.1214/11-AOS918SUPP
Bai, Z.D., Yin, Y.Q.: Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21(3), 1275–1294 (1993)
Catoni, O.: PAC-Bayesian supervised classification (The thermodynamics of statistical learning). Institute of Mathematical Statistics (2007)
Einmahl, U., Li, D.: Characterization of lil behavior in banach space. Trans. Am. Math. Soc. 360, 6677–6693 (2008)
van de Geer, S., Muro, A.: On higher order isotropy conditions and lower bounds for sparse quadratic forms. Electron. J. Stat. 8(2), 3031–3061 (2014). doi:10.1214/15-EJS983
Guédon, O., Litvak, A.E., Pajor, A., Tomczak-Jaegermann, N.: On the interval of fluctuation of the singular values of random matrices. arXiv:1509.02322
Hsu, D., Kakade, S.M., Zhang, T.: Random design analysis of ridge regression. J. Mach. Learn. Res. Proc. Track 23, 9.1–9.24 (2012)
Koltchinskii, V., Mendelson, S.: Bounding the smallest singular value of a random matrix without concentration. International Mathematics Research Notices (2015). doi:10.1093/imrn/rnv096. http://imrn.oxfordjournals.org/content/early/2015/03/31/imrn.rnv096.abstract
Tikhomirov, K.: Sample covariance matrices of heavy-tailed distributions. arXiv:1606.03557
Langford, J., Shawe-Taylor, J.: Pac-Bayes & margins. In: Becker, S., Thrun, S., Obermayer, K., (eds.) NIPS, pp. 423–430. MIT Press (2002)
Ledoux, M.: The Concentration of Measure Phenomenon. American Mathematical Society (2001)
Mendelson, S., Paouris, G.: On the singular values of random matrices. J. Eur. Math. Soc. 16(4), 823–834 (2014)
Oliveira, R.I.: A simple method for lower bounding sparse quadratic forms. In preparation
Oliveira, R.I.: Sums of random Hermitian matrices and an inequality by Rudelson. Electron. Commun. Probab. 15, 203–212 (2010)
Raskutti, G., Wainwright, M.J., Yu, B.: Restricted eigenvalue properties for correlated gaussian designs. J. Mach. Learn. Res. 11, 2241–2259 (2010)
Rasmussen, C.E., Williams, C.K.I.: Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press, Cambridge (2005)
Rudelson, M.: Random vectors in the isotropic position. J. Funct. Anal. 164(1), 60–72 (1999)
Rudelson, M., Zhou, S.: Reconstruction from anisotropic random measurements. IEEE Trans. Inf. Theory 59(6), 3434–3447 (2013). doi:10.1109/TIT.2013.2243201
Srivastava, N., Vershynin, R.: Covariance estimation for distributions with 2+epsilon moments. Ann. Probab. 41(5), 3081–3111 (2013)
Stéphane Boucheron Gábor Lugosi, P.M.: Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Oxford (2013)
Yaskov, P.: Lower bounds on the smallest eigenvalue of a sample covariance matrix. Electron. Commun. Probab. 19, no. 83, 1–10 (2014). doi:10.1214/ECP.v19-3807. http://ecp.ejpecp.org/article/view/3807
Yaskov, P.: Sharp lower bounds on the least singular value of a random matrix without the fourth moment condition. Electron. Commun. Probab. 20, no. 44, 1–9 (2015). doi:10.1214/ECP.v20-4089. http://ecp.ejpecp.org/article/view/4089
Acknowledgments
We thank the anonymous referees for their careful reading of the original manuscript, their many corrections and suggestions on the exposition. We thank them in particular for suggesting the use of the Fuk-Nagaev inequality from [7] in the analysis of OLS.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by a Bolsa de Produtividade em Pesquisa from CNPq, Brazil. This article was produced as part of the activities of FAPESP Center for Neuromathematics (Grant # 2013/ 07699-0, S. Paulo Research Foundation).
A Appendix: a moment generating function bound for non-negative random variables
A Appendix: a moment generating function bound for non-negative random variables
Lemma A.1
Let W be a nonnegative random variable with finite second monent. Then \(\forall \xi >0\), \(\mathbb {E}\left[ e^{-\xi W}\right] \le e^{-\xi \,\mathbb {E}\left[ W\right] + \frac{\xi ^2}{2}\,\mathbb {E}\left[ W^2\right] }\).
Proof
This follows from the fact that
applied to \(x=\xi \,W\). Taking expectations of the resulting inequality gives
The result follows once we apply “\(1+y\le e^y\)”, valid for all \(y\in \mathbb {R}\), to \(y:=\xi \,\mathbb {E}\left[ W\right] - \xi ^2\mathbb {E}\left[ W^2\right] /2\). \(\square \)
Rights and permissions
About this article
Cite this article
Oliveira, R.I. The lower tail of random quadratic forms with applications to ordinary least squares. Probab. Theory Relat. Fields 166, 1175–1194 (2016). https://doi.org/10.1007/s00440-016-0738-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-016-0738-9