1 Introduction

Let \({\mathbb {N}}=\{1,2,\ldots \}\) and let \({(X_n)}_{n\in {\mathbb {N}}}\) be a sequence of random vectors where for each \(n\in {\mathbb {N}}\) the random vector \(X_n\) takes values in \({\mathbb {R}}^n\), is centered, with unit covariance (isotropy):

$$\begin{aligned} {\mathbb {E}}(X_n)=0\quad \text {and}\quad {\mathbb {E}}(X_n\otimes X_n)={\mathrm {I}}_n \end{aligned}$$
(1.1)

where \({\mathrm {I}}_n\) is the \(n\times n\) identity matrix. Let \((m_n)_{n\in {\mathbb {N}}}\) be a sequence in \({\mathbb {N}}\) such that

$$\begin{aligned} 0<\underline{\rho }:=\liminf _{n\rightarrow \infty } \frac{n}{m_n} \le \limsup _{n\rightarrow \infty } \frac{n}{m_n}=:{\overline{\rho }}<\infty . \end{aligned}$$
(1.2)

For every \(n\in {\mathbb {N}}\), let \(X_n^{(1)},\ldots ,X_n^{(m_n)}\) be i.i.d. copies of \(X_n\). Their empirical covariance matrix is the \(n\times n\) symmetric positive semidefinite random matrix

$$\begin{aligned} {\widehat{\Sigma }}_n :=\frac{1}{m_n}\sum _{k=1}^{m_n} X_n^{(k)}\otimes X_n^{(k)}. \end{aligned}$$
(1.3)

If \({\mathbb {X}}_n\) denotes the \(m_n\times n\) rectangular random matrix with i.i.d. rows \(X_n^{(1)},\ldots ,X_n^{(m_n)}\) then

$$\begin{aligned} {\widehat{\Sigma }}_n=\frac{1}{m_n}{\mathbb {X}}_n^\top {\mathbb {X}}_n. \end{aligned}$$

Note that \({\mathbb {E}}{\widehat{\Sigma }}_n={\mathbb {E}}(X_n\otimes X_n)={\mathrm {I}}_n\). For convenience we define the random matrix

$$\begin{aligned} A_n:=m_n{\widehat{\Sigma }}_n ={\mathbb {X}}_n^\top {\mathbb {X}}_n =\sum _{k=1}^{m_n}X_n^{(k)}\otimes X_n^{(k)}. \end{aligned}$$
(1.4)

The eigenvalues of \(A_n\) are squares of the singular values of \({\mathbb {X}}_n\), and in particular

$$\begin{aligned} \lambda _{\max }(A_n) =s_{\max }({\mathbb {X}}_n)^2 =\max _{\Vert x\Vert =1}\Vert {\mathbb {X}}_nx\Vert ^2 =\Vert {\mathbb {X}}_n\Vert ^2. \end{aligned}$$

When \(m_n\ge n\) then the smallest eigenvalue of \(A_n\) satisfies

$$\begin{aligned} \lambda _{\min }(A_n) =s_{\min }({\mathbb {X}}_n)^2 =\min _{\Vert x\Vert =1}\Vert {\mathbb {X}}_n x\Vert ^2 =\Vert {\mathbb {X}}_n^{-1}\Vert ^{-2}, \end{aligned}$$

where by \({\mathbb {X}}_n^{-1}\) we denote the inverse linear operator for \({\mathbb {X}}_n\) defined on the image of \({\mathbb {X}}_n\). Of course, the last formula holds only when \({\mathbb {X}}_n\) is invertible (impossible if \(m_n<n\)). Above and in the sequel we denote by \(\Vert x\Vert =(x_1^2+\cdots +x_n^2)^{1/2}\) the Euclidean norm of \(x\in {\mathbb {R}}^n\).

Let us denote by \({(X_{n,k})}_{1\le k\le n}\) the components of the random vector \(X_n\). If the random variables \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. standard Gaussians, then the law of the random matrix \({\widehat{\Sigma }}_n\) is known as the real Wishart law, and constitutes a sort of a matrix version of the \(\chi ^2(n)\) law. The law of the eigenvalues of \({\widehat{\Sigma }}_n\) is then called the Laguerre Orthogonal Ensemble, a Boltzmann–Gibbs measure with density on \(\{\lambda \in [0,\infty )^m:\lambda _1\ge \cdots \ge \lambda _n\}\) proportional to

$$\begin{aligned} \lambda \mapsto \exp \left( -\frac{m_n}{2}\sum _{k=1}^n\lambda _k +\frac{m_n-n+1}{2}\sum _{k=1}^n\log (\lambda _n) +\sum _{i<j}\log {\left| \lambda _i-\lambda _j\right| }\right) , \end{aligned}$$

see [38, (7.4.2) p. 192]. This exactly solvable Gaussian model allows to deduce sharp asymptotics for the empirical spectral distribution as well as for the extremal eigenvalues of \({\widehat{\Sigma }}_n\). The famous Marchenko–Pastur theorem [35, 38] states that if

$$\begin{aligned} \frac{n}{m_n}\underset{n\rightarrow \infty }{\longrightarrow }\rho \quad \text {with}\quad \rho \in (0,\infty ) \end{aligned}$$

then almost surely as \(n\rightarrow \infty \), the empirical spectral distribution of \({\widehat{\Sigma }}_n\) tends weakly with respect to continuous and bounded test functions to a non-random distribution, namely

$$\begin{aligned} a.s.\quad \frac{1}{m_n}\sum _{k=1}^n\delta _{\lambda _k({\widehat{\Sigma }}_n)} \overset{{\mathcal {C}}_b}{\underset{n\rightarrow \infty }{\longrightarrow }} \mu _\rho \end{aligned}$$
(1.5)

where \(\mu _\rho \) is the Marchenko–Pastur distribution on \([a^-,a^+]\) with \(a^\pm =(1\pm \sqrt{\rho })^2\) given by

$$\begin{aligned} \mu _\rho (dx) =\frac{\rho -1}{\rho }{\mathbf {1}}_{\rho >1}\delta _0 +\frac{\sqrt{(a^+-x)(x-a^-)}}{\rho 2\pi x}\,{\mathbf {1}}_{[a^-,a^+]}(x)dx. \end{aligned}$$

It is a mixture between a Dirac mass at point 0 and an absolutely continuous law. The atom at 0 disappears when \(\rho \le 1\) and is a reminiscence of the rank of \({\widehat{\Sigma }}_n\). The asymptotic phenomenon (1.5) holds beyond the Gaussian case. In particular it was shown by Pajor and Pastur [36] that it holds if for every \(n\in {\mathbb {N}}\) the distribution of the isotropic random vector \(X_n\) is log-concave. Recall that a probability measure \(\mu \) on \({\mathbb {R}}^n\) with density \(\varphi \) is log-concave when \(\varphi =e^{-V}\) with V convex, see [18, 20]. Log-concavity allows some kind of geometric dependence but imposes sub-exponential tails. The asympotitic phenomenon (1.5) also holds if \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. with finite second moment [9, 38]. An extension to various other models can be found in Bai and Zhou [13], Pastur and Shcherbina [38], Adamczak [1], and Adamczak and Chafaï [3].

The weak convergence (1.5) does not provide much information at the edge on the behavior of the extremal atoms, and what one can actually extract from (1.5) is that

$$\begin{aligned} a.s.\quad \limsup _{n\rightarrow \infty }\lambda _{\min }({\widehat{\Sigma }}_n) \le (1-\sqrt{\rho })^2 \le (1+\sqrt{\rho })^2 \le \liminf _{n\rightarrow \infty }\lambda _{\max }({\widehat{\Sigma }}_n) \end{aligned}$$
(1.6)

where the first inequality is considered only in the case where \(m_n\ge n\). If \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. with finite fourth moment then it was shown by Bai and Yin [11, 12, 53] using combinatorics that the convergence in (1.6) holds:

$$\begin{aligned} a.s.\quad (1-\sqrt{\rho })^2 = \lim _{n\rightarrow \infty }\lambda _{\min }({\widehat{\Sigma }}_n) \le \lim _{n\rightarrow \infty }\lambda _{\max }({\widehat{\Sigma }}_n) = (1+\sqrt{\rho })^2, \end{aligned}$$
(1.7)

where the first equality is considered only in the case where \(m_n\ge n\). The convergence of the largest eigenvalue in the right hand side of (1.7) does not take place if \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. with infinite fourth moment, see [9]. It was understood recently that the convergence of the smallest eigenvalue in the left hand side of (1.7) holds actually as soon as \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. with finite second moment, see Tikhomirov [49].

An analytic proof of (1.7) based on the resolvent is also available, and we refer to Bordenave [17] for the i.i.d. case, and to Bai and Silverstein [8], Pillai and Yin [41], and Richard and Guionnet [42] for more sophisticated models still not including the case in which the law of \(X_n\) is log-concave for every \(n\in {\mathbb {N}}\). Note that the analytic approach was also used for various models of random matrices by Haagerup and Thorbjørnsen [28], Schultz [45], and by Capitaine and Donati-Martin [21].

The study of quantitative high dimensional non-asymptotic properties of the smallest and of the largest eigenvalues of empirical covariance matrices was the subject of an intense line of research in the recent years; in connection with log-concavity, see for instance Adamczak et al. [5, 6], Rudelson and Vershynin [44], Koltchinskii and Mendelson [32], and references therein.

Non-asymptotic estimates for (1.7) were obtained by Srivastava and Vershynin [48] using a rank one update strategy which takes advantage of the decomposition (1.4). This approach is an elementary interplay between probability and linear algebra, which is remarkably neither analytic nor combinatorial. The outcome is that with high probability

$$\begin{aligned} \left( 1-c^-\sqrt{\frac{n}{m_n}}\,\right) ^2 \le \lambda _{\min }({\widehat{\Sigma }}_n) \le \lambda _{\max }({\widehat{\Sigma }}_n) \le \left( 1+c^+\sqrt{\frac{n}{m_n}}\,\right) ^2, \end{aligned}$$
(1.8)

where the first inequality is considered only in the case where \(m_n\ge n\). Here \(c^\pm >0\) are constants, and thus one cannot deduce (1.7) from (1.8). The approach of Srivastava and Vershynin is a randomization of the spectral sparsification method developed by Batson et al. [14, 15]; the idea of using rank one updates can also be found in the works of Benaych-Georges and Nadakuditi [16]. This approach requires the control of tails of norms of projections of \(X_n\), a condition which is satisfied by log-concave distributions thanks to the thin-shell phenomenon. This condition is also satisfied if \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. with a finite moment of order \(2+\varepsilon \) for the lower bound on the smallest eigenvalue in (1.7) and i.i.d. with a finite moment of order \(4+\varepsilon \) for the upper bound on the largest eigenvalue in (1.7). This method was also used recently by Yaskov [51, 52].

Our main results below lie between the original Bai–Yin theorem and the more recent work of Srivastava and Vershynin. Our contribution is to show that the non-asymptotic approach of Srivastava and Vershynin is indeed suitable to prove and extend the sharp Bai–Yin theorem, which is an asymptotic result, under fairly general assumptions on tails of norms of projections of \(X_n\), which allow heavy tailed i.i.d. as well as log-concavity! When the coordinates of \(X_n\) are i.i.d. our approach reaches the (almost) optimal moment condition: finite second moment for the smallest eigenvalue and finite fourth moment for the largest.

Our results are based on the following tail conditions on the norm of projections.

Definition 1.1

(Weak Tail Projection property (WTP)) Let \(X_n\), \(n\in {\mathbb {N}}\), be as in (1.1). We say that the Weak Tail Projection \(\mathrm {(WTP)}\) property holds when the following is true:

(a):

The family \({(\langle X_{n},y\rangle ^2)}_{n\in {\mathbb {N}},y\in S^{n-1}}\) is uniformly integrable, in other words

$$\begin{aligned} \lim _{M\rightarrow \infty }\sup _{\begin{array}{c} n\in {\mathbb {N}}\\ y\in S^{n-1} \end{array}} {\mathbb {E}}\left( \langle X_n,y\rangle ^2{\mathbf {1}}_{\{\langle X_n,y\rangle ^2\ge M\}} \right) =0, \end{aligned}$$
(WTP-a)

where \(S^{n-1}:=\{y\in {\mathbb {R}}^n:\Vert y\Vert =1\}\) denotes the unit sphere of \({\mathbb {R}}^n\);

(b):

There exist two functions \(f:{\mathbb {N}}\rightarrow [0,1]\) and \(g:{\mathbb {N}}\rightarrow {\mathbb {R}}_+\) such that \(f(r)\rightarrow 0\) and \(g(r)\rightarrow 0\) as \(r\rightarrow \infty \) and for every \(n\in {\mathbb {N}}\) and any orthogonal projection \({\mathrm {P}}:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n\) with \({\mathrm {P}}\ne 0\),

$$\begin{aligned} {\mathbb {P}}{\left( \Vert {\mathrm {P}}X_n\Vert ^2-{\mathrm {rank}}{\mathrm {P}}\ge f({\mathrm {rank}}{\mathrm {P}}){\mathrm {rank}}{\mathrm {P}}\right) } \le g({\mathrm {rank}}{\mathrm {P}}). \end{aligned}$$
(WTP-b)

This can be less formally written as \({\mathbb {P}}\left( {\left\| {\mathrm {P}}X_n\right\| }^2-r\ge o(r)\right) \le o(1)\) where \(r:={\mathrm {rank}}{\mathrm {P}}\) and where the “o” are with respect to r and are uniform in n.

Definition 1.2

(Strong Tail Projection property (STP)) Let \(X_n\), \(n\in {\mathbb {N}}\), be as in (1.1). We say the Strong Tail Projection (STP) property holds when there exist \(f:{\mathbb {N}}\rightarrow [0,1]\) and \(g:{\mathbb {N}}\rightarrow {\mathbb {R}}_+\) such that \(f(r)\rightarrow 0\) and \(g(r)\rightarrow 0\) as \(r\rightarrow \infty \), and for every \(n\in {\mathbb {N}}\), for any orthogonal projection \({\mathrm {P}}:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n\) with \({\mathrm {P}}\ne 0\), for any real \(t\ge f({\mathrm {rank}}{\mathrm {P}}){\mathrm {rank}}{\mathrm {P}}\) we have

$$\begin{aligned} {\mathbb {P}}{\left( \Vert {\mathrm {P}}X_n\Vert ^2-{\mathrm {rank}}{\mathrm {P}}\ge t\right) } \le \frac{g({\mathrm {rank}}{\mathrm {P}}){\mathrm {rank}}{\mathrm {P}}}{t^2}. \end{aligned}$$
(STP)

This can be less formally written as \({\mathbb {P}}({\left\| {\mathrm {P}}X_n\right\| }^2-r\ge t)\le o(r)t^{-2}\) where \(t\ge o(r)\) and \(r:={\mathrm {rank}}{\mathrm {P}}\) and where the “o” are with respect to r and are uniform in n.

Note that \({\mathbb {E}}(\Vert {\mathrm {P}}X_n\Vert ^2)={\mathrm {rank}}{\mathrm {P}}\) since \(X_n\) is isotropic. The properties \(\mathrm {(WTP)}\) and (STP) were inspired by the “strong regularity assumption” used by Srivastava and Vershynin [48]. They are specially designed to obtain a sharp Bai–Yin type asymptotic result in the i.i.d. case with (almost) optimal moment assumptions, as well as in the log-concave case.

It is easy to see that (STP) implies that for any \(\varepsilon >0\), the \(4-\varepsilon \) moments of 1-dimensional marginals of \(X_n\)’s are uniformly bounded; in particular, in this case the vectors \({(X_n)}_{n\in {\mathbb {N}}}\) satisfy condition (WTP-a). Next, condition (WTP-b) is clearly weaker than (STP); thus, (STP) implies \(\mathrm {(WTP)}\), hence the qualifiers “Strong” and “Weak” for these properties.

Proposition 1.3 below, proved in Sect. 2, implies that if \({(X_n)}_{n\in {\mathbb {N}}}\) is as in (1.1) and if \(X_n\) is log-concave for every \(n\in {\mathbb {N}}\) then properties \(\mathrm {(WTP)}\) and (STP) are satisfied. It is a consequence of the thin-shell and sub-exponential tails phenomena for these distributions.

Proposition 1.3

(Log-concave random vectors) Let \(X_n\), \(n\in {\mathbb {N}}\), be as in (1.1), and suppose that the centered isotropic random vector \(X_n\) is log-concave for any \(n\in {\mathbb {N}}\). Then a stronger form of (STP) holds with

$$\begin{aligned} \{{\left\| {\mathrm {P}}X_n\right\| }^2-{\mathrm {rank}}{\mathrm {P}}\ge t\} \quad \text {replaced by}\quad \{|{\left\| {\mathrm {P}}X_n\right\| }^2-{\mathrm {rank}}{\mathrm {P}}|\ge t\}. \end{aligned}$$

The next proposition, proved in Sect. 2, implies that if \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. then property \(\mathrm {(WTP)}\) holds, and if moreover these i.i.d. random variables have finite fourth moment then property (STP) holds.

Proposition 1.4

(Random vectors with i.i.d. coordinates) Let \(X_n\), \(n\in {\mathbb {N}}\), be as in (1.1), and suppose that the coordinates \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. distributed as a real random variable \(\xi \) with zero mean and unit variance. Then, denoting \(r:={\mathrm {rank}}{\mathrm {P}}\),

  • a stronger version of \(\mathrm {(WTP)}\) holds, with, in (WTP-b),

    $$\begin{aligned} \{\Vert {\mathrm {P}}X_n\Vert ^2-r\ge f(r)r\} \quad \text {replaced by}\quad \{|\Vert {\mathrm {P}}X_n\Vert ^2-r|\ge f(r)r\}; \end{aligned}$$
  • if moreover \({\mathbb {E}}(\xi ^4)<\infty \) then a stronger version of (STP) holds, with

    $$\begin{aligned} \{\Vert {\mathrm {P}}X_n\Vert ^2-r\ge t\} \quad \text {replaced by}\quad \{|\Vert {\mathrm {P}}X_n\Vert ^2-r|\ge t\}. \end{aligned}$$

The next proposition, proved in Sect. 2, was suggested by an anonymous referee.

Proposition 1.5

(Marchenko–Pastur law holds under \(\mathrm {(WTP)}\)) Let \(X_n\), \(n\in {\mathbb {N}}\), be as in (1.1). If \(\lim _{n\rightarrow \infty }\frac{n}{m_n}=\rho \) with \(\rho \in (0,1)\), in other words \(\underline{\rho }={\overline{\rho }}\in (0,1)\), and if (WTP-a) and (WTP-b) are satisfied, then the Marchenko–Pastur law (1.5) holds.

Our main results are stated in the following couple of theorems and corollaries.

Theorem 1.6

(Smallest eigenvalue) Let \(X_n\), \(m_n\), and \(A_n\), \(n\in {\mathbb {N}}\), be as in (1.1), (1.2), and (1.4) respectively. If \({\overline{\rho }}<1\) (in particular \(m_n> n\) for \(n\gg 1\)), and if (WTP-a) and (WTP-b) are satisfied then

$$\begin{aligned} \liminf _{n\rightarrow \infty }\frac{{\mathbb {E}}(\lambda _{\min }(A_n))}{(\sqrt{m_n}-\sqrt{n})^2}\ge 1. \end{aligned}$$

If additionally \(\lim _{n\rightarrow \infty }\frac{n}{m_n}=\rho \) with \(\rho \in (0,1)\), in other words \(\underline{\rho }={\overline{\rho }}\in (0,1)\), then

$$\begin{aligned} \lambda _{\min }({\widehat{\Sigma }}_n) \overset{{\mathbb {P}}}{\underset{n\rightarrow \infty }{\longrightarrow }} (1-\sqrt{\rho })^2. \end{aligned}$$

Theorem 1.6 is proved in Sect. 3. The second part, which is the Bai–Yin edge convergence (1.7) of the smallest eigenvalue in probability, is obtained by combining the first part of the theorem with the Marchenko–Pastur bound (1.6) for the smallest eigenvalue obtained from the Marchenko–Pastur law (1.5) provided by Proposition 1.5.

Combining Theorem 1.6 with Propositions 1.3 and 1.4, we obtain the following corollary.

Corollary 1.7

(Smallest eigenvalue convergence) Let \(X_n\), \(m_n\), \({\widehat{\Sigma }}_n\), and \(A_n\), \(n\in {\mathbb {N}}\), be as in (1.1), (1.2), (1.3), and (1.4) respectively. If \({\overline{\rho }}<1\) (in particular \(m_n> n\) for \(n\gg 1\)) and if the centered isotropic random vector \(X_n\) is log-concave for every \(n\in {\mathbb {N}}\) or if \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. then

$$\begin{aligned} \liminf _{n\rightarrow \infty }\frac{{\mathbb {E}}(\lambda _{\min }(A_n))}{(\sqrt{m_n}-\sqrt{n})^2}\ge 1. \end{aligned}$$

If additionally \(\lim _{n\rightarrow \infty }\frac{n}{m_n}=\rho \) with \(\rho \in (0,1)\), in other words \(\underline{\rho }={\overline{\rho }}\in (0,1)\), then

$$\begin{aligned} \lambda _{\min }({\widehat{\Sigma }}_n) \overset{{\mathbb {P}}}{\underset{n\rightarrow \infty }{\longrightarrow }} (1-\sqrt{\rho })^2. \end{aligned}$$

Theorem 1.8

(Largest eigenvalue) Let \(X_n\), \(m_n\), and \(A_n\), \(n\in {\mathbb {N}}\), be as in (1.1), (1.2), and (1.4) respectively. If (STP) holds then

$$\begin{aligned} \limsup _{n\rightarrow \infty }\frac{{\mathbb {E}}(\lambda _{\max }(A))}{(\sqrt{m_n}+\sqrt{n})^2}\le 1. \end{aligned}$$

If additionally \(\lim _{n\rightarrow \infty }\frac{n}{m_n}=\rho \) with \(\rho \in (0,\infty )\) in other words \(\underline{\rho }={\overline{\rho }}\in (0,\infty )\), then

$$\begin{aligned} \lambda _{\max }({\widehat{\Sigma }}_n) \overset{{\mathbb {P}}}{\underset{n\rightarrow \infty }{\longrightarrow }} (1+\sqrt{\rho })^2. \end{aligned}$$

Theorem 1.8 is proved in Sect. 4. Again, the second part uses the Marchenko–Pastur bound (1.6) obtained from the Marchenko–Pastur law (1.5) provided by Proposition 1.5.

Combining Theorem 1.8 with Propositions 1.3 and 1.4, we obtain the following statement:

Corollary 1.9

(Largest eigenvalue convergence) Let \(X_n\), \(m_n\), \({\widehat{\Sigma }}_n\) and \(A_n\), \(n\in {\mathbb {N}}\), be as in (1.1), (1.2), (1.3), and (1.4) respectively. If the centered isotropic random vector \(X_n\) is log-concave for every \(n\in {\mathbb {N}}\) or if \({(X_{n,k})}_{n\ge 1,1\le k\le n}\) are i.i.d. with finite 4th moment then

$$\begin{aligned} \limsup _{n\rightarrow \infty }\frac{{\mathbb {E}}(\lambda _{\max }(A_n))}{(\sqrt{m_n}+\sqrt{n})^2}\le 1. \end{aligned}$$

If additionally \(\lim _{n\rightarrow \infty }\frac{n}{m_n}=\rho \) with \(\rho \in (0,\infty )\) in other words \(\underline{\rho }={\overline{\rho }}\in (0,\infty )\), then

$$\begin{aligned} \lambda _{\max }({\widehat{\Sigma }}_n) \overset{{\mathbb {P}}}{\underset{n\rightarrow \infty }{\longrightarrow }} (1+\sqrt{\rho })^2. \end{aligned}$$

1.1 Outline of the argument and novelty

Assume we are given a random matrix

$$\begin{aligned} A=\sum _{k=1}^{m}X^{(k)}\otimes X^{(k)}, \end{aligned}$$

where \(X^{(k)}\) are i.i.d. isotropic random vectors. As we already mentioned above, the key ingredient in estimating the extremal eigenvalues of A is the following rank one update formula known as the Sherman–Morrison formula:

$$\begin{aligned} (M+xx^\top )^{-1}=M^{-1}-\frac{M^{-1}xx^\top M^{-1}}{1+x^\top M^{-1}x}, \end{aligned}$$

which is valid for any non-singular \(n\times n\) matrix M and a vector x with \(1+x^\top M^{-1}x\ne 0\). Using the above identity and assuming that M is symmetric, the restriction on \({\mathbb {R}}\) of the Cauchy–Stieltjes transform of the spectral distribution of \(M+xx^T\), which is defined as an appropriately scaled trace of \((u-M-xx^\top )^{-1}\), \(u\in {\mathbb {R}}\), can be in a simple way expressed in terms of the Cauchy–Stieltjes transform of M. To be more precise, setting

$$\begin{aligned} A^{(0)}:=0 \quad \text {and}\quad A^{(k)}:=A^{(k-1)}+X^{(k)}\otimes X^{(k)},\quad k=1,\dots ,m, \end{aligned}$$

we get, for any \(k=1,\dots ,m\) and any \(u\in {\mathbb {R}}\),

$$\begin{aligned} {\mathrm {tr}}(u-A^{(k)})^{-1}={\mathrm {tr}}(u-A^{(k-1)})^{-1} +\frac{{X^{(k)}}^\top (u-A^{(k-1)})^{-2}X^{(k)}}{1-{X^{(k)}}^\top (u-A^{(k-1)})^{-1} X^{(k)}}. \end{aligned}$$
(1.9)

A crucial observation, made already in [15] and further developed in [48] is that the Cauchy–Stieltjes transform on the real line can be efficiently used to control the extreme eigenvalues of a positive semi-definite matrix. In our setting, starting with some specially chosen non-random number \(u_0\ne 0\), we inductively define a sequence of random numbers \(u_k\) (\(k\le m\)) in such a way that all \(u_k\)’s stay on the same side of the spectra of \(A^{(k)}\)’s (\(k\le m\)), at the same time not departing too far from the spectrum. Then the expectation of the corresponding extreme eigenvalue of \(A=A^{(m)}\) can be estimated by \({\mathbb {E}}u_m\). The increments \(u_k-u_{k-1}\) are defined with help of (1.9) as certain functions of \(A^{(k-1)}\), \(u_{k-1}\) and \(X^{(k)}\), and their expectations are controlled using the information about the distribution of \(X^{(k)}\) as well as certain induction hypotheses. At this level, the approach used in the present paper is similar to [48].

On the other hand, as our result is asymptotically sharp and covers the i.i.d. case with almost optimal moment conditions, the technical part of our argument differs significantly from [48]. In particular, we introduce the “regularity shifts”, which are designed in such a way that \(u_k\)’s stay “sufficiently far” from the spectrum of \(A^{(k)}\)’s, which guarantees validity of certain concentration inequalities, whereas at the same time not departing “too far” so that one still gets a satisfactory estimate of the expectation of the spectral edges. The shifts (which we denote by \(\delta ^k_R\) and \(\Delta ^k_R\)) are defined in Sects. 3.3 and 4.3 (see, in particular, Lemmas 3.5 and 4.7).

Let us emphasize once more that the proofs we obtain are simpler and shorter than the original combinatorial approach of Bai–Yin and the analytic approach based on the resolvent (more precisely the Cauchy–Stieltjes transform on the complex plane outside the real axis), while the class of distributions covered in our paper is larger. In particular, Theorem 1.6 of the present paper essentially recovers a recent result [49]. In must be noted, however, that in our paper we replace convergence almost surely with the weaker convergence in probability.

1.2 Discussion and extensions

In this note we restrict our analysis to random vectors with real coordinates because we think that this is simply more adapted to geometric dependence such as log-concavity. It is likely that the method remains valid for random vectors with complex entries. The Bai–Yin theorem is also available for random symmetric matrices (which are the sum of rank two updates which are no longer positive semidefinite) but it is unclear to us if one can adapt the method to this situation. One can ask in another direction if the method remains usable for non-white population covariance matrices, and for the so-called information plus noise covariance matrices, two models studied at the edge by Bai and Silverstein [8, 10] among others. One can ask finally if the method allows to extract at least the scaling of the Tracy–Widom fluctuation at the edge. For the Tracy–Widom fluctuation at the edge of empirical covariance matrices we refer to Johansson [29], Johnstone [30], Borodin and Forrester [19], Soshnikov [47], Péché [39], Feldheim and Sodin [23], Pillai and Yin [41], and references therein. It was shown by Lee and Yin [33] that for centered Wigner matrices the finiteness of the fourth moment is more than enough for the Tracy–Widom fluctuation of the largest eigenvalue. One can ask the same for the largest eigenvalue of the empirical covariance matrix of random vectors with i.i.d. entries, and one can additionally ask if a finite second moment is enough for the Tracy–Widom fluctuation of the smallest eigenvalue.

1.3 Structure

The rest of the article is structured as follows. Section 2 provides the proof of Propositions 1.3 and 1.4. In Sects. 3 and 4 we prove Theorems 1.6 and 1.8, respectively.

1.4 Notations

We set \({\left\| v\right\| }:=\sqrt{v_1^2+\cdots +v_n^2}\) and \({\left\| v\right\| }_\infty :=\max (|v_1|,\ldots ,|v_n|)\) for any vector \(v\in {\mathbb {R}}^n\). We denote \({\left\| f\right\| }_\infty :=\sup _{x\in S}|f(x)|\) for any function \(f:S\rightarrow {\mathbb {R}}\). We often use the notation \(|S|:={\mathrm {card}}(S)\) for a set S. Further, we denote by

$$\begin{aligned} \lambda _{\max }(M):=\lambda _1(M)\ge \cdots \ge \lambda _n(M)=:\lambda _{\min }(M) \end{aligned}$$

the eigenvalues of a symmetric \(n\times n\) matrix \(M\in {\mathcal {M}}_n({\mathbb {R}})\). We denote by \({\mathrm {I}}_n\) the \(n\times n\) identity matrix. For any real number u we sometimes abridge \(u{\mathrm {I}}_n\) into u in matrix expressions such as in \(M-u{\mathrm {I}}_n=M-u\).

2 Proof of Propositions 1.31.5

Proof of Proposition 1.3

Assume X is a centered isotropic log-concave vector in \({\mathbb {R}}^n\) and \({\mathrm {P}}:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n\) is a non-zero orthogonal projection. The random vector \({\mathrm {P}}X\) is log-concave with mean zero and covariance matrix \({\mathrm {P}}{\mathrm {P}}^\top ={\mathrm {P}}^2={\mathrm {P}}\). Restricted to the image of \({\mathrm {P}}\), the vector \(Y={\mathrm {P}}X\) is log-concave with covariance matrix \(I_r\) where \(r={\mathrm {rank}}{\mathrm {P}}\). The so-called thin-shell phenomenon [7] states that “most” of the distribution of Y is supported in a thin-shell around the Euclidean sphere of radius \(\sqrt{r}\). Quantitative estimates have been obtained notably by Klartag [31], Fleury [25], Guédon and Milman [27], see also Guéon [26]. On the other hand, it is also known that the tail of the of norm of Y is sub-exponential, see Paouris [37], and also Admaczak et al. [4]. The following inequality, taken from [27, Theorem 1.1], captures both phenomena: there exist absolute constants \(c,C\in (0,\infty )\) such that for any real \(u\ge 0\),

$$\begin{aligned} {\mathbb {P}}{\left( {\left| {\left\| Y\right\| }-\sqrt{r}\right| }\ge u\sqrt{r}\right) } \le C\exp \left( -c\sqrt{r}\min (u,u^3)\right) . \end{aligned}$$

This is more than enough for our needs. Namely, let \(\beta \in (0,1/20)\), and let \(u=u(r)\in (0,\infty )\) and \(\alpha =\alpha (r)\in (0,1)\) be such that \(\alpha \ge (1+r)^{-\beta }\) and \(u\ge \max ((1+r)^{-\beta },2\alpha /(1-\alpha ^2))\). Note that \(2\alpha /(1-\alpha ^2)\in (0,1)\) when \(\alpha \in (0,\sqrt{2}-1)\), and that \(2\alpha /(1-\alpha ^2)\rightarrow 0\) as \(\alpha \rightarrow 0\). Now, using the inequality \(\exp (-2t)\le t^{-4}\) for \(t>0\), we get, if \(\alpha u\in (0,1]\),

$$\begin{aligned} {\mathbb {P}}({\left\| Y\right\| }^2-r\ge u^2r)&={\mathbb {P}}({\left\| Y\right\| }\ge \sqrt{r+u^2r}) \\&\le {\mathbb {P}}({\left\| Y\right\| }\ge \sqrt{r}+\alpha u\sqrt{r}) \\&\le C\exp \left( -c\sqrt{r}(\alpha u)^3\right) \\&\le \frac{2^4C}{c^4(u^2r)^2\alpha ^{12}u^8} = \frac{o(r)}{(u^2r)^2}, \end{aligned}$$

while if \(\alpha u\in [1,\infty )\) we get

$$\begin{aligned} {\mathbb {P}}({\left\| Y\right\| }^2-r\ge u^2r) \le \cdots \le C\exp \left( -c\sqrt{r}\alpha u\right) \le \frac{2^4C}{c^4(u^2r)^2\alpha ^4} = \frac{o(r)}{(u^2r)^2}. \end{aligned}$$

Similarly, for an arbitrary \(\beta \in (0,1/20)\), let \(u=u(r)\in (0,\infty )\) and \(\alpha =\alpha (r)\in (0,1)\) be such that \(\alpha \ge (1+r)^{-\beta }\) and \(u\ge \max ((1+r)^{-\beta },2\alpha /(1+\alpha ^2))\). If \(u>1\) then necessarily

$$\begin{aligned} {\mathbb {P}}({\left\| Y\right\| }^2-r\le -u^2r)=0. \end{aligned}$$

Otherwise, \(\alpha u\in (0,1]\) and using again the inequality \(\exp (-2t)\le t^{-4}\) for \(t>0\), we get

$$\begin{aligned} {\mathbb {P}}({\left\| Y\right\| }^2-r\le -u^2r)&\le {\mathbb {P}}({\left\| Y\right\| }\le \sqrt{r-u^2r}) \\&\le {\mathbb {P}}({\left\| Y\right\| }\le \sqrt{r}-\alpha u\sqrt{r}) \\&\le C\exp \left( -c\sqrt{r}(\alpha u)^3\right) \\&\le \frac{2^4C}{c^4(u^2r)^2\alpha ^{12}u^{8}} = \frac{o(r)}{(u^2r)^2}. \end{aligned}$$

Thus, we obtain \({\mathbb {P}}\left( \left| {\left\| Y\right\| }^2-r\right| \ge t\right) \le o(r)t^{-2}\) for \(t\ge o(r)\) as expected. \(\square \)

Proof of Proposition 1.4

  • Proof of the first part (uniform integrability (WTP-a)) Recall that we are given a random variable \(\xi \) with zero mean and unit variance and that for every \(n\in {\mathbb {N}}\) the coordinates of \(X_n\) are independent copies of \(\xi \). We want to show that

    $$\begin{aligned} \lim _{M\rightarrow \infty }\sup _{\begin{array}{c} n\in {\mathbb {N}}\\ y\in S^{n-1} \end{array}} {\mathbb {E}}\left( \langle X_n,y\rangle ^2{\mathbf {1}}_{\{\langle X_n,y\rangle ^2\ge M\}}\right) =0. \end{aligned}$$

    For every \(x\in {\mathbb {R}}^n\), we define \(f_n(x):={\langle X_n,x\rangle }\). Clearly, \({\mathbb {E}}(f_n^2(x))={\left\| x\right\| }^2\) since \(X_n\) is isotropic. Let us start with some comments to understand the problem. The random variables \({(f_n^2(y))}_{n\in {\mathbb {N}},y\in S^{n-1}}\) belong to the unit sphere of \(L^1\). If \(\xi \) has finite fourth moment \(B:={\mathbb {E}}(\xi ^4)<\infty \) then by expanding and using isotropy we get \({\mathbb {E}}(f_n^4(y))\le \max (B,3)\) which shows that the family \({(f_n^2(y))}_{n\in {\mathbb {N}},y\in S^{n-1}}\) is bounded in \(L^2\) and thus uniformly integrable. How to proceed when \(\xi \) does not have a finite fourth moment? If y belongs to the canonical basis of \({\mathbb {R}}^n\) then \(f_n(y)\) is distributed as \(\xi \) and has thus the same integrability. On the other hand, if y is far from being sparse, say \(y=(n^{-1/2},\ldots ,n^{-1/2})\), then \(f_n(y)\) is distributed as \(n^{-1/2}(\xi _1+\cdots +\xi _n)\) where the \(\xi _i\)’s are independent copies of \(\xi \), which is close in distribution to the standard Gaussian law \({{\mathcal {N}}}(0,1)\) by the Central Limit Theorem (CLT).

We will use the following uniform quantitative CLT. Even though it probably exists somewhere in the literature, we provide a short proof for convenience. It can probably also be proved using the classical Fourier analytic smoothing inequality [24, equation (3.13) XVI.3 p. 538] which is the basis of many uniform CLT estimates.

Lemma 2.1

(Uniform quantitative CLT) Let \(\xi \) be a random variable with zero mean and unit variance and let \(\Phi \) be the cumulative distribution of the standard real Gaussian law \({\mathcal {N}}(0,1)\) of zero mean and unit variance. For any \(\varepsilon >0\) there exists \(\delta >0\) depending only on \(\varepsilon \) and the law of \(\xi \) with the following property: if \(n\in {\mathbb {N}}\) and \(y\in S^{n-1}\) is such that \({\left\| y\right\| }_\infty \le \delta \), then the cumulative distribution function \(F_n\) of \(\sum _{i=1}^ny_i\xi _i\) satisfies

$$\begin{aligned} {\left\| F_n-\Phi \right\| }_\infty \le \varepsilon . \end{aligned}$$

Proof of Lemma 2.1

To prove the lemma, let us assume the contrary. Then there exists \(\varepsilon >0\) and a sequence \({(y_m)}_{m\ge 1}\) in \(\ell ^2({\mathbb {N}})\) such that \({\left\| y_m\right\| }=1\) and \({\left\| y_m\right\| }_\infty \le 1/m\) for every \(m\in {\mathbb {N}}\), and such that \({\left\| F_m-\Phi \right\| }_\infty >\varepsilon \) where \(F_m\) is the cumulative distribution function of \(S_m:=\sum _{i=1}^\infty y_{m,i}\xi _i\). Let \(\varphi _m\) be the characteristic function of \(S_m\). We have \(\varphi _m(t)=\prod _{i=1}^\infty \varphi (y_{m,i}t)\) where \(\varphi \) is the characteristic function of \(\xi \). Fix any \(t\in {\mathbb {R}}\). By assumption on \(\xi \), we get \(\varphi (t)=1-\frac{t^2}{2}+o(t^2)\). Hence, using the identities \({\left\| y_m\right\| }^2:=\sum _{i=1}^\infty y_{m,i}^2=1\) and \({\left\| y_m\right\| }_\infty :=\max _{i\in {\mathbb {N}}}|y_{m,i}|\le 1/m\) together with the formula (valid for \(m\rightarrow \infty \)):

$$\begin{aligned} \varphi _m(t)= & {} \prod _{i=1}^\infty {\left( 1-\frac{y_{m,i}^2t^2}{2}+o\left( \left( y_{m,i}^2t^2\right) \right) \right) }\\= & {} \exp {\left( -\frac{t^2}{2}\sum _{i=1}^\infty y_{m,i}^2+\sum _{i=1}^\infty o\left( y_{m,i}^2t^2\right) \right) }, \end{aligned}$$

we get \(\lim _{m\rightarrow \infty }\varphi _m(t)=e^{-t^2/2}\). By the Lévy theorem for characteristic functions, it follows that \(F_m\rightarrow \Phi \) pointwise as \(m\rightarrow \infty \), which yields \(S_m\rightarrow {\mathcal {N}}(0,1)\) weakly as \(m\rightarrow \infty \), contradicting to our initial assumption. \(\square \)

Let us continue with the proof of the uniform integrability. Since \(\xi ^2\in L^1\) we get, by dominated convergence,

$$\begin{aligned} h(M):={\mathbb {E}}\left( \xi ^2{\mathbf {1}}_{\{\xi ^2\ge M\}}\right) \underset{M\rightarrow \infty }{\longrightarrow }0. \end{aligned}$$

Let \(\varepsilon >0\), and let \(\delta >0\) be defined from \(\varepsilon \) and the law of \(\xi \) as in the above lemma (we can, of course, assume that \(\delta \rightarrow 0\) with \(\varepsilon \rightarrow 0\)). Let \(M>0\) and \(n\in {\mathbb {N}}\) and \(y\in S^{n-1}\).

Let us write \(y=w+z\) where \(w_i:=y_i{\mathbf {1}}_{{\left| y_i\right| }\le \delta ^2}\) and \(z_i:=y_i{\mathbf {1}}_{{\left| y_i\right| }>\delta ^2}\) for any \(i=1,\ldots ,n\) (here, one can recall decomposition of the unit sphere into sets of “close to sparse” and “far from sparse” vectors, employed, in particular, in [34, 43]). Then it is easily seen that

  • \({\mathrm {supp}}(z)\cap {\mathrm {supp}}(w)=\varnothing \);

  • \({\left\| w\right\| }\le 1\) and \({\left\| w\right\| }_\infty \le \delta ^2\);

  • \({\left\| z\right\| }\le 1\) and \(|{\mathrm {supp}}(z)|\le 1/\delta ^{4}\),

where \({\mathrm {supp}}(x):=\{i:x_i\ne 0\}\). Now we have \(f_n^2(y)\le 2(f_n^2(w)+f_n^2(z))\) and

$$\begin{aligned} {\mathbb {E}}\left( f^2_n(y){\mathbf {1}}_{\left\{ f^2_n(y)\ge M\right\} }\right)&\le 2{\mathbb {E}}\left( \left( f_n^2(w)+f_n^2(z)\right) \left( {\mathbf {1}}_{\left\{ f_n^2(w)\ge \frac{M}{4}\right\} }+{\mathbf {1}}_{\left\{ f_n^2(z)\ge \frac{M}{4}\right\} }\right) \right) \\&\le 2{\left\| z\right\| }^2{\mathbb {P}}\left( f_n^2(w)\ge {\scriptstyle \frac{1}{4}}M\right) +2{\left\| w\right\| }^2{\mathbb {P}}\left( f_n^2(z)\ge {\scriptstyle \frac{1}{4}}M\right) \\&\quad +2{\mathbb {E}}\left( f_n^2(z){\mathbf {1}}_{\left\{ f_n^2(z)\ge \frac{M}{4}\right\} }\right) +2{\mathbb {E}}\left( f_n^2(w){\mathbf {1}}_{\left\{ f_n^2(w)\ge \frac{M}{4}\right\} }\right) \\&=:(*)+(**)+(***)+(****). \end{aligned}$$

Now, by Markov’s inequality

$$\begin{aligned} (*)+(**) \le \frac{16}{M}. \end{aligned}$$

Second, using that \(f_n^2(x)\le \Vert {\mathrm {P}}_x X_n\Vert ^2\), valid for any \(x\in {\mathbb {R}}^n\) with \(\Vert x\Vert \le 1\) and orthogonal projection \({\mathrm {P}}_x\) onto its support, we obtain

$$\begin{aligned} (***)&\le 2 {\mathbb {E}}\left( \left( \sum _{i:z_i\ne 0}\xi _i^2\right) \sum _{i:z_i\ne 0}{\mathbf {1}}_{\left\{ \xi _i^2\ge \frac{\delta ^8M}{4}\right\} }\right) \\&= 2 {\mathbb {E}}\left( \sum _{i,j\in {\mathrm {supp}}(z)} \xi _i^2{\mathbf {1}}_{\left\{ \xi _j^2\ge \frac{\delta ^8M}{4}\right\} }\right) \\&\le 2 \left( \frac{1}{\delta ^8}\frac{4}{\delta ^8M} +\frac{h\left( \frac{\delta ^8M}{4}\right) }{\delta ^4}\right) =\frac{8}{\delta ^{16}M}+\frac{2h\left( \frac{\delta ^8M}{4}\right) }{\delta ^{4}}, \end{aligned}$$

where in the third line we used Markov’s inequality. Third, we write

$$\begin{aligned} (****)=2{\mathbb {E}}\left( f_n^2(w)\right) -2{\mathbb {E}}\left( f_n^2(w) {\mathbf {1}}_{\left\{ f_n^2(w)<\frac{M}{4}\right\} }\right) . \end{aligned}$$

Now if \({\left\| w\right\| }\le \delta \) then \((****)\le 2\delta ^2\). Suppose in contrast that \({\left\| w\right\| }>\delta \), and denote \(w_*:=w/{\left\| w\right\| }\). Then \({\left\| w_*\right\| }_\infty \le \delta \), and therefore, by the CLT of Lemma  2.1, the distribution of \(f_n(w_*)\) is \(\varepsilon \)-close to \({\mathcal {N}}(0,1)\), and in particular, there exist \(M_*(\varepsilon )>0\) and \(\rho (\varepsilon )>0\) depending only on \(\varepsilon \) such that \(\rho (\varepsilon )\rightarrow 0\) as \(\varepsilon \rightarrow 0\) and

$$\begin{aligned} {\mathbb {E}}\left( f_n^2(w_*){\mathbf {1}}_{\left\{ f_n^2(w_*)<M_*(\varepsilon )\right\} }\right) \ge 1-\rho (\varepsilon ). \end{aligned}$$

Thus, having in mind that \(f_n^2(w_*)=f_n^2(w)/\Vert w\Vert ^2\), we get

$$\begin{aligned} (****)\le 2\Vert w\Vert ^2\rho (\varepsilon )\le 2\rho (\varepsilon ) \end{aligned}$$

provided that \(M\ge 4M_*(\varepsilon )\). So, in all cases, as long as M is sufficiently large,

$$\begin{aligned} (****)\le 2\delta ^2+2\rho (\varepsilon ). \end{aligned}$$

Finally, take any \(M>0\) such that \(M\ge \max (16/\varepsilon ,8/(\delta ^{16}\varepsilon ),4M_*(\varepsilon ))\) and such that \(2h(\delta ^8M/4)/\delta ^{14}\le \varepsilon \); then the desired result follows from

$$\begin{aligned} (*)+(**)+(***)+(****)\le 4\varepsilon +2\delta ^2+2\rho (\varepsilon ). \end{aligned}$$

Remark 2.2

(Alternative proof of (WTP-a)) An anonymous referee has suggested the following alternative proof of (WTP-a), which is shorter, but relies on a non-elementary inequality. For any \(n\ge 1\) and any \(y=(y_1,\ldots ,y_n)\in S^{n-1}\), Petrov’s inequality [40, §2.6.26 p. 83]—a consequence of Fuk and Nagaev’s inequality [40, (2.33) p. 61]—yields:

$$\begin{aligned} {\mathbb {E}}(w(\langle X_n,y\rangle )) \le \sum _{k=1}^n{\mathbb {E}}(w(2y_kX_{n,k}))+2e^2\int _0^\infty \frac{dw(x)}{\left( 1+\frac{1}{2}x^2\right) ^2} \end{aligned}$$

for any even function \(w:{\mathbb {R}}\rightarrow {\mathbb {R}}_+\) which is non-decreasing on \({\mathbb {R}}_+\) with \(w(0)=0\). Now take \(w(x):=x^2{\mathbf {1}}_{\{x^2>M\}}\) where \(M>0\) is fixed. Then, on the one hand, for any \(1\le k\le n\), as \(y_k\in [0,1]\), we have, denoting \(X_{n,k}\) the k-th component of \(X_n\),

$$\begin{aligned} {\mathbb {E}}(w(2y_kX_{n,k})) =4y_k^2{\mathbb {E}}\left( X_{n,k}^2{\mathbf {1}}_{\{(2y_kX_{n,k})^2>M\}}\right) \le 4y_k^2{\mathbb {E}}\left( X_{n,k}^2{\mathbf {1}}_{\{(2X_{n,k})^2>M\}}\right) \end{aligned}$$

while on the other hand,

$$\begin{aligned} \int _0^\infty \frac{dw(x)}{\left( 1+\frac{1}{2}x^2\right) ^2} =\frac{M}{\left( 1+\frac{1}{2}M\right) ^2}+\int _{\sqrt{M}}^\infty \frac{dx^2}{\left( 1+\frac{1}{2}x^2\right) ^2} \le \frac{4}{M}+\frac{4}{M+2} \le \frac{8}{M}, \end{aligned}$$

and we obtain the following inequality:

$$\begin{aligned} {\mathbb {E}}\left( \langle X_n,y\rangle ^2{\mathbf {1}}_{\{\langle X_n,y\rangle ^2>M\}}\right) \le 4\max _{1\le k\le n}{\mathbb {E}}\left( X_{n,k}^2{\mathbf {1}}_{\left\{ X_{n,k}^2 >\frac{M}{4}\right\} }\right) +\frac{8}{M}. \end{aligned}$$

This implies that the family \(\{\langle X_n,y\rangle ^2:n\ge 1,y\in S^{n-1}\}\) is uniformly integrable.

  • Proof of the first part (improved (WTP-b)) As before, we assume that \(\xi \) is a random variable with zero mean and unit variance, and denote by \((\xi _i)\) a sequence of i.i.d. copies of \(\xi \). Let us recall a kind of the weak law of large numbers for weighted sums, taken from [49, Lemma 5], which can be seen as a consequence of Lévy’s continuity theorem for characteristic functions: if \({(\eta _i)}_{i\in I}\) is a sequence (finite or infinite) of i.i.d. real random variables with zero mean then for every \(\varepsilon >0\) there exists \(\delta >0\) which may depend on the law of the random variables with the following property: for every deterministic sequence \({(t_i)}_{i\in I}\) in \([0,\infty )\) such that \({\left\| t\right\| }_1:=\sum _{i\in I}t_i=1\) and \({\left\| t\right\| }_\infty :=\max _{i\in I}t_i\le \delta \), we have

    $$\begin{aligned} {\mathbb {P}}\left( \left| \sum _{i\in I}t_i\eta _i\right| >\varepsilon \right) <\varepsilon . \end{aligned}$$

    Setting \(\eta _i:=\xi _i^2-1\), it follows that there exists \(h:(0,1]\rightarrow (0,1]\) such that, given any \(\varepsilon >0\) and a sequence \({(t_i)}_{i\in I}\) in \([0,\infty )\) with \({\left\| t\right\| }_1=1\) and \({\left\| t\right\| }_\infty \le h(\varepsilon )\), we have

    $$\begin{aligned} {\mathbb {P}}\left( \left| \sum _{i\in I}t_i\xi _i^2-1\right| >\varepsilon \right) \le \varepsilon . \end{aligned}$$
    (2.1)

    Without loss of generality, one can take h strictly monotone (i.e. invertible).

We now proceed similarly to [48, Proposition 1.3]. Fix \(n\in {\mathbb {N}}\) and let \({\mathrm {P}}\) be a non-zero orthogonal projection of \({\mathbb {R}}^n\) of rank r. Let \(X={(\xi _k)}_{1\le k\le n}\) be a random vector of \({\mathbb {R}}^n\) with \(\xi _1,\ldots ,\xi _n\) i.i.d. copies of \(\xi \). We have

$$\begin{aligned} {\left\| {\mathrm {P}}X\right\| }^2={\langle X,{\mathrm {P}}X\rangle }. \end{aligned}$$

Let us also denote the matrix of \({\mathrm {P}}\) in the canonical basis as \({\mathrm {P}}\). We have \({\mathrm {P}}^2={\mathrm {P}}={\mathrm {P}}^\top \) and \({\mathrm {tr}}({\mathrm {P}})={\mathrm {rank}}({\mathrm {P}})=r\). Let \({\mathrm {P}}_0\) be the matrix obtained from \({\mathrm {P}}\) by zeroing the diagonal. We have \({\mathrm {P}}-{\mathrm {P}}_0={\mathrm {diag}}({\mathrm {P}})\). A standard decoupling inequality proved in [50] (see also the book [22]) states that for any convex function F,

$$\begin{aligned} {\mathbb {E}}(F({\langle X,{\mathrm {P}}_0X\rangle }))\le {\mathbb {E}}(F(4{\langle X,{\mathrm {P}}_0X'\rangle })), \end{aligned}$$
(2.2)

where \(X'\) is an independent copy of X. In particular the choice \(F(u)=u^2\) gives

$$\begin{aligned} {\mathbb {E}}({\langle X,{\mathrm {P}}_0X\rangle }^2)\le 16\,{\mathbb {E}}({\langle X,{\mathrm {P}}_0X'\rangle }^2). \end{aligned}$$

Now recall that if Z is a centered random vector of \({\mathbb {R}}^n\) with covariance matrix \(\Gamma \) and if B is a \(n\times n\) matrix then \({\mathbb {E}}({\langle Z,BZ\rangle })={\mathrm {tr}}(\Gamma B^\top )\). Seeing X and \(X'\) as column vectors,

$$\begin{aligned} {\mathbb {E}}({\langle X,{\mathrm {P}}_0X'\rangle }^2)&={\mathbb {E}}\left( X'^\top {\mathrm {P}}_0 XX^\top {\mathrm {P}}_0X'\right) \\&={\mathbb {E}}\left( X'^\top {\mathrm {P}}_0 {\mathbb {E}}(XX^\top ) {\mathrm {P}}_0X'\right) \\&={\mathbb {E}}\left( X'^\top {\mathrm {P}}_0^2X'\right) \\&={\mathrm {tr}}\left( {\mathrm {P}}_0^2\right) \\&\le 2{\mathrm {tr}}\left( ({\mathrm {P}}-{\mathrm {P}}_0)^2\right) +2{\mathrm {tr}}({\mathrm {P}}^2)\\&\le 2{\mathrm {tr}}({\mathrm {P}}^2)+2{\mathrm {tr}}({\mathrm {P}}^2)\\&= 4r. \end{aligned}$$

Therefore

$$\begin{aligned} {\mathbb {E}}{\left( {\langle X,{\mathrm {P}}_0 X\rangle }^2\right) }\le 64r \end{aligned}$$

and by Markov’s inequality we get

$$\begin{aligned} {\mathbb {P}}\left( {\left| {\langle X,{\mathrm {P}}_0X\rangle }\right| }>r^{3/4}\right) \le \frac{64}{\sqrt{r}}. \end{aligned}$$
(2.3)

Next note that

$$\begin{aligned} {\langle X,({\mathrm {P}}-{\mathrm {P}}_0)X\rangle } =\sum _{i=1}^n {\mathrm {P}}_{i,i}\xi _i^2 \end{aligned}$$

with \(0\le P_{i,i}\le 1\) and \(\sum _{i=1}^n P_{i,i}=r\). Hence taking \(t_i=P_{i,i}/r\) and \(\varepsilon :=h^{-1}(1/r)\) with \(\varepsilon :=1\) if 1 / r is outside of the range of h, we get, using (2.1),

$$\begin{aligned} {\mathbb {P}}{\left( {\left| {\langle X,({\mathrm {P}}-{\mathrm {P}}_0)X\rangle }-r\right| }>\varepsilon r\right) }\le \varepsilon . \end{aligned}$$

Finally, by combining with (2.3) we obtain

$$\begin{aligned} {\mathbb {P}}{\left( {\left| {\langle X,{\mathrm {P}}X\rangle }-r\right| }>\left( \varepsilon +r^{-1/4}\right) r\right) }\le \varepsilon +\frac{64}{\sqrt{r}} \end{aligned}$$

and this implies the desired result, namely

$$\begin{aligned} {\mathbb {P}}{\left( {\left| {\langle X,{\mathrm {P}}X\rangle }-r\right| }>o(r)\right) }= o(1). \end{aligned}$$
  • Proof of the second part (improved (STP)) Let \(\xi \) be a random variable with zero mean, unit variance and a finite fourth moment. Further, let \({\mathrm {P}}=({\mathrm {P}}_{ij})\), \({\mathrm {P}}_0\) and r have the same meaning as before, and \(X=(\xi _k)_{1\le k\le n}\), where \(\xi _k\)’s are i.i.d. copies of \(\xi \). For any \(u>0\) we have

    $$\begin{aligned} {\mathbb {P}}{\left( {\left| \Vert {\mathrm {P}}X\Vert ^2-r\right| }>u\right) }&\le {\mathbb {P}}{\left( {\left| {\langle X,{\mathrm {P}}_0 X\rangle }\right| }>u/2\right) } +{\mathbb {P}}{\left( {\left| {\langle X,({\mathrm {P}}-{\mathrm {P}}_0) X\rangle }-r\right| }>u/2\right) }\\&=:(*)+(**). \end{aligned}$$

    Let us estimate the last two quantities \((*)\) and \((**)\) separately. In both cases we will compute selected moments and use Markov’s inequality. Set \(B:={\mathbb {E}}(\xi _k^4)\). Note that \(B\ge 1\) since \({\mathbb {E}}\xi ^2=1\). Since \(\xi _k\)’s are independent, we get, for any unit vector \(y=(y_k)_{1\le k\le n}\),

    $$\begin{aligned} {\mathbb {E}}({\langle X,y\rangle }^4)=\sum _i{\mathbb {E}}\left( \xi _i^4\right) y_i^2y_i^2 +3\sum _{i\ne j}{\mathbb {E}}\left( \xi _i^2\right) {\mathbb {E}}\left( \xi _j^2\right) y_i^2y_j^2 \le \max (B,3).\quad \quad \quad \end{aligned}$$
    (2.4)

    The decoupling inequality (2.2) with \(F(u)=u^4\) gives

    $$\begin{aligned} {\mathbb {E}}({\langle X,{\mathrm {P}}_0X\rangle }^4)\le 256\,{\mathbb {E}}({\langle X,{\mathrm {P}}_0X'\rangle }^4), \end{aligned}$$

    where \(X'\) is an independent copy of X. Next, in view of (2.4), we get

    $$\begin{aligned} {\mathbb {E}}({\langle X,{\mathrm {P}}_0X'\rangle }^4) \le {\mathbb {E}}(\Vert {\mathrm {P}}_0X'\Vert ^4)\max _{{\left\| y\right\| }=1}{\mathbb {E}}({\langle X,y\rangle }^4) \le \max (B,3){\mathbb {E}}(\Vert {\mathrm {P}}_0X'\Vert ^4). \end{aligned}$$

    Since \(X'\) has independent coordinates of zero mean and unit variance, we get

    $$\begin{aligned} {\mathbb {E}}(\Vert {\mathrm {P}}_0X'\Vert ^4)&\le 8{\mathbb {E}}(\Vert {\mathrm {P}}X'\Vert ^4) +8{\mathbb {E}}(\Vert ({\mathrm {P}}-{\mathrm {P}}_0)X'\Vert ^4)\\&= 8{\mathbb {E}}\left( \sum \limits _{i,j}{\mathrm {P}}_{ij}\xi _i\xi _j\right) ^2+ 8{\mathbb {E}}\left( \sum \limits _{i=1}^n\xi _i^2{{\mathrm {P}}_{ii}}^2\right) ^2\\&\le 8\max (B,2)\sum \limits _{i,j}{{\mathrm {P}}_{ij}}^2+8\sum \limits _{i\ne j}{\mathrm {P}}_{ii}{\mathrm {P}}_{jj} +8B\sum _{i,j}{{\mathrm {P}}_{ii}}^2{{\mathrm {P}}_{jj}}^2\\&\le 8\max (B,2)r+8r^2+8Br^2\\&\le 32Br^2. \end{aligned}$$

    Hence, \({\mathbb {E}}({\langle X,{\mathrm {P}}_0X\rangle }^4)\le (256Br)^2\), and applying Markov’s inequality, we get the following bound for \((*)\):

    $$\begin{aligned} (*)\le \frac{(1024Br)^2}{u^4}. \end{aligned}$$

Let us turn to estimating \((**)\). We will use symmetrization, truncation, and concentration. Let \({\widetilde{X}}=({\widetilde{\xi }}_k)_{1\le k\le n}\) be an independent copy of X. Note that

$$\begin{aligned} {\mathbb {E}}(({\langle {\widetilde{X}},({\mathrm {P}}-{\mathrm {P}}_0) {\widetilde{X}}\rangle }-r)^2) ={\mathbb {E}}\left( \left( \sum _{k=1}^n{\mathrm {P}}_{kk}({{\widetilde{\xi }}_k}^2-1)\right) ^2\right) \le Br, \end{aligned}$$

so, using the independence of X and \({\widetilde{X}}\) and applying Markov’s inequality, we get

$$\begin{aligned} (**)&\le 2{\mathbb {P}}{\left( {\left| {\langle X,({\mathrm {P}}-{\mathrm {P}}_0) X\rangle } -{\langle {\widetilde{X}},({\mathrm {P}}-{\mathrm {P}}_0) {\widetilde{X}}\rangle }\right| }>u/2-\sqrt{2Br}\right) }\\&= 2{\mathbb {P}}\left( \left| \sum _{k=1}^n {\mathrm {P}}_{kk}\left( {\xi _k}^2-{{\widetilde{\xi }}_k}^2\right) \right| >u/2-\sqrt{2Br}\right) . \end{aligned}$$

Clearly, the variables \({\xi _k}^2-{{\widetilde{\xi }}_k}^2\) (\(1\le k\le n\)) are symmetrically distributed, with the variance bounded from above by 2B. Let \(h:{\mathbb {R}}_+\rightarrow {\mathbb {R}}_+\) be a function defined as

$$\begin{aligned} h(t):={\mathbb {E}}{\left( \left( {\xi _k}^2-{{\widetilde{\xi }}_k}^2\right) ^2\chi _t\right) }, \end{aligned}$$

where \(\chi _t\) denotes the indicator of the event \(\{|{\xi _k}^2-{{\widetilde{\xi }}_k}^2|\ge t\}\). Clearly, \(h(t)\rightarrow 0\) when t tends to infinity. Note that, by Hoeffding’s inequality, we have

$$\begin{aligned} {\mathbb {P}}\left( \left| \sum _{k=1}^n{\mathrm {P}}_{kk}\left( {\xi _k}^2 -{{\widetilde{\xi }}_k}^2\right) \left( 1-\chi _{r^{1/4}}\right) \right| >u/4\right)&\le 2\exp \left( -\frac{2u^2}{16\sqrt{r}} \left( 4\sum \nolimits _{k}{{\mathrm {P}}_{kk}}^2\right) ^{-1}\right) \\&\le 2\exp \left( -u^2/(32r\sqrt{r})\right) . \end{aligned}$$

On the other hand, since the random variable \(({\xi _k}^2-{{\widetilde{\xi }}_k}^2)\chi _{r^{1/4}}\) has zero mean, we have

$$\begin{aligned} {\mathbb {E}}\left( \left( \sum _{k=1}^n{\mathrm {P}}_{kk}\left( {\xi _k}^2-{{\widetilde{\xi }}_k}^2\right) \chi _{r^{1/4}}\right) ^2\right) \le r h(r^{1/4}). \end{aligned}$$

Applying Markov’s inequality, we get for all \(u>4\sqrt{2Br}\),

$$\begin{aligned} {\mathbb {P}}\left( \left| \sum _{k=1}^n{\mathrm {P}}_{kk}\left( {\xi _k}^2-{{\widetilde{\xi }}_k}^2\right) \chi _{r^{1/4}}\right| >u/4-\sqrt{2Br}\right) \le \frac{r h(r^{1/4})}{(u/4-\sqrt{2Br})^2}. \end{aligned}$$

Combining the estimates, we obtain

$$\begin{aligned} (**)\le&\,\, 2{\mathbb {P}}\left( \left| \sum _{k=1}^n{\mathrm {P}}_{kk}\left( {\xi _k}^2-{{\widetilde{\xi }}_k}^2\right) (1-\chi _{r^{1/4}})\right|>u/4\right) \\&+ 2{\mathbb {P}}\left( \left| \sum _{k=1}^n{\mathrm {P}}_{kk}\left( {\xi _k}^2-{{\widetilde{\xi }}_k}^2\right) \chi _{r^{1/4}} \right|>u/4-\sqrt{2Br}\right) \\ \le&\,\, 4\exp \left( -u^2/(32r\sqrt{r})\right) +\frac{128r h(r^{1/4})}{u^2} \quad \text {for all}\, u>8 \sqrt{2Br}. \end{aligned}$$

Finally, grouping together the bounds for \((*)\) and \((**)\), we get for all \(u>8\sqrt{2Br}\),

$$\begin{aligned} {\mathbb {P}}{\left( {\left| \Vert {\mathrm {P}}X\Vert ^2-r\right| }>u\right) } \le \frac{(1024Br)^2}{u^4} +4\exp \left( -u^2/(32r\sqrt{r})\right) +\frac{128r h(r^{1/4})}{u^2}. \end{aligned}$$

Set \(\alpha \in (0,1/6)\). For any \(u>8\sqrt{2B}r^{1-\alpha }=o(r)\), using the inequality \(e^{-2t}\le 1/t^4\) for \(t>0\), the right hand side of the last equation above is bounded above by

$$\begin{aligned} \frac{r}{u^2}\frac{(1024 B)^2r}{u^2} +\frac{r}{u^2}4\frac{64^4r^{5}}{u^6} +\frac{r}{u^2}128h(r^{1/4}) =\frac{o(r)}{u^2}. \end{aligned}$$

This proves the desired result. We note that the proof can be shortened and simplified under the stronger assumption that \({\mathbb {E}}(|\xi _1|^p)<\infty \) for some \(p>4\), see also [48, Proposition 1.3] for thin-shell estimates in the same spirit. \(\square \)

Proof of Proposition 1.5

This proof was suggested by an anonymous referee. Let \(X_n\), \(n\in {\mathbb {N}}\), be as in (1.1) and assume that \(\lim _{n\rightarrow \infty }\frac{n}{m_n}=\rho \) with \(\rho \in (0,1)\), in other words that \(\underline{\rho }={\overline{\rho }}\in (0,1)\). It is known—see [2, Theorem 2.4]—that a sufficient condition for the Marchenko–Pastur theorem (1.5) to hold is

$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _{A}\frac{1}{n}{\mathbb {E}}{\left| X_n^\top AX_n-{\mathrm {tr}}(A)\right| }=0 \end{aligned}$$

where the supremum runs over all \(n\times n\) complex matrices such that \({\left\| A\right\| }\le 1\) (let us note that a similar condition was exploited in [36]). Recall that \({\left\| A\right\| }=\max _{{\left\| x\right\| }=1}{\left\| Ax\right\| }=s_{\max }(A)\). Note that the mapping \(A\mapsto v^\top Av-{\mathrm {tr}}(A)\) is linear for any \(v\in {\mathbb {R}}^n\). Actually, as \({\left\| \mathfrak {R}A\right\| },{\left\| \mathfrak {I}A\right\| }\le {\left\| A\right\| }\), one can consider only real matrices in the above condition. Moreover, as \(v^\top Av=0={\mathrm {tr}}(A)\) when A is antisymmetric, it is enough to verify the condition for real symmetric matrices. Furthermore, by splitting the spectrum it can be shown that one can consider only real symmetric positive semi-definite matrices. Now, if \(A\ne 0\) is a real symmetric positive semi-definite matrix, then \(A/{\left\| A\right\| }\) can be represented as a convex linear combination of some orthogonal projectors. Indeed, write

$$\begin{aligned} \frac{A}{{\left\| A\right\| }}&=\sum _{k=1}^n\lambda _kv_kv_k^\top \\&=\lambda _n\sum _{k=1}^nv_kv_k^\top +(\lambda _{n-1}-\lambda _n)\sum _{k=1}^{n-1}v_kv_k^\top +\cdots +(\lambda _1-\lambda _{2})v_1v_1^\top \\&=:\lambda _n\Pi ^{(n)}+(\lambda _{n-1}-\lambda _n)\Pi ^{(n-1)} +\cdots +(1-\lambda _{2})\Pi ^{(1)}, \end{aligned}$$

where \({(v_k)}_{1\le k\le n}\) is an orthonormal basis of eigenvectors of \(A/{\left\| A\right\| }\) associated to the eigenvalues \(\lambda _1,\ldots ,\lambda _n\) ordered as \(0\le \lambda _n\le \lambda _{n-1}\le \cdots \le \lambda _1=1\). Now, due to the convexity of the \(L^1\)-norm, we get that a sufficient condition for the Marchenko–Pastur theorem to hold is

$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _\Pi \frac{1}{n}{\mathbb {E}}{\left| X_n^\top \Pi X_n-{\mathrm {tr}}(\Pi )\right| }=0, \end{aligned}$$

where the supremum runs over all orthogonal projectors in \({\mathbb {R}}^n\). In turn, the latter is equivalent to

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{1}{n}{\mathbb {E}}{\left| X_n^\top \Pi _n X_n-{\mathrm {tr}}(\Pi _n)\right| }=0 \end{aligned}$$
(2.5)

for any sequence \({(\Pi _n)}_{n\ge 1}\) where \(\Pi _n\) is an orthogonal projector in \({\mathbb {R}}^n\) for every \(n\ge 1\). Since \(X_n\) is isotropic for every \(n\ge 1\), it turns out that the latter is in turn equivalent to

$$\begin{aligned} \frac{1}{n}(X_n^\top \Pi _n X_n-{\mathrm {tr}}(\Pi _n)) \overset{{\mathbb {P}}}{\underset{n\rightarrow \infty }{\longrightarrow }}0 \end{aligned}$$
(2.6)

for any sequence \({(\Pi _n)}_{n\ge 1}\) where \(\Pi _n\) is an orthogonal projector in \({\mathbb {R}}^n\) for every \(n\ge 1\). To see that (2.5) and (2.6) are equivalent, consider variables \(Z_n:=\frac{1}{n}(X_n^\top \Pi _nX_n-{\mathrm {tr}}(\Pi _n))+1\ge 0\) which satisfy \({\mathbb {E}}(Z_n)=1\) for every \(n\ge 1\). Note that, assuming condition (2.6), we have \(Z_n\overset{{\mathbb {P}}}{\rightarrow }1\), and then use the fact that if a sequence \({(Z_n)}_{n\ge 1}\) satisfies \(Z_n\overset{{\mathbb {P}}}{\rightarrow }Z\), \({\mathbb {E}}(|Z_n|)\rightarrow {\mathbb {E}}(|Z|)\), and \({\mathbb {E}}(|Z|)<\infty \) then \({\mathbb {E}}(|Z_n-Z|)\rightarrow 0\), see for instance [46, Problem 18, p. 211].

Let us show now that \(\mathrm {(WTP)}\) implies (2.6). We may assume without loss of generality that \({\mathrm {rank}}(\Pi _n)\rightarrow \infty \) as \({\mathbb {E}}(X_n^\top \Pi _nX_n)={\mathrm {rank}}(\Pi _n)={\mathrm {tr}}(\Pi _n)\). By (WTP-b), there exists a sequence \({(\varepsilon _n)}_{n\ge 1}\) such that \(\varepsilon _n\rightarrow 0\) and, denoting \(P_n:=\frac{1}{{\mathrm {rank}}(\Pi _n)}X_n^\top \Pi _nX_n\), we have

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {P}}(P_n>1+\varepsilon _n)=0. \end{aligned}$$

We only need to show that \(P_n\overset{{\mathbb {P}}}{\rightarrow }1\). For every fixed real number \(M>1\), we have

$$\begin{aligned} 1={\mathbb {E}}(P_n) ={\mathbb {E}}(P_n{\mathbf {1}}_{\{P_n\le 1+\varepsilon _n\}}) +{\mathbb {E}}(P_n{\mathbf {1}}_{\{1+\varepsilon _n<P_n\le M\}}) +{\mathbb {E}}(P_n{\mathbf {1}}_{\{P_n>M\}}). \end{aligned}$$

First note that \({\mathbb {E}}(P_n{\mathbf {1}}_{\{1+\varepsilon _n<P_n\le M\}})\le M{\mathbb {P}}(P_n>1+\varepsilon _n)\underset{n\rightarrow \infty }{\longrightarrow }0\). Now by (WTP-a), the family \({\mathcal {A}}:=\{\langle X_n,y\rangle ^2:n\ge 1,y\in S^{n-1}\}\) is uniformly integrable. Hence, the family

$$\begin{aligned} {\mathcal {B}}:= \bigcup _{n\ge 1}\left\{ \frac{X_n^\top \Pi X_n}{{\mathrm {rank}}(\Pi )}: \Pi \ne 0\text { is an orthogonal projector in }{\mathbb {R}}^n\right\} \end{aligned}$$

is also uniformly integrable, because, denoting by \({(v_k)}_{1\le k\le n}\) the orthonormal basis of eigenvectors of \(\Pi \),

$$\begin{aligned} \frac{X_n^\top \Pi X_n}{{\mathrm {rank}}(\Pi )} =\frac{1}{{\mathrm {rank}}(\Pi )}\sum _{k=1}^{{\mathrm {rank}}(\Pi )}\langle X_n,v_k\rangle ^2 \end{aligned}$$

is a convex linear combination of some elements of \({\mathcal {A}}\).

Next, the uniform integrability of \({\mathcal {B}}\) implies that \(\lim _{M\rightarrow \infty }\sup _{n\ge 1}{\mathbb {E}}(P_n{\mathbf {1}}_{\{P_n>M\}})=0\) and then

$$\begin{aligned} \varliminf _{n\rightarrow \infty }{\mathbb {E}}(P_n{\mathbf {1}}_{\{P_n\le 1+\varepsilon _n\}})\ge 1. \end{aligned}$$

The latter is possible only if \({\mathbb {E}}(P_n{\mathbf {1}}_{\{P_n\le 1+\varepsilon _n\}})\rightarrow 1\). This yields \(P_n\overset{{\mathbb {P}}}{\rightarrow }1\). \(\square \)

3 Proof of Theorem 1.6

As in [14, 48], for every \(n\times n\) symmetric matrix S and \(t\in {\mathbb {R}}\setminus \{\lambda _1(S),\ldots ,\lambda _n(S)\}\), we set

$$\begin{aligned} \underline{m}_S(t) := {\mathrm {tr}}((S-t{\mathrm {I}}_n)^{-1})=\sum _{k=1}^n\frac{1}{\lambda _k(S)-t} \end{aligned}$$

The function \(t\mapsto \underline{m}_S(t)\) is positive and strictly increasing on \((-\infty ,\lambda _n(S))\). We note that \(n^{-1}\underline{m}_S\) is the restriction on \({\mathbb {R}}\) of the Cauchy–Stieltjes transform of the empirical spectral distribution of S.

Let us give an overview of the proof. Assume that \(n\le m\) and \(B:=\sum _{k=1}^m X^{(k)}\otimes X^{(k)}\), and let

$$\begin{aligned} B^{(0)}:=0 \quad \text {and}\quad B^{(k)}:=B^{(k-1)}+X^{(k)}\otimes X^{(k)},\quad k=1,\dots ,m, \end{aligned}$$

where \(X^{(k)}\) (\(k\le m\)) are independent isotropic random vectors in \({\mathbb {R}}^n\). We want to find a good lower bound for \(\lambda _{\min }(B)\). As was mentioned in the introduction, we define inductively a sequence of numbers \(u_0,u_1,\dots ,u_m\) such that \(u_k<\lambda _{\min }(B^{(k)})\) (\(k\le m\)) with probability one, implying \({\mathbb {E}}\lambda _{\min }(B)\ge {\mathbb {E}}u_m\). More precisely, we set \(u_0:=n-\sqrt{mn}<0\), and for every \(k\le m\), define \(u_k:=u_{k-1}+\delta ^k-\delta ^k_R\), where both \(\delta ^k\) and \(\delta ^k_R\) are random non-negative numbers. The quantities \(\delta ^k\) and \(\delta ^k_R\) are designed so that \(\mathbf{1)}\) \(u_{k-1}+\delta ^k\) (whence, \(u_k\)) stays on the left of the spectrum of \(B^{(k)}\) with probability one, and \(\mathbf{2)}\) the sum of expectations \(\sum _{k=1}^m (\delta ^k-\delta ^k_R)\) is close to \(m-\sqrt{mn}\), implying that \({\mathbb {E}}u_m\) is close to the (scaled) left edge of the Marchenko–Pastur distribution \((\sqrt{m}-\sqrt{n})^2\). The numbers \(\delta ^k\), in a non-random “rank one update” setting, are defined in Lemma 3.2 below. The definition is quite technical, and even the inequality \(u_{k-1}+\delta ^k<\lambda _{\min }(B^{(k)})\) is not immediately clear. To take care of this condition, we utilize the concept of a feasible lower shift which was previously applied in [48]. With help of an auxiliary Lemma 3.1, based on the Sherman–Morrison formula, we verify that our \(\delta ^k\) is a feasible lower shift, meaning that condition \(\mathbf{1}\) is satisfied.

Up to this point, the argument is quite similar to [48]. Still, let us note that the formula for \(\delta ^k\) which we use in our paper is different from the one in [48] (the reader might compare Lemma 3.2 with [48, Lemma 2.3]). Indeed, unlike in [48], here we are interested in getting a sharp limiting result; hence, our estimates for the expectation (condition \(\mathbf{2}\)) must be more precise than in [48]. However, just “tuning” the definition of \(\delta ^k\) from [48] turns our to be insufficient to reach our goal; that is the reason for introducing regularity shifts \(\delta ^k_R\) which are the main technical novelty of the proof. We set

$$\begin{aligned} \delta _R^k:=&\min \left\{ \ell \in \{0,1,\ldots \}:\, \underline{m}_{B^{(k)}}(u_{k-1}+\delta ^k-\ell )\right. \\&\left. -\,\underline{m}_{B^{(k)}}(u_{k-1}+\delta ^k-\ell -1)\le \frac{1}{\varepsilon n}\right\} , \end{aligned}$$

where \(\varepsilon >0\) is a parameter which can be made arbitrarily small. The impact of the regularity shifts might seem counter-intuitive at first glance since they move \(u_k\)’s in the direction away from the spectrum increasing the gap between \(u_k\) and \(\lambda _{\min }(B^{(k)})\). For every \(k<m\), the purpose of \(\delta ^k_R\) is to guarantee at the next step some concentration inequalities for the quadratic form

$$\begin{aligned} (X^{(k+1)})^\top (B^{(k)}-u_k)^{-1}X^{(k+1)} \end{aligned}$$

(this form will appear below as \(q_1\)). The nature of this concentration property is revealed in Lemma 3.3, which in turn enables us to obtain a satisfactory estimate for the expectation of \(\delta ^k\) in Lemma 3.4. In fact, the condition (3.4) in Lemma 3.3 is exactly what is provided by \(\delta ^k_R\)’s. Note that the regularity shifts stay useful only if their cumulative effect is of lower order of magnitude compared to \((\sqrt{m}-\sqrt{n})^2\). In Lemma 3.5, we show that

$$\begin{aligned} \sum _{k=1}^m\delta _R^k\le \frac{\varepsilon n^2}{\sqrt{mn}-n} \end{aligned}$$

deterministically; thus, taking \(\varepsilon \) arbitrarily small, we can estimate the left edge of the spectrum of B with the required precision.

We find it useful to separate deterministic (linear algebraic) arguments from probabilistic estimates. In Sect. 3.1 below, we work with a non-random matrix A and a fixed vector x, thus performing a non-random rank one update to \(A+xx^\top \). The shift \(\delta ^k\) is defined deterministically as a function of the matrix and the vector. In the next Sect. 3.2, we replace the fixed vector x with a random isotropic vector X, but the matrix A stays non-random. Finally, in Sect. 3.3, we make the setting “fully random” (the way it was described above, but with matrices \(A^{(k)}\) in place of \(B^{(k)}\)’s).

3.1 Feasible lower shift

Let A be an \(n\times n\) positive semi-definite non-random matrix with eigenvalues \(\lambda _{\max }:=\lambda _1\ge \cdots \ge \lambda _n=:\lambda _{\min }\ge 0\) and a corresponding orthonormal basis of eigenvectors \((x_i)_{i=1}^n\), and let \(u<\lambda _{\min }\). Further, assume that x is a (non-random) vector in \({\mathbb {R}}^n\). We are interested in those numbers \(\delta \ge 0\) that (deterministically) satisfy

$$\begin{aligned} \lambda _{\min }>u+\delta \quad \text {and}\quad \underline{m}_{A+xx^\top }(u+\delta )\le \underline{m}_A(u). \end{aligned}$$
(3.1)

Following [48], any value of \(\delta \) satisfying (3.1), will be called a feasible lower shift with respect to A, x and u. The following statement is taken from [48]; we provide its proof for reader’s convenience.

Lemma 3.1

(Feasible lower shift—[48, Lemma 2.2]) Let \(\delta \ge 0\) be such that \(u+\delta <\lambda _{\min }\). Let us define

$$\begin{aligned} q_1(\delta )&:=x^\top (A-u-\delta )^{-1}x =\sum _{i=1}^n\frac{{\langle x,x_i\rangle }^2}{\lambda _i-u-\delta }\\ q_2(\delta )&:=\frac{x^\top (A-u-\delta )^{-2}x}{{\mathrm {tr}}((A-u-\delta )^{-2})} = \left( \sum _{i=1}^n(\lambda _i-u-\delta )^{-2}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( \lambda _i-u-\delta \right) ^2}. \end{aligned}$$

Then a sufficient condition for (3.1) to be satisfied is \( q_2(\delta )\ge \delta (1+q_1(\delta )). \)

Proof

If S is a symmetric matrix and if x is a vector, and if both S and \(S+xx^\top \) are invertible, then \(1+x^\top S^{-1} x\ne 0\) since \(x^\top S^{-1}x=-1\) gives \((S+xx^\top )S^{-1}x=0\). Moreover the inverse \((S+xx^\top )^{-1}\) of the rank one update \(S+xx^\top \) of S can be expressed as

$$\begin{aligned} (S+xx^\top )^{-1} = S^{-1}-\frac{S^{-1}xx^\top S^{-1}}{1+x^\top S^{-1}x}. \end{aligned}$$
(3.2)

This is known as the Sherman–Morrison formula. This allows to write

$$\begin{aligned} \underline{m}_{A+xx^\top }(u+\delta )&={\mathrm {tr}}((A-u-\delta +xx^\top )^{-1}) \\&={\mathrm {tr}}((A-u-\delta )^{-1}) -\frac{x^\top (A-u-\delta )^{-2}x}{1+x^\top (A-u-\delta )^{-1}x}\\&=\underline{m}_A(u)+{\mathrm {tr}}((A-u-\delta )^{-1}-(A-u)^{-1})\\&\quad -\frac{x^\top (A-u-\delta )^{-2}x}{1+x^\top (A-u-\delta )^{-1}x}. \end{aligned}$$

Now since \(A-(u+\delta ){\mathrm {I}}_n\) is positive definite, it follows that

$$\begin{aligned} \delta (A-u-\delta )^{-2}-((A-u-\delta )^{-1}-(A-u)^{-1}) \end{aligned}$$

is positive definite, and therefore

$$\begin{aligned} \underline{m}_{A+xx^\top }(u+\delta )-\underline{m}_A(u) \le \delta {\mathrm {tr}}((A-u-\delta )^{-2}) -\frac{x^\top (A-u-\delta )^{-2}x}{1+x^\top (A-u-\delta )^{-1}x}. \end{aligned}$$

Finally it can be checked that the right hand side is \(\le 0\) if \(\delta (1+q_1(\delta ))-q_2(\delta )\le 0\). \(\square \)

Lemma 3.2

(Construction of the feasible shift) Let A, x, u and \(q_1,q_2\) be as above, \(\varepsilon \in (0,1)\) and assume that

$$\begin{aligned} \lambda _{\min }-u\ge 2/\varepsilon ^2. \end{aligned}$$

Then the quantity

$$\begin{aligned} \delta :=\frac{(1-\varepsilon ) {\mathbf {1}}_{\{q_1(1/\varepsilon )\le (1+\varepsilon )\underline{m}_A(u)+\varepsilon \}}}{(1+\varepsilon )(1+\underline{m}_A(u))} \left( \sum _{i=1}^n(\lambda _i-u)^{-2}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2{\mathbf {1}}_{\{\langle x,x_i\rangle ^2\le 1/\varepsilon \}}}{\left( \lambda _i-u\right) ^2} \end{aligned}$$

satisfies \(\delta \le 1/\varepsilon \) and is a feasible lower shift w.r.t. A, x and u, i.e. \(\lambda _{\min }>u+\delta \) and \(q_2(\delta )\ge \delta (1+q_1(\delta ))\).

Proof

First, note that the condition \(\lambda _{\min }-u\ge 2/\varepsilon ^2\) immediately implies that

$$\begin{aligned} \varepsilon (\lambda _i-u)^2-\frac{2(\lambda _i-u)}{\varepsilon }+\frac{1}{\varepsilon ^2} \ge 0, \quad i=1,\dots ,n, \end{aligned}$$

which is in turn equivalent to the relation

$$\begin{aligned} \frac{1}{(1-\varepsilon )\left( \lambda _i-u\right) ^2} \ge \frac{1}{\left( \lambda _i-u-1/\varepsilon \right) ^2},\quad i=1,\dots ,n. \end{aligned}$$
(3.3)

Now, let us return to \(\delta \). The inequality \(\delta \le 1/\varepsilon \) follows directly from its definition. Next, since \(q_1(\cdot )\) is an increasing function on \([0,\lambda _{\min }-u)\), we have

$$\begin{aligned} \frac{(1+q_1(\delta )){\mathbf {1}}_{\{q_1(1/\varepsilon )\le (1+\varepsilon )\underline{m}_A(u)+\varepsilon \}}}{(1+\varepsilon )(1+\underline{m}_A(u))}\le \frac{(1+q_1(1/\varepsilon )){\mathbf {1}}_{\{q_1(1/\varepsilon )\le (1+\varepsilon )\underline{m}_A(u)+\varepsilon \}}}{(1+\varepsilon )(1+\underline{m}_A(u))}\le 1. \end{aligned}$$

Thus, we obtain

$$\begin{aligned} \delta (1+q_1(\delta ))\le (1-\varepsilon )q_2(0). \end{aligned}$$

Finally, note that by the definition of \(q_2\) and in view of relation (3.3), we have

$$\begin{aligned} q_2(\delta )&=\left( \sum _{i=1}^n(\lambda _i-u-\delta )^{-2}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( \lambda _i-u-\delta \right) ^2}\\&\ge \left( \sum _{i=1}^n(\lambda _i-u-1/\varepsilon )^{-2}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( \lambda _i-u-\delta \right) ^2}\\&\ge \left( (1-\varepsilon )^{-1}\sum _{i=1}^n(\lambda _i-u)^{-2}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( \lambda _i-u-\delta \right) ^2}\\&\ge \left( (1-\varepsilon )^{-1}\sum _{i=1}^n(\lambda _i-u)^{-2}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( \lambda _i-u\right) ^2}\\&=(1-\varepsilon )q_2(0), \end{aligned}$$

and the statement follows. \(\square \)

3.2 Randomization and control of expectations

Let, as before, A be an \(n\times n\) non-random positive semidefinite matrix with eigenvalues \(\lambda _1\ge \dots \ge \lambda _n\ge 0\), and let \(u<\lambda _n\). We define a (random) quantity \(\delta \) as in Lemma 3.2, replacing the fixed vector x with a random isotropic vector X.

Lemma 3.3

Let \(\varepsilon \in \left( 0,12^{-3}\right) \), \(n\ge 12/\varepsilon ^4\), and let X be a random isotropic vector in \({\mathbb {R}}^n\) such that

$$\begin{aligned} {\mathbb {P}}{\left( \Vert {\mathrm {P}}X\Vert ^2-{\mathrm {rank}}{\mathrm {P}}\ge \varepsilon {\mathrm {rank}}{\mathrm {P}}/6\right) }\le \varepsilon ^4 \end{aligned}$$

for any non-zero orthogonal projection \({\mathrm {P}}\) of rank at least \(\varepsilon ^{11}n/72\). Further, let the non-random matrix A, the numbers \((\lambda _i)_{i\le n}\) and vectors \((x_i)_{i\le n}\) be as above, and \(u\in {\mathbb {R}}\) be such that \(\lambda _{\min }-u\ge 6\varepsilon ^{-2}+\varepsilon ^{-1}\). Assume additionally that

$$\begin{aligned} \sum _{i=1}^n\frac{1}{(\lambda _i-u)(\lambda _i-u+1)}\le \frac{1}{\varepsilon n}. \end{aligned}$$
(3.4)

Then, with \(q_1\) defined as in Lemma 3.1 (with X replacing the non-random vector x), we have

$$\begin{aligned} {\mathbb {P}}\left\{ q_1(1/\varepsilon )+1\ge (1+\varepsilon )(1+\underline{m}_A(u))\right\} \le 4\varepsilon ^2. \end{aligned}$$

Proof

First, note that the lower bound on \(\lambda _{\min }-u\) implies that \(q_1(1/\varepsilon )\le (1+\varepsilon /6)q_1(0)\) (deterministically). Let us split the index set \(\{1,\dots ,n\}\) into several subsets in the following way: First, let \(I:=\{i\le n:\,\lambda _i-u\le \varepsilon ^4 n/12\}\). Next, we set

$$\begin{aligned} J:=\{i\le n:\,\lambda _i-u\ge 6n/\varepsilon ^3\}, \end{aligned}$$

so that

$$\begin{aligned} \{1,\dots ,n\}\setminus (I\cup J) =\left\{ i\le n:\,\varepsilon ^4 n/12< \lambda _i-u<6n/\varepsilon ^3\right\} . \end{aligned}$$

Note that, by the choice of \(\varepsilon \), we have \(\exp (1/(12\varepsilon ))\ge 72/\varepsilon ^7\). Hence, the interval \((\ln (\varepsilon ^4n/12),\ln (6n/\varepsilon ^3))\) can be partitioned into \(\lfloor \varepsilon ^{-2}\rfloor \) subintervals \({{\mathcal {S}}}_k\) (\(k\le \varepsilon ^{-2}\)) of length at most \(\varepsilon /6\) each. Then we let

$$\begin{aligned} I_k:=\left\{ i\le n:\,\ln (\lambda _i-u)\in {{\mathcal {S}}}_k\right\} , \quad k\le \varepsilon ^{-2}. \end{aligned}$$

Obviously,

$$\begin{aligned} q_1(0) =\sum _{i\in I}\frac{{\langle X,x_i\rangle }^2}{\lambda _i-u} +\sum _{i\in J}\frac{{\langle X,x_i\rangle }^2}{\lambda _i-u} +\sum _{k\le \varepsilon ^{-2}}\sum _{i\in I_k}\frac{{\langle X,x_i\rangle }^2}{\lambda _i-u} =(*)+(**)+(***). \end{aligned}$$

Let us estimate the three quantities separately.

First, in view of the condition (3.4), the lower bound on n and the definition of I, we have

$$\begin{aligned} {\mathbb {E}}(*)=\sum _{i\in I}\frac{1}{\lambda _i-u}\le \sum _{i\in I}\frac{\varepsilon ^4n/12+1}{(\lambda _i-u)(\lambda _i-u+1)} \le \frac{\varepsilon ^3}{6}. \end{aligned}$$

Hence, by Markov’s inequality,

$$\begin{aligned} {\mathbb {P}}\left\{ (*)\ge \varepsilon /6\right\} \le \varepsilon ^2. \end{aligned}$$

Similarly,

$$\begin{aligned} {\mathbb {E}}(**)=\sum _{i\in J}\frac{1}{\lambda _i-u}\le \frac{\varepsilon ^3}{6}, \end{aligned}$$

whence

$$\begin{aligned} {\mathbb {P}}\left\{ (**)\ge \varepsilon /6\right\} \le \varepsilon ^2. \end{aligned}$$

Now, we consider the quantity \((***)\). First, assume that for some \(k\le \varepsilon ^{-2}\) we have \(|I_k|\le \varepsilon ^{11} n/72\). Since for every \(i\in I_k\) we have \(\lambda _i-u\ge \varepsilon ^4 n/12\), we obtain

$$\begin{aligned} {\mathbb {E}}\sum _{i\in I_k}\frac{{\langle X,x_i\rangle }^2}{\lambda _i-u}\le \frac{\varepsilon ^7}{6}, \end{aligned}$$

whence, by Markov’s inequality,

$$\begin{aligned} {\mathbb {P}}\left\{ \sum _{i\in I_k}\frac{{\langle X,x_i\rangle }^2}{\lambda _i-u} \ge \varepsilon ^3/6\right\} \le \varepsilon ^4. \end{aligned}$$

Now, if for some \(k\le \varepsilon ^{-2}\) we have \(|I_k|>\varepsilon ^{11} n/72\), then, by the condition on projections and the definition of \(I_k\), denoting by \(P_k\) the orthogonal projection onto the span of \((x_i)_{i\in I_k}\), we obtain

$$\begin{aligned} {\mathbb {P}}\left\{ \sum _{i\in I_k}\frac{{\langle X,x_i\rangle }^2}{\lambda _i-u}\ge \sum _{i\in I_k}\frac{\exp (\varepsilon /6)(1+\varepsilon /6)}{\lambda _i-u}\right\}\le & {} {\mathbb {P}}\left\{ \Vert {\mathrm {P}}_k X\Vert ^2\ge (1+\varepsilon /6){\mathrm {rank}}{\mathrm {P}}_k\right\} \\\le & {} \varepsilon ^4. \end{aligned}$$

Combining all the above estimates together, we get

$$\begin{aligned} {\mathbb {P}}&\left\{ q_1(1/\varepsilon ) \ge (1+\varepsilon /6) \left( \varepsilon /2+\exp (\varepsilon /6)(1+\varepsilon /6)\underline{m}_A(u)\right) \right\} \\&\le {\mathbb {P}}\left\{ (*)+(**)+(***) \ge \varepsilon /6+\varepsilon /6+\varepsilon /6+ \exp (\varepsilon /6)(1+\varepsilon /6)\underline{m}_A(u)\right\} \\&\le 4\varepsilon ^2. \end{aligned}$$

It remains to note that

$$\begin{aligned} (1+\varepsilon /6)\left( \varepsilon /2 +\exp (\varepsilon /6)(1+\varepsilon /6)\underline{m}_A(u)\right) +1 \le (1+\varepsilon )(1+\underline{m}_A(u)). \end{aligned}$$

\(\square \)

Lemma 3.4

(Control of \({\mathbb {E}}\delta \)) Let \(\varepsilon \in \left( 0,12^{-3}\right) \), \(n\ge 12/\varepsilon ^4\), and let A, \((\lambda _i)_{i\le n}\), \((x_i)_{i\le n}\), X and \(\delta \) be the same as in Lemma 3.3 (and satisfy the same conditions). Assume additionally that \({\mathbb {E}}(\langle X,x_i\rangle ^2{\mathbf {1}}_{\{\langle X,x_i\rangle ^2\le 1/\varepsilon \}})\ge r\) (\(i\le n\)) for some \(r\in [0,1]\). Then

$$\begin{aligned} {\mathbb {E}}\delta \ge \frac{(1-\varepsilon )r}{(1+\varepsilon )(1+\underline{m}_A(u))}-4\varepsilon . \end{aligned}$$

Proof

Since

$$\begin{aligned} \delta= & {} \frac{(1-\varepsilon )(1-{\mathbf {1}}_{\{q_1(1/\varepsilon )> (1+\varepsilon )\underline{m}_A(u)+\varepsilon \}})}{(1+\varepsilon )(1+\underline{m}_A(u))}\\&\times \,\left( \sum _{i=1}^n(\lambda _i-u)^{-2}\right) ^{-1} \sum _{i=1}^n\frac{\langle X,x_i\rangle ^2{\mathbf {1}}_{\{\langle X,x_i\rangle ^2\le 1/\varepsilon \}}}{\left( \lambda _i-u\right) ^2}, \end{aligned}$$

and in view of the bound \(\delta \le 1/\varepsilon \), we get

$$\begin{aligned} {\mathbb {E}}\delta \ge&\frac{(1-\varepsilon )}{(1+\varepsilon )(1+\underline{m}_A(u))} \left( \sum _{i=1}^n(\lambda _i-u)^{-2}\right) ^{-1} \sum _{i=1}^n\frac{ {\mathbb {E}}\left( \langle X,x_i\rangle ^2{\mathbf {1}}_{\{\langle X,x_i\rangle ^2\le 1/\varepsilon \}}\right) }{\left( \lambda _i-u\right) ^2}\\&-\,\varepsilon ^{-1}{\mathbb {P}}\left\{ q_1(1/\varepsilon )> (1+\varepsilon )\underline{m}_A(u)+\varepsilon \right\} . \end{aligned}$$

Finally, applying Lemma 3.3 to the last expression, we get the result. \(\square \)

3.3 Proof of Theorem 1.6, completed

Let \((X_n)_{n\in {\mathbb {N}}}\) be as in the statement of the theorem. Without loss of generality, we can assume that both functions \(f,g:{\mathbb {N}}\rightarrow {\mathbb {R}}_+\) in the Weak Tail Projection property (WTP-b) are non-increasing. Additionally let us define

$$\begin{aligned} h(M):=\sup _{\begin{array}{c} n\in {\mathbb {N}}\\ y\in S^{n-1} \end{array}} {\mathbb {E}}\left( \langle X_n,y\rangle ^2{\mathbf {1}}_{\{\langle X_n,y\rangle ^2\ge M\}}\right) , \quad M\ge 0 \end{aligned}$$

Note that (WTP-a) gives \(\lim _{M\rightarrow \infty }h(M)=0\).

Take any \(0<\varepsilon <\min \left( 12^{-3},({\overline{\rho }}^{-1/2}-1)^2/4\right) ,\) and define \(n_\varepsilon \) as the smallest integer greater than \(12/\varepsilon ^4\) such that

  1. (a)

    \(g(\varepsilon ^{11}n_\varepsilon /72)\le \varepsilon ^4\) and \(f(\varepsilon ^{11}n_\varepsilon /72)\le \varepsilon /6\) and

  2. (b)

    for all \(n\ge n_\varepsilon \) we have \(m_n/n\ge 3{\overline{\rho }}^{-1}/4+1/4\).

Note that condition (b) implies that \((\sqrt{m_n/n}-1)^{-1}\le 2/({\overline{\rho }}^{-1/2}-1)\), whence \((\sqrt{m_n/n}-1)^2\ge \varepsilon \) for all \(n\ge n_\varepsilon \). From now on, we fix an \(n\ge n_\varepsilon \), let \(m:=m_n\) and let \(X^{(1)},\dots ,X^{(m)}\) be i.i.d. copies of \(X_n\). We define

$$\begin{aligned} A^{(0)}:=0 \quad \text {and}\quad A^{(k)}:=A^{(k-1)}+X^{(k)}\otimes X^{(k)},\quad 1\le k\le m, \end{aligned}$$

so that \(A_n=A^{(m)}\). Set

$$\begin{aligned} u_0:=n-\sqrt{mn} \end{aligned}$$

and let \(u_1,\dots ,u_m\) be a collection of random numbers defined inductively as follows:

$$\begin{aligned} u_k:=u_{k-1}+\delta ^k-\delta _R^k, \end{aligned}$$

where \(\delta ^k\) is defined as \(\delta \) in Lemma 3.2 (with \(A^{(k-1)}\), \(u_{k-1}\) and \(X^{(k)}\) replacing A, u and x, respectively) and the regularity shift \(\delta _R^k\) is defined by

$$\begin{aligned} \delta _R^k:=&\min \left\{ \ell \in \{0,1,\ldots \}:\, \underline{m}_{A^{(k)}}(u_{k-1}+\delta ^k-\ell )\right. \\&\left. -\,\underline{m}_{A^{(k)}}(u_{k-1}+\delta ^k-\ell -1)\le \frac{1}{\varepsilon n}\right\} . \end{aligned}$$

Note that Lemmas 3.1 and 3.2 imply that we have, (deterministically) for all \(k\ge 0\),

$$\begin{aligned} \underline{m}_{A^{(k)}}(u_k) \le \underline{m}_{A^{(0)}}(u_0) =\frac{n}{\sqrt{mn}-n}. \end{aligned}$$

The ultimate purpose of the shift \(\delta _R^k\) is to guarantee relation (3.4) which was an important condition in proving the concentration Lemma 3.3. Of course, since \(\delta _R^k\) moves \(u_k\) away from the spectrum, one must make sure that the cumulative impact of the shifts \(\delta _R^k\) is small enough and does not destroy the desired asymptotic estimate.

Lemma 3.5

With \(\delta _R^k\) (\(1\le k\le m\)) defined above, the following holds deterministically:

$$\begin{aligned} \sum _{k=1}^m\delta _R^k\le \frac{\varepsilon n^2}{\sqrt{mn}-n}. \end{aligned}$$

Proof

Take any admissible \(k\ge 1\). The definition of \(\delta _R^k\) immediately implies that for all \(0\le \ell <\delta _R^k\) we have

$$\begin{aligned} \underline{m}_{A^{(k)}}(u_{k-1}+\delta ^k-\ell )- \underline{m}_{A^{(k)}}(u_{k-1}+\delta ^k-\ell -1)>\frac{1}{\varepsilon n}. \end{aligned}$$

Hence,

$$\begin{aligned} \underline{m}_{A^{(k)}}(u_k)= & {} \underline{m}_{A^{(k)}}\left( u_{k-1}+\delta ^k-\delta _R^k\right) \\\le & {} \underline{m}_{A^{(k)}}(u_{k-1}+\delta ^k)-\frac{\delta _R^k}{\varepsilon n}\\\le & {} \underline{m}_{A^{(k-1)}}(u_{k-1})-\frac{\delta _R^k}{\varepsilon n}. \end{aligned}$$

Thus, \(\underline{m}_{A^{(k)}}(u_k)\le \underline{m}_{A^{(k-1)}}(u_{k-1})-\frac{\delta _R^k}{\varepsilon n}\) for all \(k\ge 1\), which, together with the relations \(0<\underline{m}_{A^{(m)}}(u_m)\) and \(\underline{m}_{A^{(0)}}(u_0)=\frac{n}{\sqrt{mn}-n}\) implies the result. \(\square \)

Now, fix for a moment any \(k\in \{1,\dots ,m\}\) and let \(\mathcal {F}_{k-1}\) be the \(\sigma \)-algebra generated by the vectors \(X^{(1)},\dots ,X^{(k-1)}\), with the convention \(\mathcal {F}_0=\{\varnothing ,\Omega \}\). We will first estimate the conditional expectation \({\mathbb {E}}(\delta ^k\,|\,\mathcal {F}_{k-1})\).

Note that, by the definition of \(u_{k-1}\), we have (deterministically)

$$\begin{aligned} \underline{m}_{A^{(k-1)}}(u_{k-1})-\underline{m}_{A^{(k-1)}}(u_{k-1}-1) \le \frac{1}{\varepsilon n} \end{aligned}$$

(the above relation holds for \(k>1\) in view of the definition of \(\delta ^{k-1}_R\) and for \(k=1\)  — because of the definition of \(n_\varepsilon \)). Together with the lower bound on n (which implies the conditions on orthogonal projections assumed in Lemma 3.4) and the condition

$$\begin{aligned} \underline{m}_{A^{(k-1)}}(u_{k-1}) \le \underline{m}_{A^{(0)}}(u_{0}) =\frac{n}{\sqrt{mn}-n}, \end{aligned}$$

we get from Lemma 3.4 that

$$\begin{aligned} {\mathbb {E}}(\delta ^k\,|\,\mathcal {F}_{k-1})\ge & {} \frac{(1-\varepsilon )(1-h(1/\varepsilon ))}{(1+\varepsilon )(1+\underline{m}_{A^{(0)}}(u_{0}))}-4\varepsilon \\= & {} \frac{(1-\varepsilon )(1-h(1/\varepsilon ))}{1+\varepsilon }\left( 1-\sqrt{\frac{n}{m}}\right) -4\varepsilon . \end{aligned}$$

Hence, by the definition of \(u_k\)’s and Lemma 3.5, we obtain

$$\begin{aligned} {\mathbb {E}}u_m\ge u_0+\frac{(1-\varepsilon )(1-h(1/\varepsilon ))}{1+\varepsilon } \left( m-\sqrt{mn}\right) -4\varepsilon m -\frac{\varepsilon n^2}{\sqrt{mn}-n}. \end{aligned}$$

Since \(u_m<\lambda _{\min }(A_n)\) (deterministically), we get from the above relation

$$\begin{aligned} {\mathbb {E}}\lambda _{\min }(A_n)&\ge (\sqrt{m}-\sqrt{n})^2- (3\varepsilon +h(1/\varepsilon ))\left( m-\sqrt{mn}\right) -4\varepsilon m -\frac{\varepsilon n^2}{\sqrt{mn}-n}\\&\ge (\sqrt{m}-\sqrt{n})^2-7\varepsilon m-h(1/\varepsilon )m-\frac{2\varepsilon n}{\overline{\rho }^{-1/2}-1}. \end{aligned}$$

Since the above estimate holds for arbitrarily small \(\varepsilon \), and having in mind that \(h(1/\varepsilon )\rightarrow 0\) with \(\varepsilon \rightarrow 0\), we get the desired result.

4 Proof of Theorem 1.8

As in [14, 48], for any \(n\times n\) symmetric matrix S and every \(t\in {\mathbb {R}}\setminus \{\lambda _1(S),\ldots ,\lambda _n(S)\}\), we set

$$\begin{aligned} {\overline{m}}_S(t) := {\mathrm {tr}}((t{\mathrm {I}}_n-S)^{-1})=\sum _{k=1}^n\frac{1}{t-\lambda _k(S)}. \end{aligned}$$

The function \(t\mapsto {\overline{m}}_S(t)=-\underline{m}_S(t)\) is positive and strictly decreasing on \((\lambda _1(S),+\infty )\).

The scheme of the proof of Theorem 1.8 in many respects resembles the proof of Theorem 1.6 from the previous section. Thus, we prefer to omit a detailed discussion here. Let us make just a few remarks. The concepts of feasible shifts and regularity shifts play analogous role. This time, we are talking about feasible upper shifts, again following [48]. Since the case of the largest eigenvalue is technically more complicated, our feasible upper shift \(\Delta ^k\) is split into two components \(\Delta ^k_1\) and \(\Delta ^k_2\), each designed to guarantee certain conditions on quadratic forms, and their expectations are estimated separately. As with the smallest eigenvalue, the purpose of regularity shifts \(\Delta ^k_R\), in cooperation with the Strong Tail Projection property, is to enable us to apply some concentration inequalities, which are used to obtain sharp estimates on the expectations of the shifts \(\Delta ^k\).

4.1 Feasible upper shift

Let A be an \(n\times n\) positive semi-definite non-random matrix with eigenvalues \(\lambda _{\max }:=\lambda _1\ge \cdots \ge \lambda _n=:\lambda _{\min }\ge 0\) and a corresponding orthonormal basis of eigenvectors \((x_i)_{i=1}^n\), and let \(u>\lambda _1\) be such that \({\overline{m}}_A(u)<1\). Further, assume that x is a (non-random) vector in \({\mathbb {R}}^n\). In this section, we consider those numbers \(\Delta \ge 0\) that (deterministically) satisfy

$$\begin{aligned} \lambda _{\max }(A+xx^\top )<u+\Delta \quad \text {and}\quad {\overline{m}}_{A+xx^\top }(u+\Delta )\le {\overline{m}}_A(u). \end{aligned}$$
(4.1)

Following [48], any value of \(\Delta \) satisfying (4.1), will be called a feasible upper shift with respect to A, x and u. The following statement is taken from [48]; we provide its proof for completeness.

Lemma 4.1

(Feasible upper shift—[48, Lemma 3.3]) Let \(u>\lambda _1\) and \({\overline{m}}_A(u)<1\). For any \(\Delta >0\), define

$$\begin{aligned} Q_1(\Delta )&:=x^\top (u+\Delta -A)^{-1}x =\sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{u+\Delta -\lambda _i} \\ Q_2(\Delta )&:= \frac{x^\top (u+\Delta -A)^{-2}x}{{\overline{m}}_A(u) -{\overline{m}}_A(u+\Delta )} =\left( {\overline{m}}_A(u)-{\overline{m}}_A(u+\Delta )\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( u+\Delta -\lambda _i\right) ^2}. \end{aligned}$$

Then a sufficient condition for (4.1) to be satisfied is

$$\begin{aligned} Q_1(\Delta )<1\quad \text {and}\quad Q_2(\Delta )\le 1-Q_1(\Delta ). \end{aligned}$$

Proof

We can assume without loss of generality that \(x\ne 0\), so that \(Q_2(\Delta )>0\). The Sherman–Morrison formula (3.2) gives

$$\begin{aligned} {\overline{m}}_{A+xx^\top }(u+\Delta )&={\mathrm {tr}}((u+\Delta ){\mathrm {I}}_n-A-xx^\top )^{-1}\\&={\mathrm {tr}}((u+\Delta ){\mathrm {I}}_n-A)^{-1} +\frac{\left\| ((u+\Delta ){\mathrm {I}}_n-A)^{-1}x\right\| ^2}{1-x^\top \left( (u+\Delta ){\mathrm {I}}_n-A\right) ^{-1}x}\\&={\overline{m}}_A(u+\Delta ) +\left( 1-\sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{u+\Delta -\lambda _i}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( u+\Delta -\lambda _i\right) ^2}. \end{aligned}$$

Hence, we have \({\overline{m}}_{A+xx^\top }(u+\Delta )\le {\overline{m}}_{A}(u)\) if and only if

$$\begin{aligned} \left( 1-\sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{u+\Delta -\lambda _i}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( u+\Delta -\lambda _i\right) ^2}\le {\overline{m}}_A(u)-{\overline{m}}_A(u+\Delta ). \end{aligned}$$

But the latter inequality clearly holds if \(Q_1(\Delta )<1\) and \(Q_2(\Delta )\le 1-Q_1(\Delta )\).

Next, the rank one matrix

$$\begin{aligned} (u+\Delta -A)^{-1/2}x((u+\Delta -A)^{-1/2}x)^\top =(u+\Delta -A)^{-1/2}xx^\top (u+\Delta -A)^{-1/2} \end{aligned}$$

has eigenvalues 0 and \(\Vert (u+\Delta -A)^{-1/2}x\Vert ^2\). But the condition \(Q_1(\Delta )<1\) implies \(\Vert (u+\Delta -A)^{-1/2}x\Vert ^2=x^\top (u+\Delta -A)^{-1}x<1\). Therefore the matrix

$$\begin{aligned} {\mathrm {I}}_n-(u+\Delta -A)^{-1/2}xx^\top (u+\Delta -A)^{-1/2} \end{aligned}$$

is positive definite, implying that

$$\begin{aligned} (u+\Delta -A)^{1/2}{\mathrm {I}}_n(u+\Delta -A)^{1/2}-xx^\top =(u+\Delta ){\mathrm {I}}_n-(A+xx^\top ) \end{aligned}$$

is positive definite (recall that for any two positive definite matrices S and T, the matrix \(S^{1/2}TS^{1/2}\) is positive definite). Hence, \(u+\Delta >\lambda _{\max }(A+xx^\top )\). \(\square \)

Following [48], we will treat the quantities \(Q_1\) and \(Q_2\) separately: for a specially chosen number \(\tau \in (0,1)\), we will take \(\Delta _1\) such that \(Q_1(\Delta _1)\le \tau \), and \(\Delta _2\) such that \(Q_2(\Delta _2)\le 1-\tau \). Then, in view of monotonicity of \(Q_1\) and \(Q_2\), the sum \(\Delta _1+\Delta _2\) will be a feasible upper shift. In our case, \(\tau \) shall be close to \({\overline{m}}_A(u)\).

Let us introduce “level sets” \(I_j\) as follows:

$$\begin{aligned} I_j:=\left\{ i\le n:\, 4^{j-1}\le u-\lambda _i< 4^j\right\} ,\quad j\in {\mathbb {N}}. \end{aligned}$$
(4.2)

Note that the condition \({\overline{m}}_A(u)<1\) immediately implies that \(|I_j|<4^j\) for all \(j\ge 1\). Moreover, if we additionally assume that \(\max _{\ell \in {\mathbb {N}}}\frac{|I_\ell |}{16^\ell }\) is sufficiently small then the following estimate of the ratio \(\frac{|I_j|}{4^j}\) is valid:

Lemma 4.2

Let the sets \(I_j\) be as above and assume that for some \(\varepsilon \in (0,1]\) we have

$$\begin{aligned} \max _{\ell \in {\mathbb {N}}}\frac{|I_\ell |}{16^\ell }\le \frac{1}{\varepsilon n}. \end{aligned}$$

Then

$$\begin{aligned} \frac{|I_j|}{4^j}\le \sqrt{\frac{|I_j|}{\varepsilon n}},\;\;j\in {\mathbb {N}}. \end{aligned}$$

Proof

Fix any natural j such that \(I_j\ne \varnothing \). Then, obviously,

$$\begin{aligned} \frac{|I_j|}{4^j}\le \frac{4^j}{\varepsilon n}. \end{aligned}$$

Fix for a moment any \(\alpha >0\). If \(|I_j|\ge \alpha 4^j\) then

$$\begin{aligned} \frac{|I_j|}{4^j}\le \frac{\alpha ^{-1} |I_j|}{\varepsilon n}. \end{aligned}$$

Otherwise, \(\frac{|I_j|}{4^j}\le \alpha \). It remains to choose \(\alpha :=\sqrt{\frac{|I_j|}{\varepsilon n}}\). \(\square \)

For every \(j\ge 1\), denote

$$\begin{aligned} h_j:=\sum _{i\in I_j}\langle x,x_i\rangle ^2-|I_j|. \end{aligned}$$
(4.3)

Lemma 4.3

(Definition of \(\Delta _1\)) Let A, x and u be as above, with \(u>\lambda _{\max }\) and \({\overline{m}}_A(u)<1\), and let \(\varepsilon \in (0,1/4]\) be a real parameter. Define a number \(\Delta _1\) as follows:

$$\begin{aligned} \Delta _1:=\Delta _1'+\sum _{j=1}^{\lfloor \log _4(n/\varepsilon ^2)\rfloor }\Delta _{1,j}, \end{aligned}$$

where

$$\begin{aligned} \Delta _1':=\varepsilon ^{-1/2}\Vert x\Vert ^2{\mathbf {1}}_{\{\varepsilon \Vert x\Vert ^2\ge n\}} \end{aligned}$$

and, for each natural \(j\le \log _4(n/\varepsilon ^2)\), we let

$$\begin{aligned} \Delta _{1,j}:=\varepsilon ^{-1} h_j{\mathbf {1}}_{\{h_j> \varepsilon ^2 2^j\sqrt{|I_j|}\}}. \end{aligned}$$

Let \(Q_1\) be as in Lemma 4.1. Then \(\Delta _1\) satisfies

$$\begin{aligned} Q_1(\Delta _1)\le {\overline{m}}_A(u+\Delta _1)+6\sqrt{\varepsilon }+ 8\varepsilon \sqrt{n}\sqrt{\max _{\ell \in {\mathbb {N}}}\frac{|I_\ell |}{16^\ell }}. \end{aligned}$$

Proof

Denote by J the set of all \(j\le \log _4(n/\varepsilon ^2)\) such that \(h_j> \varepsilon ^2 2^j\sqrt{|I_j|}\) and let \(J'\) be its complement inside \(\{1,\dots ,\lfloor \log _4(n/\varepsilon ^2)\rfloor \}\). Then

$$\begin{aligned} Q_1(\Delta _1)=&\,{\overline{m}}_A(u+\Delta _1) +\sum _{j=1}^\infty \sum _{i\in I_j}\frac{\langle x,x_i\rangle ^2-1}{u+\Delta _1-\lambda _i}\\ \le&\,{\overline{m}}_A(u+\Delta _1) + \sum _{j>\log _4(n/\varepsilon ^2)} \sum _{i\in I_j}\frac{\langle x,x_i\rangle ^2}{u+\Delta _1-\lambda _i}\\&+\, 4\sum _{j\in J'}\frac{h_j}{\Delta _1+4^j} +4\sum _{j\in J}\frac{h_j}{\Delta _1+4^j}\\ =:&\,{\overline{m}}_A(u+\Delta _1)+ (*)+4(**)+4(***). \end{aligned}$$

We shall estimate the last three quantities separately. First, note that

$$\begin{aligned} (*)\le \frac{4\Vert x\Vert ^2}{\Delta _1+n/\varepsilon ^2}\le \frac{4\sqrt{\varepsilon }\Vert x\Vert ^2}{\Vert x\Vert ^2{\mathbf {1}}_{\{\varepsilon \Vert x\Vert ^2\ge n\}}+n/\varepsilon ^{3/2}} \le 4\sqrt{\varepsilon }. \end{aligned}$$

Next,

$$\begin{aligned} (**)\le \varepsilon ^2\sum _{j\in J'}\sqrt{\frac{|I_j|}{4^j}} \le \varepsilon ^2\sqrt{\max _{\ell \in {\mathbb {N}}}\frac{|I_\ell |}{16^\ell }}\sum _{j\le \log _4(n/\varepsilon ^2)}2^j \le 2\varepsilon \sqrt{n}\sqrt{\max _{\ell \in {\mathbb {N}}}\frac{|I_\ell |}{16^\ell }}. \end{aligned}$$

Finally, we have

$$\begin{aligned} (***)\le \frac{\sum _{j\in J}h_j}{\sum _{j\in J}\Delta _{1,j}}\le \varepsilon . \end{aligned}$$

Summing up the estimates, we get the result. \(\square \)

Denote

$$\begin{aligned} F_2(\Delta ):=\left( \sum _{i=1}^n\frac{1}{(u+\Delta -\lambda _i)(u-\lambda _i)}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( u+\Delta -\lambda _i\right) ^2},\quad \Delta \ge 0 \end{aligned}$$

(clearly, \(F_2(\Delta )=\Delta Q_2(\Delta )\) for all \(\Delta >0\)).

Lemma 4.4

(Definition of \(\Delta _2\)) Let A, u and x be as above, with \(u>\lambda _1\), and let \(\alpha >0\) satisfy \({\overline{m}}_A(u)+\alpha <1\). Define \(\Delta _2\) according to the following procedure: if

$$\begin{aligned} (1+\alpha )F_2(0)\le \alpha (u-\lambda _1)(1-{\overline{m}}_A(u)-\alpha ) \end{aligned}$$

then set

$$\begin{aligned} \Delta _2:=\frac{(1+\alpha )F_2(0)}{1-{\overline{m}}_A(u)-\alpha }, \end{aligned}$$

otherwise, take the smallest non-negative integer j such that

$$\begin{aligned} Q_2\left( 2^j(u-\lambda _1)\right) \le 1-{\overline{m}}_A(u)-\alpha \end{aligned}$$

and let

$$\begin{aligned} \Delta _2:=2^j(u-\lambda _1). \end{aligned}$$

Then \(\Delta _2=0\) whenever \(x=0\), and, for \(x\ne 0\), we have \(\Delta _2\ne 0\) and

$$\begin{aligned} Q_2(\Delta _2)\le 1-{\overline{m}}_A(u)-\alpha . \end{aligned}$$

Proof

The case \(x=0\) is trivial, so further we assume that \(x\ne 0\) implying \(\Delta _2\ne 0\). First, suppose that

$$\begin{aligned} (1+\alpha )F_2(0)\le \alpha (u-\lambda _1)(1-{\overline{m}}_A(u)-\alpha ). \end{aligned}$$

Then, in view of the definition of \(\Delta _2\), we have \(\Delta _2\le \alpha (u-\lambda _1)\), and

$$\begin{aligned} F_2(\Delta _2)&\le \left( \sum _{i=1}^n\frac{1}{(u+\Delta _2-\lambda _i)(u-\lambda _i)}\right) ^{-1} \sum _{i=1}^n\frac{\langle x,x_i\rangle ^2}{\left( u-\lambda _i\right) ^2}\\&\le (1+\alpha )F_2(0). \end{aligned}$$

Hence,

$$\begin{aligned} Q_2(\Delta _2)=F_2(\Delta _2)/\Delta _2\le 1-{\overline{m}}_A(u)-\alpha . \end{aligned}$$

If

$$\begin{aligned} (1+\alpha )F_2(0)> \alpha (u-\lambda _1)(1-{\overline{m}}_A(u)-\alpha ) \end{aligned}$$

then the statement follows directly from the definition of \(\Delta _2\). \(\square \)

4.2 Randomization and control of expectations

In this subsection, we “randomize” Lemmas 4.3 and 4.4. Let, as before, A be an \(n\times n\) (non-random) positive semidefinite matrix with eigenvalues \(\lambda _1\ge \dots \ge \lambda _n\ge 0\), let \(u>\lambda _1\) satisfy \({\overline{m}}_A(u)<1\). We define (random) quantities \(\Delta _1\) and \(\Delta _2\) as in Lemmas 4.3 and 4.4, replacing the fixed vector x with a random isotropic vector X. In the following two lemmas, we will estimate the expectations of \(\Delta _1\) and \(\Delta _2\).

Lemma 4.5

(Control of \({\mathbb {E}}\Delta _1\)) Let \(\varepsilon \in (0,1/4]\) and let A, \(\lambda _i\), u be as above and \(I_j\) be defined according to (4.2). Assume additionally that

$$\begin{aligned} \max _{\ell \in {\mathbb {N}}}\frac{|I_\ell |}{16^\ell }\le \frac{1}{\varepsilon n}. \end{aligned}$$

Further, let \(X\in {\mathbb {R}}^n\) be an isotropic random vector and \(K\ge 1\). Assume that for any non-zero orthogonal projection \({\mathrm {P}}:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n\) we have

$$\begin{aligned} {\mathbb {P}}\left\{ \Vert {\mathrm {P}}(X)\Vert ^2-{\mathrm {rank}}{\mathrm {P}}\ge t\right\} \le \frac{g({\mathrm {rank}}{\mathrm {P}}){\mathrm {rank}}{\mathrm {P}}}{t^2},\;\;t\ge f({\mathrm {rank}}{\mathrm {P}}){\mathrm {rank}}{\mathrm {P}}, \end{aligned}$$

where functions \(f:{\mathbb {N}}\rightarrow [0,1]\) and \(g:{\mathbb {N}}\rightarrow [0,K]\) satisfy

$$\begin{aligned} g(k)\le \varepsilon ^{11/2} \quad \text {and}\quad f(k)\le \varepsilon ^2\;\;\;\;\forall k\ge \varepsilon ^{17}n. \end{aligned}$$

Define a random variable \(\Delta _1\) as in Lemma 4.3 replacing the non-random vector x with X. Then

$$\begin{aligned} {\mathbb {E}}\Delta _1\le 32K\sqrt{\varepsilon }. \end{aligned}$$

Proof

Let us estimate separately the expectations of \(\Delta _1'\) and \(\Delta _{1,j}\), (\(j\le \log _4(n/\varepsilon ^2)\)), where the quantities are defined as in Lemma 4.3. First,

$$\begin{aligned} {\mathbb {E}}\Delta _1'&=\varepsilon ^{-3/2}n{\mathbb {P}}\{\Vert X\Vert ^2\ge n/\varepsilon \} +\int _{\varepsilon ^{-3/2}n}^\infty {\mathbb {P}}\{\Vert X\Vert ^2\ge \sqrt{\varepsilon }\tau \}\,d\tau \\&\le \frac{\varepsilon ^{4}}{(\varepsilon ^{-1}-1)^2} +\int _{\varepsilon ^{-3/2}n-\varepsilon ^{-1/2}n}^\infty \frac{\varepsilon ^{9/2}n}{\tau ^2}\,d\tau \\&\le \frac{\varepsilon ^{4}}{(\varepsilon ^{-1}-1)^2}+\frac{\varepsilon ^5}{\varepsilon ^{-1}-1}\\&\le 4\varepsilon ^6. \end{aligned}$$

Next, fix any natural \(j\le \log _4(n/\varepsilon ^2)\) with \(I_j\ne \varnothing \). We let \(h_j\) to be defined as in (4.3), with the random vector X replacing the non-random x. We have

$$\begin{aligned} {\mathbb {E}}\Delta _{1,j} =\varepsilon 2^j \sqrt{|I_j|}{\mathbb {P}}\left\{ h_j> \varepsilon ^2 2^j\sqrt{|I_j|}\right\} +\int _{\varepsilon 2^j\sqrt{|I_j|}}^\infty {\mathbb {P}}\{h_j\ge \varepsilon \tau \}\,d\tau . \end{aligned}$$

Note that \(h_j=\Vert {\mathrm {P}}_j(X)\Vert ^2-{\mathrm {rank}}{\mathrm {P}}_j\) and \(|I_j|={\mathrm {rank}}{\mathrm {P}}_j\), where \({\mathrm {P}}_j\) is the orthogonal projection onto the span of \(\{x_i,\,i\in I_j\}\). Let us consider two cases.

  1. 1.

    \(|I_j|\ge \varepsilon ^{17}n\). Since \({\overline{m}}_A(u)<1\), we have \(|I_j|\le 4^j\), and \(\varepsilon ^2 |I_j|\le \varepsilon ^2 2^j\sqrt{|I_j|}\). Hence, from the above formula for \({\mathbb {E}}\Delta _{1,j}\) and the conditions on projections of X we obtain

    $$\begin{aligned} {\mathbb {E}}\Delta _{1,j}&\le \frac{\varepsilon ^{5/2}\sqrt{|I_j|}}{2^j} +\frac{\varepsilon ^{5/2}\sqrt{|I_j|}}{2^j}\\&\le 2^{j+1}\varepsilon ^{5/2}\sqrt{\max _{\ell \in {\mathbb {N}}}\frac{|I_\ell |}{16^\ell }}\\&\le 2^{j+1}\varepsilon ^{2} n^{-1/2}. \end{aligned}$$
  2. 2.

    \(|I_j|< \varepsilon ^{17}n\). Then, in view of Lemma 4.2, \(|I_j|\le \varepsilon ^8 4^j\), implying

    $$\begin{aligned} f({\mathrm {rank}}{\mathrm {P}}_j)\le 1 \le |I_j| =\sqrt{|I_j|}\sqrt{|I_j|} \le \varepsilon ^4 2^j\sqrt{|I_j|} <\varepsilon ^2 2^j\sqrt{|I_j|}. \end{aligned}$$

    Applying again the formula for \({\mathbb {E}}\Delta _{1,j}\) together with the projection conditions and Lemma 4.2, we obtain

    $$\begin{aligned} {\mathbb {E}}\Delta _{1,j}\le \frac{2K\sqrt{|I_j|}}{\varepsilon ^3 2^j}\le \min \left( \frac{2^{j+1}K}{\varepsilon ^{7/2}\sqrt{n}},2\varepsilon K\right) . \end{aligned}$$

    Thus, in any case \({\mathbb {E}}\Delta _{1,j}\le 2^{j+1}\varepsilon ^{2} n^{-1/2}+\min \left( \frac{2^{j+1}K}{\varepsilon ^{7/2}\sqrt{n}},2\varepsilon K\right) \). Summing over all \(j\le \log _4(n/\varepsilon ^2)\), we get

    $$\begin{aligned} \sum _{j\le \log _4(n/\varepsilon ^2)}{\mathbb {E}}\Delta _{1,j}&\le 4 \varepsilon + \sum _{j\le \log _4(\varepsilon ^9 n)}\frac{2^{j+1}K}{\varepsilon ^{7/2}\sqrt{n}} +2K\varepsilon (-11\log _4\varepsilon +1)\\&\le 4\varepsilon +4K\varepsilon +22K\sqrt{\varepsilon }+2K\varepsilon , \end{aligned}$$

    and the statement follows.

\(\square \)

The next lemma does not require assumptions on projections of X other than of rank one.

Lemma 4.6

(Control of \({\mathbb {E}}\Delta _2\)) Let A and \(\lambda _i\) (\(i\le n\)) be as above and let X be an isotropic random vector in \({\mathbb {R}}^n\) such that for some \(\kappa ,K>0\) we have \(\max _{{\left\| y\right\| }=1}{\mathbb {E}}|\langle X,y\rangle |^{2+\kappa }\le K\), and let \(u>\lambda _1\) and \(\alpha \in (0,1/2]\) satisfy \({\overline{m}}_A(u)+\alpha <1\). Then for

$$\begin{aligned} w_{4.6}:=K2^{1+\kappa }/(\alpha ^{1+\kappa /2}(1-2^{-\kappa /2})) \end{aligned}$$

we have

$$\begin{aligned} {\mathbb {E}}\Delta _2\le \frac{1+\alpha }{1-{\overline{m}}_A(u)-\alpha } \left( 1+ \frac{w_{4.6}}{(1-{\overline{m}}_A(u)-\alpha )^{\kappa /2} (u-\lambda _1)^{\kappa /2}}\right) , \end{aligned}$$

where \(\Delta _2\) is defined as in Lemma 4.4, with X replacing x.

Proof

An inspection of the definition of \(\Delta _2\) in Lemma 4.4 immediately shows that

$$\begin{aligned} \Delta _2\le \frac{(1+\alpha )F_2(0)}{1-{\overline{m}}_A(u)-\alpha }+\sum _{j=0}^\infty 2^j(u-\lambda _1)\chi _j, \end{aligned}$$

where \(\chi _0\) is the indicator function of the event

$$\begin{aligned} \left\{ (1+\alpha )F_2(0)> \alpha (u-\lambda _1)(1-{\overline{m}}_A(u)-\alpha )\right\} \end{aligned}$$

and for each \(j>0\), \(\chi _j\) is the indicator of the event

$$\begin{aligned} \left\{ Q_2\left( 2^{j-1}(u-\lambda _1)\right) > 1-{\overline{m}}_A(u)-\alpha \right\} . \end{aligned}$$

Note that for each fixed number \(r\ge 0\), \(F_2(r)\) is a random variable representable in the form

$$\begin{aligned} F_2(r)=\sum _{i=1}^n \beta _i\langle X,x_i\rangle ^2, \end{aligned}$$

where the numbers \(\beta _i=\beta _i(r)\) are positive and \(\sum _{i=1}^n\beta _i\le 1\). By the Minkowski inequality, for any \(p\ge 1\) we have

$$\begin{aligned} \left( {\mathbb {E}}F_2(r)^{p}\right) ^{1/p}\le & {} \sum _{i=1}^n \left( {\mathbb {E}}(\beta _i\langle X,x_i\rangle ^2)^{p}\right) ^{1/p}\\= & {} \sum _{i=1}^n \beta _i\left( {\mathbb {E}}|\langle X,x_i\rangle |^{2p}\right) ^{1/p} \le \max _{{\left\| y\right\| }=1}\left( {\mathbb {E}}|\langle X,y\rangle |^{2p}\right) ^{1/p}. \end{aligned}$$

In particular, \({\mathbb {E}}F_2(r)^{1+\kappa /2}\le \max _{{\left\| y\right\| }=1}{\mathbb {E}}|\langle X,y\rangle |^{2+\kappa }\le K\). Hence, from the definition of the indicator functions \(\chi _j\) and applying Markov’s inequality, we get

$$\begin{aligned} {\mathbb {E}}\chi _0 \le K\left( \alpha (1+\alpha )^{-1}(u-\lambda _1) (1-{\overline{m}}_A(u)-\alpha )\right) ^{-1-\kappa /2}, \end{aligned}$$

and for every \(j\ge 1\):

$$\begin{aligned} {\mathbb {E}}\chi _j \le K\left( 2^{j-1}(u-\lambda _1)(1-{\overline{m}}_A(u)-\alpha ) \right) ^{-1-\kappa /2}. \end{aligned}$$

Finally, we obtain

$$\begin{aligned} {\mathbb {E}}\Delta _2\le&\, \frac{1+\alpha }{1-{\overline{m}}_A(u)-\alpha }\\&+K(u-\lambda _1)\left( \alpha (1+\alpha )^{-1}(u-\lambda _1) (1-{\overline{m}}_A(u)-\alpha )\right) ^{-1-\kappa /2}\\&+\sum _{j=1}^\infty 2^jK(u-\lambda _1) \left( 2^{j-1}(u-\lambda _1)(1-{\overline{m}}_A(u)-\alpha )\right) ^{-1-\kappa /2}\\ =&\, \frac{1+\alpha }{1-{\overline{m}}_A(u)-\alpha }+\frac{K}{(1-{\overline{m}}_A(u)-\alpha )^{1+\kappa /2}(u-\lambda _1)^{\kappa /2}}\\&\times \left( \frac{(1+\alpha )^{1+\kappa /2}}{\alpha ^{1+\kappa /2}} +\sum _{j=1}^\infty 2^{1+\kappa /2-\kappa j/2}\right) \\ \le&\, \frac{(1+\alpha )}{1-{\overline{m}}_A(u)-\alpha }\\&\times \left( 1+\frac{K(1+\alpha )^{\kappa /2}}{\alpha ^{1+\kappa /2} (1-{\overline{m}}_A(u)-\alpha )^{\kappa /2}(u-\lambda _1)^{\kappa /2}} \sum _{j=0}^\infty 2^{1+\kappa /2-\kappa j/2}\right) , \end{aligned}$$

and the result follows. \(\square \)

4.3 Proof of Theorem 1.8, completed

Let \(X_n\), \(m_n\) and \(A_n\), \(n\in {\mathbb {N}}\), be as in (1.1), (1.2), and (1.4), respectively. Without loss of generality, we can assume that both functions f and g in (STP) are non-increasing. Denote

$$\begin{aligned} \gamma :=\inf _{n\ge 1}\frac{m_n}{n}>0 \end{aligned}$$

and fix any \(\alpha \in (0,\sqrt{\gamma }/(1+\sqrt{\gamma }))\). Further, let \(\varepsilon :=\alpha ^2/225\) and let \(n_\varepsilon \in {\mathbb {N}}\) be such that

$$\begin{aligned} g(k)\le \varepsilon ^{11/2} \quad \text {and}\quad f(k)\le \varepsilon ^2\;\;\;\forall k\ge \varepsilon ^{17}n_\varepsilon \end{aligned}$$

and, additionally,

$$\begin{aligned} \frac{2(48+192g(1))\alpha ^{-3/2}}{\left( \sqrt{\gamma }/(1+\sqrt{\gamma })-\alpha \right) ^{3/2}\left( \sqrt{\varepsilon n_\varepsilon }/4\right) ^{1/2}}\le \varepsilon \end{aligned}$$

(the last condition will be needed later to simplify the estimate coming from Lemma 4.6). From now on, we fix \(n\ge n_\varepsilon \). For convenience, we let \(m:=m_n\) and \(X^{(1)},\dots ,X^{(m)}\) be i.i.d. copies of \(X_n\). We define

$$\begin{aligned} A^{(0)}:=0 \quad \text {and}\quad A^{(k)}=A^{(k-1)}+X^{(k)}\otimes X^{(k)},\quad 1\le k\le m, \end{aligned}$$

so that \(A_n=A^{(m)}\). Set

$$\begin{aligned} u_0:=n+\sqrt{mn} \end{aligned}$$

and let \(u_1,\dots ,u_m\) be a collection of random numbers defined inductively as follows: For every \(k=1,2,\dots ,m\), assuming that \(u_{k-1}\) has already been defined, we let \(I_j^k\) (\(j\in {\mathbb {N}}\)) be random subsets of \(\{1,2,\dots ,n\}\) analogous to those given in (4.2), with \(A^{(k-1)}\) and \(u_{k-1}\) replacing A and u, respectively, i.e.

$$\begin{aligned} I_j^k:=\left\{ i\le n:\, 4^{j-1}\le u_{k-1}-\lambda _i(A^{(k-1)})< 4^j\right\} ,\quad j\in {\mathbb {N}}. \end{aligned}$$

Then we set

$$\begin{aligned} u_k:=u_{k-1}+\Delta _1^k+\Delta _2^k+\Delta _R^k, \end{aligned}$$

where \(\Delta _1^k\) is taken from Lemma 4.3 (with \(A^{(k-1)}\), \(u_{k-1}\), \(X^{(k)}\) and \(I_j^k\) replacing A, u, x and \(I_j\), respectively); \(\Delta _2^k\) is defined as in Lemma 4.4 (again, with \(A^{(k-1)}\), \(u_{k-1}\) and \(X^{(k)}\) taking place of A, u and x), and \(\Delta _R^k\) is the regularity shift defined by

$$\begin{aligned} \Delta _R^k:=&\,\min \left\{ \ell \in {\mathbb {N}}\cup \{0\}:\,{\overline{m}}_{A^{(k)}}\left( u_{k-1} +\Delta _1^k+\Delta _2^k+\ell \right) \right. \\&\left. -\,{\overline{m}}_{A^{(k)}}\left( u_{k-1}+\Delta _1^k+\Delta _2^k+\ell +1\right) \le \frac{1}{2\varepsilon n}\right\} \end{aligned}$$

(the factor “2” in front of \(\varepsilon n\) in the above definition does not carry any special meaning, and introduced for a purely technical reason). The quantities \(\Delta _R^k\) play a role analogous to numbers \(\delta _R^k\) from the section dealing with the smallest eigenvalue.

Before going further, we must make sure that the above definition is consistent, i.e. \({\overline{m}}_{A^{(k)}}(u_k)+\alpha <1\) and \(u_k>\lambda _{\max }(A^{(k)})\) deterministically for all \(k\in \{0,1,\dots ,m\}\) (otherwise, we would not be able to apply Lemmas 4.14.34.4). We shall verify this by induction. For \(k=0\), the inequalities hold by the definition of \(u_0\) and \(\alpha \). Now, let \(k\ge 1\), and assume that \(u_{k-1}\) has the property that \({\overline{m}}_{A^{(k-1)}}(u_{k-1})+\alpha <1\) and \(u_{k-1}>\lambda _{\max }(A^{(k-1)})\) deterministically (everywhere on the probability space). Let \(Q_1\) and \(Q_2\) be defined as in Lemma 4.1, with \(A^{(k-1)}\), \(u_{k-1}\) and \(X^{(k)}\) replacing A, u and x. Note that if \(k=1\) then we have

$$\begin{aligned} \sum _{j=1}^\infty \frac{|I_j^1|}{16^j}\le \frac{16n}{u_0^2}<\frac{1}{\varepsilon n}; \end{aligned}$$

and if \(k>1\) then the summand \(\Delta ^{k-1}_R\) in the definition of \(u_{k-1}\) makes sure that

$$\begin{aligned} \sum _{j=1}^\infty \frac{|I_j^k|}{16^j}\le \sum _{j=1}^\infty \frac{2|I_j^k|}{4^j(4^j+1)} < 2\left( {\overline{m}}_{A^{(k-1)}}(u_{k-1})- {\overline{m}}_{A^{(k-1)}}(u_{k-1}+1)\right) \le \frac{1}{\varepsilon n}. \end{aligned}$$

Thus, by Lemma 4.3,

$$\begin{aligned} Q_1\left( \Delta _1^k\right) \le {\overline{m}}_{A^{(k-1)}}(u_{k-1})+14\sqrt{\varepsilon }. \end{aligned}$$

Further, by Lemma 4.4, we have

$$\begin{aligned} Q_2\left( \Delta _2^k\right) \le 1-{\overline{m}}_{A^{(k-1)}}(u_{k-1})-\alpha . \end{aligned}$$

The definition of \(\varepsilon \) and monotonicity of \(Q_1,Q_2\) then imply that \(Q_1(u_{k}-u_{k-1})< 1\) and \(Q_1(u_{k}-u_{k-1})+Q_2(u_{k}-u_{k-1})\le 1\). Thus, by Lemma 4.1 we have \({\overline{m}}_{A^{(k)}}(u_{k})\le {\overline{m}}_{A^{(k-1)}}(u_{k-1})<1-\alpha \) and \(u_{k}>\lambda _{\max }(A^{(k)})\) deterministically.

The following lemma gives an estimate of the cumulative impact of the regularity shifts:

Lemma 4.7

With \(\Delta _R^k\) (\(1\le k\le m\)) defined above, the following holds deterministically:

$$\begin{aligned} \sum _{k=1}^m\Delta _R^k\le 2\varepsilon n. \end{aligned}$$

Proof

Take any admissible \(k\ge 1\). The definition of \(\Delta _R^k\) immediately implies that for all \(0\le \ell <\Delta _R^k\) we have

$$\begin{aligned} {\overline{m}}_{A^{(k)}}\left( u_{k-1}+\Delta _1^k+\Delta _2^k+\ell \right) - {\overline{m}}_{A^{(k)}}\left( u_{k-1}+\Delta _1^k+\Delta _2^k+\ell +1\right) >\frac{1}{2\varepsilon n}. \end{aligned}$$

Hence,

$$\begin{aligned} {\overline{m}}_{A^{(k)}}(u_k)&={\overline{m}}_{A^{(k)}}\left( u_{k-1}+\Delta _1^k +\Delta _2^k+\Delta _R^k\right) \\&\le {\overline{m}}_{A^{(k)}}\left( u_{k-1}+\Delta _1^k+\Delta _2^k\right) -\frac{\Delta _R^k}{2\varepsilon n}\\&\le {\overline{m}}_{A^{(k-1)}}(u_{k-1})-\frac{\Delta _R^k}{2\varepsilon n}. \end{aligned}$$

Thus, \({\overline{m}}_{A^{(k)}}(u_k)\le {\overline{m}}_{A^{(k-1)}}(u_{k-1})-\frac{\Delta _R^k}{2\varepsilon n}\) for all \(k\ge 1\), which, together with the relations \(0<{\overline{m}}_{A^{(k)}}(u_k)< 1\) (\(0\le k\le m\)), implies the result. \(\square \)

Now, fix for a moment any \(k\in \{1,\dots ,m\}\) and let \(\mathcal {F}_{k-1}\) be the \(\sigma \)-algebra generated by the vectors \(X^{(1)},\dots ,X^{(k-1)}\). We will first estimate the conditional expectation

$$\begin{aligned} {\mathbb {E}}\left( \Delta _1^k+\Delta _2^k\,|\,\mathcal {F}_{k-1}\right) ={\mathbb {E}}\left( \Delta _1^k\, |\,\mathcal {F}_{k-1}\right) +{\mathbb {E}}\left( \Delta _2^k\,|\,\mathcal {F}_{k-1}\right) . \end{aligned}$$

Applying Lemma 4.5, we obtain

$$\begin{aligned} \left\| {\mathbb {E}}\left( \Delta _1^k\,|\,\mathcal {F}_{k-1}\right) \right\| _\infty \le 32g(1)\sqrt{\varepsilon }. \end{aligned}$$

Next, note that the assumptions on 1-dimensional projections imply that for any unit vector y we have

$$\begin{aligned} {\mathbb {E}}|\langle y,X_n\rangle |^{3}&=\int _0^\infty {\mathbb {P}}\left\{ \langle y,X_n\rangle ^2\ge \tau ^{2/3}\right\} \,d\tau \\&\le \sqrt{8}+\int _{\sqrt{8}}^\infty {\mathbb {P}}\left\{ \langle y,X_n\rangle ^2-1\ge \tau ^{2/3}-1\right\} \,d\tau \\&\le \sqrt{8}+4g(1)\int _{\sqrt{8}}^\infty \tau ^{-4/3}\,d\tau \\&\le \sqrt{8}+12g(1). \end{aligned}$$

Next, the upper bound on \(\frac{|I_j^k|}{16^j}\) implies that \(I_j^k=\varnothing \) whenever \(j\le \log _{16}(\varepsilon n)\), whence

$$\begin{aligned} u_{k-1}-\lambda _1(A^{(k-1)})\ge \frac{\sqrt{\varepsilon n}}{4}. \end{aligned}$$

Then, by Lemma 4.6 (applied with \(\kappa :=1\)), and in view of the inequality \({\overline{m}}_{A^{(k-1)}}(u_{k-1})\le {\overline{m}}_{A^{(0)}}(u_0)\), we get

$$\begin{aligned} \left\| {\mathbb {E}}(\Delta _2^k\,|\,\mathcal {F}_{k-1})\right\| _\infty \le \frac{1+\alpha }{1-{\overline{m}}_{A^{(0)}}(u_0)-\alpha } \left( 1+ \frac{w_{{}4.6}}{(1-{\overline{m}}_{A^{(0)}}(u_0)-\alpha )^{1/2}(\sqrt{\varepsilon n}/4)^{1/2}}\right) , \end{aligned}$$

where

$$\begin{aligned} w_{4.6} =4\sup _{\Vert y\Vert =1}{\mathbb {E}}|\langle y,X_n\rangle |^{3}/\left( \alpha ^{3/2}\left( 1-2^{-1/2}\right) \right) \le (48+192g(1))\alpha ^{-3/2}. \end{aligned}$$

It is not difficult to check that for \(\beta :=(1+2\sqrt{\gamma })\alpha / \left( \sqrt{\gamma }-(1+\sqrt{\gamma })\alpha \right) \) we have

$$\begin{aligned} \frac{1+\alpha }{1-{\overline{m}}_{A^{(0)}}(u_0)-\alpha } \le \frac{1+\beta }{1-{\overline{m}}_{A^{(0)}}(u_0)}, \end{aligned}$$

and by the choice of \(n_\varepsilon \), we have

$$\begin{aligned} \frac{w_{4.6}(1+\alpha )}{(1-{\overline{m}}_{A^{(0)}}(u_0)-\alpha )^{3/2}(\sqrt{\varepsilon n}/4)^{1/2}} \le \varepsilon . \end{aligned}$$

Hence, we obtain

$$\begin{aligned} \left\| {\mathbb {E}}\left( \Delta _2^k\,|\,\mathcal {F}_{k-1}\right) \right\| _\infty \le (1+\beta )\left( 1+\sqrt{\frac{n}{m}}\right) +\varepsilon . \end{aligned}$$

Summing up over \(k\in \{1,\ldots ,m\}\) and applying Lemma 4.7, we get

$$\begin{aligned} {\mathbb {E}}\lambda _1(A_n)&\le {\mathbb {E}}u_{m}\\&\le n+\sqrt{mn}+32g(1)\sqrt{\varepsilon }m+(1+\beta )(m+\sqrt{mn})+\varepsilon m+2\varepsilon n\\&=(\sqrt{n}+\sqrt{m})^2+32g(1)\sqrt{\varepsilon }m+\varepsilon (m+2n)+\beta (m+\sqrt{mn}). \end{aligned}$$

Since \(\alpha \) can be chosen arbitrarily small and both \(\beta \) and \(\varepsilon \) tend to zero with \(\alpha \), we get the result.