1 Introduction

1.1 Operator Norm Regularization for Random Matrices

What should we call an optimal order of an operator norm of a random \(n \times n\) matrix? If we consider a matrix A with independent standard Gaussian entries, then by the classical Bai–Yin law (see, for example, [16])

$$\begin{aligned} \Vert A\Vert /\sqrt{n} \rightarrow 2 \quad {\text { almost surely }}, \end{aligned}$$

as the dimension \(n \rightarrow \infty \). Moreover, the \(2\sqrt{n}\) asymptotic holds for more general classes of matrices. By [21], if the entries of A have zero mean and bounded fourth moment, then

$$\begin{aligned} \Vert A\Vert = (2+o(1)) \sqrt{n} \end{aligned}$$

with high probability. If we are concerned to get an explicit (non-asymptotic) probability estimate for all large enough n, an application of Bernstein’s inequality (see, for example, in [19]) gives

$$\begin{aligned} \mathbb {P}\{\Vert A\Vert \le t \sqrt{n}\} \ge 1 - e^{-c_0t^2n} \quad {\text { for }} t \ge C_0 \end{aligned}$$

for the matrices with i.i.d. sub-Gaussian entries. Here, \(c_0, C_0 > 0\) are absolute constants. The non-asymptotic extensions to more general distributions are also available, see [2, 8, 14, 18].

Also, note that the order \(\sqrt{n}\) is the best we can generally hope for. Indeed, if the entries of A have variance C, then the typical magnitude of the Euclidean norm of a row of A is \( \sim \sqrt{n}\), and the operator norm of A cannot be smaller than that. So, it is natural to assume \(O(\sqrt{n})\) as the “ideal order” of the operator norm of an \(n \times n\) i.i.d. random matrix.

However, if we do not assume that the matrix entries have four finite moments, we do not have ideal order \(O(\sqrt{n})\): The weak fourth moment is necessary for the convergence in probability of \(\Vert A\Vert /\sqrt{n}\) when n grows to infinity (see [15]). Moreover, for the matrices with the entries having two finite moments, an explicit family of examples, constructed in [11], shows that A can have \(\Vert A\Vert \sim O(n^{\alpha })\) for any \(\alpha \le 1\) with substantial probability.

This motivates the following questions: What are the obstructions in the structure of A that make its operator norm too large? Under what conditions and how can we regularize the matrix restoring the optimal \(O(\sqrt{n})\) norm with high probability? Clearly, interesting regularization would be the one that does not change A too much, for example, that changes only a small fraction of the entries of A. We call such regularization local.

The first question was answered in our previous work with Vershynin [13]. We have shown that one can enforce the norm bound \(\Vert A\Vert \sim \sqrt{n}\) by modifying the entries in a small submatrix of A if and only if the i.i.d. entries of A have zero moment and finite variance. The proof strategy was to construct a way to regularize \(\Vert \cdot \Vert _{\infty \rightarrow 2}\) norm of A and to apply a form of Grothendieck–Pietsch theorem (see [10, Proposition 15.11]) to claim that some additional small correction regularizes the operator norm \(\Vert A\Vert \). This last step made it impossible to find the submatrix explicitly.

1.2 Main Results

In the current work, we give an (almost optimal) answer to the remaining constructiveness question, namely when local regularization is possible, how to fix the norm of A by a small change to the optimal order? The main result of the paper is

Theorem 1

(Constructive regularization) Let A be a random \(n~\times ~n\) matrix with i.i.d. entries \(A_{ij}\) having symmetric distribution such that \(\mathbb {E}A_{ij}^2 = 1\). Then for any \(\varepsilon \in (0, 1/6]\), \(r \ge 1\) with probability \(1 - n^{0.1-r}\) the following holds: If we replace with zeros at most \(\varepsilon n\) rows and \(\varepsilon n\) columns with largest \(L_2\)-norms (as vectors in \(\mathbb {R}^n\)), then the resulting matrix \(\tilde{A}\) will have a well-bounded operator norm

$$\begin{aligned} \Vert \tilde{A}\Vert \le C r\sqrt{c_{\varepsilon } n \cdot \ln \ln n}. \end{aligned}$$
(1.1)

Here, \(c_{\varepsilon } =(\ln \varepsilon ^{-1})/\varepsilon \) and \(C > 0\) is a sufficiently large absolute constant.

Remark 1

Typically, all the rows and columns of the matrix \(\tilde{A}\) have \(L_2\)-norms bounded by \(O(\sqrt{c_{\varepsilon } n})\). One way to check this is via the non-constructive regularization result proved in [13]. Indeed, with probability \(1 - 7\exp (- \varepsilon n/12)\), removing some \(\varepsilon n \times \varepsilon n\) submatrix of A, we get a matrix \(\bar{A}\) such that \(\Vert \bar{A}\Vert \lesssim \sqrt{c_{\varepsilon } n}\) (see [13, Theorem 1]). It implies that all the rows and columns of \(\bar{A}\) have well-bounded \(L_2\)-norms (of order at most \(\sqrt{c_{\varepsilon } n}\)). Since all but \(\varepsilon n\) rows and \(\varepsilon n\) columns of A coincide with those of \(\bar{A}\), there can be at most \(\varepsilon n\) rows and columns in A having larger \(L_2\)-norms. Thus, regularization described in the statement of Theorem 1 zeros out them all.

Moreover, the proof Theorem 1 holds without changes if we define \(\tilde{A}\) as the result of zeroing out of all rows and columns having \(L_2\)-norm bigger than \(C\sqrt{c_{\varepsilon }n}\). As we just discussed, this is an even more delicate change in the matrix A with high probability.

The regularization procedures discussed above (in Theorem 1 and Remark 1) are local, as they change only a small fraction of the matrix entries. However, they still change more than \(\varepsilon n \times \varepsilon n\) submatrix as promised by [13, Theorem 1.1]. As a corollary of Theorem 1, we also obtain a polynomial algorithm that regularizes the norm of A with high probability by zeroing out its small submatrix.

This algorithm addresses separately subsets of matrix entries having similar magnitude. We define these subsets via order statistics of i.i.d. samples \(A_{ij}\): Let \(\hat{A}_1, \ldots , \hat{A}_{n^2}\) be the non-increasing rearrangement of the entries \(A_{ij}\) (in sense of absolute values, namely \(|\hat{A}_1| \ge \cdots \ge |\hat{A}_{n^2}|\)). Then,

$$\begin{aligned} \mathcal {A}_l := \{\hat{A}_{ \lceil 2^{l-1} n \varepsilon + 1 \rceil }, \ldots , \hat{A}_{ \lceil 2^{l} n \varepsilon \rceil }\} \quad {\text { for any }} l \in \mathbb {Z}_{\ge 0}. \end{aligned}$$
(1.2)

We are ready to state submatrix regularization algorithm:

figure a

If the matrix A is taken from the same model as in Theorem 1, the regularization provided by Algorithm 1 finds \(\varepsilon n \times \varepsilon n\) submatrix \(I \times J\) that one can replace with zeros to get a matrix \(\tilde{A}\) with well-bounded norm. This is proved in the following

Corollary 1

(Constructive regularization, submatrix version) Let A be a random \(n~\times ~n\) matrix with i.i.d. entries \(A_{ij}\) having symmetric distribution such that \(\mathbb {E}A_{ij}^2 = 1\). Let \(\varepsilon \in (0, 1/6]\), \(l_{\max } = \lfloor \log _2 (\ln n/\ln \varepsilon ^{-4})\rfloor \) and \(c_{\varepsilon } =(\ln \varepsilon ^{-1})/\varepsilon \). Subsets \(\mathcal {A}_l\) are defined by (1.2) for \(l = 0, \ldots , l_{\max }\).

Suppose that matrix \(\tilde{A}\) is constructed from A by Algorithm 1. Then, with probability \(1 - n^{0.1-r}\) matrix, \(\tilde{A}\) differs from A on at most \(\varepsilon n \times \varepsilon n\) submatrix, and

$$\begin{aligned} \Vert \tilde{A}\Vert \le C r^{3/2} \sqrt{c_{\varepsilon } n \cdot \ln \ln n}, \end{aligned}$$

where \(C > 0\) is a sufficiently large absolute constant.

1.3 Proof Strategy

General idea of the proof of Theorem 1 is to split the entries of A into subsets with similar absolute values and bound each of these “levels” by the properly scaled 0–1 Bernoulli variables. Then, we can use known regularization results that hold for Bernoulli random matrices (see Sect. 2.1 for their short exposition) for each “level” separately, and sum the norms over the “levels”. However, this intuitive approach conceals several difficulties to be resolved on the way.

First, we cannot directly substitute the entries of A by their absolute values: The norm of |A| might have global obstructions to the regularization (consider an example of \(\pm \,1\) symmetric Rademacher entries). To approach the level splitting process more delicately, we strengthen the following ideas (initially proposed by Friedman, Kahn and Szemeredi for the regular graphs in [6] and modified by Feige and Ofek for more general Bernoulli case in [5]). In the standard approximation of the operator norm \(\Vert A\Vert \) by the supremum of a quadratic form on a discrete \(\varepsilon \)-net \(\mathcal {N}\) on the unit sphere

$$\begin{aligned} \sup _{u, v \in \mathcal {N} \subset S^{n-1}} \big |\sum _{i,j=1}^n A_{ij} u_i v_j\big |, \end{aligned}$$

for each fixed pair of unit vectors \((u, v) = ((u_i)_{i=1}^n,(v_j)_{j=1}^n)\), one can split their indices \((i,j) \in [n] \times [n]\) into light and heavy couples. Light couples correspond to the sum members \(u_iv_j\) that are bounded well enough for the classical concentration results to work, and heavy couples make a smaller set for which regularization does not need mean zero assumption any more (see Sect. 2.2 for more details). In the current work, we define similar notions of the light and heavy members that additionally take into account non-unit absolute values of \(A_{ij}\) (see Sect. 3).

The second difficulty is that a simple (based on \(L_2\) norms) regularization of the matrix A does not imply proper regularization of all the “levels.” Even if we know that the rows and columns of the matrix A are well bounded, some of the “levels” might have too large row or column norms, if the others were small enough. To address this issue without making regularization procedure more complex, we employ an additional structural decomposition for Bernoulli random matrices, first shown in the work of Le et al. [9]. Essentially, we split the entries on each “level” into three parts: a part with bounded rows and columns, and two exceptional sparse parts. The sparsity comes in handy for their separate norm estimation.

Finally, we would like to make the number of “levels” as small as possible (as their quantity creates an additional factor in the resulting norm). We were able to keep this number as small as double logarithm of the matrix size, using the matrix entries truncations, similar to ones used in the preceding work [13].

The fact that the submatrix regularization Algorithm 1 is more involved than the one presented in Theorem 1 is somewhat natural. Zeroing out a small submatrix must still bring the \(L_2\)-norms of all rows and columns to the order \(O(\sqrt{n})\). Since the majority of rows and columns stays untouched in such regularization, essentially, one needs to find the most “dense” part of the matrix.

The procedure of assigning weights to the matrix entries row-wise, multiplying them to set column weights and then thresholding columns with the low weights is a delicate way to do so. This weight construction was originally used in [12, 13] for the matrices with i.i.d. scaled Bernoulli entries. Here, we employ the same construction to regularize the entries at every “level” independently. (Here, kth “level” contains the entries of A that belong to \(2^{-k}\)-quantile of the distribution of \(A_{ij}^2\).) Additionally, to make the algorithm distribution oblivious, we estimate quantiles by order statistics of the matrix entries (since a random matrix naturally contains \(n^2\) samples of the distribution \(\xi \sim A_{ij}\)). The idea to estimate quantiles of some distribution by the order statistics of a set of samples is both natural and well known in the statistics literature (see, e.g., [4, 22]).

1.4 Notations and Structure of the Paper

We use the following standard notations throughout the paper. Positive absolute constants are denoted \(C, C_1, c, c_1\), etc. Their values may be different from line to line. We often write \(a \lesssim b\) to indicate that \(a \le C b\) for some absolute constant C.

The discrete interval \(\{1,2,\ldots ,n\}\) is denoted by [n]. Given a finite set S, by |S|, we denote its cardinality. The standard inner product in \(\mathbb {R}^n\) shall be denoted by \(\langle \cdot ,\cdot \rangle \). Given \(p\in [1,\infty ]\), \(\Vert \cdot \Vert _p\) is the standard \(\ell _p^n\)-norm in \(\mathbb {R}^n\). Given a matrix A, \(\Vert \cdot \Vert \) denotes the operator \(l_2 \rightarrow l_2\) norm of the matrix:

$$\begin{aligned} \Vert A\Vert := \max _{x \in S^{n-1}} \Vert Ax\Vert _2. \end{aligned}$$

We write \({\mathrm{row}}_1(A), \ldots , {\mathrm{row}}_n(A) \in \mathbb {R}^m\) to denote the rows of any \(m \times n\) matrix A and \(\mathrm{col}_1(A), \ldots , \mathrm{col}_m(A) \in \mathbb {R}^n\) to denote its columns. We are going to use sparsity of the matrices in the proof. We denote by \(e_i^{\mathrm{row}}(A)\) the number of nonzero entries in the ith row of the matrix A, and also \(e_i^{\mathrm{col}}(A)\) denotes the number of nonzero entries in the ith column of A.

The rest of the paper is structured as follows. In Sect. 2, we list auxiliary results (from the works of Feige and Ofek [5], and Le et al. [9]) specific to the Bernoulli matrices. In Sect. 3, we show how to extend the Bernoulli techniques to more general class of matrices and prove central Proposition 2. In Sect. 4, we combine these techniques to conclude the proof of Theorem 1. In Sect. 5, we prove Corollary 1, and the last Sect. 6 contains discussion of the results and related open questions.

2 Auxiliary Results for Bernoulli Random Matrices

In this section, we review several useful results related to the regularization of the norms of Bernoulli matrices.

2.1 Regularization of the Norms of Bernoulli Random Matrices

Consider a \(n \times n\) Bernoulli matrix B with independent 0–1 entries such that \(\mathbb {P}\{B_{ij} = 1\} = p\). Since the second moment of its entries \(\mathbb {E}B_{ij}^2 \sim p\), from the facts discussed in the beginning of Sect. 1, one would expect an ideal operator norm \(\Vert B\Vert \sim \sqrt{np}\).

This is exactly what happens with high probability when success probability p is large enough (\(p \gtrsim \sqrt{\ln n}\), see, e.g., [5]). If \(p \ll \sqrt{\ln n}\), the norm can stay larger than optimal [7]. However, it is known that the regularization procedure described in Theorem 1 works in the case of Bernoulli matrices, and moreover, results in the optimal norm order \(\Vert \tilde{B} \Vert \sim \sqrt{np}\). Since all nonzero entries in B have the same size, the \(L_2\) norm description can be simplified in terms of the number of nonzero entries in each row or column.

Namely, Feige an Ofek proved in [5] that if we zero out all rows and columns that contain more than Cnp nonzero entries, then the resulting matrix satisfies \(\Vert \tilde{B} \Vert \sim \sqrt{np}\). This result was improved by Le, Levina and Vershynin. In [9], the authors demonstrate that it is enough to zero out any part of the rows and columns with too many nonzeros or reweigh them in any way, to satisfy \(e^{\mathrm{row}}_i \le Cnp\) and \(e^{\mathrm{col}}_i \le Cnp\) for any \(i \in [n]\) to obtain the resulting matrix \(\Vert \tilde{B} \Vert \sim \sqrt{np}\).

2.2 Regularization and the Quadratic Form

So, zeroing out all rows and columns that contain more than Cnp nonzero entries regularizes the quadratic form

$$\begin{aligned} \Vert B\Vert = \sup _{u, v \in S^{n-1}} |\sum _{ij} B_{ij} u_i v_j| \end{aligned}$$
(2.1)

to the optimal order \(\sqrt{np}\).

We will need the following Lemma 1, addressing the part of the sum (2.1) (over the indices ij such that \(\{|u_iv_j| \ge \sqrt{p/n}\}\)). It was first proved by Feige and Ofek in [5] and later appears in [3]. For completeness, we give a short sketch of its proof below. Let us also emphasize that even though for the regularization procedure introduced in [5] it is crucial to zero out a product subset of the entries, in the framework of Lemma 1 it is possible to zero out any subset of the entries.

Lemma 1

Let B be a \(n \times n\) Bernoulli matrix with independent 0–1 entries such that \(\mathbb {P}\{B_{ij} = 1\} = p\). Let \(r \ge 1\). Let \(\mathcal {B}\subset [n] \times [n]\) be an index subset, such that if we zero out all \(B_{ij}\) with \((i,j) \notin \mathcal {B}\), then every row and column of B has at most \(C_0rpn\) nonzero entries. Then, with probability \(1 - n^{-r}\)

where C is a large enough absolute constant.

The proof of Lemma 1 is based on a technical Lemma 2 stated below. The proof of Lemma 2 is completely deterministic and can be found in [3, 5].

Lemma 2

[3, Lemma 21] Let B be an \(n \times n\) matrix with 0–1 elements. Let \(p > 0\), such that every row and column of B contains at most \(C_0np\) ones. For index subsets \(S, T \subset [n]\) define

$$\begin{aligned} e(S,T):= \sum _{i \in S, j \in T} B_{ij} \end{aligned}$$

(i.e., number of nonzero elements in the submatrix spanned by \(S \times T\)). Suppose that for any \(S, T \subset [n]\) one of the following conditions holds:

  1. (A)

    \( e(S, T) \le C_1 |S||T| p \), or

  2. (B)

    \( e(S, T) \cdot \log \left( \frac{e(S, T)}{ |S||T| p}\right) \le C_2 |T| \log \left( \frac{n}{|T|}\right) ,\)

with some constants \(C_1\) and \(C_2\) independent from ST and n. Then, for any \(u, v \in S^{n-1}\)

$$\begin{aligned} \sum _{i,j: |u_i v_j| \ge \sqrt{p/n}} B_{ij} |u_iv_j| \le C \sqrt{n p}, \end{aligned}$$

where the constant \(C = \max \{ 16, 3C_0, 32C_1, 32C_3\}\).

Proof

(of Lemma 1) In view of Lemma 2, it is enough to show that with probability \(1 - n^{-r}\) for every \(S, T \subset [n]\) and

$$\begin{aligned} e_{\mathcal {B}}(S,T)= \sum _{(i,j) \in \mathcal {B}\cap (S\times T)} B_{ij} \end{aligned}$$

one of the conditions (A) and (B) holds. Without loss of generality, let us assume that \(|T| \ge |S|\).

If \(|T| \ge n/e\), we have

$$\begin{aligned} e_{\mathcal {B}}(S, T) \le |S|\cdot C_0rpn \le C_0rpe|T||S|. \end{aligned}$$

Hence, condition (A) holds with \(C_1 = C_0re\).

If both \(|S|, |T| < n/e\), then \(\mathbb {P}\{\mathcal {E}\} \le n^{-r}\) for the event

$$\begin{aligned} \mathcal {E}:= \big \{\exists S,T: |S|, |T| < n/e, \, e_{\mathcal {B}} (S,T) > l_{S,T} p |S| |T|\big \} \end{aligned}$$

and \(l_{S,T}\) being a number such that

$$\begin{aligned} l_{S,T} \ln l_{S,T} := \ln \frac{n}{|T|} \cdot \frac{3(r+6) |T|}{p|S||T|}. \end{aligned}$$

Indeed, the probability estimate for \(\mathcal {E}\) follows from the Chernoff’s inequality applied to the sum of independent variables \(e(S,T) \ge e_{\mathcal {B}}(S,T)\), combined with the Stirling formula estimating the number of proper sets S and T, and the fact that the function \(f(x) = (x/n)^x\) is monotonically increasing on [1, n / e] (see [5, Section 2.2.5] for the computation details).

Thus, with probability \(1 - n^{-r}\), for any ST , such that \(|S|, |T| < n/e\), condition (B) holds:

$$\begin{aligned} e_{\mathcal {B}}(S, T) \cdot \ln \left( \frac{e_{\mathcal {B}}(S, T)}{ |S||T| p}\right)&\le l_{S,T} p |S| |T|\cdot \ln l_{S,T} \\&\le 3(r+6) |T| \ln \frac{n}{|T|}. \end{aligned}$$

This concludes Lemma 1 with \(C = \max \{ 32C_0re, 100(r+6)\}\).

2.3 Decomposition of Bernoulli Matrices

As we mentioned above, the idea is to apply an approach developed for Bernoulli matrices for the truncations of the entries of A, having absolute values on the same “level,” and then to sum over these “levels.”

However, even if we know that the rows and columns of the general matrix A are well bounded, there is no direct way to see which “levels” are well bounded: Some of them might be too large if the others are small enough. To address this issue, we are going to use additional structural decomposition for Bernoulli random matrices. The next proposition is a direct corollary of [9, Theorem 2.6]:

Proposition 1

(Decomposition lemma) Let B be a \(n \times n\) Bernoulli matrix with independent 0–1 entries such that \(\mathbb {P}\{B_{ij} = 1\} = p\). Then, for any \(n \ge 4\), \(p \ge 0\) and \(r \ge 1\), with probability at least \(1 - 3n^{-r}\), all entries of B can be divided into three disjoint classes \([n] \times [n] = \mathcal {B}_1 \sqcup \mathcal {B}_2 \sqcup \mathcal {B}_3\), such that

  1. 1.

    \(e^{\mathrm{row}}_{i}(\mathcal {B}_1) \le C_1 r^3np\) and \(e^{\mathrm{col}}_{i}(\mathcal {B}_1) \le C_1 r^3np\)

  2. 2.

    \(e^{\mathrm{row}}_{i}(\mathcal {B}_2) \le C_2r\)

  3. 3.

    \(e^{\mathrm{col}}_{i}(\mathcal {B}_3) \le C_2r,\)

where \(e^{\mathrm{col}/\mathrm{row}}_{i}(\mathcal {B})\) is the number of nonzero elements in ith row or column of B belonging to the class \(\mathcal {B}\) and \(C_1\), \(C_2\) are absolute constants.

Remark 2

Following the same methods as was employed in the proof of [9, Theorem 2.6], one can check that Proposition 1 actually holds with linear (instead of cubic) dependence on r, namely \(e^{\mathrm{row}}_{i}(\mathcal {B}_1) \le C_1 rnp\) and \(e^{\mathrm{col}}_{i}(\mathcal {B}_1) \le C_1 rnp\).

3 From Bernoulli to General Matrices

The goal of this section is to prove Proposition 2 that provides a way to generalize the regularization results known for Bernoulli random matrices to the general case:

Proposition 2

Suppose A is a random \(n \times n\) matrix with i.i.d. symmetric entries \(A_{ij}\) with \(\mathbb {E}A_{ij}^2 = 1\). Let \(\tilde{A}\) be the resulting matrix after we have zeroed out row and column subsets of A in any way, such that

$$\begin{aligned} \Vert \mathrm{row}_{i}(\tilde{A})\Vert ^2_2 \le c_{\varepsilon } n {\text { and }} \Vert \mathrm{col}_{i}(\tilde{A})\Vert ^2_2 \le c_{\varepsilon } n \end{aligned}$$
(3.1)

for all \(i = 1, \ldots , n\). Let

$$\begin{aligned} \tilde{M} := \tilde{A} \cdot {\mathbb {1}}_{\{|\tilde{A}_{ij}| \in (2^{k_0}, 2^{k_1}]\}}, {\text { and }} k_1 - k_0 =: \kappa . \end{aligned}$$

Then, with probability at least \(1 - 10\kappa n^{-r}\), we have

$$\begin{aligned} \Vert \tilde{M}\Vert \le C r \sqrt{c_\varepsilon n \kappa }. \end{aligned}$$

Here, \(c_\varepsilon \) is any positive \(\varepsilon \)-dependence and \(C > 0\) is an absolute constant.

Let us first collect several auxiliary lemmas that will be used in the proof of Proposition 2.

Lemma 3

Consider a \(n \times n\) random matrix M with independent symmetric entries and \(\mathbb {E}M^2_{ij} \le 1\). Consider two vectors \(u = (u_i)_{i = 1}^n\) and \(v = (v_j)_{j =1}^n\) such that \(u, v \in S^{n-1}\). Denote the event

$$\begin{aligned} \mathcal {M}^{\mathrm{light}}_{ij} = \{|M_{ij}| |u_iv_j| \le 2/\sqrt{n}\} \end{aligned}$$

and let \(Q \subset [n] \times [n]\) be an index subset. Then, for any constant \(C \ge 3\)

$$\begin{aligned} | \sum _{i j} u_i M_{ij} {\mathbb {1}}_{\{(i,j) \in Q\}} {\mathbb {1}}_{\mathcal {M}^{\mathrm{light}}_{ij}} v_j | \le C\sqrt{n} \end{aligned}$$

with probability at least \(1 - 2\exp (-Cn/2)\).

Proof

Let \(R_{ij} := M_{ij}{\mathbb {1}}_{\{(i,j) \in Q\}} {\mathbb {1}}_{\mathcal {M}^{\mathrm{light}}_{ij}}\). Note that \(R_{ij}\) are centered due to the symmetric distribution of \(M_{ij}\), and they are independent as \(M_{ij}\) are. So we can apply Bernstein’s inequality for bounded distributions (see, for example, [20, Theorem 2.8.4]) to bound the sum:

$$\begin{aligned} \mathbb {P}\{ | \sum _{ij} u_i R_{ij} v_j| \ge t\} \le 2 \exp \left( - \frac{t^2/2}{\sigma ^2 + Kt/3}\right) , \end{aligned}$$

where

$$\begin{aligned} K = \max _{i,j} |u_i R_{ij} v_j| \le 2/\sqrt{n} \quad {\text { and }} \quad \sigma ^2 = \sum _{ij} \mathbb {E}(u_i R_{ij} v_j)^2. \end{aligned}$$

Note that \( \mathbb {E}R^2_{ij} \le \mathbb {E}M^2_{ij}\), as \(R^2_{ij} \le M^2_{ij}\) almost surely, and \(\mathbb {E}M^2_{ij} \le 1\). So,

$$\begin{aligned} \sigma ^2 = \sum _{ij} u^2_i \mathbb {E}R^2_{ij} v^2_j \le \sum _{i,j} u_i^2 v_j^2 = 1, \end{aligned}$$

as \(\sum _i u_i^2 = \sum _j v_j^2 = 1\). So, taking \(t = C\sqrt{n}\), we obtain

$$\begin{aligned} \mathbb {P}\{ | \sum _{(i,j)} u_i M_{ij} {\mathbb {1}}_{\{(i,j) \in Q\}} {\mathbb {1}}_{\mathcal {M}^{\mathrm{light}}_{ij}} v_j | \ge C \sqrt{n}\} \le 2 \exp (-Cn/2) \end{aligned}$$

for any \(C \ge 3\). This concludes the statement of the lemma. \(\square \)

The following lemma is a version of [9, Lemma 3.3].

Lemma 4

For any matrix Q and vectors \(u,v \in S^{n-1}\), we have

$$\begin{aligned} \sum _{ij} Q_{ij} u_{i} v_{j} \le \ \max _j \Vert \mathrm{col}_j(Q)\Vert _2 \cdot (\max _{i} e_i^{\mathrm{row}}(Q) )^{1/2}. \end{aligned}$$

Proof

Indeed,

$$\begin{aligned} \sum _{ij} Q_{ij} u_{i} v_{j}\le & {} \sum _j v_j \left( \sum _{i: Q_{ij} \ne 0} Q_{ij} u_i\right) \\\le & {} \sum _j v_j \left( \sum _{i: Q_{ij} \ne 0} Q^2_{ij}\right) ^{1/2} \left( \sum _{i: Q_{ij} \ne 0} u^2_i\right) ^{1/2} \quad \left( *\right) \\\le & {} \max _j \Vert \mathrm{col}_j\left( Q\right) \Vert _2 \sum _j v_j \left( \sum _{i: Q_{ij} \ne 0} u^2_i\right) ^{1/2}\\\le & {} \max _j \Vert \mathrm{col}_j\left( Q\right) \Vert _2 \left( \sum _j v^2_j\right) ^{1/2} \left( \sum _j \sum _{i: Q_{ij} \ne 0} u^2_i\right) ^{1/2} \quad \left( *\right) \\\le & {} \max _j \Vert \mathrm{col}_j\left( Q\right) \Vert _2 \cdot 1\cdot \left( \sum _i u^2_i\sum _{j: Q_{ij} \ne 0} 1\right) ^{1/2} \quad {\left( \mathrm{since} \Vert v\Vert _2 = 1\right) } \\\le & {} \max _j \Vert \mathrm{col}_j\left( Q\right) \Vert _2 \left( \sum _i u^2_ie_i^{\mathrm{row}}\left( Q\right) \right) ^{1/2}\\\le & {} \max _j \Vert \mathrm{col}_j\left( Q\right) \Vert _2 \cdot \left( \max _{i} e_i^{\mathrm{row}}\left( Q\right) \right) ^{1/2}. \quad \left( \mathrm{since} \Vert u\Vert _2 = 1 \right) \end{aligned}$$

Steps (*) hold by the Cauchy–Schwarz inequality. Lemma 4 is proved. \(\square \)

In the proof of Proposition 2, we are going to use a standard splitting “by size” of a nonnegative random variable. Let \(X = 0\) or \(X \in (2^{k_0}, 2^{k_1}]\) almost surely. Then, clearly,

$$\begin{aligned} X \le \sum _{k = k_0+1}^{k_1} 2^k {\mathbb {1}}_{\{X \in (2^{k-1}, 2^k]\}}. \end{aligned}$$

If additionally \(\mathbb {E}X^2 \le 1\) and we denote

$$\begin{aligned} p_k := \mathbb {P}\{|M_{ij}| \in (2^{k-1}, 2^{k}]\}, \end{aligned}$$
(3.2)

the following estimates hold for \(p_k\). First, the sum

$$\begin{aligned} \sum p_k 2^{2k} \le 4 \sum p_k 2^{2(k-1)} \le 4\mathbb {E}X^2 \le 4. \end{aligned}$$
(3.3)

Now, we are ready to prove Proposition 2.

3.1 Proof of Proposition 2

Step 1. Net approximation

Let \(\mathcal {N}\) be a 1 / 2-net on \(S^{n-1}\) with cardinality \(|\mathcal {N}| \le 5^n\). (The existence of such net is a standard fact that can be found, e.g., in [20].) We will use a simple net approximation of the norm (see, e.g., [20, Lemma 4.4.1]), namely

$$\begin{aligned} \Vert \tilde{M} \Vert \le 4 \max _{u, v \in \mathcal {N}} \langle \tilde{M}u, v\rangle = 4 \max _{u, v \in \mathcal {N}} |\sum _{ij} \tilde{M}_{ij}u_i v_j|. \end{aligned}$$

We will split the sum into two parts and bound each of them separately, based on the absolute value of the element.

Let \(M := A \cdot {\mathbb {1}}_{\{|A_{ij}| \in (2^{k_0}, 2^{k_1}]\}}\). For any fixed \(u, v \in \mathcal {N}\) and every \(i, j \in [n]\), we define an event

$$\begin{aligned} \mathcal {M}^{\mathrm{light}}_{ij} = \{|M_{ij}||u_iv_j| \le 2/\sqrt{n}\}. \end{aligned}$$

A random set of indices

$$\begin{aligned} {\mathbb {1}}_{\mathcal {M}^{\mathrm{light}}_{ij}} = \{(i,j): |M_{ij}||u_iv_j| \le 2/\sqrt{n}\} \end{aligned}$$

is an analogue of the non-random set of “light couples” (ij) used by Feige and Ofek in [5]. The difference appears since in the non-Bernoulli setting we have to take into account additional random scaling by absolute values of the corresponding matrix entries \(|M_{ij}|\).

Then,

$$\begin{aligned} \max _{u, v \in \mathcal {N}}&|\sum _{i,j} \tilde{M}_{ij}u_i v_j| \\&\le \max _{u, v \in \mathcal {N}} |\sum _{i,j} \tilde{M}_{ij} ({\mathbb {1}}_{\mathcal {M}^{\mathrm{light}}_{ij}} + {\mathbb {1}}_{(\mathcal {M}^{\mathrm{light}}_{ij})^c}) u_i v_j|\\&\le \max _{u, v \in \mathcal {N}} |\sum _{i,j} \tilde{M}_{ij}{\mathbb {1}}_{\mathcal {M}^{\mathrm{light}}_{ij}} u_i v_j| + \max _{u, v \in \mathcal {N}} |\sum _{ij} \tilde{M}_{ij}{\mathbb {1}}_{(\mathcal {M}^{\mathrm{light}}_{ij})^c}u_i v_j|. \end{aligned}$$

Step 2. Light members

By Lemma 3, for any fixed \(u, v \in S^{n-1}\) and a fixed subset of indices Q (assuming that \(Q^c\) is a set of rows and columns to delete),

$$\begin{aligned} | \sum _{i,j} u_i M_{ij} {\mathbb {1}}_{\{(i,j) \in Q\}} {\mathbb {1}}_{\mathcal {M}^{\mathrm{light}}_{ij}} v_j | \le 12\sqrt{n} \end{aligned}$$
(3.4)

with probability at most \(2\exp (-6n)\). Now, taking union bound over \(5^n\) choices for \(u \in \mathcal {N}\), as many choices for \(v \in \mathcal {N}\), and \(2^{2n}\) choices for the row and column subset \(Q^c\), we obtain that

$$\begin{aligned} \mathbb {P}\{ | \sum _{i,j} u_i \tilde{M}_{ij} {\mathbb {1}}_{\mathcal {M}^{\mathrm{light}}_{ij}} v_j |\le 12\sqrt{n}\} \ge 1 - 2\exp (-n). \end{aligned}$$
(3.5)

Step 3. Other members

The second sum can be roughly bounded by the sum of absolute values of its members:

$$\begin{aligned} |\sum _{i,j} \tilde{M}_{ij}{\mathbb {1}}_{(\mathcal {M}^{\mathrm{light}}_{ij})^c}u_i v_j|\le & {} \sum _{i,j} |\tilde{M}_{ij}|{\mathbb {1}}_{(\mathcal {M}^{\mathrm{light}}_{ij})^c}|u_i v_j| \\\le & {} \sum _{i,j}\left[ \sum _{k = k_0+1}^{k_1}2^k {\mathbb {1}}_{\{|\tilde{M}_{ij}| \in (2^{k-1}, 2^k]\}}\right] {\mathbb {1}}_{\{|M_{ij}||u_iv_j| \ge 2/\sqrt{n}\}}|u_i v_j| \end{aligned}$$

Note that as long as \({\mathbb {1}}_{\{|\tilde{M}_{ij}| \in (2^{k-1}, 2^k]\}} = 1\), we also have that \(|M_{ij}| \le 2^k\). Indeed, \(|M_{ij}| > 2^k\) implies either \(|\tilde{M}_{ij}| > 2^k\) or \(|\tilde{M}_{ij}| = 0\). In any case, \(|\tilde{M}_{ij}| \notin (2^{k-1}, 2^k]\). So, the last expression is bounded above by

$$\begin{aligned} \sum _{i,j}\sum _{k = k_0+1}^{k_1}2^k {\mathbb {1}}_{\{|\tilde{M}_{ij}| \in (2^{k-1}, 2^k]\}} {\mathbb {1}}_{\{2^k|u_iv_j| \ge 2/\sqrt{n}\}}|u_i v_j| \end{aligned}$$

Since \(\mathbb {E}M_{ij}^2 \le \mathbb {E}A_{ij}^2 = 1\), from (3.3), we have \(2^{1 - k} \ge \sqrt{p_k}\) for any k, where \(p_k\) is probability of the kth level (defined in (3.2)). As a result, in Step 3, we got

$$\begin{aligned} |\sum _{i,j} \tilde{M}_{ij}&{\mathbb {1}}_{(\mathcal {M}^{\mathrm{light}}_{ij})^c} u_i v_j| \nonumber \\&\le \sum _{k = k_0+1}^{k_1} 2^k \sum _{i,j:|u_iv_j| \ge \sqrt{p_k/n}} {\mathbb {1}}_{\{|\tilde{M}_{ij}| \in (2^{k-1}, 2^k]\}} |u_i v_j|. \end{aligned}$$
(3.6)

Step 4. Bernoulli matrices

For each “size” \(k = k_0+1, \ldots , k_1\), let us define a \(n \times n\) matrix \(B^k\) with independent Bernoulli entries

$$\begin{aligned} B^k_{ij}:= {\mathbb {1}}_{\{|M_{ij}| \in (2^{k-1}, 2^k]\}}, \quad \mathbb {E}B^k_{ij} = p_k. \end{aligned}$$

By Decomposition Lemma 1 (and Remark 2), with probability \(1 - 3n^{-r}\), the entries of every \(B^k\) can be assigned to one of three disjoint classes: \(\mathcal {B}_1^k\), where all rows and columns have at most \(C_1r p_k n\) nonzero entries; \(\mathcal {B}_2^k\), where all the columns have at most \(C_2 r\) nonzero entries; and \(\mathcal {B}_3^k\), where all the rows have at most \(C_2 r\) nonzero entries. We are going to further split the sum (3.6) into three sums containing elements of these three classes and estimate each of them separately.

  • \(\mathcal {B}_1:\) The part with the entries from \(\mathcal {B}_1^k\) satisfies the conditions of Lemma 1. For any \(k = k_0 + 1, \ldots , k_1\)

    By Lemma 1, this sum is bounded by \(C r\sqrt{p_k n}\) with probability at least \(1 - n^{-r} \). So, for any \(u, v \in S^{n-1}\)

    Then, by the Cauchy–Schwarz inequality and estimate (3.3),

    $$\begin{aligned} \sum _{k = k_0+1}^{k_1} 2^k \sqrt{p_k} \le \sqrt{\sum \nolimits _{k = k_0+1}^{k_1} 2^{2k} p_k} \sqrt{\sum \nolimits _{k = k_0+1}^{k_1} 1} \le 2\sqrt{\kappa }. \end{aligned}$$
  • \(\mathcal {B}_2:\) The part with the entries from \(\mathcal {B}_2^k\) can be estimated by Lemma 4. We have that

    where

    $$\begin{aligned} Q_{ij} := \sum _{k = k_0+1}^{k_1} 2^k {\mathbb {1}}_{\{(i,j) \in \mathcal {B}_2\}} {\mathbb {1}}_{\{|\tilde{M}_{ij}| \in (2^{k-1}, 2^k]\}} \end{aligned}$$

    Note that for every fixed \(i = 1, \ldots , n\), number of nonzero entries \(Q_{ij}\) in the row i is at most \(C_2r \kappa \). Also, \(|Q_{ij}| \le 2 |\tilde{M}_{ij}| \le 2 |\tilde{A}_{ij}|\) almost surely, so maximum \(L_2\)-norm of the column of Q is \(\sqrt{c_{\varepsilon } n}\). By Lemma 4, this implies that for any \(u, v \in S^{n-1}\)

    $$\begin{aligned} \sum _{i,j} Q_{ij} |u_i v_j| \le \sqrt{C_2 r \kappa c_{\varepsilon } n} \end{aligned}$$
  • \(\mathcal {B}_3:\) The part with the entries from \(\mathcal {B}_3^k\) can be estimated in the same way as \(\mathcal {B}_2^k\), repeating the argument for \(A^T\).

Step 5. Conclusion

Now, we can combine the estimates obtained for the light members (3.5) and all three parts of the non-light sum, to get that

$$\begin{aligned} \Vert \tilde{M} \Vert \le 12 \sqrt{n} + C r \sqrt{\kappa n} + 2\sqrt{C_2 r \kappa c_{\varepsilon } n} \lesssim r \sqrt{c_{\varepsilon } \kappa n } \end{aligned}$$

with probability at least \(1 - 2 e^{-n} - 3\kappa n^{-r} - 6 n^{-r} \ge 1 - 10 n^{-r}\kappa \) for all n large enough. Proposition 2 is proved.

4 Conclusions: Proof of Theorem 1 and Further Directions

In this section, we conclude the proof of Theorem 1. As we have seen in the previous section, splitting the entries of the matrix A into \(\kappa \) “levels” with the similar absolute value produces an extra \(\sqrt{\kappa }\) factor in the norm estimate. Hence, we care to make the number of levels as small as possible. We are going to show that this number can be as small as \(C\ln \ln n\), where n is the size of the matrix. The reason is that we need only to consider the “average” entries of the matrix, those with the absolute values between \(O(\sqrt{n / \ln n})\) and \(O(\sqrt{n})\). The “large” entries will be all replaced by zeros in regularization, and restriction to the “small” entries creates a matrix with the optimal norm (no regularization is needed). One way to check this is by applying the following result of Bandeira and Van Handel:

Theorem 2

[2, Lemma 21] Let X be an \(n \times n\) matrix whose entries \(X_{ij}\) are independent centered random variables. For any \(\varepsilon \in (0, 1/2]\), there exists a constant \(c_{\varepsilon }\) such that for every \(t \ge 0\),

$$\begin{aligned} \mathbb {P}\{\Vert X\Vert \ge (1 + \varepsilon ) 6\sigma + t \} \le n \exp ( - t^2 / c_{\varepsilon } \sigma _*^2), \end{aligned}$$

where \(\sigma \) is a maxium expected row and column norm:

$$\begin{aligned} \sigma ^2 := \max (\sigma ^2_1, \sigma ^2_2), {\text { where }} \quad \sigma ^2_1 = \max _i \sum _j \mathbb {E}(X_{ij}^2), \quad \sigma ^2_2 = \max _j \sum _i \mathbb {E}(X_{ij}^2); \end{aligned}$$

and \( \sigma _*\) is a maximum absolute value of an entry:

$$\begin{aligned} \sigma _* := \max _{ij} \Vert X_{ij}\Vert _{\infty }. \end{aligned}$$

Lemma 5

Suppose S is a random \(n \times n\) matrix with i.i.d. mean zero entries \(S_{ij}\), such that \(\mathbb {E}S_{ij}^2 \le 1\) and \(|S_{ij}| < \bar{c}\sqrt{n}/\sqrt{\ln n}\). Let \(r \ge 1\). If \(\bar{c} < c\), then with probability at least \(1 - n^{-r}\)

$$\begin{aligned} \Vert S\Vert \le 13r \sqrt{n}. \end{aligned}$$

Here, c is a small enough absolute constant.

Proof

The proof follows directly from Theorem 2 with \(t = r\sqrt{n}\) and \(\varepsilon = 1\). It is enough to take \(c = 1/11c_1\), where \(c_1\) is a constant from the statement of Theorem 2. \(\square \)

Now, we are ready to prove Theorem 1.

4.1 Proof of Theorem 1

Let us decompose A into a sum of three \(n \times n\) matrices

$$\begin{aligned} A := S + M + L, \end{aligned}$$
(4.1)

where S contains the entries of A that satisfy \(|A_{ij}| \le 2^{k_0}\), the matrix M contains the entries for which \(2^{k_0} < |A_{ij}| \le 2^{k_1}\), and L contains large entries, satisfying \(|A_{ij}| > 2^{k_1}\). Here,

$$\begin{aligned} k_0 := \left\lfloor \frac{1}{2} \log _2 \frac{c_1 n}{\ln n} \right\rfloor , \quad k_1:= \left\lceil \frac{1}{2}\log _2 (C_2 c_\varepsilon n)\right\rceil , \, {\text { where }} c_\varepsilon = (\ln \varepsilon ^{-1})/\varepsilon \end{aligned}$$

and \(c_1, C_2 > 0\) are absolute constants.

Note that S, M and L inherit essential properties of the matrix A. First, they also have i.i.d. entries (since they are obtained by independent individual truncations from the i.i.d. elements \(A_{ij}\)). Due to the symmetric distribution of the entries of A, the entries of S, M and L have mean zero. Also, their second moment is bounded from above by \(\mathbb {E}A_{ij}^2 = 1\).

Note that all the entries in S satisfy \(|A_{ij}| \le \sqrt{c_1 n/\ln n}\). Thus, as long as we choose constant \(c_1\) small enough to satisfy the condition in Lemma 5, the norm of S can be estimated as

$$\begin{aligned} \mathbb {P}\{\Vert S\Vert > 13 \sqrt{n}\} < n^{-r}. \end{aligned}$$
(4.2)

Clearly, replacing by zeros some rows and column subset can only decrease the norm of S.

By Remark 1, with probability at least \( 1 - 7 \exp (\varepsilon n/12)\), all rows and columns of \(\tilde{A}\) have bounded norms: for \(i = 1, \ldots , n\)

$$\begin{aligned} \Vert \mathrm{row}_i(\tilde{A})\Vert _2 \le C \sqrt{c_\varepsilon n} {\text { and }} \Vert \mathrm{col}_i(\tilde{A})\Vert _2 \le C \sqrt{c_\varepsilon n}. \end{aligned}$$

In particular, it implies that all the entries of \(\tilde{A}\) have absolute values bound by \(C \sqrt{c_\varepsilon n}\). So, by taking constant \(C_2 \ge C^2\), we achieve that L will be empty after the regularization.

Proposition 2 estimates the norm of M after the regularization (zeroing out row and columns with large norms):

$$\begin{aligned} \mathbb {P}\{\Vert \tilde{A} \cdot {\mathbb {1}}_{\{|\tilde{A}_{ij}| \in (2^{k_0}, 2^{k_1}]\}}\Vert > C c_{\varepsilon } r \sqrt{n \kappa } \} \le 10n^{-r}\kappa . \end{aligned}$$
(4.3)

By definition,

$$\begin{aligned} \kappa := k_1 - k_0 \le \frac{1}{2} \left[ \log _2 (C_2 c_\varepsilon n) - \log _2 \frac{c_1n }{\ln n} + 2\right] \le \log _2 \ln n \end{aligned}$$

for all large enough n.

Using triangle inequality to combine norm estimates (4.2) and (4.3), we get \(\Vert \tilde{A}\Vert \lesssim c_{\varepsilon } r \sqrt{n \cdot \ln \ln n }\) with probability at least

$$\begin{aligned} 1- n^{-r} - 7e^{-\varepsilon n/12} - 10n^{-r}\ln \ln n \ge 1 - n^{-r +0.1} \end{aligned}$$

for all n large enough. This concludes the proof of Theorem 1.

Remark 3

Note that the only place in the argument where it matters what entry subset we replace by zeros is Step 2 of the proof of Proposition 2. To have a manageable union bound estimate, we need to be sure that the number of options for the potential subset to be deleted is of order \(\exp (n)\). (So, \(\exp (\ln 2 \cdot n^2)\) options for a general entry subset would be too much.)

Hence, also recalling Remark 1, we emphasize that the norm estimate (1.1) holds with the probability \(1 - n^{0.1- r}\) as long as we achieve \(L_2\)-norm of all rows and columns bounded by \(C\sqrt{c_\varepsilon n}\) by zeroing out any product subset of the entries of A.

5 Proof of Corollary 1

General idea of the proof of Corollary 1 is to show that after the regularization procedure, all rows and columns of the matrix have well-bounded \(L_2\)-norms and then apply Theorem 1.

Originally, the core part of the algorithm (Step 2) was presented for Bernoulli random matrices in [12, 13]. We will use the following version of [13, Lemma 5.1] (based on the ideas developed in [12, Proposition 2.1]):

Lemma 6

Let B be a \(n \times n\) matrix with independent 0–1 valued entries, \(\mathbb {E}B_{ij} \le p\). Then, for any \( L \ge 10\) with probability \(1 - \exp (-n \exp (-L pn))\), the following holds. If we define

$$\begin{aligned} W_{ij} := {\left\{ \begin{array}{ll} 1, &{}{\text {if }} e_j^{\mathrm{row}}(B) \le L np {\text { or }} B_{ij} = 0, \\ L np/e_j^{\mathrm{row}}(B) , &{}{\text {otherwise.}} \end{array}\right. } \end{aligned}$$

and \(V_j := \prod _{i = 1}^n W_{ij}\) , and \(J := \{j: V_j < 0.1\}\), then

$$\begin{aligned} |J| \le n \exp (-L n p) \quad {\text { and }} \, \sum _{j \in J^c} B_{ij} \le 10Lnp, {\text { for any }} i \in [n] . \end{aligned}$$

In order to pass from Bernoulli case to the general distribution case, we are going to use some version of “level truncation” idea once again. Note that here we need the probabilities of the levels \(p_l\) to be both not too large (for the joint cardinality estimate) and not too small (for the probability estimate union bound). This motivates the idea to group “similar size” entries \(A_{ij}\) not by absolute value \(|A_{ij}|\), but by common \(2^{-l}\)-quantiles of the distribution of \(\xi \sim A_{ij}^2\).

Remark 4

A version of Corollary 1 can be proved, when one would define the sets \(\mathcal {A}_l\) to contain all \(A_{ij}\) such that \(A_{ij}^2 \in (q_{k-1}, q_l]\), where \(q_l\) is \(2^l\)th quantile of the distribution of \(A_{ij}^2\), namely

$$\begin{aligned} q_l := \inf \{t: \mathbb {P}\{A^2_{ij} > t\} \le 2^{-l}\}. \end{aligned}$$
(5.1)

The proof of this version is actually almost identical to the one presented below and gives smaller absolute constants. However, an additional requirement to know quantiles of the distribution of the entries in order to regularize the norm of the matrix seems undesirable. So, we are going to prove the distribution oblivious version as presented by Algorithm 1.

The next lemma shows that the order statistics used in the statement of Corollary 1 approximate quantiles of the distribution of \(A_{ij}^2\).

Lemma 7

(Order statistics as approximate quantiles) Let \(\hat{A}_1, \ldots \hat{A}_{n^2}\) be all the entries of \(n \times n\) random matrix A in an non-increasing order, and \(q_k\) be \(2^{-k}\) quantiles of the distribution of \(A_{ij}^2\) defined by (5.1). Then, with probability at least \(1 - 4\exp (-n^2 2^{-k_1 - 2})\) for all \(k = 1, \ldots , k_1\)

$$\begin{aligned} q_{k-2} \le \hat{A}^2_{ \lceil n^2 2^{1-k}\rceil } \le q_k. \end{aligned}$$
(5.2)

Proof

A direct application of Chernoff’s inequality shows that for any k

$$\begin{aligned} \mathbb {P}\{ \nu _1 > \lceil n^2 2^{1-k} \rceil \} \le \exp (-n^2 2^{-k}/4), \end{aligned}$$

where \(\nu _1\) is a number of entries \(A_{ij}\) such that \(A_{ij}^2 > q_k\). Another application of Chernoff’s inequality lower bound shows that

$$\begin{aligned} \mathbb {P}\{ \nu _2 < \lceil n^2 2^{1-k}\rceil \} \le \exp (-n^2 2^{-k}/4), \end{aligned}$$

where \(\nu _2\) is a number of entries \(A_{ij}\) such that \(A_{ij}^2 > q_{k-2}\).

Then, with probability at least \(1 - 2 \exp (-n^2 2^{-k})\), the order statistics \(\hat{A}^2_{n^2 2^{2-k}}\) is at least \(q_{k-2}\) and at most \(q_k\). Taking union bound, Eq. (5.2) holds with probability \(1 - 4 \exp (-n^2 2^{-k_1}/4)\) for all \(k = 1, \ldots , k_1\). \(\square \)

Remark 5

We will use \(k_1 = \lceil \log _2 (8 n/\varepsilon )\rceil \). An easy computation using (5.2) shows that

$$\begin{aligned} q_{k_1 - l - 3} \le \hat{A}^2_{\lceil 2^{4 + l - k_1} n^2 \rceil } \le \hat{A}^2_{\lceil 2^l n \varepsilon \rceil } \le \hat{A}^2_{\lceil 2^{3 + l - k_1} n^2 \rceil } \le q_{k_1- l - 2} \end{aligned}$$

for all \(l = 0, \ldots , l_{\max }\) with probability \(1 - 4 \exp (-n\varepsilon /4)\). Then, for \(\mathcal {A}_l\) as defined in (1.2) and all \(l \le l_{\max }\),

$$\begin{aligned} \mathbb {P}\{A_{ij} \in \mathcal {A}_l\} \le 2^{3 + l - k_1} \le 2^{l}\varepsilon /n. \end{aligned}$$
(5.3)

Now, we are ready to prove Corollary 1.

5.1 Proof of Corollary 1

Let \(k_1 = \lceil \log _2 (8 n/\varepsilon )\rceil \), and \(q_{k_1}\) is a corresponding \(2^{-k_1}\) quantile of the distribution of \(A_{ij}^2\) (as defined by (5.1)). It is easy to check by Chernoff’s inequality that the total number of entries in A such that \(A_{ij}^2 \ge q_{k_1}\) is at most \(\varepsilon n/2\) with probability at least \(1 - e^{-\varepsilon n/4}\). So, all these “large” entries will be replaced by zeros at Step 1 of regularization Algorithm 1.

To prove Corollary 1, it is enough to show that:

  1. 1.

    Algorithm 1 makes all rows and columns of the truncated matrix \(A \cdot {\mathbb {1}}_{\{A_{ij}^2 < q_{k_1}\}}\) have norms bounded by \(C\sqrt{c_\varepsilon n}\). Then, in view of Remark 3, we can apply Theorem 1 to conclude the desired norm estimate.

  2. 2.

    the cardinalities of the exceptional index subsets \(|I|, |J| \le \varepsilon n/2\) with high probability. Then, the regularization procedure is indeed local.

The matrix \(\bar{A} := A \cdot {\mathbb {1}}_{\{A_{ij}^2 < q_{k_1}\}}\) is naturally decomposed into the union of \(l_{\max }\) “levels” with the entries coming from sets \(\mathcal {A}_l\) and “the leftover” part that contains \(A_{ij}\) such that \(A^2_{ij} < \hat{A}^2_{\lceil n \varepsilon 2^{l_{\max }} \rceil }\). So,

$$\begin{aligned} \bar{A}_{ij} = A_{ij} {\mathbb {1}}_{\{A_{ij} \in \cup \mathcal {A}_l\}} + A_{ij} {\mathbb {1}}_{\{|A_{ij}| < \hat{A}_{\lceil n \varepsilon 2^{l_{\max }} \rceil }\}} =: A^{\mathrm{Large}} + A^{\mathrm{Small}}. \end{aligned}$$

All the rows and columns of \(A^{\mathrm{Small}}\) have \(L_2\)-norms at most \(C \sqrt{n r c_\varepsilon }\) with probability at least \(1 - n^{-r}\) (without any regularization). This follows from an application of Bernstein’s inequality (e.g., [20, Theorem 2.8.4]) for a sum of independent centered entries bounded by \(\sqrt{Cn/\ln n}\). Indeed, we just need to check the boundedness condition. Recall that \(l_{\max } = \lfloor \log _2 (\ln n/\ln \varepsilon ^{-4})\rfloor \). By definition of quantiles \(q_k\) and Markov’s inequality

$$\begin{aligned} \mathbb {P}\{A_{ij}^2 \ge q_{k_1 - 2 - l_{\max }} \} \ge 2^{-k_1 + 1} \frac{\ln n}{ \ln \varepsilon ^{-4}} \ge \mathbb {P}\left\{ A_{ij}^2 \ge \frac{32 c_\varepsilon n}{\ln n}\right\} . \end{aligned}$$

Hence, the entries of \(A^{\mathrm{Small}}\) can be estimated from above by

$$\begin{aligned} \hat{A}^2_{\lceil n \varepsilon 2^{l_{\max }} \rceil } \le q_{k_1 - 2 - l_{\max }} \lesssim \frac{32 c_\varepsilon n}{\ln n}. \end{aligned}$$

For the part \(A^{\mathrm{Large}}\), we use Lemma 6 applied to \(n\times n\) matrices with i.i.d. entries \(B^l_{ij} = {\mathbb {1}}_{\{A_{ij} \in \mathcal {A}_l\}}\) for each \(l = 0, \ldots , l_{\max }\) with \(L = 2c_{\varepsilon }\) and \(p_l = 2^{l}\varepsilon /n\) (which is a valid choice by Remark 5). From the union bound estimate, we can conclude that the statement of Lemma 6 holds for all \(l \le l_{\max }\) with high probability

$$\begin{aligned} 1 - \sum _{l \le l_{\max }} \exp (-n \exp (-2c_{\varepsilon } n 2^{l}\varepsilon /n) )\ge 1 - \exp (-n^{0.5}). \end{aligned}$$

Recall that \(\bar{J} = \cup _{l} J_l\) is the union of all exceptional column index subsets found for all matrices \(A\cdot {\mathbb {1}}_{\{A_{ij} \in \mathcal {A}_l\}}\) with \(l = 0, \ldots , l_{\max }\). Note that by the definition of quantiles and second moment condition,

$$\begin{aligned} \sum _{s=0}^{\infty } q_s 2^{-s-1} \le \mathbb {E}A_{ij}^2 \le 1. \end{aligned}$$
(5.4)

By Lemma 6, we can estimate for every \(i \in [n]\)

$$\begin{aligned} \Vert \mathrm{row}_i(A^{\mathrm{Large}}_{[n] \times \bar{J}^c})\Vert _2^2&\le \sum _{l \le l_{\max }} q_{k_1 - l - 2} 20 c_\varepsilon n p_l \\&\le \sum _{l \le l_{\max }} q_{k_1 - l - 2} 20 c_\varepsilon n \frac{2^{l - k_1} \varepsilon }{n} 2^{k_1} \le 160 c_{\varepsilon } n, \end{aligned}$$

as \(2^{k_1} \le 16 n/\varepsilon \), we used (5.4) with \(s = k_1 - l - 2\) in the last step.

Then, by the \(L_2\)-norm triangle inequality applied to the rows of \(A_{[n] \times \bar{J}^c}^{\mathrm{Large}}\) and \(A^{\mathrm{Small}}\), we have the row boundedness condition satisfied for \(\bar{A}_{[n] \times \bar{J}^c}\). Next, at Step 3, we add the set \(\hat{J}\) of columns with largest \(L_2\)-norms. The same argument as in Remark 1 shows that with probability at least \(1- n^{-r}\), there are no columns with the norm larger than \(C \sqrt{c_\varepsilon n}\) outside the set \(\hat{J}\). So, matrix \(\tilde{A}_{[n] \times J^c}\) has all rows and columns norms well bounded (recall that \(J := \bar{J} \cup \hat{J}\)). Then, by Theorem 1, with high probability \(1 - C (\ln \ln n)n^{- r}\)

$$\begin{aligned} \Vert \tilde{A}_{[n] \times J^c}\Vert \lesssim r^{3/2} \sqrt{c_{\varepsilon } n \ln \ln n}. \end{aligned}$$
(5.5)

Repeating the same argument for the transpose, we have that

$$\begin{aligned} \Vert \tilde{A}_{I^c \times J}\Vert \le \Vert \tilde{A}_{I^c \times [n]}\Vert \lesssim r^{3/2} \sqrt{c_{\varepsilon } n \ln \ln n}. \end{aligned}$$
(5.6)

Now, we can combine (5.5) and (5.6) by triangle inequality for the operator norm to conclude the desired norm estimate for \(\tilde{A}\) on the intersection of good events, namely with probability

$$\begin{aligned} 1 - \exp ^{-\varepsilon n/4} - n^{-r} - \exp (-n^{0.5}) - 2C (\ln \ln n)n^{- r} \ge 1 - n^{0.1-r}. \end{aligned}$$

Finally, let us check that the regularization is local. Again by Lemma 6, the total number of exceptional columns

$$\begin{aligned} |\bar{J}| = |\bigcup _l J_l| \le \sum _{l \le l_{\max }} n \exp (-2c_\varepsilon n 2^{l}\varepsilon /n) \le n \varepsilon /4, \end{aligned}$$

since we are summing a geometric progression and \(l \ge 0\). Since the same argument holds for the cardinality of \(\bar{I}\), we can conclude that with high probability, Algorithm 1 makes changes only in a \(n \varepsilon \times n \varepsilon \) submatrix of A. This concludes the proof of Corollary 1.

6 Discussion

6.1 Regularization by the Individual Corrections of Entries

Do we actually need to look at the rows and columns of A? A simpler and very intuitive idea would be to regularize the norm of A just by zeroing out a few large entries of A. However, this approach does not work for the case when the entries have only two finite moments: For the efficient local regularization, one has to account for the mutual positions of the entries in the matrix, not only for their absolute values.

Only in the case when \(A_{ij}\) have more than two finite moments, the truncation idea works and it is not hard to derive the following result from known bounds on random matrices such as [1, 14, 18] (see also the discussion in [13, Section 1.4]).

In the two moments case, individual correction of the entries can guarantee a bound with bigger additional factor \(\ln n\) in the norm. It can be derived from known general bounds on random matrices, such as the matrix Bernstein’s inequality [17]. One would apply the matrix Bernstein’s inequality for the entries truncated at level \(\sqrt{n}\) to get that \(\Vert \tilde{A} \Vert \le \varepsilon ^{-1/2} \sqrt{n} \cdot \ln n\).

We consider Theorem 1 more advantageous with respect to the individual corrections approach not only because we are able to bound the norm closer to the optimal order \(\sqrt{n}\), but also due to the fact that it gives more adequate information about the obstructions to the ideal norm bound. Namely, they are not only in the entries that are too large, but also in the rows and columns that accumulate too many entries (all of which are, potentially, of average size).

6.2 Symmetry Assumption

An assumption that the entries of A has to have symmetric distribution does not look natural and potentially can be avoided. We need it in the current argument to keep zero mean after various truncations by absolute value (in (4.1) and also in (3.4)). The standard symmetrization techniques (see [10, Lemma 6.3]) would not work in this case since we combine convex norm function with truncation (zeroing out of a product subset), which is not convex.

6.3 Dependence on n

Another potential improvement is an extra \(\sqrt{\ln \ln n}\) factor on the optimal n-order \(\Vert \tilde{A} \Vert \sim \sqrt{n}\). The reason for its appearance in our proof is that we consider restrictions of A to the discretization “levels” independently, and independently estimate their norms. The second moment assumption gives us that \(\sum 2^{2k} p_k \sim 1\). However, the best we can hope for a norm of one “level” (after proper regularization) is \(2^k \sqrt{n p_k}\) (since this is an expected \(L_2\)-norm of a restricted row). Thus, we end up summing square roots of the converging series, \(\sum (2^{2k} p_k)^{1/2}\), which for some distributions is as large as square root of the number of summands (\(\ln \ln n\) in our case).

It would be desirable to remove extra \(\sqrt{\ln \ln n}\) term and symmetric distribution assumption, proving something like the following

Conjecture 1

Consider an \(n \times n\) random matrix A with i.i.d. mean zero entries such that \(\mathbb {E}A_{ij}^2 = 1\). Let \(\tilde{A}\) be the matrix that obtained from A by zeroing out all rows and columns such that

$$\begin{aligned} \Vert \mathrm{row}_i(A)\Vert _m \ge C\mathbb {E}\Vert \mathrm{row}_i(A)\Vert _m, \quad \Vert \mathrm{col}_i(A)\Vert _m \ge C\mathbb {E}\Vert \mathrm{col}_i(A)\Vert _m \end{aligned}$$
(6.1)

for some \(L_m\)-norm to be specified (e.g., \(m = 2\)). Then with probability \(1 - o(1)\), the operator norm satisfies \(\Vert \tilde{A}\Vert \le C' \sqrt{n}\).

Note that this result would be somewhat similar to the estimate proved by Seginer [14]: in expectation, the norm of the matrix with i.i.d. elements is bounded by the largest norm of its row or column. However, note that after cutting “heavy” rows and columns, we lose independence of the entries in the resulting matrix. And in general, the question of the norm regularization is not equivalent to another interesting question about the sufficient conditions on the distribution of the entries that ensure optimal order of the operator norm.