1 Introduction

1.1 Motivation

Semidefinite programming (SDP) relaxations are an important tool to provide dual bounds for many discrete and continuous non-convex optimization problems [35]. These SDP relaxations have the form

$$\begin{aligned} \begin{array}{rcl} &{}min &{} \langle C, X\rangle \\ &{}s.t. &{} \langle A^i, X\rangle \le b_i ~~\ \forall i \in \{1, \dots , m\}\\ &{}&{} X \in \mathcal {S}^n_+, \end{array} \end{aligned}$$
(1)

where C and the \(A^i\)’s are \(n \times n\) matrices, \(\langle M, N\rangle := \sum _{i,j} M_{ij} N_{ij}\), and \(\mathcal {S}^n_+\) denotes the cone of \(n\times n\) symmetric positive semidefinite (PSD) matrices:

$$\begin{aligned} \mathcal {S}^n_+=\{ X\in \mathbb {R}^{n\times n}\,|\,X=X^T, ~x^{\top } Xx\ge 0, ~\forall x\in \mathbb {R}^n\}. \end{aligned}$$

In practice, it is often computationally challenging to solve large-scale instances of SDPs due to the global PSD constraint \(X \in \mathcal {S}^n_+\). One technique to address this issue is to consider a further relaxation that replaces the PSD cone by a larger one \(\S \supseteq \mathcal {S}^n_+\). In particular, one can enforce PSD-ness on (some or all) smaller \(k\times k\) principal submatrices of X, i.e., we consider the problem

$$\begin{aligned} \begin{array}{rcl} &{}min &{} \langle C, X\rangle \\ &{}s.t. &{} \langle A^i, X\rangle \le b_i \ \forall i \in \{1, \dots , m\}\\ &{}&{} selected k \times k principal submatrices of X \in \mathcal {S}^k_+. \end{array} \end{aligned}$$
(2)

We call such a relaxation the sparse SDP relaxation.

One reason why these relaxations may be solved more efficiently in practice is that we can enforce PSD constraints by iteratively separating linear constraints. Enforcing PSD-ness on smaller \(k\times k\) principal submatrices leads to linear constraints that are sparser, an important property leveraged by linear programming solvers that greatly improves their efficiency [3, 7, 17, 29, 32]. This is an important motivation for using sparse SDP relaxations [4, 18, 28]. (This is also the motivation for studying approximations of polytopes [20], convex hulls of integer linear programs [19, 21, 32], and integer programming formulations [22] by sparse linear inequalities.) This is our reason for calling the relaxation obtained by enforcing the SDP constraints on smaller \(k\times k\) principal submatrices of X as the sparse SDP relaxation.

It has been observed that sparse SDP relaxations not only can be solved much more efficiently in practice, but in some cases they produce bounds that are close to the optimal value of the original SDP. See [4, 18, 28] for successful applications of this technique for solving box quadratic programming instances, and [26, 30] for solving the optimal power flow problem in power systems.

Despite their computational success, theoretical understanding of sparse SDP relaxations remains quite limited. In this paper, we initiate such theoretical investigation. Ideally we would like to compare the objective function values of (1) and (2), but this appears to be a very challenging problem. Therefore, we consider a simpler data-independent question, where we ignore the data of the SDP and the particular selected principal submatrices, to arrive at the following:

How close to the PSD cone \(\mathcal {S}^n_+\) do we get when we only enforce PSD-ness on \(k \times k\) principal submatrices?

To formalize this question, we begin by defining the k-PSD closure, namely matrices that satisfy all \(k \times k\) principal submatrices PSD constraints.

Definition 1

(k-PSD closure) Given positive integers n and k where \(2 \le k \le n\), the \(k\)-PSD closure \(\mathcal {S}^{n,k}\) is the set of all \(n \times n\) symmetric real matrices where all \(k \times k\) principal submatrices are PSD.

It is clear that the \(k\)-PSD closure is a relaxation of the PSD cone (i.e., \(\S ^{n,k} \supseteq \mathcal {S}^n_+\) for all \(2 \le k \le n\)) and is an increasingly better approximation as the parameter k increases, i.e., we enforce that larger chunks of the matrix are PSD (in particular \(\S ^{n,n} = \mathcal {S}^n_+\)). The SOCP relaxation formulated in [30] is equivalent to using the \(k\)-PSD closure with \(k=2\) to approximate the PSD cone. Our definition is a generalization of this construction.

It is worth noting that the dual cone of \(\mathcal {S}^{n,k}\) is the set of symmetric matrices with factor width k, defined and studied in [8, 27]. In particular, the set of symmetric matrices with factor width 2 is the set of scaled diagonally dominant matrices [8, 33], i.e., symmetric matrices A such that DAD is diagonally dominant for some positive diagonal matrix D. Note that [2] uses scaled diagonally dominant for constructing inner approximation of the SDP cones for use in solving polynomial optimization problems.

1.2 Problem setup

We are interested in understanding how well the \(k\)-PSD closure approximates the PSD cone for the different values of k and n. To measure this approximation we would like to consider the matrix in the \(k\)-PSD closure that is farthest from the PSD cone. We need to make two choices here: the norm to measure this distance and a normalization method (since otherwise there is no upper bound on the distance between matrices in the PSD cone and the \(k\)-PSD closure).

We will use the Frobenius norm \(\Vert \cdot \Vert _F\) for both purposes. That is, the distance between a matrix M and the PSD cone is measured as \(dist _F(M,\mathcal {S}^n_+) = inf _{N \in \S ^n_{+}}\Vert M - N\Vert _F\), and we restrict our attention to matrices in \(k\)-PSD closure with Frobenius norm equal to 1. Thus we arrive at the (normalized) Frobenius distance between the \(k\)-PSD closure and the PSD cone, namely the largest distance between a unit-norm matrix M in \(\S ^{n,k}\) and the cone \(\mathcal {S}^n_+\):

$$\begin{aligned} \overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)&=\sup _{M\in \mathcal {S}^{n,k},\,\Vert M\Vert _F=1}dist _F(M,\mathcal {S}^n_+)\\&= \sup _{M\in \mathcal {S}^{n,k},\,\Vert M\Vert _F=1} \inf _{N \in \mathcal {S}^n_+} \Vert M - N\Vert _F. \end{aligned}$$

Note that since the origin belongs to \(\mathcal {S}^n_+\) this distance is at most 1.

The rest of the paper is organized as follows: Sect. 2 presents all our results and Sect. 3 concludes with some open questions. Then Sect. 4 presents additional notation and background results needed for proving the main results. The remaining sections present the proofs of the main results.

2 Our results

In order to understand how well the \(k\)-PSD closure approximates the PSD cone we present:

  • Matching upper and lower bounds on \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\) for different regimes of k.

  • Show that a polynomial number of \(k \times k\) PSD constraints are sufficient to provide a good approximation (in Frobenius distance) to the full \(k\)-PSD closure (which has \({n \atopwithdelims ()k} \approx \big (\frac{en}{k}\big )^k\) such constraints).

We present these result in more details in the following subsections.

2.1 Upper bounds

First we show that the distance between the \(k\)-PSD closure and the SDP cone is at most roughly \(\approx \frac{n-k}{n}\). In particular, this bound approximately goes from 1 to 0 as the parameter k goes from 2 to n, as expected.

Theorem 1

For all \(2\le k<n\) we have

$$\begin{aligned} \overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\le \frac{n -k}{ n + k - 2}. \end{aligned}$$
(3)

The idea for obtaining this upper bound is the following: given any matrix M in the \(k\)-PSD closure \(\mathcal {S}^{n,k}\), we construct a PSD matrix \(\widetilde{M}\) by taking the average of the (PSD) matrices obtained by zeroing out all entries of M but those in a \(k \times k\) principal submatrix; the distance between M and \(\widetilde{M}\) provides an upper bound on \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\). The proof of Theorem 1 is provided in Sect. 5.

It appears that for k close to n this upper bound is not tight. In particular, our next upper bound is of the form \((\frac{n-k}{n})^{3/2}\), showing that the gap between the \(k\)-PSD closure and the PSD cone goes to 0 as \(n-k \rightarrow n\) at a faster rate than that prescribed by the previous theorem. In particular, for \(k = n - c\) for a constant c, Theorem 1 gives an upper bound of \(O\left( \frac{1}{n}\right) \) whereas the next Theorem gives an improved upper bound of \(O\left( \frac{1}{n^{3/2}}\right) \).

Theorem 2

Assume \(n \ge 97\) and \(k \ge \frac{3n}{4}\). Then

$$\begin{aligned} \overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+) \le 96 \, \bigg (\frac{n-k}{n}\bigg )^{3/2}. \end{aligned}$$
(4)

It is easy to verify that for sufficiently large r if \(k > rn\), then the upper bound given by Theorem 2 dominates the upper bound given by Theorem 1.

The proof of Theorem 2 is more involved than that of Theorem 1. The high-level idea is the following: Using Cauchy’s Interlace Theorem for eigenvalues of hermitian matrices, we first verify that every matrix in \(\mathcal {S}^{n,k}\) has at most \(n - k\) negative eigenvalues. Since the PSD cone consists of symmetric matrices with non-negative eigenvalues, it is now straightforward to see that the distance from a unit-norm matrix \(M \in \mathcal {S}^{n,k}\) to \(\mathcal {S}^n_+\) is upper bounded by the absolute value of the most negative eigenvalue of M times \(\sqrt{n -k}\). To bound a negative eigenvalue \(-\lambda \) of M (where \(\lambda \ge 0\)), we consider an associated eigenvector \(v \in \mathbb {R}^n\) and randomly sparsify it to obtain a random vector V that has at most k non-zero entries. By construction we ensure that \(V \approx v\), and that V remains almost orthogonal to all other eigenvectors of M. This guarantees that \(V^{\top } M V \approx -\lambda +\) “small error”. On the other hand, since only k entries of V are non-zero, it guarantees that \(V^\top M V\) only depends on a \(k \times k\) submatrix of M, which is PSD by the definition of the \(k\)-PSD closure; thus, we have \(V^\top M V \ge 0\). Combining these observations we get that \(\lambda \le \) “small error”. This eigenvalue bound is used to upper bound the distance from M to the PSD cone. A proof of Theorem 2 is provided in Sect. 6.

We briefly comment on extensions of the results of Theorems 1 and 2 when the distance is measured in other norms. More generally, consider:

$$\begin{aligned} \overline{dist }_{\mathcal {N}_1, \mathcal {N}_2}(\mathcal {S}^{n,k},\mathcal {S}^n_+) :=sup _{M \in \mathcal {S}^{n,k}, \Vert M\Vert _{\mathcal {N}_1} = 1} inf _{N \in \mathcal {S}^n_+} \Vert M - N\Vert _{\mathcal {N}_2}, \end{aligned}$$

where \(\Vert \cdot \Vert _{\mathcal {N}_1}\) and \(\Vert \cdot \Vert _{\mathcal {N}_2}\) are two given matrix norms. We call \(\mathcal {N}_1\) the normalizing norm and \(\mathcal {N}_2\) the distance-measuring norm.

The proof of Theorem 1 uses two properties of the norms: (i) the normalizing norm and the distance-measuring norm are the same and (ii) this norm does not change if all off-diagonal terms of the matrix are negated. A class of matrix norms where (ii) holds is element-wise norms, i.e., a norm of the form \(\Vert A\Vert = \left( \sum _{i,j} |A_{ij}|^p\right) ^{\frac{1}{p}}\). Thus, Theorem 1 may be more generally stated as:

$$\begin{aligned} \overline{dist }_{\mathcal {N}, \mathcal {N}}(\mathcal {S}^{n,k},\mathcal {S}^n_+) \le \frac{n -k}{n + k -2}, \end{aligned}$$

where \(\mathcal {N}\) is an element-wise norm.

The proof of Theorem 2, as discussed above, uses two key arguments: (i) the number of negative eigenvalues can be bounded by \(n -k\) for matrices in \(\mathcal {S}^{n,k}\) and (ii) the most negative eigenvalue of matrices in \(\mathcal {S}^{n,k}\) with Frobenius norm 1 is at most \(\frac{96 (n -k)}{n^{3/2}}\) in absolute value. Therefore, as pointed out by an anonymous reviewer, Theorem 2 easily generalizes to the case where the distance-measuring norm is any other unitarily invariant matrix norm (thus, a norm depending entirely on the eigenvalues). As an example, consider the Schatten p-norm \(\Vert A\Vert _{S_p} := \left( \sum _{i = 1}^{n} \sigma _i(A)^p \right) ^{\frac{1}{p}}\) where \(\sigma _i(A)\) is the \(i^{th}\) singular value of A. Then, we obtain the following result:

$$\begin{aligned} \overline{dist }_{F, S_{p}}(\mathcal {S}^{n,k},\mathcal {S}^n_+) \le \frac{96 (n -k) (n - k)^{1/p}}{n^{3/2}}, \quad \text {for all} \quad k \ge \frac{3}{4}n,\ n \ge 97. \end{aligned}$$

2.2 Lower bounds

We next provide lower bounds on \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\) that show that the upper bounds presented in Sect. 2.1 are tight for various regimes of k. The first lower bound, presented in the next theorem, is obtained by a simple construction of an explicit matrix in the \(k\)-PSD closure that is far from being PSD. Its proof is provided in Sect. 7.

Theorem 3

For all \(2\le k<n\), we have

$$\begin{aligned} \overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\ge \frac{n - k}{\sqrt{ (k-1)^2\, n + n(n-1)} }. \end{aligned}$$
(5)

Notice that for small values of k the above lower bound is approximately \(\approx \frac{n - k}{n}\) which matches the upper bound from Theorem 1. For very large values of k. i.e. \(k = n - c\) for a constant c, the above lower bound is approximately \(\approx \frac{c}{n^{3/2}}\) which matches the upper bound by Theorem 2.

Now consider the regime where k is a constant fraction of n. While our upper bounds give \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+) = O(1)\), Theorem 3 only shows that this distance is at least \(\Omega (\frac{1}{\sqrt{n}})\), leaving open the possibility that the \(k\)-PSD closure provides a sublinear approximation of the PSD cone in this regime. Unfortunately, our next lower bound shows that this is not that case: the upper bounds are tight (up to a constant) in this regime.

Theorem 4

Fix a constant \(r < \frac{1}{93}\) and let \(k = rn\). Then for all \( k\ge 2\),

$$\begin{aligned} \overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^{n}_+)>\frac{\sqrt{r-93r^2}}{\sqrt{162r+3}}, \end{aligned}$$

which is independent of n.

For this construction we establish a connection with the Restricted Isometry Property (RIP) [12, 13], a very important notion in signal processing and recovery [14, 15]. Roughly speaking, these are matrices that approximately preserve the \(\ell _2\) norm of sparse vectors. The details of this connection and the proof of Theorem 4 are provided in Sect. 8.

2.3 Achieving the strength of \(\mathcal {S}^{n,k}\) by a polynomial number of PSD constraints

In practice one is unlikely to use the full \(k\)-PSD closure, since it involves enforcing the PSD-ness for all \({n \atopwithdelims ()k} \approx \left( \frac{en}{k}\right) ^k\) principal submatrices. Is it possible to achieve the upper bounds mentioned above while enforcing PSD-ness on fewer principal submatrices?

2.3.1 Deterministic approach

For a subset \(I \subseteq [n]\) of size k and a symmetric matrix M, let \(M^I\) be the matrix where we zero out all the rows and columns of M except the ones belonging to I. As we discuss after the proof of Theorem 1, the bound in (3) is obtained by estimating the distance between a given matrix M in \(\mathcal {S}^{n,k}\) and the PSD matrix \(\tilde{M}\), which is the average of \({n \atopwithdelims ()k}\) (PSD) matrices \(M^I\) corresponding to every k-element subset I of [n]. Now suppose that we have a collection of \(k \times k\) principal submatrices (and the corresponding collection I of k-subsets of [n]) with the property that every off-diagonal entry appears in exactly the same number of principal submatrices in the collection. It is then straightforward to show that the average of the matrices \(M^I\), where I is in the above collection, is the same as the average \(\tilde{M}\) taken over all \(\left( {\begin{array}{c}n\\ k\end{array}}\right) \) principal submatrices. In other words, the upper bound (3) can be achieved by any collection of \(k \times k\) principal submatrices with the above property.

This property is formally captured by the notion of 2-designs: Recall that a collection \(\mathcal {D}\) of k-subsets of [n] (called blocks) is called a 2-design (also called a balanced incomplete block design or BIBD) if every pair of elements in [n] belongs to the same number of blocks, denoted \(\lambda \). It follows that every element of [n] belongs to the same number of blocks, denoted r. Let b be the total number of blocks. From the discussion above, b principal submatrices corresponding to a 2-design is sufficient to give a bound of (3). For background on block designs we refer to [31], (Chapters 1 and 2).

It is known from the work of Wilson [34],(Corollary A and B) that, a 2-design with \(b=n(n-1)\) exists for all sufficiently large values of n, although to the best of our knowledge no explicit construction is known. (Wilson’s theorem gives a much more general statement for existence of 2-designs). Therefore, for almost all n we can achieve the strength of bound (3) while only using \(n(n-1)\) submatrices. Fisher’s inequality states that \(b\ge n\), so we need to enforce PSD-ness of at least n minors if we use a 2-design. A 2-design is called symmetric if \(b=n\). Bruck-Ryser-Chowla Theorem [10, 16] gives necessary conditions on b, k and \(\lambda \), for which a symmetric 2-designs exist, and this is certainly a limited set of parameters. Nevertheless, symmetric 2-designs may be of use in practice, as they give us the full strength of (3) while enforcing PSD-ness of only n \(k\times k\) submatrices. Some important examples of symmetric 2-designs are finite projective planes (symmetric 2-designs with \(\lambda =1\)), biplanes (\(\lambda =2\)) and Hadamard 2-designs.

2.3.2 Randomized approach

Another way to achieve the bound in (3) is to randomly select \(k \times k\) principal submatrices. We show that the upper bound given by (3) can also be achieved within factor \(1+\epsilon \) and probability at least \(1-\delta \) by randomly sampling \(O\left( \frac{n^2}{\varepsilon ^2}\ln \frac{n}{\delta }\right) \) of the \(k\times k\) principal submatrices.

Theorem 5

Let \(\mathcal {S}^n\) denote the set of real symmetric matrices. Let \(2 \le k \le n -1\). Consider \(\varepsilon ,\delta >0\) and let

$$\begin{aligned} m :=\frac{12n(n-1)^2}{\varepsilon ^2 (n-k)^2 k}\ln \frac{2n^2}{\delta } \in O\left( \frac{n^2}{\varepsilon ^2}\ln \frac{n}{\delta }\right) . \end{aligned}$$

Let \(\mathcal {I}=(I_1,\ldots ,I_m)\) be a sequence of random k-sets independently uniformly sampled from all subsets of \(\{1,\ldots ,n\}\) of size k, and define \(\S _{\mathcal {I}}\) as the set of symmetric matrices satisfying the PSD constraints for the principal submatrices indexed by the \(I_i\)’s, namely

$$\begin{aligned} \S _{\mathcal {I}} := \{M \in \mathcal {S}^n : M_{I_i} \text { is PSD},~\forall i \in [m]\}. \end{aligned}$$

Then with probability at least \(1 - \delta \) we have

$$\begin{aligned} \overline{\text {dist}}_F(\S _{\mathcal {I}},\mathcal {S}^n_+) \le (1+\varepsilon )\frac{n -k}{ n + k - 2}. \end{aligned}$$

Remark 1

Since the zero matrix is PSD, by definition we always have \(\overline{\text {dist}}_F(\S _{\mathcal {I}},\mathcal {S}^n_+) \le 1\). So in order for the bound given by Theorem 5 to be of interest, we need \((1+\varepsilon )\frac{n -k}{ n + k - 2} \le 1\), which means \(\varepsilon \le \frac{2k-2}{n-k}\). Plugging this into m, we see that we need at least \(\frac{3n(n-1)^2}{k(k-1)^2}\ln \frac{2n^2}{\delta }=\tilde{O}(\frac{n^3}{k^3})\) samples to obtain a nontrivial upper bound on the distance.

3 Conclusion and open questions

In this paper, we have been able to provide various upper and lower bounds on \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\). In two regimes our bounds on \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\) are quite tight. These are: (i) k is small, i.e., \(2 \le k \le \sqrt{n}\) and (ii) k is quite large, i.e., \(k = n - c\) where c is a constant. These are shown in the first two rows of Table 1. When k/n is a constant, we have also established upper and lower bounds on \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\) that are independent of n. However, our upper and lower bounds are not quite close when viewed as a function of the ratio k/n. Improving these bounds as a function of this ratio is an important open question.

Table 1 Bounds on \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\) for some regimes

We also showed that instead of selecting all minors, only a polynomial number of randomly selected minors realizes upper bound (3) within factor \(1+\varepsilon \) with high probability. An important question in this direction is to deterministically and strategically determine principal submatrices to impose PSD-ness, so as to obtain the best possible bound for (2) As discussed earlier, such questions are related to exploring 2-designs and perhaps further generalizations of results presented in [25].

4 Notation and preliminaries

The support of a vector is the set of its non-zero coordinates, and we call a vector k-sparse if its support has size at most k. We will use [n] to denote the set \(\{1,\ldots ,n\}\). A k-set of a set A is a subset \(B\subset A\) with \(|B|=k\). We also use \({[n]\atopwithdelims ()k}\) to denote the set of all k-sets of [n]. Given any vector \(x\in \mathbb {R}^n\) and a k-set \(J\subset [n]\) we define \(x_J\in \mathbb {R}^k\) as the vector where we remove the coordinates whose indices are not in J. Similarly, for a matrix \(M \in \mathbb {R}^{n \times n}\) and a k-set \(J\subset [n]\), we denote the principal submatrix of M corresponding to the rows and columns in J by \(M_J\).

4.1 Linear algebra

Given any \(n\times n\) matrix \(A=[a_{ij}]\) its trace (the sum of its diagonal entries) is denoted as \({{\,\mathrm{Tr}\,}}(A)\). Recall that \({{\,\mathrm{Tr}\,}}(A)\) is also equal to the sum of all eigenvalues of A, counting multiplicities. Given a symmetric matrix A, we use \(\lambda _1(A) \ge \lambda _2(A) \ge \dots \) to denote its eigenvalues in non-increasing order.

We remind the reader that, a real symmetric \(n\times n\) matrix M is said to be PSD if \(x^{\top } Mx\ge 0\) for all \(x\in \mathbb {R}^n\), or equivalently all of its eigenvalues are non-negative. We also use the notation that \(A\succeq B\) if \(A-B\) is PSD.

We next present the famous Cauchy’s Interlace Theorem which will be important for obtaining an upper bound on the number of negative eigenvalues of matrices in \(\mathcal {S}^{n,k}\). A proof can be found in [23].

Theorem 6

(Cauchy’s Interlace Theorem) Consider an \(n \times n\) symmetric matrix A and let \(A_J\) be any of its \(k\times k\) principal submatrix. Then for all \(1\le i\le k\),

$$\begin{aligned} \lambda _{n-k+i}(A)\le \lambda _i (A_J)\le \lambda _i (A). \end{aligned}$$

4.2 Probability

These following concentration inequalities will be used throughout, and can be found in [9].

Theorem 7

(Markov’s inequality) Let X be a non-negative random variable. Then for all \(a \ge 1\),

$$\begin{aligned} \Pr (X\ge a\mathbb {E}(X))\le \frac{1}{a}. \end{aligned}$$

Theorem 8

(Chebyshev’s inequality) Let X be a random variable with finite mean and variance. Then for all \(a>0\),

$$\begin{aligned} \Pr (|X - \mathbb {E}(X)| \ge a) \le \frac{{{\,\mathrm{Var}\,}}(X)}{a^2}. \end{aligned}$$

Theorem 9

(Chernoff bound) Let \(X_1,\ldots , X_n\) be i.i.d. Bernoulli random variables, with \(\Pr (X_i=1)=\mathbb {E}(X_i)=p\) for all i. Let \(X=\sum _{i=1}^n X_i\) and \(\mu =\mathbb {E}(X)=np\). Then for any \(0<\delta <1\),

$$\begin{aligned} \Pr \big (|X-\mu |>\delta \mu \big )\le 2\exp \bigg (-\frac{\mu \delta ^2}{3}\bigg ). \end{aligned}$$

5 Proof of Theorem 1: averaging operator

Consider a matrix M in the \(k\)-PSD closure \(\S ^{n,k}\) with \(\Vert M\Vert _F = 1\). To upper bound its distance to the PSD cone we transform M into a “close by” PSD matrix \(\widetilde{M}\).

The idea is clear: since all \(k \times k\) principal submatrices of M are PSD, we define \(\widetilde{M}\) as the average of these minors. More precisely, for a set \(I \subseteq [n]\) of k indices, let \(M^I\) be the matrix where we zero out all the rows and columns of M except those indexed by indices in I; then \(\widetilde{M}\) is the average of all such matrices:

$$\begin{aligned} \widetilde{M} := \frac{1}{{n \atopwithdelims ()k}} \sum _{I \in {[n] \atopwithdelims ()k}} M^I. \end{aligned}$$

Notice that indeed since the principal submatrix \(M_I\) is PSD, \(M^I\) is PSD as well: for all vectors \(x \in \mathbb {R}^n\), \(x^\top M^I x = x_I M_I x_I \ge 0\). Since the average of PSD matrices is also PSD, we have that \(\widetilde{M}\) is PSD, as desired.

Moreover, notice that the entries of \(\widetilde{M}\) are just scalings of the entries of M, depending on how many terms of the average are not zeroed out:

  1. 1.

    Diagonal terms: These are scaled by the factor

    $$\begin{aligned} \frac{{n \atopwithdelims ()k} - {n -1 \atopwithdelims ()k} }{{n \atopwithdelims ()k}} = \frac{k}{n}, \end{aligned}$$

    that is, \(\widetilde{M}_{ii} = \frac{k}{n}M_{ii}\) for all \(i \in [n]\).

  2. 2.

    Off-diagonal terms: These are scaled by the factor

    $$\begin{aligned} \frac{{n \atopwithdelims ()k} - (2{n - 1 \atopwithdelims ()k} - {n - 2 \atopwithdelims ()k}) }{{n \atopwithdelims ()k}} = \frac{k(k-1)}{n(n-1)}, \end{aligned}$$

    that is, \(\widetilde{M}_{ij} = \frac{k(k-1)}{n(n-1)}M_{ij}\) for all \(i \ne j\).

To even out these factors, we define the scaling \(\alpha := \frac{2n(n-1)}{k (n + k -2)}\) and consider \(\alpha \widetilde{M}\). Now we have that the difference between M and \(\alpha \widetilde{M}\) is a uniform scaling (up to sign) of M itself: \((M - \alpha \widetilde{M})_{ii} = (1 - \alpha \frac{k}{n})\, M_{ii} = -\frac{n -k}{ n + k - 2}\, M_{ii}\), and \((M - \alpha \widetilde{M})_{ij} = (1-\alpha \frac{k(k-1)}{n(n -1)})\,M_{ij} = \frac{n -k}{ n + k - 2}\, M_{ij}\) for \(i\ne j\). Therefore, we have

$$\begin{aligned} dist _{F}(M, \mathcal {S}^n_+) ~\le ~ \Vert M - \alpha \widetilde{M}\Vert _F ~=~ \frac{n -k}{ n + k - 2}\,\Vert M\Vert _F ~=~ \frac{n -k}{ n + k - 2}. \end{aligned}$$

Since this holds for all unit-norm matrix \(M \in \S ^{n,k}\), this upper bound also holds for \(\overline{\text {dist}}_F(\S ^{n,k},\mathcal {S}^n_+)\). This concludes the proof of Theorem 1.

6 Proof of Theorem 2: randomized sparsification

Let \(M \in \S ^{n,k}\) be a matrix in the \(k\)-PSD closure with \(\Vert M\Vert _F = 1\). To prove Theorem 2, we show that the Frobenius distance from M to the PSD cone is at most \(O\big ((\frac{n-k}{n})^{3/2}\big )\). We assume that M is not PSD, otherwise we are done, and hence it has a negative eigenvalue. We write M in terms of its eigendecomposition: Let \(-\lambda _1 \le -\lambda _2 \le \ldots \le -\lambda _\ell \) and \(\mu _1,\ldots ,\mu _{n-\ell }\) be the negative and non-negative eigenvalues of M, and let \(v^1,\ldots ,v^{\ell } \in \mathbb {R}^n\) and \(w^1,\ldots ,w^{n-\ell } \in \mathbb {R}^n\) be orthonormal eigenvectors relative to these eigenvalues. Thus

$$\begin{aligned} M=- \sum _{i \le \ell } \lambda _i v^i (v^i)^\top + \sum _{i \le n-\ell } \mu _i w^i (w^i)^\top . \end{aligned}$$
(6)

Notice that since \(\Vert M\Vert _F = 1\) we have

$$\begin{aligned} \sum _{i \le \ell } \lambda _i^2 + \sum _{i \le n-\ell } \mu _i^2 = 1. \end{aligned}$$
(7)

We first relate the distance from M to the PSD cone to its negative eigenvalues.

6.1 Distance to PSD cone and negative eigenvalues

We start with the following general observation.

Proposition 1

Suppose M is a symmetric \(n\times n\) matrix with \(\ell \le n\) negative eigenvalues. Let \(-\lambda _1\le -\lambda _2\le \cdots \le -\lambda _\ell < 0\) and \(\mu _1,\ldots ,\mu _{n-l}\ge 0\) be the negative and non-negative eigenvalues of M. Then

$$\begin{aligned} dist _F(M,\mathcal {S}^n_+)=\sqrt{\sum _{i=1}^\ell \lambda _i^2}. \end{aligned}$$

Proof

Let V be the orthonormal matrix that diagonalizes M, i.e.,

$$\begin{aligned} V^{\top } MV=D :=diag (-\lambda _1,\ldots ,-\lambda _\ell ,\mu _1,\ldots ,\mu _{n-\ell }). \end{aligned}$$

It is well-known that the Frobenius norm is invariant under orthonormal transformation. Therefore, for any \(N \in \mathcal {S}^n_+\) we have

$$\begin{aligned} dist _F(M,N)~=~\Vert M-N\Vert _F~=~\Vert V^{\top }(M-N)V\Vert _F ~=~ dist _F(D,\,V^{\top }NV). \end{aligned}$$

Since \(N\in \mathcal {S}^n_+\) iff \(V^{\top }NV\in \mathcal {S}^n_+\), we see that \(dist _F(M,\mathcal {S}^n_+)=dist _F(D,\mathcal {S}^n_+)\). So we only need to show that the latter is \(\sqrt{\sum _{i=1}^\ell \lambda _i^2}\).

Let \(D_+=diag (0,\ldots ,0,\mu _1,\ldots ,\mu _{n-\ell })\) be obtained from D by making all negative eigenvalues zero. Then

$$\begin{aligned} \Vert D-D_+\Vert _F=\sqrt{\sum _{i=1}^n\sum _{i=1}^n (D-D_+)_{ij}^2}=\sqrt{\sum _{i=1}^\ell \lambda _i^2}. \end{aligned}$$

It then suffices to show that \(D^+\) is the PSD matrix closest to D. For that, let N be any PSD matrix. Then \(N_{ii}=e_i^{\top } N e_i\ge 0\) for all i, where \(e_i\) is the standard unit vector on \(i^{th}\) coordinate. Thus we have

$$\begin{aligned} \Vert D-N\Vert _F&~=~\sqrt{\sum _{i=1}^\ell (N_{ii}+\lambda _{i})^2+\sum _{i=\ell +1}^{n} (\mu _{i-\ell }-N_{ii})^2+\sum _{i=1}^n\sum _{j\ne i}N_{ij}^2} ~\ge ~ \sqrt{\sum _{i=1}^\ell \lambda _i^2}. \end{aligned}$$

This concludes the proof. \(\square \)

In addition, Cauchy’s Interlace Theorem gives an upper bound on the number of negative eigenvalues of matrices in \(\mathcal {S}^{n,k}\).

Proposition 2

Any \(A\in \mathcal {S}^{n,k}\) has at most \(n-k\) negative eigenvalues.

Proof

Let J be any k-subset of [n]. Since \(A\in \mathcal {S}^{n,k}\) we have that \(A_J\) is PSD, so in particular \(\lambda _k({A_J})\ge 0\). Thus, by Theorem 6 the original matrix A also has \(\lambda _k(A)\ge 0\), and so the first k eigenvalues of A are nonnegative. \(\square \)

Using Propositions 1 and 2, given any symmetric matrix \(M\in \mathcal {S}^{n,k}\) we can get an upper bound on \(dist _F(M,\mathcal {S}^n_+)\) using its smallest eigenvalue.

Proposition 3

Consider a matrix \(M\in \mathcal {S}^{n,k}\) with smallest eigenvalue \(-\lambda _1 < 0\). Then

$$\begin{aligned} dist _F(M,\mathcal {S}^n_+)\le \sqrt{n-k} \cdot \lambda _1. \end{aligned}$$

Proof

Letting \(-\lambda _1,\ldots ,-\lambda _\ell \) be the negative eigenvalues of M, we have from Proposition 1 that \(dist _F(M,\mathcal {S}^n_+)=\sqrt{\sum _{i=1}^\ell \lambda _i^2} \le \sqrt{\ell }\cdot \lambda _1\), since \(-\lambda _1\) is the smallest eigenvalue. Since \(\ell \le n-k\), because of Proposition 2, we obtain the result. \(\square \)

6.2 Upper bounding \(\lambda _1\)

Given the previous proposition, fix throughout this section a (non PSD) matrix \(M \in \mathcal {S}^{n,k}\) with smallest eigenvalue \(- \lambda _1 < 0\). Our goal is to upper bound \(\lambda _1\).

The first observation is the following: Consider a symmetric matrix A and a set of coordinates \(I \subseteq [n]\), and notice that for every vector \(x \in \mathbb {R}^n\) supported in I we have \(x^\top A x = x_I^\top A_I x_I\). Thus, the principal submatrix \(A_I\) is PSD iff for all vectors \(x \in \mathbb {R}^n\) supported in I we have \(x^\top A x \ge 0\). Applying this to all principal submatrices gives a characterization of the \(k\)-PSD closure via k-sparse test vectors.

Observation 1

A symmetric real matrix A belongs to \(\S ^{n,k}\) iff for all k-sparse vectors \(x \in \mathbb {R}^n\) we have \(x^\top A x \ge 0\).

Using this characterization, and the fact that \(M \in \S ^{n,k}\), the idea to upper bound \(\lambda _1\) is to find a vector \(\bar{v}\) with the following properties (informally):

  1. 1.

    \(\bar{v}\) is k-sparse

  2. 2.

    \(\bar{v}\) is similar to the eigenvector \(v^1\) relative to \(\lambda _1\)

  3. 3.

    \(\bar{v}\) is almost orthogonal to the eigenvectors of M relative to its non-negative eigenvalues.

Such vector gives a bound on \(\lambda _1\) because using the eigendecomposition (6)

$$\begin{aligned} 0 {\mathop {\le }\limits ^{\text {Obs}(1)}} \bar{v}^\top M \bar{v} = - \sum _{i \le \ell } \lambda _i \, \langle v^i, \bar{v}\rangle ^2 + \sum _{i \le n-\ell } \mu _i \, \langle w^i, \bar{v}\rangle ^2 \lesssim -\lambda _1 + \text {``small error''}, \end{aligned}$$

and hence \(\lambda _1 \lesssim \) “small error”.

We show the existence of such k-sparse vector \(\bar{v}\) via the probabilistic method by considering a random sparsification of \(v^1\). More precisely, define the random vector \(V \in \mathbb {R}^n\) as follows: in hindsight set \(p:= 1 - \frac{2 (n-k)}{n}\), which is always at least \(\frac{1}{2}\) due to our assumption in Theorem 2 that \(k\ge \frac{3n}{4}\), and let V have independent entries satisfying

$$\begin{aligned} V_i={\left\{ \begin{array}{ll} v^1_i &{} \text { if }(v^1_i)^2>2/n, \\ \frac{v^1_i}{p} \text { with probability } p &{} \text { if } (v^1_i)^2\le \frac{2}{n}, \\ 0 \text { with probability } 1-p &{} \text { if } (v^1_i)^2\le \frac{2}{n}. \end{array}\right. } \end{aligned}$$

The choice of p guarantees that V is k-sparse with good probability.

Lemma 1

V is k-sparse with probability at least \(\frac{1}{2}\).

Proof

Let m be the number of entries in \(v^1\) with \((v^1_i)^2\le \frac{2}{n}\). Since \(\Vert v^1\Vert _2=1\) we have \(m\ge \frac{n}{2}\). By the randomized construction, the number of coordinates of value 0 in V is lower bounded by a binomial random variable B with m trials and success probability \(1-p\). Using the definition of p we have the expectation

$$\begin{aligned} \mathbb {E}B = m (1-p) \ge \frac{n}{2} \cdot \frac{2(n-k)}{n} = n-k; \end{aligned}$$

since \(n-k\) is integer we have \(\lfloor \mathbb {E}B \rfloor \ge n-k\). Moreover, it is known that the median of a binomial distribution is at least the expectation rounded down to the nearest integer [24], hence \(\Pr (B \ge \lfloor \mathbb {E}B \rfloor ) \ge \frac{1}{2}\). Chaining these observations we have

In other words, our randomized vector V is k-sparse with probability at least \(\frac{1}{2}\). \(\square \)

Next, we show that with good probability V and \(v_1\) are in a “similar direction”.

Lemma 2

With probability \(> 1 - \frac{1}{6}\) we have \(\langle V, v^1\rangle \ge \frac{1}{2}\).

Proof

To simplify the notation we use v to denote \(v^1\). By definition of V, for each coordinate we have \(\mathbb {E}[V_i v_i] = v_i^2\), and hence \(\mathbb {E}\langle V, v\rangle = \Vert v\Vert _2^2 = 1\).

In addition, let I be the set of coordinates i where \(v_i^2 \le \frac{2}{n}\). Then for \(i \notin I\) we have \({{\,\mathrm{Var}\,}}(V_i v_i) = 0\), and for \(i \in I\) we have \({{\,\mathrm{Var}\,}}(V_i v_i) = v_i^2 {{\,\mathrm{Var}\,}}(V_i) \le \frac{2}{n} {{\,\mathrm{Var}\,}}(V_i)\). Moreover, since \(p \ge \frac{1}{2}\) (implied by the assumption \(k \ge \frac{3n}{4}\)) we have by construction \(V_i \le \frac{v_i}{p} \le 2 v_i\), and hence

$$\begin{aligned} {{\,\mathrm{Var}\,}}(V_i) \le \mathbb {E}V_i^2 \le 2 v_i \mathbb {E}V_i = 2 v_i^2. \end{aligned}$$

So using the independence of the coordinates of V we have

$$\begin{aligned} {{\,\mathrm{Var}\,}}\langle V, v\rangle = \sum _{i \in I} {{\,\mathrm{Var}\,}}(V_i v_i) \le \frac{4}{n}\, \sum _i v_i^2 = \frac{4}{n}. \end{aligned}$$

Then by Chebyshev’s inequality we obtain that

$$\begin{aligned} \Pr \left( \langle V, v\rangle \le \frac{1}{2}\right) \le \Pr \left( |\langle V, v\rangle - 1| \ge \frac{1}{2}\right) \le \frac{16}{n}. \end{aligned}$$

Since \(n \ge 97\), this proves the lemma. \(\square \)

Finally, we show that V is almost orthogonal to the eigenvectors of M relative to non-negative eigenvalues.

Lemma 3

With probability \(\ge 1 -\frac{1}{3}\) we have \(\sum _{i \le n - \ell } \mu _i\,\langle V, w^i\rangle ^2 \le \frac{24(n-k)}{n^{3/2}}\).

Proof

Again we use v to denote \(v^1\). Define the matrix \(\overline{M} := \sum _{i \le n - \ell } \mu _i w_i w_i^\top \), so we want to upper bound \(V^\top \overline{M} V\). Moreover, let \(\Delta =V-v\); since v and the \(w_i\)’s are orthogonal we have \(\overline{M} v = 0\) and hence

$$\begin{aligned} V^\top \overline{M} V = v \overline{M} v + 2 \Delta ^\top \overline{M} v + \Delta ^\top \overline{M} \Delta = \Delta ^\top \overline{M} \Delta , \end{aligned}$$
(8)

so it suffices to upper bound the right-hand side.

For that, notice that \(\Delta \) has independent entries with the form

$$\begin{aligned} \Delta _i={\left\{ \begin{array}{ll} 0 &{} \text { if }v_i^2>\frac{2}{n}, \\ \frac{v_i(1-p)}{p} \text { with probability } p &{} \text { if } v_i^2\le \frac{2}{n}, \\ -v_i \text { with probability } 1-p &{} \text { if } v_i^2\le \frac{2}{n}. \end{array}\right. } \end{aligned}$$

So \(\mathbb {E}[\Delta _i \Delta _j] = \mathbb {E}\Delta _i \mathbb {E}\Delta _j = 0\) for all \(i\ne j\). In addition \(\mathbb {E}\Delta _i^2 = 0\) for indices where \(v^2_i > \frac{2}{n}\), and for other indices

$$\begin{aligned} \mathbb {E}\Delta _i^2 \le \frac{v_i^2 (1-p)^2}{p} + v_i^2 (1-p) = v_i^2 \frac{1-p}{p} \le \frac{2(1-p)}{np}. \end{aligned}$$

Using these we can expand \(\mathbb {E}[\Delta ^T \overline{M}\Delta ]\) as

$$\begin{aligned} \mathbb {E}[\Delta ^T \overline{M}\Delta ] = \mathbb {E}\bigg [\sum _{i,j} \overline{M}_{ij}\Delta _i \Delta _j\bigg ] = \sum _{i,j} \overline{M}_{ij} \,\mathbb {E}[\Delta _i\Delta _j]&= \sum _{i=1}^{n} \overline{M}_{ii}\,\mathbb {E}\Delta _i^2 \nonumber \\&\le \frac{2(1-p)}{np}{{\,\mathrm{Tr}\,}}(\overline{M}) \nonumber \\&= \frac{4 (n-k)}{n^2 p}{{\,\mathrm{Tr}\,}}(\overline{M}), \end{aligned}$$
(9)

where the last equation uses the definition of p.

Since the \(\mu _i\)’s are the eigenvalues of of \(\overline{M}\), we can therefore bound the trace as

$$\begin{aligned} {{\,\mathrm{Tr}\,}}(\overline{M})= & {} \sum _{i \le n-\ell } \mu _i \le \sqrt{n - \ell } \cdot \sqrt{\sum _{i \le n - \ell } \mu _i^2} \\&\le \sqrt{n - \ell } \le \sqrt{n}, \end{aligned}$$

where the first inequality follows from the well-known inequality that \(\Vert u\Vert _1 \le \sqrt{n}\Vert u\Vert _2\) for all \(u \in \mathbb {R}^n\) and the second inequality comes from (7). Further using the assumption that \(p \ge \frac{1}{2}\), we get from (9) that

$$\begin{aligned} \mathbb {E}[\Delta ^T \overline{M}\Delta ] \le \frac{8 (n-k)}{n^{3/2}}. \end{aligned}$$

Finally, since all the eigenvalues \(\mu _i\) of \(\overline{M}\) are non-negative, this matrix is PSD and hence the random variable \(\Delta ^\top \overline{M} \Delta \) is non-negative. Markov’s inequality then gives that

$$\begin{aligned} \Pr \bigg (\Delta ^\top \overline{M} \Delta \ge \frac{24 (n-k)}{n^{3/2}} \bigg ) \le \Pr \bigg (\Delta ^\top \overline{M} \Delta \ge 3\, \mathbb {E}[\Delta ^\top \overline{M} \Delta ] \bigg ) \le \frac{1}{3}. \end{aligned}$$

This concludes the proof of the lemma. \(\square \)

With these properties of V we can finally upper bound the absolute value of the most negative eigenvalue of M.

Lemma 4

\(\lambda _1 \le \frac{96(n-k)}{n^{3/2}}.\)

Proof

We take the union bound over Lemmas 13. In other words, the probability that V fails at least one of the properties in above three lemmas is strictly less than \(\frac{1}{2}+\frac{1}{6}+\frac{1}{3}=1\). Therefore, with strictly positive probability V satisfies all these properties. That is, there is a vector \(\bar{v} \in \mathbb {R}^n\) that is k-sparse, has \(\langle \bar{v}, v^1\rangle \ge \frac{1}{2}\) and \(\sum _{i \le n - \ell } \mu _i\,\langle \bar{v}, w^i\rangle ^2 \le \frac{24(n-k)}{n^{3/2}}\). Then using Observation 1 and the eigendecomposition (6)

$$\begin{aligned} 0 {\mathop {\le }\limits ^{\text {Obs}(1)}} \bar{v}^\top M \bar{v} = - \sum _{i \le \ell } \lambda _i \, \langle \bar{v}, v^i\rangle ^2 + \sum _{i \le n-\ell } \mu _i \, \langle \bar{v}, w^i\rangle ^2 \le -\frac{\lambda _1}{4} + \frac{24(n-k)}{n^{3/2}}. \end{aligned}$$

Reorganizing the terms proves the lemma. \(\square \)

6.3 Concluding the proof of Theorem 2

Plugging the upper bound on \(\lambda _1\) from Lemma 4 into Proposition 3 we obtain that

$$\begin{aligned} dist _F(M, \mathcal {S}^n_+) \le 96\, \bigg (\frac{n-k}{n}\bigg )^{3/2}. \end{aligned}$$

Since this holds for all unit-norm \(M \in \S ^{n,k}\), we have that \(\overline{\text {dist}}_F(\S ^{n,k},\mathcal {S}^n_+)\) also satisfies the same upper bound. This concludes the proof.

7 Proof of Theorem 3: a specific family of matrices in \(\S ^{n,k}\)

To prove the lower bounds on \(\overline{\text {dist}}_F(\mathcal {S}^{n,k},\mathcal {S}^n_+)\) we construct specific families of matrices in \(\mathcal {S}^{n,k}\) with Frobenius norm 1, and then lower bound their distance to the PSD cone.

For the first lower bound in Theorem 3, we consider the construction where all diagonal entries are the same, and all off-diagonal ones are also the same. More precisely, given scalars \(a,b \ge 0\) we define the matrix

$$\begin{aligned} G(a,b, n):= (a + b) I_n -a \mathbf{1} \mathbf{1} ^{\top }, \end{aligned}$$
(10)

where \(I_n\) is the \(n \times n\) identity matrix, and \(\mathbf{1} \) is the column vector with all entries equal to 1. In other words, all diagonal entries of G(abn) are b, and all off-diagonal ones are \(-a\).

The parameter a will control how far this matrix is from PSD: for \(a=0\) it is PSD, and if a is much bigger than b it should be “far” from the PSD cone. We then directly compute its eigenvalues, as well as its Frobenius distance to the PSD cone.

Proposition 4

The eigenvalues of G(abn) are \(b-(n-1)a\) with multiplicity 1, and \(b+a\) with multiplicity \(n-1\).

Proof

Let \(\{v^1,\ldots ,v^n\}\) be an orthonormal basis of \(\mathbb {R}^n\) such that \(\sqrt{n}v^1=\mathbf{1} \). Then we can rewrite G(abn) as

$$\begin{aligned} G(a,b,n)&= (a+b)\sum _{i=1}^n v^i (v^i)^{\top }-nav^1 (v^1)^{\top }\\&= \big (b-(n-1)a\big )v^1 (v^1)^{\top }+(a+b)\sum _{i=2}^n v^i (v^i)^{\top }. \end{aligned}$$

This gives a spectral decomposition of G(abn), so it has the aforementioned set of eigenvalues. \(\square \)

The next two corollaries immediately follow from Proposition 4.

Corollary 1

If \(a,b \ge 0\), then \(G(a, b, n) \in \mathcal {S}^{n,k}\) iff \(b \ge (k -1)a\). In particular, since \(\S ^{n,n}=\mathcal {S}^n_+\), \(G(a,b,n)\in \mathcal {S}^n_+\) iff \(b\ge (n-1)a\).

Proof

Note that every \(k \times k\) principal submatrix of G(abn) is just the matrix G(abk), which belongs to \(\mathcal {S}^k_+\) iff \(b-(k-1)a\ge 0\), since \(a,b\ge 0\). \(\square \)

Corollary 2

If \(a,b \ge 0\), then \(dist _F(G(a,b,n), \mathcal {S}^n_+) = \max \{(n -1) a - b, 0\}\).

Proof

If \(b\ge (n-1)a\), then \(G(a,b,n)\in \mathcal {S}^n_+\) from first corollary, so \(dist _F(G(a,b,n), \mathcal {S}^n_+)=0\) by definition.

If \(b<(n-1)a\), then G(abn) has only one negative eigenvalue \(b-(n-1)a\). Thus using Proposition 1 we get \(dist _F(G(a,b,n), \mathcal {S}^n_+)=(n-1)a-b\). \(\square \)

To conclude the proof of Theorem 3, let \(\bar{a} = \frac{1}{\sqrt{ ( k - 1)^2n + n(n -1)} }\) and \(\bar{b} = (k-1) \bar{a}\). From Corollary 1 we know that \(G(\bar{a}, \bar{b}, n)\) belongs to the \(k\)-PSD closure \(\mathcal {S}^{n,k}\), and it is easy to check that it has Frobenius norm 1. Then using Corollary 2 we get

$$\begin{aligned} \overline{\text {dist}}_F(\S ^{n,k},\mathcal {S}^n_+) \ge dist _F(G(\bar{a} , \bar{b}, n), \mathcal {S}^n_+) = (n-k) \bar{a} = \frac{n - k}{\sqrt{ ( k - 1)^2n + n(n -1)}}. \end{aligned}$$

This concludes the proof.

8 Proof of Theorem 4: RIP construction when \(k= O(n)\)

Again, to prove the lower bound \(\overline{\text {dist}}_F(\S ^{n,k}, \mathcal {S}^n_+) \ge cst\) for a constant cst we will construct (randomly) a unit-norm matrix M in the \(k\)-PSD closure \(\S ^{n,k}\) that has distance at least cst from the PSD cone \(\mathcal {S}^n_+\); we will use its negative eigenvalues to assess this distance, via Proposition 1. Motivation for connection with RIP property. Before presenting the actual construction, we give the high-level idea of how the RIP property (Definition 2 below) fits into the picture. For simplicity, assume \(k = n/2\). (The actual proof will not have this value of k). The idea is to construct a matrix M where about half of its eigenvalues take the negative value \(-\frac{1}{\sqrt{n}}\), with orthonormal eigenvectors \(v^1,v^2,\ldots , v^{n/2}\), and rest take a positive value \(\frac{1}{\sqrt{n}}\), with orthonormal eigenvectors \(w^1,w^2,\ldots , w^{n/2}\)). This normalization makes \(\Vert M\Vert _F = \Theta (1)\), so the reader can just think of M being unit-norm, as desired. In addition, from Proposition 1 this matrix is far from the PSD cone: \(dist _F(M,\mathcal {S}^n_+) > rsim \sqrt{\left( \frac{1}{\sqrt{n}}\right) ^2 \cdot \frac{n}{2}} = cst\). So we only need to guarantee that M belongs to the \(k\)-PSD closure; for that we need to carefully choose its positive eigenspace, namely the eigenvectors \(w^1,w^2,\ldots , w^{n/2}\).

Recall that from Observation 1, M belongs to the \(k\)-PSD closure iff \(x^\top M x\) for all k-sparse vectors \(x \in \mathbb {R}^n\). Letting V be the matrix with rows \(v^1,v^2,\ldots ,\) and W the matrix with rows \(w^1,w^2,\ldots \), the quadratic form \(x^\top M x\) is

$$\begin{aligned} x^\top M x = - \frac{1}{\sqrt{n}} \sum _i \langle v^i, x\rangle ^2 + \frac{1}{\sqrt{n}} \sum _i \langle w^i, x\rangle ^2 = -\frac{1}{\sqrt{n}} \Vert Vx\Vert _2^2 + \frac{1}{\sqrt{n}} \Vert Wx\Vert _2^2. \end{aligned}$$

Since the rows of V are orthonormal we have \(\Vert Vx\Vert _2^2 \le \Vert x\Vert _2^2\). Therefore, if we could construct the matrix W so that for all k-sparse vectors \(x \in \mathbb {R}^n\) we had \(\Vert Wx\Vert _2^2 \approx \Vert x\Vert _2^2\), we would be in good shape, since we would have

$$\begin{aligned} x^\top M x > rsim - \frac{1}{\sqrt{n}} \Vert x\Vert _2^2 + \frac{1}{\sqrt{n}} \Vert x\Vert _2^2 > rsim 0 \qquad \qquad \text {for all }k\text {-sparse vectors }x, \end{aligned}$$
(11)

thus M would be (approximately) in the \(k\)-PSD closure. This approximate preservation of norms of sparse vectors is precisely the notion of the Restricted Isometry Property (RIP) [12, 13].

Definition 2

(RIP) Given \(k<m<n\), an \(m\times n\) matrix A is said to be \((k,\delta )\)-RIP if for all k-sparse vectors \(x \in \mathbb {R}^n\), we have

$$\begin{aligned} (1-\delta )\Vert x\Vert _2^2\le \Vert Ax\Vert _2^2\le (1+\delta )\Vert x\Vert _2^2. \end{aligned}$$

This definition is very important in signal processing and recovery [12,13,14,15], and there has been much effort trying to construct deterministic [5, 11] or randomized [6] matrices satisfying given RIP guarantees.

The following theorem in [6] provides a probabilistic guarantee for a random Bernoulli matrix to have the RIP.

Theorem 10

((4.3) and (5.1) in [6], see also [1]) Let A be an \(m\times n\) matrix where each entry is independently \(\pm 1/\sqrt{m}\) with probability 1/2. Then A is \((k,\delta )\)-RIP with probability at least

$$\begin{aligned} 1-2\left( \frac{12}{\delta }\right) ^k e^{-\left( \delta ^2/16-\delta ^3/48\right) m}. \end{aligned}$$
(12)

Proof of Theorem 4 After we have observed the above connection between matrices in \(\mathcal {S}^{n,k}\) and RIP matrices, in the actual proof we adopt a strategy that does not “flow” exactly as described above but is easier to analyze. We will: 1) select W, a RIP matrix by selecting parameters m and \(\delta \) and applying Theorem 10; 2) use it to construct a matrix \(M \in \mathcal {S}^{n, k}\); 3) rescale the resulting matrix so that its Frobenius norm is 1, and; 4) finally compute its distance from \(\mathcal {S}^n_+\) and show that this is a constant independent of n.

Actual construction of M. Set \(m = 93 k\) and \(\delta = 0.9\). Then we can numerically verify that whenever \(k\ge 2\), the probability (12) is at least \(0.51>\frac{1}{2}\). Then let W be a random \(m \times n\) matrix as in Theorem 10, and define the matrix

$$\begin{aligned} M := - (1-\delta ) I + W^\top W. \end{aligned}$$

First observe that M has a large relative distance to the PSD cone and with good probability belongs to the \(k\)-PSD closure.

Lemma 5

The matrix M satisfies the following:

  1. 1.

    With probability at least 0.51, \(M \in \S ^{n,k}\)

  2. 2.

    \(dist _F(M,\mathcal {S}^n_+) \ge \sqrt{n-m}\,(1-\delta ) \).

Proof

Whenever W is \((k,\delta )\)-RIP, by definition, for all k-sparse x we have \(x^{\top } W^{\top } Wx=\Vert Wx\Vert ^2\ge (1-\delta )x^{\top }x\). Therefore \(x^{\top } Mx\ge 0\) for all k-sparse x, and hence \(M\in \mathcal {S}^{n,k}\) by Observation 1. This gives the first item of the lemma.

For the second item, notice that all vectors in the kernel of W, which has dimension \(n-m\), are eigenvectors of M with eigenvalue \(-(1-\delta )\). So the negative eigenvalues of M include at least \(n-m\) copies of \(-(1-\delta )\), and the result follows from Proposition 1. \(\square \)

Now we need to normalize M, and for that we need to control its Frobenius norm.

Lemma 6

With probability at least \(\frac{1}{2}\), \(\Vert M\Vert _F^2 \le 2n\delta ^2+\frac{2n(n-1)}{m}\).

Proof

Notice that the diagonal entries of \(W^\top W\) equal 1, so

$$\begin{aligned} \Vert M\Vert _F^2 = \sum _{i = 1}^n M_{ii}^2 + \sum _{i , j \in [n], i \ne j} M_{ij}^2 = n \delta ^2 + \sum _{i , j \in [n], i \ne j} (W^{\top }W)_{ij}^2. \end{aligned}$$

We upper bound the last sum. Let the columns of W be \(C^1,\ldots ,C^n\), and denote by \(X_{ij} = \langle C^i, C^j\rangle \) the ij-th entry of \(W^{\top } W\). Notice that when \(i\ne j\), \(X_{ij}\) is the sum of m independent random variables \(C^i_\ell C^j_\ell \) that take values \(\{-\frac{1}{m},\frac{1}{m}\}\) with equal probability, where \(\ell \) ranges from 1 to m. Therefore,

$$\begin{aligned} \mathbb {E}X_{ij}^2 = {{\,\mathrm{Var}\,}}(X_{ij}) = \sum _{\ell =1}^m {{\,\mathrm{Var}\,}}(C^i_\ell C^j_\ell ) = m \,\frac{1}{m^2} = \frac{1}{m}. \end{aligned}$$

This gives that

$$\begin{aligned} \mathbb {E}\, \Vert M\Vert _F^2 = n\delta ^2+\frac{n(n-1)}{m}. \end{aligned}$$

Since \(\Vert M\Vert _F^2\) is non-negative, from Markov’s inequality \(\Vert M\Vert _F^2 \le 2 \mathbb {E}\, \Vert M\Vert _F^2\) with probability at least 1/2. This gives the desired bound, concluding the proof.

Taking a union bound over Lemmas 5 and 6, with strictly positive probability the normalized matrix \(\frac{M}{\Vert M\Vert _F}\) belongs to \(\S ^{n,k}\) and has

$$\begin{aligned} dist _F\left( \frac{M}{\Vert M\Vert _F}, \mathcal {S}^n_+\right) \ge \frac{\sqrt{n-m}\,(1-\delta )}{\sqrt{2n(n-1)/m+2n\delta ^2}} \ge \frac{\sqrt{n-m}\,(1-\delta )}{\sqrt{2n^2/m+2n\delta ^2}}. \end{aligned}$$

The first inequality comes from the fact that \(\mathcal {S}^n_+\) is a pointed convex cone, and for any constant \(a>0\) we have \(dist _F(aM,\mathcal {S}^n_+)=adist _F(M,\mathcal {S}^n_+)\). Thus, there is a matrix with such properties.

Now plugging in \(k=rn, m=93k, \delta =0.9\), the right hand side is at least \(\frac{\sqrt{r-93r^2}}{\sqrt{162r+3}}\). This concludes the proof of Theorem 4.

9 Proof of Theorem 5

The idea of the proof is similar to that of Theorem 1 (in Sect. 5), with the following difference: Given a unit-norm matrix \(M \in \S _{\mathcal {I}}\), we construct a matrix \(\widetilde{M}\) by averaging over the principal submatrices indexed by only the k-sets in \(\mathcal {I}\) instead of considering all k-sets, and upper bound the distance from M to the PSD cone by \(dist _F(M, \alpha \widetilde{M})\). Then we need to provide a uniform upper bound on \(dist _F(M, \widetilde{M})\) that holds for all M’s simultaneously with good probability (with respect to the samples \(\mathcal {I}\)). This will then give an upper bound on \(\overline{\text {dist}}_F(\S _{\mathcal {I}},\mathcal {S}^n_+)\).

Recall that \(\mathcal {I}= (I_1,\ldots ,I_m)\) is a sequence of independent uniform samples from the k-sets of [n]. As defined in Sect. 5, let \(M^I\) be the matrix where we zero out all the rows and columns of M except those indexed by indices in I. Let \(T_{\mathcal {I}}\) be the (random) partial averaging operator, namely for every matrix \(M \in \mathbb {R}^{n \times n}\)

$$\begin{aligned} T_{\mathcal {I}}(M) := \frac{1}{|\mathcal {I}|} \sum _{I \in \mathcal {I}} M^I. \end{aligned}$$

As we showed in Sect. 5 for the full average \(\widetilde{M} := T_{{[n] \atopwithdelims ()k}}(M)\), the first observation is that if \(M \in \S _{\mathcal {I}}\), that is, all principal submatrices \(\{M_I\}_{I \in \mathcal {I}}\) are PSD, then the partial average \(T_{\mathcal {I}}(M)\) is also PSD.

Lemma 7

If \(M \in \S _{\mathcal {I}}\), then \(T_{\mathcal {I}}(M)\) is PSD.

Proof

This is straightforward, since each \(M^I\) is PSD.

Consider a unit-norm matrix M. Now we need to upper bound \(dist _F(M, \alpha \,T_{\mathcal {I}}(M))\), for a scaling \(\alpha \), in a way that is “independent” of M. In order to achieve this goal, notice that \((T_\mathcal {I}(M))_{ij} = f_{ij} M_{ij}\), where \(f_{ij}\) is the fraction of sets in \(\mathcal {I}\) that contain \(\{i,j\}\). Then it is not difficult to see that the Frobenius distance between M and \(T_{\mathcal {I}}(M)\) can be controlled using only these fractions \(\{f_{ij}\}\), since the Frobenius norm of M is fixed to be 1.

The next lemma makes this observation formal. Since the fractions \(\{f_{ij}\}\) are random (they depend on \(\mathcal {I}\)), the lemma focuses on the typical scenarios where they are close to their expectations.

Notice that the probability that a fixed index i belongs to \(I_\ell \) is \(\frac{k}{n}\), so the fraction \(f_{ii}\) is \(\frac{k}{n}\) in expectation. Similarly, the expected value of \(f_{ij}\) is \(\frac{k(k-1)}{n(n-1)}\) when \(i\ne j\). In other words, the expectation of \(T_{\mathcal {I}}(M)\) is \(\widetilde{M}\).

Lemma 8

Consider \(\varepsilon \in [0,1)\) and let \(\gamma := \frac{k(n-k)}{2n(n-1)}\). Consider a scenario where \(\mathcal {I}\) satisfies the following for some \(\varepsilon \in [0,1)\):

  1. 1.

    For every \(i \in [n]\), the fraction of the sets in \(\mathcal {I}\) containing i is in the interval \(\left[ \frac{k}{n} - \varepsilon \gamma , \frac{k}{n} + \varepsilon \gamma \right] \).

  2. 2.

    For every pair of distinct indices \(i,j \in [n]\), the fraction of the sets in \(\mathcal {I}\) containing both i and j is in the interval \(\left[ \frac{k(k-1)}{n(n-1)} - \varepsilon \gamma , \frac{k(k-1)}{n(n-1)} + \varepsilon \gamma \right] \).

Then there is a scaling \(\alpha > 0\) such that for all matrices \(M \in \mathbb {R}^{n \times n}\) we have

$$\begin{aligned} dist _F(M, \alpha \, T_{\mathcal {I}}(M)) \le (1+\varepsilon ) \frac{n-k}{n+k-2}\,\Vert M\Vert _F. \end{aligned}$$

Proof

As in Sect. 5, let \(\widetilde{M} = T_{{[n] \atopwithdelims ()k}}(M)\) be the full average matrix. Recall that \(\widetilde{M}_{ii} = \widetilde{f}_{ii} M_{ii}\) for \(\widetilde{f}_{ii} = \frac{k}{n}\), and \(\widetilde{M}_{ij} = \widetilde{f}_{ij} M_{ij}\) for \(\widetilde{f}_{ij} = \frac{k (k-1)}{n (n-1)}\) when \(i \ne j\). Also let \(\alpha := \frac{2n(n-1)}{k (n + k -2)}\). Finally, define \(\Delta := \tilde{M} - T_{\mathcal {I}}(M)\) as the error between the full and partial averages.

From triangle inequality we have

$$\begin{aligned} \Vert M - \alpha \, T_{\mathcal {I}}(M)\Vert _F \le \Vert M - \alpha \, \widetilde{M}\Vert _F + \alpha \,\Vert \Delta \Vert _F. \end{aligned}$$

Moreover, in Sect. 5 we proved the full average bound \(\Vert M - \alpha \, \widetilde{M}\Vert _F \le \frac{n-k}{n+k-2} \Vert M\Vert _F\). Moreover, from our assumptions we have \(f_{ij} \in [\widetilde{f}_{ij} - \varepsilon \gamma , \widetilde{f}_{ij} + \varepsilon \gamma ]\) for all ij, and hence \(|\Delta _{ij}| \le \varepsilon \gamma \, |M_{ij}|\); this implies the norm bound \(\Vert \Delta \Vert _F \le \varepsilon \gamma \Vert M\Vert _F\). Putting these bounds together in the previous displayed inequality gives

$$\begin{aligned} \Vert M - \alpha \, T_{\mathcal {I}}(M)\Vert _F \le \bigg (\frac{n-k}{n+k-2} + \varepsilon \alpha \gamma \bigg )\,\Vert M\Vert _F = (1+\varepsilon ) \frac{n-k}{n+k-2}\,\Vert M\Vert _F. \end{aligned}$$

This concludes the proof. \(\square \)

Finally, we use concentration inequalities to show that the “typical” scenario assumed in the previous lemma holds with good probability.

Lemma 9

With probability at least \(1 - \delta \) and the parameter m given in Theorem 5, the sequence \(\mathcal {I}\) is in a scenario satisfying the assumptions of Lemma 8.

Proof

As stated in Lemma 8, we only need that for all entries ij the fraction \(f_{ij}\) deviates from its expectation by at most \(+\varepsilon \gamma \), with failure probability at most \(\delta \). From union bound, this can be achieved if for each entry, the probability that the deviation of its fraction \(f_{ij}\) fails to be within \([-\varepsilon \gamma , \epsilon \gamma ]\) is at most \(\frac{\delta }{n^2}\). Now we consider both diagonal and off-diagonal terms:

  1. 1.

    Diagonal terms \(f_{ii}\): For each \(k-\)set sample I, let \(X_I\) be the indicator variable that is 0 if \(i\notin I\), and 1 if \(i\in I\). Notice that they are independent, with expectation \(\frac{k}{n}\). Let \(X= \sum _{I\in \mathcal {I}} X_I\) be the sum of these variables. From definition of \(f_{ii}\) we have \(X=f_{ii}m\), where m is the total number of samples. From Chernoff bound, have that

    $$\begin{aligned} \Pr \bigg (\left| f_{ii}-\frac{k}{n}\right|> \varepsilon \frac{(n-k)k}{2n(n-1)}\bigg )&= \Pr \bigg (\left| X-\frac{mk}{n}\right| > \varepsilon m\frac{(n-k)k}{2n(n-1)}\bigg )\\&\le 2\exp \bigg (-\frac{\varepsilon ^2 (n-k)^2k m}{12n(n-1)^2 }\bigg )\\&\le \frac{\delta }{n^2} \end{aligned}$$

    as long as

    $$\begin{aligned} m\ge \frac{12n(n-1)^2}{\varepsilon ^2 (n-k)^2 k}\ln \frac{2n^2}{\delta }. \end{aligned}$$
  2. 2.

    Off-diagonal terms \(f_{ij}\): Similar to first case, now for each \(k-\)set sample I, let \(X_I\) be the indicator variable that is 1 if \(\{i,j\}\subset I\), and 0 otherwise. Now the expectation of each \(X_I\) becomes \(\frac{k(k-1)}{n(n-1)}\). Again let \(X=\sum _{i\in \mathcal {I}} X_I\). Using same argument as above, \(X=f_{ij}m\). From Chernoff bound we get

    $$\begin{aligned} \Pr \bigg (\left| f_{ij}-\frac{k(k-1)}{n(n-1)}\right|> \varepsilon \frac{(n-k)k}{2n(n-1)}\bigg )&= \Pr \bigg (\left| X-\frac{mk(k-1)}{n(n-1)}\right| > \varepsilon m\frac{(n-k)k}{2n(n-1)}\bigg )\\&\le 2\exp \bigg (-\frac{\varepsilon ^2 (n-k)^2k m}{12n(n-1)(k-1) }\bigg )\\&\le \frac{\delta }{n^2} \end{aligned}$$

    as long as

    $$\begin{aligned} m\ge \frac{12n(n-1)(k-1)}{\varepsilon ^2 (n-k)^2 k}\ln \frac{2n^2}{\delta }. \end{aligned}$$

Since we chose m large enough so it satisfies both of these cases, taking a union bound over all ij’s we get that the probability that any of the \(f_{ij}\)’s is \(+\varepsilon \gamma \) more than their expectations is at most \(\delta \). This concludes the proof. \(\square \)

Combining this with Lemma 8, we conclude the proof of Theorem 5.