Abstract
In this self-contained chapter, we revisit a fundamental problem of multivariate statistics: estimating covariance matrices from finitely many independent samples. Based on massive multiple-input multiple-output (MIMO) systems, we illustrate the necessity of leveraging structure and considering quantization of samples when estimating covariance matrices in practice. We then provide a selective survey of theoretical advances of the last decade focusing on the estimation of structured covariance matrices. This review is spiced up by some yet unpublished insights on how to benefit from combined structural constraints. Finally, we summarize the findings of our recently published preprint “Covariance estimation under one-bit quantization” Dirksen et al. (2021) to show how guaranteed covariance estimation is possible even under coarse quantization of the samples.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
3.1 Introduction
The key objective in covariance estimation is simple to state: given \(n \in {\mathbb N}\) i.i.d. samples \({\mathbf {X}}^1,...,{\mathbf {X}}^n \overset {\mathrm {d}}{\sim } \mathbf {X}\) of a random vector \(\mathbf {X} \in {\mathbb R}^p\), compute a reliable estimate of the covariance matrix \(\mathbb E [\mathbf {X}{\mathbf {X}}^\top ] = \boldsymbol {\Sigma } \in {\mathbb R}^{p\times p}\) (without loss of generality, we restrict ourselves here to mean-zero distributions, i.e., \(\mathbb {E} [\mathbf {X}] = \boldsymbol {0}\)). For this purpose, a natural estimator is the sample covariance matrix
as it converges to Σ, for n →∞, by the law of large numbers. Nevertheless, an asymptotic result is of limited use from practical perspective. Given \(n \in {\mathbb N}\), it provides no information on the reconstruction error \(\| \hat {\boldsymbol \Sigma }_n - \boldsymbol \Sigma \|\) measured in the operator norm ∥⋅∥. (Although other norms or error metrics might be considered as well, e.g., the Frobenius norm, we mainly restrict ourselves in this chapter on operator norm bounds as the most common representative.)
In the last two decades, numerous works on non-asymptotic analysis of covariance estimation showed that reliable approximation of Σ by \(\hat {\boldsymbol \Sigma }_n\) becomes feasible for sub-Gaussian distributions if n ≳ p, where \(a \lesssim b\) denotes a ≤ Cb for some absolute constant C > 0. For instance, if X has a Gaussian distribution, then it is well known [61] that with probability at least 1 − 2e −t
This classical result exhibits various weaknesses. For instance, it requires strong concentration of the distribution of X around its mean. The estimator in (3.1) is sensitive to outliers and not reliable if concentration fails [12, 34]. Furthermore, in applications the ambient dimension can easily exceed the number of accessible samples such that even if concentration may be assumed, the estimate in (3.2) is void.
3.1.1 Outline and Notation
In Sect. 3.2, we briefly discuss massive MIMO as one specific modern application of covariance estimation. The massive MIMO setting originates from wireless communications research and will serve as a motivation for investigating multiple structures and quantized samples in a mathematical framework. Section 3.3 then surveys recent theoretical advances on estimation of structured covariance matrices, and Sect. 3.4 shows the impact of coarse sample quantization on estimation guarantees. Having the theoretical results from Sects. 3.3 and 3.4 in mind, in Sect. 3.5, we finally return to the details of massive MIMO and present our recent approach in engineering literature. We conclude in Sect. 3.6 by discussing the gap between existing theoretical guarantees and practical solutions. Some technical details of Sect. 3.3 are deferred to the Appendix.
We denote [n] = {1, ..., n}. For any absolute constant C > 0, we abbreviate a ≤ Cb (resp., ≥) as \(a \lesssim b\) (resp., ≳). We furthermore write a ≲L b (resp., ≳L) if C only depends on the quantity L. Whenever we use absolute constants c, C > 0, their values may vary from line to line. Scalar-valued functions act component-wise on vectors and matrices. For a set S, the indicator function χ S is 1 on S and 0 on its complement S c. We denote the all ones-matrix by \(\boldsymbol {1} \in {\mathbb R}^{p\times p}\) and the identity by \(\mathbf I \in {\mathbb R}^{p\times p}\). In particular,
for all \(\mathbf x \in {\mathbb R}^p\) and i ∈ [p]. For \(\mathbf Z \in {\mathbb R}^{p\times p}\), we denote the operator norm (the maximum singular value) by \(\| \mathbf Z \| = \sup _{\mathbf u \in \mathbb {S}^{p-1}} \| \mathbf Z\mathbf u \|{ }_2\), the nuclear norm (the sum of singular values) by \(\| \mathbf Z \|{ }_* = \mathrm {tr}(\sqrt {\mathbf Z^\top \mathbf Z})\), the Frobenius norm (trace norm) by \(\| \mathbf Z \|{ }_F^2 = \mathrm {tr} (\mathbf Z^\top \mathbf Z) = \sum _{i,j = 1}^p Z_{i,j}^2\), the max norm by ∥Z∥∞ =maxi,j|Z i,j|, and the maximum column norm ∥Z∥1→2 =maxj ∈ [p]∥z j∥2, where z j denotes the j-th column of Z. We use ⊙ for the Hadamard (i.e., entry-wise) product of two matrices. The uniform distribution on a set S is denoted by Unif(S). The multivariate Gaussian distribution with mean \(\boldsymbol \mu \in {\mathbb R}^{p}\) and covariance matrix \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) is denoted by \(\mathcal {N}(\boldsymbol \mu , \boldsymbol \Sigma )\). The sub-Gaussian (ψ 2-) and subexponential (ψ 1-) norms of a random variable X are defined by
A mean-zero random vector X on \({\mathbb R}^n\) is called K-sub-Gaussian if
3.2 Motivation: Massive MIMO
Multiple-input multiple-output (MIMO) is a method in wireless communication to enhance the capacity of a radio link by using multiple transmission and multiple receiving antennas. It has become an essential element of wireless communication standards for Wi-Fi and mobile devices [24, 50]. Massive MIMO equips the base station (BS) with a large number of antennas to further increase bandwidth and potential number of users [44, 45].
In a classical massive MIMO communication system, the BS is equipped with a uniform linear array (ULA) of M antennas and communicates with multiple users through a scattering channel, e.g., wave reflection on buildings or objects. See Fig. 3.1 for an exemplary setup. During uplink (UL), the BS receives user pilots and aims at estimating the respective channel covariance matrices, which characterize each transmission channel. By assuming mutual orthogonality of all UL pilots, it suffices to focus on a single user channel. We denote the corresponding UL channel vector at time–frequency resource s by \(\mathbf h(s) \in \mathbb C^M\) (standard block-fading model, e.g., [59]). Furthermore, we assume that the user transmits a single pilot per channel coherence block such that the channel vectors h(s) are i.i.d. complex Gaussian vectors, for s ∈ [N] [27, 28].
The received pilot signal at the BS at resource block s is then given as
for s ∈ [N], where \(x(s) \in {\mathbb C}\) is the known pilot symbol and \(\mathbf z(s) \sim \mathcal C \mathcal N (\boldsymbol 0, N_0 \mathbf I) = \mathcal N (\boldsymbol 0, \tfrac {N_0}{2} \mathbf I) + j \mathcal N (\boldsymbol 0, \tfrac {N_0}{2} \mathbf I)\) models additive white Gaussian noise (AWGN). Without loss of generality, one may assume that the pilot symbols are normalized, i.e., x(s) = 1. The core problem of massive MIMO channel estimation is now to estimate the channel covariance matrix
from N noisy samples y(s), s ∈ [N]. Since the number of samples N is limited due to time constraints of the UL phase, one expects for massive MIMO that N ≈ M. Translating this into our initial theoretical setting, i.e., identifying the ambient dimension p with the number of antennas M, the number of samples n with the number of independent time–frequency resources N, and the sample vectors X k with the channel vectors h(s), we see that the sample covariance matrix will not provide a reliable estimate of Σ h in this case, cf. Eq. (3.2) for n ≈ p. Nevertheless, a closer look into the channel model reveals that Σ h naturally exhibits intrinsic structures such as low-rankness and Toeplitz structure, cf. Sect. 3.5.
Structure and Quantization
Let us highlight two crucial points. First, whereas engineers are successful in boosting the sample covariance matrix by using special features of their problem setting, cf. Sect. 3.5, it might simplify existing approaches if alternatives to the sample covariance matrix are used that automatically leverage intrinsic structure(s) of the covariance matrix. As Sect. 3.3 will show, the last decade substantially improved our theoretical understanding in this regard. Second, if the above methods are used in real applications, one has to take into account that the sample vectors y(s) have to be quantized to finite alphabets before digital processing. Especially, in massive MIMO, the information loss due to quantization can be significant since fine quantization at a multitude of antennas leads to enormous energy consumption. The results presented in Sect. 3.4 can be seen as a first theoretical step into understanding the non-asymptotic behavior of covariance estimators under coarse quantization of the samples. Since we concentrate on memoryless quantization schemes (each vector entry is quantized independently of all others), our model should be applicable to massive MIMO in a straightforward way.
3.3 Estimation of Structured Covariance Matrices and Robustness Against Outliers
As we have already seen in Sect. 3.2, there are several structures of interest that Σ might exhibit in applications. We concentrate here on three important instances—sparsity, low-rankness, and Toeplitz structure—that naturally emerge in engineering, biology, and data science, e.g., [42, 53]. Parts of the results we review below are not restricted to Gaussian random vectors but allow to treat heavy-tailed distributions that only satisfy assumptions on their lower moments. Techniques for robust covariance estimation include median of means [31, 49], element- and spectrum-wise truncation [12, 47], and M-estimators [47, 48]. The recent work [46] even constructs a “sub-Gaussian” estimator that only requires a finite kurtosis assumption (L 4–L 2-norm equivalence). In this context, an estimator is called sub-Gaussian if it performs on non-Gaussian distributions as well as the sample covariance matrix applied to Gaussian distributions, for further discussion see [46]. Although the proposed construction is computationally intractable, it illustrates the potential of robust estimation. For further information on early and recent approaches to robust covariance estimation, we refer the reader to [29, 34].
3.3.1 Sparse Covariance Matrices
We begin with the assumption that Σ is a sparse matrix, i.e., only few entries of Σ are relevant and hence non-zero. If X models ordered variables, the non-zero entries of Σ, for instance, might cluster around the diagonal such that Σ is a banded or tapered matrix. A straightforward way to estimate such covariance matrices is to band/taper the sample covariance matrix \(\hat {\boldsymbol \Sigma }_n\) [6, 11, 23]. If the variables are not ordered and the non-zero entries of Σ do not cluster, thresholding of \(\hat {\boldsymbol \Sigma }_n\) is a viable alternative [5, 19]. As remarked in [40], the aforementioned approaches can be treated in a unified way by introducing a mask M ∈ [0, 1]p×p and considering the masked sample covariance matrix \(\mathbf M \odot \hat {\boldsymbol \Sigma }_n\). The masked formulation allows to decompose the estimation error
into a variance term that behaves well if M is (close to) sparse and a bias term that is small whenever M encodes the support of Σ. The bias term is deterministic and solely depends on a proper choice of M. For understanding the influence of sparsity on the required sample size, it thus suffices to control the variance term. The corresponding state-of-the-art result can be found in [13] which extends [40] from Gaussian distributions to general distributions of finite fourth moment and strengthens [40] if applied to Gaussian distributions. To facilitate the comparison with (3.2), we present the result only in the Gaussian case.
Theorem 3.1 ([13, Theorem 1.1])
Let M ∈ [0, 1]p×p , for p ≥ 3, be fixed and \(\mathbf X \sim \mathcal {N} (\boldsymbol 0, \boldsymbol \Sigma )\) with \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) . Then,
Theorem 3.1 only bounds the second moment of the variance term, which yields high-probability estimates via Markov’s inequality. However, the same proof techniques apply to higher moments of the variance term as well such that exponential tail bounds can be achieved for Gaussian X, cf. [13, Section 3.3].
Let us compare Theorem 3.1 with (3.2). For general covariance estimation, i.e., M = 1, we have \(\| \boldsymbol M \|{ }_{1\rightarrow 2}^2 = \| \boldsymbol M \| = p\), which implies that up to \(\log \)-factors both results are of the same order \(\mathcal {O}(\sqrt {\tfrac {p}{n}} + \tfrac {p}{n})\). If M encodes sparsity, however, meaning that only up to s ≪ p columns and rows are non-zero and \(\| \boldsymbol M \|{ }_{1\rightarrow 2}^2 = \| \boldsymbol M \| = s\), the estimation error is considerably reduced when applying Theorem 3.1. A similar error reduction occurs if \(\mathbf {M} \odot \hat {\boldsymbol \Sigma }_n\) is a banded estimator of bandwidth B.
Estimation via Thresholding
While the masked framework provides a unified understanding of the intrinsic complexity of sparse covariance estimation, in practice the mask M is unknown. A more realistic approach to the problem is hence thresholding procedures as, e.g., [5]. To allow for non-ordered covariance matrices, i.e., general sparsity and not only limited bandwidth of the matrix, the authors of [5] introduce the set of bounded and (effectively) sparse covariance matrices
for q ∈ [0, 1) and s, M > 0. If q = 0, the matrices in \(\mathcal {U} (q,s,M)\) have at most s non-zero entries per row; if q > 0, the rows are close to s-sparse vectors. To estimate \(\boldsymbol \Sigma \in \mathcal {U} (q,s,M)\), the thresholded estimator \(\mathbb T_\tau (\hat {\boldsymbol \Sigma }_n)\) is considered, where
for any τ > 0 and \(\mathbf A \in {\mathbb R}^{p\times p}\).
Theorem 3.2 ([5, Theorem 1])
Let \(\mathbf X \sim \mathcal {N} (\boldsymbol 0, \boldsymbol \Sigma )\) , for \(\boldsymbol \Sigma \in \mathcal {U} (q,s,M)\) , and M′ > 0 be sufficiently large (depending on M). If
for \(n \gtrsim \log (p)\) , then with probability at least \(1 - e^{-cn\tau ^2}\)
Theorem 3.2 does not require knowledge on the support of Σ and respects sparsity defects. However, if we once more consider the case q = 0, we see that the estimate in Theorem 3.2 is suboptimal since the error behaves (up to \(\log \)-factors) like \(\mathcal {O} \big ( \sqrt {\tfrac {s^2}{n}} \big )\) and not like \(\mathcal {O} (\sqrt {\tfrac {s}{n}})\) as one would expect.
3.3.2 Low-Rank Covariance Matrices
When working with high-dimensional random vectors, another commonly considered structural prior is to assume that the distribution concentrates around a low-dimensional manifold. This may manifest itself in Σ being a low-rank matrix. Interestingly, the sample covariance matrix in (3.1) intrinsically leverages low-rankness of Σ. To understand this phenomenon, we consider the effective rank of Σ defined as
It is straightforward to verify that 1 ≤r( Σ) ≤rank( Σ). In contrast to the rank of Σ, the quantity r( Σ) is small even if Σ is only close to a low-rank matrix, e.g., consider Σ to be a full rank matrix with exponentially decaying spectrum.
Theorem 3.3 ([37, Corollary 2])
Let \(\mathbf X \sim \mathcal N (\boldsymbol 0,\boldsymbol \Sigma )\) , for \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) , and n ≳r( Σ). Then with probability at least 1 − e −t the sample covariance matrix satisfies
The authors of [37] further show that the bound in Theorem 3.3 is tight up to constants. If we compare the result to (3.2), we see that both estimates agree for (effectively) full rank matrices like Σ = I. If Σ is of low rank, however, Theorem 3.3 controls the estimation error even in the case n < p.
Low-Rank Estimators
We could stop at this point since \(\hat { \boldsymbol \Sigma }_n\) apparently meets our requirements. Nevertheless, two questions remain. First, if one assumes Σ to be low rank, one would wish for an estimator that is low rank itself, and, second, Theorem 3.3 fails if X does not exhibit strong concentration around its mean. The first point can be addressed by using the LASSO estimator
where λ > 0 is a tunable parameter. Initially introduced in [43] to estimate covariance matrices from incomplete observations, the result reads in our setting as follows.
Theorem 3.4 ([43, Corollary 1])
Let \(\mathbf X \sim \mathcal N (\boldsymbol 0,\boldsymbol \Sigma )\) , for \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) , and \(n \gtrsim \mathbf r (\boldsymbol \Sigma ) \log (2p + n)^2\) . If
for a sufficiently large absolute constant C > 0, then with probability at least \(1-\tfrac {1}{2p}\) the estimator in (3.6) satisfies
The nuclear norm regularization in (3.6) induces (effective) low-rankness on \(\hat { \boldsymbol \Sigma }_n^\lambda \) [21, 51] and the order of estimation error reflects up to \(\log \)-factors the one in Theorem 3.3. Furthermore, the construction of \(\hat { \boldsymbol \Sigma }_n^\lambda \) can easily be adapted to heavy-tailed distributions by replacing \(\hat { \boldsymbol \Sigma }_n\) with an appropriate robust counterpart, e.g., the spectrum-wise truncated sample covariance matrix [34]. A corresponding version of Theorem 3.4 that is not restricted to (sub)-Gaussian distributions is [34, Theorem 5.2].
3.3.3 Toeplitz Covariance Matrices and Combined Structures
The third structure we discuss here in detail naturally arises in various engineering problems. If the entries of X resemble measurements on a temporal or spatial grid whose covariances only depend on the distances of measurements (in time or space) but not their location, Σ is a symmetric Toeplitz matrix, i.e.,
and the first column \(\boldsymbol \sigma \in {\mathbb R}^p\) determines Σ via Σi,j = σ |i−j|+1. (For simplicity, we identify Toeplitz matrices with their first column in the following.) Such a structure appears, for instance, in Direction-Of-Arrival (DOA) estimation [38] and medical/radar imaging processing [9, 56]. For further examples, we refer the reader to [53]. Since Toeplitz structure reduces the degrees of freedom in Σ from p 2 to p, leveraging this structure can lead to a notable reduction in sample complexity.
The authors of [10] propose to average, the sample covariance matrix along its diagonals to obtain the Toeplitz estimator \(\hat { \boldsymbol \Sigma }_n^{\mathrm {Toep}}\) defined as
They derive error estimates for Gaussian distributions with banded Toeplitz covariance matrices.
The more recent work [33] extends these results to non-Gaussian distributions and general masks as introduced in Sect. 3.3.1. To be more precise, the authors of [33] assume that the distribution of X has the so-called convex concentration property.
Definition 3.1
A random vector \(\mathbf X \in {\mathbb R}^p\) has the convex concentration property with constant K if for any 1-Lipschitz function \(\phi \colon {\mathbb R}^p \to {\mathbb R}\), one has \(\mathbb E [\phi (\mathbf X)] < \infty \) and
By setting ϕ(⋅) = 〈⋅, x〉, for \(\mathbf x \in \mathbb R^p\), one easily sees that all distributions that have the convex concentration property are sub-Gaussian. For the sake of consistency, we therefore restrict ourselves here to Gaussian distributions as their most prominent representative. For a symmetric Toeplitz mask M ∈ [0, 1]p×p characterized by its first column m ∈ [0, 1]p, we furthermore define the weighted ℓ 1- and ℓ 2-norms of m as
Theorem 3.5 ([33, Theorem 3])
Let M ∈ [0, 1]p×p be a symmetric Toeplitz mask and \(\mathbf X \sim \mathcal N(\boldsymbol 0,\boldsymbol \Sigma )\) with \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) symmetric and Toeplitz. Then,
As Theorem 3.1, the result is not restricted to an estimate of the expected error but includes respective high probability bounds with exponential tail decay. Let us compare Theorem 3.5 to Theorem 3.1. If we ignore \(\log \)-factors and assume that M is a banding or tapering mask with support bandwidth \(B \le \tfrac {p}{2}\), i.e., only the B innermost diagonals of M are non-zero, Theorem 3.5 guarantees an estimation error of order \(\mathcal {O}(\sqrt {\tfrac {B}{pn}} + \tfrac {B}{pn})\), cf. [33, Corollary 2], which improves the estimate \(\mathcal {O}(\sqrt {\tfrac {B}{n}} + \tfrac {B}{n})\) of Theorem 3.1 by a factor p. This improvement corresponds to the reduction in degrees of freedom when comparing Toeplitz to general matrices. Note, however, that the additional assumption B ≤ αp, for α ∈ (0, 1), is required for such a reduction since estimation of the outermost diagonals of Σ is hardly enhanced by averaging over the Toeplitz structure. This is expressed by Theorem 3.5 since ∥m∥1,∗ and ∥m∥2,∗ are \(\mathcal O(1)\) and not \(\mathcal {O}(\tfrac {1}{p})\) if the tail entries of m are not of vanishing magnitude.
Estimation via Thresholding
Theorem 3.5 differs from the previously discussed results in the sense that it allows to simultaneously leverage two structures of Σ, sparsity and Toeplitz structure. Nevertheless, as in Sect. 3.3.1, the masked framework leaves open the question of how to choose M in practice. By combining the thresholded approach in Theorem 3.2 with the techniques of Theorem 3.5, one can obtain a thresholded Toeplitz estimator which profits from both structural priors. To state a corresponding estimate, let us define the set of bounded Toeplitz covariance matrices with (effectively) sparse first column σ by
We furthermore denote by \(\mathbb B_{\alpha p} (\boldsymbol \Sigma )\) the matrix Σ restricted to bandwidth αp, i.e., \([\mathbb B_{\alpha p} (\boldsymbol \Sigma )]_{i,j} = \Sigma _{i,j}\) if |i − j| + 1 ≤ αp and \([\mathbb B_{\alpha p} (\boldsymbol \Sigma )]_{i,j} = 0\) else.
Theorem 3.6
There exists an absolute constant C > 0 such that the following holds. Let X have the convex concentration property with constant K. Let \(\mathbb E [\mathbf X] = \boldsymbol 0\) and \(\mathbb E [\mathbf X \mathbf X^\top ] = \boldsymbol \Sigma \) , for \(\boldsymbol \Sigma \in \mathcal {U}^{\mathrm {Toep}} (q,s,M)\) . For all α ∈ (0, 1) and c > 1, we have with probability at least 1 − (2αp)−(c−1) that if
then
where \(\mathbb T_\tau \) is the thresholding operator from (3.5).
Two comments are in order here. To gain from the Toeplitz structure, Theorem 3.6 requires Σ to be close to a banded matrix. This is as in Theorem 3.5 before and has been discussed previously. Moreover, by adapting the proof strategy of Theorem 3.2, the result inherits the slightly suboptimal error decay in the sparsity level s, cf. the discussion of Theorem 3.2 for the case q = 0. The proof, which combines ideas from [5] and [33], can be found in the Appendix.
Combining Toeplitz Structure and Low-Rankness
Sparsity is not the only structure that can be imposed on Toeplitz matrices. For instance, in massive MIMO, see Sect. 3.2, low-rankness of Σ may naturally be assumed in addition to Toeplitz structure [28]. The recent works [20, 39] propose several algorithms to estimate low-rank Toeplitz covariance matrices from partial observations by a technique called “sparse ruler.” In particular, the authors can show that the sufficient number of samples to approximate Σ scales (up to \(\log \)-factors) polynomial in the (effective) rank of Σ.
Remark 3.1
Before closing this section, let us briefly comment on the three types of structures discussed above and their mutual relation:
-
Sparsity: The concept of sparsity is the maybe most fundamental way of theoretically describing intrinsic “low-complexity” of points in a vector space. Whereas we only introduced sparsity of vectors in \({\mathbb R}^n\) with respect to the canonical basis, it is straightforward to generalize the definition to arbitrary vector spaces and other bases (or even frames). Note, however, that sparsity strongly depends on the chosen representation of objects in space, i.e., a point that is sparse in one basis need not be sparse in another.
-
Low-rankness: One can view low-rankness as a special case of sparsity since a matrix is low rank if and only if the vector of its singular values is sparse in the canonical basis. Stated differently, a matrix is low rank if its induced linear mapping only acts on low-dimensional subspaces of the ambient input and output space. This second characterization shows that, in contrast to sparsity, low-rankness is not representation dependent. Furthermore, one can generalize the concept to higher dimensional linear operators as well, e.g., tensor spaces.
-
Toeplitz structure: Just as low-rankness, Toeplitz structure is a special type of sparsity that requires matrix structure of the points in space. Its low-dimensional structure lies in the fact that only (2n − 1) parameters are necessary to characterize an \({\mathbb R}^{n\times n}\) Toeplitz matrix. In contrast to low-rankness, Toeplitz structure is representation dependent. Nevertheless, Toeplitz matrices naturally appear as covariance matrices of stationary random processes, i.e., if the covariance of two events does not depend on their localization in time but only their distance in time.
For further discussion and literature on the subject, we refer the interested reader to [22].
3.4 Estimation from Quantized Samples
All results stated above assume that the sample vectors X k are real-valued, i.e., one has access to infinite precision representations of the samples. In applications, this assumption is not always fulfilled. Especially in signal processing, samples are collected via sensors and, hence, need to be quantized to finitely many bits before they can be digitally transmitted and processed. Engineers have been examining the influence of coarse quantization on correlation and covariance estimation for decades, e.g., [2, 14, 30, 41, 54]. However, in contrast to classical covariance estimation from unquantized samples, so far only asymptotic estimation guarantees have been derived in the quantized setting. To improve our understanding on the effect of quantization on covariance estimation, we analyzed two memoryless one-bit quantization schemes in our recent work [16]. We call a quantizer memoryless if it quantizes each entry of X k independently of all remaining entries. This is fundamentally different from feedback systems, e.g., Σ Δ-quantization [4, 55], and of particular interest for large-scale applications like massive MIMO where the entries of X k correspond to inputs from different antennas, cf. Sect. 3.2. We conclude by providing a detailed discussion of the models and results in [16].
3.4.1 Sign Quantization
In the first setting, we assume to receive one-bit quantized samples
for k ∈ [n], instead of X k itself. (Recall that we apply scalar functions like sign entry-wise to vectors and matrices.) Since the quantizer sign is scale-invariant, i.e., sign(z) = sign(Dz) for any diagonal matrix \(\mathbf D\in {\mathbb R}^{p\times p}\) with strictly positive entries and \(\mathbf z \in {\mathbb R}^p\), we can only hope to recover the correlation matrix of the distribution, i.e., a normalized version of Σ with entries \(\big [ \tfrac {\Sigma _{i,j}}{\sqrt {\Sigma _{i,i}} \sqrt {\Sigma _{j,j}}} \big ]_{i,j}\). We thus assume that \(\mathbf X \sim \mathcal {N}(\boldsymbol 0,\boldsymbol \Sigma )\), where Σ has ones on its diagonal.
It has been known for decades that
is well suited to approximate Σ from the quantized samples, cf. [30]. Note that the specific form of \(\tilde {\boldsymbol \Sigma }_n\) is motivated by Grothendieck’s identity (see, e.g., [61, Lemma 3.6.6]), also known as “arcsin-law” in the engineering literature [30, 60], which implies that
if \(\mathbf X \sim \mathcal N (\boldsymbol 0,\boldsymbol \Sigma )\). Applying the strong law of large numbers and the continuity of the sine function to (3.10), one easily obtains with (3.11) that \(\tilde {\boldsymbol \Sigma }_n\) is a consistent estimator of Σ.
The two key quantities for understanding the non-asymptotic performance of \(\tilde {\boldsymbol \Sigma }_n\) are Γ and
Furthermore, we define
for symmetric \(\mathbf Z \in {\mathbb R}^{p\times p}\).
Theorem 3.7 ([16, Theorem 1])
There exist constants c 1, c 2 > 0 such that the following holds. Let \(\mathbf X \sim \mathcal N(\boldsymbol 0,\boldsymbol \Sigma )\) with Σ i,i = 1, for i ∈ [p], and \(\mathbf X^1,...,\mathbf X^n \overset {\mathrm {d}}{\sim } \mathbf X\) be i.i.d. samples of X . Let M ∈ [0, 1]p×p be a fixed symmetric mask. Then, for all t ≥ 0 with \(n \ge c_1 \log ^2(p) (\log (p) + t)\) , the biased sign estimator \(\tilde {\boldsymbol \Sigma }_n\) fulfills with probability at least \(1 - 2e^{-c_2 t}\)
The right-hand side in Theorem 3.7 (for convenience, we only consider the case M = 1 here) can be trivially estimated to get
which is up to the additional dependence on \(\cos {}(\arcsin (\boldsymbol \Sigma ))\) comparable to the error bound in (3.2) for \(\hat { \boldsymbol \Sigma }_n\). This is remarkable since \(\tilde {\boldsymbol \Sigma }_n\) accesses considerably less information on the samples than \(\hat { \boldsymbol \Sigma }_n\).
Theorem 3.7 even suggests that for strongly correlated distributions of X, i.e., Σ ≈1, the dominant first term on the right-hand side of (3.12) vanishes. In other words, the bound in (3.12) predicts \(\tilde {\boldsymbol \Sigma }_n\) to outperform \(\hat {\boldsymbol \Sigma }_n\) if the entries of X strongly correlate. Numerical experiments from [16] confirm this counter-intuitive fact, cf. Fig. 3.2. A possible explanation is that by construction, \(\tilde {\boldsymbol \Sigma }_n\) implicitly uses the assumption that Σ has ones on its diagonal which is not provided to \(\hat {\boldsymbol \Sigma }_n\).
Furthermore, a corresponding lower bound on the second moment of the estimation error shows that the unconventional term ∥σ(M ⊙A)∥ is factual and not an artifact of the proof, cf. [16, Proposition 14].
3.4.2 Dithered Quantization
The results of Sect. 3.4.1 are restricted to the estimation of correlation matrices of Gaussian distributions. Both limitations stem from the chosen quantization model: first, (3.9) is blind to the rescaling of variances and, second, Grothendieck’s identity only holds for Gaussian distributions. Nevertheless, by introducing a dither in the one-bit quantizer in (3.9), we can fully estimate the covariance matrix of general sub-Gaussian distributions. Dithering means adding artificial random noise (with a suitably chosen distribution) to the samples before quantizing them to improve reconstruction from quantized observations, cf. [25, 26, 52]. In the context of one-bit compressed sensing, the effect of dithering was recently rigorously analyzed in [3, 17, 18, 32, 36], see also the survey [15].
To be precise, we require two bits per entry of each sample vector where each bit is dithered by an independent uniformly distributed dither, i.e., we are given
where the dithering vectors \(\boldsymbol \tau ^1,\bar {\boldsymbol \tau }^1,\ldots ,\boldsymbol \tau ^n,\bar {\boldsymbol \tau }^n\) are independent and uniformly distributed in [−λ, λ]p, with λ > 0 to be specified later. From the quantized observations in (3.13), we construct the estimator
where
Theorem 3.8 ([16, Theorem 3])
Let X be a mean-zero, K-sub-Gaussian vector with covariance matrix \(\mathbb E [ \mathbf X \mathbf X^\top ] = \boldsymbol \Sigma \) . Let \(\mathbf X^1,...,\mathbf X^n \overset {\mathrm {d}}{\sim } \mathbf X\) be i.i.d. samples of X . Let M ∈ [0, 1]p×p be a fixed symmetric mask. If \(\lambda ^2 \gtrsim _K \log (n) \| \boldsymbol \Sigma \|{ }_{\infty }\) , then with probability at least 1 − e −t,
In particular, if \(\lambda ^2 \approx _K \log (n) \|\boldsymbol \Sigma \|{ }_{\infty }\) , then
The error bound (3.16) coincides (up to different logarithmic factors) with the best known estimate for the masked sample covariance matrix in Theorem 3.1, even though the sample covariance matrix requires direct access to the samples X k. This performance, however, heavily depends on the choice of λ, cf. [16]. Furthermore, it should be mentioned that there are cases where the performance of the dithered estimator is significantly worse than the performance of the sample covariance matrix. Let us consider for simplicity the case M = 1. If the samples X k are Gaussian, then [37] shows that
whereas (3.16) yields
via tail integration. Since Tr( Σ) ≤ p∥ Σ∥∞, the second estimate is worse in general. Numerical experiments in [16] have shown that this difference is not an artifact of proof. Simply put, \(\hat {\boldsymbol \Sigma }_n\) and \(\tilde {\boldsymbol \Sigma }_n^{\text{dith}}\) perform similarly if Σ has a constant diagonal, whereas \(\hat {\boldsymbol \Sigma }_n\) performs significantly better whenever Tr( Σ) ≪ p∥ Σ∥∞.
Theorem 3.8 can be extended to heavier-tailed random vectors. This, however, requires a larger choice of λ and thus more samples to reach the same error. For a sub-exponential random vector X, one would already need \(\lambda ^2 \gtrsim \log (n)^2 \cdot \max _{i \in [p]} \| X_i \|{ }_{\psi _1}^2\). The dependence of λ on n, both in the latter statement and in Theorem 3.8, can be observed in numerical experiments [16] as well.
Let us finally mention that the quantized estimators in (3.10) and (3.14) are not necessarily positive semi-definite as one expects from covariance matrices. In applications, one would thus replace both estimators by their projection onto the cone of positive semi-definite matrices, which is efficiently computed via the singular value decomposition [8, Section 8.1.1]. The obtained estimates also apply to the projected estimators since convex projections are 1-Lipschitz.
3.5 The Underlying Structures of Massive MIMO Covariance Estimation
Having the just surveyed theoretical insights on covariance matrix estimation in mind, let us return to the massive MIMO setup of Sect. 3.2. To understand the intrinsic structure of Σ h in (3.4) and consequent approaches in engineering literature to leverage it, we have to dive deeper into the underlying model and its physical interpretation. We thus follow the notational conventions of engineering literature in this section. Recall that the number of antennas M can be identified with the ambient dimension p, the number of time–frequency resources N with the number of samples n, and the channel vectors h(s) correspond to samples X k.
Under the assumptions stated in the beginning of Sect. 3.2, i.e., the BS is equipped with M antennas in a ULA, the channel vector h(s) can be written explicitly as
for s ∈ [N]. Here, \(\xi = \tfrac {\sin {}(\theta )}{\sin {}(\theta _{\text{max}})}\) are the normalized angles of arrival (AoA) with \(\theta _{\text{max}} \in [0,\tfrac {\pi }{2}]\) being the maximum array angular aperture (cf. Fig. 3.1), the vectors \(\mathbf a(\xi ) \in \mathbb C^M\) denote the respective array response at the BS, and the channel gain ρ(ξ, s) is a complex Gaussian process with zero mean. By assuming the antenna spacing to be \(d = \tfrac {\lambda }{2}\), where \(\lambda = \tfrac {c_0}{f_0}\) denotes the wavelength with c 0 being the speed of light and f 0 the carrier frequency, we obtain
where j denotes the imaginary unit. With the additional assumption of wide sense stationary uncorrelated scattering (WSSUS), the second-order statistics of the Gaussian process ρ(ξ, s) are time invariant and uncorrelated across AoAs so that
where \(\gamma \colon [-1,1] \to {\mathbb R}_{\ge 0}\) is the real and non-negative measure that represents the angular scattering function (ASF) and δ is the Dirac delta function. In particular, this implies that
Building upon this explicit representation of h(s) and structural assumptions on γ, one can refine the estimate obtained from the sample covariance matrix of y defined in (3.3).
A Hands-on Approach
In [35], we choose the following approach. First note that by (3.17) the channel covariance matrix belongs to the set
where \(\mathcal A\) denotes the class of typical ASFs in wireless propagation. If one assumes sparse scattering propagation, the set \(\mathcal A\) consists of sparse ASFs. In particular, we assume that γ(ξ) can be decomposed as the sum of a discrete spike component γ d (modeling the power received from line-of-sight (LOS) paths and narrow scatterers) and a continuous component γ c (modeling the power received from wide scatterers). Mathematically, we can write
where γ d consists of r ≪ M Dirac deltas with AoAs ξ 1, …, ξ r and strengths c 1, …, c r > 0 corresponding to r specular propagation elements. Furthermore, by sparsity assumptions on γ, we have that meas(γ c) ≪meas([−1, 1]), where meas(γ c) denotes here the measure of the support of γ c. Combining (3.17) and (3.18), we decompose the channel covariance matrix as
where \(\boldsymbol \Sigma _{\mathbf h}^d\) is rank-r and positive semi-definite and \(\boldsymbol \Sigma _{\mathbf h}^c\) is full rank and positive semi-definite with few dominant singular values. We can approximate Σ h now in three consecutive steps:
-
(i)
Spike Location Estimation for γ d: Applying the MUltiple SIgnal Classification (MUSIC) algorithm [58], we estimate the AoAs ξ k of the spike component γ d from the noisy samples y(1), …, y(N), cf. [35, Theorem 1]. Since this step is fairly standard, we do not discuss the details here but refer the interested reader to [35]. Let us only mention that the number of spikes is estimated by the number of dominant eigenvalues of \(\boldsymbol \Sigma _{\mathbf y} := \mathbb E [ \mathbf y(s) \mathbf y(s)^{\mathsf H} ]\) (where one can naturally assume a corresponding gap in the spectrum since the power received via LOS paths in γ d dominates the power received from wide scatterers in γ c). As a result, we obtain estimated spike locations \(\hat \xi _k\), for \(k \in [\hat r]\), and define an approximation of γ d
$$\displaystyle \begin{aligned} \tilde\gamma_d(\xi) = \sum_{k=1}^{\hat r} \tilde c_k \, \delta(\xi - \hat\xi_k), \end{aligned} $$where the coefficients \(\tilde c_1,\dots , \tilde c_{\hat r} \ge 0\) still need to be estimated.
-
(ii)
Sparse Dictionary-Based Method: We approximate the continuous component γ c over a finite dictionary of densities \(\mathcal G_c := \{ \psi _i \colon [-1,1] \to {\mathbb R}, \; i \in [G] \}\) that are suitably chosen, e.g., Gaussian, Laplacian, or rectangular kernels, cf. Fig. 3.3. We hence define
$$\displaystyle \begin{aligned} \tilde\gamma_c(\xi) = \sum_{i=1}^{G} \tilde b_i \psi_i(\xi), \end{aligned} $$where only the coefficients \(\tilde b_1,\dots , \tilde b_G \ge 0\) need to be estimated.
-
(iii)
Non-Negative Least Square (NNLS) estimator: Collecting the coefficients in a single vector \(\mathbf u = (\tilde b_1,\dots ,\tilde b_G,\tilde c_1,\dots , \tilde c_{\hat r})^\top \in {\mathbb R}_{\ge 0}^{G + \hat r}\) and recalling (3.19), we define our coefficient-dependent estimate of the channel covariance
$$\displaystyle \begin{aligned} \boldsymbol \Sigma_{\mathbf h} (\mathbf u) = \sum_{k=1}^{\hat r} \tilde c_k \, \mathbf a (\hat\xi_k) \mathbf a (\hat\xi_k)^{\mathsf H} + \sum_{i=1}^G \tilde b_i \int_{-1}^1 \psi_i(\xi) \, \mathbf a (\xi) \mathbf a (\xi)^{\mathsf H} \; \mathrm d \xi =: \sum_{i=1}^{G+\hat r} u_i \mathbf S_i, \end{aligned} $$(3.20)where
$$\displaystyle \begin{aligned} \mathbf S_i = \begin{cases} \int_{-1}^1 \psi_i(\xi) \, \mathbf a (\xi) \mathbf a (\xi)^{\mathsf H} \; \mathrm d \xi & \text{if } 1 \le i \le G \\ \mathbf a (\hat\xi_{i-G}) \mathbf a (\hat\xi_{i-G})^{\mathsf H} & \text{if } G < i \le G + \hat r. \end{cases} \end{aligned} $$All that remains is to determine the coefficient vector u. Since Σ y = Σ h + N 0 I, we can do so by fitting (3.20) to the sample covariance matrix \(\hat {\boldsymbol \Sigma }_{\mathbf y}\) of y(1), …, y(N), i.e.,
$$\displaystyle \begin{aligned} \mathbf u^* = \mathrm{arg}\min_{\mathbf u \ge \boldsymbol 0} \Big\| \hat {\boldsymbol \Sigma}_{\mathbf y} - \sum_{i=1}^{G+\hat r} u_i \mathbf S_i - N_0 \mathbf I \Big\|{}_F^2. \end{aligned} $$(3.21)Since Σ h is Hermitian Toeplitz, one can incorporate the structure in (3.21) by replacing \(\hat { \boldsymbol \Sigma }_{\mathbf h} = \hat {\boldsymbol \Sigma }_{\mathbf y} - N_0 \mathbf I\) with its projection \(\tilde { \boldsymbol \Sigma }_{\mathbf h}\) onto the space of Hermitian Toeplitz matrices (which can be done by averaging the diagonals as in (3.7)). Denoting the first column of \(\tilde { \boldsymbol \Sigma }_{\mathbf h}\) by \(\tilde { \boldsymbol \sigma } \in {\mathbb C}^M\) and collecting the first columns of the matrices S i in a matrix \(\tilde { \mathbf S } \in {\mathbb C}^{M \times (G+\hat r)}\), we may instead solve
$$\displaystyle \begin{aligned} \mathbf u^* = \mathrm{arg}\min_{\mathbf u \ge \boldsymbol 0} \Big\| \mathbf W ( \tilde{\mathbf S} \mathbf u - \tilde{ \boldsymbol \sigma }) \Big\|{}_F^2, \end{aligned} $$(3.22)where \(\mathbf W = \mathrm {diag} \big ( (\sqrt {M}, \sqrt {2(M-1)},\sqrt {2(M-2)},...,\sqrt {2})^\top \big )\) is a weight matrix compensating the averaging process.
A Hands-on Approach: Empirical Evaluation
Let us empirically compare the NNLS estimator to the sample covariance matrix right away. We consider a ULA with M = 128 antennas, where the spacing between two consecutive antenna elements is set to \(d = \frac {\lambda }{2}\). We produce random ASFs in the following general format:
where we set the number of delta, rectangular, and Gaussian functions to be r := 2, n r := 2, and n g := 2, respectively. The spike locations are chosen uniformly at random from [−1, 1], i.e., ξ i ∼Unif([−1, 1]) for i ∈ [2]. The rectangular functions are defined as
where μ 1 ∼Unif([−1, 0]), μ 2 ∼Unif([0, 1]), and σ j ∼Unif([0.1, 0.3]), for j ∈ [2]. The Gaussian functions \(\mathtt {Gaussian}_{\mu _k,\sigma _k}\) are densities of \(\mathcal N(\mu _k,\sigma _k)\), where μ k ∼Unif([−0.7, 0.7]) and σ k ∼Unif([0.03, 0.04]), for k ∈ [2]. Moreover, α := 0.5 is set to present the power contribution of discrete spikes. The constant \(Z = \int _{-1}^1 \gamma _c(\xi ) d\xi \) normalizes γ c in measure. The signal-to-noise ratio (SNR) is set to 10 dB.
In addition to the sample covariance, we compare our NNLS estimator to sparse iterative covariance-based estimation (SPICE) [57]. This method also exploits the ASF domain to fit a covariance matrix. Note that SPICE can only be applied with Dirac delta dictionaries and that it does not include a step of spike support detection as in our method.
Denoting a generic covariance estimate as \(\bar {\boldsymbol {\Sigma }}\), we consider two metrics to evaluate the estimation quality. The first metric, the normalized Frobenius-norm error, is defined as \(E_{\text{NF}} = \frac {\Vert \boldsymbol {\Sigma }_{\mathbf {h}} - \bar {\boldsymbol {\Sigma }} \Vert _{\mathsf {F}}}{\Vert \boldsymbol {\Sigma }_{\mathbf {h}} \Vert _{\mathsf {F}}}\). The second metric, the power efficiency, evaluates the similarity of dominant subspaces between the estimated and true matrices, which is an important factor in various applications of massive MIMO such as user grouping and group-based beamforming. Specifically, let d ∈ [M] denote a subspace dimension parameter, and let \({\mathbf {U}}_d \in \mathbb {C}^{M\times d}\) and \(\bar {\mathbf {U}}_d\in \mathbb {C}^{M\times d}\) be the d dominant eigenvectors of Σ h and \(\bar {\boldsymbol {\Sigma }}\) corresponding to their largest d eigenvalues, respectively. Then, the power efficiency based on d is defined as \(E_{\text{PE}}(d) = 1 - \frac {\langle \boldsymbol {\Sigma }_{\mathbf {h}},\bar {\mathbf {U}}_d\bar {\mathbf {U}}_d^{\sf {H}}\rangle }{\langle \boldsymbol {\Sigma }_{\mathbf {h}},{\mathbf {U}}_d{\mathbf {U}}_d^{\sf {H}}\rangle }\). Note that E PE(d) ∈ [0, 1] where a value closer to 0 means that more power is captured by the estimated d-dominant subspace.
SPICE and the proposed NNLS estimators are applied with G = 2M Dirac delta dictionaries for the continuous part \(\mathcal {G}_c\). The resulting Frobenius-norm error and power efficiency are depicted in Fig. 3.4. All results are averaged over 20 random ASFs and 200 random channel realizations for each ASF. The proposed NNLS method outperforms the sample covariance matrix and SPICE for both metrics. Finally, one can observe a similar outcome for smaller sample sizes as well, e.g., N∕M = 0.125, which occur naturally in massive MIMO.
3.6 Conclusion
The present chapter shows that in the last decade good progress has been made on understanding the influence of intrinsic structure of covariance matrices on the non-asymptotic performance of suitably designed estimators. As we have seen, such estimators with strong guarantees are available for sparse, low-rank, and Toeplitz covariance matrices. At the same time, the chapter illustrates that practitioners still continue to tweak the basic sample covariance matrix using their specific knowledge of the application at hand—seemingly unaware of the progress in theory. We hope that this essay helps mathematicians and practitioners alike to gain an overview of recent theoretical developments on structural and quantized covariance estimation and that it motivates mathematicians to look deeper into the underlying physical models of concrete applications to better understand the structures of interest. Furthermore, our recent theoretical progress on quantized covariance estimation suggests that reliable reconstruction of the covariance matrix is possible even under heavy loss of information during sampling. The use of coarse quantization might thus lead to a considerable increase in capacity in massive MIMO systems and related applications.
References
Adamczak, R.: A note on the Hanson-Wright inequality for random vectors with dependencies. Electron. Commun. Probab. 20 (2015)
Bar-Shalom, O., Weiss, A.J.: DOA estimation using one-bit quantized measurements. IEEE Trans. Aerospace Electron. Syst. 38(3), 868–884 (2002)
Baraniuk, R.G., Foucart, S., Needell, D., Plan, Y., Wootters, M.: Exponential decay of reconstruction error from binary measurements of sparse signals. IEEE Trans. Inform. Theory 63(6), 3368–3385 (2017)
Benedetto, J.J., Powell, A.M., Yilmaz, O.: Sigma-delta quantization and finite frames. IEEE Trans. Inform. Theory 52(5), 1990–2005 (2006)
Bickel, P.J., Levina, E.: Covariance regularization by thresholding. Ann. Stat. 36(6), 2577–2604 (2008)
Bickel, P.J., Levina, E.: Regularized estimation of large covariance matrices. Ann. Stat. 36(1), 199–227 (2008)
Boucheron, S., Lugosi, G., Massart, P.: Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Oxford (2013)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Brookes, M.J., Vrba, J., Robinson, S.E., Stevenson, C.M., Peters, A.M., Barnes, G.R., Hillebrand, A., Morris, P.G.: Optimising experimental design for MEG beamformer imaging. Neuroimage 39(4), 1788–1802 (2008)
Cai, T.T., Ren, Z., Zhou, H.H.: Optimal rates of convergence for estimating Toeplitz covariance matrices. Probab. Theory Related Fields 156(1-2), 101–143 (2013)
Cai, T.T., Zhang, C.H., Zhou, H.H.: Optimal rates of convergence for covariance matrix estimation. Ann. Stat. 38(4), 2118–2144 (2010)
Catoni, O.: Challenging the empirical mean and empirical variance: a deviation study. In: Annales de l’IHP Probabilités et statistiques, vol. 48, pp. 1148–1185 (2012)
Chen, R.Y., Gittens, A., Tropp, J.A.: The masked sample covariance estimator: an analysis using matrix concentration inequalities. Inform. Inference J. IMA 1(1), 2–20 (2012)
Choi, J., Mo, J., Heath, R.W.: Near maximum-likelihood detector and channel estimator for uplink multiuser massive MIMO systems with one-bit ADCs. IEEE Trans. Commun. 64(5), 2005–2018 (2016)
Dirksen, S.: Quantized compressed sensing: a survey. In: Compressed Sensing and Its Applications: Third International MATHEON Conference 2017, pp. 67–95. Applied and Numerical Harmonic Analysis. Birkhäuser, Cham (2019)
Dirksen, S., Maly, J., Rauhut, H.: Covariance estimation under one-bit quantization. arXiv preprint arXiv:2104.01280 (2021)
Dirksen, S., Mendelson, S.: Robust one-bit compressed sensing with partial circulant matrices. ArXiv:1812.06719 (2018)
Dirksen, S., Mendelson, S.: Non-gaussian hyperplane tessellations and robust one-bit compressed sensing. J. Eur. Math. Soc. Arxiv: 1805.09409 (2021)
El Karoui, N.: Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Stat. 36(6), 2717–2756 (2008)
Eldar, Y.C., Li, J., Musco, C., Musco, C.: Sample efficient Toeplitz covariance estimation. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 378–397. SIAM (2020)
Fazel, M.: Matrix rank minimization with applications. Ph.D. Thesis, Stanford University (2002)
Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkhäuser, Basel (2013)
Furrer, R., Bengtsson, T.: Estimation of high-dimensional prior and posterior covariance matrices in Kalman filter variants. J. Multivariate Anal. 98(2), 227–255 (2007)
Goldsmith, A., Jafar, S.A., Jindal, N., Vishwanath, S.: Capacity limits of MIMO channels. IEEE J. Selected Areas Commun. 21(5), 684–702 (2003)
Gray, R.M., Neuhoff, D.L.: Quantization. IEEE Trans. Inform. Theory 44(6), 2325–2383 (1998)
Gray, R.M., Stockham, T.G.: Dithered quantizers. IEEE Trans. Inform. Theory 39(3), 805–812 (1993)
Haghighatshoar, S., Caire, G.: Massive MIMO channel subspace estimation from low-dimensional projections. IEEE Trans. Signal Process. 65(2), 303–318 (2016)
Haghighatshoar, S., Caire, G.: Low-complexity massive MIMO subspace estimation and tracking from low-dimensional projections. IEEE Trans. Signal Process. 66(7), 1832–1844 (2018)
Hubert, M., Rousseeuw, P.J., Van Aelst, S.: High-breakdown robust multivariate methods. Statistical Science, pp. 92–119 (2008)
Jacovitti, G., Neri, A.: Estimation of the autocorrelation function of complex Gaussian stationary processes by amplitude clipped signals. IEEE Trans. Inform. Theory 40(1), 239–245 (1994)
Jerrum, M.R., Valiant, L.G., Vazirani, V.V.: Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci. 43, 169–188 (1986)
Jung, H.C., Maly, J., Palzer, L., Stollenwerk, A.: Quantized compressed sensing by rectified linear units. IEEE Trans. Inform. Theory 67(6), 4125–4149 (2021)
Kabanava, M., Rauhut, H.: Masked Toeplitz covariance estimation. ArXiv:1709.09377 (2017)
Ke, Y., Minsker, S., Ren, Z., Sun, Q., Zhou, W.X.: User-friendly covariance estimation for heavy-tailed distributions. Stat. Sci. 34(3), 454–471 (2019)
Khalilsarai, M.B., Yang, T., Haghighatshoar, S., Caire, G.: Structured channel covariance estimation from limited samples in massive MIMO. In: IEEE International Conference on Communications (ICC), pp. 1–7 (2020)
Knudson, K., Saab, R., Ward, R.: One-bit compressive sensing with norm estimation. IEEE Trans. Inform. Theory 62(5), 2748–2758 (2016)
Koltchinskii, V., Lounici, K.: Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23(1), 110–133 (2017)
Krim, H., Viberg, M.: Two decades of array signal processing research: the parametric approach. IEEE Signal Process. Mag. 13(4), 67–94 (1996)
Lawrence, H., Li, J., Musco, C., Musco, C.: Low-rank Toeplitz matrix estimation via random ultra-sparse rulers. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4796–4800. IEEE (2020)
Levina, E., Vershynin, R.: Partial estimation of covariance matrices. Probab. Theory Related Fields 153(3-4), 405–419 (2012)
Li, Y., Tao, C., Seco-Granados, G., Mezghani, A., Swindlehurst, A.L., Liu, L.: Channel estimation and performance analysis of one-bit massive MIMO systems. IEEE Trans. Signal Process. 65(15), 4075–4089 (2017)
Liu, L., Hawkins, D.M., Ghosh, S., Young, S.S.: Robust singular value decomposition analysis of microarray data. Proc. Nat. Acad. Sci. 100(23), 13167–13172 (2003)
Lounici, K.: High-dimensional covariance matrix estimation with missing observations. Bernoulli 20(3), 1029–1058 (2014)
Lu, L., Li, G.Y., Swindlehurst, A.L., Ashikhmin, A., Zhang, R.: An overview of massive MIMO: Benefits and challenges. IEEE J. Selected Topics Signal Process. 8(5), 742–758 (2014)
Marzetta, T.L., Ngo, H.Q.: Fundamentals of massive MIMO. Cambridge University Press, Cambridge (2016)
Mendelson, S., Zhivotovskiy, N.: Robust covariance estimation under L4-L2 norm equivalence. Ann. Stat. 48(3), 1648–1664 (2020)
Minsker, S.: Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries. Ann. Stat. 46(6A), 2871–2903 (2018)
Minsker, S., Wei, X.: Robust modifications of U-statistics and applications to covariance estimation problems. Bernoulli 26(1), 694–727 (2020)
Nemirovskij, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization (1983)
Paulraj, A.J., Gore, D.A., Nabar, R.U., Bolcskei, H.: An overview of MIMO communications-a key to gigabit wireless. Proc. IEEE 92(2), 198–218 (2004)
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)
Roberts, L.: Picture coding using pseudo-random noise. IRE Trans. Inform. Theory 8(2), 145–154 (1962)
Romero, D., Ariananda, D.D., Tian, Z., Leus, G.: Compressive covariance sensing: Structure-based compressive sensing beyond sparsity. IEEE Signal Process. Mag. 33(1), 78–93 (2016)
Roth, K., Munir, J., Mezghani, A., Nossek, J.A.: Covariance based signal parameter estimation of coarse quantized signals. In: 2015 IEEE International Conference on Digital Signal Processing (DSP), pp. 19–23. IEEE (2015)
Schreier, R., Temes, G.C., Norsworthy, S.R.: Delta-Sigma Data Converters: Theory, Design, and Simulation. IEEE Press (1996)
Snyder, D.L., O’Sullivan, J.A., Miller, M.I.: The use of maximum likelihood estimation for forming images of diffuse radar targets from delay-doppler data. IEEE Trans. Inform. Theory 35(3), 536–548 (1989)
Stoica, P., Babu, P., Li, J.: SPICE: a sparse covariance-based estimation method for array processing. IEEE Trans. Signal Process. 59(2), 629–638 (2011)
Stoica, P., Moses, R.L.: Spectral analysis of signals (2005)
Tse, D., Viswanath, P.: Fundamentals of Wireless Communication. Cambridge University Press, Cambridge (2005)
Van Vleck, J.H., Middleton, D.: The spectrum of clipped noise. Proc. IEEE 54(1), 2–19 (1966)
Vershynin, R.: High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47. Cambridge University Press, Cambridge (2018)
Acknowledgements
All authors acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) through the project CoCoMIMO funded within the priority program SPP 1798 Compressed Sensing in Information Processing (COSIP).
Appendix: Proof of Theorem 3.6
To prove Theorem 3.6, we need two technical lemmas. In the remaining section, σ always refers to the first column of Σ and \(\hat { \boldsymbol \sigma }\) to the first column of \(\hat {\boldsymbol \Sigma }_n^{\mathrm {Toep}} \).
Lemma 3.1
Under the assumptions of Theorem 3.6 , we have for α ∈ (0, 1) and 0 < u < 1 that
where C > 0 is an absolute constant.
Proof
We proceed similar as in [33]. First note that, for all k ∈ [n], r ∈ [αp], we can write
where the mask M r is defined by \([M_r]_{i,j} = \tfrac {1}{(p+1)-r}\) if j − i = r − 1 and [M r]i,j = 0 else, i.e., only the r-th co-diagonal of M is non-zero. By using a version of the Hanson–Wright inequality for random vectors with the convex concentration property [1], we get that
which, by integration, leads to
for any q ≥ 1. The random variables \(Z_i^r\) are thus sub-gamma with variance parameter \(\nu = 16K^4 (C\| \mathbf M_r \|{ }_{F}^2 + C^2\| \mathbf M_r \|{ }^2) \le CK^4 \| \mathbf M_r \|{ }_{F}^2\) and scale parameter γ = 4CK 2∥M r∥2 [7, Theorem 2.3]. By independence, we get for all \(0 < \mu < \tfrac {1}{\gamma }\)
(and the same holds for \(-Z_i^r\)) such that \(\sum _{i=1}^{n} Z_i^r\) is sub-gamma with variance parameter νn and scale parameter γ [7, Chapter 2.4]. Consequently,
for any u > 0 [7, Chapter 2.4]. Recalling (3.24) and noting that \(\| \mathbf M_r \|{ }_{F}^2 = \| \mathbf M_r \| = \frac {1}{(p+1)-r}\) yield with the choice \(u = \min \big \{ \tfrac {1}{C^2K^4},\tfrac {1}{CK^2} \big \} ((p+1)-r)n \tilde {u}\) that
A union bound over r ∈ [αp] and the bound r ≤ αp conclude the proof. □
The second lemma follows along the lines of [5, Theorem 1].
Lemma 3.2
Under the assumptions of Theorem 3.6 , assume in addition that Σ has a bandwidth of at most αp, i.e., \(\mathbb B_{\alpha p} (\boldsymbol \Sigma ) = \boldsymbol \Sigma \) and thus supp(σ) ⊂ [αp] and that
for some γ ∈ (0, 1). Then,
Proof
For convenience, let us abbreviate \(\tilde {\boldsymbol \Sigma } := \mathbb B_{\alpha p} (\hat {\boldsymbol \Sigma }_n^{\mathrm {Toep}} )\) and denote its first column by \(\tilde {\boldsymbol \sigma }\). We write
Since \(\boldsymbol \Sigma \in \mathcal {U}^{\mathrm {Toep}} (q,s,M)\),
and Gershgorin’s disc theorem imply
Moreover,
First recall that by assumption supp(σ) ⊂ [αp] and \(\mathrm {supp}(\tilde {\boldsymbol \sigma }) \subset [\alpha p]\). Hence, using the observation that \(\tilde {\sigma }_r = \hat {\sigma }_r\), for r ≤ αp, and
we may estimate with (3.25) and \(\boldsymbol \Sigma \in \mathcal {U}^{\mathrm {Toep}} (q,s,M)\)
Furthermore,
By (3.26), we know that (V ) ≤ τ 1−q s. Furthermore, we get that
where we defined \(N_i(1-\gamma ) := \sum _{j=1}^p \chi _{\{ |\tilde {\Sigma }_{i,j} - \Sigma _{i,j}| > (1-\gamma ) \tau \}}\) and reused the bound on (III) for the second term (replacing τ with γτ in the summation). Since we have by (3.25) that N i(1 − γ) = 0, for i ∈ [p], we get that (IV ) ≲γ sτ 1−q. Hence,
Finally, note that by (3.27) and \(\boldsymbol \Sigma \in \mathcal {U}^{\mathrm {Toep}} (q,s,M)\),
Combining the bounds for (I), (II), and (III) yields the claim. □
Proof of Theorem 3.6
Note that
By Lemma 3.1, we get with probability at least 1 − (2αp)−(c−1) that
where c > 1 and \(1-\gamma = \frac {1}{\sqrt {2}}\). The claim now follows by applying Lemma 3.2 to the first term on the right-hand side of (3.28). □
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Maly, J., Yang, T., Dirksen, S., Rauhut, H., Caire, G. (2022). New Challenges in Covariance Estimation: Multiple Structures and Coarse Quantization. In: Kutyniok, G., Rauhut, H., Kunsch, R.J. (eds) Compressed Sensing in Information Processing. Applied and Numerical Harmonic Analysis. Birkhäuser, Cham. https://doi.org/10.1007/978-3-031-09745-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-09745-4_3
Published:
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-031-09744-7
Online ISBN: 978-3-031-09745-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)