3.1 Introduction

The key objective in covariance estimation is simple to state: given \(n \in {\mathbb N}\) i.i.d. samples \({\mathbf {X}}^1,...,{\mathbf {X}}^n \overset {\mathrm {d}}{\sim } \mathbf {X}\) of a random vector \(\mathbf {X} \in {\mathbb R}^p\), compute a reliable estimate of the covariance matrix \(\mathbb E [\mathbf {X}{\mathbf {X}}^\top ] = \boldsymbol {\Sigma } \in {\mathbb R}^{p\times p}\) (without loss of generality, we restrict ourselves here to mean-zero distributions, i.e., \(\mathbb {E} [\mathbf {X}] = \boldsymbol {0}\)). For this purpose, a natural estimator is the sample covariance matrix

$$\displaystyle \begin{aligned} \hat{\boldsymbol{\Sigma}}_n = \frac{1}{n} \sum_{k=1}^n {\mathbf{X}}^k ({\mathbf{X}}^k)^\top \end{aligned} $$
(3.1)

as it converges to Σ, for n →, by the law of large numbers. Nevertheless, an asymptotic result is of limited use from practical perspective. Given \(n \in {\mathbb N}\), it provides no information on the reconstruction error \(\| \hat {\boldsymbol \Sigma }_n - \boldsymbol \Sigma \|\) measured in the operator norm ∥⋅∥. (Although other norms or error metrics might be considered as well, e.g., the Frobenius norm, we mainly restrict ourselves in this chapter on operator norm bounds as the most common representative.)

In the last two decades, numerous works on non-asymptotic analysis of covariance estimation showed that reliable approximation of Σ by \(\hat {\boldsymbol \Sigma }_n\) becomes feasible for sub-Gaussian distributions if n ≳ p, where \(a \lesssim b\) denotes a ≤ Cb for some absolute constant C > 0. For instance, if X has a Gaussian distribution, then it is well known [61] that with probability at least 1 − 2e t

$$\displaystyle \begin{aligned} \| \hat{\boldsymbol\Sigma}_n - \boldsymbol\Sigma \| \lesssim \| \boldsymbol\Sigma \| \left( \sqrt{\frac{p + t}{n}} + \frac{p + t}{n} \right). \end{aligned} $$
(3.2)

This classical result exhibits various weaknesses. For instance, it requires strong concentration of the distribution of X around its mean. The estimator in (3.1) is sensitive to outliers and not reliable if concentration fails [12, 34]. Furthermore, in applications the ambient dimension can easily exceed the number of accessible samples such that even if concentration may be assumed, the estimate in (3.2) is void.

3.1.1 Outline and Notation

In Sect. 3.2, we briefly discuss massive MIMO as one specific modern application of covariance estimation. The massive MIMO setting originates from wireless communications research and will serve as a motivation for investigating multiple structures and quantized samples in a mathematical framework. Section 3.3 then surveys recent theoretical advances on estimation of structured covariance matrices, and Sect. 3.4 shows the impact of coarse sample quantization on estimation guarantees. Having the theoretical results from Sects. 3.3 and 3.4 in mind, in Sect. 3.5, we finally return to the details of massive MIMO and present our recent approach in engineering literature. We conclude in Sect. 3.6 by discussing the gap between existing theoretical guarantees and practical solutions. Some technical details of Sect. 3.3 are deferred to the Appendix.

We denote [n] = {1, ..., n}. For any absolute constant C > 0, we abbreviate a ≤ Cb (resp., ≥) as \(a \lesssim b\) (resp., ≳). We furthermore write a ≲L b (resp., ≳L) if C only depends on the quantity L. Whenever we use absolute constants c, C > 0, their values may vary from line to line. Scalar-valued functions act component-wise on vectors and matrices. For a set S, the indicator function χ S is 1 on S and 0 on its complement S c. We denote the all ones-matrix by \(\boldsymbol {1} \in {\mathbb R}^{p\times p}\) and the identity by \(\mathbf I \in {\mathbb R}^{p\times p}\). In particular,

$$\displaystyle \begin{aligned}{}[\mathrm{sign}(\mathbf{x})]_i = \begin{cases} 1 & \text{if } x_i\geq 0 \\ -1& \text{if } x_i<0, \end{cases} \end{aligned} $$

for all \(\mathbf x \in {\mathbb R}^p\) and i ∈ [p]. For \(\mathbf Z \in {\mathbb R}^{p\times p}\), we denote the operator norm (the maximum singular value) by \(\| \mathbf Z \| = \sup _{\mathbf u \in \mathbb {S}^{p-1}} \| \mathbf Z\mathbf u \|{ }_2\), the nuclear norm (the sum of singular values) by \(\| \mathbf Z \|{ }_* = \mathrm {tr}(\sqrt {\mathbf Z^\top \mathbf Z})\), the Frobenius norm (trace norm) by \(\| \mathbf Z \|{ }_F^2 = \mathrm {tr} (\mathbf Z^\top \mathbf Z) = \sum _{i,j = 1}^p Z_{i,j}^2\), the max norm by ∥Z =maxi,j|Z i,j|, and the maximum column normZ1→2 =maxj ∈ [p]z j2, where z j denotes the j-th column of Z. We use ⊙ for the Hadamard (i.e., entry-wise) product of two matrices. The uniform distribution on a set S is denoted by Unif(S). The multivariate Gaussian distribution with mean \(\boldsymbol \mu \in {\mathbb R}^{p}\) and covariance matrix \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) is denoted by \(\mathcal {N}(\boldsymbol \mu , \boldsymbol \Sigma )\). The sub-Gaussian (ψ 2-) and subexponential (ψ 1-) norms of a random variable X are defined by

$$\displaystyle \begin{aligned} \| X \|{}_{\psi_\alpha} = \inf \left\{ t>0 \colon \mathbb E \left[ \exp\left(\tfrac{|X|{}^\alpha}{t^\alpha} \right) \right] \le 2 \right\} \end{aligned} $$

A mean-zero random vector X on \({\mathbb R}^n\) is called K-sub-Gaussian if

$$\displaystyle \begin{aligned}\|\langle \mathbf X,\mathbf x\rangle\|{}_{\psi_2} \leq K \; \mathbb E [\langle \mathbf X,\mathbf x \rangle^2 ]^{1/2} \quad \mbox{ for all } \mathbf x \in {\mathbb R}^n.\end{aligned}$$

3.2 Motivation: Massive MIMO

Multiple-input multiple-output (MIMO) is a method in wireless communication to enhance the capacity of a radio link by using multiple transmission and multiple receiving antennas. It has become an essential element of wireless communication standards for Wi-Fi and mobile devices [24, 50]. Massive MIMO equips the base station (BS) with a large number of antennas to further increase bandwidth and potential number of users [44, 45].

In a classical massive MIMO communication system, the BS is equipped with a uniform linear array (ULA) of M antennas and communicates with multiple users through a scattering channel, e.g., wave reflection on buildings or objects. See Fig. 3.1 for an exemplary setup. During uplink (UL), the BS receives user pilots and aims at estimating the respective channel covariance matrices, which characterize each transmission channel. By assuming mutual orthogonality of all UL pilots, it suffices to focus on a single user channel. We denote the corresponding UL channel vector at time–frequency resource s by \(\mathbf h(s) \in \mathbb C^M\) (standard block-fading model, e.g., [59]). Furthermore, we assume that the user transmits a single pilot per channel coherence block such that the channel vectors h(s) are i.i.d. complex Gaussian vectors, for s ∈ [N] [27, 28].

Fig. 3.1
figure 1

An exemplary multipath propagation channel, where the user signal is received at the BS through two scattering clusters

The received pilot signal at the BS at resource block s is then given as

$$\displaystyle \begin{aligned} \mathbf y(s) = \mathbf h(s) x(s) + \mathbf z(s), \end{aligned} $$
(3.3)

for s ∈ [N], where \(x(s) \in {\mathbb C}\) is the known pilot symbol and \(\mathbf z(s) \sim \mathcal C \mathcal N (\boldsymbol 0, N_0 \mathbf I) = \mathcal N (\boldsymbol 0, \tfrac {N_0}{2} \mathbf I) + j \mathcal N (\boldsymbol 0, \tfrac {N_0}{2} \mathbf I)\) models additive white Gaussian noise (AWGN). Without loss of generality, one may assume that the pilot symbols are normalized, i.e., x(s) = 1. The core problem of massive MIMO channel estimation is now to estimate the channel covariance matrix

$$\displaystyle \begin{aligned} \boldsymbol \Sigma_{\mathbf h} = \mathbb E [ \mathbf h(s) \mathbf h(s)^{\mathsf H} ] \end{aligned} $$
(3.4)

from N noisy samples y(s), s ∈ [N]. Since the number of samples N is limited due to time constraints of the UL phase, one expects for massive MIMO that N ≈ M. Translating this into our initial theoretical setting, i.e., identifying the ambient dimension p with the number of antennas M, the number of samples n with the number of independent time–frequency resources N, and the sample vectors X k with the channel vectors h(s), we see that the sample covariance matrix will not provide a reliable estimate of Σ h in this case, cf. Eq. (3.2) for n ≈ p. Nevertheless, a closer look into the channel model reveals that Σ h naturally exhibits intrinsic structures such as low-rankness and Toeplitz structure, cf. Sect. 3.5.

Structure and Quantization

Let us highlight two crucial points. First, whereas engineers are successful in boosting the sample covariance matrix by using special features of their problem setting, cf. Sect. 3.5, it might simplify existing approaches if alternatives to the sample covariance matrix are used that automatically leverage intrinsic structure(s) of the covariance matrix. As Sect. 3.3 will show, the last decade substantially improved our theoretical understanding in this regard. Second, if the above methods are used in real applications, one has to take into account that the sample vectors y(s) have to be quantized to finite alphabets before digital processing. Especially, in massive MIMO, the information loss due to quantization can be significant since fine quantization at a multitude of antennas leads to enormous energy consumption. The results presented in Sect. 3.4 can be seen as a first theoretical step into understanding the non-asymptotic behavior of covariance estimators under coarse quantization of the samples. Since we concentrate on memoryless quantization schemes (each vector entry is quantized independently of all others), our model should be applicable to massive MIMO in a straightforward way.

3.3 Estimation of Structured Covariance Matrices and Robustness Against Outliers

As we have already seen in Sect. 3.2, there are several structures of interest that Σ might exhibit in applications. We concentrate here on three important instances—sparsity, low-rankness, and Toeplitz structure—that naturally emerge in engineering, biology, and data science, e.g., [42, 53]. Parts of the results we review below are not restricted to Gaussian random vectors but allow to treat heavy-tailed distributions that only satisfy assumptions on their lower moments. Techniques for robust covariance estimation include median of means [31, 49], element- and spectrum-wise truncation [12, 47], and M-estimators [47, 48]. The recent work [46] even constructs a “sub-Gaussian” estimator that only requires a finite kurtosis assumption (L 4L 2-norm equivalence). In this context, an estimator is called sub-Gaussian if it performs on non-Gaussian distributions as well as the sample covariance matrix applied to Gaussian distributions, for further discussion see [46]. Although the proposed construction is computationally intractable, it illustrates the potential of robust estimation. For further information on early and recent approaches to robust covariance estimation, we refer the reader to [29, 34].

3.3.1 Sparse Covariance Matrices

We begin with the assumption that Σ is a sparse matrix, i.e., only few entries of Σ are relevant and hence non-zero. If X models ordered variables, the non-zero entries of Σ, for instance, might cluster around the diagonal such that Σ is a banded or tapered matrix. A straightforward way to estimate such covariance matrices is to band/taper the sample covariance matrix \(\hat {\boldsymbol \Sigma }_n\) [6, 11, 23]. If the variables are not ordered and the non-zero entries of Σ do not cluster, thresholding of \(\hat {\boldsymbol \Sigma }_n\) is a viable alternative [5, 19]. As remarked in [40], the aforementioned approaches can be treated in a unified way by introducing a mask M ∈ [0, 1]p×p and considering the masked sample covariance matrix \(\mathbf M \odot \hat {\boldsymbol \Sigma }_n\). The masked formulation allows to decompose the estimation error

$$\displaystyle \begin{aligned} \| \mathbf M \odot \hat{\boldsymbol \Sigma }_n - \boldsymbol \Sigma \| \le \| \mathbf M \odot \hat{\boldsymbol \Sigma }_n - \mathbf M \odot \boldsymbol \Sigma \| + \| \mathbf M \odot \boldsymbol \Sigma - \boldsymbol \Sigma \| \end{aligned} $$

into a variance term that behaves well if M is (close to) sparse and a bias term that is small whenever M encodes the support of Σ. The bias term is deterministic and solely depends on a proper choice of M. For understanding the influence of sparsity on the required sample size, it thus suffices to control the variance term. The corresponding state-of-the-art result can be found in [13] which extends [40] from Gaussian distributions to general distributions of finite fourth moment and strengthens [40] if applied to Gaussian distributions. To facilitate the comparison with (3.2), we present the result only in the Gaussian case.

Theorem 3.1 ([13, Theorem 1.1])

Let M ∈ [0, 1]p×p , for p ≥ 3, be fixed and \(\mathbf X \sim \mathcal {N} (\boldsymbol 0, \boldsymbol \Sigma )\) with \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) . Then,

$$\displaystyle \begin{aligned} & \left( \mathbb E \| \mathbf M \odot \hat{\boldsymbol \Sigma }_n - \mathbf M \odot \boldsymbol \Sigma \|{}^2 \right)^{\frac{1}{2}} \\ &\lesssim \| \boldsymbol \Sigma \| \left( \sqrt{ \frac{\| \boldsymbol \Sigma \|{}_\infty}{\| \boldsymbol \Sigma \|} \cdot \frac{\| \boldsymbol M \|{}_{1\rightarrow 2}^2 \log(p) }{ n } } + \frac{\| \boldsymbol \Sigma \|{}_\infty}{\| \boldsymbol \Sigma \|} \cdot \frac{\| \boldsymbol M \| \log(p) \log(np) }{ n } \right). \end{aligned} $$

Theorem 3.1 only bounds the second moment of the variance term, which yields high-probability estimates via Markov’s inequality. However, the same proof techniques apply to higher moments of the variance term as well such that exponential tail bounds can be achieved for Gaussian X, cf. [13, Section 3.3].

Let us compare Theorem 3.1 with (3.2). For general covariance estimation, i.e., M = 1, we have \(\| \boldsymbol M \|{ }_{1\rightarrow 2}^2 = \| \boldsymbol M \| = p\), which implies that up to \(\log \)-factors both results are of the same order \(\mathcal {O}(\sqrt {\tfrac {p}{n}} + \tfrac {p}{n})\). If M encodes sparsity, however, meaning that only up to s ≪ p columns and rows are non-zero and \(\| \boldsymbol M \|{ }_{1\rightarrow 2}^2 = \| \boldsymbol M \| = s\), the estimation error is considerably reduced when applying Theorem 3.1. A similar error reduction occurs if \(\mathbf {M} \odot \hat {\boldsymbol \Sigma }_n\) is a banded estimator of bandwidth B.

Estimation via Thresholding

While the masked framework provides a unified understanding of the intrinsic complexity of sparse covariance estimation, in practice the mask M is unknown. A more realistic approach to the problem is hence thresholding procedures as, e.g., [5]. To allow for non-ordered covariance matrices, i.e., general sparsity and not only limited bandwidth of the matrix, the authors of [5] introduce the set of bounded and (effectively) sparse covariance matrices

$$\displaystyle \begin{aligned} \mathcal{U} (q,s,M) := \left\{ \boldsymbol \Sigma \colon \Sigma_{i,i} \le M\ \text{and } \sum_{j=1}^p |\Sigma_{i,j}|{}^q \le s, \text{for all } i \in [p] \right\}, \end{aligned} $$

for q ∈ [0, 1) and s, M > 0. If q = 0, the matrices in \(\mathcal {U} (q,s,M)\) have at most s non-zero entries per row; if q > 0, the rows are close to s-sparse vectors. To estimate \(\boldsymbol \Sigma \in \mathcal {U} (q,s,M)\), the thresholded estimator \(\mathbb T_\tau (\hat {\boldsymbol \Sigma }_n)\) is considered, where

$$\displaystyle \begin{aligned}{}[\mathbb T_\tau (\mathbf A)]_{i,j} = \begin{cases} A_{i,j} & \text{if } |A_{i,j}| \ge \tau, \\ 0 & \text{else,} \end{cases} \end{aligned} $$
(3.5)

for any τ > 0 and \(\mathbf A \in {\mathbb R}^{p\times p}\).

Theorem 3.2 ([5, Theorem 1])

Let \(\mathbf X \sim \mathcal {N} (\boldsymbol 0, \boldsymbol \Sigma )\) , for \(\boldsymbol \Sigma \in \mathcal {U} (q,s,M)\) , and M′ > 0 be sufficiently large (depending on M). If

$$\displaystyle \begin{aligned} \tau = M' \sqrt{\frac{\log(p)}{n}}, \end{aligned} $$

for \(n \gtrsim \log (p)\) , then with probability at least \(1 - e^{-cn\tau ^2}\)

$$\displaystyle \begin{aligned} \| \mathbb T_\tau (\hat{\boldsymbol \Sigma }_n) - \boldsymbol \Sigma \| = \mathcal O \left( s \left( \frac{\log(p)}{n} \right)^{\frac{1-q}{2}} \right). \end{aligned} $$

Theorem 3.2 does not require knowledge on the support of Σ and respects sparsity defects. However, if we once more consider the case q = 0, we see that the estimate in Theorem 3.2 is suboptimal since the error behaves (up to \(\log \)-factors) like \(\mathcal {O} \big ( \sqrt {\tfrac {s^2}{n}} \big )\) and not like \(\mathcal {O} (\sqrt {\tfrac {s}{n}})\) as one would expect.

3.3.2 Low-Rank Covariance Matrices

When working with high-dimensional random vectors, another commonly considered structural prior is to assume that the distribution concentrates around a low-dimensional manifold. This may manifest itself in Σ being a low-rank matrix. Interestingly, the sample covariance matrix in (3.1) intrinsically leverages low-rankness of Σ. To understand this phenomenon, we consider the effective rank of Σ defined as

$$\displaystyle \begin{aligned} \mathbf r (\boldsymbol \Sigma) = \frac{\| \boldsymbol \Sigma \|{}_*}{\| \boldsymbol \Sigma \|}. \end{aligned} $$

It is straightforward to verify that 1 ≤r( Σ) ≤rank( Σ). In contrast to the rank of Σ, the quantity r( Σ) is small even if Σ is only close to a low-rank matrix, e.g., consider Σ to be a full rank matrix with exponentially decaying spectrum.

Theorem 3.3 ([37, Corollary 2])

Let \(\mathbf X \sim \mathcal N (\boldsymbol 0,\boldsymbol \Sigma )\) , for \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) , and n r( Σ). Then with probability at least 1 − e t the sample covariance matrix satisfies

$$\displaystyle \begin{aligned} \| \hat{ \boldsymbol \Sigma }_n - \boldsymbol \Sigma \| \lesssim \| \boldsymbol \Sigma \| \left( \sqrt{\frac{\mathbf r ( \boldsymbol \Sigma )}{n}} + \frac{\mathbf r ( \boldsymbol \Sigma )}{n} + \sqrt{\frac{t}{n}} + \frac{t}{n} \right). \end{aligned} $$

The authors of [37] further show that the bound in Theorem 3.3 is tight up to constants. If we compare the result to (3.2), we see that both estimates agree for (effectively) full rank matrices like Σ = I. If Σ is of low rank, however, Theorem 3.3 controls the estimation error even in the case n < p.

Low-Rank Estimators

We could stop at this point since \(\hat { \boldsymbol \Sigma }_n\) apparently meets our requirements. Nevertheless, two questions remain. First, if one assumes Σ to be low rank, one would wish for an estimator that is low rank itself, and, second, Theorem 3.3 fails if X does not exhibit strong concentration around its mean. The first point can be addressed by using the LASSO estimator

$$\displaystyle \begin{aligned} \hat{ \boldsymbol \Sigma }_n^\lambda = \mathrm{arg}\min_{\mathbf S \succcurlyeq \boldsymbol 0} \| \mathbf S - \hat{\boldsymbol \Sigma}_n \|{}_F^2 + \lambda \| \mathbf S \|{}_* \;, \end{aligned} $$
(3.6)

where λ > 0 is a tunable parameter. Initially introduced in [43] to estimate covariance matrices from incomplete observations, the result reads in our setting as follows.

Theorem 3.4 ([43, Corollary 1])

Let \(\mathbf X \sim \mathcal N (\boldsymbol 0,\boldsymbol \Sigma )\) , for \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) , and \(n \gtrsim \mathbf r (\boldsymbol \Sigma ) \log (2p + n)^2\) . If

$$\displaystyle \begin{aligned} \lambda = C \sqrt{ \mathrm{tr} (\hat{\boldsymbol \Sigma}_n) \| \hat{\boldsymbol \Sigma}_n \| } \sqrt{ \frac{\log(2p)}{n} }, \end{aligned} $$

for a sufficiently large absolute constant C > 0, then with probability at least \(1-\tfrac {1}{2p}\) the estimator in (3.6) satisfies

$$\displaystyle \begin{aligned} \| \hat{ \boldsymbol \Sigma }_n^\lambda - \boldsymbol \Sigma \| \lesssim \| \boldsymbol \Sigma \| \sqrt{ \frac{\mathbf r (\boldsymbol \Sigma) \log(2p)}{n} }. \end{aligned} $$

The nuclear norm regularization in (3.6) induces (effective) low-rankness on \(\hat { \boldsymbol \Sigma }_n^\lambda \) [21, 51] and the order of estimation error reflects up to \(\log \)-factors the one in Theorem 3.3. Furthermore, the construction of \(\hat { \boldsymbol \Sigma }_n^\lambda \) can easily be adapted to heavy-tailed distributions by replacing \(\hat { \boldsymbol \Sigma }_n\) with an appropriate robust counterpart, e.g., the spectrum-wise truncated sample covariance matrix [34]. A corresponding version of Theorem 3.4 that is not restricted to (sub)-Gaussian distributions is [34, Theorem 5.2].

3.3.3 Toeplitz Covariance Matrices and Combined Structures

The third structure we discuss here in detail naturally arises in various engineering problems. If the entries of X resemble measurements on a temporal or spatial grid whose covariances only depend on the distances of measurements (in time or space) but not their location, Σ is a symmetric Toeplitz matrix, i.e.,

$$\displaystyle \begin{aligned} \boldsymbol \Sigma = \begin{pmatrix} \sigma_1 & \sigma_2 & \cdots & \sigma_{p} \\ \sigma_2 & \ddots & \ddots & \vdots \\ \vdots & \ddots & & \sigma_2 \\ \sigma_{p} & \cdots & \sigma_2 & \sigma_1 \end{pmatrix}, \end{aligned} $$

and the first column \(\boldsymbol \sigma \in {\mathbb R}^p\) determines Σ via Σi,j = σ |ij|+1. (For simplicity, we identify Toeplitz matrices with their first column in the following.) Such a structure appears, for instance, in Direction-Of-Arrival (DOA) estimation [38] and medical/radar imaging processing [9, 56]. For further examples, we refer the reader to [53]. Since Toeplitz structure reduces the degrees of freedom in Σ from p 2 to p, leveraging this structure can lead to a notable reduction in sample complexity.

The authors of [10] propose to average, the sample covariance matrix along its diagonals to obtain the Toeplitz estimator \(\hat { \boldsymbol \Sigma }_n^{\mathrm {Toep}}\) defined as

$$\displaystyle \begin{aligned}{}[\hat{\sigma}_n^{\mathrm{Toep}}]_r = \frac{1}{(p+1)-r} \sum_{i-j = r-1} [\hat{\Sigma}_n]_{i,j}, \quad \text{for } r \in [p]. \end{aligned} $$
(3.7)

They derive error estimates for Gaussian distributions with banded Toeplitz covariance matrices.

The more recent work [33] extends these results to non-Gaussian distributions and general masks as introduced in Sect. 3.3.1. To be more precise, the authors of [33] assume that the distribution of X has the so-called convex concentration property.

Definition 3.1

A random vector \(\mathbf X \in {\mathbb R}^p\) has the convex concentration property with constant K if for any 1-Lipschitz function \(\phi \colon {\mathbb R}^p \to {\mathbb R}\), one has \(\mathbb E [\phi (\mathbf X)] < \infty \) and

$$\displaystyle \begin{aligned} \mathrm{Pr} \left[ | \phi(\mathbf X) - \mathbb E [\phi(\mathbf X)] | \ge t \right] \le 2 e^{-\tfrac{t^2}{K^2}}, \quad \text{for all } t > 0. \end{aligned} $$

By setting ϕ(⋅) = 〈⋅, x〉, for \(\mathbf x \in \mathbb R^p\), one easily sees that all distributions that have the convex concentration property are sub-Gaussian. For the sake of consistency, we therefore restrict ourselves here to Gaussian distributions as their most prominent representative. For a symmetric Toeplitz mask M ∈ [0, 1]p×p characterized by its first column m ∈ [0, 1]p, we furthermore define the weighted 1- and 2-norms of m as

$$\displaystyle \begin{aligned} \| \mathbf m \|{}_{1,*} = \sum_{r = 1}^p \frac{m_r}{(p+1)-r} \quad \text{and} \quad \| \mathbf m \|{}_{2,*} = \left( \sum_{r = 1}^p \frac{m_r^2}{(p+1)-r} \right)^{\frac{1}{2}}. \end{aligned} $$

Theorem 3.5 ([33, Theorem 3])

Let M ∈ [0, 1]p×p be a symmetric Toeplitz mask and \(\mathbf X \sim \mathcal N(\boldsymbol 0,\boldsymbol \Sigma )\) with \(\boldsymbol \Sigma \in {\mathbb R}^{p\times p}\) symmetric and Toeplitz. Then,

$$\displaystyle \begin{aligned} \mathbb E \| \mathbf M \odot \hat{\boldsymbol \Sigma}_n^{\mathrm{Toep}} - \mathbf M \odot \boldsymbol \Sigma \| \lesssim \| \boldsymbol \Sigma \| \left( \sqrt{\frac{ \| \mathbf m \|{}_{2,*} \log(p) }{n}} + \frac{ \| \mathbf m \|{}_{1,*} \log(p) }{n} \right). \end{aligned} $$

As Theorem 3.1, the result is not restricted to an estimate of the expected error but includes respective high probability bounds with exponential tail decay. Let us compare Theorem 3.5 to Theorem 3.1. If we ignore \(\log \)-factors and assume that M is a banding or tapering mask with support bandwidth \(B \le \tfrac {p}{2}\), i.e., only the B innermost diagonals of M are non-zero, Theorem 3.5 guarantees an estimation error of order \(\mathcal {O}(\sqrt {\tfrac {B}{pn}} + \tfrac {B}{pn})\), cf. [33, Corollary 2], which improves the estimate \(\mathcal {O}(\sqrt {\tfrac {B}{n}} + \tfrac {B}{n})\) of Theorem 3.1 by a factor p. This improvement corresponds to the reduction in degrees of freedom when comparing Toeplitz to general matrices. Note, however, that the additional assumption B ≤ αp, for α ∈ (0, 1), is required for such a reduction since estimation of the outermost diagonals of Σ is hardly enhanced by averaging over the Toeplitz structure. This is expressed by Theorem 3.5 since ∥m1,∗ and ∥m2,∗ are \(\mathcal O(1)\) and not \(\mathcal {O}(\tfrac {1}{p})\) if the tail entries of m are not of vanishing magnitude.

Estimation via Thresholding

Theorem 3.5 differs from the previously discussed results in the sense that it allows to simultaneously leverage two structures of Σ, sparsity and Toeplitz structure. Nevertheless, as in Sect. 3.3.1, the masked framework leaves open the question of how to choose M in practice. By combining the thresholded approach in Theorem 3.2 with the techniques of Theorem 3.5, one can obtain a thresholded Toeplitz estimator which profits from both structural priors. To state a corresponding estimate, let us define the set of bounded Toeplitz covariance matrices with (effectively) sparse first column σ by

$$\displaystyle \begin{aligned} \mathcal{U}^{\mathrm{Toep}} (q,s,M) := \left\{ \boldsymbol \Sigma \colon \Sigma_{i,j} = \sigma_{|i-j|+1} \le M, \text{for } \boldsymbol \sigma \in {\mathbb R}^p\ \text{with } \sum_{r=1}^p |\sigma_{r}|{}^q \le s \right\}. \end{aligned} $$

We furthermore denote by \(\mathbb B_{\alpha p} (\boldsymbol \Sigma )\) the matrix Σ restricted to bandwidth αp, i.e., \([\mathbb B_{\alpha p} (\boldsymbol \Sigma )]_{i,j} = \Sigma _{i,j}\) if |i − j| + 1 ≤ αp and \([\mathbb B_{\alpha p} (\boldsymbol \Sigma )]_{i,j} = 0\) else.

Theorem 3.6

There exists an absolute constant C > 0 such that the following holds. Let X have the convex concentration property with constant K. Let \(\mathbb E [\mathbf X] = \boldsymbol 0\) and \(\mathbb E [\mathbf X \mathbf X^\top ] = \boldsymbol \Sigma \) , for \(\boldsymbol \Sigma \in \mathcal {U}^{\mathrm {Toep}} (q,s,M)\) . For all α ∈ (0, 1) and c > 1, we have with probability at least 1 − (2αp)−(c−1) that if

$$\displaystyle \begin{aligned} \tau = \sqrt{\frac{2c}{(1-\alpha)}} \max \{CK^2,\sqrt{C}K\} \sqrt{\frac{\log(p)}{np}}, \end{aligned} $$
(3.8)

then

$$\displaystyle \begin{aligned} \left\| \mathbb T_\tau ( \mathbb B_{\alpha p} (\hat{\boldsymbol \Sigma}_n^{\mathrm{Toep}} )) - \boldsymbol \Sigma \right\| \lesssim s \left( \max \{C^2K^4,CK^2\} \frac{c}{1 - \alpha} \frac{\log(p)}{np} \right)^{\frac{1-q}{2}} + \| \mathbb B_{\alpha p} (\boldsymbol \Sigma) - \boldsymbol \Sigma \|, \end{aligned} $$

where \(\mathbb T_\tau \) is the thresholding operator from (3.5).

Two comments are in order here. To gain from the Toeplitz structure, Theorem 3.6 requires Σ to be close to a banded matrix. This is as in Theorem 3.5 before and has been discussed previously. Moreover, by adapting the proof strategy of Theorem 3.2, the result inherits the slightly suboptimal error decay in the sparsity level s, cf. the discussion of Theorem 3.2 for the case q = 0. The proof, which combines ideas from [5] and [33], can be found in the Appendix.

Combining Toeplitz Structure and Low-Rankness

Sparsity is not the only structure that can be imposed on Toeplitz matrices. For instance, in massive MIMO, see Sect. 3.2, low-rankness of Σ may naturally be assumed in addition to Toeplitz structure [28]. The recent works [20, 39] propose several algorithms to estimate low-rank Toeplitz covariance matrices from partial observations by a technique called “sparse ruler.” In particular, the authors can show that the sufficient number of samples to approximate Σ scales (up to \(\log \)-factors) polynomial in the (effective) rank of Σ.

Remark 3.1

Before closing this section, let us briefly comment on the three types of structures discussed above and their mutual relation:

  • Sparsity: The concept of sparsity is the maybe most fundamental way of theoretically describing intrinsic “low-complexity” of points in a vector space. Whereas we only introduced sparsity of vectors in \({\mathbb R}^n\) with respect to the canonical basis, it is straightforward to generalize the definition to arbitrary vector spaces and other bases (or even frames). Note, however, that sparsity strongly depends on the chosen representation of objects in space, i.e., a point that is sparse in one basis need not be sparse in another.

  • Low-rankness: One can view low-rankness as a special case of sparsity since a matrix is low rank if and only if the vector of its singular values is sparse in the canonical basis. Stated differently, a matrix is low rank if its induced linear mapping only acts on low-dimensional subspaces of the ambient input and output space. This second characterization shows that, in contrast to sparsity, low-rankness is not representation dependent. Furthermore, one can generalize the concept to higher dimensional linear operators as well, e.g., tensor spaces.

  • Toeplitz structure: Just as low-rankness, Toeplitz structure is a special type of sparsity that requires matrix structure of the points in space. Its low-dimensional structure lies in the fact that only (2n − 1) parameters are necessary to characterize an \({\mathbb R}^{n\times n}\) Toeplitz matrix. In contrast to low-rankness, Toeplitz structure is representation dependent. Nevertheless, Toeplitz matrices naturally appear as covariance matrices of stationary random processes, i.e., if the covariance of two events does not depend on their localization in time but only their distance in time.

For further discussion and literature on the subject, we refer the interested reader to [22].

3.4 Estimation from Quantized Samples

All results stated above assume that the sample vectors X k are real-valued, i.e., one has access to infinite precision representations of the samples. In applications, this assumption is not always fulfilled. Especially in signal processing, samples are collected via sensors and, hence, need to be quantized to finitely many bits before they can be digitally transmitted and processed. Engineers have been examining the influence of coarse quantization on correlation and covariance estimation for decades, e.g., [2, 14, 30, 41, 54]. However, in contrast to classical covariance estimation from unquantized samples, so far only asymptotic estimation guarantees have been derived in the quantized setting. To improve our understanding on the effect of quantization on covariance estimation, we analyzed two memoryless one-bit quantization schemes in our recent work [16]. We call a quantizer memoryless if it quantizes each entry of X k independently of all remaining entries. This is fundamentally different from feedback systems, e.g., Σ Δ-quantization [4, 55], and of particular interest for large-scale applications like massive MIMO where the entries of X k correspond to inputs from different antennas, cf. Sect. 3.2. We conclude by providing a detailed discussion of the models and results in [16].

3.4.1 Sign Quantization

In the first setting, we assume to receive one-bit quantized samples

$$\displaystyle \begin{aligned} \mathrm{sign}(\mathbf X^k) \in \{-1,1\}^p, \end{aligned} $$
(3.9)

for k ∈ [n], instead of X k itself. (Recall that we apply scalar functions like sign entry-wise to vectors and matrices.) Since the quantizer sign is scale-invariant, i.e., sign(z) = sign(Dz) for any diagonal matrix \(\mathbf D\in {\mathbb R}^{p\times p}\) with strictly positive entries and \(\mathbf z \in {\mathbb R}^p\), we can only hope to recover the correlation matrix of the distribution, i.e., a normalized version of Σ with entries \(\big [ \tfrac {\Sigma _{i,j}}{\sqrt {\Sigma _{i,i}} \sqrt {\Sigma _{j,j}}} \big ]_{i,j}\). We thus assume that \(\mathbf X \sim \mathcal {N}(\boldsymbol 0,\boldsymbol \Sigma )\), where Σ has ones on its diagonal.

It has been known for decades that

$$\displaystyle \begin{aligned} \tilde{\boldsymbol \Sigma}_n = \sin \left( \frac{\pi}{2n} \sum_{k=1}^n \mathrm{sign} (\mathbf X^k) \mathrm{sign}(\mathbf X^k)^\top \right) \end{aligned} $$
(3.10)

is well suited to approximate Σ from the quantized samples, cf. [30]. Note that the specific form of \(\tilde {\boldsymbol \Sigma }_n\) is motivated by Grothendieck’s identity (see, e.g., [61, Lemma 3.6.6]), also known as “arcsin-law” in the engineering literature [30, 60], which implies that

$$\displaystyle \begin{aligned} \boldsymbol \Gamma := \mathbb E [\mathrm{sign} (\mathbf X^k) \mathrm{sign} (\mathbf X^k)^\top ] = \frac{2}{\pi} \arcsin (\boldsymbol \Sigma) \end{aligned} $$
(3.11)

if \(\mathbf X \sim \mathcal N (\boldsymbol 0,\boldsymbol \Sigma )\). Applying the strong law of large numbers and the continuity of the sine function to (3.10), one easily obtains with (3.11) that \(\tilde {\boldsymbol \Sigma }_n\) is a consistent estimator of Σ.

The two key quantities for understanding the non-asymptotic performance of \(\tilde {\boldsymbol \Sigma }_n\) are Γ and

$$\displaystyle \begin{aligned} \mathbf A := \cos{}(\arcsin(\boldsymbol \Sigma)) = \cos{}( \tfrac{\pi}{2} \boldsymbol \Gamma ). \end{aligned} $$

Furthermore, we define

$$\displaystyle \begin{aligned} \sigma(\mathbf Z)^2 := \mathbf Z^2 \odot \boldsymbol \Gamma - ( \mathbf Z \odot \boldsymbol \Gamma )^2 = \frac{2}{\pi} \mathbf Z^2 \odot \arcsin(\boldsymbol \Sigma) - \frac{4}{\pi^2} \big( \mathbf Z \odot \arcsin(\boldsymbol \Sigma) \big)^2, \end{aligned} $$

for symmetric \(\mathbf Z \in {\mathbb R}^{p\times p}\).

Theorem 3.7 ([16, Theorem 1])

There exist constants c 1, c 2 > 0 such that the following holds. Let \(\mathbf X \sim \mathcal N(\boldsymbol 0,\boldsymbol \Sigma )\) with Σ i,i = 1, for i ∈ [p], and \(\mathbf X^1,...,\mathbf X^n \overset {\mathrm {d}}{\sim } \mathbf X\) be i.i.d. samples of X . Let M ∈ [0, 1]p×p be a fixed symmetric mask. Then, for all t ≥ 0 with \(n \ge c_1 \log ^2(p) (\log (p) + t)\) , the biased sign estimator \(\tilde {\boldsymbol \Sigma }_n\) fulfills with probability at least \(1 - 2e^{-c_2 t}\)

$$\displaystyle \begin{aligned} \| \mathbf M \odot \tilde{\boldsymbol \Sigma}_n - \mathbf M \odot \boldsymbol \Sigma \| & \lesssim \left\| \sigma \left(\mathbf M \odot \mathbf A \right)\right\| \sqrt{\frac{\log(p) + t}{n}} \\ & \qquad \qquad + \left( \max\left\{ \| \mathbf M \odot \mathbf A \|, \| \mathbf M \odot \boldsymbol \Sigma \| \right\} \right) \frac{\log(p) + t}{n}. \end{aligned} $$
(3.12)

The right-hand side in Theorem 3.7 (for convenience, we only consider the case M = 1 here) can be trivially estimated to get

$$\displaystyle \begin{aligned} \| \tilde{\boldsymbol \Sigma}_n - \boldsymbol \Sigma \| \lesssim \max \{ \| \cos{}(\arcsin(\boldsymbol \Sigma)) \|,\| \boldsymbol \Sigma \| \} \left( \sqrt{ \frac{\log(p) + t}{n}} + \frac{\log(p) + t}{n} \right), \end{aligned} $$

which is up to the additional dependence on \(\cos {}(\arcsin (\boldsymbol \Sigma ))\) comparable to the error bound in (3.2) for \(\hat { \boldsymbol \Sigma }_n\). This is remarkable since \(\tilde {\boldsymbol \Sigma }_n\) accesses considerably less information on the samples than \(\hat { \boldsymbol \Sigma }_n\).

Theorem 3.7 even suggests that for strongly correlated distributions of X, i.e., Σ ≈1, the dominant first term on the right-hand side of (3.12) vanishes. In other words, the bound in (3.12) predicts \(\tilde {\boldsymbol \Sigma }_n\) to outperform \(\hat {\boldsymbol \Sigma }_n\) if the entries of X strongly correlate. Numerical experiments from [16] confirm this counter-intuitive fact, cf. Fig. 3.2. A possible explanation is that by construction, \(\tilde {\boldsymbol \Sigma }_n\) implicitly uses the assumption that Σ has ones on its diagonal which is not provided to \(\hat {\boldsymbol \Sigma }_n\).

Fig. 3.2
figure 2

The experiment from [16] depicts average estimation error of \(\hat { \boldsymbol \Sigma }_n\) and \(\tilde { \boldsymbol \Sigma }_n\) in operator norm, for p = 20, n varying from 10 to 300 and three different choices of the ground truth Σ with ones on the diagonal and off-diagonal entries equal to c = 0.5, c = 0.9, and c = 0.99

Furthermore, a corresponding lower bound on the second moment of the estimation error shows that the unconventional term ∥σ(M ⊙A)∥ is factual and not an artifact of the proof, cf. [16, Proposition 14].

3.4.2 Dithered Quantization

The results of Sect. 3.4.1 are restricted to the estimation of correlation matrices of Gaussian distributions. Both limitations stem from the chosen quantization model: first, (3.9) is blind to the rescaling of variances and, second, Grothendieck’s identity only holds for Gaussian distributions. Nevertheless, by introducing a dither in the one-bit quantizer in (3.9), we can fully estimate the covariance matrix of general sub-Gaussian distributions. Dithering means adding artificial random noise (with a suitably chosen distribution) to the samples before quantizing them to improve reconstruction from quantized observations, cf. [25, 26, 52]. In the context of one-bit compressed sensing, the effect of dithering was recently rigorously analyzed in [3, 17, 18, 32, 36], see also the survey [15].

To be precise, we require two bits per entry of each sample vector where each bit is dithered by an independent uniformly distributed dither, i.e., we are given

$$\displaystyle \begin{aligned} \mathrm{sign} (\mathbf X^k + \boldsymbol \tau^k), \ \mathrm{sign} (\mathbf X^k + \bar{\boldsymbol \tau}^k)^\top, \qquad k=1,\ldots,n, \end{aligned} $$
(3.13)

where the dithering vectors \(\boldsymbol \tau ^1,\bar {\boldsymbol \tau }^1,\ldots ,\boldsymbol \tau ^n,\bar {\boldsymbol \tau }^n\) are independent and uniformly distributed in [−λ, λ]p, with λ > 0 to be specified later. From the quantized observations in (3.13), we construct the estimator

$$\displaystyle \begin{aligned} \tilde{\boldsymbol \Sigma}_n^{\text{dith}} = \tfrac{1}{2}\tilde{\boldsymbol \Sigma}^{\prime}_n + \tfrac{1}{2}(\tilde{\boldsymbol \Sigma}^{\prime}_n)^\top, \end{aligned} $$
(3.14)

where

$$\displaystyle \begin{aligned} \tilde{\boldsymbol \Sigma}^{\prime}_n = \frac{\lambda^2}{n}\sum_{k=1}^n \mathrm{sign} (\mathbf X^k + \boldsymbol \tau^k) \mathrm{sign} (\mathbf X^k + \bar{\boldsymbol \tau}^k)^\top. \end{aligned} $$
(3.15)

Theorem 3.8 ([16, Theorem 3])

Let X be a mean-zero, K-sub-Gaussian vector with covariance matrix \(\mathbb E [ \mathbf X \mathbf X^\top ] = \boldsymbol \Sigma \) . Let \(\mathbf X^1,...,\mathbf X^n \overset {\mathrm {d}}{\sim } \mathbf X\) be i.i.d. samples of X . Let M ∈ [0, 1]p×p be a fixed symmetric mask. If \(\lambda ^2 \gtrsim _K \log (n) \| \boldsymbol \Sigma \|{ }_{\infty }\) , then with probability at least 1 − e t,

$$\displaystyle \begin{aligned} \| \mathbf M \odot \tilde{\boldsymbol \Sigma}_n^{\mathit{\text{dith}}} &- \mathbf M \odot \boldsymbol \Sigma \| \\ &\lesssim_K \|\mathbf M\|{}_{1\to 2}(\lambda\|\boldsymbol \Sigma \|{}^{1/2}+\lambda^2)\sqrt{\frac{\log(p)+t}{n}} + \lambda^2\|\mathbf M\| \frac{\log(p)+t}{n}. \end{aligned} $$

In particular, if \(\lambda ^2 \approx _K \log (n) \|\boldsymbol \Sigma \|{ }_{\infty }\) , then

$$\displaystyle \begin{aligned} &\| \mathbf M \odot \tilde{\boldsymbol \Sigma}_n^{\mathit{\text{dith}}} - \mathbf M \odot \boldsymbol \Sigma \| \\ &\lesssim_K \log(n) \| \mathbf M \|{}_{1\to 2} \sqrt{\frac{\|\boldsymbol \Sigma \| \ \|\boldsymbol \Sigma \|{}_{\infty}(\log(p)+t)}{n}} + \log(n)\|\mathbf M\|\|\boldsymbol \Sigma \|{}_{\infty}\frac{\log(p)+t}{n}. \end{aligned} $$
(3.16)

The error bound (3.16) coincides (up to different logarithmic factors) with the best known estimate for the masked sample covariance matrix in Theorem 3.1, even though the sample covariance matrix requires direct access to the samples X k. This performance, however, heavily depends on the choice of λ, cf. [16]. Furthermore, it should be mentioned that there are cases where the performance of the dithered estimator is significantly worse than the performance of the sample covariance matrix. Let us consider for simplicity the case M = 1. If the samples X k are Gaussian, then [37] shows that

$$\displaystyle \begin{aligned} \mathbb E \| \hat{\boldsymbol \Sigma}_n - \boldsymbol \Sigma \| \simeq \sqrt{\frac{\|\boldsymbol \Sigma \| \mathrm{Tr}(\boldsymbol \Sigma)}{n}} + \frac{\mathrm{Tr}(\boldsymbol \Sigma)}{n}, \end{aligned} $$

whereas (3.16) yields

$$\displaystyle \begin{aligned} \mathbb E \| \tilde{\boldsymbol \Sigma}_n^{\text{dith}} - \boldsymbol \Sigma \| \lesssim \log(n) \sqrt{\frac{p\|\boldsymbol \Sigma \| \ \|\boldsymbol \Sigma \|{}_{\infty}\log(p)}{n}} + \log(n)\frac{p\|\boldsymbol \Sigma \|{}_{\infty}\log(p)}{n} \end{aligned} $$

via tail integration. Since Tr( Σ) ≤ pΣ, the second estimate is worse in general. Numerical experiments in [16] have shown that this difference is not an artifact of proof. Simply put, \(\hat {\boldsymbol \Sigma }_n\) and \(\tilde {\boldsymbol \Sigma }_n^{\text{dith}}\) perform similarly if Σ has a constant diagonal, whereas \(\hat {\boldsymbol \Sigma }_n\) performs significantly better whenever Tr( Σ) ≪ pΣ.

Theorem 3.8 can be extended to heavier-tailed random vectors. This, however, requires a larger choice of λ and thus more samples to reach the same error. For a sub-exponential random vector X, one would already need \(\lambda ^2 \gtrsim \log (n)^2 \cdot \max _{i \in [p]} \| X_i \|{ }_{\psi _1}^2\). The dependence of λ on n, both in the latter statement and in Theorem 3.8, can be observed in numerical experiments [16] as well.

Let us finally mention that the quantized estimators in (3.10) and (3.14) are not necessarily positive semi-definite as one expects from covariance matrices. In applications, one would thus replace both estimators by their projection onto the cone of positive semi-definite matrices, which is efficiently computed via the singular value decomposition [8, Section 8.1.1]. The obtained estimates also apply to the projected estimators since convex projections are 1-Lipschitz.

3.5 The Underlying Structures of Massive MIMO Covariance Estimation

Having the just surveyed theoretical insights on covariance matrix estimation in mind, let us return to the massive MIMO setup of Sect. 3.2. To understand the intrinsic structure of Σ h in (3.4) and consequent approaches in engineering literature to leverage it, we have to dive deeper into the underlying model and its physical interpretation. We thus follow the notational conventions of engineering literature in this section. Recall that the number of antennas M can be identified with the ambient dimension p, the number of time–frequency resources N with the number of samples n, and the channel vectors h(s) correspond to samples X k.

Under the assumptions stated in the beginning of Sect. 3.2, i.e., the BS is equipped with M antennas in a ULA, the channel vector h(s) can be written explicitly as

$$\displaystyle \begin{aligned} \mathbf h(s) = \int_{-1}^1 \rho(\xi,s) \, \mathbf a(\xi) \; \mathrm d \xi, \end{aligned} $$

for s ∈ [N]. Here, \(\xi = \tfrac {\sin {}(\theta )}{\sin {}(\theta _{\text{max}})}\) are the normalized angles of arrival (AoA) with \(\theta _{\text{max}} \in [0,\tfrac {\pi }{2}]\) being the maximum array angular aperture (cf. Fig. 3.1), the vectors \(\mathbf a(\xi ) \in \mathbb C^M\) denote the respective array response at the BS, and the channel gain ρ(ξ, s) is a complex Gaussian process with zero mean. By assuming the antenna spacing to be \(d = \tfrac {\lambda }{2}\), where \(\lambda = \tfrac {c_0}{f_0}\) denotes the wavelength with c 0 being the speed of light and f 0 the carrier frequency, we obtain

$$\displaystyle \begin{aligned} \mathbf a(\xi) = \begin{pmatrix} 1, e^{j\pi \xi}, \dots, e^{j\pi (M-1) \xi} \end{pmatrix}^\top, \end{aligned} $$

where j denotes the imaginary unit. With the additional assumption of wide sense stationary uncorrelated scattering (WSSUS), the second-order statistics of the Gaussian process ρ(ξ, s) are time invariant and uncorrelated across AoAs so that

$$\displaystyle \begin{aligned} \mathbb E [ \rho(\xi,s) \rho^*(\xi',s) ] = \gamma (\xi) \, \delta (\xi - \xi'), \end{aligned} $$

where \(\gamma \colon [-1,1] \to {\mathbb R}_{\ge 0}\) is the real and non-negative measure that represents the angular scattering function (ASF) and δ is the Dirac delta function. In particular, this implies that

$$\displaystyle \begin{aligned} \boldsymbol \Sigma_{\mathbf h} = \mathbb E [ \mathbf h(s) \mathbf h(s)^{\mathsf H} ] = \int_{-1}^1 \gamma (\xi) \, \mathbf a (\xi) \mathbf a (\xi)^{\mathsf H} \; \mathrm d \xi. \end{aligned} $$
(3.17)

Building upon this explicit representation of h(s) and structural assumptions on γ, one can refine the estimate obtained from the sample covariance matrix of y defined in (3.3).

A Hands-on Approach

In [35], we choose the following approach. First note that by (3.17) the channel covariance matrix belongs to the set

$$\displaystyle \begin{aligned} \mathcal M = \left\{ \int_{-1}^1 \gamma(\xi) \, \mathbf a (\xi) \mathbf a (\xi)^{\mathsf H} \; \mathrm d \xi \colon \gamma \in \mathcal A \right\}, \end{aligned} $$

where \(\mathcal A\) denotes the class of typical ASFs in wireless propagation. If one assumes sparse scattering propagation, the set \(\mathcal A\) consists of sparse ASFs. In particular, we assume that γ(ξ) can be decomposed as the sum of a discrete spike component γ d (modeling the power received from line-of-sight (LOS) paths and narrow scatterers) and a continuous component γ c (modeling the power received from wide scatterers). Mathematically, we can write

$$\displaystyle \begin{aligned} \gamma (\xi) = \gamma_d(\xi) + \gamma_c(\xi) = \sum_{k=1}^r c_k \delta(\xi - \xi_k) + \gamma_c(\xi), \end{aligned} $$
(3.18)

where γ d consists of r ≪ M Dirac deltas with AoAs ξ 1, …, ξ r and strengths c 1, …, c r > 0 corresponding to r specular propagation elements. Furthermore, by sparsity assumptions on γ, we have that meas(γ c) ≪meas([−1, 1]), where meas(γ c) denotes here the measure of the support of γ c. Combining (3.17) and (3.18), we decompose the channel covariance matrix as

$$\displaystyle \begin{aligned} \boldsymbol \Sigma_{\mathbf h} = \boldsymbol \Sigma_{\mathbf h}^d + \boldsymbol \Sigma_{\mathbf h}^c = \sum_{k=1}^r c_k \, \mathbf a (\xi_k) \mathbf a (\xi_k)^{\mathsf H} + \int_{-1}^1 \gamma_c(\xi) \, \mathbf a (\xi) \mathbf a (\xi)^{\mathsf H} \; \mathrm d \xi, \end{aligned} $$
(3.19)

where \(\boldsymbol \Sigma _{\mathbf h}^d\) is rank-r and positive semi-definite and \(\boldsymbol \Sigma _{\mathbf h}^c\) is full rank and positive semi-definite with few dominant singular values. We can approximate Σ h now in three consecutive steps:

  1. (i)

    Spike Location Estimation for γ d: Applying the MUltiple SIgnal Classification (MUSIC) algorithm [58], we estimate the AoAs ξ k of the spike component γ d from the noisy samples y(1), …, y(N), cf. [35, Theorem 1]. Since this step is fairly standard, we do not discuss the details here but refer the interested reader to [35]. Let us only mention that the number of spikes is estimated by the number of dominant eigenvalues of \(\boldsymbol \Sigma _{\mathbf y} := \mathbb E [ \mathbf y(s) \mathbf y(s)^{\mathsf H} ]\) (where one can naturally assume a corresponding gap in the spectrum since the power received via LOS paths in γ d dominates the power received from wide scatterers in γ c). As a result, we obtain estimated spike locations \(\hat \xi _k\), for \(k \in [\hat r]\), and define an approximation of γ d

    $$\displaystyle \begin{aligned} \tilde\gamma_d(\xi) = \sum_{k=1}^{\hat r} \tilde c_k \, \delta(\xi - \hat\xi_k), \end{aligned} $$

    where the coefficients \(\tilde c_1,\dots , \tilde c_{\hat r} \ge 0\) still need to be estimated.

  2. (ii)

    Sparse Dictionary-Based Method: We approximate the continuous component γ c over a finite dictionary of densities \(\mathcal G_c := \{ \psi _i \colon [-1,1] \to {\mathbb R}, \; i \in [G] \}\) that are suitably chosen, e.g., Gaussian, Laplacian, or rectangular kernels, cf. Fig. 3.3. We hence define

    $$\displaystyle \begin{aligned} \tilde\gamma_c(\xi) = \sum_{i=1}^{G} \tilde b_i \psi_i(\xi), \end{aligned} $$
    Fig. 3.3
    figure 3

    Example of a Gaussian dictionary that might be used to express γ c

    where only the coefficients \(\tilde b_1,\dots , \tilde b_G \ge 0\) need to be estimated.

  3. (iii)

    Non-Negative Least Square (NNLS) estimator: Collecting the coefficients in a single vector \(\mathbf u = (\tilde b_1,\dots ,\tilde b_G,\tilde c_1,\dots , \tilde c_{\hat r})^\top \in {\mathbb R}_{\ge 0}^{G + \hat r}\) and recalling (3.19), we define our coefficient-dependent estimate of the channel covariance

    $$\displaystyle \begin{aligned} \boldsymbol \Sigma_{\mathbf h} (\mathbf u) = \sum_{k=1}^{\hat r} \tilde c_k \, \mathbf a (\hat\xi_k) \mathbf a (\hat\xi_k)^{\mathsf H} + \sum_{i=1}^G \tilde b_i \int_{-1}^1 \psi_i(\xi) \, \mathbf a (\xi) \mathbf a (\xi)^{\mathsf H} \; \mathrm d \xi =: \sum_{i=1}^{G+\hat r} u_i \mathbf S_i, \end{aligned} $$
    (3.20)

    where

    $$\displaystyle \begin{aligned} \mathbf S_i = \begin{cases} \int_{-1}^1 \psi_i(\xi) \, \mathbf a (\xi) \mathbf a (\xi)^{\mathsf H} \; \mathrm d \xi & \text{if } 1 \le i \le G \\ \mathbf a (\hat\xi_{i-G}) \mathbf a (\hat\xi_{i-G})^{\mathsf H} & \text{if } G < i \le G + \hat r. \end{cases} \end{aligned} $$

    All that remains is to determine the coefficient vector u. Since Σ y = Σ h + N 0 I, we can do so by fitting (3.20) to the sample covariance matrix \(\hat {\boldsymbol \Sigma }_{\mathbf y}\) of y(1), …, y(N), i.e.,

    $$\displaystyle \begin{aligned} \mathbf u^* = \mathrm{arg}\min_{\mathbf u \ge \boldsymbol 0} \Big\| \hat {\boldsymbol \Sigma}_{\mathbf y} - \sum_{i=1}^{G+\hat r} u_i \mathbf S_i - N_0 \mathbf I \Big\|{}_F^2. \end{aligned} $$
    (3.21)

    Since Σ h is Hermitian Toeplitz, one can incorporate the structure in (3.21) by replacing \(\hat { \boldsymbol \Sigma }_{\mathbf h} = \hat {\boldsymbol \Sigma }_{\mathbf y} - N_0 \mathbf I\) with its projection \(\tilde { \boldsymbol \Sigma }_{\mathbf h}\) onto the space of Hermitian Toeplitz matrices (which can be done by averaging the diagonals as in (3.7)). Denoting the first column of \(\tilde { \boldsymbol \Sigma }_{\mathbf h}\) by \(\tilde { \boldsymbol \sigma } \in {\mathbb C}^M\) and collecting the first columns of the matrices S i in a matrix \(\tilde { \mathbf S } \in {\mathbb C}^{M \times (G+\hat r)}\), we may instead solve

    $$\displaystyle \begin{aligned} \mathbf u^* = \mathrm{arg}\min_{\mathbf u \ge \boldsymbol 0} \Big\| \mathbf W ( \tilde{\mathbf S} \mathbf u - \tilde{ \boldsymbol \sigma }) \Big\|{}_F^2, \end{aligned} $$
    (3.22)

    where \(\mathbf W = \mathrm {diag} \big ( (\sqrt {M}, \sqrt {2(M-1)},\sqrt {2(M-2)},...,\sqrt {2})^\top \big )\) is a weight matrix compensating the averaging process.

A Hands-on Approach: Empirical Evaluation

Let us empirically compare the NNLS estimator to the sample covariance matrix right away. We consider a ULA with M = 128 antennas, where the spacing between two consecutive antenna elements is set to \(d = \frac {\lambda }{2}\). We produce random ASFs in the following general format:

$$\displaystyle \begin{aligned} \gamma (\xi) &= \gamma_d (\xi) + \gamma_c (\xi) \\& = \frac{\alpha}{r} \sum^r_{i=1}\delta (\xi - \xi_i) + \frac{1-\alpha}{Z}\left( \sum^{n_r}_{j=1}\mathtt{rect}_{\mu_j,\sigma_j} (\xi) + \sum^{n_g}_{k=1} \mathtt{Gaussian}_{\mu_k,\sigma_k} (\xi) \right), \end{aligned} $$
(3.23)

where we set the number of delta, rectangular, and Gaussian functions to be r  :=  2, n r  :=  2, and n g  :=  2, respectively. The spike locations are chosen uniformly at random from [−1, 1], i.e., ξ i ∼Unif([−1, 1]) for i ∈ [2]. The rectangular functions are defined as

$$\displaystyle \begin{aligned} \mathtt{rect}_{\mu_j,\sigma_j} (\xi) = \chi_{ \left[ \mu_j - \tfrac{\sigma_j}{2}, \mu_j + \tfrac{\sigma_j}{2} \right]} (\xi), \end{aligned} $$

where μ 1 ∼Unif([−1, 0]), μ 2 ∼Unif([0, 1]), and σ j ∼Unif([0.1, 0.3]), for j ∈ [2]. The Gaussian functions \(\mathtt {Gaussian}_{\mu _k,\sigma _k}\) are densities of \(\mathcal N(\mu _k,\sigma _k)\), where μ k ∼Unif([−0.7, 0.7]) and σ k ∼Unif([0.03, 0.04]), for k ∈ [2]. Moreover, α := 0.5 is set to present the power contribution of discrete spikes. The constant \(Z = \int _{-1}^1 \gamma _c(\xi ) d\xi \) normalizes γ c in measure. The signal-to-noise ratio (SNR) is set to 10 dB.

In addition to the sample covariance, we compare our NNLS estimator to sparse iterative covariance-based estimation (SPICE) [57]. This method also exploits the ASF domain to fit a covariance matrix. Note that SPICE can only be applied with Dirac delta dictionaries and that it does not include a step of spike support detection as in our method.

Denoting a generic covariance estimate as \(\bar {\boldsymbol {\Sigma }}\), we consider two metrics to evaluate the estimation quality. The first metric, the normalized Frobenius-norm error, is defined as \(E_{\text{NF}} = \frac {\Vert \boldsymbol {\Sigma }_{\mathbf {h}} - \bar {\boldsymbol {\Sigma }} \Vert _{\mathsf {F}}}{\Vert \boldsymbol {\Sigma }_{\mathbf {h}} \Vert _{\mathsf {F}}}\). The second metric, the power efficiency, evaluates the similarity of dominant subspaces between the estimated and true matrices, which is an important factor in various applications of massive MIMO such as user grouping and group-based beamforming. Specifically, let d ∈ [M] denote a subspace dimension parameter, and let \({\mathbf {U}}_d \in \mathbb {C}^{M\times d}\) and \(\bar {\mathbf {U}}_d\in \mathbb {C}^{M\times d}\) be the d dominant eigenvectors of Σ h and \(\bar {\boldsymbol {\Sigma }}\) corresponding to their largest d eigenvalues, respectively. Then, the power efficiency based on d is defined as \(E_{\text{PE}}(d) = 1 - \frac {\langle \boldsymbol {\Sigma }_{\mathbf {h}},\bar {\mathbf {U}}_d\bar {\mathbf {U}}_d^{\sf {H}}\rangle }{\langle \boldsymbol {\Sigma }_{\mathbf {h}},{\mathbf {U}}_d{\mathbf {U}}_d^{\sf {H}}\rangle }\). Note that E PE(d) ∈ [0, 1] where a value closer to 0 means that more power is captured by the estimated d-dominant subspace.

SPICE and the proposed NNLS estimators are applied with G = 2M Dirac delta dictionaries for the continuous part \(\mathcal {G}_c\). The resulting Frobenius-norm error and power efficiency are depicted in Fig. 3.4. All results are averaged over 20 random ASFs and 200 random channel realizations for each ASF. The proposed NNLS method outperforms the sample covariance matrix and SPICE for both metrics. Finally, one can observe a similar outcome for smaller sample sizes as well, e.g., NM = 0.125, which occur naturally in massive MIMO.

Fig. 3.4
figure 4

Frobenius-norm error (left) and power efficiency with \(\tfrac {N}{M} = 0.5\) (right)

3.6 Conclusion

The present chapter shows that in the last decade good progress has been made on understanding the influence of intrinsic structure of covariance matrices on the non-asymptotic performance of suitably designed estimators. As we have seen, such estimators with strong guarantees are available for sparse, low-rank, and Toeplitz covariance matrices. At the same time, the chapter illustrates that practitioners still continue to tweak the basic sample covariance matrix using their specific knowledge of the application at hand—seemingly unaware of the progress in theory. We hope that this essay helps mathematicians and practitioners alike to gain an overview of recent theoretical developments on structural and quantized covariance estimation and that it motivates mathematicians to look deeper into the underlying physical models of concrete applications to better understand the structures of interest. Furthermore, our recent theoretical progress on quantized covariance estimation suggests that reliable reconstruction of the covariance matrix is possible even under heavy loss of information during sampling. The use of coarse quantization might thus lead to a considerable increase in capacity in massive MIMO systems and related applications.