Keywords

18.1 Introduction

18.1.1 Uncertainty Quantification in Compressed Sensing

Compressed sensing and related convex relaxation algorithms have had a profound impact on high-dimensional statistical modelling in recent years. They provide efficient recovery of low-dimensional objects that sit within a high-dimensional ill-posed system of linear equations. Prototypical low-dimensional structures are described by sparsity and low rank hypotheses. The statistical analysis of such algorithms has mostly been concerned with recovery rates, or with the closely related question of how many measurements are sufficient to reach a prescribed recovery level—some key references are Refs. [3, 5,6,7, 16, 24, 25, 33].

A statistical question of fundamental importance that has escaped a clear answer so far is the question of uncertainty quantification: Can we tell from the observations how well the algorithm has worked? In technical terms: can we report confidence sets for the unknown parameter? Or, in the sequential sampling setting, can we give data-driven rules that ensure recovery of the true parameter at a given precision? Answers to this question are of great importance in various applications of compressed sensing. For one-dimensional subproblems, such as projection onto a fixed coordinate of the parameter vector, recent advances have provided some useful confidence intervals (see Refs. [8, 9, 22, 36]), but our understanding of valid inference procedures for the entire parameter remains limited.

Whereas the ‘estimation theory’ for compressed sensing is quite similar for sparsity and low rank constraints, this is not so for the theory of confidence sets. On the one hand, if one is interested in inference on the full parameter, sparsity conditions induce information theoretic barriers, as shown in Ref. [30]: unless one is willing to make additional signal strength assumptions (inspired by the literature on nonparametric confidence sets, such as Refs. [13, 19]), a uniformly valid confidence set for the unknown parameter vector θ cannot have a better performance than \(1/\sqrt {n}\) in quadratic loss. This significantly falls short of the optimal recovery rates \((k \log p)/n\) for the most interesting sparsity levels k. On the other hand, and perhaps surprisingly, we will show in this article that the low-rank constraint is naturally compatible with certain risk estimation approaches to confidence sets. This will be seen to be true for general sub-Gaussian sensing matrices, but also for sensing matrices arising from Pauli observables, as is specifically relevant in quantum state tomography problems (see the next section). In the latter case it will be helpful to enforce the additional ‘quantum state shape constraint’ on the unknown matrix to obtain optimal results. One can conclude that, in contrast to ‘sparse models’, no signal strength assumptions are necessary for the existence of adaptive confidence statements in low rank recovery problems. Our findings are confirmed in a simple simulation study, see Sect. 18.4.

The honest non-asymptotic confidence sets we will derive below can be used for the construction of adaptive sampling procedures: An experimenter wants to know—at least with a prescribed probability of 1 − α—that the matrix recovery algorithm, the ‘estimator’, has produced an output \(\tilde \theta \) that is close to the ‘true state’ θ. The sequential protocol advocated here—which is related to ‘active learning algorithms’ from machine learning, e.g., Ref. [29]—should tell the experimenter whether a new batch of measurements has to be taken to decrease the recovery error, or whether the collected observations are already sufficient. The data-driven stopping time for this protocol should not exceed the minimax optimal stopping time, again with high probability. We shall show that for Pauli and sub-Gaussian sensing ensembles, such algorithms exist under mild assumptions on the true matrix θ. These assumptions are in particular always satisfied under the ‘quantum shape constraint’ that naturally arises in quantum tomography problems.

Our results depend on the choice of the Frobenius norm and the Hilbert space geometry induced by it. For other natural matrix norms, such as for instance the trace-(nuclear) norm, the theory is more difficult. We show as a first step that at least theoretically a trace-norm optimal confidence set can be constructed for the unknown quantum state (Theorem 18.4)—this suggests interesting directions for future research.

Fig. 18.1
figure 1

Caricature of a quantum mechanical experiment. With every source of quantum systems, one associates a density matrix θ. Observations systems are performed by measurement devices, which interact with incoming systems and produce real-valued outcomes. Each such devices is modelled mathematically by a Hermitian matrix X, referred to as an observable

18.1.2 Application to Quantum State Estimation

This work was partly motivated by a problem arising in present-day physics experiments that aim at estimating quantum states. Conceptually, a quantum mechanical experiment involves two stages (c.f. Fig. 18.1): A source (or preparation procedure) that emits quantum mechanical systems with unknown properties, and a measurement device that interacts with incoming quantum systems and produces real-valued measurement outcomes, e.g. by pointing a dial to a value on a scale. Quantum mechanics stipulates that both stages are completely described by certain matrices.

The properties of the source are represented by a positive semi-definite unit trace matrix θ, the quantum state, also referred to as density matrix. In turn, the measurement device is modelled by a Hermitian matrix X, which is referred to as an observable in physics jargon. A key axiom of the quantum mechanical formalism states that if the measurement X is repeatedly performed on systems emitted by the source that is preparing θ, then the real-valued measurement outcomes will fluctuate randomly with expected value

$$\displaystyle \begin{aligned} \langle X, \theta\rangle_F= \mathrm{tr} (X \theta) \end{aligned} $$
(18.1)

(referred to as expectation value in the quantum physics literature). The precise way in which physical properties are represented by these matrices is immaterial to our discussion (cf. any textbook, e.g. Ref. [32]). We merely note that, while in principle any Hermitian X can be measured by some physical apparatus, the required experimental procedures are prohibitively complicated for all but a few highly structured matrices. This motivates the introduction of Pauli designs below, which correspond to fairly tractable ‘spin parity measurements’.

The quantum state estimation or quantum state tomography Footnote 1 problem is to estimate an unknown density matrix θ from the measurement of a collection of observables X 1, …, X n. This task is of particular importance to the young field of quantum information science [31]. There, the sources might be a carefully engineered component used for technological applications such as quantum key distribution or quantum computing. In this context, quantum state estimation is the process of characterising the components one has built—clearly an important capability for any technology.

A major challenge lies in the fact that relevant instances are described by d × d-matrices for fairly large dimensions d ranging from 100 to 10,000 in presently performed experiments [18]. Such high-dimensional estimation problems can benefit substantially from structural properties of the objects to be recovered. Fortunately, the density matrices occurring in quantum information experiments are typically well-approximated by matrices of low rank r ≪ d. In fact, in the practically most important applications, one usually even aims at preparing a state of rank one—a so-called pure quantum state. While environmental noise will drive the actual state away from the perfect rank-one case, the error will usually be small.

As a result, quantum physicists have early on shown an interest in low-rank matrix recovery methods [12, 15,16,17, 28]. Initial works [15, 16] focused on the minimal number n of observables X 1, …, X n required for reconstructing a rank-r density matrix θ in the noiseless case, i.e. under the idealised assumption that the expectation values tr(θX i) are known exactly. The practically highly relevant problem of quantifying the uncertainty of an estimate \(\hat \theta \) arising from noisy observations on low-rank states was addressed only later [12] and remains less well understood.

More concretely, the basic approach taken in Ref. [12] for uncertainty quantification is similar to the one pursued in the present paper. In a first step, one uses a Matrix Lasso or Dantzig Selector to construct an estimate. Then, a confidence region is obtained by comparing predictions derived from the initial estimate to new samples. However, Ref. [12] suffers from two demerits. First, and most importantly, the performance analysis of the scheme relies on a bound on the rank r of the unknown true θ. Such a bound is not available in practice. Second, the dependence of the rate on r is not tight. Both of these demerits will be addressed here.

We close this section pointing to more broadly related works. Uncertainty quantification in quantum state tomography in general has been treated by numerous authors—a highly incomplete list is Refs. [2, 4, 10, 34, 35]. However, the concept of dimension reduction for low-rank states does not feature explicitly in these papers. This contrasts with Ref. [17], where the authors propose model selection techniques based on information criteria to arrive at low-rank estimates. The use of general-purpose methods—like maximum likelihood estimation and the Akaike Information Criterion—in Ref. [17] means that it is applicable to very general experimental designs. In contrast to this, the present paper relies on compressed sensing ideas to arrive at rigorous a priori guarantees on statistical and computational performance. Also, it remains non-obvious how such model selection steps can be transformed into ‘post-model selection’ confidence sets—typically such constructions result in sub-optimal signal strength conditions that ensure model selection consistency (see Ref. [26] and also the discussion after Theorem 2 in Ref. [30]). Our confidence procedures never estimate the unknown rank of the quantum state—not even implicitly. Rather, they estimate the performance of a dimension-reduction technique directly based on sample splitting.

18.2 Matrix Compressed Sensing

We consider inference on a d × d matrix θ that is symmetric, or, if it consists of possibly complex entries, assumed to be Hermitian (that is θ = θ where θ is the conjugate transpose of θ). Denote by \(\mathbb M_d(\mathbb K)\) the space of d × d matrices with entries in \(\mathbb K = \mathbb C\) or \(\mathbb K=\mathbb R\). We write ∥⋅∥F for the usual Frobenius norm on \(\mathbb M_d(\mathbb K)\) arising from the inner product tr(AB) = 〈A, BF. Moreover let \(\mathbb H_d(\mathbb C)\) be the set of all Hermitian matrices, and \(\mathbb H_d(\mathbb R)\) for the set of all symmetric d × d matrices with real entries. The norm symbol ∥⋅∥ without subindex denotes the standard Euclidean norm on \(\mathbb R^n\) or on \(\mathbb C^n\) arising from the Euclidean inner product 〈⋅, ⋅〉.

We denote the usual operator norm on \(\mathbb M_d(\mathbb C)\) by ∥⋅∥op. For \(M\in \mathbb M_d(\mathbb C)\) let \((\lambda _k^2: k =1, \dots , d)\) be the eigenvalues of M TM (which are all real-valued and positive). The l 1-Schatten, trace, or nuclear norm of M is defined as

$$\displaystyle \begin{aligned}\|M\|{}_{S_1} = \sum_{j\leq d} |\lambda_j|.\end{aligned}$$

Note that for any matrix M of rank 1 ≤ r ≤ d the following inequalities are easily shown,

$$\displaystyle \begin{aligned} \|M\|{}_{F} \leq \|M\|{}_{S_1}\leq \sqrt{r}\|M\|{}_{F}. \end{aligned} $$
(18.2)

We will consider parameter subspaces of \(\mathbb H_d(\mathbb C)\) described by low rank constraints on θ, and denote by R(k) the space of all Hermitian d × d matrices that have rank at most k, k ≤ d. In quantum tomography applications, we may assume an additional ‘shape constraint’, namely that θ is a density matrix of a quantum state, and hence contained in state space

$$\displaystyle \begin{aligned}\Theta_{+} = \{\theta \in \mathbb H_d(\mathbb C):\mathrm{tr}(\theta)=1, \theta \succeq 0\},\end{aligned}$$

where θ ≽ 0 means that θ is positive semi-definite. In fact, in most situations, we will only require the bound \(\|\theta \|{ }_{S_1} \le 1\) which trivially holds for any θ in Θ+.

We have at hand measurements arising from inner products 〈X i, θF = tr(X iθ), i = 1, …, n, of θ with d × d (random) matrices X i. This measurement process is further subject to independent additive noise ε. Formally, the measurement model is

$$\displaystyle \begin{aligned} Y_i = tr (X^i \theta) + \varepsilon_i,\quad i=1, \dots, n, \end{aligned} $$
(18.3)

where the ε i’s and X i’s are independent of each other. We write Y = (Y 1, …, Y n)T, and for probability statements under the law of Y, X, ε given fixed θ we will use the symbol \(\mathbb P_\theta \). Unless mentioned otherwise we will make the basic assumption of Gaussian noise

$$\displaystyle \begin{aligned}\varepsilon = (\varepsilon_1, \dots, \varepsilon_n)^T \sim N(0,\sigma^2I_n),\end{aligned}$$

where σ > 0 is known. See Remark 18.6 for some discussion of the unknown variance case. In the context of quantum mechanics, the inner product tr(X iθ) gives the expected value of the observable X i when measured on a system in state θ (cf. Sect. 18.1.2). A class of physically realistic measurements (correlations among spin-1∕2 particles) is described by X i’s drawn from the Pauli basis. Our main results also hold for measurement processes of this type. Before we describe this in Sect. 18.2.2, let us first discuss our assumptions on the matrices X i.

18.2.1 Sensing Matrices and the RIP

When \(\theta \in \mathbb M_d(\mathbb R)\), we shall restrict to design matrices X i that have real-valued entries, too, and when \(\theta \in \mathbb H_d(\mathbb C)\) we shall consider designs where \(X^i \in \mathbb H_d(\mathbb C)\). This way, in either case, the measurements tr(X iθ)’s and hence the Y i’s are all real-valued. More concretely, the sensing matrices X i that we shall consider are described in the following assumption, which encompasses both a prototypical compressed sensing setting—where we can think of the matrices X i as i.i.d. draws from a Gaussian ensemble (X m,k) ∼iidN(0, 1)—as well as the ‘random sampling from a basis of \(\mathbb M_d(\mathbb C)\)’ scenario. The systematic study of the latter has been initiated by quantum physicists [15, 28], as it contains, in particular, the case of Pauli basis measurements [12, 16] frequently employed in quantum tomography problems. Note that in Part (a) the design matrices are not Hermitian but our results can easily be generalised to symmetrised sub-Gaussian ensembles (as those considered in Ref. [24]).

Condition 18.1

  1. (a)

    \(\theta \in \mathbb H_d(\mathbb R)\), ‘isotropic’ sub-Gaussian design: The random variables \((X^i_{m,k})\) , 1 ≤ m, k  d, i = 1, …, n, generating the entries of the random matrix X i are i.i.d. distributed across all indices i, m, k with mean zero and unit variance. Moreover, for every \(\theta \in \mathbb M_d(\mathbb R)\) such thatθF ≤ 1, the real random variables Z i = tr(X iθ) are sub-Gaussian: for some fixed constants τ i > 0 independent of θ,

    $$\displaystyle \begin{aligned}\mathbb E e^{\lambda Z_i} \le \tau_1 e^{\lambda^2 \tau_2^2} ~\forall \lambda \in \mathbb R.\end{aligned}$$
  2. (b)

    \(\theta \in \mathbb H_d(\mathbb C)\), random sampling from a basis (‘Pauli design’): Let \(\{E_1, \dots , E_{d^2}\} \subset \mathbb H_d(\mathbb C)\) be a basis of \(\mathbb M_d(\mathbb C)\) that is orthonormal for the scalar product 〈⋅, ⋅〉F and such that the operator norms satisfy, for all i = 1, …, d 2 ,

    $$\displaystyle \begin{aligned}\|E_i\|{}_{op} \le \frac{K}{\sqrt{d}},\end{aligned}$$

    for some universal ‘coherence’ constant K. [In the Pauli basis case we have K = 1.] Assume the X i, i = 1, …, n, are draws from the finite family \(\mathcal E = \{d E_i: i=1, \dots , d^2\}\) sampled uniformly at random.

The above examples all obey the matrix restricted isometry property, that we describe now. Note first that if \(\mathcal X: \mathbb R^{d\times d} \to \mathbb R^n \) is the linear ‘sampling’ operator

$$\displaystyle \begin{aligned} \mathcal X: \theta \mapsto \mathcal X \theta = (tr(X^1 \theta) , \dots,\mathrm{tr}(X^n \theta))^T, \end{aligned} $$
(18.4)

so that we can write the model equation (18.3) as \(Y=\mathcal X \theta + \varepsilon \), then in the above examples we have the ‘expected isometry’

$$\displaystyle \begin{aligned}\mathbb E \frac{1}{n}\|\mathcal X \theta\|{}^2 = \|\theta\|{}_F^2.\end{aligned}$$

Indeed, in the isotropic design case we have

$$\displaystyle \begin{aligned} \frac{1}{n}\mathbb E\|\mathcal X \theta\|{}^2= \frac{1}{n}\sum_{i=1}^n \mathbb E \left(\sum_m \sum_k X^{i}_{m,k} \theta_{m,k} \right)^2 = \sum_m \sum_k \mathbb E X^2_{m,k} \theta_{m,k}^2 = \|\theta\|{}_F^2, \end{aligned} $$
(18.5)

and in the ‘basis case’ we have, from Parseval’s identity and since the X i’s are sampled uniformly at random from the basis,

$$\displaystyle \begin{aligned} \frac{1}{n}\mathbb E \|\mathcal X \theta\|{}^2= \frac{d^2}{n} \sum_{i=1}^n \sum_{j=1}^{d^2} \Pr(X^i=E_j) |\langle E_j, \theta \rangle_F|{}^2 = \|\theta \|{}_F^2. \end{aligned} $$
(18.6)

The restricted isometry property (RIP) then requires that this ‘expected isometry’ actually holds, up to constants and with probability ≥ 1 − δ, for a given realisation of the sampling operator, and for all d × d matrices θ of rank at most k:

$$\displaystyle \begin{aligned} \sup_{\theta \in R(k)}\left|\frac{\frac{1}{n}\|\mathcal X \theta\|{}^2 - \|\theta\|{}_F^2}{\|\theta\|{}_F^2} \right| \le \tau_n(k), \end{aligned} $$
(18.7)

where τ n(k) are some constants that may depend, among other things, on the rank k and the ‘exceptional probability’ δ. For the above examples of isotropic and Pauli basis design inequality (18.7) can be shown to hold with

$$\displaystyle \begin{aligned} \tau^2_n(k) = c^2 \frac{k d \cdot \overline{\log} d}{n}, \end{aligned} $$
(18.8)

where

$$\displaystyle \begin{aligned}\overline{\log} x := (\log x) ^\eta,\end{aligned}$$

for some η > 0 denotes a ‘polylog function’, and where c = c(δ) is a constant. See Refs. [6, 28] for these results, where it is also shown that c(δ) can be taken to be at least O(1∕δ 2) as δ → 0 (sufficient for our purposes below).

18.2.2 Quantum Measurements

Here, we introduce a paradigmatic set of quantum measurements that is frequently used in both theoretical and practical treatments of quantum state estimation (e.g. [16, 18]). For a more general account, we refer to standard textbooks [20, 31]. The purpose of this section is to motivate the ‘Pauli design’ case (Condition 18.1(b) of the main theorem, as well as the approximate Gaussian noise model. Beyond this, the technical details presented here will not be used.

18.2.2.1 Pauli Spin Measurements on Multiple Particles

We start by describing ‘spin measurements’ on a single ‘spin-1∕2 particle’. Such a measurement corresponds to the situation of having d = 2. Without worrying about the physical significance, we accept as fact that on such particles, one may measure one of three properties, referred to as the ‘spin along the x, y, or z-axis’ of \(\mathbb R^3\). Each of these measurements may yield one of two outcomes, denoted by + 1 and − 1 respectively.

The mathematical description of these measurements is derived from the Pauli matrices

$$\displaystyle \begin{aligned} \sigma^1= \left[ \begin{array}{cc} 0 & 1\\ 1 & 0 \end{array} \right],\, \sigma^2=\left[ \begin{array}{cc} 0 & -i\\ i & 0 \end{array} \right],\, \sigma^3 =\left[ \begin{array}{cc} 1 &0\\ 0 & -1 \end{array} \right] \end{aligned} $$
(18.9)

in the following way. Recall that the Pauli matrices have eigenvalues ± 1. For x ∈{1, 2, 3} and j ∈{+1, −1}, we write \(\psi _j^x\) for the normalised eigenvector of σ x with eigenvalue j. The spectral decomposition of each Pauli spin matrix can hence be expressed as

$$\displaystyle \begin{aligned} \sigma^x = \pi^{x}_+ - \pi^{x}_-, \end{aligned} $$
(18.10)

with

$$\displaystyle \begin{aligned} \pi^{x}_{\pm} = \psi^{x}_{\pm} (\psi^{y}_{\pm})^\ast \end{aligned} $$
(18.11)

denoting the projectors onto the eigenspaces. Now, a physical measurement of the ‘spin along direction x’ on a system in state θ will give rise to a {−1, 1}-valued random variable C x with

$$\displaystyle \begin{aligned} \mathbb P(C^x = j) =\mathrm{tr}\left( \pi^{x}_j \theta \right), \end{aligned} $$
(18.12)

where \(\theta \in \mathbb H_2(\mathbb C)\). Using Eq. (18.10), this is equivalent to stating that the expected value of C x is given by

$$\displaystyle \begin{aligned} \mathbb E(C^x) =\mathrm{tr}\left( \sigma^x \theta \right). \end{aligned} $$
(18.13)

Next, we consider the case of joint spin measurements on a collection of N particles. For each, one has to decide on an axis for the spin measurement. Thus, the joint measurement setting is now described by a word x = (x 1, …, x N) ∈{1, 2, 3}N. The axioms of quantum mechanics posit that the joint state θ of the N particles acts on the tensor product space \((\mathbb C^2)^{\otimes N}\), so that \(\theta \in \mathbb H_{2^N}(\mathbb C)\).

Likewise, the measurement outcome is a word j = (j 1, …, j N) ∈{1, −1}N, with j i the value of the spin along axis x i of particle i = 1, …, N. As above, this prescription gives rise to a {1, −1}N-valued random variable C x. Again, the axioms of quantum mechanics imply that the distribution of C x is given by

$$\displaystyle \begin{aligned} \mathbb P(C^x = j) =\mathrm{tr}\left((\pi^{x_1}_{j_1} \otimes \dots \otimes \pi^{x_N}_{j_N}) \theta \right). \end{aligned} $$
(18.14)

Note that the components of the random vector C x are not necessarily independent, as θ will generally not factorise

It is often convenient to express the information in Eq. (18.14) in a way that involves tensor products of Pauli matrices, rather than their spectral projections. In other words, we seek a generalisation of Eq. (18.13) to N particles. As a first step toward this goal, let

$$\displaystyle \begin{aligned} \begin{array}{rcl} \chi(j) = \left\{ \begin{array}{ll} -1\qquad &\qquad \text{number of }-1\text{ elements in }j\text{ is odd} \\ 1\qquad &\qquad \text{number of }-1\text{ elements in }j\text{ is even} \end{array} \right. \end{array} \end{aligned} $$
(18.15)

be the parity function. Then one easily verifies

$$\displaystyle \begin{aligned} tr ((\sigma^{x_1} \otimes \dots \otimes \sigma^{x_N} )\theta) = \sum_{j\in\{1,-1\}^N} \chi(j) \, tr \left( \theta ( \pi^{x_1}_{j_1} \otimes \dots \otimes \pi^{x_N}_{j_N} ) \right) = \mathbb E\big(\chi(C^x)\big). \end{aligned} $$
(18.16)

In this sense, the tensor product \(\sigma ^{x_1}\otimes \dots \otimes \sigma ^{x_N}\) describes a measurement of the parity of the spins along the respective directions given by x.

In fact, the entire distribution of C x can be expressed in terms of tensor products of Pauli matrices and suitable parity functions. To this end, we extend the definitions above. Write

$$\displaystyle \begin{aligned} \sigma^0=\left[ \begin{array}{cc} 1 &0\\ 0 & 1 \end{array} \right] \end{aligned} $$
(18.17)

for the identity matrix in \(\mathbb M_2(\mathbb C)\). For every subset S of {1, …, N}, define the ‘parity function restricted to S’ via

$$\displaystyle \begin{aligned} \begin{array}{rcl} \chi_S(j) = \Big\{ \begin{array}{ll} -1\qquad &\qquad \text{number of }-1\text{ elements }j_i\text{ for }i\in S\text{ is odd} \\ 1\qquad &\qquad \text{number of }-1\text{ elements }j_i\text{ for }i\in S\text{ is even}. \end{array} \end{array} \end{aligned} $$
(18.18)

Lastly, for S ⊂{1, …, N} and x ∈{1, 2, 3}N, the restriction of x to S is

$$\displaystyle \begin{aligned} \begin{array}{rcl} x^S_i = \left\{ \begin{array}{ll} x_i\qquad &\qquad i \in S \\ 0\qquad &\qquad i \not\in S. \end{array} \right. \end{array} \end{aligned} $$
(18.19)

Then for every such x, S one verifies the identity

$$\displaystyle \begin{aligned} tr ((\sigma^{x^S_1} \otimes \dots \otimes \sigma^{x^S_N})\theta) = \mathbb E\big( \chi_S(C^x) \big). \end{aligned} $$
(18.20)

In other words, the distribution of C x contains enough information to compute the expectation value of all observables \((\sigma ^{x^S_1} \otimes \dots \otimes \sigma ^{x^S_N})\) that can be obtained by replacing the Pauli matrices on an arbitrary subset S of particles by the identity σ 0. The converse is also true: the set of all such expectation values allows one to recover the distribution of C x. The explicit formula reads

$$\displaystyle \begin{aligned} \mathbb P( C^x = j ) = \frac{1}{2^N}\, \sum_{S \subset \{1, \dots, N\}} \chi_S(j)\, \mathbb E\big( \chi_S (C^x) \big) = \frac{1}{2^N}\, \sum_{S \in \{1, \dots, N\}} \chi_S(j)\, tr \big(\theta ( \sigma^{x_1^S} \otimes \dots \otimes \sigma^{x_N^S} ) \big) \end{aligned} $$
(18.21)

and can be verified by direct computation.Footnote 2

In this sense, the information obtainable from joint spin measurements on N particles can be encoded in the 4N real numbers

$$\displaystyle \begin{aligned} 2^{-N/2}\,tr ((\sigma^{y_1} \otimes \dots \otimes \sigma^{y_N})\theta), \qquad y \in \{ 0, 1, 2, 3 \}^N. \end{aligned} $$
(18.22)

Indeed, every such y arises as y = x S for some (generally non-unique) combination of x and S. This representation is particularly convenient from a mathematical point of view, as the collection of matrices

$$\displaystyle \begin{aligned} E^y := 2^{-N/2}\sigma^{y_1}\otimes \dots \otimes \sigma^{y_N}, \qquad y \in \{ 0, 1, 2, 3 \}^N \end{aligned} $$
(18.23)

forms an ortho-normal basis with respect to the 〈⋅, ⋅〉F inner product. Thus the terms in Eq. (18.22) are just the coefficients of a basis expansion of the density matrix θ.Footnote 3

From now on, we will use Eq. (18.22) as our model for quantum tomographic measurements. Note that the E y satisfy Condition 18.1(b) with coherence constant K = 1 and d = 2N.

18.2.2.2 Bernoulli Errors and Pauli Observables

In the model (18.3) under Condition 18.1(b) we wish to approximate d ⋅tr(E yθ) for a fixed observable E y (we fix the random values of the X i’s here) and for d = 2N. If y = x S for some setting x and subset S, then the parity function B y := χ S(C x) has expected value \(2^{N/2} \cdot \mathrm {tr}(E^y \theta )=\sqrt {d} \cdot tr(E^y \theta )\) (see Eqs. (18.20) and (18.23)), and itself is a Bernoulli variable taking values {1, −1} with

$$\displaystyle \begin{aligned}p=\mathbb P(B^y=1)=\frac{1+\sqrt{d}\mathrm{tr}(E^y \theta)}{2}.\end{aligned}$$

Note that

$$\displaystyle \begin{aligned}\sqrt{d} |tr(E^y \theta)| \le \sqrt{d} \|E^y\|{}_{op} \|\theta\|{}_{S_1} \le 1,\end{aligned}$$

so indeed p ∈ [0, 1] and the variance satisfies

$$\displaystyle \begin{aligned}\mathrm{Var}B^y = 1 - d \cdot\mathrm{tr}(E^y \theta)^2 \le 1.\end{aligned}$$

This is the error model considered in Ref. [12].

In order to estimate all Y i, i = 1, …, n, for given E i := E y, a total number nT of identical preparations of the quantum state θ are being performed, divided into batches of T Bernoulli variables \(B_{i,j} := B^y_j, j=1, \dots , T\). The measurements of the sampling model Eq. (18.3) are thus

$$\displaystyle \begin{aligned}Y_i = \frac{\sqrt{d}}{T}\sum_{j=1}^T B_{i,j} = d \cdot\mathrm{tr}(E_i \theta) + \varepsilon_i\end{aligned}$$

where

$$\displaystyle \begin{aligned}\varepsilon_i = \frac{\sqrt{d}}{T}\sum_{j=1}^T (B_{i,j}-\mathbb E B_{i,j})\end{aligned}$$

is the effective error arising from the measurement procedure making use of T preparations to estimate each quantum mechanical expectation value. Now note that

$$\displaystyle \begin{aligned} |\varepsilon_i| \le 2\sqrt{d},~\mathbb E \varepsilon_i^2 \le \frac{d}{T} \mathrm{Var}(B_{i,1}) \le \frac{d}{T}. \end{aligned} $$
(18.24)

We see that since the ε i’s are themselves sums of independent random variables, an approximate Gaussian error model with variance σ 2 will be roughly appropriate. If T ≥ n then \(\sigma ^2 = \mathbb E \varepsilon _1^2\) is no greater than dn, and if in addition T ≥ d 2 then all results in Sect. 18.3 below can be proved for this Bernoulli noise model too, see Remarks 18.5 and 18.6 for details.

18.2.3 Minimax Estimation Under the RIP

Assuming the matrix RIP to hold and Gaussian noise ε, one can show that the minimax risk for recovering a Hermitian rank k matrix is

$$\displaystyle \begin{aligned} \inf_{\hat \theta} \sup_{\theta \in R(k)} \mathbb E _\theta \|\hat \theta- \theta\|{}^2_F \simeq \sigma^2 \frac{d k}{n}, \end{aligned} $$
(18.25)

where ≃ denotes two-sided inequality up to universal constants.

For the upper bound one can use the nuclear norm minimisation procedure or matrix Dantzig selector from Candès and Plan [6], and needs n to be large enough so that the matrix RIP holds with τ n(k) < c 0 where c 0 is a small enough numerical constant. Such an estimator \(\tilde \theta \) then satisfies, for every θ ∈ R(k) and those \(n \in \mathbb N\) for which τ n(k) < c 0,

$$\displaystyle \begin{aligned} \|\tilde \theta- \theta\|{}^2_F \leq D(\delta) \sigma^2 \frac{k d}{n}, \end{aligned} $$
(18.26)

with probability greater than 1 − 2δ, and with the constant D(δ) depending on δ and also on c 0 (suppressed in the notation). Note that the results in Ref. [6] use a different scaling in sample size in their Theorem 2.4, but eq. (II.7) in that reference explains that this is just a question of renormalisation. The same result holds for randomly sampled ‘Pauli bases’, see Ref. [28] (and take note of the slightly different normalisation in the notation there, too), and also for the Bernoulli noise model from Sect. 18.2.2.2, see Ref. [12].

A key interpretation for quantum tomography applications is that, instead of having to measure all n = d 2 basis coefficients tr(E iθ), i = 1, …, d 2, a number

$$\displaystyle \begin{aligned}n \approx kd \overline{\log} d\end{aligned}$$

of randomly chosen basis measurements is sufficient to reconstruct θ in Frobenius norm loss (up to a small error). In situations where d is large compared to k such a gain can be crucial.

Remark 18.1 (Uniqueness)

It is worth noting that in the absence of errors, so when \(Y_0=\mathcal X \theta _0\) in terms of the sampling operator of Eq. (18.4), the quantum shape constraint ensures that under a suitable RIP condition, only the single matrix θ 0 is compatible with the data. More specifically, let \(Y_0=\mathcal X\theta _0\) for some θ 0 ∈ Θ+ of rank k, and assume that \(\mathcal X\) satisfies RIP with \(\tau _n(4k)<\sqrt {2}-1\). Then

$$\displaystyle \begin{aligned} \left\{ \theta\in \Theta_+: \mathcal X \theta=Y_0 \right\} = \{\theta_0\}. \end{aligned} $$
(18.27)

This is a direct consequence of Theorem 3.2 in Ref. [33], which states that if RIP is satisfied with \(\tau _n(4k)<\sqrt {2}-1\) and \(Y_0=\mathcal X\theta _0\), the unique solution of

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{argmin} &\displaystyle \|\theta\|{}_{S_1}\\ \mathrm{subject \,\, to}&\displaystyle \mathcal X \theta=Y_0 \end{array} \end{aligned} $$
(18.28)

is given by θ 0. If θ 0 ∈ Θ+, then the minimisation can be replaced by (compare also Ref. [23]).

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{argmin} &\displaystyle \mathrm{tr}(\theta)\\ \mathrm{subject \,\, to}&\displaystyle \mathcal X \theta=Y_0,\,\, \theta\geq 0, \end{array} \end{aligned} $$
(18.29)

giving rise to the above remark. This observation further signifies the role played by the quantum shape constraint.

18.3 Uncertainty Quantification for Low-Rank Matrix Recovery

We now turn to the problem of quantifying the uncertainty of estimators \(\tilde \theta \) that satisfy the risk bound (18.26). In fact the confidence sets we construct could be used for any estimator of θ, but the conclusions are most interesting when used for minimax optimal estimators \(\tilde \theta \). For the main flow of ideas we shall assume ε = (ε 1, …, ε n)T ∼ N(0, σ 2I n) but the results hold for the Bernoulli measurement model from Sect. 18.2.2.2 as well—this is summarised in Remark 18.5.

From a statistical point of view, we phrase the problem at hand as the one of constructing a confidence set for θ: a data-driven subset C n of \(\mathbb M_d(\mathbb C)\) that is ‘centred’ at \(\tilde \theta \) and that satisfies

$$\displaystyle \begin{aligned}\mathbb P_\theta (\theta \in C_n) \ge 1- \alpha,~~~0<\alpha<1,\end{aligned}$$

for a chosen ‘coverage’ or significance level 1 − α, and such that the Frobenius norm diameter |C n|F reflects the accuracy of estimation, that is, it satisfies, with high probability,

$$\displaystyle \begin{aligned}|C_n|{}^2_F \approx \|\tilde \theta - \theta\|{}_F^2.\end{aligned}$$

In particular such a confidence set provides, through its diameter |C n|F, a data-driven estimate of how well the algorithm has recovered the true matrix θ in Frobenius-norm loss, and in this sense provides a quantification of the uncertainty in the estimate.

In the situation of an experimentalist this can be used to decide sequentially whether more measurements should be taken (to improve the recovery rate), or whether a satisfactory performance has been reached. Concretely, if for some 𝜖 > 0 a recovery level \(\|\tilde \theta - \theta \|{ }_F \le \epsilon \) is desired for an estimator \(\tilde \theta \), then assuming \(\tilde \theta \) satisfies the minimax optimal risk bound dkn from (18.26), we expect to need, ignoring constants,

$$\displaystyle \begin{aligned}\frac{d k}{n}< \epsilon^2 \text{ and hence at least } n > \frac{d k}{\epsilon^2}\end{aligned}$$

measurements. Note that we also need the RIP to hold with τ n(k) from (18.8) less than a small constant c 0, which requires the same number of measurements, increased by a further poly-log factor of d (and independently of σ).

Since the rank k of θ remains unknown after estimation we cannot obviously guarantee that the recovery level 𝜖 has been reached after a given number of measurements. A confidence set C n for \(\tilde \theta \) provides such certificates with high probability, by checking whether |C n|F ≤ 𝜖, and by continuing to take further measurements if not. The main goal is then to prove that a sequential procedure based on C n does not require more than approximately

$$\displaystyle \begin{aligned}n>\frac{dk \overline{\log} d}{\epsilon^2}\end{aligned}$$

samples (with high probability). We construct confidence procedures in the following subsections that work with at most as many measurements, for the designs from Condition 18.1.

18.3.1 Adaptive Sequential Sampling

Before we describe our confidence procedures, let us make the following definition, where we recall that R(k) denotes the set of d × d Hermitian matrices of rank at most k ≤ d.

Definition 18.1

Let 𝜖 > 0, δ > 0 be given constants. An algorithm \(\mathcal A\) returning a d × d matrix \(\hat \theta \) after \(\hat n \in \mathbb N\) measurements in model (18.3) is called an (𝜖, δ)—adaptive sampling procedure if, with \(\mathbb P_\theta \)-probability greater than 1 − δ, the following properties hold for every θ ∈ R(k) and every 1 ≤ k ≤ d:

$$\displaystyle \begin{aligned} \|\hat \theta - \theta \|{}_{F} \le \epsilon, \end{aligned} $$
(18.30)

and, for positive constants C(δ), γ, the stopping time \(\hat n\) satisfies

$$\displaystyle \begin{aligned} \hat n \le C(\delta) \frac{k d (\log d)^\gamma}{\epsilon^2} . \end{aligned} $$
(18.31)

Such an algorithm provides recovery at given accuracy level 𝜖 with \(\hat n\) measurements of minimax optimal order of magnitude (up to a poly-log factor), and with probability greater than 1 − δ. The sampling algorithm is adaptive since it does not require the knowledge of k, and since the number of measurements required depends only on k and not on the ‘worst case’ rank d.

The construction of non-asymptotic confidence sets C n for θ at any sample size n in the next subsections will imply that such algorithms exist for low rank matrix recovery problems. The main idea is to check sequentially, for a geometrically increasing number 2m of samples, m = 1, 2, …, if the diameter \(|C_{2^m}|{ }_F\) of a confidence set exceeds 𝜖. If this is not the case, the algorithm terminates. Otherwise one takes 2m+1 additional measurements and evaluates the diameter \(|C_{2^{m+1}}|{ }_F\). A precise description of the algorithm is given in the proof of the following theorem, which we detail for the case of ‘Pauli’ designs. The isotropic design case is discussed in Remark 18.9.

Theorem 18.1

Consider observations in the model (18.3) under Condition 18.1(b) with θ Θ +. Then an adaptive sampling algorithm in the sense of Definition 18.1 exists for any 𝜖, δ > 0.

Remark 18.2 (Dependence in σ of Definition 18.1 and Theorem 18.1)

Definition 18.1 and Theorem 18.1 are stated for the case where the standard deviation of the noise σ is assumed to be bounded by an absolute constant. It is straight-forward to modify the proofs to obtain a version where the dependency of the constants on the variance is explicit. Indeed, under Condition 1(a), Theorem 18.1 continues to hold if Eq. (18.31) is replaced by

$$\displaystyle \begin{aligned} \hat n \le C(\delta) \frac{\sigma^2k d (\log d)^\gamma}{\epsilon^2}. \end{aligned}$$

For the ‘Pauli design case’—Condition 1(b)—Eq. (18.31) can be modified to

$$\displaystyle \begin{aligned} \hat n \le C(\delta) \Big(\frac{\sigma^2k d (\log d)^\gamma}{\epsilon^2} \lor \frac{d (\log d)^\gamma}{\epsilon^2}\Big). \end{aligned}$$

Remark 18.3 (Necessity of the Quantum Shape Constraint)

Note that the assumption θ ∈ Θ+ in the previous theorem is necessary (in the case of Pauli design): Else the example of θ = 0 or θ = E i—where E i is an arbitrary element of the Pauli basis—demonstrates that the number of measurements has to be at least of order d 2: otherwise with positive probability, E i is not drawn at a fixed sample size. On this event, both the measurements and \(\hat \theta \) coincide under the laws \(\mathbb P_0\) and \(\mathbb P_{E_i}\), so we cannot have \(\|\hat \theta -0\|{ }_F < \epsilon \) and \(\|\hat \theta - E_i\|{ }_F<\epsilon \) simultaneously for every 𝜖 > 0, disproving existence of an adaptive sampling algorithm. In fact, the crucial condition for Theorem 18.1 to work is that the nuclear norms \(\|\theta \|{ }_{S_1}\) are bounded by an absolute constant (here = 1), which is violated by \(\|E_i\|{ }_{S_1} = \sqrt {d}\).

18.3.2 A Non-asymptotic Confidence Set Based on Unbiased Risk Estimation and Sample-Splitting

We suppose that we have two samples at hand, the first being used to construct an estimator \(\tilde \theta \), such as the one from (18.26). We freeze \(\tilde \theta \) and the first sample in what follows and all probabilistic statements are under the distribution \(\mathbb P_\theta \) of the second sample Y, X of size \(n \in \mathbb N\), conditional on the value of \(\tilde \theta \). We define the following residual sum of squares statistic (recalling that σ 2 is known):

$$\displaystyle \begin{aligned}\hat r_n = \frac{1}{n}\|Y-\mathcal X \tilde \theta\|{}^2_F - \sigma^2,\end{aligned}$$

which satisfies \(\mathbb E _\theta \hat r_n = \|\theta -\tilde \theta \|{ }_F^2\) as is easily seen (see the proof of Theorem 18.2 below). Given α > 0, let ξ α,σ be quantile constants such that

$$\displaystyle \begin{aligned} \Pr\left(\sum_{i=1}^n(\varepsilon_i^2-1)>\xi_{\alpha, \sigma} \sqrt{n}\right) = \alpha \end{aligned} $$
(18.32)

(these constants converge to the quantiles of a fixed normal distribution as n →), let \(z_\alpha =\log (3/\alpha )\) and, for z ≥ 0 a fixed constant to be chosen, define the confidence set

$$\displaystyle \begin{aligned} C_n = \left\{v \in \mathbb H_d(\mathbb C): \|v-\tilde \theta\|{}_F^2 \le 2 \left(\hat r_n + z \frac{d}{n} + \frac{\bar z+\xi_{\alpha/3, \sigma}}{\sqrt{n}}\right) \right\}, \end{aligned} $$
(18.33)

where

$$\displaystyle \begin{aligned}\bar z^2= \bar z^2(\alpha,d,n, \sigma, v) = z_{\alpha/3}\sigma^2 \max(3\|v-\tilde \theta\|{}^2_F, 4zd/n).\end{aligned}$$

Note that in the ‘quantum shape constraint’ case we can always bound \(\|v-\tilde \theta \|{ }_F \le 2\) which gives a confidence set that is easier to compute and of only marginally larger overall diameter. In many important situations, however, the quantity \(\bar z/\sqrt {n}\) is of smaller order than \(1/\sqrt {n}\), and the more complicated expression above is preferable.

It is not difficult to see (using that \(x^2 \lesssim y+x/\sqrt {n}\) implies \(x^2 \lesssim y +1/n\)) that the square Frobenius norm diameter of this confidence set is, with high probability, of order

$$\displaystyle \begin{aligned} |C_n|{}^2_F\lesssim \|\tilde \theta -\theta\|{}_F^2 + \frac{zd + z_{\alpha/3}}{n} + \frac{\xi_{\alpha/3, \sigma}}{\sqrt{n}}. \end{aligned} $$
(18.34)

Whenever \(d \ge \sqrt {n}\)—so as long as at most n ≤ d 2 measurements have been taken—the deviation terms are of smaller order than kdn, and hence C n has minimax optimal expected squared diameter whenever the estimator \(\tilde \theta \) is minimax optimal as in (18.26). Improvements for \(d < \sqrt {n}\), corresponding to n > d 2 measurements, will be discussed in the next subsections.

The following result shows that C n is an honest confidence set for arbitrary d × d matrices (without any rank constraint). Note that the result is non-asymptotic—it holds for every \(n \in \mathbb N\).

Theorem 18.2

Let \(\theta \in \mathbb H_d(\mathbb C)\) be arbitrary and let \(\mathbb P_\theta \) be the distribution of Y, X from model (18.3).

  1. (a)

    Assume Condition 18.1(a) and let C n be given by (18.33) with z = 0. We then have for every \(n \in \mathbb N\) that

    $$\displaystyle \begin{aligned}\mathbb P_\theta(\theta \in C_n) \ge 1-\frac{2\alpha}{3} - 2 e^{-c n}\end{aligned}$$

    where c is a numerical constant. In the case of standard Gaussian design, c = 1∕24 is admissible.

  2. (b)

    Assume Condition 18.1(b), let C n be given by (18.33) with z > 0 and assume also that θ Θ + and \(\tilde \theta \in \Theta _+\) (that is, both satisfy the ‘quantum shape constraint’). Then for every \(n \in \mathbb N\) ,

    $$\displaystyle \begin{aligned}\mathbb P_\theta(\theta \in C_n) \ge 1-\frac{2\alpha}{3} - 2 e^{-C(K) z}\end{aligned}$$

    where, for K the coherence constant of the basis,

    $$\displaystyle \begin{aligned}C(K) = \frac{1}{(16+8/3)K^2}.\end{aligned}$$

In Part (a), if we want to control the coverage probability at level 1 − α, n needs to be large enough so that the third deviation term is controlled at level α∕3. In the Gaussian design case with α = 0.05, n ≥ 100 is sufficient, for smaller sample sizes one can reduce the coverage level. The bound in (b) is entirely non-asymptotic (using the quantum constraint) for suitable choices of z. Also note that the quantile constants z, z α, ξ α all scale at least as \(O(\log (1/\alpha ))\) in the desired coverage level α → 0.

Remark 18.4 (Dependence of the Confidence Set’s Diameter on K (Pauli Design) and σ)

Note that in the case of the Pauli design from Condition 1(b), the confidence set’s diameter depends on K only through the potential dependence of \(\|\theta - \tilde \theta \|{ }_F^2\) on K—the constants involved in the construction of \(\tilde C_n\) and on the bound on its diameter do not depend on K. On the other hand, the coverage probability of the confidence set depends on K, see Theorem 18.2, (b).

In this paper we assume that σ is a universal constant, and so as such it does not appear in Eqs. (18.33) and (18.34). It can however be interesting to investigate the dependence in σ. In the case of isotropic design from Condition 1(a), we could set

$$\displaystyle \begin{aligned} C_n = \left\{v \in \mathbb H_d(\mathbb C): \|v-\tilde \theta\|{}_F^2 \le 2 \left(\hat r_n + z c\sigma^2 \frac{d}{n} + \sigma^2\frac{\bar z+\xi_{\alpha/3, \sigma}}{\sqrt{n}}\right) \right\}, \end{aligned}$$

(where σ 2 could be replaced by twice the plug-in estimator of σ 2, using \(\hat \theta \)) and one would get

$$\displaystyle \begin{aligned} \mathbb E _\theta|C_n|{}^2_F\lesssim \|\tilde \theta -\theta\|{}_F^2 + \sigma^2 \frac{zd + z_{\alpha/3}}{n} + \sigma^2\frac{\xi_{\alpha/3, \sigma}}{\sqrt{n}}, \end{aligned}$$

and Theorem 18.2 also holds by introducing minor changes in the proof. In the case of the Pauli design from Condition 1(b), we could set

$$\displaystyle \begin{aligned} C_n = \left\{v \in \mathbb H_d(\mathbb C): \|v-\tilde \theta\|{}_F^2 \le 2 \left(\hat r_n + z c \frac{d}{n} + \sigma^2\frac{\bar z+\xi_{\alpha/3, \sigma}}{\sqrt{n}}\right) \right\}, \end{aligned}$$

(where σ 2 could be replaced by twice the plug-in estimator of σ 2, using \(\hat \theta \)) and one would get

$$\displaystyle \begin{aligned} \mathbb E _\theta|C_n|{}^2_F\lesssim \|\tilde \theta -\theta\|{}_F^2 + \frac{zd + z_{\alpha/3}}{n} + \sigma^2\frac{\xi_{\alpha/3, \sigma}}{\sqrt{n}}, \end{aligned}$$

and Theorem 18.2 also holds by introducing minor changes in the proof. In this case we do not get a full dependence in σ as in the isotropic design case from Condition 1(a). However if \(k^2d \lesssim n\), we could also obtain a result similar to the one for the Gaussian design, using part (c) of Lemma 18.1.

Remark 18.5 (Bernoulli Noise)

Theorem 18.2(b) holds as well for the Bernoulli measurement model from Sect. 18.2.2.2 with T ≥ d 2, with slightly different constants in the construction of C n and the coverage probabilities. See Remark 18.10 after the proof of Theorem 18.2(b) below. The modified quantile constants z, z α, ξ α still scale as \(O(\sqrt {1/\alpha })\) in the desired coverage level α → 0, and hence the adaptive sampling Theorem 18.1 holds for such noise too, if the number T of preparations of the quantum state exceeds d 2.

Remark 18.6 (Unknown Variance)

The above confidence set C n can be constructed with \(\tilde r_n= \frac {1}{n}\|Y-\mathcal X \tilde \theta \|{ }^2\) replacing \(\hat r_n\)—so without requiring knowledge of σ—if an a priori bound σ 2 ≤ vdn is available, with v a known constant. An example of such a situation was discussed at the end of Sect. 18.2.2.2 above in quantum tomography problems: when T ≥ n, the constant z should be increased by v in the construction of C n, and the coverage proof goes through as well by compensating for the centring at \(\mathbb E \varepsilon _i^2=\sigma ^2\) by the additional deviation constant v.

Remark 18.7 (Anisotropic Design Instead of Condition 1(a))

It is also interesting to consider the case of anisotropic design. This case is not very different, when it comes to confidence sets, than isotropic design, as long as the variance-covariance matrix of the anisotropic sub-Gaussian design is such that the ratio of its largest eigenvalue with the smallest eigenvalue is bounded. Lemma 18.1(a), which quantifies the effect of the design, would change as follows: There exist constants c , c +, c > 0 that depend only on the variance-covariance matrix of the anisotropic sub-Gaussian design and that are such that

$$\displaystyle \begin{aligned}\Pr \left(c_- \|\vartheta\|{}_F^2\leq \frac{1}{n}\|\mathcal X\vartheta\|{}^2 \leq c_+ \|\vartheta\|{}_F^2\right) \geq 1 - 2 e^{-c n}.\end{aligned}$$

Using this instead of the inequality in Lemma 18.1(a) in the proof of Theorem 18.2, part (a) leads to a similar result as Theorem 18.2, part (a).

18.3.3 Improvements When \(d \le \sqrt {n}\)

The confidence set from Theorem 18.2 is optimal whenever the desired performance of \(\|\theta -\tilde \theta \|{ }_F^2\) is no better than of order \(1/\sqrt {n}\). From a minimax point of view we expect \(\|\theta -\tilde \theta \|{ }_F^2\) to be of order kdn for low rank θ ∈ R(k). In absence of knowledge about k ≥ 1 the confidence set from Theorem 18.2 can hence be guaranteed to be optimal whenever \(d \ge \sqrt {n}\), corresponding to the important regime n ≤ d 2 for sequential sampling algorithms. Refinements for measurement scales n ≥ d 2 are also of interest—we present two optimal approaches in this subsection for the designs from Condition 18.1.

18.3.3.1 Isotropic Design and U-Statistics

Consider first isotropic i.i.d design from Condition 18.1(a), and an estimator \(\tilde \theta \) based on an initial sample of size n (all statements that follow are conditional on that sample). Collect another n samples to perform the uncertainty quantification step. Define the U-statistic

$$\displaystyle \begin{aligned} \hat R_n = \frac{2}{n(n-1)} \sum_{i<j} \sum_{m,k} (Y_i X^i_{m,k}-\tilde \theta_{m,k}) (Y_j X^j_{m,k}-\tilde \theta_{m,k}) \end{aligned} $$
(18.35)

whose \(\mathbb E _\theta \)-expectation, conditional on \(\tilde \theta \), equals \(\|\theta -\tilde \theta \|{ }_F^2\) in view of

$$\displaystyle \begin{aligned}\mathbb E Y_i X_{m,k}^i = \mathbb E \sum_{m',k'} X^i_{m',k'}X^i_{m,k} \theta_{m',k'} = \theta_{m,k}.\end{aligned}$$

Define

$$\displaystyle \begin{aligned} C_n = \left\{v \in \mathbb H_d(\mathbb R): \|v - \tilde \theta\|{}^2_F \le \hat R_n + z_{\alpha,n} \right\} \end{aligned} $$
(18.36)

where

$$\displaystyle \begin{aligned}z_{\alpha, n} = \frac{C_1 \|\theta-\tilde \theta\|{}_F}{\sqrt{n}} + \frac{C_2 d}{n}\end{aligned}$$

and \(C_1 \ge \zeta _1 \|\theta \|{ }_F,~C_2 \ge \zeta _2 \|\theta \|{ }_F^2\) with ζ i constants depending on α, σ. Note that if θ ∈ Θ+ then ∥θF ≤ 1 can be used as an upper bound. In practice the constants ζ i can be calibrated by Monte Carlo simulations (see the implementation section below), or chosen based on concentration inequalities for U-statistics (see Ref. [14], Theorem 4.4.8). This confidence set has expected diameter

$$\displaystyle \begin{aligned}\mathbb E _\theta |C_n|{}^2_F \lesssim \|\tilde \theta- \theta\|{}_F^2 +\frac{C_1+C_2d}{n},\end{aligned}$$

and hence is compatible with any minimax recover rate \(\|\tilde \theta - \theta \|{ }_F^2 \lesssim kd/n\) from (18.26), where k ≥ 1 is now arbitrary. For suitable choices of ζ i we now show that C n also has non-asymptotic coverage.

Theorem 18.3

Assume Condition 18.1(a), and let C n be as in (18.36). For every α > 0 we can choose \(\zeta _i(\alpha )=O(\sqrt {1/\alpha }), i=1,2,\) large enough so that for every \(n \in \mathbb N\) we have

$$\displaystyle \begin{aligned}\mathbb P_\theta (\theta \in C_n) \ge 1-\alpha.\end{aligned}$$

Remark 18.8 (Dependence of the Confidence Set’s Diameter on σ)

As what was noted in Remark 18.4, Theorem 18.3 does not make explicit the dependence on σ, which is assumed to be (bounded by) an universal constant. In order to take the dependence on σ into account, we could replace z α,n in Eq. (18.36) by \( \frac {C_1 \|\theta -\tilde \theta \|{ }_F}{\sqrt {n}} + \sigma ^2\frac {C_2 d}{n}\) (where σ 2 could be replaced by twice the plug-in estimator of σ 2, using \(\hat \theta \)), and we would get

$$\displaystyle \begin{aligned}\mathbb E _\theta |C_n|{}^2_F \lesssim \|\tilde \theta- \theta\|{}_F^2 +\sigma^2 \frac{C_1+C_2d}{n},\end{aligned}$$

and Theorem 18.3 also holds by introducing minor changes in the proof.

18.3.3.2 Re-averaging Basis Elements When \(d \le \sqrt {n}\)

Consider the setting of Condition 18.1(b) where we sample uniformly at random from a (scaled) basis \(\{dE_1, \dots , dE_{d^2}\}\) of \(\mathbb M_d(\mathbb C)\). When \(d \le \sqrt {n}\) we are taking n ≥ d 2 measurements, and there is no need to sample at random from the basis as we can measure each individual coefficient, possibly even multiple times. Repeatedly sampling a basis coefficient tr(E kθ) leads to a reduction of the variance of the measurement by averaging. More precisely, when taking n = md 2 measurements for some (for simplicity integer) m ≥ 1, and if (Y k,l : l = 1, …, m) are the measurements Y i corresponding to the basis element E k, k ∈{1, …, d 2}, we can form averaged measurements

$$\displaystyle \begin{aligned}Z_k = \frac{1}{\sqrt m} \sum_{l=1}^m Y_{k,l} = \sqrt m d\langle E_{k}, \theta \rangle_F + \epsilon_k, ~~\epsilon_k = \frac{1}{\sqrt m} \sum_{l=1}^{m}\varepsilon_l \sim N(0,\sigma^2).\end{aligned}$$

We can then define the new measurement vector \(\tilde Z = (\tilde Z_1, \dots , \tilde Z_{d^2})^T\) (using also m = nd 2)

$$\displaystyle \begin{aligned}\tilde Z_k = Z_k - \sqrt{n}\langle \tilde \theta, E_k \rangle = \sqrt{n} \langle E_k, \theta - \tilde \theta \rangle_F + \epsilon_k, ~~k =1, \dots, d^2\end{aligned}$$

and the statistic

$$\displaystyle \begin{aligned}\hat R_n = \frac{1}{n}\|\tilde Z\|{}_{\mathbb R^{d^2}}^2 - \frac{\sigma^2d^2}{n}\end{aligned}$$

which estimates \(\|\theta - \tilde \theta \|{ }_F^2\) with precision

$$\displaystyle \begin{aligned} \hat R_n - \|\theta-\tilde \theta\|{}_F^2 & = \frac{2}{\sqrt{n}} \sum_{k=1}^{d^2}\epsilon_k \langle E_k, \theta - \tilde \theta \rangle_F + \frac{1}{n}\sum_{k=1}^{d^2}(\epsilon^2_k-\mathbb E \epsilon^2) \\ &= O_P\left(\frac{\sigma\|\theta-\tilde \theta\|{}_F}{\sqrt{n}} + \frac{\sigma^2 d}{n}\right) \notag. \end{aligned} $$

Hence, for z α the quantiles of a N(0, 1) distribution and ξ α,σ as in (18.32) with d 2 replacing n there, we can define a confidence set

$$\displaystyle \begin{aligned} \bar C_n = \left\{v \in \mathbb H_d(\mathbb C): \|v - \tilde \theta\|{}_F^2 \le \hat R_n + \frac{z_{\alpha/2} \sigma \|\theta -\tilde \theta\|{}_F}{\sqrt{n}} + \frac{\xi_{\alpha/2, \sigma} d}{n} \right\} \end{aligned} $$
(18.37)

which has non-asymptotic coverage

$$\displaystyle \begin{aligned}\mathbb P_\theta (\theta \in \bar C_n) \ge 1- \alpha\end{aligned}$$

for every \(n \in \mathbb N\), by similar (in fact, since Lemma 18.1 is not needed, simpler) arguments as in the proof of Theorem 18.2 below. The expected diameter of \(\bar C_n\) is by construction

$$\displaystyle \begin{aligned} \mathbb E _\theta |\bar C_n|{}^2_F \lesssim \|\theta - \tilde \theta\|{}_F^2 + \frac{\sigma^2d}{n}, \end{aligned} $$
(18.38)

now compatible with any rate of recovery kdn, 1 ≤ k ≤ d.

18.3.4 A Confidence Set in Trace Norm Under Quantum Shape Constraints

The confidence sets from the previous subsections are all valid in the sense that they contain information about the recovery of θ by \(\tilde \theta \) in Frobenius norm ∥⋅∥F. It is of interest to obtain results in stronger norms, such as for instance the nuclear norm \(\|\cdot \|{ }_{S_1}\), which is particularly meaningful for quantum tomography problems since it then corresponds to the total variation distance on the set of ‘probability density matrices’. In fact, since

$$\displaystyle \begin{aligned} \frac{1}{2}\|\theta-\tilde\theta\|{}_{S_1} = \sup_{\|X\|{}_{op}=1} tr \left(X (\theta-\tilde\theta)\right), \end{aligned} $$
(18.39)

the nuclear norm has a clear interpretation in terms of the maximum probability with which two quantum states can be distinguished by arbitrary measurements.

The absence of the ‘Hilbert space geometry’ induced by the relationship of the Frobenius norm to the inner product 〈⋅, ⋅〉F makes this problem significantly harder, both technically and from an information-theoretic point of view. In particular it appears that the quantum shape constraint θ ∈ Θ+ is crucial to obtain any results whatsoever, and for the theoretical results presented here it will be more convenient to perform an asymptotic analysis where \(\min (n,d) \to \infty \) (with o, O-notation to be understood accordingly).

Instead of Condition 18.1 we shall now consider any design (X 1, …, X n) in model (18.3) that satisfies the matrix RIP (18.7) with

$$\displaystyle \begin{aligned} \tau_n (k) = c \sqrt{kd \frac{\overline{\log} (d)}{n}}. \end{aligned} $$
(18.40)

As discussed above, this covers in particular the designs from Condition 18.1. We shall still use the convention discussed before Condition 18.1 that θ and the matrices X i are such that tr(X iθ) is always real-valued.

In contrast to the results from the previous section we shall now assume a minimal low rank constraint on the parameter space:

Condition 18.2

θ  R +(k) := R(k) ∩ Θ + for some k satisfying

$$\displaystyle \begin{aligned}k\sqrt{\frac{d \overline{\log} d}{n}} = o(1),\end{aligned}$$

This in particular implies that the RIP holds with τ n(k) = o(1). Given this minimal rank constraint θ ∈ R +(k), we now show that it is possible to construct a confidence set C n that adapts to any low rank 1 ≤ k 0 < k. Here we may choose k = d but note that this forces n ≫ d 2 (for Condition 18.2 to hold with k = d).

We assume that there exists an estimator \(\tilde \theta _{\mathrm {Pilot}}\) that satisfies, uniformly in R(k 0) for any k 0 ≤ k and for n large enough,

$$\displaystyle \begin{aligned} \|\tilde \theta_{\mathrm{Pilot}} - \theta\|{}_F^2 \leq D \sigma^2 \frac{k_0 d }{n} := \frac{r^2_n(k_0)}{4} \end{aligned} $$
(18.41)

where D = D(δ) depends on δ, and where so-defined r n will be used frequently below. Such estimators exist as has already been discussed before (18.26). We shall in fact require a little more, namely the following oracle inequality: for any k and any matrix S of rank k ≤ d, with high probability and for n large enough,

$$\displaystyle \begin{aligned} \|\tilde \theta_{\mathrm{Pilot}} - \theta\|{}_F \lesssim \|\theta - S\|{}_F + r_n(k), \end{aligned} $$
(18.42)

which in fact implies (18.41). Such inequalities exist assuming the RIP and Condition 18.2, see, e.g., Theorem 2.8 in Ref. [6]. Starting from \(\tilde \theta _{\mathrm {Pilot}}\) one can construct (see Theorem 18.5 below) an estimator that recovers θ ∈ R(k) in nuclear norm at rate \(k \sqrt {d/n}\), which is again optimal from a minimax point of view, even under the quantum constraint (as discussed, e.g., in Ref. [24]). We now construct an adaptive confidence set for θ centred at a suitable projection of \(\tilde \theta _{\mathrm {Pilot}}\) onto Θ+.

In the proof of Theorem 18.4 below we will construct estimated eigenvalues \((\hat \lambda _j, j=1, \dots , d)\) of θ (see after Lemma 18.3). Given those eigenvalues and \(\tilde \theta _{\mathrm {Pilot}}\), we choose \(\hat k\) to equal the smallest integer ≤ d such that there exists a rank \(\hat k\) matrix \(\tilde \theta '\) for which

$$\displaystyle \begin{aligned}\|\tilde \theta' - \tilde \theta_{\mathrm{Pilot}}\|{}_F \le r_n(\hat k) \text{ and } 1 - \sum_{J \le \hat k} \hat \lambda_J \leq 2 \hat k \sqrt{d/n}\end{aligned}$$

is satisfied. Such \(\hat k\) exists with high probability (since the inequalities are satisfied for the true θ and λ j’s, as our proofs imply). Define next \(\hat \vartheta \) to be the 〈⋅, ⋅〉F-projection of \(\tilde \theta _{\mathrm {Pilot}}\) onto

$$\displaystyle \begin{aligned}R^+(2\hat k) := R(2 \hat k) \cap \Theta_+\end{aligned}$$

and note that, since \(2 \hat k \ge \hat k\),

$$\displaystyle \begin{aligned} \|\tilde \theta_{\mathrm{Pilot}} - \hat \vartheta\|{}_F =\|\tilde \theta_{\mathrm{Pilot}} - R^+(2\hat k)\|{}_F \le \|\tilde \theta_{\mathrm{Pilot}} - \tilde \theta'\|{}_F \le r_n(\hat k). \end{aligned} $$
(18.43)

Finally define, for C a constant chosen below,

$$\displaystyle \begin{aligned} C_n = \left\{v \in \Theta_+ : \|v - \hat \vartheta\|{}_{S_1} \le C \sqrt{\hat k} r_n(\hat k) \right\}. \end{aligned} $$
(18.44)

Theorem 18.4

Assume Condition 18.2 for some 1 ≤ k  d, and let δ > 0 be given. Assume that with probability greater than 1 − 2δ∕3, (a) the RIP (18.7) holds with τ n(k) as in (18.40) and (b) there exists an estimator \(\tilde \theta _{\mathrm {Pilot}}\) for which (18.42) holds. Then we can choose C = C(δ) large enough so that, for C n as in the last display,

$$\displaystyle \begin{aligned}\liminf_{\min (n,d) \to \infty} \inf_{\theta \in R^+(k)}\mathbb P_\theta (\theta \in C_n) \ge 1-\delta.\end{aligned}$$

Moreover, uniformly in R +(k 0), 1 ≤ k 0 ≤ k, and with \(\mathbb P_\theta \)-probability greater than 1 − δ,

$$\displaystyle \begin{aligned}|C_n|{}_{S_1} \lesssim \sqrt {k_0} r_n(k_0).\end{aligned}$$

Theorem 18.4 should mainly serve the purpose of illustrating that the quantum shape constraint allows for the construction of an optimal trace norm confidence set that adapts to the unknown low rank structure. Implementation of C n is not straightforward so Theorem 18.4 is mostly of theoretical interest. Let us also observe that in full generality a result like Theorem 18.4 cannot be proved without the quantum shape constraint. This follows from a careful study of certain hypothesis testing problems (combined with lower bound techniques for confidence sets as in Refs. [19, 30]). Precise results are subject of current research and will be reported elsewhere.

18.4 Simulation Experiments

In order to illustrate the methods from this paper, we present some numerical simulations. The setting of the experiments is as follows: A random matrix \(\eta \in \mathbb M_d(\mathbb C)\) of norm ∥ηF = R 1∕2 is generated according to two distinct procedures that we will specify later, and the observations are

$$\displaystyle \begin{aligned}\bar {Y}_i = \mathrm{tr}(X^i \eta) + \varepsilon_i.\end{aligned}$$

where the ε i are i.i.d. Gaussian of mean 0 and variance 1. The observations are reparametrised so that η represents the ‘estimation error’ \(\theta - \hat \theta \), and we investigate how well the statistics

$$\displaystyle \begin{aligned}\hat r_n = \frac{1}{n} \|\bar Y\|- 1 \text{ and } \hat R_n = \frac{2}{n(n-1)} \sum_{i<j} \sum_{m,k} \bar Y_i X^i_{m,k} \bar Y_j X^j_{m,k}\end{aligned}$$

estimate the ‘accuracy of estimation’ \(\|\eta \|{ }_F^2= \|\theta -\hat \theta \|{ }_F^2\), conditional on the value of \(\hat \theta \). We will choose η in order to illustrate two extreme cases: a first one where the nuclear norm \(\|\eta \|{ }_{S_1}\) is ‘small’, corresponding to a situation where the quantum constraint is fulfilled; and a second one where the nuclear norm is large, corresponding to a situation where the quantum constraint is not fulfilled. More precisely we generate the parameter η in two ways:

  • ‘Random Dirac’ case: set a single entry (with position chosen at random on the diagonal) of η to R 1∕2, and all the other coordinates equal to 0.

  • ‘Random Pauli’ case: Set η equal to a Pauli basis element chosen uniformly at random and then multiplied by R 1∕2.

The designs that we consider are the Gaussian design, and the Pauli design, described in Condition 1. We perform experiments with d = 32, R ∈{0.1, 1} and

$$\displaystyle \begin{aligned}n \in \{100, 200,500,1000,2000,5000\}.\end{aligned}$$

Note that d 2 = 1024, so that the first four choices of n correspond to the important regime n < d 2. Our results are plotted as a function of the number n of samples in Figs. 18.2, 18.3, 18.4, and 18.5. The solid red and blue curves are the median errors of the normalised estimation errors

$$\displaystyle \begin{aligned}\frac{\sqrt{\hat R_n - R}}{R^{1/2}}, \quad \mathrm{and} \quad \frac{\sqrt{\hat r_n - R}}{R^{1/2}},\end{aligned}$$

after 1000 iterations, and the dotted lines are respectively, the (two-sided) 90% quantiles. We also report (see Tables 18.1, 18.2, 18.3, and 18.4) how well the confidence sets based on these estimates of the norm perform in terms of coverage probabilities, and of diameters. The diameters are computed as

$$\displaystyle \begin{aligned}\left({\hat R_n + \frac{C_{\mathrm{UStat}} d}{n} + \frac{C_{\mathrm{UStat}}^{\prime} \hat R_n^{1/2}}{\sqrt{n}} }\right)^{1/2},\end{aligned}$$
Table 18.1 Gaussian design, and random Dirac (a single entry, chosen at random, is non-zero on the diagonal) η, with R = 0.1 (left table) and R = 1 (right table)
Table 18.2 Gaussian design, and random Pauli η, with R = 0.1 (left table) and R = 1 (right table)
Table 18.3 Pauli design, and random Dirac (a single entry, chosen at random, is non-zero on the diagonal) η, with R = 0.1 (left table) and R = 1 (right table)
Table 18.4 Pauli design, and random Pauli η, with R = 0.1 (left table) and R = 1 (right table)
Fig. 18.2
figure 2

Gaussian design, and random Dirac (a single entry, chosen at random, is non-zero on the diagonal) η, with R = 0.1 (left picture) and R = 1 (right picture)

Fig. 18.3
figure 3

Gaussian design, and random Pauli η, with R = 0.1 (left picture) and R = 1 (right picture)

Fig. 18.4
figure 4

Pauli design, and random Dirac (a single entry, chosen at random, is non-zero on the diagonal) η, with R = 0.1 (left picture) and R = 1 (right picture)

Fig. 18.5
figure 5

Pauli design, and random Pauli η, with R = 0.1 (left picture) and R = 1 (right picture)

for the U-Statistic approach and

$$\displaystyle \begin{aligned}\left({\hat r_n + \frac{C_{\mathrm{RSS}}}{\sqrt{n}} + \frac{C_{\mathrm{RSS}}^{\prime} \hat r_n^{1/2}}{\sqrt{n}}}\right)^{1/2},\end{aligned}$$

for the RSS approach, where we have chosen C UStat = 2.5, C RSS = 1 and \(C_{\mathrm {UStat}}^{\prime } = C_{\mathrm {RSS}} = 6\) for all experiments—calibrated to a 95% coverage level.

From these numerical results, several observations can be made:

  • In Gaussian random designs, the results are insensitive to the nature of η (see Figs. 18.2 and 18.3 and Tables 18.1 and 18.2). This is not surprising since the Gaussian design is ‘isotropic’.

  • For Pauli designs with the quantum constraint (see Fig. 18.4 and Table 18.3) the RSS method works quite well even for small sample sizes. But the U-Stat method is not very reliable—indeed we see no empirical evidence that Theorem 18.3 should also hold true for Pauli design.

  • For Pauli design and when the quantum shape constraint is not satisfied our methods cease to provide reliable results (see Fig. 18.5 and in particular Table 18.4). Indeed, when the matrix η is chosen itself as a random Pauli (which is the hardest signal to detect under Pauli design) both the RSS and the U-Stat approach perform poorly. The confidence set are not honest anymore, which is in line with the theoretical limitations we observe in Theorem 18.2. Figure 18.5 illustrates that the methods do not detect the signal, since the norm of η is largely under-evaluated for small sample sizes. These limitations are less pronounced when n ≥ d 2. In this case one could use alternatively the re-averaging approach from Sect. 18.3.3.2 (not investigated in the simulations) to obtain honest results without the quantum shape constraint.

18.5 Proofs

18.5.1 Proof of Theorem 18.1

Proof

Before we define the algorithm and prove the result, a few preparatory remarks are required: Our sequential procedure will be implemented in m = 1, 2, …, T potential steps, in each of which 2 ⋅ 2m = 2m+1 measurements are taken. The arguments below will show that we can restrict the search to at most

$$\displaystyle \begin{aligned}T = O(\log (d/\epsilon))\end{aligned}$$

steps. We also note that from the discussion after (18.7)—in particular since c = c(δ) from (18.8) is O(1∕δ 2)—a simple union bound over m ≤ T implies that the RIP holds with probability ≥ 1 − δ′, some δ′ > 0, simultaneously for every m ≤ T satisfying \(2^m \ge c'kd \overline {\log }d\), and with \(\tau _{2^m}(k)<c_0\), where c′ is a constant that depends on δ′, c 0 only. The maximum over \(T=O(\log (d/\epsilon ))\) terms is absorbed in a slightly enlarged poly-log term. Hence, simultaneously for all such sample sizes 2m, m ≤ T, a nuclear norm regulariser exists that achieves the optimal rate from (18.26) with n = 2m and for every k ≤ d, with probability greater than 1 − δ∕3. Projecting this estimator onto Θ+ changes the Frobenius error only by a universal multiplicative constant (arguing as in (18.43) below), and we denote by \(\tilde \theta _{2^m} \in \Theta _+\) the resulting estimator computed from a sample of size 2m.

We now describe the algorithm at the m-th step: Split the 2m+1 observations into two halves and use the first subsample to construct \(\tilde \theta _{2^m} \in \Theta _+\) satisfying (18.26) with \(\mathbb P_\theta \)-probability ≥ 1 − δ∕3. Then use the other 2m observations to construct a confidence set \(C_{2^m}\) for θ centred at \(\tilde \theta _{2^m}\): if 2m < d 2 we take \(C_{2^m}\) from (18.33) and if 2m ≥ d 2 we take \(C_{2^m}\) from (18.37)—in both cases of non-asymptotic coverage at least 1 − α, α = δ∕(3T). If \(|C_{2^m}|{ }_F \le \epsilon \) we terminate the procedure (\(m=\hat m\), \(\hat n = 2^{\hat m+1}\), \(\hat \theta = \tilde \theta _{2^{\hat m}}\)), but if \(|C_{2^m}|{ }_F>\epsilon \) we repeat the above procedure with 2 ⋅ 2m+1 = 2m+1+1 new measurements, etc., until the algorithm terminates, in which case we have used

$$\displaystyle \begin{aligned}\sum_{m \le \hat m} 2^{m+1} \lesssim 2^{\hat m} \approx \hat n\end{aligned}$$

measurements in total.

To analyse this algorithm, recall that the quantile constants z, z α, ξ α appearing in the confidence sets (18.33) and (18.37) for our choice of α = δ∕(3T) grow at most as \(O(\log (1/\alpha ))=O(\log T) = o(\overline {\log }d)\). In particular in view of (18.26) and (18.34) or (18.38) the algorithm necessarily stops at a ‘maximal sample size’ n = 2T+1 in which the squared Frobenius risk of the maximal model (k = d) is controlled at level 𝜖. Such \(T \in \mathbb N\) is \(O(\log (d/\epsilon ))\) and depends on σ, d, 𝜖, δ, hence can be chosen by the experimenter.

To prove that this algorithms works we show that the event

$$\displaystyle \begin{aligned}\left\{\|\hat \theta - \theta\|{}_F^2 > \epsilon^2\right\} \cup \left\{\hat n > \frac{C(\delta) kd (\log d)^\gamma }{\epsilon^2} \right\} = A_1 \cup A_2\end{aligned}$$

has probability at most 2δ∕3 for large enough C(δ), γ. By the union bound it suffices to bound the probability of each event separately by δ∕3. For the first: Since \(\hat n\) has been selected we know \(|C_{\hat n}|{ }_F\le \epsilon \) and since \(\hat \theta = \tilde \theta _{\hat n}\) the event A 1 can only happen when \(\theta \notin C_{\hat n}\). Therefore

$$\displaystyle \begin{aligned}\mathbb P_\theta(A_1) \le \mathbb P_\theta(\theta \notin C_{\hat n}) \le \sum_{m=1}^T \mathbb P_\theta ( \theta \notin C_{2^m}) \le \delta \frac{T}{3T} = \frac{\delta}{3}.\end{aligned}$$

For A 2, whenever θ ∈ R(k) and for all m ≤ T for which \(2^m \ge c'kd \overline {\log } d\), we have, as discussed above, from (18.34) or (18.38) and (18.26) that

$$\displaystyle \begin{aligned}\mathbb E _\theta |C_{2^m}|{}_F^2 \le D' \frac{kd \log T}{2^m},\end{aligned}$$

where D′ is a constant. In the last inequality the expectation is taken under the distribution of the sample used for the construction of \(C_{2^m}\), and it holds on the event on which \(\tilde \theta _{2^m}\) realises the risk bound (18.26). Then let C(δ), γ be large enough so that \(C(\delta ) kd (\log d)^\gamma / \epsilon ^2 \ge c'kd\overline {\log }d\) and let \(m_0 \in \mathbb N\) be the smallest integer such that

$$\displaystyle \begin{aligned}2^{m_0} > \frac{C(\delta) kd (\log d)^\gamma}{\epsilon^2}.\end{aligned}$$

Then, for C(δ) large enough and since \(T=O(\log (d/\epsilon )\),

$$\displaystyle \begin{aligned}\mathbb P_\theta \left(\hat n >\frac{C(\delta) kd (\log d)^\gamma}{\epsilon^2}\right) \le \mathbb P_\theta \left(|C_{2^{m_0}}|{}^2_F >\epsilon^2\right) \le \frac{\mathbb E _\theta|C_{2^{m_0}}|{}_F^2}{\epsilon^2} \le \frac{D' \log T}{C(\delta) (\log d)^\gamma }<\delta/3,\end{aligned}$$

by Markov’s inequality, completing the proof. \(\blacksquare \)

Remark 18.9 (Isotropic Sampling)

The proof above works analogously for isotropic designs as defined in Condition 18.1a). When 2m ≥ d 2, we replace the confidence set (18.37) in the above proof by the confidence set from (18.36). Assuming also that ∥θF ≤ M for some fixed constant M, we can construct a similar upper bound for T and the above proof applies directly (with T of slighter larger but still small enough order). Instead of assuming an upper bound on ∥θF one can simply continue using the confidence set (18.33) also when 2m ≥ d 2, in which case one has the slightly worse bound

$$\displaystyle \begin{aligned}\hat n \le C(\delta) \max \left( \frac{k d \overline {\log} d}{\epsilon^2}, \frac{1}{\epsilon^4}\right)\end{aligned}$$

for the number of measurements required.

18.5.2 Proof of Theorem 18.2

Proof

By Lemma 18.1 below with \(\vartheta =\tilde \theta - \theta \) the \(\mathbb P_\theta \)-probability of the complement of the event

$$\displaystyle \begin{aligned}\mathcal E= \left\{\left|\frac{1}{n} \|\mathcal X(\tilde \theta - \theta)\|{}^2 - \|\tilde \theta- \theta\|{}_F^2 \right| \le \max\left(\frac{\|\theta-\tilde \theta\|{}_F^2}{2}, \frac{z d}{n} \right)\right\}\end{aligned}$$

is bounded by the deviation terms 2e cn and 2e C(K)z, respectively (note z = 0 in Case (a)). We restrict to this event in what follows. We can decompose

$$\displaystyle \begin{aligned} \hat r_n = \frac{1}{n} \|\mathcal X(\tilde \theta - \theta)\|{}^2 + \frac{2}{n} \langle \varepsilon, \mathcal X(\theta-\tilde \theta) \rangle + \frac{1}{n} \sum_{i=1}^n (\varepsilon_i^2-\mathbb E \varepsilon_i^2) =A+B+C. \end{aligned} $$

Since \(\mathbb P(Y+Z<0) \le \mathbb P(Y<0) + \mathbb P(Z<0)\) for any random variables Y, Z we can bound the probability

$$\displaystyle \begin{aligned} \mathbb P_\theta (\theta \notin C_n, \mathcal E) = \mathbb P_\theta \left(\left\{\frac{1}{2}\|\theta-\tilde \theta\|{}_F^2 > A+B+C + \frac{z d}{n} + \frac{\bar z +\xi_{\alpha/3, \sigma}}{\sqrt{n}}\right\}, \mathcal E \right) \end{aligned} $$

by the sum of the following probabilities

$$\displaystyle \begin{aligned}I := \mathbb P_\theta\left(\left\{\frac{1}{2}\|\theta-\tilde \theta\|{}_F^2 > \frac{1}{n} \|\mathcal X(\tilde \theta - \theta)\|{}^2 + \frac{z d}{n}\right\}, \mathcal E \right),\end{aligned}$$
$$\displaystyle \begin{aligned}II := \mathbb P_\theta\left(\left\{- \frac{1}{\sqrt{n}} \langle \varepsilon , \mathcal X(\theta-\tilde \theta)\rangle > \bar z\right\}, \mathcal E\right),\end{aligned}$$
$$\displaystyle \begin{aligned}III := \mathbb P_\theta \left(- \frac{1}{\sqrt{n}} \sum_{i=1}^n (\varepsilon_i^2-\mathbb E \varepsilon_i^2) > \xi_{\alpha/3, \sigma}\right).\end{aligned}$$

The first probability I is bounded by

$$\displaystyle \begin{aligned} &\mathbb P_\theta\left(\left\{- \frac{1}{n} \|\mathcal X(\tilde \theta - \theta)\|{}^2 + \|\theta-\tilde \theta\|{}_F^2 > \frac{1}{2}\|\theta-\tilde \theta\|{}_F^2 + \frac{z d}{n}\right\}, \mathcal E \right) \\ &\le \mathbb P_\theta\left(\left\{\left|\frac{1}{n} \|\mathcal X(\tilde \theta - \theta)\|{}^2 - \|\tilde \theta- \theta\|{}_F^2 \right| > \max\left(\frac{\|\theta-\tilde \theta\|{}_F^2}{2}, \frac{z d}{n} \right)\right\}, \mathcal E\right)=0 \end{aligned} $$

About term II: Conditional on \(\mathcal X\) the variable \(\frac {1}{\sqrt {n}} \langle \varepsilon , \mathcal X(\theta -\tilde \theta )\rangle \) is centred Gaussian with variance \((\sigma ^2/n)\|\mathcal X(\theta -\tilde \theta )\|{ }^2\). The standard Gaussian tail bound then gives by definition of \(\bar z\), and conditional on \(\mathcal X\),

$$\displaystyle \begin{aligned} &\le \exp\{-\bar z^2/2(\sigma^2/n)\|\mathcal X(\theta-\tilde \theta)\|{}^2\} \\ & = \exp\left\{-\frac{z_{\alpha/3} \max(3 \|\theta-\tilde \theta\|{}^2_F, 4zd/n)}{2\|\mathcal X(\theta-\tilde \theta)\|{}^2/n}\right\} \le \exp \{-z_{\alpha/3}\}=\alpha/3 \end{aligned} $$

since, on the event \(\mathcal E\),

$$\displaystyle \begin{aligned}\max(3\|\theta-\tilde \theta\|{}^2_F, 4zd/n) \ge (2/n)\|\mathcal X(\theta-\tilde \theta)\|{}^2.\end{aligned}$$

The overall bound for II follows from integrating the last but one inequality over the distribution of X. Term III is bounded by α∕3 by definition of ξ α,σ. \(\blacksquare \)

Remark 18.10 (Modification of the Proof for Bernoulli Errors)

If instead of Gaussian errors we work with the error model from Sect. 18.2.2.2, we require a modified treatment of the terms II, III in the above proof. For the pure noise term III we modify the quantile constants slightly to \(\xi _{\alpha , \sigma } = \sqrt {(1/\alpha )}\). If the number T of preparations satisfies T ≥ 4d 2 then Chebyshev’s inequality and (18.24) give

$$\displaystyle \begin{aligned} & \mathbb P_\theta \left(\left|\frac{1}{\sqrt{n}} \sum_{i=1}^n (\varepsilon_i^2-\mathbb E \varepsilon_i^2)\right| > \xi_{\alpha/3, \sigma}\right) \le \frac{\alpha}{3n} \sum_{i=1}^n\mathbb E \varepsilon_i^4 \le \frac{\alpha}{3} \frac{4d^2}{T} \le \frac{\alpha}{3}. \end{aligned} $$

For the ‘cross term’ we have likewise with \(z_{\alpha }=\sqrt {1/\alpha }\) and \(a_i = (\mathcal X(\theta -\tilde \theta ))_i\) that, on the event \(\mathcal E\),

$$\displaystyle \begin{aligned} &\mathbb P_\varepsilon \left(\left\{- \frac{1}{\sqrt{n}} \langle \varepsilon , \mathcal X(\theta-\tilde \theta)\rangle > \bar z\right\}, \mathcal E\right) \le \frac{1}{n \bar z^2} \mathbb E _\varepsilon \left(\sum_{i=1}^n \varepsilon_i a_i 1_{\mathcal E} \right)^2 \le \frac{d}{T \bar z^2} \frac{\|\mathcal X(\theta-\tilde \theta)\|{}^2}{n} 1_{\mathcal E} \le \alpha/3, \end{aligned} $$

just as at the end of the proof of Theorem 18.2, so that coverage follows from integrating the last inequality w.r.t. the distribution of X. The scaling T ≈ d 2 is similar to the one discussed in Theorem 3 in Ref. [12].

Lemma 18.1

  1. (a)

    For isotropic design from Condition 18.1(a) and any fixed matrix \(\vartheta \in \mathbb H_d(\mathbb C)\) we have, for every \(n \in \mathbb N\) ,

    $$\displaystyle \begin{aligned}\Pr \left(\left|\frac{1}{n}\|\mathcal X\vartheta\|{}^2 - \|\vartheta\|{}_F^2\right| > \frac{\|\vartheta\|{}_F^2}{2}\right) \le 2 e^{-c n}.\end{aligned}$$

    In the standard Gaussian design case we can take c = 1∕24.

  2. (b)

    In the ‘Pauli basis’ case from Condition 18.1(b) we have for any fixed matrix \(\vartheta \in \mathbb H_d(\mathbb C)\) satisfying the Schatten-1-norm bound \(\|\vartheta \|{ }_{S_1} \le 2\) and every \(n \in \mathbb N\) ,

    $$\displaystyle \begin{aligned}\Pr \left(\left|\frac{1}{n}\|\mathcal X\vartheta\|{}^2 - \|\vartheta\|{}_F^2\right| > \max\left(\frac{\|\vartheta\|{}_F^2}{2}, z\frac{d}{n} \right) \right) \le 2 \exp \left\{-C(K) z \right\}\end{aligned}$$

    where C(K) = 1∕[(16 + 8∕3)K 2], and where K is the coherence constant of the basis.

  3. (c)

    In the ‘Pauli basis’ case from Condition 18.1(b) we have for any fixed matrix \(\vartheta \in \mathbb H_d(\mathbb C)\) such that the rank of 𝜗 is smaller than 2k and every \(n \in \mathbb N\) ,

    $$\displaystyle \begin{aligned}\Pr \left(\left|\frac{1}{n}\|\mathcal X\vartheta\|{}^2 - \|\vartheta\|{}_F^2\right| > \max\left(\frac{\|\vartheta\|{}_F^2}{2}, z\frac{d}{n} \right) \right) \le 2 \exp \left\{-\frac{n}{17 K^2 k^2 d} \right\}.\end{aligned}$$

Proof

We first prove the isotropic case. From (18.5) we see

$$\displaystyle \begin{aligned} & \Pr \left(\left|\frac{1}{n}\|\mathcal X\vartheta\|{}^2 - \|\vartheta\|{}_F^2\right| > \|\vartheta\|{}_F^2/2 \right) = \Pr\left(\left| \sum_{i=1}^n (Z_i^2 - \mathbb E Z^2_1)/\|\vartheta\|{}_F^2 \right| > n/2 \right) \end{aligned} $$

where the Z i∕∥𝜗F are sub-Gaussian random variables. Then the \(Z_i^2/\|\vartheta \|{ }_F^2\) are sub-exponential and we can apply Bernstein’s inequality (Prop. 4.1.8 in Ref. [14]) to the last probability. We give the details for the Gaussian case and derive explicit constants. In this case g i := Z i∕∥𝜗F ∼ N(0, 1) so the last probability is bounded, using Theorem 4.1.9 in Ref. [14], by

$$\displaystyle \begin{aligned}\Pr\left(\left| \sum_{i=1}^n (g_i^2 - 1) \right| > \frac{n}{2} \right) \le 2 \exp \left\{- \frac{n^2/4}{4n+2n}\right\},\end{aligned}$$

and the result follows.

Under Condition 18.1(b), if we write \(D= \max (n\|\vartheta \|{ }_F^2/2, z d)\) we can reduce likewise to bound the probability in question by

$$\displaystyle \begin{aligned} & \Pr\left(\left| \sum_{i=1}^n (Y_i - \mathbb E Y_1) \right| > D \right) \end{aligned} $$

where the Y i = |tr(X i𝜗)|2 are i.i.d. bounded random variables. Using \(\|E_i\|{ }_{op} \le K/\sqrt {d}\) from Condition 18.1(b) and the quantum constraint \(\|\vartheta \|{ }_F \le \|\vartheta \|{ }_{S_1} \le 2\) we can bound

$$\displaystyle \begin{aligned}|Y_i| \le d^2 \max_i \|E_i\|{}^2_{op} \|\vartheta\|{}_{S_1}^2 \le 4K^2d := U\end{aligned}$$

as well as

$$\displaystyle \begin{aligned}\mathbb E Y_i^2 \le U \mathbb E |Y_i| \le 4 K^2 d \|\vartheta\|{}_F^2 := s^2.\end{aligned}$$

Bernstein’s inequality for bounded variables (e.g., Theorem 4.1.7 in Ref. [14]) applies to give the bound

$$\displaystyle \begin{aligned}2 \exp\left\{-\frac{D^2}{2ns^2 + \frac{2}{3}UD}\right\} \le 2 \exp \left\{-C(K) z \right\},\end{aligned}$$

after some basic computations, by distinguishing the two regimes of \(D=n\|\vartheta \|{ }_F^2/2 \ge zd\) and \(D=zd \ge n\|\vartheta \|{ }_F^2/2\).

Finally for (c), using the same reasoning as above and using \(\|E_i\|{ }_{op} \le K/\sqrt {d}\) from Condition 18.1(b) and the fact that the estimator is also of rank less than k, we have \(\|\vartheta \|{ }_F \le \|\vartheta \|{ }_{S_1} \le \sqrt {2k}\|\vartheta \|{ }_F\) we can bound

$$\displaystyle \begin{aligned}|Y_i| \le d^2 \max_i \|E_i\|{}^2_{op} \|\vartheta\|{}_{S_1}^2 \le 2K^2kd\|\vartheta\|{}_{F}^2 := \tilde U\end{aligned}$$

as well as

$$\displaystyle \begin{aligned}\mathbb E Y_i^2 \le \tilde U \mathbb E |Y_i| \le 2 K^2 d k^2 \|\vartheta\|{}_F^4 := \tilde s^2.\end{aligned}$$

Bernstein’s inequality for bounded variables (e.g., Theorem 4.1.7 in Ref. [14]) applies to give the bound

$$\displaystyle \begin{aligned}2 \exp\left\{-\frac{D^2}{2n\tilde s^2 + \frac{2}{3}\tilde UD}\right\} \le 2 \exp \left\{-\frac{n}{17 K^2 k^2 d} \right\},\end{aligned}$$

after some basic computations.\(\blacksquare \)

18.5.3 Proof of Theorem 18.3

Proof

Since \(\mathbb E _\theta \hat R_n = \|\theta - \tilde \theta \|{ }_F^2\) we have from Chebyshev’s inequality

$$\displaystyle \begin{aligned} \mathbb P_\theta (\theta \notin C_n) &\le \mathbb P_\theta \left( |\hat R_n -\mathbb E \hat R_n| > z_{\alpha, n} \right) \\ & \le \frac{\mathrm{Var}_\theta(\hat R_n -\mathbb E \hat R_n)}{z_{\alpha_n}^2}. \end{aligned} $$

Now \(U_n =\hat R_n -\mathbb E _\theta \hat R_n\) is a centred U-statistic and has Hoeffding decomposition U n = 2L n + D n where

$$\displaystyle \begin{aligned}L_n = \frac{1}{n} \sum_{i =1}^{n} \sum_{m,k} (Y_i X^i_{m,k} - \mathbb E _\theta[Y_i X^i_{m,k}])(\Theta_{m,k}-\tilde \Theta_{m,k}) \end{aligned}$$

is the linear part and

$$\displaystyle \begin{aligned}D_n = \frac{2}{n(n-1)} \sum_{i<j} \sum_{m,k} (Y_i X^i_{m,k} - \mathbb E _\theta[Y_i X^i_{m,k}])(Y_j X^i_{m,k} - \mathbb E [Y_j X^i_{m,k}]) \end{aligned}$$

the degenerate part. We note that L n and D n are orthogonal in \(L^2(\mathbb P_\theta )\).

The linear part can be decomposed into

$$\displaystyle \begin{aligned} L_n = L_n^{(1)} + L_n^{(2)} \end{aligned}$$

where

$$\displaystyle \begin{aligned} L_n^{(1)} =\frac{1}{n} \sum_{i =1}^{n} \sum_{m,k} \left(\sum_{m',k'} X^i_{m',k'} X^i_{m, k} \Theta_{m',k'} - \Theta_{m,k} \right)(\Theta_{m,k}-\tilde \Theta_{m,k}) \end{aligned}$$

and

$$\displaystyle \begin{aligned} L_n^{(2)} = \frac{1}{n} \sum_{i =1}^{n} \varepsilon_i \sum_{m,k} X^i_{m,k}(\Theta_{m,k}-\tilde \Theta_{m,k}). \end{aligned}$$

Now by the i.i.d. assumption we have

$$\displaystyle \begin{aligned}\mathrm{Var}_\theta(L_n^{(2)}) = \sigma^2\frac{\|\tilde \theta - \theta\|{}^2_F}{n}. \end{aligned}$$

Moreover, by transposing the indices m, k and m′, k′ in an arbitrary way into single indices M = 1, …, d 2, K = 1, …, d 2, d 2 = p, respectively, basic computations given before eq. (28) in Ref. [30] imply that the variance of the second term is bounded by

$$\displaystyle \begin{aligned} \mathrm{Var}_\theta(L^{(1)}_n) \leq \frac{c\|\theta - \tilde \theta\|{}^2_F \|\theta\|{}^2_F}{n} \end{aligned}$$

where c is a constant that depends only on \(\mathbb E X_{1,1}^4\) (which is finite since the X 1,1 are sub-Gaussian in view of Condition 18.1(a)). Moreover, the degenerate term satisfies

$$\displaystyle \begin{aligned} \mathrm{Var}_\theta(D_n) \leq c \frac{d}{n^2} \|\theta\|{}_F^4 \end{aligned}$$

in view of standard U-statistic computations leading to eq. (6.6) in Ref. [21], with d 2 = p, and using the same transposition of indices as before. This proves coverage by choosing the constants in the definition of z α,n large enough.\(\blacksquare \)

18.5.4 Proof of Theorem 18.4

We prove the result for symmetric matrices with real entries—the case of Hermitian matrices requires only minor (mostly notational) adaptations.

Given the estimator \(\tilde \theta _{\mathrm {Pilot}}\), we can easily transform it into another estimator \(\tilde \theta \) for which the following is true.

Theorem 18.5

There exists an estimator \(\tilde \theta \) that satisfies, uniformly in θ  R(k), for any k  d and with \(\mathbb P_\theta \)-probability greater than 1 − 2δ∕3,

$$\displaystyle \begin{aligned} \| \tilde \theta - \theta\|{}_{F} \leq r_n(k), \end{aligned} $$

as well as,

$$\displaystyle \begin{aligned}\tilde \theta \in R(k),\end{aligned}$$

and then also

$$\displaystyle \begin{aligned} \| \tilde \theta- \theta\|{}_{S_1} \leq \sqrt{2k}r_n(k). \end{aligned} $$

Proof

Let \(\tilde \theta _{\mathrm {Pilot}}\) and let \(\tilde \theta \) be the element of R(d) with smallest rank k′ such that

$$\displaystyle \begin{aligned}\|\tilde \theta_{\mathrm{Pilot}} - \tilde \theta \|{}_F^2 \le \frac{r^2_n(k')}{4}.\end{aligned}$$

Such \(\tilde \theta \) exists and has rank ≤ k, with probability ≥ 1 − 2δ∕3, since θ ∈ R(k) satisfies the above inequality in view of (18.41). The \(\|\cdot \|{ }^2_F\)-loss of \(\tilde \theta \) is no larger than r n(k) by the triangle inequality

$$\displaystyle \begin{aligned}\|\tilde \theta - \theta\|{}_F \le \|\tilde \theta - \tilde \theta_{\mathrm{Pilot}} \|{}_F + \|\tilde \theta_{\mathrm{Pilot}} - \theta\|{}_F,\end{aligned}$$

and this completes the proof of the third claim in view of (18.2). \(\blacksquare \)

The rest of the proof consists of three steps: The first establishes some auxiliary empirical process type results, which are then used in the second step to construct a sufficiently good simultaneous estimate of the eigenvalues of θ. In Step III the coverage of the confidence set is established.

18.5.4.1 Step I

Let θ ∈ R +(k) = R(k) ∩ Θ+ and let \(\tilde \theta \) be the estimator from Theorem 18.5. Then with probability ≥ 1 − 2δ∕3, and if \(\eta = \tilde \theta - \theta \), we have

$$\displaystyle \begin{aligned} \|\eta\|{}_F^2 \le r^2_n(k)~~~~\forall \theta \in R^+(k), \end{aligned} $$
(18.45)

and that

$$\displaystyle \begin{aligned}\eta \in R(2k).\end{aligned}$$

For the rest of the proof we restrict in what follows to the event of probability greater than or equal to 1 − 2δ∕3 described by (a) and (b) in the hypothesis of the theorem.

Write \(Y_i^{\prime } = Y_i -\mathrm {tr}(X^i \tilde \theta )\) for the ‘new observations’

$$\displaystyle \begin{aligned}Y_i^{\prime} = tr (X^i \eta) + \varepsilon_i, ~~i=1, \dots, n.\end{aligned}$$

For any d × d′ matrix V  we set

$$\displaystyle \begin{aligned}\tilde \gamma_\eta(V) = V^T \left(\frac{1}{n} \sum_{i=1}^n X^i Y_i^{\prime} \right) V\end{aligned}$$

which estimates

$$\displaystyle \begin{aligned}\gamma_\eta(V)= V^T\eta V.\end{aligned}$$

Let now U be any unit vector in \(\mathbb R^d\). Then in the above notation (d′ = 1) we can write

$$\displaystyle \begin{aligned} \tilde \gamma_\eta(U) &= \frac{1}{n} \sum_{i=1}^n \sum_{m, m' \le d} U_m U_{m'} X^{i}_{m,m'}Y_i^{\prime} \\ &= \frac{1}{n} \sum_{i=1}^n \sum_{m, m' \le d} U_m U_{m'} X^{i}_{m,m'}(tr (X^i\eta) + \varepsilon_i) \\ &= \frac{1}{n} \sum_{i=1}^n \sum_{m, m' \le d} U_m U_{m'} X^{i}_{m,m'}\left(\sum_{k,k'\le d} X^i_{k,k'} \eta_{k,k'}+ \varepsilon_i \right). \end{aligned} $$

If \(\mathbb U\) denotes the d × d matrix UU T, the last quantity can be written as

$$\displaystyle \begin{aligned}\frac{1}{n} \langle \mathcal X \mathbb U, \mathcal X\eta \rangle + \frac{1}{n} \langle \mathcal X \mathbb U, \varepsilon \rangle.\end{aligned}$$

We can hence bound, for \(\mathcal S = \{U \in \mathbb R^d: \|U\|{ }_2=1\}\)

$$\displaystyle \begin{aligned} & \sup_{\eta \in R(2k), \|\eta\|{}_F \le r_n(k), U \in \mathcal S} |\tilde \gamma_\eta (U)-\gamma_\eta(U)| \\ &\le \sup_{\eta \in R(2k), \|\eta\|{}_F \le r_n(k), U \in \mathcal S} \left|\frac{1}{n} \langle \mathcal X \mathbb U, \mathcal X \eta \rangle - \langle \mathbb U, \eta \rangle\right| + \sup_{U \in \mathcal S} \left|\frac{1}{n} \langle \mathcal X \mathbb U, \varepsilon \rangle\right|. \end{aligned} $$

Lemma 18.2

The right hand side on the last inequality is, with probability greater than 1 − δ, of order

$$\displaystyle \begin{aligned}v_n := O \left( r_n(k) \tau_n(k) + \sqrt{\frac{d}{n}}\right).\end{aligned}$$

Proof

The first term in the bound corresponds to the first supremum on the right hand side of the last inequality, and follows directly from the matrix RIP (and Lemma 18.4). For the second term we argue conditionally on the values of \(\mathcal X\) and on the event for which the matrix RIP is satisfied. We bound the supremum of the Gaussian process

$$\displaystyle \begin{aligned}\mathbb G_\varepsilon(U) := \frac{1}{\sqrt{n}} \langle \mathcal X \mathbb U, \varepsilon\rangle \sim N(0, \|\mathcal X \mathbb U\|{}^2/n)\end{aligned}$$

indexed by elements U of the unit sphere \(\mathcal S\) of \(\mathbb R^d\), which satisfies the metric entropy bound

$$\displaystyle \begin{aligned}\log N(\delta, \mathcal S, \|\cdot\|) \lesssim d \log (A/\delta)\end{aligned}$$

by a standard covering argument. Moreover \(\mathbb U = UU^T \in R(1)\) and hence for any pair of vectors \(U, \bar U \in \mathcal S\) we have that \(\mathbb U - \bar {\mathbb U} \in R(2)\). From the RIP we deduce for every fixed \(U, \bar U \in \mathcal S\) that

$$\displaystyle \begin{aligned} \frac{1}{n}\|\mathcal X \mathbb U- \mathcal X \bar{\mathbb U}\|{}^2 &= \|\mathbb U - \bar{\mathbb U}\|{}_F^2 \left(1 + \frac{\frac{1}{n}\|\mathcal X (\mathbb U- \bar {\mathbb U})\|{}^2 - \|\mathbb U-\bar {\mathbb U}\|{}_F^2}{\|\mathbb U-\bar {\mathbb U}\|{}_F^2} \right) \\ &\le (1+\tau_n(2)) \|\mathbb U-\bar {\mathbb U}\|{}_F^2 \le C \|U- \bar U\|{}^2 \end{aligned} $$

since τ n(2) = O(1) and since

$$\displaystyle \begin{aligned}\| \mathbb U-\bar {\mathbb U}\|{}_F^2 = \sum_{m,m'} (U_mU_{m'}- \bar U_m \bar U_{m'})^2 = \sum_{m,m'} (U_mU_{m'}- U_m\bar U_{m'} + U_m \bar U_{m'} - \bar U_m \bar U_{m'})^2 \le 2 \|U-\bar U\|{}^2.\end{aligned}$$

Hence any δ-covering of \(\mathcal S\) in ∥⋅∥ induces a δC covering of \(\mathcal S\) in the intrinsic covariance \(d_{\mathbb G_\varepsilon }\) of the (conditional on \(\mathcal X\)) Gaussian process \(\mathbb G_\varepsilon \), i.e.,

$$\displaystyle \begin{aligned}\log N(\delta, \mathcal S, d_{\mathbb G_\varepsilon}) \lesssim d \log (A'/\delta)\end{aligned}$$

with constants independent of X. By Dudley’s metric entropy bound (e.g., Ref. [14]) applied to the conditional Gaussian process we have for d > 0 some constant

$$\displaystyle \begin{aligned}\mathbb E \sup_{U \in \mathcal S} |\mathbb G_\varepsilon(U)| \lesssim \int_0^d \sqrt {\log N(\delta, \mathcal S, d_{\mathbb G_\varepsilon})} d \delta \lesssim \sqrt{d}\end{aligned}$$

and hence we deduce that

$$\displaystyle \begin{aligned} \mathbb E _\varepsilon \sup_{U \in \mathcal S}\frac{1}{n}\left|\langle \mathcal X \mathbb U, \varepsilon\rangle \right| = \mathbb E _\varepsilon \frac{1}{\sqrt{n}}\sup_{U \in \mathcal S} |\mathbb G_\varepsilon(U)| \lesssim \sqrt{\frac{d}{n}} \end{aligned} $$
(18.46)

with constants independent of X, so that the result follows from applying Markov’s inequality. \(\blacksquare \)

18.5.4.2 Step II

Define the estimator

$$\displaystyle \begin{aligned} \hat \theta' = \tilde \theta + \frac{1}{n} \sum_{i=1}^n X^i Y_i^{\prime} = \tilde \theta + \tilde \gamma_\eta (I_d). \end{aligned}$$

Then we can write, using \(U^T \tilde \gamma _\eta (I_d)U = \tilde \gamma _\eta (U)\),

$$\displaystyle \begin{aligned} U^T \hat \theta' U - U^T \theta U & = U^T(\tilde \theta + \tilde \gamma_\eta(I_d))U - U^T(\tilde \theta + \eta)U \\ & = \tilde \gamma_\eta(U) - \gamma_\eta(U), \end{aligned} $$

and from the previous lemma we conclude, for any unit vector U that with probability ≥ 1 − δ,

$$\displaystyle \begin{aligned} |U^T \hat \theta' U - U^T \theta U | \le v_n. \end{aligned}$$

Let now \(\hat \theta \) be any symmetric positive definite matrix such that

$$\displaystyle \begin{aligned} |U^T \hat \theta U - U^T \hat \theta' U | \le v_n. \end{aligned}$$

Such a matrix exists, for instance θ ∈ R +(k), and by the triangle inequality we also have

$$\displaystyle \begin{aligned} |U^T \hat \theta U - U^T \theta U | \le 2v_n. \end{aligned} $$
(18.47)

Lemma 18.3

Let M be a symmetric positive definite d × d matrix with eigenvalues λ j’s ordered such that λ 1 ≥ λ 2 ≥… ≥ λ d. For any j  d consider an arbitrary collection of j orthonormal vectors \(\mathcal V_j = (V^\iota : 1 \le \iota \le j)\) in \(\mathbb R^d\). Then we have

$$\displaystyle \begin{aligned}(a)~~ \lambda_{j+1} \le \sup_{U \in \mathcal S, U \perp span(\mathcal V_j)} U^TMU,\end{aligned}$$

and

$$\displaystyle \begin{aligned}(b)~~~ \sum_{\iota \le j} \lambda_\iota \ge \sum_{\iota \le j} (V^\iota)^T M V^\iota.\end{aligned}$$

Let \(\hat R\) be the rotation that diagonalises \(\hat \theta \) such that \(\hat R^T \hat \theta \hat R = diag(\hat \lambda _j: j =1, \dots , d)\) ordered such that \(\hat \lambda _j \ge \hat \lambda _{j+1}\)j. Moreover let R be the rotation that does the same for θ and its eigenvalues λ j. We apply the previous lemma with \(M= \hat \theta \) and \(\mathcal V\) equal to the column vectors r ι : ι ≤ l − 1 of R to obtain, for any fixed l ≤ j ≤ d,

$$\displaystyle \begin{aligned} \hat \lambda_l \le \sup_{U \in \mathcal S, U \perp span(r_\iota, \iota \le l-1)} U^T \hat \theta U, \end{aligned} $$
(18.48)

and also that

$$\displaystyle \begin{aligned} \sum_{l \le j} \hat \lambda_l \ge \sum_{l \le j} r_{l}^T \hat \theta r_{l}. \end{aligned} $$
(18.49)

From (18.47) we deduce, that

$$\displaystyle \begin{aligned}\hat \lambda_l \le \sup_{U \in \mathcal S, U \perp span(r_\iota, \iota \le j-1)} U^T \theta U + 2 v_n = \lambda_j +2v_n ~~~\forall ~l \le j,\end{aligned}$$

as well as

$$\displaystyle \begin{aligned}\sum_{l \le j} \hat \lambda_l \ge \sum_{l \le j} r_{l}^T \theta r_{l} - 2jv_n = \sum_{l \le j} \lambda_l - 2jv_n,\end{aligned}$$

with probability ≥ 1 − δ. Combining these bounds we obtain

$$\displaystyle \begin{aligned} \left| \sum_{l \le j} \hat \lambda_l - \sum_{l \le j} \lambda_l \right| \le 2jv_n,~~~j \le d. \end{aligned} $$
(18.50)

18.5.4.3 Step III

We show that the confidence sets covers the true parameter on the event of probability ≥ 1 − δ on which Steps I and II are valid, and for the constant C chosen large enough.

Let \(\Pi = \Pi _{R^+(2\hat k)}\) be the projection operator onto \(R^+(2 \hat k)\). We have

$$\displaystyle \begin{aligned}\|\hat \vartheta - \theta\|{}_{S_1} \le \|\hat \vartheta - \Pi \theta\|{}_{S_1} + \|\Pi \theta - \theta\|{}_{S_1}.\end{aligned}$$

We have, using (18.50) and Lemma 18.5 below

$$\displaystyle \begin{aligned} \|\Pi \theta - \theta\|{}_{S_1} &= \sum_{J > 2\hat k} \lambda_J = 1- \sum_{J \le 2 \hat k} \lambda_J \\ & \le 1- \sum_{J \le 2 \hat k} \hat \lambda_J + 4 \hat k v_n \\ & \le 6 v_n \hat k \le (C/2) \sqrt{\hat k} r_n(\hat k) \end{aligned} $$

for C large enough.

Moreover, using the oracle inequality (18.42) with S =  Πθ and (18.43),

$$\displaystyle \begin{aligned} \|\hat \vartheta - \Pi \theta\|{}_{S_1} &\le \sqrt{4 \hat k} \|\hat \vartheta - \Pi \theta\|{}_{F} \\ & \le \sqrt{4 \hat k} ( \|\hat \vartheta - \theta\|{}_{F} + \|\Pi \theta -\theta\|{}_{F})\\ & \le \sqrt{4 \hat k} (\|\hat \vartheta - \tilde \theta_{\mathrm{Pilot}}\|{}_{F} + \|\tilde \theta_{\mathrm{Pilot}} - \theta\|{}_{F} + \|\Pi \theta -\theta\|{}_{F})\\ & \lesssim \sqrt{\hat k} (r_n(\hat k) + \|\Pi \theta -\theta\|{}_{F}). \end{aligned} $$

We finally deal with the approximation error: Note

$$\displaystyle \begin{aligned}\|\Pi\theta - \theta \|{}_{F}^2 = \sum_{l > 2 \hat k} \lambda_l^2 \le \max_{l > 2 \hat k} |\lambda_l| \sum_{l>2 \hat k} |\lambda_l|. \end{aligned}$$

By (18.50) we know that

$$\displaystyle \begin{aligned}\sum_{l > \hat k} \lambda_l = 1- \sum_{l \le \hat k} \lambda_l \le 1- \sum_{l \le \hat k} \hat \lambda_l + 2 v_n \hat k \le 4 v_n \hat k.\end{aligned}$$

Hence out of the λ l’s with indices \(l >\hat k\) there have to be less than \(\hat k\) coefficients which exceed 4v n. Since the eigenvalues are ordered this implies that the λ l’s with indices \(l>2 \hat k\) are all less than or equal to 4v n, and hence the quantity in the last but one display is bounded by (since \(\hat k < 2 \hat k\)), using again (18.50) and the definition of \(\hat k\),

$$\displaystyle \begin{aligned}4 v_n \left(1-\sum_{l\le\hat k} |\lambda_l|\right) \lesssim v_n \left(1-\sum_{l\le\hat k} |\hat \lambda_l| \right) + \hat k v^2_n \lesssim v_n^2 \hat k \lesssim \sqrt {\hat k} r_n(\hat k).\end{aligned}$$

Overall we get the bound

$$\displaystyle \begin{aligned}\|\hat \vartheta - \Pi \theta\|{}_{S_1} \lesssim \hat k v_n \lesssim (C/2) \sqrt{\hat k} r_n(\hat k)\end{aligned}$$

for C large enough, which completes the proof of coverage of C n by collecting the above bounds. The diameter bound follows from \(\hat k \le k\) (in view of the defining inequalities of \(\hat k\) being satisfied, for instance, for \(\tilde \theta ' = \theta \), whenever θ ∈ R +(k 0).)

18.6 Auxiliary Results

18.6.1 Proof of Lemma 18.3

  1. (a)

    Consider the subspaces E = span((V ι)ιj) and F = span((e ι)ιj+1) of \(\mathbb R^d\), where the e ι’s are the eigenvectors of the d × d matrix M corresponding to eigenvalues λ j. Since dim(E) + dim(F) = (d − j) + j + 1 = d + 1, we know that \(E \bigcap F\) is not empty and there is a vectorial sub-space of dimension 1 in the intersection. Take \(U \in E \bigcap F\) such that ∥U∥ = 1. Since U ∈ F, it can be written as

    $$\displaystyle \begin{aligned}U = \sum_{\iota=1}^{j+1} u_{\iota} e_{\iota}\end{aligned}$$

    for some coefficients u ι. Since the e ι’s are orthogonal eigenvectors of the symmetric matrix M we necessarily have

    $$\displaystyle \begin{aligned}MU = \sum_{\iota=1}^{j+1} \lambda_{\iota} u_{\iota} e_{\iota},\end{aligned}$$

    and thus

    $$\displaystyle \begin{aligned}U^T MU = \sum_{\iota=1}^{j+1} \lambda_{\iota} u_{\iota}^2.\end{aligned}$$

    Since the λ ι’s are all non-negative and ordered in decreasing absolute value, one has

    $$\displaystyle \begin{aligned}U^T MU = \sum_{\iota=1}^{j+1} \lambda_{\iota} u_{\iota}^2 \geq \lambda_{j+1} \sum_{\iota=1}^{j+1} u_{\iota}^2 = \lambda_{j+1} \|U\|{}^2 = \lambda_{j+1}.\end{aligned}$$

    Taking the supremum in U yields the result.

  2. (b)

    For each ι ≤ j, let us write the decomposition of V ι on the basis of eigenvectors (e l : l ≤ d) of M as

    $$\displaystyle \begin{aligned}V^{\iota} = \sum_{l \leq d} v^{\iota}_l e_{l}.\end{aligned}$$

    Since the (e l) are the eigenvectors of M we have

    $$\displaystyle \begin{aligned}\sum_{\iota \leq j} (V^{\iota})^T M V^{\iota} = \sum_{\iota \leq j} \sum_{l=1}^d \lambda_{l} (v^{\iota}_l)^2,\end{aligned}$$

    where \(\sum _{l=1}^d (v^{\iota }_l)^2 = 1\) and \(\sum _{\iota \leq j} (v^{\iota }_l)^2 \leq 1\), since the V ι are orthonormal. The last expression is maximised in \((v^{\iota }_l)_{\iota \leq j, 1 \le l \leq d}\) and under these constraints, when \(v^{\iota }_{\iota } = 1\) and \(v^{\iota }_{l} = 0\) if ι ≠ l (since the (λ ι) are in decreasing order), and this gives

    $$\displaystyle \begin{aligned}\sum_{\iota \leq j} (V^{\iota})^T M V^{\iota} \leq \sum_{\iota \leq j} \lambda_{\iota}.\end{aligned}$$

18.6.2 Some Further Lemmas

Lemma 18.4

Under the RIP (18.7) we have for every 1 ≤ k  d that, with probability at least 1 − δ,

$$\displaystyle \begin{aligned} \sup_{A, B \in R(k)} \left|\frac{\frac{1}{n}\langle \mathcal XA, \mathcal XB \rangle - \langle A, B \rangle_{F} }{\|A\|{}_F\|B\|{}_F}\right| \le 10 \tau_n(k). \end{aligned} $$
(18.51)

Proof

The matrix RIP can be written as

$$\displaystyle \begin{aligned} \sup_{A\in R(k)} \left| \frac{\langle \mathcal XA, \mathcal XA \rangle}{n \langle A,A\rangle_F} -1 \right| =\frac{|\langle A , (n^{-1} M- \mathbb I) A \rangle_F|}{\langle A,A\rangle_F}\leq \tau_n(k), \end{aligned} $$
(18.52)

for a suitable \(M\in \mathbb H_{d^2}(\mathbb C)\). The above bound then follows from applying the Cauchy-Schwarz inequality to

$$\displaystyle \begin{aligned} \frac{1}{n}\langle \mathcal XA, \mathcal XB \rangle - \langle A, B \rangle_F =\langle A , (n^{-1} M- \mathbb I) B \rangle_F . \end{aligned} $$
(18.53)

\(\blacksquare \)

The following lemma can be proved by basic linear algebra, and is left to the reader.

Lemma 18.5

Let M ≥ 0 with positive eigenvalues (λ j)j ordered in decreasing order. Denote with \(\Pi _{R^+(j-1)}\) the projection onto R +(j − 1) = R(j − 1) ∩ Θ +. Then for any 2 ≤ j  d we have

$$\displaystyle \begin{aligned}\sum_{j'\geq j} \lambda_{j'} = \|M - \Pi_{R^+(j-1)} M\|{}_{S_1}.\end{aligned}$$