Abstract
Compressed sensing (CS) ensures the recovery of sparse vectors from a number of randomized measurements proportional to their sparsity. The initial theory considers discretized domains, and the randomness makes the physical positions of the grid nodes irrelevant. Most imaging devices, however, operate over some continuous physical domain, and it makes sense to consider Dirac masses with arbitrary positions. In this article, we consider such a continuous setup and analyze the performance of the BLASSO algorithm, which is the continuous extension of the celebrated LASSO \(\ell ^1\) regularization method. This approach is appealing from a numerical perspective because it avoids to discretize the domain of interest. Previous works considered translation-invariant measurements, such as randomized Fourier coefficients, in which it makes clear that the discrete theory should be extended by imposing a minimum distance separation constraint (often called “Rayleigh limit”) between the Diracs. These prior works, however, rule out many domains and sensing operators of interest, which are not translation invariant. This includes, for instance, Laplace measurements over the positive reals and Gaussian mixture models over the mean-covariance space. Our theoretical advances crucially rely on the introduction of a canonical metric associated with the measurement operator, which is the so-called Fisher geodesic distance. In the case of Fourier measurements, one recovers the Euclidean metric, but this metric can cope with arbitrary (possibly non-translation invariant) domains. Furthermore, it is naturally invariant under joint reparameterization of both the sensing operator and the Dirac locations. Our second and main contribution shows that if the Fisher distance between spikes is larger than a Rayleigh separation constant, then the BLASSO recovers in a stable way a stream of Diracs, provided that the number of measurements is proportional (up to log factors) to the number of Diracs. We measure the stability using an optimal transport distance constructed on top of the Fisher geodesic distance. Our result is (up to log factor) sharp and does not require any randomness assumption on the amplitudes of the underlying measure. Our proof technique relies on an infinite-dimensional extension of the so-called golfing scheme which operates over the space of measures and is of general interest.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Sparse regularization, and in particular convex approaches based on \(\ell ^1\) minimization, is one of the workhorses to ill-posed linear inverse models. It finds numerous applications ranging from signal processing [19] to machine learning [56]. It is thus also the method of choice to solve the compressed sensing (CS) problem [17, 29], which is an inverse problem where the linear operator is random. Randomness of the linear operator makes the recovery possible as soon as the number of observations is of the order (up to log-factor) of the number of nonzero elements in the vector to recover. This theory and the associated numerical solvers are fundamentally discrete, which does not complain with most imaging scenarios where CS needs to be adapted to deal with physical constraints. Purely discrete random operators are idealizations, and studying random operators obtained by random sampling of continuous operators (e.g., Fourier measurements) requires the study of the so-called Rayleigh limit. This is the minimum separation distance between the individual elements forming the object of interest (in the following, Dirac masses) required to ensure that a stable recovery is possible. Furthermore, extending CS to continuous domains and sub-sampled continuous operators is of both of practical and theoretical interests. It avoids gridding the parameter space, thus enabling more efficient solvers and a sharper theoretical analysis.
The natural continuous extension of the \(\ell ^1\) approach encodes the positions and amplitudes of the sought after solution into a Radon measure. The \(\ell ^1\) norm is then replaced by the total variation (total mass) of the measure, and a measure is naturally said to be “sparse” when it is a sum of Diracs at the desired positions and amplitudes. The corresponding infinite dimensional optimization problem is called BLASSO in [25] following theoretical works on spectral extrapolation [7]. This setting of optimization on measures has also been considered in the inverse problems community [9]. Successful examples of applications of such “off-the-grid methods” include single-molecule fluorescent imaging [8], spikes sorting in neurosciences [33], mixture model estimation [37] and training shallow neural networks [5]. Existing previous theoretical works on “off-the-grid” CS are, however, focused on domains which are either the whole space (\({\mathbb {R}}^d\) or the periodic torus \({\mathbb {T}}^d\)) and consider translation-invariant measurements (such as random Fourier measurements or sub-sampled convolutions). In this article, we provide a sharp analysis of a general class of operators over arbitrary domains.
1.1 Sparse Spikes Recovery Using the BLASSO
1.1.1 Observation Model
We consider the general problem of estimating a complex-valued unknown Radon measure \(\mu _0 \in {\mathcal {M}}({\mathcal {X}})\) defined over some metric space \({\mathcal {X}}\) from a small number m of randomized linear observations \(y \in {\mathbb {C}}^m\). In this paper, \({\mathcal {X}}\) will either be a connected bounded open subset of \({\mathbb {R}}^d\) or the d-dimensional torus \({\mathbb {T}}^d\), even though some of our results extend beyond this case. We define the inner product between a complex-valued continuous function \(f \in {\mathscr {C}}^{}({\mathcal {X}})\) and complex-valued measure \(\mu \in {\mathcal {M}}({\mathcal {X}})\) as \(\langle f,\,\mu \rangle _{\mathcal {M}} {\mathop {=}\limits ^{\text {def.}}}\int _{\mathcal {X}}\overline{f(x)} \mathrm {d} \mu (x)\). The (forward) measurement operator \( \varPhi : {\mathcal {M}}({\mathcal {X}}) \mapsto {\mathbb {C}}^m\) that we consider in this paper is of the form
where \((\omega _1,\ldots ,\omega _m)\) are parameters identically and independently distributed according to a probability distribution \(\varLambda (\omega )\) over some space \(\varOmega \), and \(\varphi _\omega : {\mathcal {X}}\rightarrow {\mathbb {C}}\) are smooth functions parameterized by \(\omega \). We further assume that \(\varphi _\omega \) is normalized, that is \( {\mathbb {E}}_{\omega \sim \varLambda }[\left|\varphi _\omega (x)\right|^2] = 1\) for all \(x\in {\mathcal {X}}\). Our observations are of the form
where \(\mu _0=\sum _{i=1}^s a_i \delta _{x_i}\) with \( (x_i,a_i) \in {\mathcal {X}}\times {\mathbb {C}}\) is the s-sparse measure we are interested in, \({\tilde{\mu }}_0\in {\mathcal {M}}({\mathcal {X}})\) accounts for modeling error, and \(w\in {\mathbb {C}}^m\) is measurement noise. In the rest of the paper, we naturally assume that the support of \({\tilde{\mu }}_0\) does not include any of the \(x_i\)’s.
1.1.2 BLASSO
An increasingly popular method to estimate such a sparse measure corresponds to solving an infinite-dimensional analogue of the Lasso regression problem with regularization parameter \(\lambda >0\),
Following [25], we call this method the BLASSO (for Beurling-LASSO). Here, \(|\mu |({\mathcal {X}})\) is the so-called total variation (or total mass) of the measure \(\mu \) and is defined as
Note that on unbounded \({\mathcal {X}}\), one needs to impose that f vanishes at infinity. If \({\mathcal {X}}=\{x_i\}_i\) is a finite space, then this would correspond to the classical finite-dimensional LASSO problem [56], because \(|\mu |({\mathcal {X}})=\left\| a \right\| _1 {\mathop {=}\limits ^{\text {def.}}}\sum _i |a_i|\) where \(a_i=\mu (\{x_i\})\). Similarly, when \({\mathcal {X}}\) is possibly infinite but \(\mu =\sum _i a_i \delta _{x_i}\), one also has that \(|\mu |({\mathcal {X}})=\left\| a \right\| _1\). The noiseless problem of (\({\mathcal {P}}_\lambda (y)\)) when \(\lambda \rightarrow 0\) is
1.2 Previous Works
The initial development of CS [17, 29] considered only discrete problems, which corresponds to imposing that \({\mathcal {X}}\) is a fixed discrete space. The use of “structured” measurements, typically random Fourier frequencies, requires this grid to be uniform [17]. This forbids the recovered Dirac’s to be closer than the grid resolution, thus implicitly imposing a Rayleigh limit. These initial works have been extended to “continuous” domains, typically making use of continuous Fourier measurements up to frequency \(f_c\). Without random sub-sampling, the main result of Candès and Fernandez-Granda [15] shows that in this noiseless setting with \(y = \varPhi \mu _0\), (\({\mathcal {P}}_0(y)\)) exactly recovers \(\mu _0\) under a so-called Rayleigh criterion, that the minimum distance between two spikes \(\min _{i\ne j}\left\| x_i - x_j \right\| \) is at least \({\mathcal {O}}(1/f_c)\). Note that this limit is consistent with the initial discrete analysis, since in this case \(1/f_c\) is equal to the grid spacing. This result has been extended to provide robustness to noise [4, 14, 30, 35] and to cope with more general measurement operators [6]. The CS setup is then obtained by randomly sub-sampling the Fourier frequencies. The first work in this compressed sensing direction is by Tang et al [55] where they showed that the recovery guarantees of [15] remain valid with high probability when only a small number of (Fourier) measurements are randomly selected, of the order (up to log factors) of the sparsity of the underlying measure. All these previous theoretical works, however, strongly rely on the translation invariance of the linear operator (Fourier measurements or convolutions) and the underlying domain (either Euclidean space or the periodic torus). Applying directly these results to spatially varying operators (such as, for instance, when imaging with non-stationary point spread functions) generally leads to overly pessimistic minimum separation conditions. The goal of this paper is thus to study the CS problem on arbitrary domains and with arbitrary operators, which necessitates to replace the Euclidean distance by an intrinsic metric induced by the operator \(\varPhi \): the Fisher geodesic distance.
Note that this “Rayleigh criterion” is critical to the performance of (\({\mathcal {P}}_\lambda (y)\)): it is shown in [30] that two spikes of opposite signs cannot be recovered if their separation is smaller than \(1/f_c\). Although it is not the topic of this paper, let us note that lifting the minimum separation condition requires to impose positivity of the spikes [25, 51] and the price to pay is an explosion of the instabilities as spikes cluster together [27]. Another point to note is that existing theoretical results on continuous CS are only valid under a random signs assumption on the amplitudes of the sought-after Dirac masses. This forbids in particular imposing positive signs on the solution. This random sign hypothesis is a well-known assumption in classical discrete compressed sensing [17, 29] but appears somewhat unrealistic. Our analysis does not impose such a random sign constraint, which requires to use different proof technics, in particular extending the so-called golfing scheme method to this continuous setting.
1.2.1 Numerical Solvers and Alternative Approaches
The focus of this paper is on the theoretical analysis of the performance BLASSO method, not on the development and analysis of efficient numerical solvers. Although the BLASSO problem is infinite dimensional, there are efficient numerical solvers that use the fact that the sought-after sparse solution is parameterized by a small number of parameters (positions and amplitudes of the spikes). This open the door to algorithms which do not scale with some grid size and hence can scale beyond 1-D and 2-D problems. Let us mention in particular: (i) discretization on a grid [31, 54], (ii) semi-definite programming (SDP) relaxation using Lasserre hierarchy [15, 26], (iii) Frank–Wolfe and its variants [8, 9, 37], (iv) non-convex particle flows [20].
1.2.2 Other Approaches
The BLASSO is by no means the only method for estimating sparse measures in an off-the-grid setup. One of the first known methods for recovering a sum of Diracs (from Fourier measurements) is Prony’s method [46], which aims to recover the Dirac positions by finding the zeros of some polynomial, whose coefficients are derived from the measurements y. This approach is non-variational and non-convex. Several extensions (with improved robustness to noise) have also been proposed, such as MUSIC and ESPRIT [48, 52]. We refer to [41] for theoretical analysis of these methods. In practice, when the noise is small and the spikes tends to cluster so that the minimum separation distance condition does not hold, these methods often surpasses BLASSO in terms of performance. However, these methods are relevant only for spectral (Fourier) type measurements and the extension to the multivariate setting is non-trivial, see, for instance, [40, 50] for extensions. A rule of thumb is that \(\ell ^1\)-regularization is, however, a good baseline, which benefits from both efficient and stable numerical solvers and an in-depth theoretical analysis which leverages the convexity of the problem.
1.3 Contributions
Our main result is informally stated in Theorem 1 and is stated in full details in Theorem 3. It ensures sharp compressed recovery guarantees for a large class of measurement operators over general domains. A salient feature of this statement is that, contrary to previous works such as [55], it does not require randomness of the signs of the sought after measure. This is achieved by extending the so-called golfing scheme [16, 39] to the infinite-dimensional setting. At the heart of this result is the definition of an intrinsic distance over the parameter domain, the so-called Fisher geodesic distance. It is defined by the metric tensor associated with the covariance kernel of the measurement operator. This definition is crucial both to define the Rayleigh limit of the problem and to quantify the recovery error using the optimal transport distance induced over the space of measures by the Fisher distance.
We now give a more precise exposition of these results. We define the limit covariance kernel as
which measures how much two Diracs at x and \(x'\) interact with each other in the large samples limit as \(m\rightarrow \infty \), and assume that K is real-valued (primary examples include the Gaussian kernel, or the so-called Jackson kernel used in [15]). Define the metric tensor \({{\mathfrak {g}}}_x {\mathop {=}\limits ^{\text {def.}}}\nabla _1\nabla _2 K(x,x) \in {\mathbb {R}}^{d\times d}\), where \(\nabla _i\) indicates the gradient with respect to the ith variable, and assume that for all \(x\in {\mathcal {X}}\), it is a positive definite matrix. Finally, define the associated geodesic distance \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') = \inf _\gamma \int _0^1 \sqrt{\gamma '(t)^\top {{\mathfrak {g}}}_{\gamma (t)}\gamma '(t)}\mathrm {d}t\), where the infimum is taken over all continuous path \(\gamma :[0,1] \rightarrow {\mathcal {X}}\) such that \(\gamma (0)=x\) and \(\gamma (1)=x'\). (More details about this geodesic distance are given in Sect. 3.1.) Denote by \({\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x;r)\) the ball of radius r centered on x, for the metric \({\mathfrak {d}}_{{\mathfrak {g}}}\). The main result of the paper, here stated in an informal way, is the following.
Theorem 1
(Main result, informal) Let \(R_{\mathcal {X}}{\mathop {=}\limits ^{\text {def.}}}\sup _{x,x'\in {\mathcal {X}}} {\mathfrak {d}}_{{\mathfrak {g}}}(x,x')\). Under some assumptions on the kernel \(K\) (see Assumption 1 in Sect. 4) and features \(\varphi _\omega \) (see Assumption 2 in Sect. 5), there are constants \(r, \varDelta >0\) that only depend on \(K\), and \(C_1, C_2>0\) which depend on K and the regularity of \(\varphi _{\omega _k}\) (up to 2nd order), such that the following holds. Suppose that y is of the form (2) with \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j) \geqslant \varDelta \) and
Then, with probability \(1-\rho \), when \(\left\| w \right\| \leqslant \delta \) and \(\lambda \sim \frac{\delta }{\sqrt{s}}\), any solution \({\hat{\mu }}\) to (\({\mathcal {P}}_\lambda (y)\)) satisfies
where \({\hat{A}}_j {\mathop {=}\limits ^{\text {def.}}}\left|{\hat{\mu }}\right|({\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_j;r))\), \({\hat{a}}_j {\mathop {=}\limits ^{\text {def.}}}{\hat{\mu }}({\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_j;r))\), and \({\mathcal {T}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}\) is the partial optimal transport distance associated with \({\mathfrak {d}}_{{\mathfrak {g}}}\) (see Definition ’1).
Let us comment on this result. From an inverse problem perspective (i.e., when no sub-sampling is used, or equivalently when letting \(m \rightarrow +\infty \)), Theorem 1 is already informative, since it defines and proves a Rayleigh limit in term of a new intrinsic distance (the Fisher geodesic distance), which extends results only presented before in the translation invariant case. From a compressed sensing perspective , the most salient feature of Theorem 1 is that, up to log factors, the bound (3) is linear in the sparsity of the underlying measure. This improves over the best known result of Tang et al [55], since we do not require the random signs assumption.
The assumptions on the kernel \(K(x,x')\) mainly state that it must decrease sufficiently when x and \(x'\) are far apart, or, in other words, that the coherence between \(\varPhi \delta _x\) and \(\varPhi \delta _{x'}\) must be low. The main novelty of our approach is that we measure this separation in term of the geodesic metric \({\mathfrak {d}}_{{\mathfrak {g}}}\), which allows to account for non-translation-invariant kernels in an intrinsic and natural manner. The relationship between the decay of the kernel and separation is made explicit in our kernel width definition in Definition 3. As mentioned previously, separation is crucial for stability, and our definition of kernel width can be seen as an extension of the Babel function in compressed sensing [57] to the continuous setting, which links sparsity and separation to well conditioning of the corresponding covariance matrix. We refer to the discussion following Theorem 2 for further details. The assumptions on the features \(\varphi _\omega \), which are more technical in nature, relate to their regularity and the boundedness of their various derivatives.
Concerning the recovery bound (4), the first part states that the measure \({\hat{\mu }}\) concentrates around the true positions of the Diracs, while the second part guarantees that the complex amplitudes of \({\hat{\mu }}\) around the Diracs are close to their true values. The discrepancy in the first part is measured in terms of a partial optimal transport distance associated with \({\mathfrak {d}}_{{\mathfrak {g}}}\) (Def. 1 in Sec. 3). Although our error bound is linear with respect to the noise level \(\delta \), we do not expect the \(\sqrt{s}\) factor be sharp and is rather an artifact of proof techniques. For instance, in [14], where sub-sampling is not considered, one could also obtain bounds \(\sum _{j=1}^s \left|{\hat{a}}_j - a_j\right| \lesssim \delta \). We refer to the discussion after Proposition 1 for further remarks and links to previous works, but simply mention here that the existing proof techniques which lead to sharper bounds cannot be readily extended to the case of randomized measurements.
Finally, the constants \(C_1,C_2\) that appear in (3) can depend (generally polynomially) on the dimension d but not on the sparsity s. As we will see in Sect. 5 and the detailed version of Theorem 1 (Theorem 3), the bound (3) is actually valid when we suppose the features \(\varphi _\omega \) and their derivatives to be uniformly bounded for all x and \(\omega \). When this is not the case, we will be able to relax this assumption, similar to the notion of stochastic incoherence [16] in compressed sensing. As a result, m can actually appear in \(C_1,C_2\), generally in a logarithmic form (see examples in Sect. 2), which only adds logarithmic terms in s and d in the final number of measurements.
1.3.1 Outline of the Paper
The paper is organized as follows. In Sect. 2, we give example applications of Theorem 1, including non-translation-invariant frameworks such as Laplace measurements used in microscopy [28]. In Sect. 3, we introduce our Riemannian geometry framework and prove intermediate recovery results based on the existence of a so-called non-degenerate dual certificate, which is known in the literature to be the key object in the analysis of the BLASSO model. In Sect. 4, we study in more detail the relationship between the minimal separation condition and the covariance kernel. We prove that, under some conditions on \(K\), in the limit \(m\rightarrow \infty \), one can indeed prove the existence of a non-degenerate dual certificate when minimal separation is imposed with respect to \({\mathfrak {d}}_{{\mathfrak {g}}}\). Finally, in Sect. 5, we state our main result with finite number of measurements m (Theorem 3, which is a detailed version of Theorem 1). Section 6 is dedicated to its proof using an infinite-dimensional extension of the celebrated golfing scheme [16], with technical computations in the appendix.
1.3.2 Relationship to Our Previous Work
[44] This article is a substantially extended version of the conference publication [44]. The results of Sect. 4 are in most part already published (under slightly more restrictive assumptions) in this conference paper. The remainder of the paper is, however, entirely novel. We remove the random signs assumption of [44] thanks to a new proof technique with the golfing scheme. Furthermore, the results in [44] are restricted to the small noise setting and focus on exact support stability, while we study here arbitrary noise levels and establish more general stability bounds in terms of optimal transport distances.
1.3.3 Notations
Given \(n\in {\mathbb {N}}\), we denote by \([n]{\mathop {=}\limits ^{\text {def.}}}\{1,2,\ldots , n\}\) the first n integers. We write \(1_n\) to denote the vector of length n whose entries are all 1’s, and \(0_n\) to denote the vector of length n whose entries are all 0’s. Given two matrices A and B, we write \(A\prec B\) to mean that \(B-A\) is positive definite and \(A\preceq B\) to mean that \(B-A\) is positive semi-definite. Given two positive numbers a, b, we write \(a\lesssim b\) to mean that there exists some universal constant \(C>0\) so that \(a \leqslant C b\). Given \(({\mathcal {X}}, {\mathfrak {d}})\) a metric space, \(x\in {\mathcal {X}}\) and \(r>0\), we define \({\mathcal {B}}_{\mathfrak {d}}(x;r){\mathop {=}\limits ^{\text {def.}}} \left\{ z\in {\mathcal {X}} \;;\; {\mathfrak {d}}(x,z)<r \right\} \) the ball centered on x of radius r, or just \({\mathcal {B}}_{\left\| \cdot \right\| }(r) = \left\{ z\in {\mathcal {X}} \;;\; \left\| z \right\| <r \right\} \) the ball centered on 0 for a norm \(\left\| \cdot \right\| \).
We write \(\left\| \cdot \right\| _p\) to denote the \(\ell _p\) norm, and \(\left\| \cdot \right\| \) without any subscript denotes the spectral norm for matrices or \(\ell _2\) norm for vectors. For any norm \(\left\| \cdot \right\| _X\) on vectors, the corresponding matrix norm is \(\left\| A \right\| _{X \rightarrow Y} = \sup _{\left\| x \right\| _X = 1} \left\| Ax \right\| _Y\) and \(\left\| A \right\| _X = \left\| A \right\| _{X \rightarrow X}\) for short.
Given a vector \(x \in {\mathbb {C}}^{sd}\) decomposed in blocks \(x = [x_1^\top ,\ldots ,x_s^\top ]^\top \) with \(x_i \in {\mathbb {C}}^d\), where s and d will always be defined without ambiguity, we define the block norm \(\left\| x \right\| _{\mathrm {block}}{\mathop {=}\limits ^{\text {def.}}}\max _{1\leqslant i\leqslant s} \left\| x_i \right\| \). Given a vector \(x\in {\mathbb {C}}^{s(d+1)}\) decomposed as \(x = [x_0^\top , X_1^\top , \ldots , X_s^\top ]^\top \) where \(x_0 \in {\mathbb {C}}^s\) and \(X_j\in {\mathbb {C}}^d\), we define \( \left\| x \right\| _{\mathrm {Block}} {\mathop {=}\limits ^{\text {def.}}}\max \left( \left\| x_0 \right\| _\infty , \max _{j=1}^s\left\| X_j \right\| _2 \right) \).
For a complex number a, its sign is denoted by \({{\,\mathrm{sign}\,}}(a) = \frac{a}{\left|a\right|}\). Given a complex-valued measure \(\mu \in {\mathcal {M}}({\mathcal {X}})\) and complex-valued continuous function \(f \in {\mathscr {C}}^{}({\mathcal {X}})\), we recall that \(\langle f,\,\mu \rangle _{\mathcal {M}} {\mathop {=}\limits ^{\text {def.}}}\int _{\mathcal {X}}\overline{f(x)} \mathrm {d} \mu (x)\). For two complex vectors v and w, \(\langle v,\,w\rangle _2 {\mathop {=}\limits ^{\text {def.}}}v^* w\), where \(v^* = {\overline{v}}^\top \) denotes conjugate transpose.
2 Examples
In this section, we illustrate Theorem 1 for some special cases of practical interest in imaging and machine learning. The following statements are obtained by bounding the constants in Theorem 3 in Sect. 5 (the detailed version of Theorem 1). These computations, which can be somewhat verbose, are delayed to Appendices C, D and E.
2.1 Off-the-Grid Compressed Sensing
Off-the-grid Compressed sensing, initially introduced in the special case of 1-D Fourier measurements on \({\mathcal {X}}={\mathbb {T}}={\mathbb {R}}/{\mathbb {Z}}\) by [55], corresponds to Fourier measurements of the form (1). This is a “continuous” analogous of the celebrated compressed sensing line of works [17, 29]. We give a multi-dimensional version below.
Let \(f_c \in {\mathbb {N}}\) with \(f_c\geqslant 128\) (for simplicity) and \({\mathcal {X}}= {\mathbb {T}}^d\) the d-dimensional torus. Let \(\varphi _\omega (x) {\mathop {=}\limits ^{\text {def.}}}e^{\mathrm {i} 2\pi \omega ^\top x}\), \(\varOmega {\mathop {=}\limits ^{\text {def.}}} \left\{ \omega \in {\mathbb {Z}}^d \;;\; \left\| \omega \right\| _\infty \leqslant f_c \right\} \), and \(\varLambda (\omega ) = \prod _{j=1}^d g(\omega _j)\) where
The distribution \(\varLambda \) concentrates at lower frequencies, and the corresponding kernel is the Jackson kernel (the Dirichlet kernel raised to the power of 4): \( K(x,x') = \prod _{i=1}^d \kappa (x_i-x_i'), \) where
Note that if \(\varLambda \) is chosen to be the uniform distribution, then the corresponding kernel is the Dirichlet kernel. The choice of the \(\varLambda \) here is purely technical: the Jackson kernel leads to easier analysis due to faster decay as \(\left\| x-x' \right\| \) increases and has been considered in many previous works such as [14, 55]. In this case, the Fisher metric is, up to a constant C, the Euclidean metric \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') = C f_c \left\| x-x' \right\| \). Provided that \(\min _{i\ne j}\left\| x_i-x_j \right\| \gtrsim \frac{d^{\frac{1}{2}} s^{\frac{1}{4}}}{f_c}\), stable recovery is guaranteed with
The bound on m directly extends the (univariate) main result of [55] to the multivariate setting, whilst removing the unrealistic assumption that the signs of the underlying amplitudes are i.i.d. in the uniform distribution. Note that, compared to the unidimensional case in [55], the minimal separation \(\varDelta \) depends on s in general. However, as we explain in the appendix, when the dimension is such that \(d<4\), this bound can effectively be replaced by one that is exponential in d but does not depend on the sparsity s, and hence yields an extension of the result from [55]. Indeed, during the proof, one must bound a quantity of the form \(\sum _{i=2}^s \left\| x_1-x_i \right\| ^{-4}\), for \(\varDelta \)-separated Diracs. Since in one dimension only 2 Diracs can be situated at distance \(k\varDelta \) from \(x_1\) for each integer \(k>0\), this can be easily bounded by a global bound \(\varDelta ^{-4}\sum _{k=1}^\infty k^{-4}\) that does not depend on s. In the multidimensional case, however, \({\mathcal {O}}(j^{d})\) number of Diracs spaced \(\delta \) apart can be packed into the ball of radius \(j\delta \) around \(x_1\), and this can be handled by the polynomial decay of the kernel \(K(x,x')\) (which decays as \(\left\| x-x' \right\| ^{-4}\) when \(f_c\) is sufficiently large) only when \(d<4\).
2.2 Continuous Sampling Fourier Transform
For most imaging problems, imposing periodic boundary conditions on a square domain is not satisfying. Considering instead Fourier frequencies over the whole space \({\mathbb {R}}^d\) is more natural and can for instance cope with rotation-invariant sampling strategies, such as, for instance, using a Gaussian distribution. Let \({\mathcal {X}}\subset {\mathbb {R}}^d\) be a bounded open subset of \({\mathbb {R}}^d\). The space of frequencies is \(\varOmega = {\mathbb {R}}^d\), \(\varphi _\omega (x) {\mathop {=}\limits ^{\text {def.}}}e^{\mathrm {i} \omega ^\top x}\), and \(\varLambda (\omega ) = {\mathcal {N}}(0,\varSigma ^{-1})\) for some known symmetric positive definite matrix \(\varSigma \). Note that, for simplicity, the frequencies are drawn according to a Gaussian with precision matrix \(\varSigma \) (the inverse of the covariance matrix), such that the kernel \(K\) is the classical Gaussian kernel \(K(x,x') = e^{-\frac{1}{2} \left\| \varSigma ^{-\frac{1}{2}}(x-x') \right\| ^2}\). The Fisher metric is \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') = \left\| \varSigma ^{-\frac{1}{2}}(x-x') \right\| \). In this case, provided that \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j) \gtrsim \sqrt{\log (s)}\), stable recovery is guaranteed with
where \(L = d + \log ^2\left( \frac{dm}{\rho } \right) \). Note that \(\log (m)\) appears in L in the r.h.s. of the expression above, which only incurs additional logarithmic terms in the bound on m, as mentioned in the introduction.
2.3 Learning of Gaussian Mixtures with Fixed Covariances
An original framework for continuous sparsity is sketched learning of mixture models [37], and in particular Gaussian mixture models (GMM), for which we can exploit the computations of the previous case of Fourier measurements sampled in accordance to a Gaussian distribution. Assume that we have data samples \(z_1,\ldots ,z_n \in {\mathbb {R}}^d\) drawn i.i.d. from a mixture of Gaussians \(\xi {\mathop {=}\limits ^{\text {def.}}}\sum _{i=1}^s a_i {\mathcal {N}}(x_i, \varSigma )\) with known covariance \(\varSigma \). The means \(x_1,\ldots ,x_s \in {\mathcal {X}}\subset {\mathbb {R}}^d\) and weights \(a_1,\ldots ,a_s>0\) are the objects which we want to estimate. We then sample frequencies \(\omega _1,\ldots ,\omega _m \in {\mathbb {R}}^d\) i.i.d. from a Gaussian \(\varLambda = {\mathcal {N}}(0, \varSigma ^{-1}/d)\) and construct the following linear sketch [22] of the data:
where the constant \(C = (1+\frac{2}{d})^{\frac{d}{4}} \leqslant e^\frac{1}{2}\) is here for normalization purpose. Linear sketches are mainly used for computational gain: they are easy to compute in a streaming of distributed context and are much smaller to store in memory than the whole database [22, 37]. It is easy to see that the sketch can be reformulated as (1), by writing
where \(\mu _0 = \sum _i a_i \delta _{x_i}\), and \(\varPhi \) is defined using the feature functions
The “noise” \(w{\mathop {=}\limits ^{\text {def.}}}y - {\mathbb {E}}_z (Ce^{-\mathrm {i} \omega _k^\top z})_{k=1}^m\) is simply the difference between empirical and true expectations, and using simple concentration inequalities that we skip here for simplicity, it is possible to show that with high probability, \(\left\| w \right\| \leqslant {\mathcal {O}}\left( n^{-\frac{1}{2}} \right) \). Applying the previous computations we obtain the following result: provided that \(\min _{i\ne j} \left\| \varSigma ^{-\frac{1}{2}}(x_i-x_j) \right\| _2 \gtrsim \sqrt{d \log (s)}\), stable recovery of \(\mu _0\) is guaranteed when
and the concentration in the recovery bound (4) is given by \(\delta =\left\| w \right\| ={\mathcal {O}}\left( n^{-\frac{1}{2}} \right) \).
2.4 Gaussian Mixtures with Varying Covariances
The case of simultaneously recovering both the means and covariance matrices is an interesting venue for future research. We simply describe here the associated metric and distance in the univariate case. The geodesic distance between univariate Gaussian distributions is well known [23]: Given \(x=(m,u)\) and \( x'=(n,v)\) with \(m,n\in {\mathbb {R}}\) and \(u,v\in {\mathbb {R}}_+\), let \(\varphi (x) {\mathop {=}\limits ^{\text {def.}}}\frac{1}{\root 4 \of {\pi } \sqrt{u}} e^{-(m-\cdot )/(2u^2)}\), then the covariance kernel is
The associated metric at \(x=(m,u)\) is \({{\mathfrak {g}}}_x=\frac{1}{2u^2} \mathrm {Id}_2\), and the Fisher–Rao distance is the Poincaré half-plane distance
Consider now the case of Gaussian mixture \(\xi {\mathop {=}\limits ^{\text {def.}}}\sum _{i=1}^s a_i {\mathcal {N}}(x_i, v_i^2)\), where the unknowns are \(a_i>0\), \(x_i\in {\mathbb {R}}\) and \(v_i>0\), and we are given data \( \left\{ z_i \right\} _{i=1}^n\) drawn i.i.d. from \(\xi \) and we construct the linear sketch (5) as before, where \(\omega _k \in {\mathbb {R}}\) are i.i.d. from \({\mathcal {N}}(0,\sigma ^2)\). This corresponds to the normalized random features
and
where \(u_\sigma ^2 = \frac{1}{2\sigma ^2}+u^2\) and \(v_\sigma ^2 = \frac{1}{2\sigma ^2}+v^2\). The metric at \(x=(m,u)\) is \({{\mathfrak {g}}}_x=\frac{1}{2u_\sigma ^2} \mathrm {Id}_2\). Note that since (8) also corresponds to the kernel between Gaussian distributions with mean and standard deviation as \(x_\sigma {\mathop {=}\limits ^{\text {def.}}}(m,u_\sigma )\) and \(x_\sigma '{\mathop {=}\limits ^{\text {def.}}}(n,v_\sigma )\), the associated geodesic distance is therefore \({\mathfrak {d}}_0(x_\sigma ,x_\sigma ')\) where \({\mathfrak {d}}_0\) is the Poincaré half-plane distance described in (7). (As mentioned in (16), geodesic distances on random features and parameter space are equivalent.)
2.5 Sampling the Laplace Transform
In some fluorescence microscopy applications (see [28] and the references therein), depth measurements are obtained from the Laplace transform of the signal. Contrary to Fourier measurements, this gives rise to a non-translation-invariant kernel \(K\) and was therefore not covered by existing theory. Using the proposed Riemannian geometry framework, we can cover this setting.
Let \({\mathcal {X}}= (0,1)^d \subset {\mathbb {R}}^d_+\). Let \(\varOmega = {\mathbb {R}}_+^d\). Define for \(x\in {\mathcal {X}}\) and \(\omega \in \varOmega \),
where \(\alpha _i \sim d\) are positive and distinct real numbers. The sampling of \(\omega \) here typically corresponds to observations at random discrete time-points.
The Fisher metric is
and provided that \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j) \gtrsim d + \log (d^{3/2}s)\), stable recovery is guaranteed with
where \(C {\mathop {=}\limits ^{\text {def.}}}d^2 \left( d + \log ^2(m) + \log ^2\left( \frac{d}{\rho } \right) \right) \). Similar to the Gaussian example, \(\log (m)\) appears in C.
3 Stability and the Fisher Information Metric
In this section, we introduce the proposed Riemannian geometry framework and give intermediate recovery guarantees which constitute the first building block of our main result. Namely, we introduce so-called dual certificates, which are known to be key objects in the study of the BLASSO, and show how they lead to sparse recovery guarantees in our Riemannian framework.
3.1 Fisher and Optimal Transport Distances
Let us first introduce the proposed Riemannian geometry framework and define objects related to it.
3.1.1 The Covariance Kernel and the Fubini–Study Metric
A natural property to analyze in our problem is the way two Diracs interact with each other, which is linked to the well-known notion of coherence (or, rather, incoherence) between measurements in compressive sensing [36]. This is done through what we refer to as the covariance kernel \({\hat{K}}:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {C}}\), defined as
In the limit case \(m \rightarrow \infty \), the law of large number states that \({\hat{K}}\) converges almost surely to the limit covariance kernel:
where we recall that \(\omega \sim \varLambda \). This object naturally governs the geometry of the space, and we use it to define our Riemannian metric, which as we will see is linked to a notion of Fisher information metric. In the rest of the paper, we assume throughout that \(K\) is real-valued, even though \({\hat{K}}\) may be complex-valued.
Given the normalization \({\mathbb {E}}_\omega \left|\varphi _\omega (x)\right|^2 = 1\) for all \(x\in {\mathcal {X}}\), \(\varphi _\omega (x)\) can be interpreted as a complex-valued probability amplitude with respect to \(\omega \) (parameterized by x), a classical notion in quantum mechanics (see [38]). When x varies, a natural metric between probability amplitudes is the so-called Fubini–Study metric, which is the complex equivalent of the well-known Fisher information metric. Writing \(\varphi _\omega (x) = \sqrt{p(\omega ,x)} e^{\mathrm {i}\alpha (\omega ,x)}\) where \(p(\omega ,x) {\mathop {=}\limits ^{\text {def.}}}\left|\varphi _\omega (x)\right|^2 \) and \(\alpha (\omega ,x) {\mathop {=}\limits ^{\text {def.}}}\mathrm {arg}(\varphi _\omega (x))\), the Fubini–Study metric is defined by the following metric tensor in \({\mathbb {C}}^{d\times d}\) [34]:
where we use the notation \({\mathbb {E}}_p[f] = \int f(\omega ) p(\omega , x) \mathrm {d}\varLambda (\omega )\). If \(\varphi _\omega \) is real-valued, then \(\alpha = 0\) and this is indeed the Fisher metric up to a factor of \(\frac{1}{4}\). The following simple lemma shows the link between this metric and the derivatives of the covariance kernel \(K\).
Lemma 1
For any kernel \(K(x,x') {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}_\omega \overline{\varphi _\omega (x)}\varphi _\omega (x')\), the Fubini–Study metric defined in (11) satisfies
If furthermore \(K(x,x')\) is assumed real-valued, then \({\mathbb {E}}_p[\nabla _x \alpha ] = 0\), and \({{\mathfrak {g}}}_x = \nabla _1 \nabla _2 K(x,x)\).
Proof
Using \(p = \left|\varphi _\omega \right|^2\) and \(\nabla \varphi _\omega = \left( \frac{\nabla p}{2p} + i \nabla \alpha \right) \varphi _\omega \), a direct computation shows that
Therefore,
Similarly,
which proves the first claim. The second claim is immediate by noticing from (13) that \(\nabla _p \alpha = {{\,\mathrm{Im}\,}}\left( \nabla _2 K(x,x) \right) \), which cancels when \(K(x,x')\) is real (in particular in a neighborhood around \(x=x'\)). \(\square \)
Since in this paper the limit covariance kernel (10) is assumed real-valued, the previous lemma justifies the definition \({{\mathfrak {g}}}_x = \nabla _1\nabla _2 K(x,x)\) that we adopt in the rest of the paper. For two vectors \(u,v \in {\mathbb {C}}^d\), we define the corresponding inner product
As described in the introduction, this induces a geodesic distance on \({\mathcal {X}}\):
and in the case where \(\varphi _\omega (x)\) is real-valued, this coincides with the “Fisher-Rao” geodesic distance [47] which is used extensively in information geometry for estimation and learning problems on parametric families of distributions [3].
Remark 1
(As a distance on the feature space) The geodesic distance induced by \({{\mathfrak {g}}}\) is the natural distance between the random features \(\varphi _{\cdot }(x)\). Indeed, as discussed in [11], the manifold \(({\mathcal {X}}, {{\mathfrak {g}}})\) as an embedded sub-manifold of the sphere in Hilbert space \(L_2(\mathrm {d}\varLambda )\) with embedding \(x\mapsto \varphi _{\cdot }(x)\), and given any \(x,x'\in {\mathcal {X}}\), we have
where \(\varGamma _{x,x'}\) consists of all piecewise smooth paths \(\gamma : [0,1]\rightarrow \left\{ \varphi (x) \;;\; x\in {\mathcal {X}} \right\} \) with \(\gamma (0) = \varphi (x)\) and \(\gamma (1) = \varphi (x')\).
Remark 2
(Fisher metric and invariances) The Fisher–Rao metric \({\mathfrak {d}}_{{\mathfrak {g}}}\) is “canonical” in the sense that it is the only (up to scalar multiples) geodesic distance which satisfies the natural invariances of the BLASSO problem. Indeed, the solutions to (\({\mathcal {P}}_\lambda (y)\)), in the large sample limit \(m \rightarrow +\infty \), are (i) invariant by the multiplication of \(\varphi (x) {\mathop {=}\limits ^{\text {def.}}}(\varphi _\omega (x))_{\omega \in \varOmega }\) by an arbitrary orthogonal transform U (orthogonality on \(L_2(\mathrm {d}\varLambda )\)), i.e., invariance to \(\varphi (x) \mapsto U \varphi (x)\), (ii) covariance under any change of variable \(\varphi \mapsto \varphi \circ h\) where h is a diffeomorphism between two d-dimensional parameter spaces. The covariance (ii) means that if \(\mu = \sum _i a_i \delta _{x_i}\) is a solution associated with \(\varphi \), then the push-forward measure \((h^{-1})_\sharp \mu {\mathop {=}\limits ^{\text {def.}}}\sum _i a_i \delta _{h^{-1}(x_i)}\) is a solution associated with \(\varphi \circ h\). Note that the invariance (i) is different from the usual invariance under “Markov morphisms” considered in information theory [13, 18]. When considering \({\mathfrak {d}}_{{{\mathfrak {g}}}} = {\mathfrak {d}}_{{{\mathfrak {g}}}_\varphi }\) as a Riemannian distance depending solely on \(\varphi \), the invariance under any diffeomorphism h reads
Assuming for simplicity that \(\varphi \) is injective, this invariance (17) is equivalent to the fact that the formula
defines a proper (i.e., parameterization-independent) Riemannian distance \(d_{\mathcal {M}}\) on the embedded manifold \({\mathcal {M}}{\mathop {=}\limits ^{\text {def.}}}( \varphi (x) )_x \subset L_2(\mathrm {d}\varLambda )\). Among all possible such Riemannian metrics on \({\mathcal {M}}\), the only ones being invariant by orthogonal transforms \(\varphi \mapsto U \varphi \) are scalar multiples of the Hermitian positive tensor \(\partial \varphi (x)^* \partial \varphi (x) \in {\mathbb {C}}^{d \times d}\), which is equal to \({{\mathfrak {g}}}_\varphi \) (here \(\partial \varphi (x)^*\) refers to the adjoint in \(L_2(\mathrm {d}\varLambda )\) for the inner product defined by the measure \(\varLambda (\omega )\)).
Remark 3
(Tangent spaces) Formally, in Riemannian geometry, one would use the notion of tangent space \({\mathcal {T}}_x\), and, for instance, the inner product \(\langle \cdot ,\,\cdot \rangle _x\) would only be defined between vectors belonging to \({\mathcal {T}}_x\). However, in our case, since the considered ambient “manifold” is just \({\mathbb {R}}^d\), in the sense that \({\mathcal {X}}\) is not a low-dimensional sub-manifold of \({\mathbb {R}}^d\) but an open set of \({\mathbb {R}}^d\), each tangent space can be identified with \({\mathbb {R}}^d\), and we extend the definitions to complex vectors for our needs.
3.1.2 Optimal Transport Metric
In order to state quantitative performance bounds, one needs to consider a geometric distance between measures. The canonical way to “lift” a ground distance \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x')\) between parameter to a distance between measure is to use optimal transport distances [49].
Definition 1
(Wasserstein distance) Given \(\mu ,\nu \in {\mathcal {M}}_+({\mathcal {X}})\) with \(\left|\mu \right|({\mathcal {X}}) = \left|\nu \right|({\mathcal {X}})\), the Wasserstein distance between \(\mu \) and \(\nu \) relative to the metric \({\mathfrak {d}}\) on \({\mathcal {X}}\) is defined by
where \(\varGamma (\mu ,\nu ) \subset {\mathcal {M}}_+({\mathcal {X}}^2)\) is the set of all transport plans with marginals \(\mu \) and \(\nu \). Given \(\mu ,\nu \in {\mathcal {M}}_+({\mathcal {X}})\) (not necessarily of equal total mass), the optimal partial distance between \(\mu \) and \(\nu \) is defined as
Note that the distance \(W_{\mathfrak {d}}(\mu ,\nu )\) is actually an hybridation (an inf-convolution) between the classical Wasserstein distance between probability distributions and the total variation norm. It is often called “partial optimal transport” in the literature (see, for instance, [12]) and belongs to the larger class of unbalanced optimal transport distances [21, 42].
3.2 Non-degenerate Certificates, Uniqueness and Stability for Sparse Measures
We now introduce the notion of a dual certificate and prove recovery guarantees under certain non-degeneracy conditions, which is the first step toward our main result.
3.2.1 Dual Certificates
The minimization problem (\({\mathcal {P}}_\lambda (y)\)) is a convex optimisation problem and a natural way of studying their solutions are via their corresponding Fenchel-dual problems. It is well known that, in the limit as \(\lambda \rightarrow 0\), its solutions cluster in a weak-* sense around minimizers of
and that properties of the dual solutions to (\({\mathcal {P}}_0(y)\)) with \(y = \varPhi \mu _0\) can be used to derive stability estimates for (\({\mathcal {P}}_\lambda (y)\)) under noisy measurements. In this section, we recall some of these results (see [9, 30] for further details). The (pre)dual of (\({\mathcal {P}}_\lambda (y)\)) is
where we remark that the adjoint operator \(\varPhi ^*:{\mathbb {C}}^m \rightarrow {\mathscr {C}}^{}({\mathcal {X}})\) is defined by \((\varPhi ^* p)(x) = \frac{1}{\sqrt{m}}\sum _{i=1}^m p_i \varphi _{\omega _i}(x)\). Note that for \(\lambda >0\), this is the projection of \(y/\lambda \) onto the closed convex set \( \left\{ p \;;\; \left\| \varPhi ^* p \right\| _\infty \leqslant 1 \right\} \) and the solution \(p_\lambda \) is hence unique. The dual solution \(p_\lambda \) is related to any primal solution \(\mu _\lambda \) of (\({\mathcal {P}}_\lambda (y)\)) by the condition
Conversely, any pair \(p_\lambda \) and \(\mu _\lambda \) which satisfy this equation (18) are necessarily dual and primal solutions of (\({\mathcal {D}}_\lambda (y)\)) and (\({\mathcal {P}}_\lambda (y)\)), respectively. In the case where \(\lambda = 0\), a dual solution need not be unique, although existence is guaranteed (since in our setting, the dual variable belongs to a finite dimensional space). In this case, \(p_0\) and \(\mu _0\) solve (\({\mathcal {D}}_\lambda (y)\)) with \(\lambda =0\) and (\({\mathcal {P}}_0(y)\)), respectively, if and only if
Following the literature, we call any element \(\eta \in {{\,\mathrm{Im}\,}}(\varPhi ^*) \cap \partial \left|\mu _0\right|({\mathcal {X}})\) a dual certificate for \(\mu _0\). For \(\mu _0 = \sum _{j=1}^s a_j \delta _{x_j}\), the condition \(\eta \in \partial \left|\mu _0\right|({\mathcal {X}})\) imposes that \(\eta (x_j) = {{\,\mathrm{sign}\,}}(a_j)\) and \(\left\| \eta \right\| _\infty \leqslant 1\). Furthermore, it is known that in the noiseless case, \(\mu _0\) is the unique solution to (\({\mathcal {P}}_0(y)\)) if: the operator \(\varPhi _x: {\mathbb {C}}^s \rightarrow {\mathbb {C}}^m\) defined by \(\varPhi _x b = \sum _{j=1}^s b_j \varPhi \delta _{x_j}\) is injective, and there exists \(\eta \in {{\,\mathrm{Im}\,}}(\varPhi ^*) \cap \partial \left|\mu _0\right|({\mathcal {X}})\) such that \(\left|\eta (x)\right|<1\) for all \(x\not \in \{x_j\}\). In order to quantify the latter constraint and provide quantitative stability bounds, we impose even stronger conditions on \(\eta \) and make the following definition.
Definition 2
(Non-degenerate dual certificate) Given \((a_i,x_i)_{i=1}^s\), we say that \(\eta \in {{\,\mathrm{Im}\,}}(\varPhi ^*)\) is an \((\varepsilon _0,\varepsilon _2,r)\)-non-degenerate dual certificate if:
-
(i)
\(\eta (x_i) = {{\,\mathrm{sign}\,}}(a_i)\) for all \(i=1,\ldots ,s\),
-
(ii)
\(\left|\eta (x)\right| \leqslant 1- \varepsilon _0\) for all \(x\in {\mathcal {X}}^{\mathrm {far}}\),
-
(iii)
\(\left|\eta (x)\right| \leqslant 1- \varepsilon _2 {\mathfrak {d}}_{{\mathfrak {g}}}(x,x_i)^2\) for all \(x\in {\mathcal {X}}^{\mathrm {near}}_i\),
where \({\mathcal {X}}^{\mathrm {near}}_i {\mathop {=}\limits ^{\text {def.}}}{\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_i;r)\) and \({\mathcal {X}}^{\mathrm {far}}{\mathop {=}\limits ^{\text {def.}}}{\mathcal {X}}\setminus \bigcup _i {\mathcal {X}}^{\mathrm {near}}_i\).
In other words, there are neighborhoods of the \(x_j\) such that, outside of these neighborhoods, \(\eta \) is strictly bounded away from 1, and inside, \(\left|\eta \right|\) has quadratic decay. In the next section, we prove stable recovery results from the existence of non-degenerate dual certificates.
3.2.2 Stable Recovery Bounds
The following two propositions describe stability guarantees under the non-degeneracy condition. Proposition 1 quantifies how the recovered measure is approximated by a sparse measure supported on \(\{x_j\}_j\), and Proposition 2 describes the error in measure around small neighborhoods of the points \(\{x_j\}_j\).
Proposition 1
(Stability away from the sparse support) Suppose that there exists \(\varepsilon _0,\varepsilon _2>0\), \(\eta {\mathop {=}\limits ^{\text {def.}}}\varPhi ^* p\) for some \(p\in {\mathbb {C}}^m\) such that \(\eta \) is \((\varepsilon _0,\varepsilon _2,r)\)-non-degenerate.
Assuming the measurement model (1), any minimizer \({\hat{\mu }}\) of (\({\mathcal {P}}_\lambda (y)\)), with \(\left\| w \right\| \leqslant \delta \) and \(\lambda \sim \delta /\left\| p \right\| \) is approximately sparse: by defining \({\hat{A}}_j = \left|{\hat{\mu }}\right|({\mathcal {X}}^{\mathrm {near}}_j)\), we have
Proof
To prove this proposition, we first establish the following bound
As we will see, the optimal partial transport bound above is then a consequence of this bound.
For \(i=1,\ldots ,s\), let \({\mathcal {X}}^{\mathrm {near}}_i\subset {\mathcal {X}}\) and \({\mathcal {X}}^{\mathrm {far}}= {\mathcal {X}}\setminus \bigcup _{j=1}^s {\mathcal {X}}^{\mathrm {near}}_j\) be as in Definition 2. Recall the measurement model \(y = \varPhi (\mu _0 + {\tilde{\mu }}_0) +w\), and define \({\bar{\mu }}_0 = \mu _0 + {\tilde{\mu }}_0\) for simplicity. We first adapt the proof of [10, Thm. 2] to derive an upper bound for \(\left|{\hat{\mu }}\right| - \left|{\bar{\mu }}_0\right| - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }}-{\bar{\mu }}_0\rangle _{\mathcal {M}}\right) \). By minimality of \({\hat{\mu }}\) and since \(\left\| w \right\| \leqslant \delta \),
Using \(\eta = \varPhi ^* p\), and by adding and subtracting \(\mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }} - {\bar{\mu }}_0\rangle _{\mathcal {M}}\right) = \mathrm {Re}\left( \langle p,\,\varPhi {\hat{\mu }} - y\rangle _2\right) + \mathrm {Re}\left( \langle p,\,w\rangle _2\right) \), we obtain
using \(\lambda \sim \delta / \left\| p \right\| \). We now derive a lower bound for \(\left|{\hat{\mu }}\right| - \left|\mu _0\right| - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }}-{\bar{\mu }}_0\rangle _{\mathcal {M}}\right) \). Since \(\eta \) is a dual certificate, we have \(\langle \eta ,\,{\bar{\mu }}_0\rangle _{\mathcal {M}} = \left|\mu _0\right|({\mathcal {X}})\) and \(\left|\langle \eta ,\,\mu \rangle _{\mathcal {M}}\right| \leqslant \left|\mu \right|({\mathcal {X}})\). By further exploiting the non-degeneracy assumptions (ii) and (iii) on \(\eta \), we have
which proves (21). Note also that by combining this with (22), we obtain the following bound that we will use later:
It remains to show that the bound (21) yields an upper bound on the partial optimal transport distance between the recovered measure \(\left|{\hat{\mu }}\right|\) and \(\rho {\mathop {=}\limits ^{\text {def.}}}\sum _i \left|{\hat{\mu }}\right|({\mathcal {X}}^{\mathrm {near}}_i) \delta _{x_i}\), its “projection" onto the positions \(\{x_j\}_j\). To see this, first note that the Kantorovich dual formulation [49] of the Wasserstein distance in Def. 1 is
Given any \(\varphi , \psi \in C_b({\mathcal {X}})\) satisfying \(\varphi (x)+\psi (y) \leqslant {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,y)^2\) for all \(x,y\in {\mathcal {X}}\), we have
So,
So, since \(\varepsilon _0\left|{\hat{\mu }}\right|_{{\mathcal {X}}^{\mathrm {far}}}({\mathcal {X}}) \lesssim \left|{\tilde{\mu }}_0\right|({\mathcal {X}})+ \delta \left\| p \right\| \), we have
\(\square \)
We now give stability bounds around the sparse support, under some additional assumptions.
Proposition 2
(Stability around the sparse support) Under the assumptions of Proposition 1, let \({\hat{\mu }}\) be a solution of (\({\mathcal {P}}_\lambda (y)\)), and let \({\hat{a}} = ({\hat{\mu }}({\mathcal {X}}^{\mathrm {near}}_j))_{j=1}^s\). Suppose in addition that for \(j=1,\ldots , s\), there exists \(\eta _j = \varPhi ^* p_j\) which satisfies
-
(i)
\(\eta _j(x_j) = 1\) and \(\eta _j(x_\ell ) = 0\) for all \(\ell \ne j\)
-
(ii)
\(\left|1-\eta _j(x)\right| \leqslant \varepsilon _2 {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_j)^2\) for all \(x\in {\mathcal {X}}^{\mathrm {near}}_j\),
-
(iii)
\(\left|\eta _j(x)\right| \leqslant \varepsilon _2 {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_\ell )^2\) for all \(x\in {\mathcal {X}}^{\mathrm {near}}_\ell \) and \(\ell \ne j\),
-
(iv)
\(\left|\eta _j(x)\right| \leqslant 1-\varepsilon _0\) for all \(x\in {\mathcal {X}}^{\mathrm {far}}\).
Then,
where p is as in Proposition 1.
Proof
First observe that writing \(\nu = {\hat{\mu }}-\mu _0\), we have
Using (21), we have \( \left|\nu \right|({\mathcal {X}}^{\mathrm {far}}) = \left|{\hat{\mu }}\right|({\mathcal {X}}^{\mathrm {far}}) \lesssim \varepsilon _0^{-1}\left( \delta \left\| p \right\| + \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \right) \) and
Finally, by (23),
using \(\sqrt{ab}\leqslant (a+b)/2\). Therefore, we obtain
\(\square \)
Additional Certificates Proposition 2 assumes the construction of additional functions \(\eta _j \in {{\,\mathrm{Im}\,}}(\varPhi ^*)\), which are essentially similar to non-degenerate certificates but with all “signs” to interpolate put to 0 except for one. As we will see, they are even simpler to construct than \(\eta \): indeed, the reason one has to resort to the random signs assumption (as in [55]) or to the golfing scheme (as in this paper) is that the Euclidean norm of the vector of signs \(({{\,\mathrm{sign}\,}}(a_i))_{i=1}^s\) appears in the proof, which results in a spurious term \(\sqrt{s}\). When constructing the \(\eta _j\), this problem does not occur, since only one sign is nonzero.
Relation to Previous Works Note that (21) and (24), without the inexact sparsity term \( \left|{\tilde{\mu }}_0\right|({\mathcal {X}})\), were previously presented in [35] in the context of sampling Fourier coefficients and in a more general setting in [4]. However, the statement in [4] is given in terms of orthonormal systems, and the so-called Bernstein Isolation Property which imposes that \(\left|P'(x)\right| \leqslant C m^2 \left\| P \right\| _\infty \) for all \(P\in {{\,\mathrm{Im}\,}}(\varPhi ^*)\). These conditions can be difficult to check in our setting of random sampling and were imposed only to ensure the existence of non-degenerate dual certificates, and to have explicit control on the constant C. For completeness, we still present the proof of (21) under non-degeneracy assumptions, and we later establish that these non-degeneracy assumptions hold, under appropriate separation conditions imposed via \({\mathfrak {d}}_{{\mathfrak {g}}}\).
In [14], one could also obtain bounds \(\sum _{j=1}^s \left|{\hat{a}}_j - a_j\right| \lesssim \delta \) in the case of Fourier sampling; however, to prove such a statement, one is required to construct a trigonometric function (a dual certificate) which interpolates arbitrary sign patterns. In the case of sub-sampling, such an approach cannot lead to sharp dependency on s, since in the real setting, one is then required to show the existence of \(2^s\) random polynomials corresponding to all possible sign patterns. We therefore settle for the bound (24) in this paper. We remark that being able to construct dual functions which interpolate arbitrary signs patterns leads to Wasserstein-1 error bounds, as opposed to Wasserstein-2 error bounds presented here.
Finally, we mention the more recent work of [32] which presents stability bounds for the sparse spikes problem where one restricts to positive measures and where the sampling functions form a T-systems. Under a positivity constraint (rather than total variation penalization), they derive stability bounds in terms of optimal partial transport distances. We stress that since we consider more general measurement operators than T-systems in this work, we consider transport distances under the Fisher metric as opposed to the Euclidean metric. Moreover, another difference is that our error bounds use the Wasserstein-2 distance, whereas they use the Wasserstein-1 distance—the reason is that since they do not consider random sub-sampling, their proofs in fact follow the work of [14] to construct dual certificates which interpolate arbitrary sign patterns.
4 Non-degenerate Limit Certificates
In this section, we provide the second building block of our main theorem: a generic way to ensure the existence and construct non-degenerates dual certificates, when \(m\rightarrow \infty \) and the sought-after Diracs satisfy a minimal separation condition with respect to the metric \({\mathfrak {d}}_{{\mathfrak {g}}}\).
4.1 Notions of Differential Geometry
We start with additional definitions in differential Riemannian geometry. All these notions can be found in the textbook [1], to which we refer the reader for further details. In many instances, we extend classical definitions to the complex case in a natural way.
4.1.1 Riemannian Gradient and Hessian
Let \(f: {\mathbb {R}}^d \rightarrow {\mathbb {C}}\) be a smooth function. The Riemannian gradient \(grad f(x) \in {\mathbb {C}}^d\) and Riemannian Hessian \(Hess f(x) : {\mathbb {C}}^d \rightarrow {\mathbb {C}}^d\), which is a linear mapping, can be defined as:
where \(\nabla , \partial _i\) are the classical Euclidean gradient and partial derivatives, and the \(\{e_i\}\) are the canonical basis of \({\mathbb {R}}^d\). The \(\varGamma _{ij}(x) = [\varGamma _{ij}^k(x)]_k \in {\mathbb {R}}^d\) are the Christoffel symbols, here equal to:
where \(g_{ij}(x) = [{{\mathfrak {g}}}_x]_{ij}\) and \(g^{ij}(x) = [{{\mathfrak {g}}}_x^{-1}]_{ij}\). Finally, we denote by \(H f(x)\in {\mathbb {C}}^{d\times d}\) the matrix that contains these terms: \(H f(x) {\mathop {=}\limits ^{\text {def.}}}\Big (\langle Hess f(x)[e_i],\,e_j\rangle _x\Big )_{ij}\).
For \(r=0,1,2\), the “covariant derivative" \(\mathrm {D}_{r}\left[ f\right] (x): ({\mathbb {C}}^d)^r \rightarrow {\mathbb {C}}\) are mappings (or scalar in the case \(r=0\)) defined as:
We define associated operator norms
where we recall that \(\left\| \cdot \right\| _x\) is defined by (14).
4.1.2 Covariant Derivatives of the Kernel
Recall the definition of the limit covariance kernel (10). Given \(0\leqslant i,j \leqslant 2\), let \(K^{(ij)}(x,x')\) be a “bi”-multilinear map, defined for \(Q\in ({\mathbb {C}}^d)^i\) and \(V\in ({\mathbb {C}}^d)^j\) as
In the case \(i,j \leqslant 1\), note that these admits simplified expressions: \(K^{(00)}(x,x') = K(x,x')\), \([v]K^{(10)}(x,x') = v^\top \nabla _1 K(x,x')\) and \([v]K^{(11)}(x,x')[v'] = v^\top \nabla _1 \nabla _2 K(x,x') \overline{v'}\). Define the operator norm of \(K^{(ij)}(x,x')\) as
where the supremum is over all \(V = [v_1,\ldots , v_i]\) with \(\left\| v_\ell \right\| _{x}\leqslant 1\) for all \(\ell \in [i]\), and all \(Q = [q_1,\ldots , q_j]\) with \(\left\| q_\ell \right\| _{x'}\leqslant 1\) for all \(\ell \in [j]\). We will sometimes overload the notations and write \(\left\| \cdot \right\| _{x}\) when the dependence is only on x, i.e., for \(K^{(ij)}\) where \(j = 0\). Note that, in particular,
All these definitions are naturally extended to the covariance kernel \({\hat{K}}\) by replacing the expectation \({\mathbb {E}}\) in (25) by an empirical expectation over \(\omega _1,\ldots , \omega _m\).
4.2 Non-Degenerate Dual Certificate with \(m \rightarrow \infty \)
Recall the definition of the covariance kernel (9). Following [15], a natural approach toward constructing a dual certificate is by interpolating the sign vector \({{\,\mathrm{sign}\,}}(a_j)\) using the functions \({\hat{K}}(x_j,\cdot )\) and \( {\hat{K}}^{(10)}(x_j,\cdot )\), since we have
Using the gradients of the kernel allows to additionally impose that \(\nabla \eta (x_i) = 0\), which is a necessary (but not sufficient) condition for the dual certificate to reach its maximum amplitude in \(x_i\). Usual proofs then show that, under minimal separation, applying this strategy indeed yields a non-degenerate dual certificate.
We first consider the case where one has access to arbitrarily many measurements (\(m \rightarrow \infty \)), and to this end, we consider the limit covariance kernel \(K\) defined in (10). Let us introduce some handy notations that will be particularly useful in later proofs (Sect. 6). Our aim is to find coefficients \((\alpha _{1,j})_{j=1}^s \in {\mathbb {C}}^s\) and \((\alpha _{2,j})_{j=1}^s \in ({\mathbb {C}}^{d})^s\) such that
satisfies \( \eta (x_j) = {{\,\mathrm{sign}\,}}(a_j)\) and \( \nabla \eta (x_j) = 0\) for all \(j=1,\ldots ,s\). Note that these \(s(d+1)\) constraints can be written as the linear system
where \(\varUpsilon \in {\mathbb {R}}^{s(d+1) \times s(d+1)}\) is a real symmetric matrix defined as
with the vector \(\gamma (\omega )\in {\mathbb {C}}^{s(d+1)}\) defined as
Assuming that \(\varUpsilon \) is invertible, we can therefore rewrite (28) as \(\eta (x) = (\varUpsilon ^{-1} {\mathbf {u}}_s)^\top {\mathbf {f}}(x)\), where
We also define the block diagonal normalization matrix \(D_{{\mathfrak {g}}}\in {\mathbb {R}}^{s(d+1)\times s(d+1)}\) as
so that \({\tilde{\varUpsilon }} = D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}}\) has constant value 1 along its diagonal.
We will prove in Theorem 2 that \(\eta \) of the form (28) is indeed non-degenerate, provided that there is sufficient curvature on \(K(x,\cdot )\) in a small neighborhood around x and \(\min _{k\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_j,x_k)\geqslant \varDelta \) where \(\varDelta \) is the distance at which the kernel and its partial derivatives are sufficiently small (to allow for interpolation with \(K(\cdot ,x_j)\) with minimal inference between the point sources). To do so, we need the following definition.
Definition 3
Given \(r>0\), the local curvature constants \({\bar{\varepsilon }}_0(r)\) and \({\bar{\varepsilon }}_2(r)\) of \(K\) are defined as
Given \(h>0\) and \(s\in {\mathbb {N}}\), the kernel width of \(K\) is defined as
where \({\mathcal {S}}_\varDelta {\mathop {=}\limits ^{\text {def.}}} \left\{ (x_k)_{k=1}^s\in {\mathcal {X}}^s \;;\; {\mathfrak {d}}(x_k,x_\ell )\geqslant \varDelta ,\; \forall k\ne \ell \right\} \) is the set of k-tuples of \(\varDelta \)-separated points. We define \(\inf \emptyset {\mathop {=}\limits ^{\text {def.}}}+\infty \).
Intuitively, these notions are similar to those appearing in the definition of non-degenerate dual certificates (and will ultimately serve in the proof of existence of such certificates): r is a neighborhood size, \({\bar{\varepsilon }}_0\) represents the distance to 1 of the kernel away from \(x=x'\), and \({\bar{\varepsilon }}_2\) is the “curvature” of the kernel when \(x \approx x'\). Finally, \(\varDelta \) is the “minimal separation” under which s Diracs have minimal interference between them, or, in other words, the covariance kernel and its derivatives have low value. We formalize this in the following assumption.
Assumption 1
(Assumptions on the kernel.) Suppose that K is a real-valued kernel. For \(i,j\leqslant 2\) and \(i+j \leqslant 3\), assume that \(B_{ij} {\mathop {=}\limits ^{\text {def.}}}\sup _{x,x'\in {\mathcal {X}}}\left\| K^{(ij)}(x,x') \right\| _{x,x'}< \infty \) and denote \(B_i{\mathop {=}\limits ^{\text {def.}}}B_{0i} + B_{1i}+1\). Assume that K has positive curvature constants \({\bar{\varepsilon }}_0\) and \({\bar{\varepsilon }}_2\) at radius \(0< {r_{\mathrm {near}}}< B_{02}^{-\frac{1}{2}}\). Let \(s \in {\mathbb {N}}\) be such that \(\varDelta {\mathop {=}\limits ^{\text {def.}}}\varDelta (h,s)<\infty \) with \(h \leqslant \frac{1}{64}\min \left( \frac{{\bar{\varepsilon }}_0}{B_0},\frac{{\bar{\varepsilon }}_2}{B_2} \right) \).
Under this assumption, the following theorem, which is the main result of this section, proves that a limit non-degenerate dual certificate can be constructed under minimal separation.
Theorem 2
Under Assumption 1, for all \(\{x_k\}_{k=1}^s\) with \(\min _{k\ne \ell }{\mathfrak {d}}_{{\mathfrak {g}}}(x_k,x_\ell ) \geqslant \varDelta \), there exists a unique function \(\eta \) of the form (28) which is \((\frac{{\bar{\varepsilon }}_0}{2},\frac{{\bar{\varepsilon }}_2}{4}, {r_{\mathrm {near}}})\)-non-degenerate. Moreover,
We delay the (slightly lengthy) proof of this result to the next subsection. Before that, we make a few comments.
4.2.1 Dependency on
s As we have seen in the examples of Sect. 2, for a constant h we generally let the minimal separation \(\varDelta = W(h,s)\) depend on s. Indeed, in dimension d, it is well known one can pack \(C^d\) \(\varDelta \)-separated points in a ball of radius \(2\varDelta \) for some constant C. (This is known as the kissing number.) Hence, there exist s \(\varDelta \)-separated points such that
Therefore, while the kernel width can be independent of s in low dimensions (and the trick is then to upper bound this by a constant bound \(s\rightarrow \infty \), assuming the sum on the l.h.s. converges), as d increases, the dependence on s will become inevitable; otherwise, \(\varDelta \) generally depends exponentially on d.
4.2.2 Babel Function
The attentative reader might recognize the similarity of definition of kernel width W(h, s) with the Babel function from compressed sensing [57], if we restrict the definition to \((i,j) = (0,0)\) and recall that \(K(x,x') = {\mathbb {E}}_\omega [\overline{\varphi _\omega (x)}\varphi _\omega (x')]\). The Babel function of a \(m\times N\) matrix A with columns \({\mathbf {a}}_j\) is defined as
and small value of \(\mu (s)\) ensures that the sub-matrix \(A_S^* A_S\), where \(A_S\) is the matrix A restricted to index set S with \(\left|S\right|\leqslant s\), is well conditioned and invertible. Furthermore, recovery guarantees for Basis Pursuit and Orthogonal Matching Pursuit can be stated in terms of \(\mu (s)\). In Theorem 2, sufficient kernel width also ensures that \(\varPhi _x^* \varPhi _x\) is well conditioned and thereby provide performance guarantees for the BLASSO.
4.3 Proof of Theorem 2
Before proving Theorem 2, we illustrate the link between curvature of the kernel as represented by \({\bar{\varepsilon }}_2\) in Def. 3 and the quadratic decay condition \(\left|\eta \right| \leqslant 1-\varepsilon {\mathfrak {d}}_{{\mathfrak {g}}}(x_i,\cdot )^2\) that we used in the definition of non-degenerate certificates (Def. 2). The resulting condition (35) is the one that we are actually going to prove in practice. The following lemma is based on a generalized second-order Taylor expansion.
Lemma 2
Let \(x_0 \in {\mathcal {X}}\) and \(a \in {\mathbb {C}}\) with \(\left|a\right| = 1\). Suppose that for some \(\varepsilon >0\), \(B>0\) and \(0<r\leqslant B^{-\frac{1}{2}}\) we have: for all \(x \in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r)\) and \(v \in {\mathbb {C}}^d\), it holds that \(-K^{(02)}(x_0,x)[v,v] \geqslant \varepsilon \left\| v \right\| _{x}^2\) and \(\left\| K^{(02)}(x_0,x) \right\| _x \leqslant B\). Let \(\eta :{\mathcal {X}}\rightarrow {\mathbb {C}}\) be a smooth function.
-
(i)
If \(\eta (x_0) = 0, \nabla \eta (x_0) = 0\) and
$$\begin{aligned} \left\| \mathrm {D}_{2}\left[ \eta \right] (x) \right\| _x \leqslant \delta \qquad \forall x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r) \end{aligned}$$(34)then \(\left|\eta (x)\right| \leqslant \delta {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2\) for all \(x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r)\).
-
(ii)
If \(\eta (x_0) = a\), \(\nabla \eta (x_0) = 0\) and
$$\begin{aligned} \left\| {\overline{a}}\mathrm {D}_{2}\left[ \eta \right] (x) - K^{(02)}(x_0,x) \right\| _x \leqslant \delta \qquad \forall x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r) \end{aligned}$$(35)for some \(\delta < \frac{\varepsilon }{2}\), then for all \(x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r)\) we have \(\left|\eta (x)\right| \leqslant 1- \varepsilon ' {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2\) with \(\varepsilon ' = \frac{\varepsilon - 2\delta }{2}\).
Proof
We prove (ii), the proof for (i) is similar and simpler. Using (35) and the assumption on \(K^{(02)}\), we can deduce that for all \(v\in {\mathbb {R}}^d\) we have
Given a geodesic \(\gamma :[0,1] \rightarrow {\mathbb {R}}^d\), it is a well-known property that \( \ddot{\gamma } + \sum _{i,j} \varGamma _{ij}(\gamma ) \dot{\gamma _i} \dot{\gamma _j} = 0 \) where we recall that \(\varGamma _{ij} \in {\mathbb {R}}^d\) are the Christoffel symbols. Therefore, we have
So, given any geodesic path with \(\gamma (0) = x_0\) and \(\gamma (1) = x\), since of course we have \({\mathfrak {d}}_{{\mathfrak {g}}}(x_0,\gamma (t))\leqslant {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)\leqslant r\), by applying the inequalities above:
where the last line follows because \(\left\| {\dot{\gamma }}(t) \right\| _{\gamma (t)}\) is constant for all \(t\in [0,1]\). Similarly, we can show that \(\mathrm {Re}\left( {\overline{a}} \eta (x)\right) \geqslant 1-\frac{B+\delta }{2} {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2 \geqslant 0\) since \(r\leqslant B^{-\frac{1}{2}}\), and \( \left|\mathrm {Im}\left( {\overline{a}} \eta (x)\right) \right| \leqslant \frac{\delta }{2}{\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2 \), from which we got \(\left|\eta (x)\right| \leqslant \mathrm {Re}\left( {\overline{a}} \eta (x)\right) + \left|\mathrm {Im}\left( {\overline{a}} \eta (x)\right) \right| \leqslant 1- \frac{\varepsilon - 2\delta }{2} {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2\). \(\square \)
We can now proceed with the proof of Theorem 2.
Proof of Theorem 2
Recall the block diagonal matric \(D_{{\mathfrak {g}}}\) from (33). The system (29) is equivalent to
where \({\tilde{\varUpsilon }} = D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}}\) and \({\tilde{\alpha }} = D_{{\mathfrak {g}}}^{-1} \alpha \). So, if \({\tilde{\varUpsilon }}\) is invertible, then we can write \(\eta = \left( {\tilde{\varUpsilon }}^{-1} {\mathbf {u}}_s \right) ^\top D_{{\mathfrak {g}}}{\mathbf {f}}= \left( \varUpsilon ^{-1} {\mathbf {u}}_s \right) ^\top {\mathbf {f}}\). Therefore, we will proceed as follows: First, prove that \({\tilde{\varUpsilon }}\) is invertible. Second, bound the coefficients \(\alpha _1\) and \(\alpha _2\). Third, prove that \(\eta \) is non-degenerate.
We first prove that the matrix \({\tilde{\varUpsilon }}\) is invertible. To this end, we decompose it into blocks
where \(\varUpsilon _0\in {\mathbb {C}}^{s\times s}\), \(\varUpsilon _1 \in {\mathbb {C}}^{sd\times s}\) and \(\varUpsilon _2 \in {\mathbb {C}}^{sd \times sd}\) are defined as
To prove the invertibility of \({\tilde{\varUpsilon }}\), it suffices to prove that \(\varUpsilon _2\) and its Schur complement \(\varUpsilon _S {\mathop {=}\limits ^{\text {def.}}}\varUpsilon _0 - \varUpsilon _1 \varUpsilon _2^{-1} \varUpsilon _1^\top \) are both invertible. To show that \(\varUpsilon _2\) is invertible, we define \(A_{ij} = {{\mathfrak {g}}}_{x_i}^{-\frac{1}{2}}\nabla _1\nabla _2K(x_i,x_j) {{\mathfrak {g}}}_{x_j}^{-\frac{1}{2}}\), such that \(\varUpsilon _2\) has the form:
and by Lemma 5 in Appendix A.1, Assumption 1 and (27), we have
Since \(\left\| \mathrm {Id}- \varUpsilon _2 \right\| _{\mathrm {block}} <1\), \(\varUpsilon _2\) is invertible, and we have \(\left\| \varUpsilon _2^{-1} \right\| _{\mathrm {block}} \leqslant \frac{1}{1 - \left\| \mathrm {Id}- \varUpsilon _2 \right\| _{\mathrm {block}}} \leqslant \frac{4}{3}\). Next, again with Lemma 5, we can bound
since \(K^{(10)}(x,x)=0\). Hence, we have
Therefore, the Schur complement of \({\tilde{\varUpsilon }}\) is invertible and so is \({\tilde{\varUpsilon }}\). Moreover, \(\left\| \varUpsilon _S^{-1} \right\| _\infty \leqslant \frac{1}{1-h'}\).
We can now define:
and, as described above, \(\alpha = D_{{\mathfrak {g}}}^{-1} {\tilde{\alpha }}\). The Schur’s complement of \({\tilde{\varUpsilon }}\) allows us to express \(\alpha _1\) and \(\alpha _2\) as
, and therefore, we can bound
Moreover, we have
We can now prove that \(\eta \) is non-degenerate. For any x such that \({\mathfrak {d}}_{{\mathfrak {g}}}(x, x_i)\geqslant {r_{\mathrm {near}}}\) for all \(x_i\)’s, there exists at most one index i such that \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x_i) < \varDelta /2\) and so, for all \(j \ne i\), we have \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x_j)\geqslant \varDelta /2\). Therefore,
Now, let x be such that \({\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x)\leqslant {r_{\mathrm {near}}}\). Similarly, for all \(j \ne i\) we have \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x_j)\geqslant \varDelta /2\). Observe that
So,
We conclude using Lemma 2 and \(\frac{{\bar{\varepsilon }}_2 - 2 {\bar{\varepsilon }}_2/16}{2} \geqslant {\bar{\varepsilon }}_2/4\). \(\square \)
5 Sparse Recovery
In this section, we formulate our main contribution, Theorem 3, which is a detailed version of Theorem 1. In previous sections, we have shown that the existence of a non-degenerate dual certificates implies sparse recovery guarantees, and that in the limit case \(m\rightarrow \infty \), a minimal separation assumption implies the existence of a dual certificate. Our main theorem is obtained by bounding the deviations from the limit case when m is finite. We do so by extending the celebrated golfing scheme [39] to the infinite-dimensional case. We first begin by our assumptions on the feature functions \(\varphi _\omega \).
5.1 Almost Bounded Random Features
In order to bound the variation between \(K\) and \({\hat{K}}\), we would ideally like the features \(\varphi _\omega \) and their derivatives to be uniformly bounded for all \(\omega \). However, this may not be the case: think of \(e^{i\omega ^\top x}\), which does not have a uniformly bounded gradient when the support of the distribution \(\varLambda \) is not bounded. On the other hand, if \(\varLambda (\omega )\) has sufficient decay as \(\left\| \omega \right\| \) increases, one could argue that the selected random features and their derivatives are uniformly bounded with high probability. For \(r\in \{0,1,2\}\), we define the random variables
Note that \(L_r(\omega )<\infty \) for each \(\omega \) since \({\mathcal {X}}\) is a bounded domain and \(\varphi _\omega \) is smooth.
Since \(\left|\varphi _\omega (x) - \varphi _\omega (x')\right| = \left|\int _{0}^1 \frac{d}{dt}\varphi _\omega (\gamma (t))\mathrm {d}t\right| = \left|\int _{0}^1 \mathrm {D}_{1}\left[ \varphi _\omega \right] (\gamma (t))[{\dot{\gamma }}(t)]\mathrm {d}t\right|\) for a smooth path from x to \(x'\), it is easy to see that
We will also require \(\mathrm {D}_{2}\left[ \varphi _\omega \right] (x)\) to be Lipschitz; to this end, we assume that for all \(x,x'\in {\mathcal {X}}\), there exists \(\tau _{x\rightarrow x'}:{\mathbb {C}}^d \rightarrow {\mathbb {C}}^d\) an isometric isomorphism with respect to \({{\mathfrak {g}}}_x\), that is, such that \(\langle u,\,v\rangle _{x} = \langle \tau _{x\rightarrow x'}u,\,\tau _{x\rightarrow x'}v\rangle _{x'}\), such that for all \(\omega \):
where naturally
and \({r_{\mathrm {near}}}\) comes from Assumption 1. One possible choice of \(\tau _{x\rightarrow x'}\) is to choose the parallel transport along the unique geodesic connecting x and \(x'\). Another possible choice is to simply choose \(\tau _{x\rightarrow x'}:v \mapsto {{\mathfrak {g}}}_{x'}^{-\frac{1}{2}} {{\mathfrak {g}}}_x^{\frac{1}{2}}v\). The latter choice implies
which is a more convenient expression that we will use in the examples.
Finally, we let \(F_r: [0,\infty ) \rightarrow [0,1]\) be decaying tail functions such that
Our sampling complexity will depend on the decay of these tail distributions so that the derivatives of the selected random features are bounded with high probability. A similar idea of stochastic incoherence was exploited in [16] for deriving compressed sensing bounds.
5.2 Main Result
Our main result is valid under the following assumption, which links the tail probabilities of the bounds on the feature functions and the final number of measurements m.
Assumption 2
(Assumption on the features and the sample complexity) For \(\rho >0\), suppose that \(m\in {\mathbb {N}}\) and some constant \(\{{{\bar{L}}}_i\}_{i=0}^3 \in {\mathbb {R}}_+^4\) are chosen such that
and
where \(N{\mathop {=}\limits ^{\text {def.}}}\frac{{\mathcal {R}}_{\mathcal {X}}{{\bar{L}}}_1}{{\bar{\varepsilon }}_0} +\frac{{r_{\mathrm {near}}}{{\bar{L}}}_{3}{{\bar{L}}}_0 + {{\bar{L}}}_2}{{\bar{\varepsilon }}_2}\), \(C_1{\mathop {=}\limits ^{\text {def.}}}({{\bar{L}}}_{0}^2+{{\bar{L}}}_1^2) \sum _{r=0,2}\frac{B_r^2}{{\bar{\varepsilon }}_r^2}\), and \(C_2{\mathop {=}\limits ^{\text {def.}}}\frac{B_{22}{{\bar{L}}}_{01}^2}{B_2^2} + \sum _{r=0,2}\left( \frac{{{\bar{L}}}_r^2}{{\bar{\varepsilon }}_r^2} + \frac{{{\bar{L}}}_{01}{{\bar{L}}}_r}{{\bar{\varepsilon }}_r} \right) \) with \({{\bar{L}}}_{ij} = \sqrt{{{\bar{L}}}_i^2+{{\bar{L}}}_j^2}\).
The constants \({{\bar{L}}}_r\) play the role of “stochastic” Lipschitz constant: for \(r=0,1,2\), with high probability on \(\omega _j\), \(\mathrm {D}_{r}\left[ \varphi _\omega \right] (x)\) will be \({{\bar{L}}}_r\)-bounded and \({{\bar{L}}}_{r+1}\)-Lipschitz. The condition (46) ensures that this is true with probability \(1-\rho \), that is, with the same desired probability of failure. Then, the entire proof is done conditionally on these bounds to hold.
Note also that, generally, \(\{{{\bar{L}}}_r\}\) depend on m, through (46). However, all our examples fall under two categories (see Sec. 2):
-
(i)
either \(\left\| \mathrm {D}_{r}\left[ \varphi _\omega \right] (x) \right\| _x\) is already uniformly bounded, in which case \({{\bar{L}}}_r\) can be chosen independently of \(\rho \) and m, this is, for instance, the case of discrete Fourier sampling;
-
(ii)
or the \(F_r(t)\) are exponentially decaying, in which case we can show that \({{\bar{L}}}_r = {\mathcal {O}}\left( \log \left( \frac{m}{\rho } \right) ^p \right) \) for some \(p>0\), which only incurs additional logarithmic terms in the bound (47). This occurs in the case of sampling the Laplace transform or sampling the Fourier transform with respect to a Gaussian distribution.
We are now ready to state the detailed version of Theorem 1, which is the main result of this paper.
Theorem 3
Suppose that Assumptions 1 and 2 hold. Let y be as in (2) with \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j)\geqslant \varDelta \) and \(\left\| w \right\| \leqslant \delta \). Then, with probability at least \(1-\rho \), any solution \({\hat{\mu }}\) of (\({\mathcal {P}}_\lambda (y)\)) with \(\lambda \sim \frac{\delta }{\sqrt{s}}\) satisfies
where \({\hat{A}}_i = \left|{\hat{\mu }}\right|\left( {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_i;{r_{\mathrm {near}}}) \right) \), \({\hat{a}}_i = {\hat{\mu }}\left( {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_i;{r_{\mathrm {near}}}) \right) \) and \(e\lesssim \frac{1}{\min ({\bar{\varepsilon }}_0,{\bar{\varepsilon }}_2)}\left( \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \right) \) \({+ \delta \cdot \sqrt{s} }\).
The next section is dedicated to the proof of Theorem 3 using an infinite-dimensional golfing scheme. Appendix A is dedicated to the proof of some technical Lemmas. Appendix B gathers all the concentration inequalities that we use in the golfing scheme, which are essentially many variants of Bernstein’s inequality. Finally, Appendices C, D and E are dedicated to the computation of all the constants in Assumptions 1 and 2 for the examples described in Sect. 2, which can be quite verbose.
6 Proof of Theorem 3
The main step toward proving Theorem 3 is to prove the existence of a dual certificate satisfying the properties described in Proposition 1. More precisely, we are going to prove the following theorem.
Theorem 4
Suppose that Assumptions 1 and 2 hold. Let \( \left\{ x_j \right\} _{j=1}^s\) be such that \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j)\geqslant \varDelta \). Then, with probability at least \(1-\rho \), there exists \(p\in {\mathbb {C}}^m\) with \(\left\| p \right\| \lesssim \sqrt{s}\) such that \({\hat{\eta }}= \varPhi ^* p\) is \((\frac{{\bar{\varepsilon }}_0}{8}, \frac{3{\bar{\varepsilon }}_2}{8}, {r_{\mathrm {near}}})\)-non-degenerate.
6.1 Outline of the Proof
The construction of the non-degenerate certificate includes several intermediate steps. As usual in this type of proof, we will first prove these properties on a finite \(\varepsilon \)-net that covers \({\mathcal {X}}\), then extend them to the whole space by regularity. Here we work with several nets \({\mathcal {G}}^{\mathrm {near}}_j\subset {\mathcal {X}}^{\mathrm {near}}_j\) and \({\mathcal {G}}^{\mathrm {far}}\subset {\mathcal {X}}^{\mathrm {far}}\) whose precision will be adjusted later. The principle of the golfing scheme is to work with an “approximate” dual certificate \(\eta ^{\mathrm {app}}\) (which is actually not a dual certificate at all); then, “correct” it to obtain the desired true certificate. In details, we will go through the following steps:
-
1.
First, show that with probability at least \(1-\rho \), there is an approximate certificate \(\eta ^{\mathrm {app}}\in {{\,\mathrm{Im}\,}}(\varPhi ^*)\) such that for some constant \(c_0\) that will be adjusted later,
$$\begin{aligned} {\left\{ \begin{array}{ll} \sum _{j=1}^s \left|\eta ^{\mathrm {app}}(x_j) - {{\,\mathrm{sign}\,}}(a_j)\right|^2 + \left\| \mathrm {D}_{1}\left[ \eta ^{\mathrm {app}}\right] (x_j) \right\| _{{x_j}}^2\leqslant c_0^2 &{}\text {for all } j=1,\ldots , s \\ \left|\eta ^{\mathrm {app}}(x)\right| \leqslant 1-\frac{{\bar{\varepsilon }}_0}{4} &{} \text {for all }x\in {\mathcal {G}}^{\mathrm {far}}\\ \left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ \eta ^{\mathrm {app}}\right] (x) - K^{(02)}(x_j,x)} \right\| _x \leqslant \frac{7{\bar{\varepsilon }}_2}{64} &{}\text {for all }j=1,\ldots , s, x\in {\mathcal {G}}^{\mathrm {near}}_j \end{array}\right. } \end{aligned}$$(48)In other words, we relax the condition \(\eta (x_j) = {{\,\mathrm{sign}\,}}(a_j)\), \(\nabla \eta (x_j) = 0\), and replace it with the first equation above.
-
2.
Second, correct the approximate certificate to obtain a functionFootnote 1\({\hat{\eta }}\in {{\,\mathrm{Im}\,}}(\varPhi ^*)\) such that:
$$\begin{aligned} {\left\{ \begin{array}{ll} {\hat{\eta }}(x_j) = {{\,\mathrm{sign}\,}}(a_j) \quad \text {and} \quad \nabla {\hat{\eta }}(x_j) = 0 &{}\text {for all }j=1,\ldots , s \\ \left|{\hat{\eta }}(x)\right| \leqslant 1-\frac{3{\bar{\varepsilon }}_0}{16} &{} \text {for all }x\in {\mathcal {G}}^{\mathrm {far}}\\ \left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - K^{(02)}(x_j,x)} \right\| _x \leqslant \frac{15{\bar{\varepsilon }}_2}{128} &{}\text {for all }j=1,\ldots , s, x\in {\mathcal {G}}^{\mathrm {near}}_j \end{array}\right. } \end{aligned}$$(49)That is, \({\hat{\eta }}\) satisfy all the properties we want, but on the finite nets \({\mathcal {G}}^{\mathrm {far}}, {\mathcal {G}}^{\mathrm {near}}_j\).
-
3.
Third, bound the norm of the \(p\in {\mathbb {C}}^m\) corresponding to \({\hat{\eta }}= \varPhi ^* p\).
-
4.
Then, use Assumption 2 on the feature functions and the bound on \(\left\| p \right\| \) to show that actually, the \({\hat{\eta }}\) constructed above satisfy:
$$\begin{aligned} {\left\{ \begin{array}{ll} {\hat{\eta }}(x_j) = {{\,\mathrm{sign}\,}}(a_j) \quad \text {and} \quad \nabla {\hat{\eta }}(x_j) = 0 &{}\text {for all }j=1,\ldots , s \\ \left|{\hat{\eta }}(x)\right| \leqslant 1-\frac{{\bar{\varepsilon }}_0}{8} &{} \text {for all }x\in {\mathcal {X}}^{\mathrm {far}}\\ \left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - K^{(02)}(x_j,x)} \right\| _x \leqslant \frac{{\bar{\varepsilon }}_2}{8} &{}\text {for all }j=1,\ldots , s, x\in {\mathcal {X}}^{\mathrm {near}}_j \end{array}\right. } \end{aligned}$$(50)which, by Lemma 2, will imply that \({\hat{\eta }}\) is non-degenerate with the desired constants and conclude the proof of Theorem 4.
-
5.
In a fifth and final step, prove the existence of s additional certificates \({\hat{\eta }}_j\) as appear in Proposition 2. Combined with the existence of \({\hat{\eta }}\) and Propositions 1 and 2, it concludes the proof of Theorem 3.
We dedicate a subsection to each step of the proof. Before that, we start in the next subsection with some technical preliminaries and notations.
6.2 Preliminaries
Let us introduce some notations and show some technical bounds that will be handy. Recall the definitions of the sign vector \({\mathbf {u}}_s\) from (29), \(\gamma \), \(\varUpsilon \) and \({\mathbf {f}}\) from (31), (30) and (32), and \(D_{{\mathfrak {g}}}\) from (33). We have the following additional bounds, whose proof, in Appendix A.2, follows similar arguments to that of Theorem 2.
Lemma 3
Under Assumption 1, \(\varUpsilon \) and \({\mathbf {f}}\) defined as in (30) and (32) satisfy the following.
-
(i)
\(\varUpsilon \) is invertible and satisfies
$$\begin{aligned} \left\| \mathrm {Id}- D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}} \right\| _2\leqslant \frac{1}{2} \quad \text {and} \quad \left\| \mathrm {Id}-D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}} \right\| _{\mathrm {Block}}\leqslant \frac{1}{2}. \end{aligned}$$(51) -
(ii)
For any vector \(q \in {\mathbb {C}}^{s(d+1)}\) and any \(x\in {\mathcal {X}}^far \), we have
$$\begin{aligned} \left\| D_{{\mathfrak {g}}}{\mathbf {f}}(x) \right\| _2 \leqslant B_0 \quad \text {and} \quad \left|q^\top {\mathbf {f}}(x)\right| \leqslant B_0 \left\| D_{{\mathfrak {g}}}^{-1} q \right\| _{\mathrm {Block}} \end{aligned}$$(52) -
(iii)
For any vector \(q \in {\mathbb {C}}^{s(d+1)}\) and any \(x \in {\mathcal {X}}^near \) we have the bound:
$$\begin{aligned}&\left\| \mathrm {D}_{2}\left[ q^\top {\mathbf {f}}(.)\right] (x) \right\| _x \leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| B_2 \nonumber \\&\quad \text {and} \quad \left\| \mathrm {D}_{2}\left[ q^\top {\mathbf {f}}(.)\right] (x) \right\| _x \leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| _{\mathrm {Block}}B_2 \qquad \end{aligned}$$(53)
Now, for \(\omega _1,\ldots ,\omega _m\), denote the empirical versions of \(\varUpsilon \) and \({\mathbf {f}}\) by:
Recall the definition of \(L_j(\omega )\) and \({{\bar{L}}}_j\) in Assumption 2. Let the event \({{{\bar{E}}}}\) be defined by
Since by Assumption 2, Eq. (46), we have \({\mathbb {P}}({{{\bar{E}}}}^c) \leqslant \rho \), a non-degenerate dual certificate can be constructed with probability at least \((1-\rho )^2 \geqslant 1-2\rho \) provided that, conditional on event \({{{\bar{E}}}}\), a non-degenerate dual certificate can be constructed with probability at least \(1-\rho \).
We therefore assume for the rest of this proof that event \({{{\bar{E}}}}\) holds and establish the probability conditional on \({{{\bar{E}}}}\) that a non-degenerate dual certificate exists. To control this probability, we will need to control the deviation of \(\hat{{\mathbf {f}}}\) and \({\hat{\varUpsilon }}\) from their conditional expectations \({\mathbf {f}}_{{{\bar{E}}}}= {\mathbb {E}}_{{{\bar{E}}}}[\hat{{\mathbf {f}}}]\) and \(\varUpsilon _{{{\bar{E}}}}{\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}_{{{\bar{E}}}}[{\hat{\varUpsilon }}]\), where we denote \({\mathbb {E}}_{{{\bar{E}}}}[\cdot ] {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}[\cdot | {{{\bar{E}}}}]\). The following lemma, proved in Appendix A.3, bounds the deviations between these.
Lemma 4
Under Assumptions 1 and 2, we have:
-
(i)
\(\left\| D_{{\mathfrak {g}}}(\varUpsilon - \varUpsilon _{{{\bar{E}}}})D_{{\mathfrak {g}}} \right\| _2\leqslant 4\frac{(s+1)\min ({\bar{\varepsilon }}_0, {\bar{\varepsilon }}_2)}{m}\) and \(\left\| D_{{\mathfrak {g}}}(\varUpsilon - \varUpsilon _{{{\bar{E}}}})D_{{\mathfrak {g}}} \right\| _{\mathrm {Block}}\) \(\leqslant 8\frac{(s+1)\min ({\bar{\varepsilon }}_0, {\bar{\varepsilon }}_2)}{m}\)
-
(ii)
for all \(x \in {\mathcal {X}}^{\mathrm {far}}\), \(\left\| D_{{\mathfrak {g}}}({\mathbf {f}}(x) - {\mathbf {f}}_{{{\bar{E}}}}(x)) \right\| _2 \leqslant \frac{(B_0 + 2 \sqrt{s}) \min ({\bar{\varepsilon }}_0, {\bar{\varepsilon }}_2)}{m}\)
-
(iii)
for all \(x \in {\mathcal {X}}^{\mathrm {near}}\), \(\sup _{\left\| q \right\| _2\leqslant 1}\left\| \mathrm {D}_{2}\left[ ({\mathbf {f}}- {\mathbf {f}}_{{{\bar{E}}}})^\top D_{{\mathfrak {g}}}q\right] (x) \right\| _x \leqslant \frac{(B_2 + 2 \sqrt{s}) \min ({\bar{\varepsilon }}_0, {\bar{\varepsilon }}_2)}{m}\)
6.3 Step 1: Construction of an Approximate Certificate with the Golfing Scheme
The first step is to construct an approximate certificate \(\eta ^{\mathrm {app}}\) using the so-called golfing scheme. The golfing scheme was introduced in [39] and successfully used in compressed sensing, for instance, in [16]. It can be intuitively explained as follows. Recall that the certificate constructed in Theorem 2 in the case \(m\rightarrow \infty \) is of the form \(\eta = (\varUpsilon ^{-1} {\mathbf {u}})^{\top }{\mathbf {f}}\). It is therefore natural to try to show directly that \({\hat{\eta }}{\mathop {=}\limits ^{\text {def.}}}({\hat{\varUpsilon }}^{-1} {\mathbf {u}})^{\top }\hat{{\mathbf {f}}}\) is also non-degenerate by bounding the variation between \(\eta \) and \({\hat{\eta }}\). This is the strategy adopted by Tang et al [55] and in our previous work [44]. However, as mentioned before, this proof technique requires the random signs assumption; otherwise, a sub-optimal bound on m is obtained. To solve this, the golfing scheme starts by writing the following Neumann expansion: assuming that \({\hat{\varUpsilon }}\) is invertible, we have
where \(q_{\ell } {\mathop {=}\limits ^{\text {def.}}}\left( \mathrm {Id}- {\hat{\varUpsilon }}\varUpsilon ^{-1} \right) q_{\ell -1} \), \(q_0 {\mathop {=}\limits ^{\text {def.}}}{\mathbf {u}}\). By cutting the sum above to a finite number of terms, one effectively obtains an approximate certificate that must be later corrected. However, there is an additional difficulty in analyzing the sum, which comes from the fact that for each summand, \(\hat{{\mathbf {f}}}\) and \(\varUpsilon ^{-1} q_{\ell -1}\) are random variables which are not mutually independent. The idea of [16, 39] is to decouple the random variables by partitioning the indices \(\{1,\ldots , m\}\) into \(J\) disjoint blocks \({\mathcal {B}}_\ell \) of size \(m_\ell \) with \(\sum _{\ell =1}^J m_\ell = m\), for some \(J\) and \(m_\ell \) that are adjusted below. Denote by \({\hat{\varUpsilon }}_\ell \) and \(\hat{{\mathbf {f}}}_\ell \) the empirical versions of \(\varUpsilon \) and \({\mathbf {f}}\) over the \(m_\ell \) random variables included in \({\mathcal {B}}_\ell \), that is:
Then, instead of (56), we consider
where \(q_{\ell } {\mathop {=}\limits ^{\text {def.}}}\left( \mathrm {Id}- {\hat{\varUpsilon }}_\ell \varUpsilon ^{-1} \right) q_{\ell -1}\), \(q_0 {\mathop {=}\limits ^{\text {def.}}}{\mathbf {u}}\). Note that this can be rewritten as:
Now, the idea is that one can control each term \(q_{\ell -1}^{\top }\hat{{\mathbf {f}}}_\ell \) conditional on \(q_{\ell -1}\) and for appropriate choices of the blocksizes \(m_\ell \), \(\eta ^{\mathrm {app}}\) can be shown to be approximately non-degenerate with high probability. Each additional term in the sum brings the certificate “closer” to its desired properties, hence the term “golfing” scheme.
6.3.1 Parameters and Intermediate Assumptions
We set the error \(c_0\) that appears in (48) as
for some universal constant \(C_0\). We define the parameters of our golfing scheme as follows:
We now formulate an intermediate set of assumptions, and proceed to show that: first, they imply the desired properties on \(\eta ^{\mathrm {app}}\), and second, they are valid with high probability. For \(1\leqslant \ell \leqslant J\), we define:
- (\(\hbox {I}_\ell \)):
-
\(\left\| D_{{\mathfrak {g}}}q_\ell \right\| _{\mathrm {Block}} \leqslant c_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}}\),
- (\(\hbox {II}_\ell \)):
-
For all \(x\in {\mathcal {G}}^{\mathrm {far}}\), \( \left|(\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell (x)\right| \leqslant t_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}}\),
- (\(\hbox {III}_\ell \)):
-
If \(\ell =1\): for all \(j=1,\ldots ,s\), \(x\in {\mathcal {G}}^{\mathrm {near}}_j\), \(\left\| \overline{ {{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ (\varUpsilon ^{-1} {\mathbf {u}}_s)^\top \hat{{\mathbf {f}}}_1\right] (x) - K^{(02)}(x_j, x) \right\| _x \leqslant b_1\); and if \(\ell \geqslant 2\): for all \(x\in {\mathcal {G}}^{\mathrm {near}}\), \(\left\| { \mathrm {D}_{2}\left[ (\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell \right] (x)} \right\| _x \leqslant b_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}}\).
Let us now assume that (\(\hbox {I}_\ell \)), (\(\hbox {II}_\ell \)) and (\(\hbox {III}_\ell \)) are true for all \(\ell \) and show that \(\eta ^{\mathrm {app}}\) satisfy the desired properties. We define \(\varPsi : {\mathscr {C}}^{}({\mathcal {X}}) \rightarrow {\mathbb {C}}^{s(d+1)}\) by
In words, \(\varPsi \) evaluates a function and its first derivative at the points \(\{x_j\}_{j=1}^s\). Note that for any vector \(v \in {\mathbb {C}}^{s(d+1)}\), by definition we have \(\varPsi (v^\top \hat{{\mathbf {f}}}_\ell ) = {\hat{\varUpsilon }}_\ell v\). Using this, we have
since by adjusting \(C_0\) we can have \(c_0 \leqslant \left( \frac{1}{\sqrt{6}} \right) ^{\frac{1}{\log (3)-1}} \leqslant \left( \frac{1}{\sqrt{2s}} \right) ^{\frac{1}{\log (s)-1}}\) where the last inequality is valid for all s and results from a simple function study. It proves the first part of (48). Next, for all \(x\in {\mathcal {G}}^{\mathrm {far}}\),
since by our choice of \(c_0\) and adjusting \(C_0\), \( B_0 c_0 + \frac{B_0c_0^2}{4(1-c_0)}\leqslant \frac{{\bar{\varepsilon }}_0}{8}\). Similarly, for all \(x\in {\mathcal {G}}^{\mathrm {near}}_j\),
since similarly, \( B_2 c_0 + \frac{B_2 c_0^2 }{4(1-c_0)} \leqslant \frac{{\bar{\varepsilon }}_2}{64}\). Hence, (\(\hbox {I}_\ell \)), (\(\hbox {II}_\ell \)), (\(\hbox {III}_\ell \)) indeed implies (48). Next, we derive a condition on m under which they are true with probability \(1-\rho \) (conditional on event \({{{\bar{E}}}}\)).
6.3.2 Probability of Successful Construction
Let us now prove that (\(\hbox {I}_\ell \)), (\(\hbox {II}_\ell \)) and (\(\hbox {III}_\ell \)) are indeed valid with the desired probability. Let \(p_1(\ell )\), \(p_2(\ell )\) and \(p_3(\ell )\) be the probabilities conditional on event \({{{\bar{E}}}}\) that (\(\hbox {I}_\ell \)), (\(\hbox {II}_\ell \)) and (\(\hbox {III}_\ell \)) fail, respectively. By a union bound, our goal is to derive a bound on m such that \(\sum _{k=1}^3 \sum _{\ell =1}^Jp_k(\ell ) \leqslant \rho \). We do so by applying variants of Bernstein’s concentration inequality that are all detailed in Appendix B. As we mentioned before, a crucial construction of the golfing scheme is that, at each step, \(q_{\ell -1}\) and \(\hat{{\mathbf {f}}}_\ell \) are mutually independent, such that we can reason conditionally on \(q_{\ell -1}\) and treat it as a fixed vector when bounding the probabilities w.r.t. \(\hat{{\mathbf {f}}}_\ell \) and \({\hat{\varUpsilon }}_\ell \).
We define \({\bar{q}}_\ell {\mathop {=}\limits ^{\text {def.}}}D_{{\mathfrak {g}}}^{-1} \varUpsilon ^{-1} q_\ell \) for short. To bound \(p_1(\ell )\), we first observe the recurrence relation \(D_{{\mathfrak {g}}}q_\ell = D_{{\mathfrak {g}}}(\mathrm {Id}- {\hat{\varUpsilon }}_\ell \varUpsilon ^{-1}) q_{\ell -1} = D_{{\mathfrak {g}}}(\varUpsilon - {\hat{\varUpsilon }}_\ell ) D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1}\). Moreover, by Lemma 3 we have \(\left\| D_{{\mathfrak {g}}}^{-1}\varUpsilon ^{-1}D_{{\mathfrak {g}}}^{-1} \right\| _{\mathrm {Block}} \leqslant \frac{1}{1- \left\| D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}} \right\| _{\mathrm {Block}}} \leqslant 2\), and therefore \( \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}} \geqslant \frac{1}{\left\| D_{{\mathfrak {g}}}^{-1}\varUpsilon ^{-1}D_{{\mathfrak {g}}}^{-1} \right\| _{\mathrm {Block}}} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \geqslant \frac{1}{2} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \). Finally, by Lemma 4 and our assumptions we have in particular that \(\left\| D_{{\mathfrak {g}}}(\varUpsilon _{{{\bar{E}}}}- \varUpsilon )D_{{\mathfrak {g}}} \right\| _{\mathrm {Block}} \leqslant \min _\ell c_\ell /4\). Therefore,
Finally, applying Lemma 14, for some \(\rho _\ell \) that we adjust later we obtain that
if \(m_\ell \gtrsim \frac{s {{\bar{L}}}_{01}^2}{c_\ell ^2} \log \left( \frac{s}{\rho _\ell } \right) \).
For \(p_2(\ell )\), we have
by Lemma 3 for the case \(\ell \geqslant 2\) and Theorem 2 for the case \(\ell =1\). Hence,
Since by Lemma 4 we have in particular
by Lemma 8 and a union bound we have
provided that \(m_\ell \gtrsim s \left( \frac{{{\bar{L}}}_0^2}{{\tilde{t}}_\ell ^{2}} + \frac{{{\bar{L}}}_{01} {{\bar{L}}}_0}{{\tilde{t}}_\ell } \right) \log \left( \frac{\left|{\mathcal {G}}^{\mathrm {far}}\right|}{\rho _\ell } \right) \).
For \(p_3(\ell )\), fix j, for any \(x \in {\mathcal {G}}^{\mathrm {near}}_j\): in the case \(\ell \geqslant 2\), by Lemma 3,
and for \(\ell =1\), by Theorem 2,
Therefore, by the same computation as before,
Again using Lemma 4, we bound \(\left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1})^\top ({\mathbf {f}}_{{{\bar{E}}}}- {\mathbf {f}})\right] (x) \right\| _x \leqslant \frac{{\tilde{b}}_\ell }{2} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}}\) and
by Lemma 10 and a union bound, provided that \(m_\ell \gtrsim s\left( \frac{{{\bar{L}}}_2^2}{{\tilde{b}}_\ell ^{2}} + \frac{{{\bar{L}}}_2{{\bar{L}}}_{01}}{{\tilde{b}}_\ell } \right) \log \left( \frac{\left|{\mathcal {G}}^{\mathrm {near}}\right|}{\rho _\ell } \right) \).
Choosing \(\rho _1 = \rho _2 = \rho /9\) and \(\rho _\ell = \rho /(9J)\) for \(\ell \geqslant 3\), recalling that obviously \({\bar{\varepsilon }}_r \leqslant B_r\) for \(r=1,2\) and denoting \(N_0 = \left|{\mathcal {G}}^{\mathrm {far}}\right|\) and \(N_2 = \left|{\mathcal {G}}^{\mathrm {near}}\right|\) for short, we have \( \sum _{k=1}^3 \sum _{\ell =1}^Jp_k(\ell ) \leqslant \rho \) provided that
and for \(\ell \geqslant 3\),
Therefore, conditionally on \({{{\bar{E}}}}\), \(\eta ^{\mathrm {app}}\) can be constructed with probability at least \(1-\rho \) if \(m \gtrsim m_1+m_2+ Jm_3\), for which it is sufficient that
6.4 Step 2: Correcting the Approximate Certificate
The second step of our proof is to “correct” the previously constructed approximate certificate \(\eta ^{\mathrm {app}}\) to obtain a certificate \(\eta \in {{\,\mathrm{Im}\,}}(\varPhi ^*)\) satisfying (49). Recalling the definition (58) of \(\varPsi \), let \(e{\mathop {=}\limits ^{\text {def.}}}\varPsi \eta ^{\mathrm {app}}- {\mathbf {u}}_s\) be the error made by \(\eta ^{\mathrm {app}}\) and define
Then,
and we have indeed that \({\hat{\eta }}(x_i) = {{\,\mathrm{sign}\,}}(a_i)\) and \(\nabla {\hat{\eta }}(x_i) = 0\). We will now bound the deviations of \({\hat{\eta }}\) on the grids \({\mathcal {G}}^{\mathrm {far}}\) and \({\mathcal {G}}^{\mathrm {near}}\), using the fact that e has a small norm. Note that there is a subtlety here: e itself is random, and not independent of \(\hat{{\mathbf {f}}}\) or \({\hat{\varUpsilon }}\). So we must use “uniform” concentration bounds.
Using Lemma 3 in combination with Lemma 4 and Lemma 12, we have that with probability at least \(1-\rho \):
and therefore
By Lemma 3, 4, 9 and a union bound to, respectively, bound each term in the following triangular inequality, with probability \(1-\rho \) we have
if \(m \gtrsim B_0^{-2} \log \left( \frac{\left|{\mathcal {G}}^{\mathrm {far}}\right|}{\rho } \right) (s {{\bar{L}}}_{01}^2 + \sqrt{s} {{\bar{L}}}_{01} {{\bar{L}}}_0)\). Then, for all \(x\in {\mathcal {G}}^{\mathrm {far}}\), since by adjusting \(C_0\) we can have in particular \(\left\| D_{{\mathfrak {g}}}e \right\| \leqslant \frac{c_0}{\sqrt{s}} \leqslant c_0 \leqslant \frac{1}{128} \min \left( \frac{{\bar{\varepsilon }}_2}{B_2} , \frac{{\bar{\varepsilon }}_0}{ B_0} \right) \), we have
Similarly, by Lemma 3, 4, with probability \(1-\rho \) we have for all \(x \in {\mathcal {G}}^{\mathrm {near}}\) and \(q \in {\mathbb {C}}^{s(d+1)}\),
where \(g_\omega (v) {\mathop {=}\limits ^{\text {def.}}}\mathrm {D}_{2}\left[ \varphi _\omega \right] (x)[v,v]\). By Lemma 11 and a union bound, for all \(x\in {\mathcal {G}}^{\mathrm {near}}\),
if \( m \gtrsim \frac{s B_{22} {{\bar{L}}}_{01}^2 + \sqrt{s} {{\bar{L}}}_{01} {{\bar{L}}}_2 B_2}{B_2^2} \left( \log \left( \frac{\left|{\mathcal {G}}^{\mathrm {near}}\right|}{\rho } \right) + d \log \left( \frac{s {{\bar{L}}}_{01} {{\bar{L}}}_{2}}{B_2} \right) \right) \). Using this property with \(q {\mathop {=}\limits ^{\text {def.}}}D_{{\mathfrak {g}}}^{-1} {\hat{\varUpsilon }}^{-1} e\) such that \(\left\| q \right\| \leqslant 4 c_0\), and by adjusting \(C_0\), we obtain: for all \(x\in {\mathcal {G}}^{\mathrm {near}}_j\),
which concludes the second step of our proof. By combining the bounds on m that we obtained with (59), after simplification we still obtain
with \(N'_0 = N_0 = \left|{\mathcal {G}}^{\mathrm {far}}\right|\) but \(N'_2 = \left|{\mathcal {G}}^{\mathrm {near}}\right| + (s{{\bar{L}}}_{01} {{\bar{L}}}_2/B_2)^d\).
6.5 Step 3: Bounding the Norm \(\left\| p \right\| \)
In this section, we upper bound \(\left\| p \right\| \) where \(\varPhi ^* p = {\hat{\eta }}\), for the \({\hat{\eta }}\) that we have constructed in the previous section. We recall that \(\varPhi ^* p = \frac{1}{\sqrt{m}} \sum _{k=1}^m p_k \varphi _{\omega _k}(\cdot )\), and
where \(p^{\mathrm {app}}{\mathop {=}\limits ^{\text {def.}}}(p_\ell )_{\ell =1}^J\in {\mathbb {C}}^m\) and \(p_\ell {\mathop {=}\limits ^{\text {def.}}}\frac{\sqrt{m}}{m_\ell } \left( \gamma (\omega _k)^* \varUpsilon ^{-1} q_{\ell -1} \right) _{k\in {\mathcal {B}}_j} \in {\mathbb {C}}^{m_\ell }\). So, \(\left\| p^{\mathrm {app}} \right\| ^2 = \sum _{\ell =1}^J\left\| p_\ell \right\| ^2_2\). To upper bound this, for each \(\ell =1,\ldots , J\),
where we have used \(\left\| D_{{\mathfrak {g}}}^{-1} \varUpsilon ^{-1} D_{{\mathfrak {g}}}^{-1} \right\| \leqslant 2\) by Lemma 3, \(\left\| \cdot \right\| \leqslant \sqrt{2s}\left\| \cdot \right\| _{\mathrm {Block}}\), and the computation that precedes for \(\left\| D_{{\mathfrak {g}}}q_\ell \right\| _{\mathrm {Block}}\). For \(\ell =1,2\) \( \frac{m}{m_\ell } = {\mathcal {O}}(1) \) and \(\frac{m}{m_3} = {\mathcal {O}}( \log (s))\). Also, for \(\ell \geqslant 3\),
Therefore,
On the other hand, \(\eta ^{\mathrm {e}}= \varPhi ^* p^\mathrm {e}\) where \(p^\mathrm {e}=\left( \gamma (\omega _k)^* \varUpsilon ^{-1} e \right) _{k=1}^m\). So,
Therefore, \({\hat{\eta }}= \varPhi ^* p \) with \(\left\| p \right\| ^2 \lesssim s\).
6.6 Step 4: Non-degeneracy on the Entire Domain
We conclude by showing that the \({\hat{\eta }}\) constructed in the previous sections is indeed non-degenerate on the entire domain. For this, we simply need to control the Lipschitz constants of \({\hat{\eta }}\) and its Hessian, which are in fact directly related to \(\left\| p \right\| \). Let any \(x \in {\mathcal {X}}^{\mathrm {far}}\) and \(x'\in {\mathcal {G}}^{\mathrm {far}}\) be the point in the grid closest to it. Under \({{{\bar{E}}}}\), we have
Hence, we prove the first part of (50) by choosing \({\mathcal {G}}^{\mathrm {far}}\) such that \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \leqslant \frac{{\bar{\varepsilon }}_0}{16{{\bar{L}}}_{1} \left\| p \right\| }\), which results in
for an appropriate constant C.
Now, for any \(x \in {\mathcal {X}}^{\mathrm {near}}_j\), and \(x'\in {\mathcal {G}}^{\mathrm {near}}_j\) closest to it, we write
We bound each of these terms. For the first, under \({{{\bar{E}}}}\) we have
For the second term in (63), we have
from what we have proved in the previous section.
Finally, for the third term in (63) we naturally introduce \(K_{{{\bar{E}}}}^{(ij)}\) defined as \(K^{(ij)}\) in (25), but by replacing \({\mathbb {E}}\) with the conditional \({\mathbb {E}}_{{{\bar{E}}}}\). From Lemma 4, the deviation between \(K^{(02)}\) and \(K_{{{\bar{E}}}}^{(02)}\) can be bounded by
where \(u_j\) is the jth canonical vector of \({\mathbb {C}}^{s(d+1)}\). Moreover, by Assumption 2 it is easy to see that
Hence, by a triangular inequality we have
Therefore, (63) becomes
We prove the desired property on \(\mathrm {D}_{2}\left[ {\hat{\eta }}\right] \) by choosing \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \leqslant \frac{{\bar{\varepsilon }}_2}{256{{\bar{L}}}_3({{\bar{L}}}_0 + \left\| p \right\| )}\), which yields
for an appropriate constant C. Gathering everything with (62), we finally obtain
with \({\bar{N}}_0 = \frac{s {\mathcal {R}}_{\mathcal {X}}{{\bar{L}}}_1}{{\bar{\varepsilon }}_0}\), \({\bar{N}}_2 = \frac{s ({r_{\mathrm {near}}}{{\bar{L}}}_0{{\bar{L}}}_3 + {{\bar{L}}}_2)}{{\bar{\varepsilon }}_2}\).
6.7 Step 5: Additional Certificates
Non-degeneracy of \({\hat{\eta }}\) directly allows us to apply Proposition 1 to deduce stability away from the sparse support in the reconstructed measure. In order to apply Proposition 2, we need to construct an additional s certificates \(\eta _j\), which are, however, “simpler” to construct since they need to interpolate a “sign vector” that has only one nonzero coordinate and do not require the golfing scheme to do so.
For each \(j=1,\ldots ,s\), let \(u_j\) be the vector of length \(s(d+1)\) whose \(j^{th}\) entry is one and all other entries are zero. Define the functions
and
By Theorem 2, \(\eta _j^+\) and \(\eta _j^-\) are non-degenerate (limit) dual certificates with respect to signs \(1_s\) and \(-1_s + 2 u_j\), respectively, and \(\eta _j\) satisfies, for all \(\ell \ne j\):
Thus, using Lemma 2 to translate the last two conditions into quadratic decay, we conclude that \(\eta _j\) satisfies the conditions of Proposition 2.
To conclude, we will show that
does not deviate too much from \(\eta _j\) and satisfies the conditions of Proposition 2. Note that by construction, \({\hat{\eta }}_j(x_j) = 1\), \({\hat{\eta }}_j(x_\ell ) = 0\) for all \(\ell \ne j\), and \(\nabla {\hat{\eta }}_j(x_\ell ) = 0\) for all \(\ell \). It therefore remains to control the deviation of \({\hat{\eta }}_j\) from \(\eta _j\) on \({\mathcal {X}}^{\mathrm {far}}\) and \(\mathrm {D}_{2}\left[ {\hat{\eta }}_j\right] \) from \(\mathrm {D}_{2}\left[ \eta _j\right] \) on \({\mathcal {X}}^{\mathrm {near}}\).
Proposition 3
Under Assumptions 1 and 2, suppose that \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j)\geqslant \varDelta \). Then, with probability at least \(1-\rho \), for all \(j=1,\ldots , s\), there exists \({\hat{\eta }}_j = \varPhi ^* p_j\) where \(\left\| p_j \right\| \leqslant 4\) which satisfies, for all \(\ell \ne j\):
The proof controls the deviation between \({\hat{\eta }}_j\) and \(\eta _j\) on a fine grid using Bernstein’s concentration inequalities and extends the bound to the entire domain using Lipschitz properties of \({\hat{\eta }}_j\). As we mentioned above, the proof of this result is conceptually simpler than the deviation bounds on \(\eta ^{\mathrm {app}}\) since \(\left\| u_j \right\| = 1\). We therefore defer its proof to Appendix B.5. Using Lemma 2, we have therefore constructed the additional certificates to apply Proposition 2 and conclude the proof of Theorem 3.
7 Conclusion and Outlooks
In this paper, we have presented a unifying geometric view on the problem of sparse measures recovery from random measurements. This theoretical analysis highlights the key role played by the invariant Fisher metric to define a precise notion of Rayleigh limit in the case of possibly non-translation-invariant measurement kernels. We analyzed several examples including Laplace measurements in imaging and left partially open some other important examples such as one-hidden-layer neural networks. Analyzing the super-resolution regime (going below the Rayleigh limit) requires stringent assumptions, such as positivity of the measures. Beyond the 1-D case, this is still mostly an open question, and we refer to [45] for some partial results.
Notes
Here, we write \({\hat{\eta }}\) to distinguish from the “limit” certificate \(\eta \) that we built in the case \(m\rightarrow \infty \).
References
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds (2014). https://doi.org/10.1515/9781400830244
Akkouchi, M.: On the convolution of exponential distributions. Journal of the Chungcheong Mathematical Society 21(4), 501–510 (2008)
Amari, S.i., Nagaoka, H.: Methods of information geometry, vol. 191. American Mathematical Soc. (2007)
Azais, J.M., De Castro, Y., Gamboa, F.: Spike detection from inaccurate samplings. Applied and Computational Harmonic Analysis 38(2), 177–195 (2015)
Bach, F.: Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research 18(19), 1–53 (2017)
Bendory, T., Eldar, Y.C.: Recovery of sparse positive signals on the sphere from low resolution measurements. IEEE Signal Processing Letters 22(12), 2383–2386 (2015)
Beurling, A.: Sur les intégrales de fourier absolument convergentes et leur application à une transformation fonctionelle. In: Ninth Scandinavian Mathematical Congress, pp. 345–366 (1938)
Boyd, N., Schiebinger, G., Recht, B.: The alternating descent conditional gradient method for sparse inverse problems. SIAM Journal on Optimization 27(2), 616–639 (2017)
Bredies, K., Pikkarainen, H.K.: Inverse problems in spaces of measures. ESAIM Control, Optimisation and Calculus of Variations 19(1), 190–218 (2013)
Burger, M., Osher, S.: Convergence rates of convex variational regularization. Inverse problems 20(5), 1411 (2004)
Burges, C.J.: Geometry and invariance in kernel based methods. MIT Press (1999)
Caffarelli, L., McCann, R.J.: Free boundaries in optimal transport and monge-ampere obstacle problems. Annals of mathematics 171(2), 673–730 (2010)
Campbell, L.L.: An extended čencov characterization of the information metric. Proceedings of the American Mathematical Society 98(1), 135–141 (1986)
Candès, E.J., Fernandez-Granda, C.: Super-resolution from noisy data. Journal of Fourier Analysis and Applications 19(6), 1229–1254 (2013). https://doi.org/10.1007/s00041-013-9292-3
Candès, E.J., Fernandez-Granda, C.: Towards a mathematical theory of super-resolution. Communications on Pure and Applied Mathematics 67(6), 906–956 (2014). https://doi.org/10.1002/cpa.21455
Candes, E.J., Plan, Y.: A probabilistic and RIPless theory of compressed sensing. IEEE Transactions on Information Theory 57(11), 7235–7254 (2011)
Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory 52(2), 489–509 (2006)
Cencov, N.N.: Statistical decision rules and optimal inference. 53. American Mathematical Soc. (1982)
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM review 43(1), 129–159 (2001)
Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in neural information processing systems, pp. 3036–3046 (2018)
Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.X.: Unbalanced optimal transport: Dynamic and kantorovich formulations. Journal of Functional Analysis 274(11), 3090–3123 (2018)
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4, 1–294 (2011). https://doi.org/10.1561/1900000004
Costa, S.I., Santos, S.A., Strapasson, J.E.: Fisher information distance: a geometrical reading. Discrete Applied Mathematics 197, 59–69 (2015)
Dasgupta, S., Gupta, A.: An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22(1), 60–65 (2003). https://doi.org/10.1002/rsa.10073
De Castro, Y., Gamboa, F.: Exact reconstruction using Beurling minimal extrapolation. Journal of Mathematical Analysis and applications 395(1), 336–354 (2012)
De Castro, Y., Gamboa, F., Henrion, D., Lasserre, J.B.: Exact solutions to super resolution on semi-algebraic domains in higher dimensions. IEEE Transactions on Information Theory 63(1), 621–630 (2016)
Denoyelle, Q., Duval, V., Peyré, G.: Support recovery for sparse super-resolution of positive measures. Journal of Fourier Analysis and Applications 23(5), 1153–1194 (2017)
Denoyelle, Q., Duval, V., Peyre, G., Soubies, E.: The Sliding Frank-Wolfe Algorithm and its Application to Super-Resolution Microscopy. Inverse Problems (2019). https://doi.org/10.1088/1361-6420/ab2a29
Donoho, D.L.: Compressed sensing. IEEE Transactions on information theory 52(4), 1289–1306 (2006)
Duval, V., Peyré, G.: Exact support recovery for sparse spikes deconvolution. Foundations of Computational Mathematics 15(5), 1315–1355 (2015)
Duval, V., Peyré, G.: Sparse spikes super-resolution on thin grids I: the LASSO. Inverse Problems 33(5), 055008 (2017). https://doi.org/10.1088/1361-6420/aa5e12
Eftekhari, A., Tanner, J., Thompson, A., Toader, B., Tyagi, H.: Sparse non-negative super-resolution-simplified and stabilised. arXiv preprint arXiv:1804.01490 (2018)
Ekanadham, C., Tranchina, D., Simoncelli, E.P.: A unified framework and method for automatic neural spike identification. Journal of neuroscience methods 222, 47–55 (2014)
Facchi, P., Kulkarni, R., Man’ko, V.I., Marmo, G., Sudarshan, E.C., Ventriglia, F.: Classical and quantum Fisher information in the geometrical formulation of quantum mechanics. Physics Letters, Section A: General, Atomic and Solid State Physics 374(48), 4801–4803 (2010). https://doi.org/10.1016/j.physleta.2010.10.005
Fernandez-Granda, C.: Support detection in super-resolution. Proc. Proceedings of the 10th International Conference on Sampling Theory and Applications pp. 145–148 (2013)
Foucart, S., Rauhut, H.: A mathematical introduction to compressive sensing, vol. 1. Birkhäuser Basel (2013)
Gribonval, R., Blanchard, G., Keriven, N., Traonmilin, Y.: Compressive statistical learning with random feature moments. arXiv preprint arXiv:1706.07180 (2017)
Griffiths, D.: Introduction to Quantum Mechanics. Pearson Education, Inc. (2004). https://doi.org/10.1063/1.2958160
Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57(3), 1548–1566 (2011)
Kunis, S., Peter, T., Römer, T., von der Ohe, U.: A multivariate generalization of Prony’s method. Linear Algebra and its Applications 490, 31–47 (2016)
Liao, W., Fannjiang, A.: MUSIC for single-snapshot spectral estimation: Stability and super-resolution. Applied and Computational Harmonic Analysis 40(1), 33–67 (2016)
Liero, M., Mielke, A., Savaré, G.: Optimal entropy-transport problems and a new hellinger–kantorovich distance between positive measures. Inventiones mathematicae 211(3), 969–1117 (2018)
Minsker, S.: On some extensions of bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters 127, 111–119 (2017)
Poon, C., Keriven, N., Peyré, G.: Support localization and the fisher metric for off-the-grid sparse regularization. In: Proc. AISTATS’19 (2019). arXiv:1810.03340
Poon, C., Peyré, G.: Multi-dimensional sparse super-resolution. SIAM Journal on Mathematical Analysis 51(1), 1–44 (2019). https://doi.org/10.1137/17M1147822
Prony, G.: Essai expérimental et analytique: sur les lois de la dilatabilité de fluides élastique et sur celles de la force expansive de la vapeur de l’alkool, à différentes températures. J. de l’Ecole Polytechnique 1(22), 24–76 (1795)
Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945)
Roy, R., Kailath, T.: ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Transactions on acoustics, speech, and signal processing 37(7), 984–995 (1989)
Santambrogio, F.: Optimal transport for applied mathematicians. Birkäuser, NY 55, 58–63 (2015)
Sauer, T.: Prony’s method in several variables. Numerische Mathematik 136(2), 411–438 (2017)
Schiebinger, G., Robeva, E., Recht, B.: Superresolution without separation. Information and Inference: A Journal of the IMA 7(1), 1–30 (2018)
Schmidt, R.: Multiple emitter location and signal parameter estimation. IEEE transactions on antennas and propagation 34(3), 276–280 (1986)
Soubies, E., Blanc-Féraud, L., Aubert, G.: A continuous exact \(\ell _0\) penalty (CEL0) for least squares regularized problem. SIAM Journal on Imaging Sciences 8(3), 1607–1639 (2015)
Tang, G., Bhaskar, B.N., Recht, B.: Sparse recovery over continuous dictionaries-just discretize. In: 2013 Asilomar Conference on Signals, Systems and Computers, pp. 1043–1047. IEEE (2013)
Tang, G., Bhaskar, B.N., Shah, P., Recht, B.: Compressed sensing off the grid. IEEE transactions on information theory 59(11), 7465–7490 (2013)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) pp. 267–288 (1996)
Tropp, J.A.: Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory 50(10), 2231–2242 (2004)
Acknowledgements
The work of Gabriel Peyré was supported by the ERC project NORIA and by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Thomas Strohmer.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Preliminaries
In this Appendix, we provide the proofs to some technical lemmas in the paper, and give useful tools.
1.1 Linear Algebra Tools
We give the following simple lemma.
Lemma 5
For \(1\leqslant i,j \leqslant s\), take any scalars \(a_{ij} \in {\mathbb {C}}\), vectors \(Q_{ij}, R_{ij} \in {\mathbb {C}}^d\) and square matrices \(A_{ij} \in {\mathbb {C}}^{d\times d}\).
-
(i)
For \(q \in {\mathbb {C}}^{sd}\) and \(M \in {\mathbb {C}}^{sd \times sd}\), we have \(\left\| q \right\| _{\mathrm {block}} \leqslant \left\| q \right\| \leqslant \sqrt{s} \left\| q \right\| _{\mathrm {block}}\), and as a consequence \(\left\| M \right\| \leqslant \sqrt{s} \left\| M \right\| _{\mathrm {block}}\) and \(\left\| M \right\| _{\mathrm {block}} \leqslant \sqrt{s} \left\| M \right\| \). Similarly, for \(q \in {\mathbb {C}}^{s(d+1)}\) and \(M \in {\mathbb {C}}^{s(d+1) \times s(d+1)}\), we have \(\left\| q \right\| _{\mathrm {Block}} \leqslant \left\| q \right\| \leqslant \sqrt{2s} \left\| q \right\| _{\mathrm {Block}}\), and as a consequence \(\left\| M \right\| \leqslant \sqrt{2s} \left\| M \right\| _{\mathrm {Block}}\) and \(\left\| M \right\| _{\mathrm {Block}} \leqslant \sqrt{2s} \left\| M \right\| \).
-
(ii)
Let \(M \in {\mathbb {C}}^{sd\times sd}\) be a matrix formed by blocks:
$$\begin{aligned} M = \left( \begin{matrix} A_{11} &{} \ldots &{} A_{1s} \\ \vdots &{} \ddots &{} \vdots \\ A_{s1} &{} \ldots &{} A_{ss} \end{matrix} \right) \end{aligned}$$Then, we have
$$\begin{aligned} \left\| M \right\| _{\mathrm {block}} = \sup _{\left\| x \right\| _{\mathrm {block}} = 1}\left\| Mx \right\| _{\mathrm {block}} \leqslant \max _{1\leqslant i\leqslant s} \sum _{j=1}^s \left\| A_{ij} \right\| \end{aligned}$$(68)Now, let \(M \in {\mathbb {R}}^{sd\times s}\) be a rectangular matrix formed by stacking vectors \(Q_{ij} \in {\mathbb {R}}^{d}\):
$$\begin{aligned} M = \left( \begin{matrix} Q_{11} &{} \ldots &{} Q_{1s} \\ \vdots &{} \ddots &{} \vdots \\ Q_{s1} &{} \ldots &{} Q_{ss} \end{matrix} \right) \end{aligned}$$Then,
$$\begin{aligned} \left\| M \right\| _{\infty \rightarrow block } \leqslant \max _{1\leqslant i\leqslant s} \sum _{j=1}^s \left\| Q_{ij} \right\| _2, \quad \left\| M^\top \right\| _{block \rightarrow \infty } \leqslant \max _{1\leqslant i\leqslant s} \sum _{j=1}^s \left\| Q_{ji} \right\| _2\nonumber \\ \end{aligned}$$(69) -
(iii)
Consider \(M \in {\mathbb {C}}^{s(d+1)\times s(d+1)}\) decomposed as
$$\begin{aligned} M = \left( \begin{matrix} a_{11} &{} \ldots &{} a_{1s} &{} Q^\top _{11} &{} \ldots &{} Q^\top _{1s} \\ \vdots &{} \ddots &{} \vdots &{}\vdots &{} \ddots &{} \vdots \\ a_{s1} &{} \ldots &{} a_{ss} &{} Q^\top _{s1} &{} \ldots &{} Q^\top _{ss} \\ R_{11} &{} \ldots &{} R_{1s} &{} A_{11} &{} \ldots &{} A_{1s} \\ \vdots &{} \ddots &{} \vdots &{}\vdots &{} \ddots &{} \vdots \\ R_{s1} &{} \ldots &{} R_{ss} &{} A_{s1} &{} \ldots &{} A_{ss} \end{matrix}\right) . \end{aligned}$$Then,
$$\begin{aligned} \left\| M \right\| ^2&\leqslant \max _i \left( \sum _{j=1}^s \left|a_{ij}\right| + \left\| Q_{ij} \right\| \right) \cdot \max _j \left( \sum _{i=1}^s \left|a_{ij}\right| + \left\| Q_{ij} \right\| \right) \\&\quad + \max _i \left( \sum _{j=1}^s \left\| R_{ij} \right\| + \left\| A_{ij} \right\| \right) \cdot \max _{j=1}^s \left( \sum _{i=1}^s\left\| R_{ij} \right\| + \left\| A_{ij} \right\| \right) . \end{aligned}$$and
$$\begin{aligned} \left\| M \right\| _{\mathrm {Block}}&\leqslant \max _i \{\sum _{j}{ \left|a_{ij}\right| + \left\| Q_{ij} \right\| }, \; \sum _{j}{ \left\| R_{ij} \right\| + \left\| A_{ij} \right\| }\} \end{aligned}$$
Proof
The proof is simple linear algebra.
-
(i)
This is immediate by writing the definitions.
-
(ii)
Let x be a vector with \(\left\| x \right\| _{\mathrm {block}} \leqslant 1\) decomposed into blocks \(x = [x_1,\ldots ,x_s]\) with \(x_i \in {\mathbb {C}}^d\), we have
$$\begin{aligned} \left\| Mx \right\| _{\mathrm {block}}&= \max _{1\leqslant i\leqslant s} \left\| \sum _{j=1}^s A_{ij} x_j \right\| \leqslant \max _{i}\sum _j \left\| A_{ij} \right\| \left\| x_j \right\| \leqslant \max _{i}\sum _j \left\| A_{ij} \right\| \end{aligned}$$Similarly,
$$\begin{aligned} \left\| M^\top x \right\| _\infty = \max _{1\leqslant i\leqslant s} \left\| \sum _{j=1}^s Q_{ji}^\top x_j \right\| \leqslant \max _{i}\sum _j \left\| Q_{ji} \right\| \left\| x_j \right\| \leqslant \max _{i}\sum _j \left\| Q_{ji} \right\| \end{aligned}$$Then, taking \(x\in {\mathbb {C}}^s\) such that \(\left\| x \right\| _\infty \leqslant 1\), we have
$$\begin{aligned} \left\| M x \right\| _{\mathrm {block}} = \max _{1\leqslant i\leqslant s} \left\| \sum _{j=1}^s x_j Q_{ij} \right\| \leqslant \max _{i}\sum _j \left\| Q_{ij} \right\| \end{aligned}$$ -
(iii)
Taking \(x=[x_1,\ldots ,x_s,X_1,\ldots ,X_s] \in {\mathbb {C}}^{s(d+1)}\) with \(\left\| x \right\| = 1\), we have
$$\begin{aligned} \left\| Mx \right\| ^2&= \sum _{i=1}^s \left( \sum _{j=1}^s a_{ij}x_j + Q_{ij}^\top X_j \right) ^2 + \left\| \sum _{j=1}^s R_{ij}x_j + A_{ij} X_j \right\| ^2 \\&\leqslant \sum _{i=1}^s \left( \sum _{j=1}^s \left|a_{ij}\right|x_j^2 + \left\| Q_{ij} \right\| \left\| X_j \right\| ^2 \right) \left( \sum _{j=1}^s \left|a_{ij}\right| + \left\| Q_{ij} \right\| \right) \\&\quad +\sum _{i=1}^s \left( \sum _{j=1}^s \left\| R_{ij} \right\| x_j^2 + \left\| A_{ij} \right\| \left\| X_j \right\| ^2 \right) \left( \sum _{j=1}^s \left\| R_{ij} \right\| + \left\| A_{ij} \right\| \right) \\&= \max _i \left( \sum _{j=1}^s \left|a_{ij}\right| + \left\| Q_{ij} \right\| \right) \cdot \max \left( \max _j \sum _{i=1}^s \left|a_{ij}\right|, \max _j\sum _{i=1}^s\left\| Q_{ij} \right\| \right) \left\| x \right\| ^2 \\&\quad + \max _i \left( \sum _{j=1}^s \left\| R_{ij} \right\| + \left\| A_{ij} \right\| \right) \cdot \max \left( \max _{j=1}^s \sum _{i=1}^s\left\| R_{ij} \right\| , \max _j \sum _{i=1}^s \left\| A_{ij} \right\| \right) \left\| x \right\| ^2. \end{aligned}$$Now, if \(\left\| x \right\| _{\mathrm {Block}} = 1\), we have
$$\begin{aligned} \left\| Mx \right\| _{\mathrm {Block}}&= \max _i \left( \left|\sum _{j=1}^s a_{ij}x_j + Q_{ij}^\top X_j\right| , \; \left\| \sum _{j=1}^s R_{ij}x_j + A_{ij} X_j \right\| \right) \\&\leqslant \max _i \left( \sum _{j=1}^s \left|a_{ij}\right| +\left\| Q_{ij} \right\| ,\; \sum _{j=1}^s \left\| R_{ij} \right\| + \left\| A_{ij} \right\| \right) \end{aligned}$$
\(\square \)
1.2 Proof of Lemma 3
The proof is similar to that of Theorem 2.
-
(i)
We bound the spectral norm of \(\mathrm {Id}- D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}}\). By Lemma 5,
$$\begin{aligned} \left\| (\mathrm {Id}- D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}}) \right\| ^2&\leqslant \max _i \left( \sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^s \left|K(x_i,x_j)\right| + \sum _{j=1}^s\left\| K^{(10)}(x_i,x_j) \right\| _{x_i} \right) ^2 \\&\quad + \max _i \left( \sum _{\begin{array}{c} j=1 \end{array}{j\ne i}}^s \left\| K^{(10)}(x_j,x_i) \right\| _{x_j}\right. \\&\left. \quad + \sum _{j=1}^s \left\| K^{(11)}(x_i,x_j) \right\| _{x_i,x_j} \right) ^2 \leqslant 8 h^2 \end{aligned}$$by assumption on the kernel widths. Hence, \(\varUpsilon \) is invertible. Similarly, by again applying Lemma 5, \(\left\| D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}}- \mathrm {Id} \right\| _{\mathrm {Block}} \leqslant 2h\).
-
(ii)
Let \(x\in {\mathcal {X}}^{\mathrm {far}}\), then we have
$$\begin{aligned} \left\| D_{{\mathfrak {g}}}{\mathbf {f}}(x) \right\|&\leqslant \left( \sum _{i=1}^s \left|K(x_i,x)\right|^2 + \left\| K^{(10)}(x_i,x) \right\| _{x_i}^2 \right) ^\frac{1}{2} \leqslant B_{00} + B_{10} + 2h \leqslant B_0 \end{aligned}$$for which, similar to the proof above, we have used the fact that x is \(\varDelta /2\)-separated from at least \(s-1\) points \(x_i\). Similarly, for any vector \(q = [q_1,\ldots ,q_s,Q_1,\ldots ,Q_s]\in {\mathbb {C}}^{s(d+1)}\) and any \(x\in {\mathcal {X}}^{\mathrm {far}}\), we have
$$\begin{aligned} \left\| q^\top {\mathbf {f}}(x) \right\|&\leqslant \sum _{i=1}^s \left|q_i\right| \left|K(x_i,x)\right| + \left\| Q_i \right\| _{x_i}\left\| K^{(10)}(x_i,x) \right\| _{x_i} \\&\leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| _{\mathrm {Block}}\left( B_{00} +B_{10} + 2h \right) \leqslant B_0 \left\| D_{{\mathfrak {g}}}^{-1} q \right\| _{\mathrm {Block}}. \end{aligned}$$ -
(iii)
For any \(x \in {\mathcal {X}}^{\mathrm {near}}\), we have the bound:
$$\begin{aligned} \left\| \mathrm {D}_{2}\left[ q^\top {\mathbf {f}}\right] (x) \right\| _x&= \left\| \sum _{i=1}^s q_i K^{(02)}(x_i,x) + [Q_i]K^{(12)}(x_i,x) \right\| _x \\&\leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| \left( \sum _{i=1}^s \left\| K^{(02)}(x_i,x) \right\| _x^2 \right. \\&\left. \quad + \left\| K^{(12)}(x_i,x) \right\| _{x_i,x}^2 \right) ^\frac{1}{2} \leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| B_2 \end{aligned}$$and
$$\begin{aligned} \left\| \mathrm {D}_{2}\left[ q^\top {\mathbf {f}}\right] (x) \right\| _x&= \left\| \sum _{i=1}^s q_i K^{(02)}(x_i,x) + [Q_i]K^{(12)}(x_i,x) \right\| _x \\&\leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| _{\mathrm {Block}}\left( \sum _{i=1}^s \left\| K^{(02)}(x_i,x) \right\| _x + \left\| K^{(12)}(x_i,x) \right\| _{x_i,x} \right) \\&\leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| _{\mathrm {Block}}B_2 \end{aligned}$$which concludes the proof.
1.3 Proof of Lemma 4
First note that, for \(X = n^{-1} \sum _{k=1}^m f(\omega _k)\) any empirical average, since the \(\omega _k\) are i.i.d., we have \({\mathbb {E}}_{{{\bar{E}}}}[X] = {\mathbb {E}}_{E_\omega }[f(\omega )]\), and therefore \({\mathbf {f}}_{{{\bar{E}}}}= {\mathbb {E}}_{E_\omega } [\gamma (\omega ) \gamma (\omega )^*]\), and so on.
We now prove a general bound that we then implement for each item. Let \(A = A_\omega \) be a random matrix that depends on \(\omega \), such that \(\left\| {\mathbb {E}}[A] \right\| \leqslant B\) and \(\left\| A \right\| \leqslant L(\omega )\), for any matrix norm \(\left\| \cdot \right\| \). We have
by Bayes’ rule, and therefore,
Then, if we let \(E_{\omega ,q}\) be the event that \(L_q(\omega ) \leqslant {{\bar{L}}}_q\), so \(E_\omega = \cap _{q=0}^3 E_{\omega ,q}\), by the union bound we get \({\mathbb {P}}(E_{\omega }^c) \leqslant \sum _q {\mathbb {P}}(E_{\omega ,q}^c) \leqslant \sum _q F_q({{\bar{L}}}_q) \leqslant \frac{\min ({\bar{\varepsilon }}_0,{\bar{\varepsilon }}_2)}{m\max _j({{\bar{L}}}_j^2)} \leqslant \frac{1}{2}\), and in particular \({\mathbb {P}}(E_\omega ) \geqslant \frac{1}{2}\). In the following, \(L(\omega )\) will be a sum of some of the \(L_q(\omega )^2\), so we bound \({\mathbb {E}}[ L_q(\omega )^2 1_{E_\omega ^c}] \leqslant \sum _j {\mathbb {E}}[ L_q(\omega )^2 1_{E_{\omega ,j}^c}]\) and we have
where we have bounded \({\mathbb {P}}\left( (L_q(\omega )^2 \geqslant t) \cap (L_j(\omega ) \geqslant {{\bar{L}}}_j) \right) \) by, respectively, \({\mathbb {P}}(L_j(\omega ) \geqslant {{\bar{L}}}_j) \leqslant F_j({{\bar{L}}}_j)\) in the first term and by \({\mathbb {P}}(L_q(\omega )^2 \geqslant t) \leqslant F_q(\sqrt{t})\) in the second term. Hence, by Assumption 2 we have
We can now obtain the desired results by combining (70) and (71) each time:
-
(i)
we let \(A =D_{{\mathfrak {g}}}\gamma (\omega ) \gamma (\omega )^* D_{{\mathfrak {g}}}\). We have \(\left\| {\mathbb {E}}[A] \right\| \leqslant 2\) by Lemma 3, and \(\left\| \gamma (\omega ) \gamma (\omega )^* \right\| \leqslant s L_{01}^2(\omega )\). When applied with the norm \(\left\| \cdot \right\| _{\mathrm {Block}}\), we get \(\left\| {\mathbb {E}}[A] \right\| _{\mathrm {Block}} \leqslant 2\), and \(\left\| \gamma (\omega ) \gamma (\omega )^* \right\| _{\mathrm {Block}} \leqslant 2s L_{01}^2(\omega )\) by Lemma 5 (iii).
-
(ii)
we let \(A = D_{{\mathfrak {g}}}\gamma (\omega ) \varphi _\omega (x)\). We have \(\left\| {\mathbb {E}}[A] \right\| \leqslant B_0\) by Lemma 3, and \(\left\| A \right\| \leqslant \sqrt{s} L_{01}(\omega ) L_0(\omega ) \leqslant \frac{1}{2}\sqrt{s} (L_{01}(\omega )^1+ L_0(\omega )^2)\).
-
(iii)
we let \(A = ({\tilde{\gamma }}(\omega )^\top q) {{\mathfrak {g}}}_x^{-\frac{1}{2}}(H \varphi _\omega )(x) {{\mathfrak {g}}}_x^{-\frac{1}{2}}\). We have \(\left\| {\mathbb {E}}[A] \right\| \leqslant B_2 \left\| q \right\| \) by Lemma 3, and \(\left\| A \right\| \leqslant \sqrt{s} L_{01}(\omega ) L_2(\omega )\).
Concentration Bounds
In this section, we detail the various Bernstein concentration inequalities that we used in the golfing scheme. More precisely, we present some probabilistic bounds on deviation of \(\hat{{\mathbf {f}}}\) and \({\hat{\varUpsilon }}\) from their deterministic counterparts \({\mathbf {f}}\) and \(\varUpsilon \), conditional on event \({{{\bar{E}}}}\) (recall their definitions in (30), (54) and (55)). Define the shorthands
Observe that conditional on \({{{\bar{E}}}}\), we have
All this section is done under the assumptions of Theorem 3, and we will use several times the following from Lemmas 3 and 4:
1.1 Elementary Concentration Inequalities
To begin, we first recall some elementary concentration inequalities.
Lemma 6
(Matrix Bernstein for complex matrices) Let \(Y_1,\ldots , Y_M\) be a sequence of \(d_1\times d_2\) complex random matrices with \({\mathbb {E}}[Y_\ell ] = 0\), \(\left\| Y_\ell \right\| _{2\rightarrow 2} \leqslant K\) for all \(\ell =1,\ldots , M\) and set
Then,
Lemma 7
(Vector Bernstein for complex vectors [43]) Let \(Y_1,\ldots , Y_M\in {\mathbb {C}}^d\) be a sequence of independent random vectors such that \({\mathbb {E}}[Y_i] = 0\), \(\left\| Y_i \right\| _2\leqslant K\) for \(i=1,\ldots , M\) and set
Then, for all \(t\geqslant (2K + 6\sigma )/M\),
1.2 Deviation Between \({\mathbf {f}}_{{{\bar{E}}}}\) and \(\hat{{\mathbf {f}}}\)
Lemma 8
(Bound against a fixed vector) Let \(q \in {\mathbb {C}}^{s(d+1)}\) and \(x \in {\mathcal {X}}\). For all \(u>0\), we have
As a corollary,
Proof
Assume \(\left\| q \right\| _2 = 1\) without loss of generality. We apply the classical (scalar) Bernstein inequality. By defining \(Y_k{\mathop {=}\limits ^{\text {def.}}}\varphi _{\omega _k}(x) \gamma (\omega _k)^* D_{{\mathfrak {g}}}q - {\mathbb {E}}_E[\varphi _{\omega }(x) \gamma (\omega )^\top D_{{\mathfrak {g}}}q]\), we have \((\hat{{\mathbf {f}}}(x)-{\mathbf {f}}_{{{\bar{E}}}}(x))^\top D_{{\mathfrak {g}}}q = \frac{1}{m} \sum _{k=1}^m Y_k\). To apply Bernstein’s inequality, observe that for each \(k=1,\ldots , m\), \({\mathbb {E}}_{{{\bar{E}}}}[Y_k] = 0\), and conditional on event \({{{\bar{E}}}}\), we have \(\left|Y_k\right| \leqslant 2\sqrt{s} {{\bar{L}}}_{01} {{\bar{L}}}_0\) and \({\mathbb {E}}_E \left|Y_k\right|^2 = {\mathbb {E}}_E \left|\varphi _{\omega _k}(x)\right|^2 \left|\gamma (\omega _k)^* D_{{\mathfrak {g}}}q\right|^2 \leqslant {{\bar{L}}}_0^2 \left\| D_{{\mathfrak {g}}}\varUpsilon _{{{\bar{E}}}}D_{{\mathfrak {g}}} \right\| \leqslant 2 {{\bar{L}}}_0^2\) by (73). Therefore,
The last inequality follows because \(\left\| q \right\| _{\mathrm {Block}} \geqslant \left\| q \right\| _2/\sqrt{2s}\). \(\square \)
Lemma 9
(Uniform bound) Fix \(x\in {\mathcal {X}}\). For all \(u>\frac{4\sqrt{s}{{\bar{L}}}_{01} {{\bar{L}}}_0}{m} + \frac{6\sqrt{s}{{\bar{L}}}_{01}}{\sqrt{m}}\) we have
Proof
We apply the vector Bernstein inequality (Lemma 7). By defining \(Y_k{\mathop {=}\limits ^{\text {def.}}}D_{{\mathfrak {g}}}\overline{\gamma (\omega _k)} \varphi _{\omega _k}(x) - {\mathbb {E}}_E[ D_{{\mathfrak {g}}}\overline{\gamma (\omega _k)} \varphi _{\omega _k}(x) ]\), we have \(D_{{\mathfrak {g}}}(\hat{{\mathbf {f}}}(x)-{\mathbf {f}}_{{{\bar{E}}}}(x)) = \frac{1}{m} \sum _{k=1}^m Y_k\). Observe that for each \(k=1,\ldots , m\), \({\mathbb {E}}_{{{\bar{E}}}}[Y_k] = 0\), and conditional on event \({{{\bar{E}}}}\), we have \(\left|Y_k\right| \leqslant 2\sqrt{s} {{\bar{L}}}_{01} {{\bar{L}}}_0\) and \({\mathbb {E}}_E \left\| Y_k \right\| ^2 = {\mathbb {E}}_E \left|\varphi _{\omega _k}(x)\right|^2 \left\| D_{{\mathfrak {g}}}\gamma (\omega _k) \right\| ^2 \leqslant s{{\bar{L}}}_{01}^2\). Therefore, for all \(u \geqslant \frac{4\sqrt{s}{{\bar{L}}}_{01} {{\bar{L}}}_0}{m} + \frac{6\sqrt{s}{{\bar{L}}}_{01}}{\sqrt{m}}\),
The last inequality follows because \(\left\| q \right\| _{\mathrm {Block}} \geqslant \left\| q \right\| _2/\sqrt{2s}\). \(\square \)
1.3 Deviation Between \(\mathrm {D}_{2}\left[ {\mathbf {f}}_{{{\bar{E}}}}\right] \) and \(\mathrm {D}_{2}\left[ \hat{{\mathbf {f}}}\right] \)
Lemma 10
(Bound against a fixed vector) Let \(q \in {\mathbb {C}}^{s(d+1)}\) and \(x \in {\mathcal {X}}\). For all \(u>0\) we have
as a corollary
Proof
Assume \(\left\| q \right\| =1\) without loss of generality. Recalling the definitions of Sec. 4.1, we have
We now apply Lemma 6. Define
which are indeed symmetric matrices. We have \({\mathbb {E}}_E Y_k = 0\) and conditional on event E,
Furthermore, defining \(A = (q^\top D_{{\mathfrak {g}}}\gamma (\omega _k)) {{\mathfrak {g}}}_x^{-\frac{1}{2}} H \left( \varphi _{\omega _k} \right) (x) {{\mathfrak {g}}}_x^{-\frac{1}{2}} \) (which is symmetric), we have
and thus \(\left\| {\mathbb {E}}_{{{\bar{E}}}}[Y_j Y_j^*] \right\| \leqslant 2{{\bar{L}}}_2^2\). Therefore, the matrix Bernstein’s inequality yields
The last inequality follows because \(\left\| q \right\| _{\mathrm {Block}} \geqslant \left\| q \right\| _2/\sqrt{2s}\). \(\square \)
Lemma 11
(Uniform bound) Let \(x \in {\mathcal {X}}\). Let \({\mathcal {B}}_x {\mathop {=}\limits ^{\text {def.}}} \left\{ v \in {\mathbb {C}}^d \;;\; \left\| v \right\| _x \leqslant 1 \right\} \) and given \(v\in {\mathcal {B}}_x\), let \(g_\omega (v) {\mathop {=}\limits ^{\text {def.}}}\mathrm {D}_{2}\left[ \varphi _\omega \right] (x)[v,v] \in {\mathbb {C}}\). Then, for all \(u>\frac{4\sqrt{s} {{\bar{L}}}_{01} {{\bar{L}}}_2}{m} + \frac{6\sqrt{2} {{\bar{L}}}_2^2}{\sqrt{m}}\),
for some constant C.
Proof
We use a covering net strategy: let \({\mathcal {V}}=\{ v_1,\ldots , v_N\}\) be a covering \(\varepsilon \)-net of \({\mathcal {B}}_x \), for \(\varepsilon >0\) that we will adjust later. Fix \(v \in {\mathcal {V}}\), and define \(Y_k = D_{{\mathfrak {g}}}\overline{\gamma (\omega _k)} g_{\omega _k}(v) - {\mathbb {E}}_{{{\bar{E}}}}D_{{\mathfrak {g}}}\overline{\gamma (\omega _k)} g_{\omega _k}(v) \in {\mathbb {C}}^{s(d+1)}\) centered i.i.d. variables. We have \({\mathbb {E}}_{{{\bar{E}}}}Y_k = 0\), \(\left|Y_k\right| \leqslant 2 \sqrt{s} {{\bar{L}}}_{01}{{\bar{L}}}_2\) and \({\mathbb {E}}_{{{\bar{E}}}}\left\| Y_k \right\| ^2 \leqslant {\mathbb {E}}_{{{\bar{E}}}}\left|g_{\omega }(v)\right|^2 \left\| D_{{\mathfrak {g}}}\gamma (\omega ) \right\| ^2 \leqslant s {{\bar{L}}}_{01}^2 B_{22}\). Hence applying Lemma 7: for all \(u \geqslant \frac{4\sqrt{s} {{\bar{L}}}_{01} {{\bar{L}}}_2}{m} + \frac{6\sqrt{B_{22}s} {{\bar{L}}}_{01} }{\sqrt{m}}\),
Next, we use the fact that for all \(\omega \)
Hence by choosing
and using a union bound on \(\left|{\mathcal {V}}\right|\), we conclude the proof. \(\square \)
1.4 Deviation Between \(\varUpsilon _{{{\bar{E}}}}\) and \({\hat{\varUpsilon }}\)
Lemma 12
(Bound in spectral norm) For all \(u>0\), it holds that,
Proof
To bound this probability, we apply Lemma 6 with
\(Y_k {\mathop {=}\limits ^{\text {def.}}}(D_{{\mathfrak {g}}}\gamma (\omega _k))(D_{{\mathfrak {g}}}\gamma (\omega _k))^* - \varUpsilon _{{{\bar{E}}}}\). We have, conditional on event E:
Also,
So, \(\left\| {\mathbb {E}}[Y_k^* Y_k] \right\| = \left\| {\mathbb {E}}[Y_k Y_k^*] \right\| \leqslant s {{\bar{L}}}_{01}^2 \left\| D_{{\mathfrak {g}}}\varUpsilon _{{{\bar{E}}}}D_{{\mathfrak {g}}} \right\| \leqslant 2s {{\bar{L}}}_{01}^2\) by (73). By choosing \(K=2 s{{\bar{L}}}_{01}^2\) and \(\sigma ^2 = m s {{\bar{L}}}_{01}^2 \left\| D_{{\mathfrak {g}}}\varUpsilon _{{{\bar{E}}}}D_{{\mathfrak {g}}} \right\| \) in Lemma 6, we obtain
\(\square \)
Lemma 13
For \(i=1,\ldots , s\), let \(S_i = \{s+(i-1)d +1, \ldots , s+id\}\), \(q\in {\mathbb {C}}^{s(d+1)}\). Then, for all \(u\geqslant \frac{4\sqrt{s}{{\bar{L}}}_{01} {{\bar{L}}}_1}{m} + \frac{6\sqrt{2}{{\bar{L}}}_1}{\sqrt{m}}\),
As a corollary, for all \( u\geqslant \frac{4\sqrt{2}s{{\bar{L}}}_{01} {{\bar{L}}}_1}{m} + \frac{12\sqrt{s}{{\bar{L}}}_1}{\sqrt{m}} \), we have
Proof
Fix \(i\in \{1,\ldots , s\}\). Without loss of generality, assume that \(\left\| q \right\| _2 = 1\). The claim of this lemma follows by applying Lemma 7. Let
and observe that \((D_{{\mathfrak {g}}}({\hat{\varUpsilon }}- \varUpsilon _{{{\bar{E}}}})D_{{\mathfrak {g}}}q)_{S_i} = \frac{1}{m}\sum _k Y_k\). We apply Lemma 7. Observe that conditional on event \({{{\bar{E}}}}\), we have
and
by (73). Therefore, for all \(u\geqslant \frac{4\sqrt{s}{{\bar{L}}}_{01} {{\bar{L}}}_1}{m} + \frac{6\sqrt{2}{{\bar{L}}}_1}{\sqrt{m}}\),
The last inequality follows because \(\left\| q \right\| _{\mathrm {Block}} \geqslant \left\| q \right\| _2/\sqrt{2s}\). \(\square \)
Lemma 14
(Bound in block norm) Let \(q\in {\mathbb {C}}^{s(d+1)}\) be any vector. For all \( u\geqslant \frac{4\sqrt{2}s{{\bar{L}}}_{01} {{\bar{L}}}_1}{m} + \frac{12\sqrt{s}{{\bar{L}}}_1}{\sqrt{m}} \), we have
Proof
Let \(S_0 {\mathop {=}\limits ^{\text {def.}}}\{1,\ldots , s\}\) and \(S_j {\mathop {=}\limits ^{\text {def.}}}\{ s+(j-1)d+1, \ldots , s+jd\}\) for \(j=1,\ldots , s\). By the union bound
To bound the first sum, observe that for \(j=1,\ldots , s\), \(( D_{{\mathfrak {g}}}(\varUpsilon - {\hat{\varUpsilon }}) D_{{\mathfrak {g}}}q)_{j} = (D_{{\mathfrak {g}}}({\mathbf {f}}(x_j) - \hat{{\mathbf {f}}}(x_j)))^\top q\) and apply Lemma 8. The second sum can be bounded by applying Lemma 13.
\(\square \)
1.5 Proof of Proposition 3
We fix a particular \(j=1,\ldots ,s\), do the proof for \({\hat{\eta }}_j\) and then use a union bound to conclude. As before, it is enough to establish the probability that \({\hat{\eta }}_j\) satisfies the properties of Proposition 3 conditional on event \({{{\bar{E}}}}\). We proceed in the same way as in the main proof of the golfing scheme: first we show that \({\hat{\eta }}_j\) satisfies the desired property on a finite grid, then we bound \(\left\| p_j \right\| \), and finally we use the latter to extend the non-degeneracy to the whole space. As mentioned in the paper, the first step is considerably simpler and more direct than the golfing scheme, since the “sign” vector \(u_j\) is of norm 1.
Deviation Bounds on a Grid Similar to our previous argument, we will bound the deviation between \({\hat{\eta }}_j\) and \(\eta _j\) on a finite grid \({\mathcal {G}}^{\mathrm {far}}\subset {\mathcal {X}}^{\mathrm {far}}\) whose precision we will later adjust, and between \(\mathrm {D}_{2}\left[ {\hat{\eta }}_j\right] \) and \(\mathrm {D}_{2}\left[ \eta _j\right] \) on \({\mathcal {G}}^{\mathrm {near}}\subset {\mathcal {X}}^{\mathrm {near}}\). We will show that
Let \({\hat{q}}_j {\mathop {=}\limits ^{\text {def.}}}D_{{\mathfrak {g}}}^{-1} {\hat{\varUpsilon }}^{-1} u_j\) and \( q_j {\mathop {=}\limits ^{\text {def.}}}D_{{\mathfrak {g}}}^{-1} \varUpsilon ^{-1} u_j\). Note that \(q_j\) is deterministic and \(\left\| q_j \right\| \leqslant 2\) for all j. Recall also that \(\eta _j = q_j^\top D_{{\mathfrak {g}}}{\mathbf {f}}(x)\) and \({\hat{\eta }}_j = {\hat{q}}_j^\top D_{{\mathfrak {g}}}\hat{{\mathbf {f}}}(x)\). For \(x\in {\mathcal {G}}^{\mathrm {far}}\),
where the last line is valid with probability \(1-\rho \) by Lemma 3 and (61). Similarly,
where again, the last line is valid with probability \(1-\rho \) by Lemma 3 and (61). Therefore, we simply have to show that with probability at least \(1-\rho \),
-
(i)
For \(j=1,\ldots ,s\), \(\left|q_j^\top D_{{\mathfrak {g}}}({\mathbf {f}}_{{{\bar{E}}}}(x) - \hat{{\mathbf {f}}}(x))\right| \leqslant {\bar{\varepsilon }}_0/32\) for all \(x\in {\mathcal {G}}^{\mathrm {far}}\).
-
(ii)
For \(j=1,\ldots , s\), \(\left\| \mathrm {D}_{2}\left[ q_j^\top D_{{\mathfrak {g}}}({\mathbf {f}}- \hat{{\mathbf {f}}})\right] (x) \right\| _x \leqslant {\bar{\varepsilon }}_2/64\) for all \(x\in {\mathcal {G}}^{\mathrm {near}}\).
-
(iii)
\( \left\| D_{{\mathfrak {g}}}\hat{{\mathbf {f}}}(x) \right\| \leqslant 2 B_0\) for all \(x\in {\mathcal {G}}^{\mathrm {far}}\).
-
(iv)
\( \sup _{\left\| v \right\| _x\leqslant 1}\left\| \frac{1}{m} \sum _{k=1}^mD_{{\mathfrak {g}}}\overline{\gamma (\omega _k)} \mathrm {D}_{2}\left[ \varphi _{\omega _k}\right] (x)[v,v] \right\| \leqslant 2 B_2\) for all \(x\in {\mathcal {G}}^{\mathrm {near}}\).
-
(v)
\( \left\| D_{{\mathfrak {g}}}( \varUpsilon - {\hat{\varUpsilon }})D_{{\mathfrak {g}}} \right\| \leqslant \min \left( \frac{{\bar{\varepsilon }}_0}{512 B_0}, \frac{{\bar{\varepsilon }}_2}{1024 B_2} \right) \).
By applying Lemma 4 and recalling our choice of m, (i) follows by Lemma 8, (ii) follows by Lemma 10, (iii) follows by Lemma 9, (iv) follows by Lemma 11, and (v) follows by Lemma 12.
Bound on \(p_j\) By the same computations as in Sect. 6.6, we have \({\hat{\eta }}_j(x) = ({\hat{\varUpsilon }}^{-1} u_j)^\top \hat{{\mathbf {f}}}(x) = \varPhi ^* p_j\) with \(p_j = \frac{1}{\sqrt{m}}\left( \gamma (\omega _i)^* {\hat{\varUpsilon }}^{-1} u_j \right) _{i=1}^m\). Therefore,
with probability \(1-\rho \), by (61).
Extension to the Whole Domain We proceed as in Sect. 6.5. By the same computations, we obtain: for any \(x \in {\mathcal {X}}^{\mathrm {far}}\) and \(x' \in {\mathcal {G}}^{\mathrm {far}}\),
, and therefore, we choose
For the second covariant derivative, as in Sect. 6.5 we get: for all \(x \in {\mathcal {X}}^{\mathrm {near}}_j\) and \(x' \in {\mathcal {G}}^{\mathrm {near}}_j\),
and similarly for \(\ell \ne j\), for all \(x \in {\mathcal {X}}^{\mathrm {near}}_\ell \) and \(x' \in {\mathcal {G}}^{\mathrm {near}}_\ell \),
and therefore we conclude by setting
The final bound on m is satisfied with the one we obtained previously (65).
Application: Discrete Fourier Sampling
In this section, we consider the case of sampling Fourier coefficients as described in [15]. Let \(f{\in } {\mathbb {N}}\) and \({\mathcal {X}}{\in } {\mathbb {T}}^d\) the d-dimensional torus. Let \(\varOmega {=} \left\{ \omega \in {\mathbb {Z}}^d \;;\; \left\| \omega \right\| _\infty \leqslant f \right\} \), \(\varphi _\omega (x) {\mathop {=}\limits ^{\text {def.}}}e^{\mathrm {i} 2\pi \omega ^\top x}\), and \(\varLambda (\omega ) = \prod _{j=1}^d g(\omega _j)\) where \(g(j) = \frac{1}{f} \sum _{k=\max (j-f,-f)}^{\min (j+f,f)}(1-\left|k/f\right|)(1-\left|(j-k)/f\right|)\).
The Kernel and Fisher Metric The associated kernel is the multivariate Jackson kernel \( K(x,x') = \prod _{i=1}^d \kappa (x_i-x_i'), \) where
with constant metric tensor
where \(C_f{\mathop {=}\limits ^{\text {def.}}}-\kappa ''(0) = \frac{\pi ^2}{3}f(f+4) \sim f^2\). Note that \(K^{(ij)} = \nabla _1^i \nabla _2^j K\) and \(\left\| K^{(ij)} \right\| _{x,x'}= C_f^{-(i+j)/2}\left\| \nabla _1^i \nabla _2^j K \right\| \). Moreover, since the metric is constant, we have \(\left\| \cdot \right\| _x = C_f^{\frac{1}{2}}\left\| \cdot \right\| \) for all x. The domain diameter is \({\mathcal {R}}_{\mathcal {X}}= C_f^{\frac{1}{2}} d^{\frac{1}{2}}\).
Sampling Bounds Suppose that \(f\geqslant 128\). The rest of this section consists of Lemmas which bound the parameters in Theorem 3: We show in Lemma 15 that by choosing \({r_{\mathrm {near}}}= \frac{1}{8\sqrt{2}}\), for all \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \leqslant {r_{\mathrm {near}}}\), we can set \({\bar{\varepsilon }}_2 = (1-6{r_{\mathrm {near}}}^2)/(1-{r_{\mathrm {near}}}^2/(2-{r_{\mathrm {near}}}^2) - {r_{\mathrm {near}}}^2) \geqslant 0.941\). In Lemma 16, we show that for all \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \geqslant {r_{\mathrm {near}}}\), \(\left|K(x,x') \right| \leqslant 1-1/(8^3\cdot 2)\), so we can set \({\bar{\varepsilon }}_0 {\mathop {=}\limits ^{\text {def.}}}0.00097\). Moreover, the uniform bounds given in Lemma 18 imply that
So, for \(h = {\mathcal {O}}(d^{-\frac{1}{2}})\), by Lemma 17, we have \( W(h,s) = {\mathcal {O}}(s^{\frac{1}{4}} d^{\frac{1}{2}}). \) and in the case of \(d<4\), this can be replaced by \(W(h,s) = {\mathcal {O}}(2^d)\). Gradient bounds are computed in Sect. C.6.
To summarize, Theorem 3 is applicable with:
-
(i)
\(B_{00} = B_{02} = B_{12} = {\mathcal {O}}(1)\), \(B_{01} = {\mathcal {O}}(d^{\frac{1}{2}})\), \(B_{22} = {\mathcal {O}}(d)\) and \(C_{{\mathfrak {g}}}=0\).
-
(ii)
\({r_{\mathrm {near}}}= 1/(8\sqrt{2})\), \({\bar{\varepsilon }}_0 = 0.00097\), \({\bar{\varepsilon }}_2 = 0.941\).
-
(iii)
\(\varDelta = {\mathcal {O}}(d^{\frac{1}{2}} s_{\max }^{\frac{1}{4}})\).
-
(iv)
\({{\bar{L}}}_i = {\mathcal {O}}(d^{i/2})\).
and
1.1 Preliminaries: Properties of the Univariate Kernel
We first summarize in Sect. C.1 some key properties of the univariate Jackson kernel \(\kappa \) when \(f\geqslant 128\) which were derived in [15].
From [15, Equations (2.20)-(2.24) and (2.29)], for all \(t \in [-1/2,1/2]\) and \(\ell =0,1,2,3\):
By [15, Lemma 2.6],
where \(H_0^\infty {\mathop {=}\limits ^{\text {def.}}}1\), \(H_1^\infty {\mathop {=}\limits ^{\text {def.}}}4\), \(H_2^\infty {\mathop {=}\limits ^{\text {def.}}}18\) and \(H_3^\infty {\mathop {=}\limits ^{\text {def.}}}77\), and \(H_\ell (t) {\mathop {=}\limits ^{\text {def.}}}\alpha ^4(t) \beta _\ell (t)\), with
and \(\beta _0(t) {\mathop {=}\limits ^{\text {def.}}}1\), \(\beta _1(t) {\mathop {=}\limits ^{\text {def.}}}2+2{\bar{\beta }}(t)\), \(\beta _2 {\mathop {=}\limits ^{\text {def.}}}4+7{\bar{\beta }}(t) + 6{\bar{\beta }}(t)^2\) and \(\beta _3(t) {\mathop {=}\limits ^{\text {def.}}}8+24{\bar{\beta }} + 30{\bar{\beta }}(t)^2 + 15 {\bar{\beta }}(t)^3\). Let us first remark that \({\bar{\beta }}\) is decreasing on \(I {\mathop {=}\limits ^{\text {def.}}}[\frac{1}{2f}, \frac{\sqrt{2}}{\pi }]\), so \(\left|{\bar{\beta }}(t)\right| \leqslant \left|{\bar{\beta }}(1/(2f))\right| \approx 1.2733\), and \(a(t) \leqslant a(\sqrt{2}/\pi ) = \frac{3}{\pi }\) on I. Therefore, on I, \(H_0(t) \leqslant \frac{3}{\pi }\), \(H_1(t) \leqslant 3.79\), \(H_2(t) \leqslant 18.83\) and \(H_3(t) \leqslant 98.26\), and we can conclude that on \([\frac{1}{2f}, \frac{1}{2})\), we have
where \(H^\infty _0 = 1\), \(H^\infty _1 {\mathop {=}\limits ^{\text {def.}}}4\), \(H^\infty _2 {\mathop {=}\limits ^{\text {def.}}}19\), \(H^\infty _3 {\mathop {=}\limits ^{\text {def.}}}99\). Combining with (79), we have
where \(\kappa ^\infty _0 {\mathop {=}\limits ^{\text {def.}}}1\), \(\kappa ^\infty _2 {\mathop {=}\limits ^{\text {def.}}}C_f\),
Finally, given \(p\in (0,1)\),
Choosing \(p=\frac{1}{2}\) and using \((f+2)^2 =( \frac{3}{\pi ^2}C_f+ 4) \geqslant \frac{3}{\pi ^2} C_f\), we have
In the following sections, we will repeatedly make use of (79), (80) and (81).
1.2 Notation
For notational convenience, write \(t_i{\mathop {=}\limits ^{\text {def.}}}x_i-x_i'\), \(\kappa _i {\mathop {=}\limits ^{\text {def.}}}\kappa (t_i)\), \(\kappa _i'\ {\mathop {=}\limits ^{\text {def.}}}\kappa '(t_i)\), and so on. Let
With this, we have:
Where convenient, we sometimes write \(K(t) = K(x-x') {\mathop {=}\limits ^{\text {def.}}}K(x,x')\). Given a symmetric matrix M, we write \(\lambda _{\min }(M)\) to denote the smallest eigenvalue of M.
1.3 Bounds When \(\left\| t \right\| \) is Small
Lemma 15
Suppose that \(C_f\left\| x-x' \right\| _2^2 \leqslant c\) with \(c>0\) such that
Then, \(-\langle K^{(02)}(x-x') q,\,q\rangle \geqslant \varepsilon \left\| q \right\| _{x}\).
Proof
Let \(q\in {\mathbb {R}}^d\), and note that
We first consider \( \kappa _i'' K_i\): By applying (79), we obtain
and hence,
For the second term in (82), again, by applying (79), we obtain
Therefore, for \(\left\| q \right\| _x =1\), we have
\(\square \)
Lemma 16
Assume that \(\frac{1}{8 \sqrt{C_f}} \geqslant \left\| t \right\| _2 \) Then,
Consequently, for all
and all t such that \(\left\| t \right\| _2 \geqslant c\),
Proof
First note that by (79),
where
and note that \(g(u) \in (0, \tfrac{C_f}{2})\) for \(u\in (0,1/(8\sqrt{C_f})\). So, writing \(t = (t_i)_{i=1}^d\) and \(g_j {\mathop {=}\limits ^{\text {def.}}}g(t_j)\), we have
where \({\mathcal {J}}_\ell {\mathop {=}\limits ^{\text {def.}}} \left\{ j\in {\mathbb {N}}^d \;;\; j\leqslant d, \text { all entries of } j \text { are distinct} \right\} \). Note that for odd integers \(\ell \),
since \(\left( 1- \frac{C_f}{2}\left\| t \right\| _2^2 \right) >0\). Also,
by assumption. So,
Finally, observe that the function
is positive and increasing on the interval \([0, \frac{1}{8\sqrt{2C_f} }]\). So, for t satisfying
we have \( \left|K(t)\right| \leqslant 1- q(c) \leqslant 1- \frac{C_f}{8} c^2 . \) Finally, since \(\left|K(t)\right|\) is decreasing as t increases, we trivially have that \( \left|K(t)\right| \leqslant 1- q(c ) \) for all t with \(\left\| t \right\| _2\geqslant c\).
\(\square \)
1.4 Bounds When \(\left\| t \right\| \) is Large
Lemma 17
Let \(i,j\in \{0,1,2\}\) with \(i+j\leqslant 3\). Let \({\bar{A}} \geqslant \sqrt{\tfrac{4\pi ^2}{3 }}\) and \(\left\| t \right\| _2 \geqslant {\bar{A}} \sqrt{d} s_{\max }^{1/4}/\sqrt{C_f}\). Then, we have \(\left\| K^{(ij)}(t) \right\| _{x,x'} \leqslant d^{\frac{i+j-4}{2}} ({\bar{A}}^4 s_{\max })^{-1}\).
To see that this implies that for \(h={\mathcal {O}}(d^{-\frac{1}{2}})\), \(W(h,s) = {\mathcal {O}}\left( s^{\frac{1}{4}} d^{\frac{1}{2}} \right) \), first note that if \(\min _{\ell \ne k}\left\| x_\ell - x_k \right\| _2 \geqslant {\bar{A}} \sqrt{d} s_{\max }^{1/4}/\sqrt{C_f}\), then
by choosing \({\bar{A}}\) to be a sufficiently large constant.
For the case of \(d<4\), note that if \(\left\| t \right\| _2 \geqslant {\bar{A}}/\sqrt{C_f}\), we have \(\left\| K^{(ij)}(t) \right\| _{x,x'} \leqslant d^{\frac{3}{2}} {\bar{A}}^{-4}\). If \(\min _{\ell \ne k}\left\| x_\ell - x_k \right\| _2 \geqslant {\bar{A}}/\sqrt{C_f}\), then for each \(n\in {\mathbb {N}}\), there are at most \({\mathcal {O}}(d 3^{d-1} (n+1)^{d-1})\) points for which \(\left\| x - x_1 \right\| \in C_f^{-\frac{1}{2}} [n {\bar{A}}, (n+1) {\bar{A}}]\). To see this, we simply need to upper bound the number of points P spaced \(\delta = {\bar{A}}/\sqrt{C_f}\) apart which can fit into the tube \(B_{(n+1)\delta }(0) \setminus B_{n \delta }(0)\). By comparing volumes, this corresponds to fitting balls of radius \(\delta /2\) into \(B_{(n+1)\delta +\delta /2}(0) \setminus B_{n \delta - \delta /2}(0)\),
So, \(P\leqslant \left( 2(n+1)+1 \right) ^d - \left( 2(n - 1) + 1 \right) ^d \leqslant d \left( 2 n +3 \right) ^{d-1} \leqslant d (n+1)^{d-1} 3^{d-1} \), where the 2\(^\text {nd}\) inequality is obtained by the mean value theorem. Therefore,
by choosing \({\bar{A}}\gtrsim 2^d\) and provided that \(d< 4\).
Proof of Lemma 17
Write \(t = (t_j)_{j=1}^d\). To bound \(K(t) = \prod _{j=1}^d \kappa (t_j)\), we want to make use of the bounds on \(\kappa ^\infty _j\) from (81). We can do this for each \(t_j\) such that \(\left|t_j\right| \geqslant \sqrt{\tfrac{2\pi ^2}{3 C_f}}\). Note that there exists at least one such \(t_j\) since \(\left\| t \right\| _\infty \geqslant \left\| t \right\| _2/\sqrt{d} \geqslant {\bar{A}} s_{\max }^{1/4}/\sqrt{C_f} \geqslant \sqrt{\tfrac{2\pi ^2}{3 C_f}}\). If \(\{\left|t_j\right|\}_{j=1}^k \subset [0, \sqrt{\tfrac{2\pi ^2}{3 C_f}})\) for \(k\leqslant d-1\), then
which implies that \(\sum _{j=k+1}^d t_j^2 \geqslant \frac{1}{C_f}\left( {\bar{A}}^2 d s_{\max }^{1/2} - \frac{2\pi ^2(d-1)}{3} \right) \geqslant \frac{{\bar{A}}^2 d s_{\max }^{1/2}}{2C_f} \), by our assumptions on \({\bar{A}}\). Therefore, we may assume that we have some \(d\geqslant p\geqslant 1\) such that \(\{b_j\}_{j=1}^p \subseteq \{t_j\}\) with \(\left|b_j\right| \geqslant \sqrt{\tfrac{2\pi ^2}{3 C_f}}\) and \(\left\| b \right\| _2 \geqslant \frac{{\bar{A}} \sqrt{d} \root 4 \of { s_{\max }}}{\sqrt{2C_f} }\). Observe that
So, by applying the fact that \(\left|\kappa \right| \leqslant 1\), \(\kappa ^\infty _0 = 1\) and (81), we have
For \(\left|\kappa _i' K_i\right|\), if \(i\not \in \left\{ j \;;\; \left|t_j\right|> \sqrt{\tfrac{2\pi ^2}{3 C_f}} \right\} \), then
and otherwise, we have \( \left|\kappa _i' K_i\right| \leqslant \left|\kappa '(t_i)\right| \prod _{j\ne i} \left|\kappa (b_j)\right| \leqslant \frac{\kappa ^\infty _1}{\left( 1+ \frac{3}{4\pi ^2} {\bar{A}}^2 d \sqrt{s_{\max }} \right) ^2}, \) In a similar manner, writing \(V{\mathop {=}\limits ^{\text {def.}}}\left( 1+ \frac{3}{4\pi ^2} {\bar{A}}^2 d \sqrt{s_{\max }} \right) ^{-2}\), we can deduce that
Therefore,
Using Gershgorin theorem, we have
and hence,
Note also that \(\left\| K^{(11)} \right\| _{x,x'} = \left\| K^{(02)} \right\| _{x'}\). Finally, since
we have
\(\square \)
1.5 Uniform Bounds
Lemma 18
If \({r_{\mathrm {near}}}\sim 1/\sqrt{C_f}\), then \(B_0={\mathcal {O}}(1)\), \(B_{01} = {\mathcal {O}}(\sqrt{d})\), \(B_{02} = B_{12} = {\mathcal {O}}(1)\) and \(B_{22} = {\mathcal {O}}(d)\).
Proof
We have \(\left|K\right| \leqslant 1\), and
so \(B_{01} = {\mathcal {O}}(\sqrt{d})\).
From (82), for all \(\left\| q \right\| =1\),
for \(\left\| t \right\| \lesssim 1/\sqrt{C_f}\). So, since \({r_{\mathrm {near}}}\leqslant 2/\sqrt{C_f}\), \(\left\| K^{02}(t) \right\| \leqslant 2{\mathop {=}\limits ^{\text {def.}}}B_{02}\). For the bound on \(B_{12}\):
for \(\left\| t \right\| \leqslant 1/C_f^{1/2}\).
\(\square \)
1.6 Gradient Bounds
The derivatives of the random features are uniformly bounded with
So, we can set \({{\bar{L}}}_i = {\mathcal {O}}(d^{i/2})\) for \(i=0,1,2\). For \({{\bar{L}}}_3\), the condition (44) is simply
so \({{\bar{L}}}_3 = {\mathcal {O}}(d^{3/2})\) by (84).
Application: Continuous Fourier Sampling with the Gaussian Kernel
In this section, we consider the case of continuous Fourier sampling with Gaussian frequencies, which may appear, for instance, in sketched Gaussian mixture learning [37]. Let \({\mathcal {X}}\subset {\mathbb {R}}^d\) be any bounded subset of \({\mathbb {R}}^d\). Let \(\varOmega = {\mathbb {R}}^d\), \(\varphi _\omega (x) {\mathop {=}\limits ^{\text {def.}}}e^{\mathrm {i} \omega ^\top x}\), and \(\varLambda (\omega ) ={\mathcal {N}}(0,\varSigma ^{-1})\), for a known covariance matrix \(\varSigma \).
The Kernel and Fisher Metric The associated kernel is the Gaussian kernel
with constant metric tensor
Sampling Bounds The rest of this section consists of Lemmas which bound the parameters in Assumptions 1 and 2. We show that by choosing \({r_{\mathrm {near}}}= \frac{1}{\sqrt{2}}\), we obtain \({\bar{\varepsilon }}_2=\frac{1}{2} e^{-\frac{1}{4}}\) and \({\bar{\varepsilon }}_0 {\mathop {=}\limits ^{\text {def.}}}1-e^{-\frac{1}{4}}\). Moreover, Lemma 21 gives uniform bounds in \( B_{ij} = {\mathcal {O}}(1) \) and, for \(h = {\mathcal {O}}(1)\), \( W(h,s) = {\mathcal {O}}(\sqrt{\log s} + 1). \) Gradient bounds are computed in Sect. D.5.
1.1 Properties of the Kernel
Notations For simplicity, define \(t = x-x'\), b an abuse of notations \(K_\varSigma (t) = \exp \left( -\frac{1}{2} \left\| t \right\| _{\varSigma ^{-1}}^2 \right) \) and for \(u\in {\mathbb {R}}\), \(\kappa (u) = \exp \left( -\frac{1}{2} u^2 \right) \). Denote by \( \left\{ e_i \right\} \) the canonical basis of \({\mathbb {R}}^d\), and by \(f_i = \varSigma ^{-1} e_i\) the \(i^{th}\) row of \(\varSigma ^{-1}\).
Gradients of the Kernel We have the following:
Then, we observe that for any \(q\geqslant 1\), the function \(f_q(r) = r^q e^{-\frac{1}{2} {r^2}}\) defined on \({\mathbb {R}}_+\) is increasing on \([0,\sqrt{q}]\) and decreasing after, and its maximum value is \(f_q(\sqrt{q}) = \left( \frac{q}{e} \right) ^{q/2}\). Furthermore, it is easy to see that we have \(f_q(r) = r^q e^{-r^2/2} \leqslant \left( \frac{2q}{2} \right) ^{\frac{q}{2}} e^{-r^2/4}\) and therefore \(f(r) \leqslant \varepsilon \) if \(r \geqslant 2\left( \log \left( \frac{1}{\varepsilon } \right) + \frac{q}{2}\log \left( \frac{2q}{e} \right) \right) \).
1.2 Bounds When \(\left\| t \right\| \) is Small
Lemma 19
For all \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \leqslant {r_{\mathrm {near}}}{\mathop {=}\limits ^{\text {def.}}}\frac{1}{\sqrt{2}}\) and all \(v \in T_x {\mathcal {M}}\), we have \(-K^{(02)}(x,x')[v,v] \geqslant {\bar{\varepsilon }}_2 \left\| v \right\| _x^2\) where \({\bar{\varepsilon }}_2=\frac{1}{2} e^{-\frac{1}{4}}\).
Proof
From the derivations above, we have \(K^{(02)}(x,x')[v,v] = v^\top \nabla _2^2 K_\varSigma (t) v = (-1+{\mathfrak {d}}_{{\mathfrak {g}}}(x,x')^2)\kappa ({\mathfrak {d}}_{{\mathfrak {g}}}(x,x')) \left\| v \right\| ^2_x \leqslant ({r_{\mathrm {near}}}^2 -1)\kappa ({r_{\mathrm {near}}}) \left\| v \right\| _x\). \(\square \)
1.3 Bounds When \(\left\| t \right\| \) is Large
Lemma 20
For all \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \geqslant {r_{\mathrm {near}}}\), we have \(\left|K(x,x') \right| \leqslant 1-{\bar{\varepsilon }}_0\), where \({\bar{\varepsilon }}_0 {\mathop {=}\limits ^{\text {def.}}}1-e^{-\frac{1}{4}}\), and for \(h= {\mathcal {O}}(1)\), \(W(h,s) = {\mathcal {O}}(\sqrt{\log s} + 1)\).
Proof
For the first inequality we have \(\left|K\right| \leqslant \kappa ({r_{\mathrm {near}}}) = 1- (1-e^{-\frac{1}{4}})\).
Then, from (27), the fact that the metric tensor is constant, and the expressions for the derivatives of the kernel above, is immediate that
For \(K^{(12)}\), again since the metric tensor \({{\mathfrak {g}}}\) is constant, we observe that
and
Using, \(\sum _i (\varSigma ^{\frac{1}{2}}q)_i f_i = \varSigma ^{-\frac{1}{2}} q\), we observe that
Hence at the end of the day
Therefore, for \(h={\mathcal {O}}(1)\), using the properties of the functions \(f_q\) it is immediate that \(W(h,s) = {\mathcal {O}}(\sqrt{\log s} + 1)\). \(\square \)
1.4 Uniform Bounds
Lemma 21
For \((i,j) \in \{0,1,2\}\), we have \(B_{ij} = {\mathcal {O}}(1)\).
Proof
The bounds for \(i+j\leqslant 3\) are immediate using the identities in the proof of Lemma 20 and the properties of the functions \(f_q\).
By the same reasoning, we have
and we have
Hence,
and \(B_{22} = {\mathcal {O}}(1)\). \(\square \)
1.5 Gradient Bounds
For \(j = \{0,1,2\}\), we have \(\mathrm {D}_{j}\left[ \varphi _\omega \right] (x)[q_1,\ldots ,q_j] = \left( \prod _i \omega ^\top q_i \right) \varphi _\omega (x)\) and therefore
And then, from (44), using \(\tau _{x\rightarrow x'} = \mathrm {Id}\),
Since \(\omega \sim {\mathcal {N}}(0,\varSigma ^{-1})\), \(\left\| \varSigma ^\frac{1}{2} \omega \right\| ^j = W^{\frac{j}{2}}\) where W is a \(\chi ^2\) variable with d degrees of freedom. Then, we use the following Chernoff bound [24]: for \(x\geqslant d\), we have
by using \(x^2e^{-\frac{x^2}{2}} \leqslant \frac{2}{e}\).
Hence, we can define the \(F_j\) such that, for all \(t \geqslant d^{j/2}\), \({\mathbb {P}}(L_j(\omega ) \geqslant t) \leqslant F_j(t) = 2^{\frac{d}{2}} \exp \left( -\frac{t^\frac{2}{j}}{4} \right) \), and \(F_j({\bar{L}}_j)\) is smaller than some \(\delta \) if \({\bar{L}}_j \propto \left( d + \log \frac{1}{\delta } \right) ^{\frac{j}{2}}\). Then, we must choose the \(L_j\) such that \(\int _{{\bar{L}}_j} t F_j(t) d t\) is bounded by some \(\delta \). Taking \({\bar{L}}_j \geqslant d^{j/2}\) in any case, we have
Hence, this quantity is bounded by \(\delta \) if \( {\bar{L}}_j \propto \left( d + \log \left( \frac{1}{\delta } \right) \right) ^{\frac{j}{2}} \). Then, we have \({\bar{L}}_j^2 F_i({\bar{L}}_i) = {\bar{L}}_j^2 2^{\frac{d}{2}} \exp \left( -\frac{{\bar{L}}_i^{\frac{2}{i}}}{4} \right) \) which is also bounded by \(\delta \) if \( {\bar{L}}_j \propto \left( d + \left( \log \frac{d}{\delta } \right) ^2 \right) ^\frac{j}{2} \). At the end of the day, our assumptions are satisfied for
1.5.1 Gaussian Mixture Model Learning
We apply the mixture model framework with the base distribution:
The random features on the data space are \(\varphi '_\omega (x) = C e^{i\omega ^\top x}\) with Gaussian distribution \(\omega \sim \varLambda = {\mathcal {N}}(0,A)\) for some constant C and matrix A that we will choose later. Then, the features on the parameter space are \(\varphi _\omega (\theta ) = {\mathbb {E}}_{x\sim P_\theta } \varphi '_\omega (x) = Ce^{i\omega ^\top \theta } e^{-\frac{1}{2} \left\| \omega \right\| _{\varSigma }^2}\) (that is, the characteristic function of Gaussians). Then, it is possible to show [37] that the kernel is
Hence, we choose \(A = c \varSigma ^{-1}\), \(C = (1+2c)^{\frac{d}{4}}\), and we come back to the previous case \(K(\theta ,\theta ') = e^{-\frac{1}{2} \left\| \theta -\theta ' \right\| _{{\tilde{\varSigma }}^{-1}}^2}\) with covariance \({\tilde{\varSigma }} = (2+1/c)\varSigma \). Hence, \({\bar{\varepsilon }}_i = {\mathcal {O}}\left( 1\right) \), \(B_{ij} = {\mathcal {O}}\left( 1\right) \), \({\mathfrak {d}}(\theta ,\theta ') = \left\| \theta - \theta ' \right\| _{{\tilde{\varSigma }}^{-1}} = \frac{1}{\sqrt{2+1/c}}\left\| \theta - \theta ' \right\| _{\varSigma ^{-1}}\).
Admissible Features Unlike the previous case, the features are directly bounded and Lipschitz. We have
Hence, all constants \(L_j\) are in \({\mathcal {O}}\left( C(2+1/c)^{\frac{j}{2}}\right) \) by choosing \(c = \frac{1}{d}\) they are in \({\mathcal {O}}\left( d^{\frac{j}{2}}\right) \).
Application: Sampling the Laplace Transform
Let \(\alpha \in {\mathbb {R}}_+^d\), and let \({\mathcal {X}}= (0,R]^d \subset {\mathbb {R}}^d_+\) for some \(R>0\). Let \(\varOmega = {\mathbb {R}}_+^d\). Define for \(x\in {\mathcal {X}}\) and \(\omega \in \varOmega \),
The Kernel and Fisher Metric The associated kernel is \(K(x,x') = \prod _{i=1}^d \kappa (x_i+\alpha _i, x_i'+\alpha _i)\) where
The associated metric \({{\mathfrak {g}}}_x\in {\mathbb {R}}^{d\times d}\) is the diagonal matrix with diagonal \((h_{x_i+\alpha _i})_{i=1}^d\) where given \(x\in {\mathbb {R}}_+\), \(h_x {\mathop {=}\limits ^{\text {def.}}}\partial _x \partial _{x'}\kappa (x,x) = (2x)^{-2}\). The induced distance in dimension one is
, and hence,
is the Fisher distance between exponential distributions. The domain diameter is \( {\mathcal {R}}_{\mathcal {X}}= \sqrt{\sum _i \left|\log \left( \frac{R+\alpha _i}{\alpha _i} \right) \right|}. \)
The Christoffel symbol is \(\varGamma ^{i}_{jk} = -(x_i+\alpha _i)^{-1}\) when \(i=j=k\) and 0 otherwise, so the Riemannian Hessian of f at x is
Sampling Bounds Assuming that the \(\alpha _i\sim d\) and are all distinct, Theorem 3 is applicable with:
-
(i)
\(B_{00} = B_{01} = B_{02} = {\mathcal {O}}(1)\), \(B_{12} = {\mathcal {O}}(\sqrt{d})\), \(B_{22} = {\mathcal {O}}(d)\).
-
(ii)
\({r_{\mathrm {near}}}= 0.2\), \({\bar{\varepsilon }}_0 = 0.005\), \({\bar{\varepsilon }}_2 = 0.7960\).
-
(iii)
\(\varDelta = {\mathcal {O}}(d + \log (d^{3/2}s_{\max }))\)
-
(iv)
\( {{\bar{L}}}_j \propto {d^j \left( \sqrt{d} + \left( \log (m) + \log \left( \frac{d}{\rho } \right) \right) \right) ^j}\)
and
where \(C {\mathop {=}\limits ^{\text {def.}}}d^2 \left( d + \log ^2(m) + \log ^2\left( \frac{d}{\rho } \right) \right) \). In the above, the implicit constant depends on R.
1.1 Preliminaries: Properties of the Univariate Kernel
We first provide bounds for \(\kappa \) and its derivatives. In the following, let
We denote \({\mathfrak {d}}_\kappa (u,v) {\mathop {=}\limits ^{\text {def.}}}\left|\log (u/v)\right|\). Recall also the hyperbolic functions
Lemma 22
We have
-
(i)
\(\kappa (u,v)= \mathrm {sech}\left( \frac{{\mathfrak {d}}_\kappa (u,v)}{2} \right) \leqslant 2e^{-\frac{1}{2} {\mathfrak {d}}_\kappa (u,v)}\).
-
(ii)
\( \left|{\kappa ^{(10)}(u,v) }\right| = 2\left|\tanh \left( \frac{{\mathfrak {d}}_\kappa (u,v)}{2} \right) {\kappa (u,v)} \right|, \) and \(\left|\kappa ^{(10)}(u,v)\right| \leqslant 2\left|\kappa (u,v)\right|\).
-
(iii)
\(\left|\kappa ^{(11)}(u,v)\right| \leqslant \left|\kappa (u,v)\right|^3 + 4\left|\kappa (u,v)\right|\)
-
(iv)
\(\left|\kappa ^{(20)}(u,v)\right| \leqslant 5 \left|\kappa (u,v)\right|\) and \(-\kappa ^{(20)}(u,v) \geqslant \kappa (u,v) \left( 1- 4\tanh \left( \frac{{\mathfrak {d}}_\kappa (u,v)}{2} \right) \right) \).
-
(v)
\(\left|\kappa ^{(12)}(u,v)\right| \leqslant 49 \left|\kappa (u,v)\right|\).
Proof
We first state the partial derivatives of \(\kappa \):
We also make use of the following fact: For \(u>v\),
-
(i)
$$\begin{aligned}&\kappa (u,v) = 2\left( \sqrt{\frac{u}{v}} + \sqrt{\frac{v}{u}} \right) ^{-1}\\&\quad = \frac{2}{e^{-\frac{{\mathfrak {d}}_\kappa (u,v)}{2}} + e^{\frac{{\mathfrak {d}}_\kappa (u,v)}{2}}} = \frac{1}{\cosh (\frac{{\mathfrak {d}}_\kappa (u,v)}{2})} \leqslant 2e^{-\frac{1}{2} {\mathfrak {d}}_\kappa (u,v)}, \end{aligned}$$
-
(ii)
We have, assuming that \(u>v\),
$$\begin{aligned} \kappa ^{(10)}(u,v)&= 2u \partial _u \kappa (u,v) = 2\frac{v-u}{u+v}\kappa (u,v)= -2\tanh ({\mathfrak {d}}_\kappa (u,v)/2) \kappa (u,v). \end{aligned}$$ -
(iii)
$$\begin{aligned} \kappa ^{(11)}(u,v)&= 4 u v\partial _{v} \partial _u \kappa (u,v) = 4u v \frac{ 4 u v - \left( u - v \right) ^2 }{2\sqrt{u v} (u+v)^3}\\&= \kappa (u,v) \left( \kappa (u,v)^2- \frac{\left( u - v \right) ^2 }{(u+v)^2} \right) \\&= \kappa (u,v) \left( \kappa (u,v)^2- 4\tanh ^2({\mathfrak {d}}_\kappa (u,v)/2) \right) \end{aligned}$$
so \(\left|\kappa ^{(11)}\right| \leqslant \left|\kappa \right|^3 + 4\left|\kappa \right|\).
-
(iv)
$$\begin{aligned} \kappa ^{(20)}(u,v)&= 4u^2 \partial _u^2 \kappa (u,v) = - \frac{4\left( u v \right) ^{1/2} \left( (u+v)^2 + 4u(v-u) \right) }{2 (u+v)^3}\\&=- \kappa (u,v) \left( 1 + \frac{4u (v-u) }{ (u+v)^2} \right) \end{aligned}$$
so \(\left|\kappa ^{(20)}\right| \leqslant 5 \left|\kappa \right|\). Also,
$$\begin{aligned} -\kappa ^{(20)} \geqslant \kappa (u,v)\left( 1 - 4\tanh ({\mathfrak {d}}_\kappa (u,v)/2) \right) \end{aligned}$$ -
(v)
$$\begin{aligned} \kappa ^{(12)}(u,v)&= 2u(2v)^2\partial _u \partial _{v}^2 \kappa (u,v) \\&=\kappa (u,v)\left( 1+ \frac{2 v (5 u^2 - 18 u v + (v)^2)}{(u + v)^3} \right) \end{aligned}$$
so \(\left|\kappa ^{(12)}\right| \leqslant 49 \left|\kappa \right|\).
\(\square \)
1.2 Kernel Bounds
Theorem 5
(Kernel bounds) The following hold:
-
1.
\(1-\frac{1}{8} {\mathfrak {d}}(x,x')^2\leqslant \left|K(x,x')\right| \leqslant \min \left\{ 2^d e^{-\frac{1}{2} {\mathfrak {d}}(x,x')}, \frac{8}{8+ {\mathfrak {d}}(x,x')^2} \right\} . \)
-
2.
\(\left\| K^{(10)}(x,x') \right\| \leqslant \min \{2\sqrt{d} \left|K\right|, \sqrt{2}\}\).
-
3.
\(\left\| K^{(11)} \right\| \leqslant \min \{9d \left|K\right|,8\}\)
-
4.
\(\left\| K^{(20)} \right\| \leqslant \min \{9d \left|K\right| , 8\} \) and \(\lambda _{\min }(-K^{(20)} ) \geqslant \left( 1-5{\mathfrak {d}}(x,x')^2 \right) K\) when \({\mathfrak {d}}(x,x')\leqslant 1\).
-
5.
\(\left\| K^{(12)} \right\| \leqslant \min \{66 \left|K\right| d^{3/2}, 16\sqrt{d} + 49\}\) and \(\left\| K^{(12)} (x,x') \right\| \leqslant 34\) if \({\mathfrak {d}}(x,x')\leqslant 1\).
In particular, for \({\mathfrak {d}}(x,x') \geqslant 2d \log (2) + 2\log \left( \frac{52 d^{3/2} s_{\max }}{h} \right) \), we have \(\left\| K^{(ij)}(x,x') \right\| \leqslant \frac{h}{s_{\max }}\).
Proof
Let \({\mathrm {d}}_\ell {\mathop {=}\limits ^{\text {def.}}}{\mathfrak {d}}_\kappa (x_\ell +\alpha _\ell , x_\ell '+\alpha _\ell )\) and note that \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') = \sqrt{\sum _\ell {\mathrm {d}}_\ell ^2}\). Define \(g = \left( 2\tanh (\frac{{\mathrm {d}}_\ell }{2}) \right) _{\ell =1}^d\). We first prove that
-
(i)
\( \left|K(x,x')\right| \leqslant \prod _{\ell =1}^d \mathrm {sech}({\mathrm {d}}_\ell /2) \leqslant \prod _{\ell =1}^d \frac{1}{1+{\mathrm {d}}_\ell ^2/8} \leqslant \frac{1}{1+\frac{1}{8}{\mathfrak {d}}(x,x')^2}. \)
-
(ii)
\(\left\| K^{(10)}(x,x') \right\| \leqslant \left\| g \right\| _2 \left|K\right|\).
-
(iii)
\(\left\| K^{(11)} \right\| \leqslant \left|K\right| \left( \left\| g \right\| ^2_2 + 5 \right) \)
-
(iv)
\(\left\| K^{(20)} \right\| \leqslant \left|K\right| \left( \left\| g \right\| _2^2 + 5 \right) \) and \(\lambda _{\min }\left( -K^{(20)} \right) \geqslant K\left( 1- 5\left\| g \right\| _2^2 \right) . \)
-
(v)
\(\left\| K^{(12)} \right\| \leqslant \left|K\right|\left( \left\| g \right\| _2^3 + 16 \left\| g \right\| _2 + 49 \right) \)
The result would then follow because \(\left|\tanh (x)\right| \leqslant \min \{x,1\}\), so \( \left\| g \right\| \leqslant \min \{ {\mathfrak {d}}(x,x'), 2\sqrt{d}\}\). For example, \(\left\| K^{(12)} \right\| \) \(\leqslant \frac{1}{1+\frac{1}{8}{\mathfrak {d}}(x,x')^2} \left( {\mathfrak {d}}(x,x')^3 + 16 {\mathfrak {d}}(x,x') + 24 \right) \leqslant 8 {\mathfrak {d}}(x,x') + \frac{\sqrt{8}}{2} + 24\leqslant 34\) when \({\mathfrak {d}}(x,x')\leqslant 1\).
In the following, we write
and \(\kappa _\ell {\mathop {=}\limits ^{\text {def.}}}\kappa _\ell ^{(00)}\) and \(K_i {\mathop {=}\limits ^{\text {def.}}}\prod _{j\ne i} \kappa _j\). Moreover, we will make use of the inequalities for \(\kappa ^{(ij)}\) derived in Lemma 22.
-
(i)
Note that \(\mathrm {sech}(x) \leqslant 2 e^{-x}\) and \(\mathrm {sech}(x) \leqslant (1+x^2/2)^{-1}\). So,
$$\begin{aligned} \left|K(x,x')\right| \leqslant \prod _{\ell =1}^d \mathrm {sech}\left( \frac{{\mathrm {d}}_\ell }{2} \right) \leqslant \prod _{\ell =1}^d \left( 1+\frac{{\mathrm {d}}_\ell ^2}{2} \right) ^{-1} \leqslant \frac{1}{1+{\mathfrak {d}}(x,x')^2}. \end{aligned}$$Also, since \(\mathrm {sech}(x) \geqslant 1-\frac{x^2}{2}\), we also have \(K(x,x') \geqslant \prod _{\ell =1}^d \left( 1-\frac{1}{8}{\mathrm {d}}_\ell ^2 \right) \geqslant 1-\frac{1}{8} {\mathfrak {d}}(x,x')^2\).
-
(ii)
Note that \(\left\| K^{(10)}(x,x') \right\| =\left\| \left( \kappa _\ell ^{(10)} K_{\ell } \right) _{\ell =1}^d \right\| \) , so by Lemma 22 (ii),
$$\begin{aligned} \left\| K^{(10)}(x,x') \right\| \leqslant \left\| g \right\| _2 \left|K\right|. \end{aligned}$$ -
(iii)
For \(i\ne j\)
$$\begin{aligned} \left|\kappa ^{(10)}_i \kappa ^{(01)}_j K_{ij} \right| \leqslant 4 \tanh \left( \frac{{\mathrm {d}}_{i}}{2} \right) \tanh \left( \frac{{\mathrm {d}}_{j}}{2} \right) \left|K\right|, \end{aligned}$$and \(\left| \kappa ^{(11)}_i K_i\right| \leqslant 5\left|K\right|\). So, given \(p\in {\mathbb {R}}^d\) of unit norm,
$$\begin{aligned} \left\| K^{(11)} \right\|&= \sup _{\left\| p \right\| = 1} \sum _{i=1}^d \sum _{j\ne i} \kappa ^{(10)}_i \kappa ^{(01)}_j K_{ij} p_i p_j + \sum _{i=1}^d p_i^2 \kappa ^{(11)}_i K_i\\&\leqslant \sup _{\left\| p \right\| = 1} \left|K\right| \left( \sum _{i=1}^d \sum _{j\ne i} 4\tanh ({\mathrm {d}}_i/2) \tanh ({\mathrm {d}}_j/2) p_i p_j + 5 \sum _{i=1}^d p_i^2 \right) \\&\leqslant \left|K\right| \left( \left\| g \right\| ^2_2 + 5 \right) . \end{aligned}$$ -
(iv)
Note that
$$\begin{aligned} \left\| K^{(20)} \right\| = \sup _{\left\| p \right\| =1} \left|\sum _{i=1}^d \sum _{j\ne i} \kappa ^{(10)}_i \kappa ^{(10)}_j K_{ij} p_i p_j + \sum _{i=1}^d p_i^2 \kappa ^{(20)}_i K_i + \sum _{i=1}^d \kappa ^{(10)}_i K_i p_i^2 \right|. \end{aligned}$$Observe that \( \left|\kappa ^{(20)}_i K_i\right| \leqslant 5 \left|K\right|\) and \(- \kappa ^{(20)}_i K_i \geqslant K\left( 1-4\tanh \left( \frac{{\mathrm {d}}_{i}}{2} \right) \right) \).
$$\begin{aligned}&\left\| K^{(20)} \right\| \leqslant \sup _{\left\| p \right\| \leqslant 1} \left| \sum _{i=1}^d \sum _{j\ne i} \kappa ^{(10)}_i \kappa ^{(10)}_j K_{ij} p_i p_j + \sum _{i=1}^d p_i^2 \kappa ^{(20)}_i K_i \right| + \left\| g \right\| _2 \left|K\right| \\&\leqslant \left|K\right| \sup _{\left\| p \right\| \leqslant 1} \left( \sum _{i=1}^d \sum _{j\ne i} 4 \tanh ({\mathrm {d}}_i/2) \tanh ({\mathrm {d}}_j/2) p_i p_j + 5\sum _{i=1}^d p_i^2 \right) + \left\| g \right\| _2 \left|K\right|\\&\leqslant \left|K\right| \left( 2\left\| g \right\| _2^2 + 5 \right) , \end{aligned}$$and given any p with \(\left\| p \right\| _x =1\),
$$\begin{aligned} \langle -K^{(20)} p,\,p\rangle \geqslant K\left( 1-4 \left\| g \right\| _\infty \right) \end{aligned}$$ -
(v)
Note that \(\left\| K^{12} \right\| _{x,x'} = \left\| A \right\| \), where \(A = (A_{ij\ell })_{i,j,\ell =1}^d\) is defined as follows: For \(i,j,\ell \) all distinct,
$$\begin{aligned} A_{ij\ell } = \kappa ^{(10)}_i \kappa ^{(01)}_j \kappa ^{(01)}_{\ell } K_{ij\ell } \leqslant 8 \tanh \left( \frac{{\mathrm {d}}_{i}}{2} \right) \tanh \left( \frac{{\mathrm {d}}_{j}}{2} \right) \tanh \left( \frac{{\mathrm {d}}_{\ell }}{2} \right) K, \end{aligned}$$for all \(i,\ell \) distinct,
$$\begin{aligned} A_{ii\ell }= & {} 8\kappa ^{(11)}_i \kappa ^{(01)}_{\ell } K_{i\ell } \leqslant 10\tanh \left( \frac{{\mathrm {d}}_{\ell }}{2} \right) K,\\ A_{i\ell i}= & {} \kappa ^{(11)}_i \kappa ^{(01)}_{\ell } K_{ij} \leqslant 10 \tanh \left( \frac{{\mathrm {d}}_{j}}{2} \right) K, \end{aligned}$$and for \(i\ne j\), \(A_{ijj} = \kappa ^{(10)}_i \kappa ^{(02)}_{j} K_{ij} \leqslant 12\tanh \left( \frac{{\mathrm {d}}_{i}}{2} \right) K\),
$$\begin{aligned} A_{ijj} = \kappa ^{(10)}_i \kappa ^{(02)}_{j} K_{ij} + \kappa ^{(10)}_i \kappa ^{(01)}_{j} K_{ij} \leqslant 10\tanh \left( \frac{{\mathrm {d}}_{i}}{2} \right) K+ 2\tanh \left( \frac{{\mathrm {d}}_{i}}{2} \right) K\end{aligned}$$and \(A_{iii} = ( \kappa ^{(12)}_i + \kappa ^{(02)}_i ) K_{i} \leqslant 54 K\). So, for \(p,q\in {\mathbb {R}}^d\) of unit norm,
$$\begin{aligned}&\sum _i \sum _{j}\sum _\ell A_{ij\ell } p_j p_\ell q_i =\sum _i \left( \sum _{j\ne i}\sum _\ell A_{ij\ell } p_j p_\ell q_i + \sum _\ell A_{ii\ell } p_i p_\ell q_i \right) \\&\quad = \sum _i \sum _{j\ne i} \left( \sum _{\ell \not \in \{i,j\}} A_{ij\ell } p_j p_\ell q_i + A_{ij i} p_j p_i q_i + K^{(12)}_{ij j} p_j^2 q_i \right) \\&\qquad + \sum _i \sum _{\ell \ne i} A_{ii\ell } p_i p_\ell q_i + \sum _i A_{iii} p_i^2 q_i \\&\quad \leqslant \left|K\right|\left( \left\| g \right\| _2^3 + 16 \left\| g \right\| _2 + 49 \right) . \end{aligned}$$
\(\square \)
1.3 Gradient Bounds
Theorem 6
(Stochastic gradient bounds) Assume that the \(\alpha _i\)’s are all distinct. Then, \({{\bar{L}}}_0(\omega ) \leqslant {{\bar{L}}}_0 {\mathop {=}\limits ^{\text {def.}}}\left( 1+ \frac{{R}}{\min _i \alpha _i} \right) ^d \) and
and we have that \(\sum _i F_j({{\bar{L}}}_j)\leqslant \delta \) and \({{\bar{L}}}_j^2 \sum _i F_i({{\bar{L}}}_i) + 2 \int _{{{\bar{L}}}_j}^\infty t F_j(t) \mathrm {d}t \leqslant \delta \) provided that
where \(\beta _i = \prod _{j\ne i} \frac{\alpha _j}{\alpha _j - \alpha _i}\). Note that \(\alpha _i \sim d\) implies that \({{\bar{L}}}_0 \sim (1+{R}/d)^d \sim e^{{R}}\).
Proof
Let \( V_x {\mathop {=}\limits ^{\text {def.}}}\left( 1-2 (x_i+\alpha _i) \omega _i \right) _{i=1}^d \in {\mathbb {R}}^d. \) Then,
We have the following bounds:
and
which yields \(\left\| \mathrm {D}_{2}\left[ \varphi _\omega \right] (x) \right\| _x \leqslant {\bar{L}}_0 (2+ {\bar{V}}^2)\).
Note that by the mean value theorem, \(\left|x_i - x_i'\right|\) \(\leqslant (R+{\alpha _i}) \left|\log (x_i + \alpha _i) - \log (x_i' + \alpha _i)\right|\) and hence,
Also, \(\left|\varphi _\omega (x) - \varphi _\omega (x')\right| \leqslant \sup _x \left\| \mathrm {D}_{1}\left[ \varphi _\omega \right] (x) \right\| {\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \leqslant {\bar{L}}_0 {\bar{V}} {\mathfrak {d}}_{{\mathfrak {g}}}(x,x')\). Therefore,
Define for \(j=0,1,2,3\)
then, for \(j=0,1,2\), \( L_j(\omega ) {\mathop {=}\limits ^{\text {def.}}}\sup _x \left\| \mathrm {D}_{j}\left[ \varphi _\omega \right] (x) \right\| _x \lesssim G_j(\omega ) \) and
When all \(\alpha _j\) are distinct, we have [2]:
where \(\beta _i = \prod _{j\ne i} \frac{\alpha _j}{\alpha _j - \alpha _i}\), using the fact that \(\left\| \omega \right\| _1\) is a sum of independent exponential random variable.
Hence, for all \(1\leqslant j\leqslant 3\) and \(t \geqslant d^{\frac{j}{2}}\) we have
and \(F_j({{\bar{L}}}_j) \leqslant \delta \) if
Next, we compute
This is bounded from above by \(\delta \) if for all \(i=1,\ldots , d\),
, that is,
It remains to bound \({{\bar{L}}}_j F_\ell ({{\bar{L}}}_\ell )\) with \(\ell ,j\in \{0,1,2,3\}\): Let \({{\bar{L}}}_\ell \geqslant {{\bar{L}}}_0 M^\ell \) for some M to be determined. Then,
if for each \(i=1,\ldots , d\)
Therefore, the conclusion follows for \({{\bar{L}}}_0 = \left( 1+ \frac{{R}}{\min _i \alpha _i} \right) ^d \), and for \(j=1,2,3\),
\(\square \)
Rights and permissions
About this article
Cite this article
Poon, C., Keriven, N. & Peyré, G. The Geometry of Off-the-Grid Compressed Sensing. Found Comput Math 23, 241–327 (2023). https://doi.org/10.1007/s10208-021-09545-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-021-09545-5