Abstract
Recent advances in quantized compressed sensing and high-dimensional estimation have shown that signal recovery is even feasible under strong nonlinear distortions in the observation process. An important characteristic of associated guarantees is uniformity, i.e., recovery succeeds for an entire class of structured signals with a fixed measurement ensemble. However, despite significant results in various special cases, a general understanding of uniform recovery from nonlinear observations is still missing. This paper develops a unified approach to this problem under the assumption of i.i.d. sub-Gaussian measurement vectors. Our main result shows that a simple least-squares estimator with any convex constraint can serve as a universal recovery strategy, which is outlier robust and does not require explicit knowledge of the underlying nonlinearity. Based on empirical process theory, a key technical novelty is an approximative increment condition that can be implemented for all common types of nonlinear models. This flexibility allows us to apply our approach to a variety of problems in nonlinear compressed sensing and high-dimensional statistics, leading to several new and improved guarantees. Each of these applications is accompanied by a conceptually simple and systematic proof, which does not rely on any deeper properties of the observation model. On the other hand, known local stability properties can be incorporated into our framework in a plug-and-play manner, thereby implying near-optimal error bounds.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
This paper is concerned with the following fundamental reconstruction task:
Problem 1
Consider a set of signals \(\mathcal {X}\subset \mathbb {R}^p\) and let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be a collection of measurement vectors. Moreover, let \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) be a scalar output function. Under what conditions can the following recovery problem be solved uniformly for all \(\varvec{x}\in \mathcal {X}\): Assume that \(\varvec{x}\) is observed in the form
where \(\nu _1,\dots , \nu _m \in \mathbb {R}\) is scalar noise. Given \(\{(\varvec{a}_i, y_i)\}_{i = 1}^m\), is it possible to recover the underlying signal \(\varvec{x}\) efficiently?
The prototypical instance of Problem 1 is the approach of compressed sensing. Dating back to the seminal works of Candès, Romberg, Tao, and Donoho [10, 11, 19], the traditional setup of compressed sensing focuses on the case of noisy linear observations, i.e., we have \(y_i = \langle \varvec{a}_i, \varvec{x}\rangle + \nu _i\) for \(i = 1, \dots , m\). In this regime, Problem 1 is nowadays fairly well understood, underpinned by various real-world applications, efficient algorithmic methods, and a rich theoretical foundation; see [21] for a comprehensive overview. In a nutshell, compressed sensing has proven that signal recovery is still feasible when \(m \ll p\), supposed that the signal set \(\mathcal {X}\) carries some low-dimensional structure, e.g., a variant of sparsity, while the measurement ensemble \(\{\varvec{a}_i\}_{i = 1}^m\) follows an appropriate random design. The uniformity over \(\mathcal {X}\) plays an important role in this context, since the measurement device—determined by \(\{\varvec{a}_i\}_{i = 1}^m\) in our case—is typically fixed in applications and should allow for the reconstruction of all (or most) signals in \(\mathcal {X}\). But also apart from this practical relevance, the above quest for uniform recovery is an interesting mathematical problem in its own right. It is significantly more involved than its non-uniform counterpart, where the \(\{\varvec{a}_i\}_{i = 1}^m\) may vary for each \(\varvec{x}\in \mathcal {X}\).
The present work is devoted to Problem 1 in the much less explored situation of nonlinear output functions. In fact, many conceptions from compressed sensing theory, such as the restricted isometry or nullspace property, are tailored to linear models and do not carry over directly to the nonlinear case. As we will see in the next subsection, the presence of nonlinear measuring components is not merely an academic concern but affects many problems in signal processing and high-dimensional statistics.
1.1 Prior Art
There exist two branches of research that are particularly relevant to this work. The first one is based in the field of (memoryless) quantized compressed sensing, which deals with the fact that analog (linear) measurements often need to be quantized before further processing in practice. A common scenario in this respect is 1-bit compressed sensing where only a single bit of information is retained, e.g., if (1.1) renders observations of the form \(y_i = {{\,\mathrm{sign}\,}}(\langle \varvec{a}_i, \varvec{x}\rangle )\). Due to this considerable loss of information, it may come as a surprise that tractable recovery methods are still available and Problem 1 is relatively well understood in this situation. A solid theoretical basis as well as efficient algorithms have been developed over the last few years, including significant progress on lower bounds [18, 34], robustness [17, 49], advanced quantization schemes [3, 36, 37, 44, 65], and non-Gaussian measurements [1, 16,17,18]. While this list of references is certainly incomplete, the surveys of Dirksen [14] and Boufounos et al. [6] provide nice overviews of the field of quantized compressed sensing.
The above achievements have also given rise to several important mathematical tools. One of the most notable breakthroughs are quantized embedding results for signal recovery—a highly geometric argument based on uniform, random hyperplane tessellations, e.g., see [17, 34, 45, 49, 50]. Closer to the original ideas of compressed sensing are reconstruction guarantees relying on variants of the restricted isometry property, e.g., see [16, 20, 33, 65]. However, these techniques are strongly tailored to quantized measurements and it remains unclear how they could be extended to other instances of Problem 1.
The second branch of related literature is much less restrictive with respect to the underlying observation model. It allows (1.1) to take the form \(y_i = f(\langle \varvec{a}_i, \varvec{x}\rangle ) + \nu _i\), where \(f:\mathbb {R}\rightarrow \mathbb {R}\) can be nonlinear and random. A pioneering work on these so-called single-index models is the one of Plan and Vershynin [51] (inspired by ideas of Brillinger [7]), who study the generalized Lasso as reconstruction method:
Here, \(K\subset \mathbb {R}^p\) is a convex constraint set, serving as an appropriate relaxation of the actual signal set \(\mathcal {X}\) and making (\({\mathsf {P}}_{K,\varvec{y}}\)) tractable in many situations of interest. Although this “linearization” strategy might appear very coarse, it was shown to produce satisfactory outcomes for Gaussian measurements, even when \(f\) is highly nonlinear. A key benefit of (\({\mathsf {P}}_{K,\varvec{y}}\)) is that it does not require any (explicit) knowledge of the observation model, which enables various applications to signal processing and statistical estimation, e.g., see [23, Chap. 3 & 4]. The article of Plan and Vershynin [51] is just an example of a whole line of research with many related and follow-up works, e.g., see [22, 28, 29, 47, 52, 56, 60, 62]; remarkably, this approach also extends to phase-retrieval-like problems where \(f\) is an even function [25, 61, 66]. For a more detailed discussion of the literature, we refer to [52, Sec. 6] and [23, Sec. 4.2].
At first sight, the aforementioned works seem to provide a general solution to Problem 1, but they lack a crucial feature, namely uniformity. Indeed, most of these results are based on concentration inequalities over high-dimensional signal sets, exploiting their “local geometry” around a fixed \(\varvec{x}\in \mathcal {X}\). This strategy naturally leads to non-uniform recovery guarantees, and we are not aware of a uniform extension in that regard. Finally, we point out that the two research areas discussed above are not independent but exhibit certain overlaps in terms of their main achievements and proof techniques, for instance, see [28, 62].
1.2 Contributions and Overview
The primary objective of this work is to fill the striking gap between the lines of research discussed in the previous subsection. On the one hand, we show that uniform recovery is not only limited to linear and quantized measurement schemes but applies to a much larger family of observation models. On the other hand, it turns out that under mild assumptions, known non-uniform guarantees for single-index and related models naturally extend to the uniform case. Apart from a unification, our theoretical approach will contribute to each of these research directions through new and improved results.
On the conceptual side, we stick to the methodology suggested by the second branch of the literature from Sect. 1.1, analyzing the performance of the generalized Lasso (\({\mathsf {P}}_{K,\varvec{y}}\)). Our main result, Theorem 2 in Sect. 2.3, establishes a uniform error bound for (\({\mathsf {P}}_{K,\varvec{y}}\)) under very general conditions, including all observation models mentioned above. This finding implies that there exists a “universal” recovery strategy, which often yields satisfactory outcomes. In this context, it is worth pointing out that we have focused on (\({\mathsf {P}}_{K,\varvec{y}}\)) mainly because of its simplicity and popularity, but similar results could be shown for other reconstruction methods as well (cf. Remark 3(3)). Therefore, the present paper is a contribution to Problem 1 in the first place, whereas the estimator of choice plays a subordinate role.Footnote 1
The recovery guarantee of Theorem 2 is formulated in a quite abstract setting, which requires some technical preparation. In order to familiarize the reader with the results of this paper, we now state a special case for single-index models with a Lipschitz continuous output function and Gaussian measurements (see Sect. 8.4 for a proof). Note that parts of the notation from Sect. 1.4 is already used here; the complexity parameters \(w_{t}(\cdot )\) and \(w_{}(\cdot )\) correspond to the common notion of (local) mean width, which is formally introduced in Definition 2.
Theorem 1
There exist universal constants \(c, C > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a standard Gaussian random vector \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). Let \(f:\mathbb {R}\rightarrow \mathbb {R}\) be \(\gamma \)-Lipschitz, and for \(g\sim \mathsf {N}(0, 1)\), we set
Moreover, let \(\mathcal {X}\subset \mathbb {S}^{p-1}\) and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\mu \mathcal {X}\subset K\). For \(u\ge 1\) and a desired reconstruction accuracy \(t \ge 0\), we assume that
where \(K- \mu \mathcal {X}{:=}\{ \varvec{z}- \mu \varvec{x}\ :\varvec{z}\in K, \varvec{x}\in \mathcal {X}\}\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}= (y_1, \dots , y_m) \in \mathbb {R}^m\) be given by
such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\).Footnote 2 Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \mu \varvec{x}\Vert _{2} \le t\).
Theorem 1 can be seen as a uniform version of Plan’s and Vershynin’s main result on single-index models [51, Thm. 1.9]. Most notably, the sampling-rate condition (1.2) implies that the dependence on the desired accuracy t is the same as in the non-uniform case. For more details on the interplay between m, t, and the complexity parameters, we refer to the discussion of Theorem 2 in Sect. 2.3. To the best of our knowledge, Theorem 1 is a new result in its own right, and in particular, it is not a consequence of the approaches outlined in the previous subsection. A novel argument is required to prove such a type of recovery guarantee; see Sect. 1.3 for an overview of the major technical challenge of our work.
The main result of Theorem 2 extends Theorem 1 by several important aspects, being in line with our key achievements:
-
1.
Discontinuous output functions. The Lipschitz assumption in Theorem 1 prohibits popular models like quantized observations. It turns out that an extension in that regard is subtle. To this end, we introduce an approximative increment condition on the observation variable, which is satisfied for many discontinuous nonlinearities; see Sect. 2.2 for more details and Sects. 4.1–4.4 for applications to nonlinear compressed sensing.
-
2.
Beyond Gaussian measurements. Compared to the previous point, it is relatively straightforward to allow for i.i.d. sub-Gaussian measurement vectors in Theorem 1. However, this relaxation leads to an additional constraint term, called the target mismatch, whose size depends on the distribution of the measurements and the output function; see Definition 1 in Sect. 2.1.
-
3.
Beyond single-index models. In Sect. 2, we study the more general situation of Problem 1, where \(y_i\) is not necessarily representable in terms of the dot product \(\langle \varvec{a}_i, \varvec{x}\rangle \). While this requires an additional step of abstraction, it enables for more complicated nonlinear models as well as refined recovery tasks, e.g., support estimation (see Sect. 4.5). For this purpose, we introduce so-called target functions in Sect. 2.1, transforming a signal \(\varvec{x}\in \mathcal {X}\) in such a way that it becomes compatible with the solution vector of (\({\mathsf {P}}_{K,\varvec{y}}\)) under observations of the form (1.1). Note that in the setup of Theorem 1, this transformation simply corresponds to rescaling \(\varvec{x}\) by the scalar factor \(\mu \).
-
4.
Outlier robustness. The \(\ell ^{2}\)-noise constraint \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\) in Theorem 1 is standard for linear models and also remains useful in the situation of Lipschitz nonlinearities. However, such a condition is overly restrictive for quantized measurement schemes, where the noise is due to (few) wrong bits instead of (small) real-valued perturbations. Theorem 2 therefore involves an important relaxation in this respect, permitting a portion of “gross” outliers in the observation vector.
Based on our findings in Theorem 2, we will also address two points of independent interest:
-
5.
Uniform vs. non-uniform recovery. Section 3 isolates characteristics that are specifically attributable to uniformity. While the presence of nonlinear output functions may only affect the oversampling rate (see also Sect. 4), our focus here is on the quite implicit complexity parameter \(w_{t}(K- \mathcal {X})\), cf. (1.2). We show that it can be always bounded by two much more informative expressions, namely a “fully localized” mean width and its global counterpart (see Proposition 1). This important simplification enables us to reuse known (non-uniform) results from the literature. We illustrate for two examples (\(\ell ^{1}\)-constraints and total variation) that there is no considerable gap between the uniform and non-uniform regime.
-
6.
Plug-and-play via embeddings. Section 5 demonstrates how to incorporate known random embedding results into our theoretical framework. In this way, for instance, it is possible to obtain near-optimal error decay rates for (\({\mathsf {P}}_{K,\varvec{y}}\)) with quantized measurements. This achievement is due to a more general guarantee stated in Theorem 3.
The general purpose of our methodology is to enable a more systematic study of Problem 1 than before, especially for nonlinear observation models. To a certain degree, this paper promotes an alternative path to compressed sensing theory that can do without common tools like the restricted isometry property or quantized embedding results. Despite clear overlaps, each of these approaches comes along with its own strengths and (in-)accessible regimes; see Sect. 6 for a more detailed discussion.
1.3 Technical Challenges
This subsection highlights the main technical achievements of this work. Our principal procedure is in fact relatively standard: We decompose the excess loss associated with the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) into separate empirical processes and aim at suitable lower/upper bounds for each (cf. [17, 23, 25, 41, 51]); see Sect. 2.2 for an overview of this argument.
However, the quest for uniform recovery as prescribed by Problem 1 makes this approach a challenging endeavor. In a nutshell, the key difficulty is that the common multiplier term turns into an empirical product process of the following type:
where the first factor takes the form \(\xi _{i}(\varvec{x}) = \langle \varvec{a}_i, T\varvec{x}\rangle - F(\varvec{a}_i, \varvec{x})\) and \(T:\mathcal {X}\rightarrow K\) is a fixed function (think of a scalar multiplication for now). Since we require a uniform upper bound for (1.3) over both \(\varvec{x}\) and \(\varvec{v}\), the regularity of the stochastic process \(\{\xi _{i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) plays an important role, in the sense that it should fulfill a sub-Gaussian increment condition (see (1.4)).Footnote 3 Such a regularity assumption unfortunately does not even hold for standard models like 1-bit quantization \(F(\varvec{a}_i, \varvec{x}) = {{\,\mathrm{sign}\,}}(\langle \varvec{a}_i, \varvec{x}\rangle )\). Therefore, existing chaining-based results for product processes are not directly applicable to our situation. This may also explain why there is no unified theory for nonlinear observations available so far and known uniform recovery guarantees are restricted to special cases (see Sect. 1.1).
The present article provides a general remedy for this foundational problem. Our starting point is that a given output function \(F\) can be often approximated by a more regular one, say \(F_t\), where \(t > 0\) controls the approximation accuracy. As such, this strategy can yield a more benign multiplier \(\xi _{t,i}(\varvec{x}) = \langle \varvec{a}_i, T\varvec{x}\rangle -F_t(\varvec{a}_i, \varvec{x})\), but it simply shifts the original problem to the resulting approximation error \(F(\varvec{a}_i, \varvec{x}) - F_t(\varvec{a}_i, \varvec{x})\). Hence, a fundamental insight in our proof (see Step 3 in Sect. 7) is that it suffices to consider the absolute error \(\varepsilon _{t,i} = |F(\varvec{a}_i, \varvec{x}) - F_t(\varvec{a}_i, \varvec{x})|\) instead. Indeed, the process \(\{\varepsilon _{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) fulfills an increment condition in all relevant model situations that we are aware of; see Fig. 2 in Sect. 4 for an illustration of the above argument based on the sign function. With this at hand, and apart from other technicalities, we can take advantage of a recent concentration inequality due to Mendelson [42, Thm. 1.13].
The intuitive idea described in the previous paragraph is placed into a rigorous framework by means of our central Assumption 2. In this context, it is worth noting that the actual main result, Theorem 2, couples the approximation accuracy t with the desired reconstruction error. This might appear somewhat peculiar at first sight because the processes \(\{\xi _{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) and \(\{\varepsilon _{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) usually become less regular with t decreasing. However, we demonstrate that this trade-off can be balanced out well and only affects the achievable oversampling rate.
Finally, our general technique to control empirical product processes of the form (1.3) may find application beyond signal recovery problems. For example, it could be also used to derive a new generation of nonlinear embedding results. In fact, one of the earliest works on binary random embeddings [50] has pointed out the usefulness of approximating nonlinearities, though strongly tailored to 1-bit quantization. With our systematic approach, their result could not only be reproduced but further improved (cf. [58, Thm. 3.7]).
1.4 Overview and Notation
The rest of this article is organized as follows. Section 2 presents our findings in full generality, where the proof of Theorem 2 is postponed to Sect. 7. In Sect. 3, we elaborate aspects that are specifically due to uniform recovery and highlight potential gaps to the non-uniform regime. Sections 4 and 5 are then devoted to applications and concrete examples, including a series of new guarantees for nonlinear compressed sensing and high-dimensional estimation; the corresponding proofs are provided in Sects. 8 and 9, respectively. Our concluding discussion can be found in Sect. 6.
Before proceeding, let us fix some standard notations and conventions that are commonly used in this work. The letters c and C are reserved for (positive) constants, whose values could change from time to time. We speak of a universal constant if its value does not depend on any other involved parameter. If an inequality holds up to a universal constant C, we usually write \(A \lesssim B\) instead of \(A \le C \cdot B\). The notation \(A \asymp B\) is a shortcut for \(A \lesssim B \lesssim A\).
For \(p \in \mathbb {N}\), we set \([p] {:=}\{1, \dots , p\}\). The cardinality of an index set \(\mathcal {S}\subset [p]\) is denoted by \(|\mathcal {S}|\) and its set complement in [p] is \(\mathcal {S}^c {:=}[p] \setminus \mathcal {S}\). Vectors and matrices are denoted by lower- and uppercase boldface letters, respectively. The j-th entry of a vector \(\varvec{v}\in \mathbb {R}^p\) is denoted by \((\varvec{v})_j\), or simply by \(v_j\) if there is no danger of confusion. The support of \(\varvec{v}\in \mathbb {R}^p\) is defined by \({{\,\mathrm{supp}\,}}(\varvec{v}) {:=}\{ j \in [p] \ :v_j \ne 0 \}\) and \(\Vert \varvec{v}\Vert _{0} {:=}|{{\,\mathrm{supp}\,}}(\varvec{v})|\) denotes its sparsity. We write \(\varvec{I}_{p} \in \mathbb {R}^{p \times p}\) and \(\varvec{0}\in \mathbb {R}^p\) for the identity matrix and the zero vector in \(\mathbb {R}^p\), respectively. For \(1 \le q \le \infty \), we denote the \(\ell ^{q}\)-norm on \(\mathbb {R}^p\) by \(\Vert \cdot \Vert _{q}\) and the associated unit ball by \(\mathbb {B}_q^p\). The Euclidean unit sphere is given by \(\mathbb {S}^{p-1} {:=}\{ \varvec{v}\in \mathbb {R}^p \ :\Vert \varvec{v}\Vert _{2} = 1 \}\). The spectral matrix norm is denoted by \(\Vert \cdot \Vert _{2\rightarrow 2}\).
Let \(H, H' \subset \mathbb {R}^p\) and \(\varvec{v}\in \mathbb {R}^p\). We write \({{\,\mathrm{{\text {span}}}\,}}(H)\) for the linear hull of \(H\) and the linear cone generated by \(H\) (not necessarily convex) is denoted by \({\text {cone}}(H) {:=}\{s {\tilde{\varvec{v}}} \ :{\tilde{\varvec{v}}} \in H, s \ge 0 \}\). By \(P_{H} :\mathbb {R}^p \rightarrow \mathbb {R}^p\), we denote the Euclidean projection onto the closure of \(H\) if it is well defined, and we use the shortcut \(P_{\varvec{v}} {:=}P_{{{\,\mathrm{{\text {span}}}\,}}(\{\varvec{v}\})}\). We write \({H}^\perp \) for the orthogonal complement of \(H\). The Minkowski difference between \(H\) and \(H'\) is defined by \(H- H' {:=}\{ \varvec{v}_1 -\varvec{v}_2 \ :\varvec{v}_1 \in H, \varvec{v}_2 \in H' \}\), and we use the shortcut \(H- \varvec{v}{:=}H- \{\varvec{v}\}\). For \(\varepsilon > 0\), we denote the covering number of \(H\) at scale \(\varepsilon \) with respect to the Euclidean norm by \(\mathcal {N}(H, \varepsilon )\).
The \(L^q\)-norm of a real-valued random variable a is given by \(\Vert a\Vert _{L^q} {:=}(\mathbb {E}[|a|^q])^{1/q}\). We call a sub-Gaussian if
and \(\Vert \cdot \Vert _{\psi _2}\) is called the sub-Gaussian norm. A random vector \(\varvec{a}\) in \(\mathbb {R}^p\) is called sub-Gaussian if \(\Vert \varvec{a}\Vert _{\psi _2} {:=}{\mathrm{sup}}_{\varvec{v}\in \mathbb {S}^{p-1}} \Vert \langle \varvec{a}, \varvec{v}\rangle \Vert _{\psi _2} < \infty \). We say that \(\varvec{a}\) is centered if \(\mathbb {E}[\varvec{a}] = \varvec{0}\), and it is isotropic if \(\mathbb {E}[\varvec{a}\varvec{a}^\mathsf {T}] = \varvec{I}_{p}\). We write \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) if \(\varvec{a}\) is a standard Gaussian random vector in \(\mathbb {R}^p\). For a more detailed introduction to sub-Gaussian random variables and their properties, see [64, Chap. 2 & 3]. Now, let \((\mathcal {H},d)\) be a pseudo-metric space and consider a real-valued stochastic process \(\{a_h\}_{h \in \mathcal {H}}\) on \(\mathcal {H}\). Then, \(\{a_h\}_{h \in \mathcal {H}}\) has sub-Gaussian increments with respect to d if
We say that a function \(f:\mathbb {R}^p \rightarrow \mathbb {R}^{p'}\) is \(\gamma \)-Lipschitz if it is Lipschitz continuous with respect to the Euclidean metric and a Lipschitz constant \(\gamma \ge 0\). The sign of \(v \in \mathbb {R}\) is denoted by \({{\,\mathrm{sign}\,}}(v)\), with the convention that \({{\,\mathrm{sign}\,}}(0) {:=}1\). If \({{\,\mathrm{sign}\,}}(\cdot )\) is applied to a vector, the operation is understood entrywise. The ceiling and floor function of \(v \in \mathbb {R}\) is denoted by \(\lceil v\rceil \) and \(\lfloor v\rfloor \), respectively.
2 Main Result
As preliminary steps, we introduce the formal model setup in Sect. 2.1, followed by our increment conditions on the observation variable in Sect. 2.2. The main result of Theorem 2 is then stated and discussed in Sect. 2.3, while its proof can be found in Sect. 7.
2.1 Model Setup
The following model assumption fixes the notation for the remainder of this section and specifies the instance of Problem 1 that we will study from now on:
Assumption 1
-
(a)
Components of the observation model:
-
Measurement vector: a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) such that \(\Vert \varvec{a}\Vert _{\psi _2} \le L\) for some \(L> 0\).Footnote 4
-
Signal set: a set \(\mathcal {X}\) (not necessarily a subset of \(\mathbb {R}^p\)).
-
Output function: a scalar function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) that may be random (not necessarily independent of \(\varvec{a}\)).
-
Observation variable: \(\tilde{y}(\varvec{x}) {:=}F(\varvec{a}, \varvec{x})\) for \(\varvec{x}\in \mathcal {X}\).
-
-
(b)
Components of the measurement process:
-
Measurement ensemble: \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\) are independent copies of the measurement model \((\varvec{a}, F)\).
-
Observations: \(\tilde{y}_i(\varvec{x}) {:=}F_i(\varvec{a}_i, \varvec{x})\) for \(i = 1, \dots , m\) and \(\varvec{x}\in \mathcal {X}\). The observation vector of \(\varvec{x}\) is denoted by \(\tilde{\varvec{y}}(\varvec{x}) {:=}(\tilde{y}_1(\varvec{x}), \dots , \tilde{y}_m(\varvec{x})) \in \mathbb {R}^m\).
-
-
(c)
Components of the recovery procedure:
-
Constraint set: a convex subset \(K\subset \mathbb {R}^p\) (not necessarily bounded).
-
Target function: a map \(T:\mathcal {X}\rightarrow K\), such that \(T\mathcal {X}\subset K\) is bounded.
-
Part (a) of Assumption 1 establishes the statistical “template” of our measurement model, from which i.i.d. observations are drawn according to part (b). Importantly, this defines an entire class of observation variables \(\{\tilde{y}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\), which corresponds to a real-valued stochastic process. Assumption 1(c) is arguably the most abstract part, but forms a crucial ingredient of our main result: Theorem 2 states an error bound for \(T\varvec{x}\in K\) instead of the actual signal \(\varvec{x}\in \mathcal {X}\) (which might not even belong to \(\mathbb {R}^p\)). The basic linearization strategy behind (\({\mathsf {P}}_{K,\varvec{y}}\)) typically calls for an appropriate transformation of the signal domain. A good example is the special case of Theorem 1, where \(T\) amounts to a rescaling of \(\varvec{x}\) by the model-dependent factor \(\mu \). Such a simple correction is sufficient for most applications considered in Sect. 4; but there exist more complicated situations, such as the variable selection problem in Sect. 4.5, where \(\mathcal {X}\) represents support sets on [p]. At the present level of abstraction, it is useful to think of \(T\varvec{x}\) as a parameterized version of \(\varvec{x}\in \mathcal {X}\) that is compatible with the estimation procedure of (\({\mathsf {P}}_{K,\varvec{y}}\)) and its constraint set \(K\subset \mathbb {R}^p\).Footnote 5
In principle, \(T\) can be arbitrary in Assumption 1, but there often exists a canonical choice that is driven by the size of the following model parameter:
Definition 1
Let Assumption 1 be satisfied. Then, we define the target mismatch of \(\varvec{x}\in \mathcal {X}\) by
The meaning of this definition becomes clearer when taking the perspective of statistical learning for a moment. Assuming that \(y_i = \tilde{y}_i(\varvec{x})\), the Lasso estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) can be viewed as an empirical loss minimization problem with “data” \(\{(\varvec{a}_i, y_i)\}_{i = 1}^m\). A central question is then under what conditions its solution can approximate (in a proper sense) the associated expected loss minimization problem (obtained in the infinite sample limit \(m \rightarrow \infty \)):
It is not hard to see that the only critical point of the objective function in (2.1) is the vector \(\varvec{x}^*{:=}\mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]\), and if \(\varvec{x}^*\in K\), this is the global optimum. Therefore, and as its name suggests, \(\rho _{}(\varvec{x})\) measures the mismatch between the target vector \(T\varvec{x}\) and the (global) expected loss minimizer \(\varvec{x}^*\).
In the context of Theorem 2, the target mismatch can be seen as an upper bound for the asymptotic error of estimating \(T\varvec{x}\) via (\({\mathsf {P}}_{K,\varvec{y}}\)). Consequently, a general rule of thumb is to select \(T\varvec{x}\) such that \(\rho _{}(\varvec{x})\) vanishes or becomes sufficiently small. However, simply setting \(T\varvec{x}{:=}\mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]\) (if contained in \(K\)) is not necessarily consistent with our wish for signal recovery, e.g., in the application in Sect. 4.2. Therefore, we have left \(T\) unspecified in Assumption 1, even though a proper choice will be obvious for all considered examples.
We close this subsection with a short remark about the noise model considered in this work:
Remark 1
(Noise model) We will adopt the adversarial (or worst-case) noise model, which is the common setting in the field of compressed sensing, cf. [21]. More specifically, when interested in recovery of \(\varvec{x}\in \mathcal {X}\), the actual input \(\varvec{y}= (y_1, \dots , y_m)\) for (\({\mathsf {P}}_{K,\varvec{y}}\)) need not exactly correspond to the observation vector \(\tilde{\varvec{y}}(\varvec{x}) = (\tilde{y}_1(\varvec{x}), \dots , \tilde{y}_m(\varvec{x}))\) introduced in Assumption 1(b).Footnote 6 Instead, arbitrary perturbations are allowed as long as \(d(\varvec{y}, \tilde{\varvec{y}}(\varvec{x})) \lesssim t\) for an appropriate noise metric \(d(\cdot ,\cdot )\) and a desired accuracy t. In Theorem 1, for instance, we have \(\tilde{y}_i(\varvec{x}) = f(\langle \varvec{a}_i, \varvec{x}\rangle )\) and \(y_i = \tilde{y}_i(\varvec{x}) + \nu _i\), while \(d(\cdot ,\cdot )\) corresponds to the normalized \(\ell ^{2}\)-norm. Thus, our overall goal is to show that, given \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\), recovery is possible for any \(\varvec{x}\in \mathcal {X}\) and any moderate perturbation of its observation \(\tilde{\varvec{y}}(\varvec{x})\).
In this context, it is worth noting that Assumption 1(b) does not require the variables \(\tilde{y}_i(\varvec{x})\) to be deterministic when conditioned on \(\varvec{a}_i\), since \(F_i\) itself may be random. For example, one could model additive random noise in that way, as common in statistical estimation theory. In principle, such a type of noise is also admissible in our main result, Theorem 2, but it is not entirely compatible with the aforementioned goal of uniform recovery. Indeed, the ensemble \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\) is only drawn once, so that a single noise realization has to be considered for all \(\varvec{x}\in \mathcal {X}\). We therefore mostly stick to the adversarial perspective; see Remark 4(2) in Sect. 3 for a more detailed comparison of both noise models in the situation of linear measurements. Finally, we point out that randomized output functions are nevertheless very useful in the uniform regime, such as in the design of dithering variables (see Sect. 4.2).
2.2 Analysis Strategy and Increment Conditions
From now on, let Assumption 1 be satisfied. In this subsection, we introduce the key condition for our uniform recovery guarantee, namely that the class of observation variables \(\{\tilde{y}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\), or at least an approximation thereof, has sub-Gaussian increments. To better understand the relevance of this condition, it is insightful to first take a closer look at our basic analysis strategy for the generalized Lasso:
Note that we hide the dependency on the measurement vectors \(\{\varvec{a}_i\}_{i = 1}^m\) when referring to (\({\mathsf {P}}_{K,\varvec{y}}\)), as this will be always clear from the context. At the present stage, the vector \(\varvec{y}= (y_1, \dots , y_m) \in \mathbb {R}^m\) in (\({\mathsf {P}}_{K,\varvec{y}}\)) is unspecified, but taking the viewpoint of our main result, Theorem 2, is already useful: If \(\varvec{x}\in \mathcal {X}\) is to be recovered, then \(\varvec{y}\) is a noisy version of the observation vector \(\tilde{\varvec{y}}(\varvec{x}) = (\tilde{y}_1(\varvec{x}), \dots , \tilde{y}_m(\varvec{x}))\) as defined in Assumption 1(b); see also Remark 1.
We now derive a simple, yet important criterion for an error bound for a (fixed) signal \(\varvec{x}\in \mathcal {X}\). To this end, let \(\bar{\mathcal {L}}_{\varvec{x}}(\varvec{v}) {:=}\tfrac{1}{m} \sum _{i = 1}^m (y_i - \langle \varvec{a}_i, \varvec{v}+ T\varvec{x}\rangle )^2\) denote the empirical loss of \(\varvec{v}\in \mathbb {R}^p\) over \(\varvec{x}\). For the sake of notational convenience, we have fixed the anchor point \(T\varvec{x}\) here, so that (\({\mathsf {P}}_{K,\varvec{y}}\)) is equivalent to minimizing \(\bar{\mathcal {L}}_{\varvec{x}}(\cdot )\) over \(K- T\varvec{x}\). The excess loss of \(\varvec{v}\in \mathbb {R}^p\) over \(\varvec{x}\) is then defined by
It measures how much the empirical loss is changing when traveling from \(T\varvec{x}\) in the direction of \(\varvec{v}\). The following fact is an immediate consequence of the convexity of \(K\) and \(\bar{\mathcal {L}}_{\varvec{x}}\); see Fig. 1 for an illustration.
Fact 1
For \(\varvec{x}\in \mathcal {X}\) and a desired reconstruction accuracy \(t > 0\), we set
If \(\mathcal {E}_{\varvec{x}}(\varvec{v}) > 0\) for all \(\varvec{v}\in K_{\varvec{x},t}\), then every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t\).
For a fixed accuracy \(t > 0\), Fact 1 implies uniform recovery for all those \(\varvec{x}\in \mathcal {X}\) satisfying \(\inf _{\varvec{v}\in K_{\varvec{x},t}} \mathcal {E}_{\varvec{x}}(\varvec{v}) > 0\). This motivates us to establish a uniform lower bound for the excess loss for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{v}\in K_{\varvec{x},t}\). In particular, such a task is considerably more difficult than showing a bound on \(K_{\varvec{x},t}\) for a fixed \(\varvec{x}\in \mathcal {X}\), which would be in accordance with the non-uniform results discussed in the second part of Sect. 1.1. To get a better sense of this challenge, let us consider the following basic decomposition of the excess loss:
The quadratic term \(\mathcal {Q}_{}(\varvec{v})\) and noise term \(\mathcal {N}_{\varvec{x}}(\varvec{v})\) are rather unproblematic and can be controlled by a recent matrix deviation inequality (see Step 1 and Step 2 in Sect. 7). Much more intricate is the multiplier term \(\mathcal {M}_{\varvec{x}}(\varvec{v})\), since the underlying multiplier variable \(\xi (\varvec{x}) {:=}\langle \varvec{a}, T\varvec{x}\rangle - \tilde{y}(\varvec{x})\) depends on \(\varvec{x}\), so that we actually have to deal with an empirical product process. A major difficulty is that known concentration results for such product processes do not apply directly, since the class \(\{\xi (\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) need not have sub-Gaussian increments with respect to an appropriate (pseudo-)metric. For example, this would happen if we would drop the Lipschitz assumption on \(f\) in Theorem 1. The key idea of our approach is to approximate \(\tilde{y}(\varvec{x})\) by a more “regular” observation variable \(\tilde{y}_t(\varvec{x})\) in such a way that the resulting multiplier variable and the approximation error both have sub-Gaussian increments. This is made precise by the following assumption:
Assumption 2
We define a pseudo-metric on \(\mathcal {X}\) by \(d_T(\varvec{x},\varvec{x}') {:=}\Vert T\varvec{x}- T\varvec{x}'\Vert _{2}\) for \(\varvec{x},\varvec{x}' \in \mathcal {X}\). For \(t \ge 0\), assume that there exists a class of observation variables \(\{\tilde{y}_t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) such that the following properties hold:
-
(a)
Approximation error: Setting \(\varepsilon _t(\varvec{x}) {:=}|\tilde{y}(\varvec{x}) - \tilde{y}_t(\varvec{x})|\), we assume that \(\mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |] \le \tfrac{t}{64}\) for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\).
-
(b)
Multiplier increments: Set \(\xi _t(\varvec{x}) {:=}\langle \varvec{a}, T\varvec{x}\rangle - \tilde{y}_t(\varvec{x})\) for \(\varvec{x}\in \mathcal {X}\). We assume that there exist \(r\ge 0\) and \(L_{t}\ge 0\) such that
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2} \le L_{t}\cdot d_T(\varvec{x},\varvec{x}') \quad {\text {and}}\quad \Vert \xi _t(\varvec{x})\Vert _{\psi _2} \le r\qquad \text {for all } \varvec{x}, \varvec{x}' \in \mathcal {X}. \end{aligned}$$ -
(c)
Error increments: We assume that there exist \(\hat{r}\ge 0\) and \(\hat{L}_{t}\ge 0\) such that
$$\begin{aligned} \Vert \varepsilon _t(\varvec{x}) - \varepsilon _t(\varvec{x}')\Vert _{\psi _2} \le \hat{L}_{t}\cdot d_T(\varvec{x},\varvec{x}') \quad {\text {and}}\quad \Vert \varepsilon _t(\varvec{x})\Vert _{\psi _2} \le \hat{r}\qquad \text {for all } \varvec{x}, \varvec{x}' \in \mathcal {X}. \end{aligned}$$
Assumption 2(b) and (c) imply that \(\{\xi _t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) and \(\{\varepsilon _t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) have sub-Gaussian increments with respect to \(L_{t}\cdot d_T\) and \(\hat{L}_{t}\cdot d_T\), respectively. Similarly, \(r\) and \(\hat{r}\) bound the (sub-Gaussian) diameter of the respective classes. A convenient interpretation of Assumption 2 is as follows: Choose an approximation \(\tilde{y}_t(\varvec{x})\) for every \(\varvec{x}\in \mathcal {X}\) such that the error does not become too large in the sense of part (a); at the same time, ensure that the increment conditions of part (b) and (c) are satisfied such that the parameters \(L_{t}\) and \(\hat{L}_{t}\) do not grow too fast as t becomes smaller. A remarkable conclusion from our applications to quantized compressed sensing in Sects. 4.1–4.3 is that this strategy can even succeed for observation variables with discontinuous output functions.
Finally, we emphasize that the approximation error \(\varepsilon _t(\varvec{x})\) includes taking the absolute value, which is crucial to our proof (see Step 3 in Sect. 7). In particular, Assumption 2 does not necessarily imply that the original class \(\{\xi (\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) has sub-Gaussian increments with respect to \({\tilde{L}}\cdot d_T\) for some \({\tilde{L}} \ge 0\). On the other hand, it is certainly possible that \(\{\xi (\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) already has sub-Gaussian increments, such as in Theorem 1. In this case, we can simply choose \(\tilde{y}_t(\varvec{x}) {:=}\tilde{y}(\varvec{x})\), so that \(\varepsilon _t(\varvec{x}) = 0\) and Assumption 2(a) and (c) are trivially fulfilled.
2.3 Uniform Recovery Guarantee
In order to formulate the main result of this work, we require the notion of Gaussian mean width. This geometric parameter has proven to be a useful complexity measure in high-dimensional signal recovery, e.g., see [2, 12, 43, 54, 57] for pioneering works in that direction. Our approach is no exception and makes use of the following localized version.
Definition 2
Let \(H\subset \mathbb {R}^p\) and \(\varvec{g}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). The (Gaussian) mean width of \(H\) is given by
For \(t \ge 0\), we define the local mean width of \(H\) (at scale t) by
As final preparatory step, we introduce two expressions that capture the size of measurement noise: For \(m_0\in \{0,1,\dots ,m\}\) and \(\varvec{w}= (w_1, \dots , w_m) \in \mathbb {R}^m\), let
where \((w_1^*,\dots ,w_m^*)\) is the non-increasing rearrangement of \((|w_1|, \dots , |w_m|)\). Obviously, we have that \(\Vert \varvec{w}\Vert _{[m_0]}^2 + \sigma _{m_0}(\varvec{w})_{2}^2 = \Vert \varvec{w}\Vert _{2}^2\), and in particular, \(\Vert \varvec{w}\Vert _{[0]} = 0\) and \(\sigma _{0}(\varvec{w})_{2} = \Vert \varvec{w}\Vert _{2}\). Moreover, note that \(\Vert \cdot \Vert _{[m_0]}\) simply corresponds to the \(\ell ^{2}\)-norm of the \(m_0\)-largest entries, while \(\sigma _{m_0}(\cdot )_{2}\) is commonly known as the \(\ell ^{2}\)-error of the best \(m_0\)-term approximation.
We are now ready to state our main recovery guarantee, which forms the basis of all applications presented in Sect. 4; see Sect. 7 for a complete proof.
Theorem 2
There exist universal constants \(c, C > 0\) for which the following holds.
Let Assumptions 1 and 2 be satisfied for a fixed accuracy \(t \ge 0\), and let \(m_0\in \{0, 1, \dots , m\}\). For \(\varDelta >0\), \(u\ge 1\), and \(u_0 \ge \sqrt{m_0\log (em/m_0)}\), we assume thatFootnote 7Footnote 8
Then with probability at least \(1 - \exp (-c u^2) - \exp (-c u_0^2)\) on the random draw of \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\), the following holds uniformly for every \(\varvec{x}\in \mathcal {X}\) with \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\): Let \(\varvec{y}\in \mathbb {R}^m\) be any input vector such that
Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t\).
The constraints of (2.4) imply that the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) is robust against (adversarial) perturbations and outliers in the observation vector \(\tilde{\varvec{y}}(\varvec{x})\). Here, a central role is played by the fine-tuning parameter \(m_0\), controlling which coefficients of the noise vector \(\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\) are captured by the first and second condition in (2.4), respectively. Enlarging \(m_0\) helps to suppress “gross” outliers (that may appear in the \(m_0\) largest coefficients of \(\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\) in magnitude). Indeed, by adjusting the free parameter \(\varDelta \), the first condition of (2.4) becomes less restrictive than the second one (which captures the remaining coefficients of \(\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\)). On the other hand, choosing \(m_0\) and/or \(\varDelta \) too large may have a negative effect on the sample complexity in the second branch of (2.3). The benefit of permitting outliers will become clearer in the context of quantization where the noise is due to wrong bits in \(\varvec{y}\) instead of real-valued perturbations (see Sects. 4.1–4.3). Note that the most common noise model is obtained for \(m_0= 0\): (2.4) then simply corresponds to the baseline condition \(\tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{2} \le \tfrac{t}{20}\), which is consistent with standard noise bounds from the literature. We refer to Remark 4(2) in the next section for a more detailed comparison of adversarial and statistical noise, as well as aspects of optimality.
The second important constraint of Theorem 2 is that it does not apply to those \(\varvec{x}\in \mathcal {X}\) with \(\rho _{}(\varvec{x}) > \tfrac{t}{32}\). Hence, the target mismatch \(\rho _{}(\varvec{x})\) can be seen as an upper bound for the asymptotic error when estimating \(T\varvec{x}\) via (\({\mathsf {P}}_{K,\varvec{y}}\)). In particular, the above error bound does not allow for arbitrarily high precision unless \(\rho _{}(\varvec{x}) = 0\).
Let us now turn to the key assumption (2.3), which relates the number of measurements m to the desired accuracy t. The above formulation views m as a function of t, specifying how fast m grows if t decreases. The inverse relationship is also of interest, since it describes the reconstruction error as a function of m. However, this can only be made explicit in specific situations where the parameters at the right-hand side of (2.3) are known or can be well estimated. Of special importance in that respect are the increment parameters \(L_{t}\) and \(\hat{L}_{t}\) from Assumption 2, whose size has a significant impact on the required (over-)sampling rate. For 1-bit observations as studied in Sect. 4.1, for example, we achieve \(L_{t}, \hat{L}_{t}\lesssim t^{-1}\), so that (2.3) would yield an error decay rate of \(O(m^{-1/4})\)—or conversely, an oversampling rate of \(O(t^{-4})\). Remarkably, we will derive a corollary of Theorem 2 in Sect. 5 that can bypass the restrictions of Assumption 2 at the price of a stronger (local) stability condition on the observation model. This allows us to combine our approach with known embedding results from the literature and thereby to obtain near-optimal decay rates in special cases.
The dependence of (2.3) on the complexity parameters \(w_{t}(K- T\mathcal {X})\) and \(w_{}(T\mathcal {X})\) is quite natural. Similar expressions have already appeared in several of the articles discussed in Sect. 1.1, e.g., see [17, 49, 52]. A detailed discussion, including concrete examples and possible simplifications, can be found in Sect. 3. A distinctive feature of Theorem 2 is that the considered version of local mean width \(w_{t}(K- T\mathcal {X})\) “compares” the parameter vectors in \(K\) only to those in the transformed signal set \(T\mathcal {X}\subset K\). This refinement of the common quantity \(w_{t}(K- K)\) can lead to improved guarantees in certain scenarios, namely when \(T\mathcal {X}\) is much smaller than \(K\) (see Example 1 for an illustration of this aspect).
Remark 2
-
(1)
Inversion of \(T\). In principle, the target function \(T\) neither has to be injective nor does it have to be explicitly known to solve (\({\mathsf {P}}_{K,\varvec{y}}\)). However, in order to obtain a practicable statement from Theorem 2, the actual signal \(\varvec{x}\in \mathcal {X}\) should be extractable from an approximation \(\hat{\varvec{z}}\) of \(T\varvec{x}\). An appropriate implementation of \(T\) can differ considerably from situation to situation. In the case of single-index models (see Theorem 1), for example, this would involve a rescaling of \(\hat{\varvec{z}}\), which requires a (rough) knowledge of the nonlinearity \(f\). In contrast, for variable selection (see Sect. 4.5), it can be sufficient to perform a simple hard-thresholding step on \(\hat{\varvec{z}}\) to obtain a good estimate of the underlying support.
-
(2)
Optimality. The best possible error decay rate that can result from Theorem 2 is \(O(m^{-1/2})\), supposed that we have \(L_{t}, \hat{L}_{t}\lesssim 1\).Footnote 9 This rate cannot be improved in general, or in other words, the exponent \(-\tfrac{1}{2}\) is not an artifact of our proof but corresponds to a fundamental statistical barrier (see [52, Sec. 4]). However, it is possible to break through this barrier in specific model setups, e.g., noiseless 1-bit observations [34, 36]. Such superior rates are usually not achieved by (\({\mathsf {P}}_{K,\varvec{y}}\)), but may require a more sophisticated estimator instead.
Remark 3
(Possible extensions) For the sake of clarity, we did not present the most general version of Theorem 2 that we could have derived. There are several variants and generalizations that we expect to be practicable:
-
(1)
Anisotropic measurements. Based on an observation from [51, Rmk. 1.7], [22, Sec. II.D], the isotropy of the measurement vectors in Assumption 1(a) can be relaxed to any positive definite covariance matrix \(\varSigma \in \mathbb {R}^{p \times p}\). To see this, let us assume that \(\varvec{a}\) is centered and sub-Gaussian with \(\mathbb {E}[\varvec{a}\varvec{a}^\mathsf {T}] = \varSigma \). Then, we may write \(\varvec{a}= \sqrt{\varSigma } {\bar{\varvec{a}}}\) where \({\bar{\varvec{a}}}\) is isotropic, centered, and sub-Gaussian. If \({\bar{\varvec{a}}}_1, \dots , {\bar{\varvec{a}}}_m\) are i.i.d. copies of \({\bar{\varvec{a}}}\), a simple reformulation of (\({\mathsf {P}}_{K,\varvec{y}}\)) yields
$$\begin{aligned} \mathop {{{\,\mathrm{argmin}\,}}}\limits _{\varvec{z}\in K} \ \tfrac{1}{m} \sum _{i = 1}^m (y_i - \langle \varvec{a}_i, \varvec{z}\rangle )^2 = \sqrt{\varSigma }^{-1} \cdot \mathop {{{\,\mathrm{argmin}\,}}}\limits _{{\bar{\varvec{z}}} \in \sqrt{\varSigma }K} \ \tfrac{1}{m} \sum _{i = 1}^m (y_i - \langle {\bar{\varvec{a}}}_i, {\bar{\varvec{z}}}\rangle )^2. \end{aligned}$$In other words, (\({\mathsf {P}}_{K,\varvec{y}}\)) is equivalent to a modified estimator that operates with an isotropic measurement ensemble. Therefore, we may apply Theorem 2 directly to the latter one, when replacing \(K\) and \(T(\cdot )\) by \(\sqrt{\varSigma }K\) and \(\sqrt{\varSigma } T(\cdot )\) in Assumption 1, respectively. Note that these adaptions are of purely theoretical nature and in practice, no explicit knowledge of \(\varSigma \) is required. In particular, this procedure still implies an error bound for minimizers \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)):
$$\begin{aligned} \Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le \Vert \sqrt{\varSigma }^{-1}\Vert _{2\rightarrow 2} \cdot \Vert \sqrt{\varSigma }\hat{\varvec{z}}- \sqrt{\varSigma }T\varvec{x}\Vert _{2} \le \Vert \sqrt{\varSigma }^{-1}\Vert _{2\rightarrow 2} \cdot t. \end{aligned}$$Similarly, one can conveniently control the adapted complexity parameters (see [51, Rmk. 1.7], [22, Sec. II.D]):
$$\begin{aligned} w_{t}(\sqrt{\varSigma }K- \sqrt{\varSigma }T\mathcal {X})&\le \max \big \{1, \Vert \sqrt{\varSigma }^{-1}\Vert _{2\rightarrow 2}\big \} \cdot \Vert \sqrt{\varSigma }\Vert _{2\rightarrow 2} \cdot w_{}(\tfrac{1}{t}(K- T\mathcal {X}) \cap \mathbb {B}_2^p), \\ w_{}(\sqrt{\varSigma }T\mathcal {X})&\lesssim \Vert \sqrt{\varSigma }\Vert _{2\rightarrow 2} \cdot w_{}(T\mathcal {X}). \end{aligned}$$Overall, we can conclude that accurate (uniform) recovery is also possible with anisotropic measurements, as long as the (unknown) covariance matrix is well conditioned.
-
(2)
Even output functions. There exist certain types of nonlinear models that are not directly compatible with the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)). For example, if \(f:\mathbb {R}\rightarrow \mathbb {R}\) is an even function in Theorem 1, it turns out that \(\mu = \mathbb {E}_{g\sim \mathsf {N}(0, 1)}[f(g)g] = 0\). In other words, (\({\mathsf {P}}_{K,\varvec{y}}\)) would simply approximate the zero vector, which is clearly not what one is aiming at. Notably, this scenario includes the important phase-retrieval problem \(f(\cdot ) = |\cdot |\). Fortunately, the linearization idea behind (\({\mathsf {P}}_{K,\varvec{y}}\)) is still applicable in these situations, when combined with the phase-lifting trick [9]:
where \(\langle \cdot , \cdot \rangle _F\) denotes the Hilbert–Schmidt inner product and \(K\supset \{\varvec{x}\varvec{x}^\mathsf {T}\ :\varvec{x}\in \mathcal {X}\}\) is a convex subset of the positive semi-definite cone in \(\mathbb {R}^{p \times p}\). The recovery performance of (\({\mathsf {P}}_{K,\varvec{y}}^{\text {lifted}}\)) for even output functions has been recently analyzed by the first author in [25, Sec. 3.5], inspired by earlier findings of Thrampoulidis and Rawat [61]. A key challenge in this regard is that the lifted measurement vectors/matrices \(\varvec{a}_i\varvec{a}_i^\mathsf {T}- \mathbb {E}[\varvec{a}_i\varvec{a}_i^\mathsf {T}]\) are heavier tailed than \(\varvec{a}_i\), which in turn has required deeper insights from generic chaining theory. We expect that our results are extendable into this direction, but with a technical effort that would go beyond the scope of this work.
-
(3)
Different metrics. An extension that is specific to uniform recovery concerns the choice of pseudo-metric in Assumption 2. In principle, \(d_T\) could be replaced by an arbitrary pseudo-metric d on the signal set \(\mathcal {X}\). The claim and proof of Theorem 2 would still hold true if the (global) complexity of \(\mathcal {X}\) is captured by \(\gamma _2(\mathcal {X}, d)\) instead of \(w_{}(T\mathcal {X})\); see Step 3 in Sect. 7. While this would make our approach slightly more flexible, we note that Talagrand’s \(\gamma _2\)-functional is difficult to control in general. Apart from that, the \(\ell ^{2}\)-norm as error metric in Theorem 2 could be replaced by an arbitrary semi-norm \(\Vert \cdot \Vert \). Specifically, this step would lead to an error bound in terms of \(\Vert \cdot \Vert \) in Fact 1, but it also entails an adaption of the spherical intersection in the local mean width (see Definition 2). Finally, different convex loss functions are feasible for (\({\mathsf {P}}_{K,\varvec{y}}\)) as well [22], although such an extension would require several additional technical assumptions.
3 (Non-)Uniformity and Signal Complexity
In this section, we highlight important aspects that are specifically attributable to uniform recovery, compared to existing non-uniform results. This includes a more detailed discussion of the local mean width \(w_{t}(K- T\mathcal {X})\), which appears as the central complexity parameter in our approach.
Except for the inclusion \(T\mathcal {X}\subset K\), Theorem 2 does not impose any restrictions on the signal set \(\mathcal {X}\). If \(T\) is the identity function, the choice of \(\mathcal {X}\) may range from a singleton \(\mathcal {X}= \{\varvec{x}\}\) to the entire constraint set \(\mathcal {X}= K\). The latter scenario corresponds to uniform recovery of all signals in \(K\), while the former means that one is interested in a specific \(\varvec{x}\in K\)—which is nothing else than non-uniform recovery. In particular, due to \(w_{}(\{\varvec{x}\}) = 0\), Theorem 2 is consistent with previously known non-uniform guarantees, e.g., see [23, Thm. 3.6] and [51, Thm 1.9]. Our main result therefore also indicates what additional expenses may come in the case of uniform recovery.
Let us first focus on the role of nonlinear observations, whose impact is reflected by the third branch of (2.3) in Theorem 2. The key quantities in this respect are the increment parameters \(L_{t}\) and \(\hat{L}_{t}\). Their size strongly depends on the type of output function \(F\) considered in Assumption 1. For instance, a jump discontinuity in \(F\) can cause a weaker oversampling rate, in the sense that \(L_{t}, \hat{L}_{t}\lesssim t^{-1}\); we refer to Sects. 4.1–4.4 for concrete examples and Sect. 5 for a possible improvement. In the non-uniform case (\(\mathcal {X}= \{\varvec{x}\}\)), on the other hand, this issue becomes irrelevant because one can simply set \(L_{t}= \hat{L}_{t}= 0\) (since Assumption 2(b) and (c) need only be satisfied for a single point). Thus, we can draw the following informal conclusion:
The transition to uniform recovery with nonlinear output functions may result in a worse oversampling rate (with respect to t).
The second foundational aspect of uniformity concerns the geometric complexity of the constraint set \(K\) and its interplay with the actual signal set \(\mathcal {X}\). To keep our exposition as clear as possible, we restrict ourselves to the situation of linear observations (though most arguments naturally carry over to nonlinear models). Applying Theorem 2 to this special case leads to the following recovery guarantee (see Sect. 10 for a proof). Note that we intentionally allow for two different types of noise here, based on the adversarial (\(\sim \nu _i\)) and statistical model (\(\sim \tau _i\)); see Remark 4(2) for further discussion.
Corollary 1
There exist universal constants \(c, C > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). Let \(\tau _1, \dots , \tau _m\) be independent copies of \(\tau \sim \mathsf {N}(0, \sigma ^2)\) for some \(\sigma \ge 0\) (also independent of \(\{\varvec{a}_i\}_{i=1}^m\)). Moreover, let \(\mathcal {X}\subset \mathbb {R}^p\) be a bounded subset and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\mathcal {X}\subset K\). For \(u\ge 1\) and \(t \ge 0\), we assume that
Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{(\varvec{a}_i, \tau _i)\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}= (y_1, \dots , y_m) \in \mathbb {R}^m\) be given by
such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \le t\).
Except for the \(\sigma \)-depending summand in (3.1), the above statement does not involve any oversampling factors and it turns into an exact recovery guarantee for \(t = 0\) and \(\sigma = 0\). Although Corollary 1 bears resemblance with standard results from compressed sensing, it comes with a distinctive feature: The needed number of measurements in (3.1) is determined by the local mean width \(w_{t}(K- \mathcal {X})\). As such, this parameter is quite implicit, so that an informative (upper) bound is required for any specific choice of \(K\) and \(\mathcal {X}\). The remainder of this section is devoted to this quest.
Arguably the most popular low-complexity model is sparsity in conjunction with an \(\ell ^{1}\)-relaxation. In our context, this corresponds to the following scenario:
where \(s \in [p]\) and \(R > 0\); note that the constraint \(\Vert \varvec{x}\Vert _{1} = R\) is a specific tuning condition for the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) that could be further relaxed (see Remark 44). In this special case, one can derive a convenient bound for the local mean width:
Remarkably, this observation dates back to one of the earliest applications of Gordon’s escape through the mesh theorem [30] to signal recovery [54]; see Sect. 10 for a proof of (3.3). A combination of (3.3) and Corollary 1 ensures that uniform (exact) reconstruction is feasible with \(O(s \cdot \log (2p/s))\) Gaussian measurements. Therefore, our approach is indeed well in line with standard sparse recovery guarantees in compressed sensing, which are typically based on restricted isometry (cf. [21]).
Beyond this classical example, the situation becomes considerably more challenging and forms an important research subject in its own right. A particular obstacle due to uniformity is that \(w_{t}(K- \mathcal {X})\) heavily depends on the size and shape of the signal set \(\mathcal {X}\). On a technical level, we have to deal with a (possibly uncountable) union of spherical intersections:
Fortunately, this cumbersome expression can be controlled through a much more convenient upper bound, as shown in the following proposition. To the best of our knowledge, this is a new result, which could be of independent interest. Its proof can be found in Sect. 10 and is based on a covering argument.
Proposition 1
Let \(K, \mathcal {X}\subset \mathbb {R}^p\) and \(t > 0\). Then,
Moreover, for \(\varvec{x}\in \mathcal {X}\) and \(K\) convex, we have that \(\tilde{w}_{t}(K- \varvec{x}) \lesssim w_{0}(K- {\tilde{\varvec{x}}}) + 1\) for every \({\tilde{\varvec{x}}} \in K\) with \(\Vert \varvec{x}- {\tilde{\varvec{x}}}\Vert _{2} \le t\).
The upper bound in (3.4) implies a significant simplification: The first term is fully localized, measuring the complexity of \(K\) with respect to each individual point \(\varvec{x}\in \mathcal {X}\), whereas the second term accounts for the global size of \(\mathcal {X}\) (independently of the constraint set \(K\)). In other words, the effect of \(K\) and \(\mathcal {X}\) is now decoupled. A second notable fact is that the local mean width \(\tilde{w}_{t}(K- \varvec{x})\) and its conic counterpart \(w_{0}(K- \varvec{x})\) are well-known complexity parameters from non-uniform recovery results, e.g., see [2, 12, 23, 51, 63]. Hence, Proposition 1 allows us to transfer any corresponding bound from the literature to the uniform regime. The presence of the global mean width \(w_{}(\mathcal {X})\) as an additional expense for uniformity appears natural to us. However, this expression also entails an oversampling factor of \(t^{-1}\), which prevents perfect reconstruction (when applied to Corollary 1). Given the \(\ell ^{1}\)-special case in (3.3), we suspect that this is an artifact of our proof, and it remains an open question whether such a factor could be removed in general. Our overall conclusion is as follows:
Regarding signal complexity, the transition to uniform recovery requires control over the total size of the set \(\mathcal {X}\), measured by \(w_{}(\mathcal {X})\). The constraint set \(K\) only appears in terms of local complexity, precisely as in the non-uniform case.
As the above argumentation is fairly abstract, it is insightful to illustrate our approach by a concrete example. In this context, we will also highlight the importance of carefully “designed” signal sets, in the sense that \(\mathcal {X}\) (and even its convex hull) is much smaller than \(K\).
Example 1
(Total variation in 1D) We consider the situation of s-gradient-sparse signals in one spatial dimension, i.e., for \(s \in [p-1]\) fixed, it is assumed that \(\Vert \nabla \varvec{x}\Vert _{0} \le s\), where
is a discrete gradient operator. Geometrically, this condition simply means that the vector \(\varvec{x}\) is piecewise constant with at most s jump discontinuities. An \(\ell ^{1}\)-relaxation of gradient sparsity then leads to the common total variation (TV) model [55], which in our case corresponds to the constraint set \(K= \{ \varvec{x}\in \mathbb {R}^p \ :\Vert \nabla \varvec{x}\Vert _{1} \le R \}\) for some \(R > 0\).
A recent work of the first author [27] has demonstrated that the reconstruction capacity of the TV model does not only depend on the number of jump discontinuities but also strongly on their position. More specifically, we say that an s-gradient-sparse signal \(\varvec{x}\) is \(\varDelta \)-separated for some \(\varDelta \in (0, 1]\) if
where \({{\,\mathrm{supp}\,}}(\nabla \varvec{x}) = \{\nu _1,\dots ,\nu _{s}\}\) with \(0 {=:}\nu _0< \nu _1< \dots< \nu _{s} < \nu _{s+1} {:=}p\). Intuitively, the constant \(\varDelta \) measures how much the jump positions in \(\varvec{x}\) deviate from an equidistant pattern (where one would have \(\varDelta = 1\)). Assuming that \(\varDelta \in (0, 1]\) is fixed (independently of n and s), we say that \(\varvec{x}\) is well separated; see [27, Sec. 2.1] for more details. This motivates us to consider the following signal set:
Similarly to (3.2), the constraint \(\Vert \nabla \varvec{x}\Vert _{1} = R\) results from a tuning condition for (\({\mathsf {P}}_{K,\varvec{y}}\)) (see Remark 44).
A deep result from [27, Thm. 2.10] now yields the following estimate for the conic mean width:
For the global mean width, we can make use of a bound derived by Krahmer et al. [38, Sec. 3.2]:
Note that (3.7) would also hold without the \(\varDelta \)-separation condition, but it exploits that \(\mathcal {X}\subset \mathbb {B}_2^p\). Combining (3.6) and (3.7) with Proposition 1, we finally obtain
Therefore, according to Corollary 1, uniform recovery of well-separated s-gradient-sparse signals becomes feasible with \(O(s \cdot \log ^2(p))\) Gaussian measurements. In particular, we have shown that there is no qualitative gap to the non-uniform guarantee derived in [27].
Let us emphasize that, although the previous bounds look quite simple, their proofs build on fundamental insights into the TV model. For example, relying only on the basic estimate \(w_{t}(K- \mathcal {X}) \lesssim t^{-1} \cdot w_{}(K)\) would not help much, since it is unclear how \(w_{}(K)\) scales compared to \(w_{}(\mathcal {X})\), cf. [38, Sec. 3.3]. Even more notable is the fact that the above argument would break down if the \(\varDelta \)-separation condition is omitted. Indeed, Cai and Xu [8] have shown that uniform recovery of all s-gradient-sparse signals (including pathological examples) via TV minimization is impossible with less than \(\varOmega (\sqrt{s \cdot p})\) Gaussian measurements. Thus, a striking gap emerges between the uniform and non-uniform regime in this scenario. Remarkably, the technical approach of [8] is based on an adapted nullspace property, for which it is not obvious how to incorporate additional signal structure such as \(\varDelta \)-separation. The presented example on the TV model therefore also underscores the merits of a geometric complexity analysis as proposed in our work. \(\square \)
While gradient sparsity seems to be only slightly more involved than standard sparsity (with respect to an orthonormal basis), the above consideration has indicated many subtleties. To a certain extent, Example 1 is just the “tip of the iceberg” and similar phenomena apply in the more general context of the analysis- and synthesis-\(\ell ^{1}\)-model. An in-depth discussion would go beyond the scope of the present paper, and we refer the interested reader to the related works [26, 40]. The actual novelty of Proposition 1 is that these previous findings may also extend to uniform recovery—a step that was not addressed so far.
We close this section with a remark on stable and robust recovery:
Remark 4
-
(1)
Tuning and stability. The equality constraint for \(\mathcal {X}\) in (3.2) and (3.5) ensures that every \(\varvec{x}\in \mathcal {X}\) lies on the boundary of \(K\). Such a technical step is important for bounds that are based on the conic mean width, such as (3.3) and (3.6); if \(\varvec{x}\) would lie in the interior of \(K\), then \({\text {cone}}(K- \varvec{x}) = \mathbb {R}^p\) and therefore \(w_{0}(K- \varvec{x}) \asymp \sqrt{p}\). From an algorithmic perspective, this can be seen as a tuning condition for the Lasso estimator (\({\mathsf {P}}_{K,\varvec{y}}\)). The “moreover”-part of Proposition 1 presents a convenient relaxation in this respect: When interested in the local complexity of some point \(\varvec{x}\in \mathcal {X}\subset K\), one may instead consider a nearby point \({\tilde{\varvec{x}}}\), whose conic mean width \(w_{0}(K- {\tilde{\varvec{x}}})\) is smaller (non-trivial). The price for this simplification is a recovery error in the order of \(\Vert \varvec{x}- {\tilde{\varvec{x}}}\Vert _{2}\). This trade-off is closely related to the idea of stability (or compressibility) in compressed sensing theory [21].
-
(2)
Robustness and optimality. The statement of Corollary 1 allows for a comparison of the statistical and adversarial noise model (cf. Remark 1). In the absence of worst-case perturbations (i.e., \(\nu _i = 0\)), the condition (3.1) can be translated into an error decay rate of \(O(\sigma \cdot m^{-1/2})\). Thus, (\({\mathsf {P}}_{K,\varvec{y}}\)) becomes a consistent estimator of \(\varvec{x}\) in the ordinary sense of statistics, i.e., \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \rightarrow 0\) in probability as \(m \rightarrow \infty \). Remarkably, the asymptotic error scaling with respect to the noise parameter \(\sigma \) and sample size m cannot be further improved in general—it is minimax optimal for linear observations, e.g., see [52, Sec. 4]. Compared to random noise, the adversarial model is more general, since the \(\nu _i\) may encode any type of perturbation, even deterministic ones. Given the \(\ell ^{2}\)-constraint in Corollary 1,
$$\begin{aligned} \big (\textstyle \tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}, \end{aligned}$$(3.8)it turns out that a recovery error in the order of \(O(t)\) is essentially optimal. Indeed, there exist instances of Corollary 1 where the \(\nu _i\) can be selected such that (3.8) holds with high probability (over \(\{\varvec{a}_i\}_{i = 1}^m\)) and one has that \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \asymp t\).Footnote 10 The difference with the statistical regime becomes clearer when considering random noise from an adversarial perspective, i.e., \(\nu _i = \tau _i \sim \mathsf {N}(0, \sigma ^2)\). Then, it is not hard to see that with high probability, the constraint (3.8) is only attainable with \(t \gtrsim \sigma \), which prevents arbitrarily high precision in terms of t. In particular, Corollary 1 would not certify the consistency of (\({\mathsf {P}}_{K,\varvec{y}}\)) anymore, contrary to the argument in the previous paragraph. Let us emphasize that uniformity does not play a special role in these matters, although the statistical noise model appears somewhat unnatural in this regime (see Remark 1). In principle, our conclusions do also carry over to the abstract setting of Theorem 2, but the interpretation of the term “noise” becomes more subtle there. The reason is that the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) basically treats nonlinear distortions as if they would be uncorrelated statistical noise (cf. [51]), even if the output function is completely deterministic. Seen from this angle, better decay rates than \(O(\sigma \cdot m^{-1/2})\) are possible, which however strongly depends on the specific model and estimator (see Remark 2(2)).
4 Applications and Examples
This section demonstrates how to derive “out-of-the-box” guarantees from Theorem 2 for specific observation models. Sects. 4.1–4.3 are devoted to applications to quantized compressed sensing. Sect. 4.4 then revisits the case of single-index models, including new applications to modulo measurements and coordinate-wise distortions. Finally, Sect. 4.5 is concerned with a conceptually different example on the problem of variable selection. Note that all proofs are deferred to Sect. 8.
4.1 1-Bit Observations
As already indicated in Sect. 1.1, the most basic version of 1-bit compressed sensing asks for the reconstruction of signals \(\varvec{x}\in \mathcal {X}\subset \mathbb {R}^p\) from binary observations \(\varvec{y}\in \{-1,1\}^m\) of the form
Here, the noise variables \(\varvec{\nu }{:=}(\nu _1, \dots , \nu _m)\) may take values in \(\{-2, 0, 2\}\), modeling possible distortions of the linear measurement process before quantization and/or bit flips during quantization. Importantly, the magnitude of \(\varvec{x}\) gets lost in (4.1) due to the scaling invariance of the \({{\,\mathrm{sign}\,}}\)-function. Thus, the best one can hope for is to reconstruct the direction of \(\varvec{x}\).
The model of (4.1) is particularly compatible with Gaussian measurement vectors—a related result on sub-Gaussian measurements can be found in the next subsection. Indeed, choosing the target function \(T\varvec{x}\) proportionally to \(\varvec{x}/ \Vert \varvec{x}\Vert _{2}\), one can show that the target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\). Hence, uniform recovery is possible up to arbitrarily high precision. The following guarantee makes this claim precise and is an application of Theorem 2 to Gaussian 1-bit observations. Its proof in Sect. 8.1 will demonstrate the usefulness of the approximation condition from Assumption 2, see also Fig. 2 for an illustration of the underlying argument.
Corollary 2
There exist universal constants \(c, c_0, C' > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a standard Gaussian random vector \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). Let \(\mathcal {X}\subset \mathbb {R}^p\) and define \(T\varvec{x}{:=}\sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\) for \(\varvec{x}\in \mathcal {X}\). Moreover, let \(K\subset \mathbb {R}^p\) be a convex set such that \(T\mathcal {X}\subset K\). For \(u\ge 1\) and \(t\in (0,1]\), we assume that
Finally, let \(\beta \in [0,1]\) be such that \(\beta \sqrt{\log (e/\beta )}\le c_0 t\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \{-1,1\}^m\) be given by (4.1) such that \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \beta \). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\big \Vert \hat{\varvec{z}}- \sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\big \Vert _{2} \le t\).
Corollary 2 is in line with some of the early achievements in 1-bit compressed sensing [48, 49]. Remarkably, the condition (4.2) translates into an error decay rate of \(O(m^{-1/4})\) in the uniform case, which improves the original rate of \(O(m^{-1/12})\) established by Plan and Vershynin in [49, Thm. 1.3]. In fact, we are not aware of any result in the literature that implies the statement of Corollary 2. But we stress that the above oversampling rate is still not optimal (see [34]) and can be further improved with a more specialized argument relying on random hyperplane tessellations (see Corollary 8 in Sect. 5). The noise constraint of Corollary 2 simply means that the fraction of wrong input bits \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1}\) must not exceed \(\beta \), while the latter may be in the order of t (up to a log-factor). This condition again significantly improves [49, Thm. 1.3] and is a particular consequence of the outlier robustness established in Theorem 2; see also [17] for a similar achievement in the situation of dithered observations.
4.2 1-Bit Observations with Dithering
According to Ai et al. [1], the conclusion of Corollary 2 cannot be extended to sub-Gaussian measurements in general, regardless of the considered reconstruction method. A practicable remedy is the technique of dithering, which in its most basic form, corresponds to a random shift of the quantization threshold. Originating from quantized signal processing, e.g., see [13, 31, 32], its benefits recently also emerged in compressed sensing theory [3, 17, 18, 37, 62, 65]. For more background information, we refer to [14] and the references therein.
Extending the original 1-bit model (4.1) by an additional dithering step leads to observations \(\varvec{y}\in \{-1,1\}^m\) of the following form:
where \(\varvec{\nu }{:=}(\nu _1, \dots , \nu _m)\in \{-2,0,2\}^m\) again models noise. The dithering variables \(\tau _i\) are independent copies of a random variable \(\tau \) that is uniformly distributed on an interval \([-\lambda , \lambda ]\). Note that a major difference between dithering and additive noise is that the parameter \(\lambda > 0\) is known and adjustable in practice (while the \(\tau _i\) could be in principle unknown). The following corollary of Theorem 2 is based on the fact that a careful choice of \(\lambda \) allows us to control the size of the (non-vanishing) target mismatch \(\rho _{}(\varvec{x})\), where \(T\varvec{x}{:=}\lambda ^{-1} \varvec{x}\).
Corollary 3
There exist universal constants \(c, c_0, {\tilde{C}}, C' > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). Let \(\tau _1, \dots , \tau _m\) be independent copies of a random variable \(\tau \) that is uniformly distributed on \([-\lambda , \lambda ]\) for a parameter \(\lambda >0\). In addition, suppose that \(\{\varvec{a}_i\}_{i=1}^m\) and \(\{\tau _i\}_{i=1}^m\) are independent. Let \(\mathcal {X}\subset R \mathbb {B}_2^p\) for some \(R > 0\), and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\lambda ^{-1}\mathcal {X}\subset K\). For \(u\ge 1\) and \(t\in (0, 1]\), we assume that
Finally, let \(\beta \in [0,1]\) be such that \(\beta \sqrt{\log (e/\beta )}\le c_0 L^{-2} t\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{(\varvec{a}_i, \tau _i)\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \{-1,1\}^m\) be given by (4.3) such that \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \beta \). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \lambda ^{-1}\varvec{x}\Vert _{2} \le t\).
Corollary 3 exhibits many common features of known recovery guarantees based on dithering. In particular, it can be seen as a uniform version of a recent result on the generalized Lasso by Thrampoulidis and Rawat [62, Thm. IV.1]. The most notable improvement over Corollary 2 is that it is now possible to recover the actual signal \(\varvec{x}\in \mathcal {X}\) up to arbitrarily high precision, and not only its direction vector \(\varvec{x}/ \Vert \varvec{x}\Vert _{2}\).
Similarly to Corollary 2, we are not aware of any result in the literature that implies the statement of Corollary 3. Nevertheless, the oversampling rate of \(O(m^{-1/4})\) promoted by (4.4) can be further improved with a proof strategy that is specifically tailored to 1-bit observations with dithering (see Corollary 9 in Sect. 5).
4.3 Multi-Bit Observations
While 1-bit measurements are an important extreme case of quantized compressed sensing, a considerable part of the literature deals with multi-bit observation models; once again, see [6, 14] for a good introduction to this subject. In this subsection, we illustrate our approach in the prototypical situation of uniform quantization: For a fixed \(\delta > 0\), the goal is to reconstruct a signal \(\varvec{x}\in \mathcal {X}\subset \mathbb {R}^p\) from observations \(\varvec{y}\in \delta \mathbb {Z}^m\) of the form
where \(q_\delta (v) {:=}(2\lceil \tfrac{v}{2\delta }\rceil -1)\delta \) is a uniform quantizer on the grid \(\delta (2\mathbb {Z}-1)^m\) with resolution \(\delta >0\). The dithering variables \(\tau _i\) are independent copies of a random variable \(\tau \) that is uniformly distributed on \([-\delta , \delta ]\). As before, \(\varvec{\nu }{:=}(\nu _1, \dots , \nu _m)\) can describe any type of adversarial noise, but now takes values in \(\delta \mathbb {Z}^m\). Theorem 2 yields the following uniform recovery guarantee for multi-bit observations with sub-Gaussian measurements and dithering.
Corollary 4
There exist universal constants \(c, c_0, C' > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). Let \(\tau _1, \dots , \tau _m\) be independent copies of a random variable \(\tau \) that is uniformly distributed on \([-\delta , \delta ]\) for a fixed parameter \(\delta >0\). In addition, suppose that \(\{\varvec{a}_i\}_{i=1}^m\) and \(\{\tau _i\}_{i=1}^m\) are independent. Let \(\mathcal {X}\subset \mathbb {R}^p\) be a bounded subset, and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\mathcal {X}\subset K\). For \(u\ge 1\) and \(t > 0\), we assume that
Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{(\varvec{a}_i, \tau _i)\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \delta \mathbb {Z}^m\) be given by (4.5) such that at least one of the following conditions is fulfilled:
Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \le 2t\).
Corollary 4 can be considered as a uniform version of a recent result by Thrampoulidis and Rawat [62, Thm. III.1]. For a fixed quantizer resolution \(\delta >0\), the condition (4.6) translates into an error decay rate of \(O(\sqrt{\delta }\cdot m^{-1/4})\). On the other hand, if \(\delta \ll t\), then (4.6) is satisfied as soon as \(m \ge C \cdot L^2 \log L\cdot (w_{t}^2(K- \mathcal {X})+ u^2 )\). In other words, if the quantizer resolution is much higher than the desired reconstruction accuracy, the performance of (\({\mathsf {P}}_{K,\varvec{y}}\)) is the same as if the input vector would consist of linear measurements (cf. Corollary 1). This behavior is perfectly consistent with the fact that the uniformly quantized observations (4.5) become linear as \(\delta \rightarrow 0\). It is also worth pointing out that the constraint (b) in (4.7) implies a certain outlier robustness: If \(t < \delta \) and only a fraction of \(t / \delta \) bits are corrupted, then the normalized \(\ell ^{2}\)-noise error may scale in the order of \(\sqrt{\delta t}\) instead of t.
To the best of our knowledge, Corollary 4 is a new result. But we stress that there exist uniform recovery guarantees for other programs than (\({\mathsf {P}}_{K,\varvec{y}}\)) in the literature, which imply better oversampling rates, e.g., see [36, Thm. 3] or [65] in the case of structured signal sets. Nevertheless, we expect that Corollary 4 could be easily improved in that regard, using the strategy of Sect. 5 in conjunction with known uniform embedding results.
4.4 Single-Index Models and Beyond
The situation of single-index models was already considered in Theorem 1 in Sect. 1.2, as an appetizer for our more general approach. As pointed out there, this guarantee can be seen as an upgrade of an earlier non-uniform result by Plan and Vershynin [51, Thm. 1.9], thereby demonstrating that uniform recovery is possible beyond quantized measurement schemes. In view of Assumption 2, the Lipschitz continuity of \(f\) in Theorem 1 is certainly not necessary. However, it verifies that a more “regular” observation variable can lead to a near-optimal error decay rate (see Remark 2(2)). Apart from that, we emphasize that (Lipschitz) continuous nonlinearities are not only of academic interest but also appear in practical applications, for instance, as power amplifiers in sensor networks [24].
The remainder of this subsection is devoted to two new applications of our general framework, which extend Theorem 1 into different directions.
Modulo measurements. The first scenario is inspired by a recent work of Bhandari et al. [4] on “unlimited sampling.” The practical motivation there is to prevent a saturation of analog-to-digital converters by applying a modulo operator in the measurement process. Tailored to the setup of the present article, we consider modulo measurements of the form
where the modulo function is given by \({\mathfrak {m}}_\lambda (v){:=}v - \big \lfloor \tfrac{v+\lambda }{2\lambda }\big \rfloor \cdot 2\lambda \) for a fixed parameter \(\lambda >0\), and \(\varvec{\nu }{:=}(\nu _1, \dots , \nu _m) \in \mathbb {R}^m\) models noise. Clearly, the nonlinearity \({\mathfrak {m}}_\lambda \) neither corresponds to a quantization (cf. Sects. 4.1–4.3) nor is it Lipschitz continuous (cf. Theorem 1). The following corollary of Theorem 2 provides a uniform recovery guarantee for the observation model (4.8):
Corollary 5
There exist universal constants \(c, C, C' > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a standard Gaussian random vector \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). For \(\lambda >0\), we set \(\mu _\lambda {:=}\mathbb {E}[{\mathfrak {m}}_\lambda (g)g]\) with \(g\sim \mathsf {N}(0, 1)\). Let \(\mathcal {X}\subset \mathbb {S}^{p-1}\), and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\mu _\lambda \mathcal {X}\subset K\). For \(u\ge 1\) and \(t\in (0,\lambda ]\), we assume that \(\lambda \ge C\) and
Then, \(\mu _\lambda \in [\tfrac{1}{2},1]\) and with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \mathbb {R}^m\) be given by (4.8) such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \mu _\lambda \varvec{x}\Vert _{2} \le t\).
Similar to our above applications to quantized compressed sensing, the (infinitely many) discontinuities of the modulo function \({\mathfrak {m}}_\lambda \) imply an error decay rate of \(O(m^{-1/4})\). To the best of our knowledge, Corollary 5 is a new result in its own right. In particular, we are not aware of alternative proof techniques that would allow us to derive a comparable statement. Finally, it is worth noting that the condition \(\lambda \ge C\) in Corollary 5 could be further relaxed to \(\lambda > 0\). This would come at the cost of a lower bound for \(\mu _\lambda \), since we have that \(\mu _\lambda \rightarrow 0\) as \(\lambda \rightarrow 0\) (the nonlinearity \({\mathfrak {m}}_\lambda \) uniformly converges to 0 as \(\lambda \rightarrow 0\)).
Coordinate-wise distortions. The common single-index model (see Theorem 1) assumes that a nonlinear output function perturbs the linear observations \(\langle \varvec{a}_i, \varvec{x}\rangle \). Instead, one can also imagine distortions that affect the computation of the dot product \(\langle \varvec{a}_i, \varvec{x}\rangle \) directly. More specifically, we are interested in observations of the form
where \(\varvec{a}\circ \varvec{b}\in \mathbb {R}^p\) denotes the coordinate-wise product of vectors \(\varvec{a}, \varvec{b}\in \mathbb {R}^p\) and \(f:\mathbb {R}^p\rightarrow \mathbb {R}\) can be represented as
Here, we assume that the functions \(f_j:\mathbb {R}\rightarrow \mathbb {R}\) are odd and \(\gamma \)-Lipschitz, while satisfying the growth conditions
and
with parameters \(\alpha > 0\) and \(\beta _2 \ge \beta _1 > 0\). One can think of (4.9) as computing a “nonlinear dot product” of \(\varvec{a}_i\) and \(\varvec{x}\). While models of this type are not covered by the original approach of [51], an application of Theorem 2 allows us to deal with them under appropriate assumptions:
Corollary 6
There exist universal constants \(c, C > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic random vector \(\varvec{a}= (a_1, \dots , a_p)\) which has independent, symmetric, and sub-Gaussian coordinates such that \(\max _{j \in [p]}\Vert a_j\Vert _{\psi _2} \le L\) for some \(L> 0\). Let \(\mathcal {X}\subset R\mathbb {B}_2^p\) and define \(T:\mathcal {X}\rightarrow \mathbb {R}^p\) by \(T\varvec{x}{:=}\mathbb {E}[f(\varvec{a}\circ \varvec{x})\varvec{a}]\). Let \(K\subset \mathbb {R}^p\) be a convex set such that \(T\mathcal {X}\subset K\). For \(u\ge 1\) and \(t\ge 0\), we assume that
Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \mathbb {R}^m\) be given by (4.9) such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t\). Furthermore, for every \(j\in [p]\), we have that
The above result shows that estimation via (\({\mathsf {P}}_{K,\varvec{y}}\)) is even possible in situations where instead of linear measurements \(\langle \varvec{a}_i, \varvec{x}\rangle =\sum _{j=1}^p(\varvec{a}_i)_jx_j\) one observes distorted dot products \(f(\varvec{a}_i\circ \varvec{x})=\sum _{j=1}^pf_j((\varvec{a}_i)_jx_j)\). Note that the generalized Lasso may not precisely recover a scalar multiple of \(\varvec{x}\) here, but rather a vector \(T\varvec{x}\) that lies in a hyperrectangle defined by \(\varvec{x}\) and the parameters \(\beta _1\) and \(\beta _2\). The closer the functions \(f_j\) are to the identity, the closer \(\beta _1\) and \(\beta _2\) are to 1, which implies that \(T\varvec{x}\approx \varvec{x}\) according to (4.12).
4.5 Variable Selection
This subsection is devoted to an instance of Assumption 1 that is substantially different from the previous ones: For a fixed integer \(s \le p\) and an output function \(f:\mathbb {R}^p \rightarrow \mathbb {R}\), we ask for estimation of an index set \(\mathcal {S}\subset [p]\) with \(|\mathcal {S}| \le s\) from observations \(\varvec{y}\in \mathbb {R}^m\) of the form
Here, \(\varvec{a}_{i,\mathcal {S}} \in \mathbb {R}^p\) is the coordinate projection of \(\varvec{a}_i\) onto \(\mathcal {S}\) and \(\nu _i \in \mathbb {R}\) models additive noise. Most notably, the signals of interest are not parameters vectors in \(\mathbb {R}^p\) anymore but correspond to (small) index sets, specifying those coefficients of a measurement vector that contribute to the observation. From a statistical perspective, this can be seen as a variable selection model, where \(\mathcal {S}\subset [p]\) determines the set of active variables among all feature variables in \(\varvec{a}_i\). In the context of uniform recovery, this leads to the following problem: Given a collection of sample data \(\{\varvec{a}_i\}_{i = 1}^m\), can we retrieve any possible index set \(\mathcal {S}\subset [p]\) with \(|\mathcal {S}| \le s\) from (nonlinear) observations of the form (4.13)? The following corollary of Theorem 2 provides a result in that direction under natural conditions on the output function \(f\).
Corollary 7
There exist universal constants \(c, C > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic random vector \(\varvec{a}= (a_1, \dots , a_p)\) which has independent sub-Gaussian coordinates such that \(\max _{j \in [p]}\Vert a_j\Vert _{\psi _2} \le L\) for some \(L> 0\). For \(s \le p\) and an output function \(f:\mathbb {R}^p \rightarrow \mathbb {R}\), let \(\mathcal {X}\subset \{\mathcal {S}\subset [p] \ :|\mathcal {S}|\le s\}\) and define \(T:\mathcal {X}\rightarrow \mathbb {R}^p\) by \(T\mathcal {S}{:=}\mathbb {E}[f(\varvec{a}_{\mathcal {S}})\varvec{a}]\), where \(\varvec{a}_{\mathcal {S}} \in \mathbb {R}^p\) denotes the coordinate projection of \(\varvec{a}\) onto \(\mathcal {S}\). Moreover, we assume that there exist parameters \(\alpha , \beta , \gamma , \kappa > 0\) such that
and
where \(\mathcal {S}\bigtriangleup \mathcal {S}' \subset [p]\) denotes the symmetric difference between \(\mathcal {S}\) and \(\mathcal {S}'\). Let \(K\subset \mathbb {R}^p\) be a convex set such that \(T\mathcal {X}\subset K\). For \(u\ge 1\) and \(t\ge 0\), we assume that
Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\mathcal {S}\in \mathcal {X}\): Let \(\varvec{y}\in \mathbb {R}^m\) be given by (4.13) such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\). Then, \({{\,\mathrm{supp}\,}}(T\mathcal {S}) = \mathcal {S}\) and every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\mathcal {S}\Vert _{2} \le t\).
The assumptions (4.14) and (4.15) can be seen as natural balancing properties of the underlying observation model: (4.14) requires that the coefficients of each target vector \(T\mathcal {S}\in K\) are uniformly bounded below and above on \(\mathcal {S}\) (and in particular \(\alpha \le \Vert T\mathcal {S}\Vert _{2} \le \beta \)). The increment condition (4.15) ensures that the distance between two observation variables can be controlled in terms of the symmetric difference of their associated index sets in \(\mathcal {X}\).
With this in mind, Corollary 7 suggests the following simple procedure for variable selection: First perform a hard-thresholding step on \(\hat{\varvec{z}}\) to extract its largest entries in magnitude; then use the corresponding indices to estimate the set of active variables \(\mathcal {S}= {{\,\mathrm{supp}\,}}(T\mathcal {S})\). Note that this does not require explicit knowledge of the output function \(f\). However, the (guaranteed) success of such a strategy strongly depends on the size of \(\alpha \) and the accuracy t. In the worst case, t would have to be in the order of \(\alpha / \sqrt{s}\) for perfect recovery of \(\mathcal {S}\), which would lead to an undesirable factor of s in (4.16). It is certainly possible to show refined versions of Corollary 7. We suspect that sharper error bounds could be obtained by considering a different error measure than the \(\ell ^{2}\)-norm (see Remark 3(3)). Nevertheless, a detailed elaboration would go beyond the scope of this article, and we confine ourselves with the above proof of concept.
5 Uniform Recovery Without Increment Conditions
We have seen in Sects. 4.1–4.3 that the increment conditions of Assumption 2 can lead to inferior error decay rates for quantizing output functions, due to their points of discontinuity. In this section, we present a workaround that can do without any (sub-Gaussian) increment conditions and thereby enables significantly better rates. The basic idea is to cover the signal set \(\mathcal {X}\) by an \(\varepsilon \)-net \(\mathcal {X}_\varepsilon \) for some \(\varepsilon > 0\) and to apply the non-uniform version of Theorem 2 to each \(\varvec{x}\in \mathcal {X}_\varepsilon \) separately. Taking the union bound then yields uniform recovery on \(\mathcal {X}_\varepsilon \). The final, and most crucial, ingredient that allows us to pass over to the entire signal set \(\mathcal {X}\) is the following local stability condition on the observation model. It is based on the (outlier) noise bounds in (2.4) and essentially requires that close points in \(T\mathcal {X}\) imply close observation vectors.
Assumption 3
Let Assumption 1 be satisfied, and let \(t > 0\) and \(\varepsilon > 0\). For \(m_0\in \{0, 1, \dots , \lfloor \tfrac{m}{2}\rfloor \}\), \(\varDelta >0\), and \(\eta \in [0,1]\), we assume that the following holds with probability at least \(1-\eta \):
The above strategy leads to the following general uniform recovery guarantee; see Sect. 9 for a detailed proof.
Theorem 3
There exist universal constants \(c, C, C_0 > 0\) for which the following holds.
Let Assumptions 1 and 3 be satisfied, let \(r{:=}{{\mathrm{sup}}}_{\varvec{x}\in \mathcal {X}}\Vert \langle \varvec{a}, T\varvec{x}\rangle -\tilde{y}(\varvec{x})\Vert _{\psi _2}\), and assume \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\) for every \(\varvec{x}\in \mathcal {X}\). For \(u\ge 1\) and \(u_0 \ge \sqrt{2m_0\log (em/2m_0)}\), we assume that
and
Then with probability at least \(1 - \exp (-c u^2) - \exp (-c u_0^2)-\eta \) on the random draw of \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\), the following holds uniformly for every \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \mathbb {R}^m\) be any input vector such that
Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t + \varepsilon \).
The statement of Theorem 3 strongly resembles the one of Theorem 2 with the important difference that the former does not rely on Assumption 2. In particular, the condition (5.2) does not depend on the increment parameters \(L_{t}\) and \(\hat{L}_{t}\) anymore, making it less restrictive than (2.3). It is worth pointing out that the global complexity of \(\mathcal {X}\) is now measured in terms of the covering number \(\mathcal {N}(T\mathcal {X}, \varepsilon )\) in (5.3) instead of the mean width \(w_{}(T\mathcal {X})\). Furthermore, (5.2) establishes a refined local complexity measure, due to \({\mathrm{sup}}_{\varvec{x}\in \mathcal {X}} w_{t}^2(K- T\varvec{x}) \le w_{t}^2(K- T\mathcal {X})\); see also Proposition 1. Nevertheless, the gain of Theorem 3 is obviously linked to the verification of Assumption 3, which is usually a highly non-trivial task.
We now present two applications of Theorem 3 in the prototypical case of 1-bit observations. In this specific situation, it turns out that Assumption 3 is compatible with uniform bounds for binary embeddings, which allows us to make use of related results from the literature. Our first application is based on the following embedding guarantee by Oymak and Recht [45] for noiseless, Gaussian 1-bit measurements (cf. Sect. 4.1). Hereafter, we agree on the shortcut notation \([H]_\varepsilon {:=}(\tfrac{1}{\varepsilon }H- \tfrac{1}{\varepsilon }H) \cap \mathbb {B}_2^p\) for any subset \(H\subset \mathbb {R}^p\) and \(\varepsilon > 0\); also note that the parameter \(w_{}([H]_\varepsilon )\) is virtually the same as the local mean width \(w_{\varepsilon }(H- H)\) except that the former intersects with \(\mathbb {B}_2^p\) instead of \(\mathbb {S}^{p-1}\).
Theorem 4
([45, Thm. 3.2]) There exist universal constants \(c, {\bar{c}}, C > 0\) for which the following holds.
Let \(H\subset \mathbb {S}^{p-1}\), and let \(\varvec{A}\in \mathbb {R}^{m\times p}\) be a random matrix with independent standard Gaussian row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\). For \(\beta \in (0,1)\) and \(\varepsilon \le {\bar{c}}\beta / \sqrt{\log (e/\beta )}\), we assume that
Then, the following holds with probability at least \(1-\exp (-cm\beta )\):
Combining Theorem 4 with Theorem 3 yields an improved version of Corollary 2 (see Sect. 9 for a proof):
Corollary 8
There exist universal constants \(c, c', c_0, C', C_0 > 0\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a standard Gaussian random vector \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). Let \(\mathcal {X}\subset \mathbb {R}^p\) and define \(T\varvec{x}{:=}\sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\) for \(\varvec{x}\in \mathcal {X}\). Moreover, let \(K\subset \mathbb {R}^p\) be a convex set such that \(T\mathcal {X}\subset K\). Fix \(t\in (0,1)\) and \(\varepsilon \le c't / \log (e/t)\). For \(u^2\ge \max \{1, C_0 \cdot \log \mathcal {N}(T\mathcal {X}, \varepsilon )\}\), we assume that
Finally, let \(\beta \in [0,1]\) be such that \(\beta \sqrt{\log (e/\beta )}\le c_0 t\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \{-1,1\}^m\) be given by (4.1) such that \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1}\le \tfrac{\beta }{2}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\big \Vert \hat{\varvec{z}}- \sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\big \Vert _{2} \le 2t\).
To better understand the oversampling behavior of this result, let us consider the situation where \(\varepsilon \asymp t / \log (e/t)\). Due to Sudakov minoration, we have that \(\log \mathcal {N}(T\mathcal {X}, \varepsilon ) \lesssim \varepsilon ^{-2} \cdot w_{}^2(T\mathcal {X})\) (e.g., see [64, Thm. 7.4.1]). Hence, in the worst case, Corollary 8 would imply an error decay rate of \(O(m^{-1/4})\) up to log factors. However, for many low-dimensional signal sets, such as subspaces or sparse vectors, it is possible to establish a much stronger bound of the form \(\log \mathcal {N}(T\mathcal {X}, \varepsilon ) \lesssim \log (\varepsilon ^{-1}) \cdot w_{}^2(T\mathcal {X})\), e.g., see [45, Sec. 2].Footnote 11 In this case, Corollary 8 would yield an error decay rate of \(O(m^{-1/2})\) up to log factors. This is superior to the achievement of Corollary 2 and matches the best possible rate that can be expected for (\({\mathsf {P}}_{K,\varvec{y}}\)) in general (see Remark 2(2)).
Our second application of Theorem 3 concerns the setup of sub-Gaussian 1-bit observations with dithering, as considered in Sect. 4.2. In this case, the stability condition of Assumption 3 can be related to a recent embedding result of Dirksen and Mendelson [17], which is based on hyperplane tessellations for sub-Gaussian vectors.
Theorem 5
([17, Thm. 2.9]) For every \(L> 0\), there exist constants \(c, {\bar{c}}, C, {\tilde{C}}>0\) only depending on \(L\) for which the following holds.
Let \(\mathcal {X}\subset R\mathbb {B}_2^p\) for some \(R > 0\), and let \(\varvec{A}\in \mathbb {R}^{m\times p}\) be a random matrix with independent, isotropic, sub-Gaussian row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\). Moreover, assume that \(\max _{i\in [m]}\Vert \varvec{a}_i\Vert _{\psi _2}\le L\). Let \(\varvec{\tau }{:=}(\tau _1, \dots , \tau _m) \in \mathbb {R}^m\) be a random vector with independent entries that are uniformly distributed on \([-\lambda , \lambda ]\) for a parameter \(\lambda \ge {\tilde{C}}\cdot R\). In addition, suppose that \(\varvec{A}\) and \(\varvec{\tau }\) are independent. For \(\beta \in (0,1]\) and \(\varepsilon \le {\bar{c}}\beta / \sqrt{\log (e/\beta )}\), we assume that
Then, the following holds with probability at least \(1-\exp (-cm\beta )\):
Combining Theorem 5 with Theorem 3 now yields an improved version of Corollary 3 (see Sect. 9 for a proof):
Corollary 9
For every \(L> 0\), there exist universal constants \(c_0, C_0>0\) and constants \(c, c', {\tilde{C}}, C'>0\) only depending on \(L\) for which the following holds.
Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). Let \(\tau _1, \dots , \tau _m\) be independent copies of a random variable \(\tau \) that is uniformly distributed on \([-\lambda , \lambda ]\) for a parameter \(\lambda >0\). In addition, suppose that \(\{\varvec{a}_i\}_{i=1}^m\) and \(\{\tau _i\}_{i=1}^m\) are independent. Let \(\mathcal {X}\subset R \mathbb {B}_2^p\) for some \(R > 0\), and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\lambda ^{-1}\mathcal {X}\subset K\). Fix \(t\in (0,1]\) and \(\varepsilon \le c't / \log (e/t)\). For \(u^2\ge \max \{1, C_0\cdot \log \mathcal {N}(\lambda ^{-1}\mathcal {X}, \varepsilon )\}\), we assume that
Finally, let \(\beta \in [0,1]\) be such that \(\beta \sqrt{\log (e/\beta )}\le c_0 L^{-2} t\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{(\varvec{a}_i, \tau _i)\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \{-1,1\}^m\) be given by (4.3) such that \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1}\le \tfrac{\beta }{2}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \lambda ^{-1}\varvec{x}\Vert _{2} \le 2t\).
Analogously to noiseless 1-bit observations, Corollary 9 implies a decay rate of \(O(m^{-1/4})\) up to log factors in the worst case, which may be improved to \(O(m^{-1/2})\) for structured signal sets. This is in line with a recent finding of Dirksen and Mendelson [17, Thm. 1.7], who have analyzed a different estimator.
We close this section by pointing out that the error bounds achieved by Corollary 8 and 9 do still not match the information-theoretic optimal rate of \(O(m^{-1})\), e.g., see [18, 34]. We suspect that this gap is not an artifact of our proof, but rather due to a fundamental performance limit of the generalized Lasso (\({\mathsf {P}}_{K,\varvec{y}}\)). A potential remedy is to consider a more specialized reconstruction method, e.g., see [36].
6 Conclusion
This section highlights several key aspects of this work and discusses them in the light of our initial objectives as well as remaining challenges.
-
(1)
Observation models. Probably the greatest benefit of our approach to Problem 1 is its flexibility. The setup of Assumption 1 allows us to implement and analyze virtually every nonlinear observation model that is conceivable for the program (\({\mathsf {P}}_{K,\varvec{y}}\)). The examples that we have seen in Sect. 4 are only a selection of possible applications. Importantly, each of these results is accompanied by a conceptually simple proof that does not require any deeper insight into the underlying model mechanisms; see also the proof template at the beginning of Sect. 8. Therefore, our methodology may form a general path toward competitive benchmark guarantees. However, the resulting oversampling rates do not always match the best possible rates from the literature, due to the increment condition of Assumption 2. In Sect. 5, we have presented a workaround for this issue, based on an additional local stability assumption. This strategy can indeed lead to near-optimal error bounds, but it relies on the availability of a strong embedding result in each considered case.
-
(2)
Non-Gaussian measurements. Our main result reveals the impact of non-Gaussian measurements, in particular, when consistent recovery via (\({\mathsf {P}}_{K,\varvec{y}}\)) can be expected and when not. While relaxing the conditions on isotropy and sub-Gaussian tails in Assumption 1 is certainly practicable (see Remark 3(1)), our analysis is inherently limited to independent measurement vectors. This excludes more structured measurement ensembles, such as partial random circulant matrices or lossless expanders [21, Chap. 12& 13]. On the other hand, such schemes typically satisfy a variant of the restricted isometry property, which is the basis for some recent advances in quantized compressed sensing, e.g., see [16, 20, 33, 65]. These findings provide evidence that at least parts of our results remain valid for a much larger family of measurement vectors. Thus, any guarantee with structured measurements that meets the generality of Theorem 2 would be significant.
-
(3)
Complexity parameters. The research on high-dimensional signal estimation has shown that the (local) Gaussian mean width is a natural, yet accurate complexity measure for convex programs such as (\({\mathsf {P}}_{K,\varvec{y}}\)). A distinctive feature of our approach is the role played by the (not necessarily convex) signal set \(\mathcal {X}\). As indicated in Sect. 3, \(\mathcal {X}\) can be viewed as a characteristic of the observation model that allows us to study any situation between non-uniform and (fully) uniform recovery. Note that this refinement is also a crucial ingredient of Proposition 1 and Theorem 3. Nevertheless, it is important to bear in mind that the Gaussian mean width is not trivial to estimate in specific situations, e.g., see (3.3) and Example 1.
-
(4)
(Non-)convexity and data-driven priors. Remarkably, Fact 1 is the only argument in our proof that relies on the convexity of the constraint set \(K\). In principle, it is possible to drop this assumption on \(K\) by investigating the projected gradient descent method as an algorithmic implementation of (\({\mathsf {P}}_{K,\varvec{y}}\)), e.g., see [46, 47, 56]. However, the feasibility of this approach depends on the existence of an efficient projection onto \(K\).Footnote 12 A modern line of research on signal processing with non-convex optimization advocates the use of data-driven priors—a natural consequence of many recent advances on generative models in machine learning research, e.g., see [5, 39] and the references therein. Although the algorithmic strategy behind these methods bears resemblance to (\({\mathsf {P}}_{K,\varvec{y}}\)), we believe that the complexity of learned signal priors leads to a more difficult mathematical problem, whose understanding is still in its infancy. This is particularly underpinned by the fact that even the situation of convex priors is not sufficiently well understood yet (see [26, 27, 40]).
7 Proof of the Main Result (Theorem 2)
Recall that according to Fact 1, it suffices to show that for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{y}\in \mathbb {R}^m\) satisfying \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\) and (2.4), we have that
To this end, we make use of the decomposition in (2.2), i.e.,
and continue by showing separate bounds for all three terms, where each bound holds uniformly for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{v}\in K_{\varvec{x},t}\). Unless stated otherwise, we assume that the hypotheses of Theorem 2 are fulfilled throughout this section and we have that \(t > 0\); the case of \(t = 0\) is in fact much simpler, see Step 4b. Moreover, we will use the notation \(K_{\mathcal {X},t} {:=}\cup _{\varvec{x}\in \mathcal {X}}K_{\varvec{x},t}=(K- T\mathcal {X}) \cap t\mathbb {S}^{p-1}\).
Step 1: Bounding the quadratic term. Let \(\varvec{A}\in \mathbb {R}^{m\times p}\) denote the matrix with row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\). In order to control the random variable
uniformly for all \(\varvec{v}\in K_{\mathcal {X},t}\), we make use of the following recent matrix deviation inequality for sub-Gaussian matrices.
Theorem 6
([35, Cor. 1.2]) There exists a universal constant \(C_{\mathcal {Q}_{}} > 0\) for which the following holds.
Let \(H\subset \mathbb {R}^p\), and let \(\varvec{A}\in \mathbb {R}^{m\times p}\) be a random matrix with independent, isotropic, sub-Gaussian row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\). Moreover, assume that \(\max _{i\in [m]}\Vert \varvec{a}_i\Vert _{\psi _2}\le L\) for some \(L> 0\). Then for every \(u>0\), we have the following with probability at least \(1-3\exp (-u^2)\):
We now apply this result to \(H{:=}K_{\mathcal {X},t} \subset t \mathbb {S}^{p-1}\) within the setting of Theorem 2: According to the first branch of the assumption (2.3), we have that \(m \ge C \cdot L^2 \log L\cdot \big ( w_{t}^2(K- T\mathcal {X}) + u^2 \big )\). Hence, by adjusting the universal constant C (only depending on \(C_{\mathcal {Q}_{}}\)), Theorem 6 implies that the following holds with probability at least \(1-3\exp (-u^2)\):
Step 2: Bounding the noise term. In order to control the noise term \(\mathcal {N}_{\varvec{x}}(\varvec{v})\), we require the following uniform upper bound for subsums of the quadratic term.
Theorem 7
([17, Thm. 2.10]) There exist universal constants \(c, C_{\mathcal {N}_{}}> 0\) for which the following holds.
Let \(H\subset \mathbb {R}^p\), and let \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\) be independent, isotropic, sub-Gaussian random vectors such that \(\max _{i\in [m]}\Vert \varvec{a}_i\Vert _{\psi _2}\le L\) for some \(L> 0\). Then for every \(m_0\in \{0,1, \dots , m\}\) and \(u_0 \ge \sqrt{m_0\log (em/m_0)}\), we have the following with probability at least \(1-2\exp (-cu_0^2)\):
For \(\varvec{w}\in \mathbb {R}^m\), we denote by \(\mathcal {I}_{m_0}(\varvec{w})\subset [m]\) any (possibly non-unique) index set that corresponds to the \(m_0\) largest entries of \(\varvec{w}\) in magnitude, i.e., for all \(i\in \mathcal {I}_{m_0}(\varvec{w})\) and \(i' \in \mathcal {I}_{m_0}(\varvec{w})^c\), we have that \(|w_i|\ge |w_{i'}|\); note that for \(m_0=0\), we simply have \(\mathcal {I}_{m_0}(\varvec{w})=\emptyset \). With this notation at hand, observe that
Now, consider the specific choice \(\varvec{w}{:=}\varvec{y}-\tilde{\varvec{y}}(\varvec{x})\). The Cauchy–Schwarz inequality then implies
Let us now estimate the first summand of this bound. According to Theorem 7, there exist universal constants \(c, C_{\mathcal {N}_{}} > 0\) such that for every \(u_0 \ge \sqrt{m_0\log (em/m_0)}\), we have the following with probability at least \(1-2\exp (-c u_0^2)\):
where the last inequality follows from the second branch of (2.3). Hence, taking a union bound with the event of (7.1), it follows that with probability at least \(1-3\exp (-u^2) - 2\exp (-cu_0^2)\), the following holds uniformly for every \(\varvec{x}\in \mathcal {X}\) and every \(\varvec{y}\in \mathbb {R}^m\) satisfying the condition (2.4):
Step 3: Bounding the multiplier term. Our goal is to bound the random variable
uniformly from below for all \(\varvec{v}\in K_{\varvec{x},t}\) and \(\varvec{x}\in \mathcal {X}\) that satisfy \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\). Let us begin by adding and subtracting the expected value:
Since \(\varvec{a}\) is isotropic, we observe that \(\tfrac{1}{2}\mathbb {E}[\mathcal {M}_{\varvec{x}}(\varvec{v})]=\langle T\varvec{x}- \mathbb {E}[\tilde{y}(\varvec{x}) \varvec{a}], \varvec{v}\rangle \). Combining this with Definition 1, we obtain a lower bound in terms of the target mismatch:
We now turn to the centered multiplier term in (7.3). To this end, recall Assumption 2 and let \(\{\tilde{y}_{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) be independent copies of the class \(\{\tilde{y}_{t}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) for \(i = 1, \dots , m\).Footnote 13 Inserting these observation variables and applying the triangle inequality then yield
Note that the second empirical product process in (7.5) is not centered yet due to pulling in the absolute values—a crucial step to ensure that the resulting factors have sub-Gaussian increments. Adding and subtracting the expected value of the last summand in (7.5) and using Assumption 2(a), we obtain the following upper bound:
This estimate in conjunction with (7.4) implies that the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\) with \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\):
By Assumption 2(b) and (c), all factors of the product processes
have sub-Gaussian increments. A key ingredient for controlling these empirical processes is a powerful concentration inequality due to Mendelson [42]. The following result is adapted from [42, Thm. 1.13].
Theorem 8
There exist universal constants \(c, C_{\mathcal {M}_{}} > 0\) for which the following holds.
Let \(\{g_{a}\}_{a\in \mathcal {A}}\) and \(\{h_{b}\}_{b\in \mathcal {B}}\) be stochastic processes indexed by two sets \(\mathcal {A}\) and \(\mathcal {B}\), both defined on a common probability space \((\varOmega , \mathsf {A}, \mathbb {P})\). We assume that there exist \(r_{\mathcal {A}}, r_{\mathcal {B}} \ge 0\) and pseudo-metrics \(d_{\mathcal {A}}\) on \(\mathcal {A}\) and \(d_{\mathcal {B}}\) on \(\mathcal {B}\) such that
Finally, let \(X_1, \dots , X_m\) be independent copies of a random variable \(X\sim \mathbb {P}\). Then for every \(u\ge 1\), we have the following with probability at least \(1- 2 \exp (-cu^2)\):
where \(\gamma _2(\cdot ,\cdot )\) denotes Talagrand’s \(\gamma _2\)-functional (e.g., see [42, Def. 1.2]).
Let us now apply Theorem 8 to the first product process in (7.7). To this end, we observe that the class \(\{\langle \varvec{a}, \varvec{v}\rangle \}_{\varvec{v}\in K_{\mathcal {X},t}}\) has sub-Gaussian increments with respect to the pseudo-metric induced by \(L\Vert \cdot \Vert _{2}\), while Assumption 2(b) implies that \(\{\xi _t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) has sub-Gaussian increments with respect to \(L_{t}\cdot d_T\). Hence, Theorem 8 implies that the following holds with probability at least \(1- 2 \exp (-cu^2)\):Footnote 14
According to Talagrand’s majorizing measure theorem [59, Thm. 2.4.1], we have that
Moreover, it is not hard to see that \(\gamma _2(\mathcal {X}, d_T) = \gamma _2(T\mathcal {X}, \Vert \cdot \Vert _{2})\). Consequently, the following bound holds with probability at least \(1- 2 \exp (-cu^2)\):
The second product process in (7.7) can be treated similarly: The class \(\{|\langle \varvec{a}, \varvec{v}\rangle |\}_{\varvec{v}\in K_{\mathcal {X},t}}\) has sub-Gaussian increments with respect to \(L\Vert \cdot \Vert _{2}\), while Assumption 2(c) implies that \(\{\varepsilon _t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) has sub-Gaussian increments with respect to \(\hat{L}_{t}\cdot d_T\). An analogous application of Theorem 8 shows that with probability at least \(1- 2 \exp (-cu^2)\), we have that
We are ready to prove our main result.
Step 4a: Proof of Theorem 2 (\(t > 0\)). We now assume that the events corresponding to (7.1), (7.2), (7.8), and (7.9) have occurred jointly with probability at least \(1- 7\exp (-cu^2) - 2\exp (-cu_0^2)\) for an appropriate constant \(c > 0\); note that the factors 7 and 2 can be removed by slightly adjusting c. Combining these bounds with (7.6), the following holds uniformly for every \(\varvec{x}\in \mathcal {X}\) with \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\) and every \(\varvec{y}\in \mathbb {R}^m\) satisfying the condition (2.4):
If we could show that this lower bound is strictly positive, the claim of Theorem 2 would follow directly from Fact 1. To conclude this argument, it is enough to have that
where \(C' > 0\) is an appropriate universal constant. Indeed, both (7.10) and (7.11) are consequences of (2.3): The bound of (7.10) is equivalent to the third branch of (2.3), while (7.11) follows from the multiplication of the first and third branch of (2.3) and then taking the square root. Note that the previous argument also makes use of the basic fact that \((v + w)^2 \ge v^2 + w^2 \ge \tfrac{1}{2}(v + w)^2\) for all \(v, w \ge 0\), and that \(L^2 \gtrsim L\) due to the isotropy of \(\varvec{a}\).
Step 4b: Proof of Theorem 2 (\(t = 0\)). We may assume that \(r= \hat{r}= L_{t}= \hat{L}_{t}= 0\), \(m_0= 0\), \(\varDelta = L^{-1}\sqrt{\log L}\), and \(u= u_0\), while only considering those \(\varvec{x}\in \mathcal {X}\) and \(\varvec{y}\in \mathbb {R}^m\) with \(\varvec{y}= \tilde{\varvec{y}}(\varvec{x})\) and \(\rho _{}(\varvec{x}) = 0\). Consequently, we have a noiseless linear model, i.e., \(\varvec{y}= \tilde{\varvec{y}}(\varvec{x}) = \langle \varvec{a}, \varvec{x}\rangle \). It follows that \(\mathcal {N}_{\varvec{x}}(\cdot ) = \mathcal {M}_{\varvec{x}}(\cdot ) = 0\). Analogously to (7.1), we can conclude that with probability at least \(1- 3\exp (-u^2)\), the quadratic term satisfies \(\mathcal {Q}_{}(\varvec{v}) \ge \tfrac{1}{2}\) for all \(\varvec{v}\in K_{\mathcal {X}, 0} {:=}{\text {cone}}(K- T\mathcal {X}) \cap \mathbb {S}^{p-1}\). On this event, let \(\hat{\varvec{z}}\in K\) be any minimizer of (\({\mathsf {P}}_{K,\varvec{y}}\)) with input \(\varvec{y}= \tilde{\varvec{y}}(\varvec{x})\) for some \(\varvec{x}\in \mathcal {X}\). If we would not have exact recovery, i.e., \(\hat{\varvec{v}} {:=}\hat{\varvec{z}}- T\varvec{x}\ne \varvec{0}\), then
which contradicts the fact that \(\hat{\varvec{z}}\) is a solution to (\({\mathsf {P}}_{K,\varvec{y}}\)). \(\square \)
8 Proofs for Sect. 4
Each result in Sect. 4 is an application of Theorem 2 and follows the same proof template:
-
Step 1. How the prerequisites of Theorem 2 are met:
-
Step 2. Proof of the actual statement via Theorem 2.
-
Step 3. Verification of Step 1b.
-
Step 4. Verification of Step 1c.
8.1 Proofs for Sect. 4.1
Proof of Corollary 2
We follow the proof template from the beginning of Sect. 8:
-
Step 1a. The model setup of Corollary 2 fits into Assumption 1 as follows:
-
We have that \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) and therefore \(\Vert \varvec{a}\Vert _{\psi _2} \lesssim 1\). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\mathbb {R}^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )\).
-
The target function \(T:\mathcal {X}\rightarrow K\) corresponds to the (scaled) normalization \(T\varvec{x}{:=}\sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')=\sqrt{\tfrac{2}{\pi }}\big \Vert \tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}} - \tfrac{\varvec{x}'}{\Vert \varvec{x}'\Vert _{2}}\big \Vert _{2}\).
-
-
Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).
-
Step 1c. There exists an approximation \(\tilde{y}_t(\varvec{x})\) of the observation variable \(\tilde{y}(\varvec{x})\) such that the conditions of Assumption 2 are fulfilled with
$$\begin{aligned} L_{t}\lesssim 1 + t^{-1},\quad \hat{L}_{t}\lesssim t^{-1},\quad r\lesssim 1,\quad \hat{r}\lesssim 1. \end{aligned}$$(8.1) -
Step 2. The first and third branch of the condition (2.3) is implied by (4.2) for a sufficiently large universal constant \(C'>0\). Since \(\phi (\beta ) {:=}\beta \sqrt{\log (e/\beta )}\) defines a continuous and non-decreasing function on [0, 1] with \(\phi (1)=1\) and \(\phi (0){:=}0\) (by continuous extension), we may assume without loss of generality that \(\beta \sqrt{\log (e/\beta )} = c_0 t \in (0,1]\). Now, we set
$$\begin{aligned} \varDelta ^2 {:=}\frac{1}{t\sqrt{\log (e/\beta )}}, \quad m_0{:=}\lfloor \beta m\rfloor , \quad u_0 {:=}\sqrt{2 m \beta \log (e/\beta )}, \end{aligned}$$implying that \(u_0\ge \sqrt{m_0\log (em/m_0)}\) and \(u_0^2\ge 2 c_0 t m\). Combining the latter inequality with (4.2) for \(C'\) sufficiently large particularly implies that \(u_0^2 \ge u^2\). Furthermore, observe that \(\varDelta ^2 u_0^2=2c_0 m\) and \(\varDelta ^2 \cdot w_{t}^2(K- T\mathcal {X})\le t^{-1} \cdot w_{t}^2(K- T\mathcal {X})\). Hence, the second branch of the condition (2.3) is satisfied if \(c_0\) is chosen small enough and \(C'\) in (4.2) large enough. Next, we show that every \(\varvec{y}\in \{-1,1\}^m\) given by (4.1) with \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \beta \) also satisfies (2.4). Since \(\varvec{\nu }= \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\in \{-2,0,2\}^m\) and \(\tfrac{1}{2}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{1} \le \beta m\), it follows that \(|\{i\in [m] \ :y_i \ne \tilde{y}_i(\varvec{x})\}|\le \lfloor \beta m\rfloor \). Consequently, \(\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}=0\) and
$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[m_0]}= \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{2}\le 2\sqrt{\beta }= 2 \Big (\tfrac{c_0 t}{\sqrt{\log (e/\beta )}}\Big )^{1/2}\le \varDelta t, \end{aligned}$$where the last inequality holds for \(c_0\le \frac{1}{4}\). Theorem 2 now implies the claim of Corollary 2.
-
Step 3. Let \(\varvec{x}\in \mathcal {X}\) and consider the orthogonal decomposition of the standard Gaussian random vector \(\varvec{a}\) along \(\varvec{x}\):
$$\begin{aligned} \varvec{a}= \langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle {{\bar{\varvec{x}}}} + P_{{\varvec{x}}^\perp }(\varvec{a}), \end{aligned}$$where \({{\bar{\varvec{x}}}} {:=}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\) and \(P_{{\varvec{x}}^\perp } {:=}P_{{\{\varvec{x}\}}^\perp }\). Since \({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )\) is centered and \(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle \sim \mathsf {N}(0, 1)\) is independent of \(P_{{\varvec{x}}^\perp }(\varvec{a})\), we have that
$$\begin{aligned} \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]&=\mathbb {E}[{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) (\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle {{\bar{\varvec{x}}}} + P_{{\varvec{x}}^\perp }(\varvec{a}))]\\&=\mathbb {E}[|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |]\cdot {{\bar{\varvec{x}}}} + \mathbb {E}[{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )] \cdot \mathbb {E}[P_{{\varvec{x}}^\perp }(\varvec{a})] = \sqrt{\tfrac{2}{\pi }} \cdot {{\bar{\varvec{x}}}} = T\varvec{x}, \end{aligned}$$which implies that \(\rho _{}(\varvec{x})=0\).
-
Step 4. Using the shortcut \({{\bar{\varvec{x}}}} {:=}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\) again, we approximate the observation variable \(\tilde{y}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )\) by \(\tilde{y}_t(\varvec{x}) {:=}\psi _t(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle )\) for \(t\in (0, 1]\) and \(\varvec{x}\in \mathcal {X}\) where
$$\begin{aligned} \psi _t(s) {:=}{\left\{ \begin{array}{ll} \tfrac{128}{t}\cdot s, &{} |s|\le \tfrac{t}{128},\\ {{\,\mathrm{sign}\,}}(s), &{}\text {otherwise,} \end{array}\right. } \quad s \in \mathbb {R}. \end{aligned}$$See also Fig. 2 for an illustration. Since \(\tilde{y}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle )\), the absolute value of the approximation error then takes the form
$$\begin{aligned} \varepsilon _t(\varvec{x}) = |{{\,\mathrm{sign}\,}}(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle ) - \psi _t(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle )|\cdot \chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |), \end{aligned}$$where we have used \(t^\circ {:=}\frac{t}{128}\) for the sake of notational convenience. We now show that for this choice of approximation, the conditions of Assumption 2 are indeed fulfilled with (8.1):
On Assumption 2(a). Let \(\varvec{x}\in \mathcal {X}\), \(\varvec{z}\in \mathbb {S}^{p-1}\), and consider the orthogonal decomposition
$$\begin{aligned} \varvec{z}= \langle \varvec{z}, {{\bar{\varvec{x}}}}\rangle {{\bar{\varvec{x}}}} + P_{{\varvec{x}}^\perp }(\varvec{z}). \end{aligned}$$This implies
$$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]&\le \mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\\&\le |\langle \varvec{z}, {{\bar{\varvec{x}}}}\rangle |\cdot \mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |] \\&\quad + \mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]. \end{aligned}$$Clearly, \(|\langle \varvec{z}, {{\bar{\varvec{x}}}}\rangle |\le 1\) and \(\mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |]\le t^\circ \). Since the random variables \(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle \) and \(\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle \) are independent, and \(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle \) is standard Gaussian, we obtain
$$\begin{aligned} \mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]= & {} \mathbb {P}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |\le t^\circ ) \cdot \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |] \\\le & {} t^\circ \cdot \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]. \end{aligned}$$Moreover, by Jensen’s inequality and the isotropy of \(\varvec{a}\), it follows that
$$\begin{aligned} \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]\le (\mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |^2])^{1/2} = \Vert P_{{\varvec{x}}^\perp }(\varvec{z})\Vert _{2}\le 1. \end{aligned}$$Putting everything together, this shows that Assumption 2(a) is satisfied. On Assumption 2(b). Since \(\psi _t\) is \(\tfrac{128}{t}\)-Lipschitz, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\) (with \({{\bar{\varvec{x}}}}' {:=}\tfrac{\varvec{x}'}{\Vert \varvec{x}'\Vert _{2}}\)):
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \sqrt{\tfrac{2}{\pi }}\Vert \langle \varvec{a}, {{\bar{\varvec{x}}}}-{{\bar{\varvec{x}}}}'\rangle \Vert _{\psi _2} + \Vert \psi _t(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle ) - \psi _t(\langle \varvec{a}, {{\bar{\varvec{x}}}}'\rangle )\Vert _{\psi _2}\\&\le \big (\sqrt{\tfrac{2}{\pi }} + \tfrac{128}{t}\big ) \cdot \Vert \langle \varvec{a}, {{\bar{\varvec{x}}}}-{{\bar{\varvec{x}}}}'\rangle \Vert _{\psi _2} \lesssim (1 + t^{-1}) \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$This implies \(L_{t}\lesssim 1 + t^{-1}\). Furthermore, observe that \(|\tilde{y}_t(\varvec{x})| \le 1\), so that
$$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}\le \sqrt{\tfrac{2}{\pi }}\Vert \langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle \Vert _{\psi _2} + \Vert \tilde{y}_t(\varvec{x})\Vert _{\psi _2}\lesssim 1 \end{aligned}$$for every \(\varvec{x}\in \mathcal {X}\). This shows \(r\lesssim 1\). On Assumption 2(c). We observe that the function \(s\mapsto \phi _t(s) {:=}|{{\,\mathrm{sign}\,}}(s)-\psi _t(s)|\) is \(\tfrac{128}{t}\)-Lipschitz. Therefore, for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\), we obtain
$$\begin{aligned} \Vert \varepsilon _t(\varvec{x}) - \varepsilon _t(\varvec{x}')\Vert _{\psi _2}\le \tfrac{128}{t} \cdot \Vert \langle \varvec{a}, {{\bar{\varvec{x}}}}-{{\bar{\varvec{x}}}}'\rangle \Vert _{\psi _2}\lesssim t^{-1} \cdot d_T(\varvec{x},\varvec{x}'), \end{aligned}$$which implies \(\hat{L}_{t}\lesssim t^{-1}\). Finally, since \(|\varepsilon _t(\varvec{x})|\le 1\) for every \(\varvec{x}\in \mathcal {X}\), it follows that \(\hat{r}\lesssim 1\).
\(\square \)
8.2 Proofs for Sect. 4.2
Proof of Corollary 3
We follow the proof template from the beginning of Sect. 8:
-
Step 1a. The model setup of Corollary 3 fits into Assumption 1 as follows:
-
The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). The signal set \(\mathcal {X}\) satisfies \(\mathcal {X}\subset R \mathbb {B}_2^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle +\tau )\), where \(\tau \) is uniformly distributed on \([-\lambda , \lambda ]\) and independent of \(\varvec{a}\). In particular, \(F\) is a random function.
-
The target function \(T:\mathcal {X}\rightarrow K\) corresponds to rescaling by a factor of \(\lambda ^{-1}\), i.e., \(T\varvec{x}{:=}\lambda ^{-1}\varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \lambda ^{-1}\Vert \varvec{x}- \varvec{x}'\Vert _{2}\).
-
-
Step 1b. There exists an absolute constant \({\tilde{C}} \ge e\) such that if
$$\begin{aligned} \lambda \ge {\tilde{C}}\cdot R \cdot L\cdot \sqrt{\log (e / t)}, \end{aligned}$$(8.2)the target mismatch satisfies \(\rho _{}(\varvec{x})\le \tfrac{t}{32}\) for every \(\varvec{x}\in \mathcal {X}\).
-
Step 1c. There exists an approximation \(\tilde{y}_t(\varvec{x})\) of the observation variable \(\tilde{y}(\varvec{x})\) such that the conditions of Assumption 2 are fulfilled with
$$\begin{aligned} L_{t}\le L\cdot (1 + 64t^{-1}),\quad \hat{L}_{t}\le 64 Lt^{-1},\quad r\lesssim R L\lambda ^{-1} + 1, \quad \hat{r}\lesssim 1. \end{aligned}$$(8.3) -
Step 2. The first and third branch of the condition (2.3) is implied by (4.4) for a sufficiently large universal constant \(C'>0\). Since \(\phi (\beta ) {:=}\beta \sqrt{\log (e/\beta )}\) defines a continuous and non-decreasing function on [0, 1] with \(\phi (1)=1\) and \(\phi (0){:=}0\) (by continuous extension), we may assume without loss of generality that \(\beta \sqrt{\log (e/\beta )}= c_0 L^{-2} t \in (0,1]\). Now, we set
$$\begin{aligned} \varDelta ^2 {:=}\frac{1}{tL^2\sqrt{\log (e/\beta )}}, \quad m_0{:=}\lfloor \beta m\rfloor , \quad u_0 {:=}\sqrt{2 m \beta \log (e/\beta )}, \end{aligned}$$implying that \(u_0\ge \sqrt{m_0\log (em/m_0)}\) and \(u_0^2\ge 2 c_0 m L^{-2} t\). Combining the latter inequality with (4.4) for \(C'\) sufficiently large particularly implies that \(u_0^2 \ge u^2\). Furthermore, we may assume that \(c_0\le \frac{1}{4}C^{-1}\) and that (4.4) holds with \(C'\ge 2C\), where \(C>0\) denotes the universal constant from (2.3). Then, the second branch of the condition (2.3) is satisfied, since
$$\begin{aligned} CL^4\varDelta ^2u_0^2= \frac{CL^2}{t\sqrt{\log (e/\beta )}} \cdot 2 \beta m \log (e/\beta )= 2c_0C\cdot m\le \tfrac{m}{2} \end{aligned}$$and
$$\begin{aligned} CL^4\varDelta ^2 \cdot w_{t}^2(K- \lambda ^{-1}\mathcal {X})= & {} \frac{CL^2}{t\sqrt{\log (e/\beta )}} \cdot w_{t}^2(K- \lambda ^{-1}\mathcal {X})\\\le & {} \tfrac{1}{2} C' \cdot L^2 t^{-1} \cdot w_{t}^2(K- \lambda ^{-1}\mathcal {X})\le \tfrac{m}{2}. \end{aligned}$$Next, we show that every \(\varvec{y}\in \{-1,1\}^m\) given by (4.3) with \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \beta \) also satisfies (2.4). Since \(\varvec{\nu }=\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\in \{-2,0,2\}^m\) and \(\tfrac{1}{2}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{1}\le \beta m\), it follows that \(|\{i\in [m] \ :y_i \ne \tilde{y}_i(\varvec{x})\}|\le \lfloor \beta m\rfloor \). Consequently, \(\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}=0\) and
$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[m_0]}= \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{2}\le 2\sqrt{\beta }= 2\Big (\tfrac{c_0 t}{L^2 \sqrt{\log (e/\beta )}} \Big )^{1/2} \le \varDelta t, \end{aligned}$$where the last inequality holds for \(c_0\le \frac{1}{4}\). Theorem 2 now implies the claim of Corollary 3.
-
Step 3. Since \(\varvec{a}\) is isotropic, we have that
$$\begin{aligned} \rho _{}(\varvec{x}) = \big \Vert \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}] - \lambda ^{-1} \varvec{x}\big \Vert _{2} \le \big \Vert \mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\varvec{a}\big ]\big \Vert _{2}. \end{aligned}$$Therefore, it suffices to show that the following holds for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\):
$$\begin{aligned} \mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\le \tfrac{t}{32}. \end{aligned}$$The following identity explains why adding a uniformly distributed dither \(\tau \in [-\lambda , \lambda ]\) before quantization is useful:
$$\begin{aligned} \mathbb {E}_{\tau }[{{\,\mathrm{sign}\,}}(s + \tau )] = \lambda ^{-1}s \cdot \chi _{[-\lambda , \lambda ]}(s) + {{\,\mathrm{sign}\,}}(s) \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(s), \quad s \in \mathbb {R}. \end{aligned}$$In other words, for s small enough, integrating over the dithering variable \(\tau \) allows us to “smooth out” the discontinuity of the sign function. As \(\tau \) and \(\varvec{a}\) are independent, we can apply the above identity as follows:
$$\begin{aligned}&\mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\\&\quad = \mathbb {E}_{\varvec{a}}\mathbb {E}_{\tau }\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\\&\quad = \mathbb {E}\big [\big (\lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \cdot \chi _{[-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle ) \\&\qquad + {{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )-\lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\\&\quad = \mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big ) \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )\cdot \langle \varvec{a}, \varvec{z}\rangle \big ]. \end{aligned}$$Using the Cauchy–Schwarz inequality twice, the isotropy of \(\varvec{a}\), and the triangle inequality, we now obtain
$$\begin{aligned}&\mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big ) \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )\cdot \langle \varvec{a}, \varvec{z}\rangle \big ]\\&\quad \le \Big (\mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big )^2 \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )\big ] \Big )^{1/2} \cdot \big (\mathbb {E}[\langle \varvec{a}, \varvec{z}\rangle ^2]\big )^{1/2}\\&\quad = \Big (\mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big )^2 \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )\big ] \Big )^{1/2}\\&\quad \le \Vert {{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \Vert _{L^4}\cdot \big (\mathbb {P}(|\langle \varvec{a}, \varvec{x}\rangle |>\lambda )\big )^{1/4}\\&\quad \le \big (1+ \lambda ^{-1}\Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{L^4}\big ) \cdot \big (\mathbb {P}(|\langle \varvec{a}, \varvec{x}\rangle |>\lambda )\big )^{1/4}. \end{aligned}$$Since \(\varvec{a}\) is sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2}\le L\), there exist absolute constants \(C''\ge 1\), \(c''>0\) such that \(\Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{L^4}\le C''\cdot \Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{\psi _2}\le C''\cdot L\cdot \Vert \varvec{x}\Vert _{2}\) and \(\mathbb {P}(|\langle \varvec{a}, \varvec{x}\rangle |>\lambda )\le 2\exp \big (-c''\lambda ^2 (L\Vert \varvec{x}\Vert _{2})^{-2}\big )\). Finally, using that every \(\varvec{x}\in \mathcal {X}\) satisfies \(\Vert \varvec{x}\Vert _{2}\le R\), we can conclude that
$$\begin{aligned} \mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\le & {} C'' \cdot (1+ R L\lambda ^{-1}) \cdot \exp \big (-c'' \lambda ^2(RL)^{-2}\big ) \\\le & {} \tfrac{t}{32}, \end{aligned}$$where the last estimate is due to (8.2) for \({\tilde{C}}\ge e\) large enough.
-
Step 4. We approximate the observation variable \(\tilde{y}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )\) by \(\tilde{y}_t(\varvec{x}) {:=}\psi _t(\langle \varvec{a}, \varvec{x}\rangle +\tau )\) for \(t\in (0, 1]\) and \(\varvec{x}\in \mathcal {X}\) where
$$\begin{aligned} \psi _t(s) = {\left\{ \begin{array}{ll} \tfrac{64}{t}\cdot \lambda ^{-1}s, &{}|s|\le \tfrac{t}{64} \cdot \lambda ,\\ {{\,\mathrm{sign}\,}}(s), &{}\text {otherwise,} \end{array}\right. } \quad s \in \mathbb {R}. \end{aligned}$$The absolute value of the approximation error then takes the form
$$\begin{aligned} \varepsilon _t(\varvec{x}) = |{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau ) - \psi _t(\langle \varvec{a}, \varvec{x}\rangle + \tau )|\cdot \chi _{[0, t^\circ \lambda ]}(|\langle \varvec{a}, \varvec{x}\rangle + \tau |), \end{aligned}$$where we have used \(t^\circ {:=}\frac{t}{64}\) for the sake of notational convenience. We now show that for this choice of approximation, the conditions of Assumption 2 are indeed fulfilled with (8.3): On Assumption 2(a). For \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\), we have that
$$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]&\le \mathbb {E}[\chi _{[0, t^\circ \lambda ]}(|\langle \varvec{a}, \varvec{x}\rangle + \tau |) \cdot |\langle \varvec{a}, \varvec{z}\rangle |] \\&\le \mathbb {E}_{\varvec{a}}\big [|\langle \varvec{a}, \varvec{z}\rangle | \cdot \mathbb {E}_{\tau } [\chi _{[0, t^\circ \lambda ]}(|\langle \varvec{a}, \varvec{x}\rangle + \tau |)]\big ]. \end{aligned}$$Since \(\mathbb {E}_{\tau }[\chi _{[0, t^\circ \lambda ]}(|\langle \varvec{a}, \varvec{x}\rangle + \tau |)]\le t^\circ \), it follows that
$$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\le t^\circ \cdot \mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |]\le t^\circ \cdot (\mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |^2])^{1/2}=t^\circ , \end{aligned}$$implying the condition of Assumption 2(a). On Assumption 2(b). Since \(\psi _t\) is \(\tfrac{64}{t \lambda }\)-Lipschitz, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \lambda ^{-1}\Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} + \Vert \psi _t(\langle \varvec{a}, \varvec{x}\rangle + \tau ) - \psi _t(\langle \varvec{a}, \varvec{x}'\rangle + \tau )\Vert _{\psi _2}\\&\le \big (\lambda ^{-1} + \tfrac{64}{t \lambda }\big ) \cdot \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} \le \lambda ^{-1} \cdot (1 + \tfrac{64}{t}) \cdot L\cdot \Vert \varvec{x}- \varvec{x}'\Vert _{2} \\&= L\cdot (1 + 64 t^{-1}) \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$This implies \(L_{t}\le L\cdot (1 + 64t^{-1})\). Furthermore, observe that \(|\tilde{y}_t(\varvec{x})| \le 1\), so that
$$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}\le \lambda ^{-1}\Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{\psi _2} + \Vert \tilde{y}_t(\varvec{x})\Vert _{\psi _2}\lesssim R L\lambda ^{-1} + 1 \end{aligned}$$for every \(\varvec{x}\in \mathcal {X}\). This shows \(r\lesssim R L\lambda ^{-1} + 1\). On Assumption 2(c). We observe that the function \(s\mapsto \phi _t(s) {:=}|{{\,\mathrm{sign}\,}}(s)-\psi _t(s)|\) is \(\tfrac{64}{t\lambda }\)-Lipschitz. Therefore, for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\), we obtain
$$\begin{aligned} \Vert \varepsilon _t(\varvec{x}) - \varepsilon _t(\varvec{x}')\Vert _{\psi _2}\le \tfrac{64}{t\lambda } \cdot \Vert \langle \varvec{a}, \varvec{x}- \varvec{x}'\rangle \Vert _{\psi _2}\le 64 Lt^{-1} \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$Hence, \(\hat{L}_{t}\le 64 Lt^{-1}\). Finally, since \(|\varepsilon _t(\varvec{x})|\le 1\) for every \(\varvec{x}\in \mathcal {X}\), it follows that \(\hat{r}\lesssim 1\).
\(\square \)
8.3 Proofs for Sect. 4.3
Proof of Corollary 4
We follow the proof template from the beginning of Sect. 8:
-
Step 1a. The model setup of Corollary 4 fits into Assumption 1 as follows:
-
The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). The signal set \(\mathcal {X}\) is a bounded subset of \(\mathbb {R}^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )\), where \(\tau \) is uniformly distributed on \([-\delta , \delta ]\) and independent of \(\varvec{a}\). In particular, \(F\) is a random function. Moreover, the observation vector of \(\varvec{x}\) is given by \(\tilde{\varvec{y}}(\varvec{x}) = q_\delta (\varvec{A}\varvec{x}+\varvec{\tau })\), where \(\varvec{A}\in \mathbb {R}^{m\times p}\) denotes the sub-Gaussian random matrix with row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\) and \(\varvec{\tau }{:=}(\tau _1, \dots , \tau _m)\).
-
The target function \(T:\mathcal {X}\rightarrow K\) is the canonical embedding into \(K\), i.e., \(T\varvec{x}{:=}\varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \Vert \varvec{x}- \varvec{x}'\Vert _{2}\).
-
-
Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).
-
Step 1c. If \(t\le 128\delta \), there exists an approximation \(\tilde{y}_t(\varvec{x})\) of the observation variable \(\tilde{y}(\varvec{x})\) such that the conditions of Assumption 2 are fulfilled with
$$\begin{aligned} L_{t}\lesssim L\delta t^{-1},\quad \hat{L}_{t}\lesssim L\delta t^{-1},\quad r\lesssim \delta ,\quad \hat{r}\lesssim \delta . \end{aligned}$$(8.4)If \(t> 128\delta \), we choose any maximal \(\tfrac{t}{256}\)-packing \(\mathcal {X}_t\) for \(\mathcal {X}\) (i.e., a maximal subset of \(\mathcal {X}\) such that \(\Vert \varvec{x}-\varvec{x}'\Vert _{2}> \tfrac{t}{256}\) for all \(\varvec{x}, \varvec{x}'\in \mathcal {X}_t\) with \(\varvec{x}\ne \varvec{x}'\)) and show that for the trivial choice \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled on \(\mathcal {X}_t\) with
$$\begin{aligned} L_{t}\lesssim \delta t^{-1},\quad \hat{L}_{t}=0,\quad r\lesssim \delta ,\quad \hat{r}= 0. \end{aligned}$$(8.5) -
Step 2. We first note that for \(t\ge \delta \), the inequality \(c_0 L^{-2} \sqrt{\delta t} / \sqrt{\max \{1, \log ( \delta e/t)\}}\le \tfrac{t}{40}\) holds if \(c_0\le \tfrac{1}{40}\). Furthermore, the condition \(\tfrac{1}{m}\Vert \varvec{\nu }\Vert _{0} \le t \delta ^{-1}\) is trivially fulfilled. Consequently, it suffices to prove the statement of Corollary 4 for all those input vectors \(\varvec{y}\in \delta \mathbb {Z}^m\) that satisfy the condition (a) in (4.7), and all those \(\varvec{y}\in \delta \mathbb {Z}^m\) that satisfy condition (b) in (4.7) in the case \(t\le \delta \). We now distinguish between the two cases \(t\le 128\delta \) and \(t>128\delta \): The case \(t\le 128\delta \). From (8.4), it follows that the first and third branch of the condition (2.3) is implied by (4.6) for a sufficiently large universal constant \(C'>0\). For the choice \(\varDelta {:=}L^{-1} \sqrt{\log L}\), \(m_0{:=}0\), and \(u{:=}u_0\), we observe that the second branch of (2.3) follows from the first one. Therefore, the claim of Corollary 4 follows from Theorem 2 for all those input vectors \(\varvec{y}\in \delta \mathbb {Z}^m\) that satisfy the condition (a) in (4.7). Next, let us assume that \(t\le \delta \) and consider the choice
$$\begin{aligned} \varDelta ^2 {:=}\frac{\delta }{4C tL^4 \log (\delta e/ t)}, \quad m_0{:=}\lfloor m t \delta ^{-1}\rfloor , \quad u_0 {:=}\sqrt{2m t \delta ^{-1}\log (\delta e / t)}, \end{aligned}$$where \(C>0\) denotes the universal constant from (2.3). This implies \(u_0\ge \sqrt{m_0\log (em/m_0)}\) as well as \(u_0^2\ge u^2\) if the universal constant \(C'\) in (4.6) is large enough. Furthermore, the second branch of (2.3) is satisfied: We have that \(CL^4\varDelta ^2u_0^2= \tfrac{m}{2}\), and assuming that (4.6) holds with \(C'\ge \frac{1}{2}\), it follows that
$$\begin{aligned} CL^4\varDelta ^2 \cdot w_{t}^2(K- T\mathcal {X})\le \tfrac{1}{4}\delta t^{-1} \cdot w_{t}^2(K- \mathcal {X})\le \tfrac{1}{4} L^2 \delta ^2 t^{-2} \cdot w_{t}^2(K- \mathcal {X})\le \tfrac{m}{2}. \end{aligned}$$It remains to show that every \(\varvec{y}\in \delta \mathbb {Z}^m\) given by (4.5) such that
$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{\nu }\Vert _{2}\le \tfrac{c_0\sqrt{\delta t}}{L^2\sqrt{\log (\delta e/t)}}\quad \text {and} \quad \tfrac{1}{m}\Vert \varvec{\nu }\Vert _{0}\le \tfrac{t}{\delta } \end{aligned}$$also satisfies the condition (2.4). Indeed, since \(\varvec{\nu }=\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\in \delta \mathbb {Z}^m\) and \(\Vert \varvec{\nu }\Vert _{0}\le m t \delta ^{-1}\), it follows that \(|\{i\in [m] \ :y_i \ne \tilde{y}_i(\varvec{x})\}|\le \lfloor m t \delta ^{-1}\rfloor \). Consequently, \(\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}=0\) and
$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[m_0]}= \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{2}\le \tfrac{c_0\sqrt{\delta t}}{L^2\sqrt{\log (\delta e/t)}}\le \varDelta t, \end{aligned}$$where the last inequality holds for \(c_0>0\) sufficiently small. Theorem 2 now implies the claim of Corollary 4 for all those input vectors \(\varvec{y}\in \delta \mathbb {Z}^m\) that satisfy the condition (b) in (4.7). The case \(t> 128\delta \). Setting \(m_0{:=}0\), \(\varDelta {:=}L^{-1} \sqrt{\log L}\), and \(u{:=}u_0\), the second branch of (2.3) is implied by the first one. Due to (8.5), Theorem 2 applied to the signal set \(\mathcal {X}_t\) now implies that there exist universal constants \(c_1,C_1>0\) such that if
$$\begin{aligned} m \ge C_1 \cdot L^2 \cdot \Big ( (\log L+ t^{-2} \delta ^2) \cdot \big (w_{t}^2(K- \mathcal {X}_t)+ u^2 \big ) + t^{-4} \delta ^2 \cdot w_{}^2(\mathcal {X}_t) \Big ),\nonumber \\ \end{aligned}$$(8.6)the following event \({\mathcal {A}}_1\) occurs with probability at least \(1 - \exp (-c_1 u^2)\): For every \(\varvec{x}' \in \mathcal {X}_t\) and \(\varvec{y}\in \delta \mathbb {Z}^m\) such that \(\tfrac{1}{\sqrt{m}}\Vert \varvec{y}-q_{\delta }(\varvec{A}\varvec{x}'+\varvec{\tau })\Vert _{2} \le \tfrac{t}{20}\), every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \varvec{x}'\Vert _{2} \le t\). Moreover, according to Theorem 6, there exists a universal constant \(C_2>0\) such that if
$$\begin{aligned} m\ge C_2\cdot L^2 \log L\cdot \big (w_{t}^2(\mathcal {X}-\mathcal {X}) + u^2\big ), \end{aligned}$$(8.7)the following event \({\mathcal {A}}_2\) occurs with probability at least \(1-3\exp (-u^2)\):
$$\begin{aligned} {\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x}\in \mathcal {X},\, \varvec{x}'\in \mathcal {X}_t\\ \Vert \varvec{x}-\varvec{x}'\Vert _{2}\le \tfrac{t}{256} \end{array}}} \big \Vert \tfrac{1}{\sqrt{m}}\varvec{A}(\varvec{x}-\varvec{x}')\big \Vert _{2} \le \tfrac{t}{128}. \end{aligned}$$We claim that Corollary 4 holds on the intersection of the events \({\mathcal {A}}_1\) and \({\mathcal {A}}_2\). To this end, let \(\varvec{x}\in \mathcal {X}\) be arbitrary and assume that \(\varvec{y}\in \delta \mathbb {Z}^m\) satisfies \(\tfrac{1}{\sqrt{m}}\Vert \varvec{y}-q_{\delta }(\varvec{A}\varvec{x}+\varvec{\tau })\Vert _{2} \le \tfrac{t}{40}\). Since \(\mathcal {X}_t\) is a maximal \(\tfrac{t}{256}\)-packing for \(\mathcal {X}\), there exists \(\varvec{x}'\in \mathcal {X}_t\) with \(\Vert \varvec{x}-\varvec{x}'\Vert _{2}\le \tfrac{t}{256}\) (otherwise \(\mathcal {X}_t\) would not be maximal). On the event \({\mathcal {A}}_2\) and by the triangle inequality, we obtain
$$\begin{aligned} \tfrac{1}{\sqrt{m}}\big \Vert \varvec{y}-q_{\delta }(\varvec{A}\varvec{x}'+\varvec{\tau })\big \Vert _{2}&\le \tfrac{1}{\sqrt{m}}\big \Vert \varvec{y}-q_{\delta }(\varvec{A}\varvec{x}+\varvec{\tau })\big \Vert _{2} \\&\quad + \tfrac{1}{\sqrt{m}}\big \Vert q_{\delta }(\varvec{A}\varvec{x}+\varvec{\tau })-q_{\delta }(\varvec{A}\varvec{x}'+\varvec{\tau })\big \Vert _{2} \\&\le \tfrac{t}{40} + \tfrac{1}{\sqrt{m}}\big \Vert (\varvec{A}\varvec{x}+ \varvec{\tau }) -(\varvec{A}\varvec{x}' + \varvec{\tau }) \big \Vert _{2} \\&\quad + \tfrac{1}{\sqrt{m}}\big \Vert q_\delta (\varvec{A}\varvec{x}+ \varvec{\tau }) - (\varvec{A}\varvec{x}+ \varvec{\tau })\big \Vert _{2} \\&\quad + \tfrac{1}{\sqrt{m}}\big \Vert q_\delta (\varvec{A}\varvec{x}' + \varvec{\tau }) - (\varvec{A}\varvec{x}' + \varvec{\tau })\big \Vert _{2} \\&\le \tfrac{t}{40} + \tfrac{t}{128}+ 2\delta \le \tfrac{t}{20}. \end{aligned}$$On the event \({\mathcal {A}}_1\), every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \varvec{x}'\Vert _{2} \le t\), and therefore
$$\begin{aligned} \Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \le \Vert \hat{\varvec{z}}- \varvec{x}'\Vert _{2} + \Vert \varvec{x}' - \varvec{x}\Vert _{2} \le t + \tfrac{t}{256}\le 2t. \end{aligned}$$Finally, for \(C'>0\) sufficiently large, we conclude that the condition (4.6) implies both (8.6) and (8.7). Hence, the events \({\mathcal {A}}_1\) and \({\mathcal {A}}_2\) occur jointly with probability at least \(1 - \exp (-c u^2)\), provided that \(c>0\) is chosen small enough.
-
Step 3. Let \(\varvec{x}\in \mathcal {X}\). Since \(\varvec{a}\) is isotropic, we have that
$$\begin{aligned} \rho _{}(\varvec{x}) = \big \Vert \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}] - \varvec{x}\big \Vert _{2} \le \big \Vert \mathbb {E}\big [\big (q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \varvec{x}\rangle \big )\varvec{a}\big ]\big \Vert _{2}, \end{aligned}$$so that it suffices to show that
$$\begin{aligned} \mathbb {E}\big [\big (q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]=0 \qquad \text {for all } \varvec{z}\in \mathbb {S}^{p-1}. \end{aligned}$$It is straightforward to check that
$$\begin{aligned} \mathbb {E}[q_\delta (s + \tau )] = s \qquad \text {for all } s\in \mathbb {R}. \end{aligned}$$(8.8)In other words, integrating over the dithering variable \(\tau \) allows us to “eliminate” the quantizer. For the sake of completeness, we show the identity (8.8) for \(s = k \cdot (2\delta ) + s'\) where \(k\in \mathbb {Z}\) and \(s'\in [0,\delta ]\); the case \(s = k \cdot (2\delta ) + s'\) where \(k\in \mathbb {Z}\) and \(s'\in (\delta , 2\delta )\) can be treated analogously. First, observe that
$$\begin{aligned} q_\delta (s + \tau )=(2\lceil \tfrac{s + \tau }{2\delta }\rceil -1)\delta = (2\lceil k + \tfrac{s' + \tau }{2\delta }\rceil -1)\delta . \end{aligned}$$Since \(\tfrac{s'+\tau }{2\delta } \in [-\tfrac{1}{2}, 1]\), it follows that \(q_\delta (s + \tau )\in \{(2k-1)\delta , (2k+1)\delta \}\) and more precisely,
$$\begin{aligned}&\tau \in [-\delta -s',-s']&\quad \Rightarrow \quad q_\delta (s + \tau ) =(2k-1)\delta ,\\&\tau \in (-s', 2\delta -s']&\quad \Rightarrow \quad q_\delta (s + \tau ) =(2k+1)\delta . \end{aligned}$$Therefore, we obtain
$$\begin{aligned} \mathbb {E}[q_\delta (s + \tau )]&= \mathbb {P}(\tau \in [-\delta -s',-s']) \cdot (2k-1)\delta \\&\qquad + \mathbb {P}(\tau \in (-s', 2\delta -s']) \cdot (2k+1)\delta \\&= \mathbb {P}(\tau \in [-\delta ,-s']) \cdot (2k-1)\delta + \mathbb {P}(\tau \in (-s', \delta ]) \cdot (2k+1)\delta \\&= \big (\tfrac{-s'+\delta }{2\delta }\big ) \cdot (2k-1)\delta + \big (\tfrac{\delta + s'}{2\delta }\big ) \cdot (2k+1)\delta \\&=\tfrac{s'}{2} + \tfrac{1}{2}(2k-1)\delta + \tfrac{s'}{2} + \tfrac{1}{2}(2k+1)\delta = 2k\delta + s' = s. \end{aligned}$$Since \(\tau \) and \(\varvec{a}\) are independent, we can apply (8.8) as follows:
$$\begin{aligned} \mathbb {E}\big [\big (q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]&=\mathbb {E}_{\varvec{a}}\mathbb {E}_\tau \big [\big (q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\\&=\mathbb {E}\big [\big (\langle \varvec{a}, \varvec{x}\rangle -\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ] = 0. \end{aligned}$$ -
Step 4. Before distinguishing between the cases \(t\le 128\delta \) and \(t>128\delta \) according to Step 1b, let us analyze the action of the quantizer \(q_\delta \) in more detail. To this end, we partition the real axis \(\mathbb {R}\) into half-open intervals \({\mathcal {I}}_k\) of length \(2\delta \) given by \({\mathcal {I}}_k {:=}(e_{k-1}, e_k]\) for \(k\in \mathbb {Z}\) with \(e_k {:=}k\cdot 2\delta \). Then, \(q_\delta \) maps every point in the interval \({\mathcal {I}}_k\) to its center point:
$$\begin{aligned} q_\delta (s)= \frac{e_{k-1}+e_k}{2}, \quad s\in {\mathcal {I}}_k. \end{aligned}$$In particular, \(q_\delta (s)\) is discontinuous exactly at the points \(\{ e_k \ :k\in \mathbb {Z}\}\). The basic idea is now to approximate the quantizer \(q_\delta \) by a Lipschitz continuous function by “cutting out” intervals of a certain radius \(t^\circ > 0\) around each point \(e_k\) and inserting a straight line that connects both quantization values. Assuming that \(t^\circ \le \delta \), the resulting function takes the form
$$\begin{aligned} \psi _{t^{\circ }}(s) {:=}{\left\{ \begin{array}{ll} \tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ }) , &{}\text {if } \exists k\in \mathbb {Z}: s\in {\mathcal {E}}_{k,t^\circ },\\ q_\delta (s), &{}\text {otherwise,} \end{array}\right. } \quad s \in \mathbb {R}, \end{aligned}$$where \({\mathcal {E}}_{k,t^\circ } {:=}[e_k-t^\circ , e_k+t^\circ ]\); note that if \(t^\circ =\delta \), then \(\psi _{t^{\circ }}\) is just the identity function. We also define
$$\begin{aligned} \phi _{t^{\circ }}(s){:=}|q_\delta (s)-\psi _{t^{\circ }}(s)|,\quad s\in \mathbb {R}. \end{aligned}$$Let us now show that \(\psi _{t^{\circ }}\) and \(\phi _{t^{\circ }}\) are both \(\tfrac{\delta }{t^\circ }\)-Lipschitz. Since \(q_\delta \) is locally constant on \(\mathbb {R}\setminus \{e_k \ :k\in \mathbb {Z}\}\) and \(s\mapsto \tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })\) is clearly \(\tfrac{\delta }{t^\circ }\)-Lipschitz, it is sufficient to show that \(\psi _{t^{\circ }}\) is continuous, i.e., \(\tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })=q_\delta (s)\) for \(s \in \mathbb {R}\) with \(|s-e_k|=t^\circ \) for some \(k \in \mathbb {Z}\). If \(s=e_k-t^\circ \), then \(\tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })=e_k-\delta \), and we also have that
$$\begin{aligned} q_\delta (s)=\big (2\big \lceil \tfrac{k\cdot (2\delta ) - t^\circ }{2\delta }\big \rceil -1\big )\delta =(2\lceil k - \tfrac{t^\circ }{2\delta }\rceil -1)\delta . \end{aligned}$$Since \(t^\circ \in (0,\delta ]\), it follows that \(\lceil k - \tfrac{t^\circ }{2\delta }\rceil =k\), and therefore, \(q_\delta (s)=(2k-1)\delta = e_k-\delta \). The case \(s=e_k+t^\circ \) works analogously. Thus, \(\psi _{t^{\circ }}\) is indeed \(\tfrac{\delta }{t^\circ }\)-Lipschitz. For \(\phi _{t^{\circ }}\), it is clearly sufficient to show \(\tfrac{\delta }{t^\circ }\)-Lipschitz continuity on \({\mathcal {E}}_{k,t^\circ }\) for every \(k\in \mathbb {Z}\). In the case \(s\in [e_k-t^\circ , e_k]\), we observe that \(q_\delta (s)=e_k-\delta \) and \(\tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })\ge e_k-\delta \), implying that
$$\begin{aligned} \phi _{t^{\circ }}(s) = \tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ }) - (e_k-\delta )= \delta - \tfrac{\delta }{t^\circ }(e_k-s). \end{aligned}$$Thus, \(\phi _{t^{\circ }}\) is \(\tfrac{\delta }{t^\circ }\)-Lipschitz continuous on \([e_k-t^\circ , e_k]\). On the other hand, if \(s\in (e_k, e_k+t^\circ ]\), we have that \(q_\delta (s)=e_k+\delta \) and \(e_k+\delta \ge \tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })\), implying that
$$\begin{aligned} \phi _{t^{\circ }}(s) = e_k+\delta - \big (\tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })\big ) = \delta - \tfrac{\delta }{t^\circ }(s-e_k). \end{aligned}$$Thus, \(\phi _{t^{\circ }}\) is \(\tfrac{\delta }{t^\circ }\)-Lipschitz continuous on \((e_k, e_k+t^\circ ]\). Moreover, \(\lim _{s\rightarrow e_k}\phi _{t^{\circ }}(s)=\delta =\phi _{t^{\circ }}(e_k)\), which shows that \(\phi _{t^{\circ }}\) is indeed \(\tfrac{\delta }{t^\circ }\)-Lipschitz continuous on \({\mathcal {E}}_{k,t^\circ }\). Note that the above calculations also allow us to write
$$\begin{aligned} \phi _{t^{\circ }}(s) = {\left\{ \begin{array}{ll} \delta - \tfrac{\delta }{t^\circ }|s-e_k|, &{}\text {if } \exists k\in \mathbb {Z}: s\in [e_k-t^\circ , e_k+t^\circ ],\\ 0, &{}\text {otherwise,} \end{array}\right. } \quad s \in \mathbb {R}. \end{aligned}$$In particular, we have that \(0\le \phi _{t^{\circ }}(s)\le \delta \cdot \chi _{{\mathcal {E}}_{t^{\circ }}}(s)\), where \({\mathcal {E}}_{t^{\circ }}{:=}\cup _{k\in \mathbb {Z}}{\mathcal {E}}_{k,t^\circ }\). We now distinguish between the two cases \(t\le 128\delta \) and \(t > 128\delta \): The case \(t\le 128\delta \). Using the above notation, we choose \(t^\circ {:=}\frac{t}{128}\) and approximate \(\tilde{y}(\varvec{x}) = q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )\) by \(\tilde{y}_t(\varvec{x}) {:=}\psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle + \tau )\). The absolute value of the approximation error is then given by \(\varepsilon _t(\varvec{x}) = \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle +\tau )\). We now show that for this choice of approximation, the conditions of Assumption 2 are indeed fulfilled with (8.4): On Assumption 2(a). We first observe that the indicator function \(\chi _{{\mathcal {E}}_{t^\circ }}\) is \(2\delta \)-periodic on \(\mathbb {R}\). Since the random variable \(s + \tau \) is uniformly distributed on \([s-\delta , s+\delta ]\) for every \(s \in \mathbb {R}\), this implies
$$\begin{aligned} \mathbb {E}_{\tau }[\chi _{{\mathcal {E}}_{t^\circ }}(s + \tau )] = \mathbb {E}_{\tau }[\chi _{{\mathcal {E}}_{t^\circ }}(\tau )] = \mathbb {E}_{\tau }[\chi _{{\mathcal {E}}_{0, t^\circ }}(\tau )]=\mathbb {E}_{\tau }[\chi _{[-t^\circ ,t^\circ ]}(\tau )] = t^\circ \delta ^{-1}. \end{aligned}$$(8.9)Now, let \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\). Using the independence of \(\varvec{a}\) and \(\tau \) in conjunction with the inequality (8.9), we obtain
$$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]&=\mathbb {E}[\phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle +\tau ) \cdot |\langle \varvec{a}, \varvec{z}\rangle |] \\&\le \delta \cdot \mathbb {E}[\chi _{{\mathcal {E}}_{t^\circ }}(\langle \varvec{a}, \varvec{x}\rangle + \tau ) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\\&\le \delta \cdot \mathbb {E}_{\varvec{a}} \big [|\langle \varvec{a}, \varvec{z}\rangle | \cdot \mathbb {E}_{\tau } [\chi _{{\mathcal {E}}_{t^\circ }}(\langle \varvec{a}, \varvec{x}\rangle + \tau )]\big ] \\&\le t^\circ \cdot \mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |] \le \tfrac{t}{64} \cdot \mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |]. \end{aligned}$$From Jensen’s inequality and the isotropy of \(\varvec{a}\), it follows that
$$\begin{aligned} \mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |]\le (\mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |^2])^{1/2}=\Vert \varvec{z}\Vert _{2}=1, \end{aligned}$$which shows that Assumption 2(a) is satisfied. On Assumption 2(b). Since \(\psi _{t^\circ }\) is \(\tfrac{\delta }{t^\circ }\)-Lipschitz, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} + \Vert \psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle + \tau ) - \psi _{t^\circ }(\langle \varvec{a}, \varvec{x}'\rangle + \tau )\Vert _{\psi _2}\\&\le (1 + \tfrac{\delta }{t^\circ }) \cdot \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} \le (1 + \tfrac{\delta }{t^\circ }) \cdot L\cdot \Vert \varvec{x}- \varvec{x}'\Vert _{2}\\&= L\cdot (1 + \tfrac{128 \delta }{t}) \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$This implies \(L_{t}\lesssim L\delta t^{-1}\). Furthermore, we clearly have that \({{\mathrm{sup}}}_{s\in \mathbb {R}}|s-\psi _{t^\circ }(s + \tau )|\lesssim \delta \), and therefore
$$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}=\Vert \langle \varvec{a}, \varvec{x}\rangle -\psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle + \tau )\Vert _{\psi _2}\lesssim \delta \end{aligned}$$for every \(\varvec{x}\in \mathcal {X}\). This shows \(r\lesssim \delta \). On Assumption 2(c). Since \(\phi _{t^\circ }\) is \(\tfrac{\delta }{t^\circ }\)-Lipschitz, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):
$$\begin{aligned} \Vert \varepsilon _t(\varvec{x}) - \varepsilon _t(\varvec{x}')\Vert _{\psi _2}&=\Vert \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle +\tau ) - \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}'\rangle +\tau ) \Vert _{\psi _2}\\&\le \tfrac{\delta }{t^\circ } \Vert \langle \varvec{a}, \varvec{x}- \varvec{x}'\rangle \Vert _{\psi _2} \le L\cdot \tfrac{128\delta }{t} \cdot d_T(\varvec{x},\varvec{x}'), \end{aligned}$$and therefore, \(\hat{L}_{t}\lesssim L\delta t^{-1}\). Finally, since \(0\le \phi _{t^\circ }(s)\le \delta \) for every \(s\in \mathbb {R}\), it follows that \(\hat{r}\lesssim \delta \). The case \(t> 128\delta \). Let \(\mathcal {X}_t\subset \mathcal {X}\) be any maximal \(\tfrac{t}{256}\)-packing for \(\mathcal {X}\). We show that for the choice \(\tilde{y}_t(\varvec{x}) {:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled on \(\mathcal {X}_t\) with (8.5). Since \(\varepsilon _t(\varvec{x}) = 0\) for all \(\varvec{x}\in \mathcal {X}_t\), Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). Furthermore, observing that \(\xi _t(\varvec{x})=\langle \varvec{a}, \varvec{x}\rangle -q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )\) and \({{\mathrm{sup}}}_{s\in \mathbb {R}} |s - q_\delta (s + \tau )|\le 2\delta \), we conclude that \({{\mathrm{sup}}}_{\varvec{x}\in \mathcal {X}_t}\Vert \xi _t(\varvec{x})\Vert _{\psi _2}\lesssim \delta \). Finally, using that \(\Vert \varvec{x}-\varvec{x}'\Vert _{2}> \tfrac{t}{256}\) for all \(\varvec{x}, \varvec{x}' \in \mathcal {X}_t\) with \(\varvec{x}\ne \varvec{x}'\), we can bound the sub-Gaussian norm of the multiplier increments as follows:
$$\begin{aligned} \Vert \xi _t(\varvec{x})-\xi _t(\varvec{x}')\Vert _{\psi _2}\le & {} \Vert \xi _t(\varvec{x})\Vert _{\psi _2} + \Vert \xi _t(\varvec{x}')\Vert _{\psi _2} \lesssim \delta < \tfrac{256\delta }{t}\Vert \varvec{x}-\varvec{x}'\Vert _{2}\\= & {} \tfrac{256\delta }{t}\cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$This shows that Assumption 2(b) is satisfied on \(\mathcal {X}_t\) with \(L_{t}\lesssim \delta t^{-1}\) and \(r\lesssim \delta \).
\(\square \)
8.4 Proofs for Sect. 4.4
Proof of Theorem 1
We follow the proof template from the beginning of Sect. 8:
-
Step 1a. The model setup of Theorem 1 fits into Assumption 1 as follows:
-
We have that \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) and therefore \(\Vert \varvec{a}\Vert _{\psi _2} \lesssim 1\). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\mathbb {S}^{p-1}\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}f(\langle \varvec{a}, \varvec{x}\rangle )\).
-
The target function \(T:\mathcal {X}\rightarrow K\) corresponds to rescaling by a factor of \(\mu = \mathbb {E}[f(g)g]\) with \(g\sim \mathsf {N}(0, 1)\), i.e., \(T\varvec{x}{:=}\mu \varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \mu \Vert \varvec{x}- \varvec{x}'\Vert _{2}\).
-
-
Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).
-
Step 1c. Let \(g\sim \mathsf {N}(0, 1)\). For the trivial choice \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled with
$$\begin{aligned} L_{t}\lesssim 1 + \gamma \mu ^{-1},\quad \hat{L}_{t}= 0,\quad r= \Vert f(g) - \mu g\Vert _{\psi _2},\quad \hat{r}= 0. \end{aligned}$$ -
Step 2. Setting \(m_0{:=}0\), \(\varDelta {:=}1\), and \(u_0 {:=}u\), the claim of Theorem 1 follows directly from Theorem 2.
-
Step 3. Let \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\) and consider the orthogonal decomposition of the standard Gaussian random vector \(\varvec{a}\) along \(\varvec{x}\):
$$\begin{aligned} \varvec{a}= \langle \varvec{a}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{a}), \end{aligned}$$where \(P_{{\varvec{x}}^\perp } {:=}P_{{\{\varvec{x}\}}^\perp }\). Since \(P_{{\varvec{x}}^\perp }(\varvec{a})\) is centered and \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\) is independent of \(P_{{\varvec{x}}^\perp }(\varvec{a})\), we have that
$$\begin{aligned} \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]&=\mathbb {E}[f(\langle \varvec{a}, \varvec{x}\rangle ) (\langle \varvec{a}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{a}))]\\&=\mu \varvec{x}+ \mathbb {E}[f(\langle \varvec{a}, \varvec{x}\rangle )] \cdot \mathbb {E}[P_{{\varvec{x}}^\perp }(\varvec{a})] =\mu \varvec{x}= T\varvec{x}, \end{aligned}$$which implies that \(\rho _{}(\varvec{x})=0\).
-
Step 4. We simply set \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\). Then, \(\varepsilon _t(\varvec{x}) = 0\), implying that Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). Furthermore, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \mu \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} + \Vert f(\langle \varvec{a}, \varvec{x}\rangle ) - f(\langle \varvec{a}, \varvec{x}'\rangle )\Vert _{\psi _2}\\&\lesssim \mu \Vert \varvec{x}- \varvec{x}'\Vert _{2} + \gamma \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} \\&\lesssim (\mu + \gamma ) \cdot \Vert \varvec{x}- \varvec{x}'\Vert _{2} = (1 + \gamma \mu ^{-1}) \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$This implies \(L_{t}\lesssim 1 + \gamma \mu ^{-1}\). Since \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\) for every \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\), we can also conclude that
$$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}=\Vert \mu \langle \varvec{a}, \varvec{x}\rangle - f(\langle \varvec{a}, \varvec{x}\rangle )\Vert _{\psi _2}=\Vert f(g) - \mu g\Vert _{\psi _2}, \quad g\sim \mathsf {N}(0, 1), \end{aligned}$$which shows that \(r{:=}\Vert f(g) - \mu g\Vert _{\psi _2}\) is a valid choice. Hence, Assumption 2(b) is satisfied as well.
\(\square \)
Proof of Corollary 5
We follow the proof template from the beginning of Sect. 8:
-
Step 1a. The model setup of Corollary 5 fits into Assumption 1 as follows:
-
We have that \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) and therefore \(\Vert \varvec{a}\Vert _{\psi _2} \lesssim 1\). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\mathbb {S}^{p-1}\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}{\mathfrak {m}}_\lambda (\langle \varvec{a}, \varvec{x}\rangle )\).
-
The target function \(T:\mathcal {X}\rightarrow K\) corresponds to rescaling by a factor of \(\mu _\lambda = \mathbb {E}[{\mathfrak {m}}_\lambda (g)g]\) with \(g\sim \mathsf {N}(0, 1)\), i.e., \(T\varvec{x}{:=}\mu _\lambda \varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \mu _\lambda \Vert \varvec{x}- \varvec{x}'\Vert _{2}\).
-
-
Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).
-
Step 1c. There exists an approximation \(\tilde{y}_t(\varvec{x})\) of the observation variable \(\tilde{y}(\varvec{x})\) such that the conditions of Assumption 2 are fulfilled with
$$\begin{aligned} L_{t}\lesssim \tfrac{\lambda }{t},\quad \hat{L}_{t}\lesssim \tfrac{\lambda }{t},\quad r\lesssim 1,\quad \hat{r}\lesssim 1. \end{aligned}$$(8.10) -
Step 2. We begin by showing that if \(\lambda \ge C\) for a sufficiently large absolute constant \(C>0\), then \(\mu _\lambda \ge \tfrac{1}{2}\). Indeed, for \(g\sim \mathsf {N}(0, 1)\), we have that
$$\begin{aligned} \mu _\lambda = \mathbb {E}[{\mathfrak {m}}_\lambda (g)g]&= \mathbb {E}[\chi _{[-\lambda , \lambda )}(g)g^2] + \mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g){\mathfrak {m}}_\lambda (g)g]\\&\ge \mathbb {E}[\chi _{[-\lambda , \lambda )}(g)g^2] - \mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)g^2]\\&=\mathbb {E}[g^2] - 2\mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)g^2]\\&=1-2\mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)g^2]. \end{aligned}$$The claim now follows by observing that \(\mathbb {E}[g^2]<\infty \), which implies \(\lim _{\lambda \rightarrow \infty }\mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)g^2]=0\). Moreover,
$$\begin{aligned} \mu _\lambda&= \mathbb {E}[\chi _{[-\lambda , \lambda )}(g)g^2] + \mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g){\mathfrak {m}}_\lambda (g)g]\\&\le \mathbb {E}[\chi _{[-\lambda , \lambda )}(g)g^2] + \mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)\lambda |g|] \le \mathbb {E}[g^2] =1. \end{aligned}$$Setting \(m_0{:=}0\), \(\varDelta {:=}1\), and \(u_0 {:=}u\), the claim of Corollary 5 follows directly from Theorem 2.
-
Step 3. Let \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\) and consider the orthogonal decomposition of the standard Gaussian random vector \(\varvec{a}\) along \(\varvec{x}\):
$$\begin{aligned} \varvec{a}= \langle \varvec{a}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{a}), \end{aligned}$$where \(P_{{\varvec{x}}^\perp } {:=}P_{{\{\varvec{x}\}}^\perp }\). Since \(P_{{\varvec{x}}^\perp }(\varvec{a})\) is centered and \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\) is independent of \(P_{{\varvec{x}}^\perp }(\varvec{a})\), we have that
$$\begin{aligned} \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]&=\mathbb {E}[{\mathfrak {m}}_\lambda (\langle \varvec{a}, \varvec{x}\rangle ) (\langle \varvec{a}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{a}))]\\&=\mu _\lambda \varvec{x}+ \mathbb {E}[{\mathfrak {m}}_\lambda (\langle \varvec{a}, \varvec{x}\rangle )] \cdot \mathbb {E}[P_{{\varvec{x}}^\perp }(\varvec{a})] =\mu _\lambda \varvec{x}= T\varvec{x}, \end{aligned}$$which implies that \(\rho _{}(\varvec{x})=0\).
-
Step 4. Let \(t\in (0, \lambda )\). We define the half-open intervals \({\mathcal {I}}_{k,t}{:=}[k\lambda -(\lambda -t), k\lambda +(\lambda -t))\) for \(k \in \mathbb {Z}\) even and \({\mathcal {I}}_{k,t}{:=}[k\lambda -t, k\lambda +t)\) for \(k \in \mathbb {Z}\) odd. Moreover, we set
$$\begin{aligned} {\mathcal {I}}_{\text {even},t}{:=}\bigcup _{\begin{array}{c} k\in \mathbb {Z}\\ k \text { even} \end{array}}{\mathcal {I}}_{k,t}, \quad {\mathcal {I}}_{\text {odd},t}{:=}\bigcup _{\begin{array}{c} k\in \mathbb {Z}\\ k \text { odd} \end{array}}{\mathcal {I}}_{k,t} \end{aligned}$$and observe that the intervals \({\mathcal {I}}_{k,t}\), \(k\in \mathbb {Z}\), form a partition of \(\mathbb {R}\). Finally, we introduce the functions
$$\begin{aligned} g_t(s){:=}-\big (\tfrac{\lambda -t}{t}\big )\cdot s, \qquad s\in \mathbb {R}, \end{aligned}$$and
$$\begin{aligned} \psi _t(s) {:=}{\left\{ \begin{array}{ll} {\mathfrak {m}}_\lambda (s), &{}s\in {\mathcal {I}}_{k,t},\; k \text { even},\\ g_t(s-k\lambda ), \quad &{}s\in {\mathcal {I}}_{k,t},\; k \text { odd}, \end{array}\right. } \end{aligned}$$and
$$\begin{aligned} \phi _t(s){:=}|{\mathfrak {m}}_\lambda (s)-\psi _t(s)|, \qquad s\in \mathbb {R}. \end{aligned}$$Let us show that both \(\psi _t\) and \(\phi _t\) are \(\tfrac{\lambda }{t}\)-Lipschitz continuous. We begin by verifying that \(\psi _t\) is continuous. Clearly, \(\psi _t\) is continuous on \({\mathcal {I}}_{k,t}\) for every odd integer k. Let k be an even integer. Then, \(s\in {\mathcal {I}}_{k,t}=[k\lambda -(\lambda -t), k\lambda +(\lambda -t))\), which implies \(\big \lfloor \tfrac{s+\lambda }{2\lambda }\big \rfloor =\tfrac{k}{2}\). Therefore, \({\mathfrak {m}}_\lambda (s)=s-k\lambda \) if \(s\in {\mathcal {I}}_{k,t}\) for k an even integer. This shows that \(\psi _t\) is continuous on every interval \({\mathcal {I}}_{k,t}\). Let us now write \({\mathcal {I}}_{k,t}=[u_k, v_k)\) for \(k\in \mathbb {Z}\). To show that \(\psi _t\) is continuous, it remains to check that \(\lim _{s\uparrow v_k}\psi _t(s)=\psi _t(u_{k+1})\) for every \(k\in \mathbb {Z}\). For k even, we have that
$$\begin{aligned} \lim _{s\uparrow v_k}\psi _t(s) = \lim _{s\uparrow v_k}{\mathfrak {m}}_\lambda (s)=\lim _{s\uparrow v_k} s-k\lambda = v_k-k\lambda = \lambda - t. \end{aligned}$$Furthermore, it holds that \(u_{k+1}=(k+1)\lambda -t\in {\mathcal {I}}_{k+1,t}\), which implies
$$\begin{aligned} \psi _t(u_{k+1})=g_t(u_{k+1}-(k+1)\lambda )=g_t(-t)=\lambda -t. \end{aligned}$$On the other hand, for k odd, we have that
$$\begin{aligned} \lim _{s\uparrow v_k}\psi _t(s)=\lim _{s\uparrow v_k}g_t(s-k\lambda )=\lim _{s\uparrow v_k} -\big (\tfrac{\lambda -t}{t}\big )(s-k\lambda )=-\big (\tfrac{\lambda -t}{t}\big )(v_k-k\lambda )=-(\lambda -t). \end{aligned}$$Furthermore, it holds that \(u_{k+1}=k\lambda +t\in {\mathcal {I}}_{k+1,t}\), which implies
$$\begin{aligned} \psi _t(u_{k+1})={\mathfrak {m}}_\lambda (u_{k+1})=u_{k+1}-(k+1)\lambda =-(\lambda -t). \end{aligned}$$This shows that \(\psi _t\) is continuous. Since \(\psi _t\) is continuous and piecewise linear, it is \(\max \{1, \tfrac{\lambda -t}{t}\}\)-Lipschitz. Therefore, \(\psi _t\) is also \(\tfrac{\lambda }{t}\)-Lipschitz. Next, we show that \(\phi _t\) is \(\tfrac{\lambda }{t}\)-Lipschitz as well. First note that
$$\begin{aligned} \phi _t(s) = |{\mathfrak {m}}_\lambda (s)-\psi _t(s)|\cdot \chi _{{\mathcal {I}}_{\text {odd},t}}(s),\qquad s\in \mathbb {R}. \end{aligned}$$Let \(s\in {\mathcal {I}}_{k,t}=[k\lambda -t, k\lambda + t)\) for k an odd integer. Let us show that
$$\begin{aligned} |{\mathfrak {m}}_\lambda (s)-\psi _t(s)|=\lambda - \tfrac{\lambda }{t}|k\lambda -s|. \end{aligned}$$To this end, we distinguish the cases \(s\in [k\lambda -t, k\lambda )\) and \(s\in [k\lambda , k\lambda +t)\). First, let \(s\in [k\lambda -t, k\lambda )\). It follows \(\big \lfloor \tfrac{s+\lambda }{2\lambda }\big \rfloor =\tfrac{k-1}{2}\), which implies \({\mathfrak {m}}_\lambda (s)=s-k\lambda +\lambda \). Furthermore, \(g_t(s-k\lambda )=s-k\lambda +\tfrac{\lambda }{t}(k\lambda -s)\). It follows that \({\mathfrak {m}}_\lambda (s)\ge g_t(s-k\lambda )\) and
$$\begin{aligned} |{\mathfrak {m}}_\lambda (s)-\psi _t(s)| = {\mathfrak {m}}_\lambda (s) - g_t(s-k\lambda ) = \lambda - \tfrac{\lambda }{t}|k\lambda -s|. \end{aligned}$$The case \(s\in [k\lambda , k\lambda +t)\) can be treated analogously. All together, this shows that for any \(s\in \mathbb {R}\), we have that
$$\begin{aligned} \phi _t(s) = \sum _{\begin{array}{c} k\in \mathbb {Z}\\ k\text { odd} \end{array}}\chi _{{\mathcal {I}}_{k,t}}(s)\cdot (\lambda - \tfrac{\lambda }{t}|k\lambda -s|). \end{aligned}$$In particular, \(\phi _t\) is \(\tfrac{\lambda }{t}\)-Lipschitz. We approximate the observation variable \(\tilde{y}(\varvec{x})={\mathfrak {m}}_\lambda (\langle \varvec{a}, \varvec{x}\rangle )\) by \(\tilde{y}_t(\varvec{x})=\psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )\) for \(t^{\circ }{:=}\tfrac{t}{56\cdot 64}\). Observe that \(t^\circ < \tfrac{\lambda }{2}\). Let us show that the conditions of Assumption 2 are indeed fulfilled with (8.10): On Assumption 2(a). Let \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\), and consider the orthogonal decomposition
$$\begin{aligned} \varvec{z}= \langle \varvec{z}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{z}). \end{aligned}$$This implies
$$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\le |\langle \varvec{z}, \varvec{x}\rangle |\cdot \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{x}\rangle |] + \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]. \end{aligned}$$Clearly, \(|\langle \varvec{z}, \varvec{x}\rangle |\le 1\) and \(\varepsilon _t(\varvec{x})=\phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )\le \lambda \cdot \chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\), which implies
$$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\le & {} \underbrace{\lambda \cdot \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\cdot |\langle \varvec{a}, \varvec{x}\rangle |]}_{=:S_1} {} \\&+ {} \underbrace{\lambda \cdot \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]}_{=:S_2}. \end{aligned}$$Let us estimate \(S_1\) and \(S_2\) separately, starting with \(S_2\). Since \(\langle \varvec{a}, \varvec{x}\rangle \) and \(\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle \) are independent, we observe that
$$\begin{aligned} \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |] = \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )]\cdot \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]. \end{aligned}$$Using Jensen’s inequality, we obtain
$$\begin{aligned} \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]\le \Big (\mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |^2]\Big )^{1/2}=\Vert P_{{\varvec{x}}^\perp }(\varvec{z})\Vert _{2}\le \Vert \varvec{z}\Vert _{2}=1. \end{aligned}$$Therefore, it follows that \(S_2 \le \lambda \cdot \mathbb {P}(\langle \varvec{a}, \varvec{x}\rangle \in {\mathcal {I}}_{\text {odd},t^\circ })\). Since \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\), it holds that
$$\begin{aligned} \tfrac{1}{2}\mathbb {P}(\langle \varvec{a}, \varvec{x}\rangle \in {\mathcal {I}}_{\text {odd},t^\circ }) = \tfrac{1}{\sqrt{2\pi }} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }e^{-x^2/2}\, dx \le \tfrac{2t^\circ }{\sqrt{2\pi }} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} e^{-(k\lambda -t^\circ )^2/2}. \end{aligned}$$For every \(k\in \mathbb {N}\),
$$\begin{aligned} 2\lambda e^{-((k+2)\lambda -t^\circ )^2/2}\le \int _{k\lambda -t^\circ }^{(k+2)\lambda -t^\circ }e^{-x^2/2}\, dx. \end{aligned}$$Therefore,
$$\begin{aligned} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} e^{-(k\lambda -t^\circ )^2/2}&= e^{-(\lambda -t^\circ )^2/2} + \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} e^{-((k+2)\lambda -t^\circ )^2/2}\\&\le \tfrac{1}{\lambda -t^\circ }\int _{0}^{\lambda -t^\circ }e^{-x^2/2}\, dx + \tfrac{1}{2\lambda } \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}}\int _{k\lambda -t^\circ }^{(k+2)\lambda -t^\circ }e^{-x^2/2}\, dx\\&\le \tfrac{1}{\lambda -t^\circ }\int _{0}^{\infty }e^{-x^2/2}\,dx. \end{aligned}$$Using the identity \(\tfrac{1}{\sqrt{2\pi }}\int _{0}^{\infty }e^{-x^2/2}\,dx=\tfrac{1}{2}\), we obtain
$$\begin{aligned} S_2 \le \tfrac{2 t^{\circ }\lambda }{\lambda -t^{\circ }}\le 4t^{\circ }, \end{aligned}$$where the last inequality follows from \(t^\circ <\tfrac{\lambda }{2}\). Let us now estimate \(S_1\). Since \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\), we have that
$$\begin{aligned} \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\cdot |\langle \varvec{a}, \varvec{x}\rangle |] = \tfrac{1}{\sqrt{2\pi }} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx. \end{aligned}$$It holds that
$$\begin{aligned}&\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx\\&\quad \le 2t^\circ \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} (k\lambda +t^\circ )e^{-(k\lambda -t^\circ )^2/2}\\&\quad = 2t^\circ \Big ((\lambda +t^\circ )\cdot e^{-(\lambda -t^\circ )^2/2} + \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}}((k+2)\lambda +t^\circ )e^{-((k+2)\lambda -t^\circ )^2/2}\, dx\Big )\\&\quad \le 2t^\circ \Big ( \tfrac{1}{\lambda -t^\circ }\int _0^{\lambda -t^\circ }(x+\lambda +t^\circ )\cdot e^{-x^2/2}\, dx \\&\qquad + \tfrac{1}{2\lambda }\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}}\int _{k\lambda -t^\circ }^{(k+2)\lambda -t^\circ }(x+2\lambda +2t^\circ )\cdot e^{-x^2/2}\, dx\Big ) \\&\quad \le \tfrac{4t^\circ }{\lambda }\int _0^\infty (x+3\lambda )\cdot e^{-x^2/2}\, dx\\&\quad =\tfrac{4t^\circ }{\lambda }\cdot (1+3\lambda \sqrt{\tfrac{\pi }{2}}). \end{aligned}$$Therefore, if \(\lambda \le 2\), then
$$\begin{aligned} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx\le \tfrac{4t^\circ }{\lambda }\cdot (1+6\sqrt{\tfrac{\pi }{2}})\le \tfrac{52t^\circ }{\lambda }. \end{aligned}$$If \(\lambda >2\), then \(\lambda -t^\circ >1\). Since the function \(x\mapsto x\exp (-x^2/2)\) is monotonically decreasing for \(x\in [1,\infty )\), we obtain
$$\begin{aligned}&\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx\\&\quad \le 2t^\circ \cdot \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} (k\lambda -t^\circ )\cdot e^{-(k\lambda -t^\circ )^2/2}\\&\quad =2t^\circ \cdot \Big ( (\lambda -t^\circ )\cdot \exp (-(\lambda -t^\circ )^2/2)+\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} ((k+2)\lambda -t^\circ )\cdot e^{-((k+2)\lambda -t^\circ )^2/2}\Big ). \end{aligned}$$Since \(\lambda -t^\circ >1\), it follows \((\lambda -t^\circ )\cdot \exp (-(\lambda -t^\circ )^2/2)\le \tfrac{1}{\lambda -t^\circ }\). Furthermore,
$$\begin{aligned} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} ((k+2)\lambda -t^\circ )\cdot e^{-((k+2)\lambda -t^\circ )^2/2} \le \tfrac{1}{2\lambda }\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda - t^\circ }^{(k+2)\lambda -t^\circ } xe^{-x^2/2}\,dx \le \tfrac{1}{2\lambda }. \end{aligned}$$Therefore, if \(\lambda >2\), then
$$\begin{aligned} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx\le \tfrac{8t^\circ }{\lambda }. \end{aligned}$$This shows \(S_1\le 52 t^\circ \). Putting everything together, we obtain
$$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |] \le S_1 + S_2\le 56t^\circ =\tfrac{t}{64}, \end{aligned}$$which implies that Assumption 2(a) is satisfied. On Assumption 2(b). For \(\varvec{x}, \varvec{x}'\in \mathcal {X}\), we have that
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \mu _\lambda \cdot \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2}+\Vert \psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )-\psi _{t^\circ }(\langle \varvec{a}, \varvec{x}'\rangle )\Vert _{\psi _2}\\&\lesssim \mu _\lambda \cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2} + \tfrac{\lambda }{t^\circ } \cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2}, \end{aligned}$$where we have used that \(\psi _{t^\circ }\) is \(\tfrac{\lambda }{t^\circ }\)-Lipschitz. Since \(\mu _\lambda \ge \tfrac{1}{2}\), \(t\le \lambda \), and \(t^\circ =\tfrac{t}{56\cdot 64}\), it follows that
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}\lesssim \tfrac{\lambda }{t}\cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$Let \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\). Since \(\mu _\lambda \le 1\) and \(|\psi _{t^\circ }(s)|\le |s|\) for all \(s\in \mathbb {R}\), we observe that
$$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}\le \mu _\lambda \cdot \Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{\psi _2} + \Vert \psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )\Vert _{\psi _2}\lesssim 1. \end{aligned}$$Therefore, Assumption 2(b) is satisfied with \(L_{t}\lesssim \tfrac{\lambda }{t}\) and \(r\lesssim 1\). On Assumption 2(c). For \(\varvec{x}, \varvec{x}'\in \mathcal {X}\),
$$\begin{aligned} \Vert \varepsilon _t(\varvec{x})-\varepsilon _t(\varvec{x}')\Vert _{\psi _2}=\Vert \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle ) - \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}'\rangle )\Vert _{\psi _2}\lesssim \tfrac{\lambda }{t^\circ }\cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2}, \end{aligned}$$where we have used that \(\phi _{t^\circ }\) is \(\tfrac{\lambda }{t^\circ }\)-Lipschitz. Since \(\mu _\lambda \ge \tfrac{1}{2}\) and \(t^\circ =\tfrac{t}{56\cdot 64}\), it follows that
$$\begin{aligned} \Vert \varepsilon _t(\varvec{x})-\varepsilon _t(\varvec{x}')\Vert _{\psi _2}\lesssim \tfrac{\lambda }{t}\cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$Let \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\). Since \(|\phi _{t^\circ }(s)|\le |s|\) for all \(s\in \mathbb {R}\), we observe that
$$\begin{aligned} \Vert \varepsilon _t(\varvec{x})\Vert _{\psi _2}=\Vert \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )\Vert _{\psi _2}\lesssim 1. \end{aligned}$$Therefore, Assumption 2(c) is satisfied with \(\hat{L}_{t}\lesssim \tfrac{\lambda }{t}\) and \(\hat{r}\lesssim 1\).
\(\square \)
Proof of Corollary 6
We follow the proof template from the beginning of Sect. 8:
-
Step 1a. The model setup of Corollary 6 fits into Assumption 1 as follows:
-
The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and it has independent, symmetric, and sub-Gaussian coordinates with \(\max _{j \in [p]}\Vert a_j\Vert _{\psi _2} \le L\); in particular, we have that \(\Vert \varvec{a}\Vert _{\psi _2}\lesssim L\) (e.g., see [64, Lem. 3.4.2]). The signal set \(\mathcal {X}\) is a subset of \(R\mathbb {B}_2^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}f(\varvec{a}\circ \varvec{x})\), where \(f:\mathbb {R}^p\rightarrow \mathbb {R}\) is given by \(f(\varvec{z}){:=}\sum _{j=1}^p f_j(z_j)\), with \(f_j:\mathbb {R}\rightarrow \mathbb {R}\) odd, \(\gamma \)-Lipschitz, and satisfying the conditions (4.10) and (4.11).
-
The target function \(T:\mathcal {X}\rightarrow \mathbb {R}^p\) is defined by \(T\varvec{x}{:=}\mathbb {E}[f(\varvec{a}\circ \varvec{x})\varvec{a}]\).
-
-
Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).
-
Step 1c. For the trivial choice \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled with
$$\begin{aligned} L_{t}\lesssim L\cdot \tfrac{\gamma }{\alpha },\quad \hat{L}_{t}= 0,\quad r\lesssim L(\beta _2+\gamma )R,\quad \hat{r}= 0. \end{aligned}$$ -
Step 2. We begin by showing that
$$\begin{aligned} (T\varvec{x})_j = \mathbb {E}[f_j(a_j x_j)a_j], \qquad \text {for all } j\in [p], \end{aligned}$$where \(a_j\) is the j-th coordinate of the measurement vector \(\varvec{a}\in \mathbb {R}^p\). Indeed, we have that
$$\begin{aligned} (T\varvec{x})_j&= (\mathbb {E}[f(\varvec{a}\circ \varvec{x})\varvec{a}])_j = \mathbb {E}[f(\varvec{a}\circ \varvec{x})a_j] \\&= \sum _{j'=1}^p \mathbb {E}[f_{j'}(a_{j'}x_{j'})a_j] = \mathbb {E}[f_j(a_jx_j)a_j], \end{aligned}$$where for the last equality we have used the independence of the random variables \(a_1, \ldots , a_p\) and \(\mathbb {E}[a_j]=0\). Next, let us show that \((T\varvec{x})_j\in [\beta _1x_j, \beta _2x_j]\) for \(x_j\ge 0\) and \((T\varvec{x})_j\in [\beta _2x_j, \beta _1x_j]\) for \(x_j<0\). Let us first assume that \(x_j\ge 0\). Then, clearly \(\mathbb {E}[f_j(a_jx_j)a_j\chi _{[0,\infty )}(a_j)]\le \beta _2 x_j \mathbb {E}[a_j^2\chi _{[0,\infty )}(a_j)]\). Furthermore,
$$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j\chi _{(-\infty ,0)}(a_j)]&= \mathbb {E}[f_j(-a_jx_j)(-a_j)\chi _{(-\infty ,0)}(a_j)] \\&\le \beta _2x_j \mathbb {E}[a_j^2\chi _{(-\infty ,0)}(a_j)], \end{aligned}$$where we have used that \(f_j\) is an odd function for the first equality. Therefore,
$$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j]&= \mathbb {E}[f_j(a_jx_j)a_j\chi _{[0,\infty )}(a_j)] + \mathbb {E}[f_j(a_jx_j)a_j\chi _{(-\infty ,0)}(a_j)] \\&\le \beta _2x_j \mathbb {E}[a_j^2] = \beta _2x_j, \end{aligned}$$where \(\mathbb {E}[a_j^2]=1\) follows from the assumption that \(\varvec{a}\) is isotropic. Similarly, we observe that
$$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j\chi _{[0,\infty )}(a_j)] \ge \beta _1x_j \mathbb {E}[a_j^2\chi _{[0,\infty )}(a_j)] \end{aligned}$$and
$$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j\chi _{(-\infty ,0)}(a_j)]&= \mathbb {E}[f_j(-a_jx_j)(-a_j)\chi _{(-\infty ,0)}(a_j)] \\&\ge \beta _1x_j\mathbb {E}[a_j^2\chi _{(-\infty ,0)}(a_j)], \end{aligned}$$which implies
$$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j]\ge \beta _1x_j\mathbb {E}[a_j^2]=\beta _1x_j. \end{aligned}$$This shows that \(\beta _1x_j\le (T\varvec{x})_j\le \beta _2x_j\) for \(x_j\ge 0\). Analogously, one can show that \(\beta _2x_j\le (T\varvec{x})_j\le \beta _1x_j\) for \(x_j<0\). Therefore, \(|(T\varvec{x})_j|\le \beta _2|x_j|\), which implies \(\Vert T\varvec{x}\Vert _{2}\le \beta _2\Vert \varvec{x}\Vert _{2}\) for every \(\varvec{x}\in \mathcal {X}\). As a consequence, it holds that \(T\mathcal {X}\subset \beta _2R\mathbb {B}_2^p\). Setting \(m_0{:=}0\), \(\varDelta {:=}1\), and \(u_0 {:=}u\), the claim of Corollary 6 follows directly from Theorem 2.
-
Step 3. By definition of the target function it follows that \(\rho _{}(\varvec{x})=0\) for every \(\varvec{x}\in \mathcal {X}\).
-
Step 4. We simply set \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\). Then, \(\varepsilon _t(\varvec{x}) = 0\), implying that Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). Since \(\xi _t(\varvec{x})=\langle \varvec{a}, T\varvec{x}\rangle -f(\varvec{a}\circ \varvec{x})\), the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \Vert \langle \varvec{a}, T\varvec{x}-T\varvec{x}'\rangle \Vert _{\psi _2} + \Vert f(\varvec{a}\circ \varvec{x})-f(\varvec{a}\circ \varvec{x}')\Vert _{\psi _2}. \end{aligned}$$Clearly, we have that \(\Vert \langle \varvec{a}, T\varvec{x}-T\varvec{x}'\rangle \Vert _{\psi _2}\lesssim L\cdot d_T(\varvec{x},\varvec{x}')\) and
$$\begin{aligned} \Vert f(\varvec{a}\circ \varvec{x})-f(\varvec{a}\circ \varvec{x}')\Vert _{\psi _2} = \Vert \sum _{j=1}^pf_j(a_jx_j)-f_j(a_jx_j')\Vert _{\psi _2}. \end{aligned}$$Let \(j\in [p]\). Using that \(f_j\) is \(\gamma \)-Lipschitz, we obtain
$$\begin{aligned} \Vert f_j(a_jx_j)-f_j(a_jx_j')\Vert _{\psi _2} \lesssim \gamma \cdot \Vert a_jx_j-a_jx_j'\Vert _{\psi _2} \le L\cdot \gamma \cdot |x_j-x_j'|. \end{aligned}$$Furthermore, since \(a_j\) is a symmetric random variable and \(f_j\) is odd, it holds \(\mathbb {E}[f_j(a_jx)]=0\) for every \(x\in \mathbb {R}\). In particular, the random variable \(f_j(a_jx_j)-f_j(a_jx_j')\) is centered. Since the random variables \(a_1,\ldots , a_p\) are independent, Hoeffding’s inequality yields
$$\begin{aligned} \Vert \sum _{j=1}^p f_j(a_jx_j)-f_j(a_jx_j') \Vert _{\psi _2}^2\lesssim & {} \sum _{j=1}^p \Vert f_j(a_jx_j)-f_j(a_jx_j') \Vert _{\psi _2}^2 \\ {}\lesssim & {} L^2 \cdot \gamma ^2 \cdot \big \Vert \varvec{x}-\varvec{x}'\big \Vert _{2}^2. \end{aligned}$$Therefore,
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\lesssim L\cdot d_T(\varvec{x},\varvec{x}') + L\cdot \gamma \cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2}. \end{aligned}$$We now show that \(d_T(\varvec{x},\varvec{x}')\ge \alpha \Vert \varvec{x}-\varvec{x}'\Vert _{2}\). For \(j\in [p]\), we have that
$$\begin{aligned} |(T\varvec{x})_j - (T\varvec{x}')_j|&= |\mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j]|\\&= |\mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j\chi _{[0,\infty )}(a_j)] \\&\quad + \mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j\chi _{(-\infty ,0)}(a_j)]|. \end{aligned}$$We may assume that \(x_j\ge x_j'\). If \(a_j\ge 0\), then \(a_jx_j\ge a_jx_j'\), which implies \(f_j(a_jx_j)-f_j(a_jx_j')\ge \alpha (a_jx_j-a_jx_j')\). Therefore, \((f_j(a_jx_j)-f_j(a_jx_j'))a_j\ge \alpha (a_jx_j-a_jx_j')a_j\), which implies
$$\begin{aligned} \mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j\chi _{[0,\infty )}(a_j)]&\ge \mathbb {E}[\alpha (a_jx_j-a_jx_j')a_j\chi _{[0,\infty )}(a_j)]\\&=\alpha (x_j-x_j')\mathbb {E}[a_j^2\chi _{[0,\infty )}(a_j)] \ge 0. \end{aligned}$$If \(a_j<0\), then \(a_jx_j\le a_jx_j'\), which implies \(f_j(a_jx_j')-f_j(a_jx_j)\ge \alpha (a_jx_j'-a_jx_j)\). Therefore, \(f_j(a_jx_j)-f_j(a_jx_j')\le \alpha (a_jx_j-a_jx_j')\), which implies \((f_j(a_jx_j)-f_j(a_jx_j'))a_j\ge \alpha (a_jx_j-a_jx_j')a_j\). This shows
$$\begin{aligned} \mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j\chi _{(-\infty ,0)}(a_j)]&\ge \mathbb {E}[\alpha (a_jx_j-a_jx_j')a_j\chi _{(-\infty ,0)}(a_j)]\\&=\alpha (x_j-x_j')\mathbb {E}[a_j^2\chi _{(-\infty ,0)}(a_j)] \ge 0. \end{aligned}$$It follows that \(|(T\varvec{x})_j - (T\varvec{x}')_j|\ge \alpha (x_j-x_j')=\alpha |x_j-x_j'|\), and therefore \(d_T(\varvec{x},\varvec{x}')\ge \alpha \Vert \varvec{x}-\varvec{x}'\Vert _{2}\). Due to this inequality, we obtain
$$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\lesssim L\cdot d_T(\varvec{x},\varvec{x}') + L\cdot \gamma \cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2}\\&\le L\cdot d_T(\varvec{x},\varvec{x}') + L\cdot \tfrac{\gamma }{\alpha }\cdot d_T(\varvec{x},\varvec{x}')\\&\le 2\cdot L\cdot \tfrac{\gamma }{\alpha }\cdot d_T(\varvec{x},\varvec{x}'), \end{aligned}$$where the last inequality follows from \(\alpha \le \gamma \). Now, let \(\varvec{x}\in \mathcal {X}\). It holds that
$$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}&=\Vert \langle \varvec{a}, T\varvec{x}\rangle -f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2} \le \Vert \langle \varvec{a}, T\varvec{x}\rangle \Vert _{\psi _2} + \Vert f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2}\\*&\lesssim L\cdot \Vert T\varvec{x}\Vert _{2} + \Vert f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2} \le L\cdot \beta _2\cdot R+ \Vert f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2}. \end{aligned}$$Since \(f_j\) is \(\gamma \)-Lipschitz and odd for every \(j\in [p]\), we obtain
$$\begin{aligned} \Vert f_j(a_jx_j)\Vert _{\psi _2}=\Vert f_j(a_jx_j)-f_j(0)\Vert _{\psi _2}\lesssim \gamma \cdot \Vert a_jx_j\Vert _{\psi _2}\le L\cdot \gamma \cdot |x_j|. \end{aligned}$$Using Hoeffding’s inequality, we conclude that
$$\begin{aligned} \Vert f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2}=\Vert \sum _{j=1}^p f_j(a_jx_j)\Vert _{\psi _2} \lesssim \Big ( \sum _{j=1}^p\Vert f_j(a_jx_j)\Vert _{\psi _2}^2\Big )^{1/2} \le L\cdot \gamma \cdot \Vert \varvec{x}\Vert _{2}. \end{aligned}$$Hence, Assumption 2(b) is satisfied with \(L_{t}\lesssim L\cdot \tfrac{\gamma }{\alpha }\) and \(r\lesssim L(\beta _2+\gamma )R\).
\(\square \)
8.5 Proofs for Sect. 4.5
Proof of Corollary 7
We follow the proof template from the beginning of Sect. 8:
-
Step 1a. The model setup of Corollary 7 fits into Assumption 1 as follows:
-
The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and has independent sub-Gaussian coordinates with \(\max _{j \in [p]}\Vert a_j\Vert _{\psi _2} \le L\); in particular, we have that \(\Vert \varvec{a}\Vert _{\psi _2}\lesssim L\) (e.g., see [64, Lem. 3.4.2]). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\{\mathcal {S}\subset [p] \ :|\mathcal {S}|\le s\}\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \mathcal {S}) {:=}f(\varvec{a}_{\mathcal {S}})\), where \((\varvec{a}_{\mathcal {S}})_j = a_j\) for \(j\in \mathcal {S}\) and \((\varvec{a}_{\mathcal {S}})_j =0\) for \(j\in \mathcal {S}^c\).
-
The target function \(T:\mathcal {X}\rightarrow K\) is defined by \(T\mathcal {S}{:=}\mathbb {E}[f(\varvec{a}_{\mathcal {S}})\varvec{a}]\).
-
-
Step 1b. The target mismatch \(\rho _{}(\mathcal {S})\) vanishes for every \(\mathcal {S}\in \mathcal {X}\).
-
Step 1c. For the trivial choice \(\tilde{y}_t(\mathcal {S}){:=}\tilde{y}(\mathcal {S})\), the conditions of Assumption 2 are fulfilled with
$$\begin{aligned} L_{t}\le L+ \gamma \alpha ^{-1},\quad \hat{L}_{t}=0,\quad r\le L\beta + \kappa ,\quad \hat{r}=0. \end{aligned}$$ -
Step 2. Setting \(m_0{:=}0\), \(\varDelta {:=}L^{-1} \sqrt{\log L}\), and \(u{:=}u_0\), the claim of Corollary 7 follows directly from Theorem 2.
-
Step 3. This is clear by the definition of the target function.
-
Step 4. We simply set \(\tilde{y}_t(\mathcal {S}){:=}\tilde{y}(\mathcal {S})\). Then, \(\varepsilon _t(\mathcal {S}) = 0\), implying that Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). To see that Assumption 2(b) holds as well, we first recall that the coordinates of \(\varvec{a}\) are centered and independent, so that
$$\begin{aligned} (T\mathcal {S})_j=\mathbb {E}[f(\varvec{a}_{\mathcal {S}})a_j]= \mathbb {E}[f(\varvec{a}_{\mathcal {S}})]\cdot \mathbb {E}[a_j]=0 \qquad \text {for all } \mathcal {S}\in \mathcal {X}\text { and } j\in \mathcal {S}^c. \end{aligned}$$Together with the lower bound in (4.14), it follows that \({{\,\mathrm{supp}\,}}(T\mathcal {S}) = \mathcal {S}\) for all \(\mathcal {S}\in \mathcal {X}\). Using the lower bound in (4.14) again, we obtain the following estimate for all \(\mathcal {S}, \mathcal {S}'\in \mathcal {X}\):
$$\begin{aligned} d_T(\mathcal {S},\mathcal {S}')= \Vert T\mathcal {S}-T\mathcal {S}'\Vert _{2}\ge \Big (\sum _{j\in \mathcal {S}\setminus \mathcal {S}'} (T\mathcal {S})_j^2 + \sum _{j\in \mathcal {S}'\setminus \mathcal {S}} (T\mathcal {S}')_j^2 \Big )^{1/2}\ge \tfrac{\alpha }{\sqrt{s}}\sqrt{|\mathcal {S}\bigtriangleup \mathcal {S}'|}. \end{aligned}$$Combining this with the assumption (4.15), it follows that
$$\begin{aligned} \Vert \xi _t(\mathcal {S}) - \xi _t(\mathcal {S}')\Vert _{\psi _2}&\le \Vert \langle \varvec{a}, T\mathcal {S}- T\mathcal {S}'\rangle \Vert _{\psi _2} + \Vert f(\varvec{a}_{\mathcal {S}}) - f(\varvec{a}_{\mathcal {S}'})\Vert _{\psi _2}\\&\le L\cdot d_T(\mathcal {S},\mathcal {S}') + \gamma \alpha ^{-1} \cdot d_T(\mathcal {S},\mathcal {S}') = (L+ \gamma \alpha ^{-1}) \cdot d_T(\mathcal {S},\mathcal {S}'). \end{aligned}$$This implies \(L_{t}\le L+ \gamma \alpha ^{-1}\). Since \({{\,\mathrm{supp}\,}}(T\mathcal {S}) = \mathcal {S}\) and \(|\mathcal {S}|\le s\) for \(\mathcal {S}\in \mathcal {X}\), we also have that \(|{{\,\mathrm{supp}\,}}(T\mathcal {S})|\le s\). The upper bound in (4.14) therefore yields \(\Vert T\mathcal {S}\Vert _{2}\le \beta \) for every \(\mathcal {S}\in \mathcal {X}\). Combining this estimate with (4.15), we obtain the following upper bound for the sub-Gaussian norm of \(\xi _t(\mathcal {S})\):
$$\begin{aligned} \Vert \xi _t(\mathcal {S})\Vert _{\psi _2}=\Vert \langle \varvec{a}, T\mathcal {S}\rangle - f(\varvec{a}_{\mathcal {S}}) \Vert _{\psi _2} \le L\Vert T\mathcal {S}\Vert _{2} + \Vert f(\varvec{a}_{\mathcal {S}})\Vert _{\psi _2} \le L\beta + \kappa . \end{aligned}$$Hence, \(r{:=}L\beta + \kappa \) is a valid choice for Assumption 2(b).
\(\square \)
9 Proofs for Sect. 5
Proof of Theorem 3
Let \(\mathcal {X}_\varepsilon \) be a minimal subset of \(\mathcal {X}\) such that \(T\mathcal {X}_\varepsilon \) is a minimal \(\varepsilon \)-net for \(T\mathcal {X}\); in particular, we have that \(|\mathcal {X}_\varepsilon | = |T\mathcal {X}_\varepsilon |=\mathcal {N}(T\mathcal {X}, \varepsilon )\). The assumptions of (5.2) and (5.3) allow us to first apply Theorem 2 to every \(\varvec{x}' \in \mathcal {X}_\varepsilon \) as a singleton signal set (Assumption 2 is trivially fulfilled here), and then to take a union bound over \(\mathcal {X}_\varepsilon \). Consequently, there exist universal constants \(c, C > 0\) such that the following event \({\mathcal {A}}_1\) holds with probability at least \(1 - \exp (-c u^2) - \exp (-c u_0^2)\): For every \(\varvec{x}' \in \mathcal {X}_\varepsilon \) with \(\rho _{}(\varvec{x}') \le \tfrac{t}{32}\) and every \(\varvec{y}\in \mathbb {R}^m\) such that
any minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}'\Vert _{2} \le t\). Note that we have applied the robustness conditions of (2.4) for \(2m_0\) instead of \(m_0\) here, which is possible due to \(2m_0\le m\).
Let \({\mathcal {A}}_2\) denote the event of (5.1) in Assumption 3. We now show that on the event \({\mathcal {A}}_1\cap {\mathcal {A}}_2\), which occurs with probability at least \(1 - \exp (-c u^2) - \exp (-c u_0^2)-\eta \), the conclusion of Theorem 3 holds: Let \(\varvec{x}\in \mathcal {X}\) be fixed and consider any input vector \(\varvec{y}\in \mathbb {R}^m\) satisfying
By the definition of \(\mathcal {X}_\varepsilon \), there exists \(\varvec{x}'\in \mathcal {X}_\varepsilon \) such that \(\Vert T\varvec{x}-T\varvec{x}'\Vert _{2}\le \varepsilon \). According to the event \({\mathcal {A}}_2\), we conclude that
and
Finally, according to the event \({\mathcal {A}}_1\), every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}'\Vert _{2} \le t\) and therefore \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2}\le \Vert \hat{\varvec{z}}- T\varvec{x}'\Vert _{2} + \Vert T\varvec{x}' - T\varvec{x}\Vert _{2}\le t + \varepsilon \). \(\square \)
Proof of Corollary 8
Analogously to Corollary 2, the model setup of Corollary 8 fits into Assumption 1 as follows:
-
We have that \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) and therefore \(\Vert \varvec{a}\Vert _{\psi _2} \lesssim 1\). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\mathbb {R}^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )\). In particular, the resulting observation vector of \(\varvec{x}\) is given by \(\tilde{\varvec{y}}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x})\) where \(\varvec{A}\in \mathbb {R}^{m\times p}\) is a standard Gaussian random matrix with row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\).
-
The target function \(T:\mathcal {X}\rightarrow K\) corresponds to the (scaled) normalization \(T\varvec{x}{:=}\sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\).
As already shown in Sect. 8.1 (Step 3), the target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\). Moreover, we clearly have that \(r={\mathrm{sup}}_{\varvec{x}\in \mathcal {X}}\Vert \langle \varvec{a}, T\varvec{x}\rangle - \tilde{y}(\varvec{x})\Vert _{\psi _2}\lesssim 1\). Since \(\phi (\beta ) {:=}\beta \sqrt{\log (e/\beta )}\) defines a continuous and non-decreasing function on [0, 1] with \(\phi (1)=1\) and \(\phi (0){:=}0\) (by continuous extension), we may assume without loss of generality that
Now, we set
implying that \(u_0 \ge \sqrt{2m_0\log (em/2m_0)}\), \(m_0\in \{0, 1, \dots , \lfloor \tfrac{m}{2}\rfloor \}\), and \(u_0^2\ge 2c_0t m\). The latter inequality again implies that \(u_0^2 \ge u^2\), and since \(u^2\ge C_0 \cdot \log \mathcal {N}(T\mathcal {X}, \varepsilon )\), the condition of (5.3) is fulfilled. Similarly, it is not hard to see that the condition of (5.2) follows from (5.4) for \(C'\) sufficiently large (cf. Sect. 8.1 (Step 2)).
According to Theorem 4 with \(H{:=}\sqrt{\tfrac{\pi }{2}}T\mathcal {X}\subset \mathbb {S}^{p-1}\), there exist universal constants \(c, {\bar{c}}, C > 0\) (possibly slightly different from those in Theorem 4) such that if
for \(\varepsilon \le {\bar{c}}\beta / \sqrt{\log (e/\beta )}\), the following holds with probability at least \(1-\exp (-c m \beta )\):
On this event and by the above choice of \(m_0\), we have that
and
where the last inequality holds for \(c_0>0\) small enough. Consequently, Assumption 3 would hold for \(\eta {:=}\exp (-c m \beta )\) if we can show that (9.2) is satisfied under the hypotheses of Corollary 8. To this end, we first note that the relationship (9.1) implies that there exists a universal constant \(c_0'>0\) such that \(\beta \ge c_0't/\sqrt{\log (e/t)}\). This particularly leads to the following estimates:
Combining the first one with (5.4), it follows that \(m \beta \ge c_0' m t^2 \gtrsim u^2\), while the second one yields
if \(c'\) is chosen sufficiently small. Hence, (9.2) is a consequence of (5.4) and the assumption \(u^2\ge C_0\cdot \log \mathcal {N}(T\mathcal {X}, \varepsilon )\).
Since all assumptions of Theorem 3 are satisfied, the following holds with probability at least \(1 - \exp (-c u^2)\) uniformly for every \(\varvec{x}\in \mathcal {X}\): For any input vector \(\varvec{y}\in \mathbb {R}^m\) such that
every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t + \varepsilon \le 2t\). The claim of Corollary 8 now follows from the fact that any input vector \(\varvec{y}\in \{-1,1\}^m\) given by (4.1) with \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \tfrac{\beta }{2}\) satisfies (9.3). \(\square \)
Proof of Corollary 9
Analogously to Corollary 3, the model setup of Corollary 9 fits into Assumption 1 as follows:
-
The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). The signal set \(\mathcal {X}\) satisfies \(\mathcal {X}\subset R \mathbb {B}_2^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x})={{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle +\tau )\), where \(\tau \) is uniformly distributed on \([-\lambda , \lambda ]\) and independent of \(\varvec{a}\). In particular, \(F\) is a random function. Moreover, the observation vector of \(\varvec{x}\) is given by \(\tilde{\varvec{y}}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x}+\varvec{\tau })\), where \(\varvec{A}\in \mathbb {R}^{m\times p}\) denotes the sub-Gaussian random matrix with row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\) and \(\varvec{\tau }{:=}(\tau _1, \dots , \tau _m)\).
-
The target function \(T:\mathcal {X}\rightarrow K\) corresponds to rescaling by a factor of \(\lambda ^{-1}\), i.e., \(T\varvec{x}{:=}\lambda ^{-1}\varvec{x}\).
As already shown in Sect. 8.2 (Step 3), there exists a universal constant \({\tilde{C}}'\ge e\) such that if
the target mismatch satisfies \(\rho _{}(\varvec{x})\le \tfrac{t}{32}\) for every \(\varvec{x}\in \mathcal {X}\). Moreover, we clearly have that
for every \(\varvec{x}\in \mathcal {X}\), and together with (9.4), it follows that \(r={\mathrm{sup}}_{\varvec{x}\in \mathcal {X}}\Vert \langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle - \tilde{y}(\varvec{x})\Vert _{\psi _2}\lesssim 1\). Since \(\phi (\beta ) {:=}\beta \sqrt{\log (e/\beta )}\) defines a continuous and non-decreasing function on [0, 1] with \(\phi (1)=1\) and \(\phi (0){:=}0\) (by continuous extension), we may assume without loss of generality that
Now, we set
implying that \(u_0 \ge \sqrt{2m_0\log (em/2m_0)}\), \(m_0\in \{0, 1, \dots , \lfloor \tfrac{m}{2}\rfloor \}\), and \(u_0^2\ge 2c_0mL^{-2} t\). Combining the latter inequality with (5.5) for \(C' \gtrsim L^2\) sufficiently large implies that \(u_0^2 \ge u^2\), and since \(u^2\ge C_0 \cdot \log \mathcal {N}(\lambda ^{-1}\mathcal {X}, \varepsilon )\), the condition of (5.3) is fulfilled. Similarly, it is not hard to see that the condition of (5.2) follows from (5.5) for \(C' \gtrsim L^2 \log L\) sufficiently large (cf. Sect. 8.2 (Step 2)).
According to Theorem 5, there exist constants \(c, {\bar{c}}, C, {\tilde{C}} > 0\) only depending on \(L\) (possibly slightly different from those in Theorem 5) such that if \(\lambda \ge {\tilde{C}}\cdot R\) and
for \(\varepsilon \le {\bar{c}}\beta / \sqrt{\log (e/\beta )}\), the following holds with probability at least \(1-\exp (-c m\beta )\):
On this event and by the above choice of \(m_0\), we have that
and
where the last inequality holds for \(c_0>0\) small enough. Consequently, Assumption 3 would hold for \(\eta {:=}\exp (-c m \beta )\) if we can show that (9.6) is satisfied under the hypotheses of Corollary 9. To this end, we first note that the relationship (9.5) implies that there exists a universal constant \(c_0'>0\) such that \(\beta \ge c_0'tL^{-2}/\sqrt{\log (eL^2/t)}\). This particularly leads to the following estimates:
Combining the first one with (5.5), it follows that \(m \beta \ge c_0' m t^2 L^{-2} \gtrsim u^2\) for \(C' \gtrsim L^2\) sufficiently large, while the second one yields
if \(c'\) is chosen sufficiently small (depending on \(L\)). Hence, (9.6) is a consequence of (5.5) and the assumption \(u^2\ge C_0\cdot \log \mathcal {N}(\lambda ^{-1}\mathcal {X}, \varepsilon )\).
Since all assumptions of Theorem 3 are satisfied, the following holds with probability at least \(1 - \exp (-c u^2)\) uniformly for every \(\varvec{x}\in \mathcal {X}\): For any input vector \(\varvec{y}\in \mathbb {R}^m\) such that
every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \lambda ^{-1}\varvec{x}\Vert _{2} \le t + \varepsilon \le 2t\). The claim of Corollary 9 now follows from the fact that any input vector \(\varvec{y}\in \{-1,1\}^m\) given by (4.3) with \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \tfrac{\beta }{2}\) satisfies (9.7). \(\square \)
10 Proofs for Sect. 3
Proof of Corollary 1
We follow the proof template from the beginning of Sect. 8:
-
Step 1a. The model setup of Corollary 1 fits into Assumption 1 as follows:
-
The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). The signal set \(\mathcal {X}\) is a bounded subset of \(\mathbb {R}^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}\langle \varvec{a}, \varvec{x}\rangle + \tau \), where \(\tau \sim \mathsf {N}(0, \sigma ^2)\).
-
The target function \(T:\mathcal {X}\rightarrow K\) is the canonical embedding into \(K\), i.e., \(T\varvec{x}{:=}\varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \Vert \varvec{x}- \varvec{x}'\Vert _{2}\).
-
-
Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).
-
Step 1c. For the trivial choice \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled with
$$\begin{aligned} L_{t}= 0,\quad \hat{L}_{t}= 0,\quad r\lesssim \sigma , \quad \hat{r}= 0. \end{aligned}$$ -
Step 2. Setting \(m_0{:=}0\), \(\varDelta {:=}L^{-1} \sqrt{\log L}\), and \(u{:=}u_0\), the claim of Corollary 1 follows directly from Theorem 2.
-
Step 3. Let \(\varvec{x}\in \mathcal {X}\). By the isotropy of \(\varvec{a}\), we have that \(\mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}] = \mathbb {E}[\langle \varvec{a}, \varvec{x}\rangle \varvec{a}]=\varvec{x}= T\varvec{x}\), and therefore \(\rho _{}(\varvec{x})=0\).
-
Step 4. We simply set \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\). Then, \(\varepsilon _t(\varvec{x}) = 0\), implying that Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). Furthermore, since \(\xi _t(\varvec{x})= \langle \varvec{a}, \varvec{x}\rangle -\tilde{y}_t(\varvec{x})= -\tau \), Assumption 2(b) holds with \(L_{t}=0\) and \(r{:=}\Vert \tau \Vert _{\psi _2} \lesssim \sigma \), since \(\tau \sim \mathsf {N}(0, \sigma ^2)\).
\(\square \)
Proof of Inequality (3.3)
Since \(\Vert \varvec{x}\Vert _{1} = R\) for every \(\varvec{x}\in \mathcal {X}\), we may use [54, Lem. 4.5] to observe that
where \({\text {conv}}(\cdot )\) denotes the convex hull. Moreover, [49, Lem. 2.3] yields \(w_{}(H) \lesssim \sqrt{s \cdot \log (2p/s)}\). Therefore, we obtain
as claimed in (3.3). \(\square \)
Proof of Proposition 1
We begin with writing out the definition of the local mean width as follows:
where \(\varvec{g}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). Now, let \(\mathcal {X}_t\) be a minimal t-net for \(\mathcal {X}\); in particular, we have that \(|\mathcal {X}_t| = \mathcal {N}(\mathcal {X}, t)\). Hence, for every \(\varvec{z}\in K\) and \(\varvec{x}\in \mathcal {X}\) with \(\Vert \varvec{z}- \varvec{x}\Vert _{2} \le t\), there exists \(\varvec{v}_{\varvec{x}} \in \mathcal {X}_t\) such that \(\Vert \varvec{v}_{\varvec{x}} - \varvec{x}\Vert _{2} \le t\) and \(\Vert \varvec{v}_{\varvec{x}} - \varvec{z}\Vert _{2} \le 2t\). Therefore, we obtain
Next, we make use of the fact that \(\mathcal {X}_t\) is a finite set. Indeed, applying [15, Lem. 6.1], this allows us to “pull out” the union over \(\mathcal {X}_t\) in the following way:
By Sudakov minoration (e.g., see [64, Thm. 7.4.1]), we have that \(t \cdot \sqrt{\log \mathcal {N}(\mathcal {X}, t)} \lesssim w_{}(\mathcal {X})\). Hence, overall, we have that
The “moreover”-part of Proposition 1 follows directly from a stability bound derived in [23, Prop. 2.6]. \(\square \)
Notes
In particular, we will not discuss any details about possible algorithmic implementations of (\({\mathsf {P}}_{K,\varvec{y}}\)), which is an important subject in its own right.
Here, the \(\nu _i\) correspond to adversarial noise, i.e., as long as the \(\ell ^{2}\)-constraint is satisfied, they could be arbitrary (even worst-case) perturbations. This model is very common in uniform recovery and compressed sensing, cf. [21]. But note that compared to statistical noise, this comes at the price of consistent estimation, i.e., the reconstruction error may not become arbitrarily small if \(m \rightarrow \infty \); see Remark 1 and 4 for further discussion.
Note that we actually have that \(L\ge \sqrt{1/\log (2)} > 1\) if \(\varvec{a}\) is isotropic, see [35].
One should bear in mind that the target function \(T\) is a purely theoretical object that does not affect the solution of (\({\mathsf {P}}_{K,\varvec{y}}\)) and may be (partially) unknown in practice. Nevertheless, it is an essential analysis tool for Problem 1, relating the underlying observation model to the actual recovery method (\({\mathsf {P}}_{K,\varvec{y}}\)), see also Remark 2(1).
The use of the modifier ‘\(\sim \)’ in Assumption 1 is also due to this fact and may help to distinguish between our observation model and the actual input for (\({\mathsf {P}}_{K,\varvec{y}}\)).
The condition (2.3) is stated in such a way that it is convenient to handle in our specific applications (see Sect. 4). However, for a basic understanding, the reader may simply set \(u= u_0\), so that the first and second branch in (2.3) can be merged to \(m \gtrsim L^2(\log L+ L^2\varDelta ^2) \cdot \big (w_{t}^2(K- T\mathcal {X})+ u^2 \big )\).
In the case of exact recovery, i.e., \(t = 0\), we follow the convention \(0 \cdot \infty {:=}0\), so that the condition (2.3) requires that \(r= \hat{r}= L_{t}= \hat{L}_{t}= 0\). Then, (2.3) is particularly fulfilled for \(m \ge C \cdot L^2 \log L\cdot (w_{0}^2(K- T\mathcal {X})+ u^2)\), where \(m_0= 0\), \(\varDelta = L^{-1} \sqrt{\log L}\), and \(u= u_0\).
Note that the expression \(O(m^{-1/2})\) suppresses the dependence on \(w_{t}(K- T\mathcal {X})\). Although the latter can be trivially bounded by \(w_{0}(K- T\mathcal {X})\), which is independent of t, such an estimate might not appropriately capture the (low) complexity of \(K\) in certain cases.
A simple example is as follows: For a unit vector \(\varvec{x}\in \mathbb {S}^{p-1}\) to be reconstructed, set \(\nu _i = c \cdot t \cdot \langle \varvec{a}_i, \varvec{x}\rangle \) for a constant \(c > 0\) small enough. If m is sufficiently large, then (3.8) holds with high probability. At the same time, we have that \(y_i = \langle \varvec{a}_i, \varvec{x}\rangle + \nu _i = \langle \varvec{a}_i, (1 + ct)\varvec{x}\rangle \). Hence, Corollary 1 can be also applied such that it certifies exact recovery of the rescaled vector \((1 + ct)\varvec{x}\) via (\({\mathsf {P}}_{K,\varvec{y}}\)). In other words, we have that \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} = ct \cdot \Vert \varvec{x}\Vert _{2} \asymp t\).
Here, it is important to note that, in contrast to \(K\), the (transformed) signal set \(T\mathcal {X}\) may be highly non-convex.
The existence of an efficient projection and the (non-)convexity of \(K\) are not equivalent. There are many examples of efficient projections onto non-convex sets, while the projection onto convex sets can be NP-hard, e.g., see [53].
To be more precise, we assume that the tuples \((\varvec{a}_i, F_i, \tilde{y}_{t,i})\) are independent copies of \((\varvec{a}, F, \tilde{y}_{t})\) for \(i = 1, \dots , m\). In particular, the respective conditions of Assumption 2(a)–(c) are also satisfied for \(\{\tilde{y}_{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\).
To be slightly more precise, we apply Theorem 8 to the index sets \(\mathcal {A}{:=}\mathcal {X}\) and \(\mathcal {B}{:=}K_{\mathcal {X},t}\), while \(X \sim \mathbb {P}\) corresponds to the random tuple \((\varvec{a}, F, \tilde{y}_{t})\).
References
Ai, A., Lapanowski, A., Plan, Y., Vershynin, R.: One-bit compressed sensing with non-Gaussian measurements. Linear Algebra Appl. 441, 222–239 (2014)
Amelunxen, D., Lotz, M., McCoy, M.B., Tropp, J.A.: Living on the edge: phase transitions in convex programs with random data. Inf. Inference 3(3), 224–294 (2014)
Baraniuk, R.G., Foucart, S., Needell, D., Plan, Y., Wootters, M.: Exponential decay of reconstruction error from binary measurements of sparse signals. IEEE Trans. Inf. Theory 63(6), 3368–3385 (2017)
Bhandari, A., Krahmer, F., Raskar, R.: On unlimited sampling and reconstruction. IEEE Trans. Signal Process. 69, 3827–3839 (2020)
Bora A, Jalal A, Price E, Dimakis AG. Compressed sensing using generative models. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning (ICML), 2017;70:537–546
Boufounos PT, Jacques L, Krahmer F, Saab R. Quantization and compressive sensing. In: H. Boche, R. Calderbank, G. Kutyniok, J. Vybiral (eds.) Compressed Sensing and its Applications: MATHEON Workshop 2013, Applied and Numerical Harmonic Analysis, pp. 193–237. Birkhäuser Cham 2015
Brillinger DR. A generalized linear model with “Gaussian” regressor variables. In: P.J. Bickel, K. Doksum, J. Hodges (eds.) A Festschrift For Erich L. Lehmann. Chapman and Hall/CRC 1982:pp. 97–114
Cai, J.F., Xu, W.: Guarantees of total variation minimization for signal recovery. Inf. Inference 4(4), 328–353 (2015)
Candès, E.J., Eldar, Y.C., Strohmer, T., Voroninski, V.: Phase retrieval via matrix completion. SIAM J. Imaging Sci. 6(1), 199–225 (2013)
Candès, E.J., Romberg, J.K., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
Candès, E.J., Romberg, J.K., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59(8), 1207–1223 (2006)
Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The convex geometry of linear inverse problems. Found. Comput. Math. 12(6), 805–849 (2012)
Dabeer, O., Karnik, A.: Signal parameter estimation using 1-bit dithered quantization. IEEE Trans. Inf. Theory 52(12), 5389–5405 (2006)
Dirksen S. Quantized compressed sensing: A survey. In: H. Boche, G. Caire, R. Calderbank, G. Kutyniok, R. Mathar, P. Petersen (eds.) Compressed Sensing and Its Applications: Third International MATHEON Conference 2017, Applied and Numerical Harmonic Analysis, . Birkhäuser Cham 2019;67–95
Dirksen S, Genzel M, Jacques L, Stollenwerk A. The separation capacity of random neural networks 2021. Preprint 2108.00207
Dirksen, S., Jung, H.C., Rauhut, H.: One-bit compressed sensing with partial Gaussian circulant matrices. Inf. Inference 9(3), 601–626 (2020)
Dirksen S, Mendelson S. Non-Gaussian hyperplane tessellations and robust one-bit compressed sensing 2018. Preprint 1805.09409
Dirksen S, Mendelson S. Robust one-bit compressed sensing with partial circulant matrices 2018. Preprint 1812.06719
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Foucart S. Flavors of compressive sensing. In: G.E. Fasshauer, L.L. Schumaker (eds.) Approximation Theory XV: San Antonio 2016, Springer Proceedings in Mathematics & Statistics, 2017;61–104
Foucart S, Rauhut H. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkhäuser Basel 2013
Genzel, M.: High-dimensional estimation of structured signals from non-linear observations with general convex loss functions. IEEE Trans. Inf. Theory 63(3), 1601–1619 (2017)
Genzel M. The mismatch principle and \(\ell ^1\)-analysis compressed sensing: A unified approach to estimation under large model uncertainties and structural constraints. Phd thesis, available online: https://doi.org/10.14279/depositonce-8394, Technische Universität Berlin 2019
Genzel, M., Jung, P.: Recovering structured data from superimposed non-linear measurements. IEEE Trans. Inf. Theory 66(1), 453–477 (2020)
Genzel M, Kipp C. Generic error bounds for the generalized Lasso with sub-exponential data 2020. Preprint 2004.05361
Genzel, M., Kutyniok, G., März, M.: \(\ell ^1\)-analysis minimization and generalized (co-)sparsity: when does recovery succeed?. Appl. Comput. Harmon. Anal. 52, 82–140 (2021)
Genzel M, März M, Seidel R. (2021) Compressed sensing with 1D total variation: breaking sample complexity barriers via non-uniform recovery. Inf Inference.https://doi.org/10.1093/imaiai/iaab001
Genzel, M., Stollenwerk, A.: Robust 1-bit compressed sensing via hinge loss minimization. Inf. Inference 9(2), 361–422 (2020)
Goldstein, L., Minsker, S., Wei, X.: Structured signal recovery from non-linear and heavy-tailed measurements. IEEE Trans. Inf. Theory 64(8), 5513–5530 (2018)
Gordon, Y.: On Milman’s inequality and random subspaces which escape through a mesh in \({\mathbb{R}}^n\). In: J. Lindenstrauss, V.D. Milman (eds.) Geometric Aspects of Functional Analysis, Springer Berlin Heidelberg, New York (1988)
Gray, R.M., Neuhoff, D.L.: Quantization. IEEE Trans. Inf. Theory 44(6), 2325–2383 (1998)
Gray, R.M., Stockham, T.G.: Dithered quantizers. IEEE Trans. Inf. Theory 39(3), 805–812 (1993)
Jacques, L., Cambareri, V.: Time for dithering: fast and quantized random embeddings via the restricted isometry property. Inf. Inference 6(4), 441–476 (2017)
Jacques, L., Laska, J.N., Boufounos, P.T., Baraniuk, R.G.: Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Trans. Inf. Theory 59(4), 2082–2102 (2013)
Jeong H, Li X, Plan Y, Yılmaz Ö. Sub-gaussian matrices on sets: Optimal tail dependence and applications 2020. Preprint 2001.10631
Jung HC, Maly J, Palzer L, Stollenwerk A. Quantized compressed sensing by rectified linear units 2019. Preprint 1911.07816
Knudson, K., Saab, R., Ward, R.: One-bit compressive sensing with norm estimation. IEEE Trans. Inf. Theory 62(5), 2748–2758 (2016)
Krahmer F, Kruschel C, Sandbichler M. Total Variation Minimization in Compressed Sensing. In: H. Boche, G. Caire, R. Calderbank, M. März, G. Kutyniok, R. Mathar (eds.) Compressed Sensing and its Applications: Second International MATHEON Conference 2015,. Birkhäuser 2017;pp. 333–358
Liu Z, Scarlett J. The generalized lasso with nonlinear observations and generative priors 2020. Preprint 2006.12415
März M, Boyer C, Kahn J, Weiss P. Sampling rates for \(\ell ^1\)-synthesis 2020. Preprint 2004.07175
Mendelson S. Learning without concentration. J. ACM 62(3): 21 (2015)
Mendelson, S.: Upper bounds on product and multiplier empirical processes. Stoch. Proc. Appl. 126(12), 3652–3680 (2016)
Mendelson, S., Pajor, A., Tomczak-Jaegermann, N.: Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom. Funct. Anal. 17(4), 1248–1282 (2007)
Moshtaghpour, A., Jacques, L., Cambareri, V., Degraux, K., De Vleeschouwer, C.: Consistent basis pursuit for signal and matrix estimates in quantized compressed sensing. IEEE Signal Process. Lett. 23(1), 25–29 (2016)
Oymak S. Recht B. Near-optimal bounds for binary embeddings of arbitrary sets 2015. Preprint 1512.04433
Oymak, S., Recht, B., Soltanolkotabi, M.: Sharp time-data tradeoffs for linear inverse problems. IEEE Trans. Inf. Theory 64(6), 4129–4158 (2018)
Oymak, S., Soltanolkotabi, M.: Fast and reliable parameter estimation from nonlinear observations. SIAM J. Optim. 27(4), 2276–2300 (2017)
Plan, Y., Vershynin, R.: One-bit compressed sensing by linear programming. Comm. Pure Appl. Math. 66(8), 1275–1297 (2013)
Plan, Y., Vershynin, R.: Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach. IEEE Trans. Inf. Theory 59(1), 482–494 (2013)
Plan, Y., Vershynin, R.: Dimension reduction by random hyperplane tessellations. Discrete Comput. Geom. 51(2), 438–461 (2014)
Plan, Y., Vershynin, R.: The generalized Lasso with non-linear observations. IEEE Trans. Inf. Theory 62(3), 1528–1537 (2016)
Plan, Y., Vershynin, R., Yudovina, E.: High-dimensional estimation with geometric constraints. Inf. Inference 6(1), 1–40 (2016)
Richard, E., Obozinski, G.R., Vert, J.P.: Tight convex relaxations for sparse matrix factorization. Advances in Neural Information Processing Systems 27: 3284–3292 (2014)
Rudelson, M., Vershynin, R.: On sparse reconstruction from fourier and gaussian measurements. Comm. Pure Appl. Math. 61(8), 1025–1045 (2008)
Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60(1–4), 259–268 (1992)
Sattar, Y., Oymak, S.: Quickly finding the best linear model in high dimensions via projected gradient descent. IEEE Trans. Signal Process. 68, 818–829 (2020)
Stojnic M. Various thresholds for \(\ell _1\)-optimization in compressed sensing 2009. Preprint 0907.3666
Stollenwerk A. One-bit compressed sensing and fast binary embeddings. Phd thesis, available online: https://doi.org/10.18154/RWTH-2020-00296, RWTH Aachen 2019
Talagrand, M.: Upper and Lower Bounds for Stochastic Processes, Ergebnisse der Mathematik und ihrer Grenzgebiete, vol. 3. Springer Berlin Heidelberg (2014)
Thrampoulidis C, Abbasi E, Hassibi B. LASSO with non-linear measurements is equivalent to one with linear measurements. In: C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Garnett (eds.) Advances in Neural Information Processing Systems 28, pp. 3402–3410. Curran Associates 2015
Thrampoulidis C, Rawat AS. Lifting high-dimensional non-linear models with Gaussian regressors 2017. Preprint 1712.03638
Thrampoulidis, C., Rawat, A.S.: The generalized Lasso for sub-gaussian measurements with dithered quantization. IEEE Trans. Inf. Theory 66(4), 2487–2500 (2020)
Tropp, J.A.: Convex recovery of a structured signal from independent random linear measurements. In: G.E. Pfander (ed.) Sampling Theory, a Renaissance, Applied and Numerical Harmonic Analysis, pp. 76–101. Birkhäuser Cham (2015)
Vershynin R. High-Dimensional Probability: An Introduction with Applications in Data Science, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 47. Cambridge University Press 2018
Xu, C., Jacques, L.: Quantized compressive sensing with RIP matrices: the benefit of dithering. Inf. Inference 9(3), 543–586 (2020)
Yang Z, Balasubramanian K, Wang Z, Liu H. Estimating high-dimensional non-Gaussian multiple index models via Stein’s lemma. In: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (eds.) Advances in Neural Information Processing Systems 30, pp. 6097–6106. Curran Associates 2017
Acknowledgements
The authors thank Sjoerd Dirksen and Maximilian März for fruitful discussions. Moreover, the authors would like to thank the anonymous referees for their very useful comments and suggestions which have helped to improve the present article.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Thomas Strohmer.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
MG acknowledges support by the DFG Priority Programme DFG-SPP 1798 Grant DI 2120/1-1. AS acknowledges support by the Fonds de la Recherche Scientifique – FNRS under Grant n\(^\circ \) T.0136.20 (Learn2Sense)
Rights and permissions
About this article
Cite this article
Genzel, M., Stollenwerk, A. A Unified Approach to Uniform Signal Recovery From Nonlinear Observations. Found Comput Math 23, 899–972 (2023). https://doi.org/10.1007/s10208-022-09562-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-022-09562-y
Keywords
- Uniform recovery
- High-dimensional estimation
- Nonlinear observations
- Quantized compressed sensing
- Empirical processes