Sharp oracle inequalities for low-complexity priors

Luu, Tung Duy; Fadili, Jalal; Chesneau, Christophe

doi:10.1007/s10463-018-0693-6

Sharp oracle inequalities for low-complexity priors

Published: 19 November 2018

Volume 72, pages 353–397, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Sharp oracle inequalities for low-complexity priors

Download PDF

Tung Duy Luu¹,
Jalal Fadili¹ &
Christophe Chesneau²

154 Accesses
Explore all metrics

Abstract

In this paper, we consider a high-dimensional statistical estimation problem in which the number of parameters is comparable or larger than the sample size. We present a unified analysis of the performance guarantees of exponential weighted aggregation and penalized estimators with a general class of data losses and priors which encourage objects which conform to some notion of simplicity/complexity. More precisely, we show that these two estimators satisfy sharp oracle inequalities for prediction ensuring their good theoretical performances. We also highlight the differences between them. When the noise is random, we provide oracle inequalities in probability using concentration inequalities. These results are then applied to several instances including the Lasso, the group Lasso, their analysis-type counterparts, the $\ell _\infty $ and the nuclear norm penalties. All our estimators can be efficiently implemented using proximal splitting algorithms.

Penalized Least Square in Sparse Setting with Convex Penalty and Non Gaussian Errors

Article 05 November 2021

Bounds on the Prediction Error of Penalized Least Squares Estimators with Convex Penalty

Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models

Article Open access 30 November 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Problem statement

Our statistical context is the following. Let $\varvec{y}=(\varvec{y}_1,\varvec{y}_2,\ldots ,\varvec{y}_n)$ be n observations with common marginal distribution, and $\varvec{X}\in \mathbb {R}^{n \times p}$ a deterministic design matrix. The goal to estimate a parameter vector ${\varvec{\theta }}\in \mathbb {R}^p$ of the observations marginal distribution based on the data $\varvec{y}$ and $\varvec{X}$.

Let $F: \mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}$ be a loss function supposed to be smooth and convex that assigns to each ${\varvec{\theta }}\in \mathbb {R}^p$ a cost $F(\varvec{X}{\varvec{\theta }},\varvec{y})$. Let ${\varvec{\theta }}_0 \in {{\mathrm{Argmin}}}_{{\varvec{\theta }}\in \mathbb {R}^p}\mathbb {E}_{}\left[ F(\varvec{X}{\varvec{\theta }},\varvec{y})\right] $ be any minimizer of the population risk. We regard ${\varvec{\theta }}_0$ as the true parameter. A usual instance of this statistical setting is the standard linear regression model based on n pairs $(\varvec{y}_i,\varvec{X}_i)$ of response–covariate that are linked linearly $\varvec{y}=\varvec{X}{\varvec{\theta }}_0+\varvec{\xi }$, and $F({\varvec{u}},\varvec{y})=\tfrac{1}{2}\big \Vert \varvec{y}-{\varvec{u}}\big \Vert _{2}^2$.

Our goal is to provide general oracle inequalities in prediction for two estimators of ${\varvec{\theta }}_0$: the penalized estimator and exponential weighted aggregation. In the setting where “p larger than n (possibly much larger), the estimation problem is ill-posed since the rectangular matrix $\varvec{X}$ has a kernel of dimension at least $p-n$. To circumvent this difficulty, we will exploit the prior that ${\varvec{\theta }}_0$ has some low-complexity structure (among which sparsity and low-rank are the most popular). That is, even if the ambient dimension p of ${\varvec{\theta }}_0$ is very large, its intrinsic dimension is much smaller than the sample size n. This makes it possible to build estimates $\varvec{X}\widehat{{\varvec{\theta }}}$ with good provable performance guarantees under appropriate conditions. There has been a flurry of research on the use of low-complexity regularization in ill-posed recovery problems in various areas including statistics and machine learning.

1.2 Penalized estimators

Regularization is now a central theme in many fields including statistics, machine learning and inverse problems. It allows one to impose on the set of candidate solutions some prior structure on the object to be estimated. This regularization ranges from squared Euclidean or Hilbertian norms to non-Hilbertian norms (e.g., $\ell _1$ norm for sparse objects, or nuclear norm for low-rank matrices) that have sparked considerable interest in the recent years. In this paper, we consider the class of estimators obtained by solving the convex optimization problem

$$\begin{aligned} {\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}\in {{\mathrm{Argmin}}}_{{\varvec{\theta }}\in \mathbb {R}^p} \left\{ V_n({\varvec{\theta }}) {\mathop {=}\limits ^{\text { def}}}\tfrac{1}{n}F(\varvec{X}{\varvec{\theta }},\varvec{y}) + \lambda _n J({\varvec{\theta }}) \right\} , \end{aligned}$$

(1)

where the regularizing penalty J is a proper closed convex function that promotes some specific notion of simplicity/low-complexity, and $\lambda _n > 0$ is the regularization parameter.

To avoid trivialities, the set of minimizers is assumed non-empty, which holds for instance if J is also coercive. A prominent member covered by (1) is the Lasso (Chen et al. 1999; Tibshirani 1996; Osborne et al. 2000; Donoho 2006; Candès and Plan 2009; Bickel et al. 2009; Bühlmann and van de Geer 2011; Koltchinskii 2008) and its variants such the analysis/fused Lasso (Rudin et al. 1992; Tibshirani et al. 2005), SLOPE (Bogdan et al. 2014; Su and Candès 2015) or group Lasso (Bakin 1999; Yuan and Lin 2006; Bach 2008; Wei and Huang 2010). Another example is the nuclear norm minimization for low-rank matrix recovery motivated by various applications including robust PCA, phase retrieval, control and computer vision (Recht et al. 2010; Candès and Recht 2009; Fazel et al. 2001; Candès et al. 2013). See Negahban et al. (2012), Bühlmann and van de Geer (2011), van de Geer (2014) and Vaiter et al. (2015b) for generalizations and comprehensive reviews.

1.3 Exponential weighted aggregation (EWA)

An alternative to the penalized estimator (1) is the aggregation by exponential weighting, which consists in substituting averaging for minimization. The aggregators are defined via the probability density function

$$\begin{aligned} \mu _n({\varvec{\theta }}) = \frac{\exp {\left( -V_n({\varvec{\theta }})/\beta \right) }}{\int _\varTheta \exp {\left( -V_n({\varvec{\omega }})/\beta \right) }\mathrm{d}{\varvec{\omega }}}, \end{aligned}$$

(2)

where $\beta >0$ is called temperature parameter. If all ${\varvec{\theta }}$ are candidates to estimate the true vector ${\varvec{\theta }}_0$, then $\varTheta = \mathbb {R}^p$. The aggregate is thus defined by

$$\begin{aligned} {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}= \int _{\mathbb {R}^p} {\varvec{\theta }}\mu _n({\varvec{\theta }})\mathrm{d}{\varvec{\theta }}. \end{aligned}$$

(3)

Aggregation by exponential weighting has been widely considered in the statistical and machine learning literature, see e.g., Dalalyan and Tsybakov (2007, 2008, 2009, 2012), Nemirovski (2000), Yang (2004), Rigollet and Tsybakov (2007), Lecué (2007), Guedj and Alquier (2013) and Duy Luu et al. (2016) to name a few. The technique used in these papers were initiated by Leung and Barron (2006) (use of Stein’s identity to study an early version of EWA) and Catoni (2003, 2007) (PAC-Bayesian theory). ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ can also be interpreted as the posterior conditional mean in the Bayesian sense if $F/(n\beta )$ is the negative-loglikelihood associated to the noise $\varvec{\xi }$ with the prior density $\pi ({\varvec{\theta }})\propto \exp {\left( -\lambda _n J({\varvec{\theta }})/\beta \right) }$.

1.4 Oracle inequalities

Oracle inequalities, which are at the heart of our work, quantify the quality of an estimator compared to the best possible one among a family of estimators. These inequalities are well adapted in the scenario where the prior penalty promotes some notion of low-complexity (e.g., sparsity, low rank, etc.). Given two vectors ${\varvec{\theta }}_1$ and ${\varvec{\theta }}_2$, let $R_n{\left( {\varvec{\theta }}_1,{\varvec{\theta }}_2\right) }$ be a nonnegative error measure between their predictions, respectively, $\varvec{X}{\varvec{\theta }}_1$ and $\varvec{X}{\varvec{\theta }}_2$. A popular example is the averaged prediction squared error $\tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}_1 - \varvec{X}{\varvec{\theta }}_2\big \Vert _{2}^2$, where $\big \Vert \cdot \big \Vert _{2}$ is the $\ell _2$ norm. $R_n$ will serve as a measure of the performance of the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$. More precisely, we aim to prove that ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ mimic as much as possible the best possible model. This idea is materialized in the following type of inequalities (stated here for EWA)

$$\begin{aligned} R_n{\big ({\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}},{\varvec{\theta }}_0\big )} \le C\inf _{{\varvec{\theta }}\in \mathbb {R}^p} {\big (R_n{\left( {\varvec{\theta }},{\varvec{\theta }}_0\right) } + \varDelta _{n,p,\lambda _n,\beta }({\varvec{\theta }})\big )}, \end{aligned}$$

(4)

where $C\ge 1$ is the leading constant of the oracle inequality and the remainder term $\varDelta _{n,\lambda _n,\beta }({\varvec{\theta }})$ depends on the performance of the estimator, the complexity of ${\varvec{\theta }}$, the sample size n, the dimension p, and the regularization and temperature parameters $(\lambda _n,\beta )$. An estimator with good oracle properties would correspond to C close to 1 (ideally, $C=1$, in which case the inequality is said “sharp”), and $\varDelta _{n,p,\lambda _n,\beta }({\varvec{\theta }})$ is small and decreases rapidly to 0 as $n \rightarrow +\infty $.

1.5 Contributions

We provide a unified analysis where we capture the essential ingredients behind the low-complexity priors promoted by J, relying on sophisticated arguments from convex analysis and our previous work (Fadili et al. 2013; Vaiter et al. 2015a, 2018, 2015b, 2017). Our main contributions are summarized as follows:

We show that the EWA estimator ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ in (2) and the penalized estimator ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ in (1) satisfy (deterministic) sharp oracle inequalities for prediction with optimal remainder term, for general data losses F beyond the usual quadratic one, and J is a proper finite-valued sublinear function (i.e., J is finite-valued convex and positively homogeneous). We also highlight the differences between the two estimators in terms of the corresponding bounds.
When the observations are random, we prove oracle inequalities in probability. The theory is non-asymptotic in nature, as it yields explicit bounds that hold with high probability for finite sample sizes, and reveals the dependence on dimension and other structural parameters of the model.
For the standard linear model with Gaussian or sub-Gaussian noise, and a quadratic loss, we deliver refined versions of these oracle inequalities in probability. We underscore the role of the Gaussian width, a concept that captures important geometric characteristics of sets in $\mathbb {R}^n$.
These results yield naturally a large number of corollaries when specialized to penalties routinely used in the literature, among which the Lasso, the group Lasso, their analysis-type counterparts (fused (group) Lasso), the $\ell _\infty $ and the nuclear norms. Some of these corollaries are known and others novel.

The estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ can be easily implemented thanks to the framework of proximal splitting methods, and more precisely forward-backward type splitting. While the latter is well-known to solve (1) (Vaiter et al. 2015b), its application within a proximal Langevin Monte Carlo algorithm to compute ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ with provable guarantees has been recently developed by Duy Luu et al. (2016) to sample from log-prox regular densities, see also Durmus et al. (2016) for log-concave densities.

1.6 Relation to previous work

Our oracle inequality for ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ extends the work of Dalalyan et al. (2018) with an unprecedented level of generality, far beyond the Lasso and the nuclear norm. Our prediction sharp oracle inequality for ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ specializes to that of Sun and Zhang (2012) in the case of the Lasso [see also the discussion in Dalalyan et al. (2017) and references therein] and that of Koltchinskii et al. (2011) for the case of the nuclear norm. Our work also goes much beyond that in van de Geer (2014) on weakly decomposable priors, where we show in particular that there is no need to impose decomposability on the regularizer, since it is rather an intrinsic property of it.

1.7 Paper organization

Section 2 states our main assumptions on the data loss and the prior penalty. All the concepts and notions are exemplified on some penalties some of which are popular in the literature. In Sect. 3, we prove our main oracle inequalities, and their versions in probability. We then tackle the case of linear regression with quadratic data loss in Sect. 4. Concepts from convex analysis that are essential to this work are gathered in Sect. A. A key intermediate result in the proof of our main results is established in Sect. B with an elegant argument relying on Moreau–Yosida regularization.

1.8 Notations

Vectors and matrices For a d-dimensional Euclidean space $\mathbb {R}^d$, we endow it with its usual inner product $\langle \cdot ,\cdot \rangle $ and associated norm $\left\| \cdot \right\| _{2}$. $\mathbf{Id}_d$ is the identity matrix on $\mathbb {R}^d$. For $p \ge 1$, $\left\| \cdot \right\| _{p}$ will denote the $\ell _p$ norm of a vector with the usual adaptation for $p=+\infty $.

In the following, if T is a vector space, ${{\mathrm{P}}}_T$ denotes the orthogonal projector on T, and

$$\begin{aligned} {\varvec{\theta }}_T = {{\mathrm{P}}}_T {\varvec{\theta }} \quad \text {and} \quad {\varvec{X}}_T= \varvec{X}{{\mathrm{P}}}_T. \end{aligned}$$

For a finite set ${\mathcal C}$, we denote $ \big |{\mathcal C} \big |$ its cardinality. For $I \subset \left\{ 1,\ldots ,p \right\} $, we denote by $I^c$ its complement. ${\varvec{\theta }}_{I}$ is the subvector whose entries are those of ${\varvec{\theta }}$ restricted to the indices in I, and $\varvec{X}_{I}$ the submatrix whose columns are those of $\varvec{X}$ indexed by I. For any matrix $\varvec{X}$, $\varvec{X}^\top $ denotes its transpose and $\varvec{X}^+$ its Moore–Penrose pseudo-inverse. For a linear operator $\varvec{A}$, $\varvec{A}^*$ is its adjoint.

Sets For a non-empty set ${\mathcal C}\in \mathbb {R}^p$, we denote ${\overline{\mathrm {conv}}\left( {\mathcal C}\right) }$ the closure of its convex hull, and $\iota _{\mathcal C}$ its indicator function, i.e., $\iota _{\mathcal C}({\varvec{\theta }})=0$ if ${\varvec{\theta }}\in {\mathcal C}$ and $+\infty $ otherwise. For a non-empty convex set ${\mathcal C}$, its affine hull${{\mathrm{aff}}}({\mathcal C})$ is the smallest affine manifold containing it. It is a translate of its parallel subspace${{\mathrm{par}}}({\mathcal C})$, i.e., ${{\mathrm{par}}}({\mathcal C}) = {{\mathrm{aff}}}({\mathcal C}) - {\varvec{\theta }}= \mathbb {R}({\mathcal C}-{\mathcal C})$; for any ${\varvec{\theta }}\in {\mathcal C}$. The relative interior${{\mathrm{ri}}}({\mathcal C})$ of a convex set ${\mathcal C}$ is the interior of ${\mathcal C}$ for the topology relative to its affine full.

Functions A function $f: \mathbb {R}^p \rightarrow \mathbb {R}\cup \{+\infty \}$ is closed (or lower semicontinuous (lsc)) if so is its epigraph. It is coercive if $\lim _{\left\| {\varvec{\theta }}\right\| _{2} \rightarrow +\infty }f({\varvec{\theta }})=+\infty $, and strongly coercive if $\lim _{\left\| {\varvec{\theta }}\right\| _{2} \rightarrow +\infty }f({\varvec{\theta }})/\left\| x\right\| _{2}=+\infty $. The effective domain of f is ${{\mathrm{dom}}}(f) = \big \{ {\varvec{\theta }}\in \mathbb {R}^p \;:\; f({\varvec{\theta }}) < +\infty \big \} $ and f is proper if ${{\mathrm{dom}}}(f) \ne \emptyset $ as is the case when it is finite-valued. A function is said sublinear if it is convex and positively homogeneous. The Legendre–Fenchel conjugate of f is $f^*(\varvec{z})=\sup _{{\varvec{\theta }}\in \mathbb {R}^p} \langle \varvec{z},{\varvec{\theta }}\rangle - f({\varvec{\theta }})$. For f proper, the functions $(f,f^*)$ obey the Fenchel–Young inequality

$$\begin{aligned} f({\varvec{\theta }}) + f^*(\varvec{z}) \ge \langle \varvec{z},{\varvec{\theta }}\rangle , \quad \forall ({\varvec{\theta }},\varvec{z}) \in \mathbb {R}^p \times \mathbb {R}^p . \end{aligned}$$

(5)

When f is a proper lower semicontinuous and convex function, $(f,f^*)$ is actually the best pair for which this inequality cannot be tightened. For a function g on $\mathbb {R}_+$, the function $g^+: a \in \mathbb {R}_+ \mapsto g^+(a)=\sup _{t \ge 0} a t - g(t)$ is called the monotone conjugate of g. The pair $(g,g^+)$ obviously obeys (5) on $\mathbb {R}_+ \times \mathbb {R}_+$.

For a $C^1$-smooth function f, $\nabla f({\varvec{\theta }})$ is its (Euclidean) gradient. For a bivariate function $g: ({\varvec{\eta }},\varvec{y}) \in \mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}$ that is $C^2$ with respect to the first variable ${\varvec{\eta }}$, for any $\varvec{y}$, we will denote $\nabla g({\varvec{\eta }},\varvec{y})$ the gradient of g at ${\varvec{\eta }}$ with respect to the first variable.

The subdifferential$\partial f({\varvec{\theta }})$ of a convex function f at ${\varvec{\theta }}$ is the set

$$\begin{aligned} \partial f({\varvec{\theta }}) = \big \{ {\varvec{\eta }}\in \mathbb {R}^p \;:\; f({\varvec{\theta }}') \ge f({\varvec{\theta }}) + \langle {\varvec{\eta }},{\varvec{\theta }}'-{\varvec{\theta }}\rangle , \quad \forall {\varvec{\theta }}' \in {{\mathrm{dom}}}(f) \big \} ~. \end{aligned}$$

An element of $\partial f({\varvec{\theta }})$ is a subgradient. If the convex function f is differentiable at ${\varvec{\theta }}$, then its only subgradient is its gradient, i.e., $\partial f({\varvec{\theta }}) = \left\{ \nabla f({\varvec{\theta }}) \right\} $.

The Bregman divergence associated to a convex function f at ${\varvec{\theta }}$ with respect to ${\varvec{\eta }}\in \partial f({\varvec{\theta }}) \ne \emptyset $ is

$$\begin{aligned} D_{f}^{{\varvec{\eta }}}{\left( \overline{{\varvec{\theta }}},{\varvec{\theta }}\right) } = f(\overline{{\varvec{\theta }}}) - f({\varvec{\theta }}) - \langle {\varvec{\eta }},\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle . \end{aligned}$$

The Bregman divergence is in general nonsymmetric. It is also nonnegative by convexity. When f is differentiable at $\overline{{\varvec{\theta }}}$, we simply write $D_{f}^{}{\left( {\varvec{\theta }},\overline{{\varvec{\theta }}}\right) }$ (which is, in this case, also known as the Taylor distance).

2 Estimation with low-complexity penalties

The estimators ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ and ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ in (1) and (3) require two essential ingredients: the data loss term F and the prior penalty J. We here specify the class of such functions covered in our work and provide illustrating examples.

2.1 Data loss

The class of loss functions F that we consider obey the following assumptions:

(H.1):

$F(\cdot ,\varvec{y}): \mathbb {R}^n \rightarrow \mathbb {R}$ is $C^1(\mathbb {R}^n)$ and uniformly convex for all $\varvec{y}$ of modulus $\varphi $, i.e.,

$$\begin{aligned} F({\varvec{v}},\varvec{y}) \ge F({\varvec{u}},\varvec{y}) + \langle \nabla F({\varvec{u}},\varvec{y}),{\varvec{v}}-{\varvec{u}}\rangle + \varphi (\left\| {\varvec{v}}-{\varvec{u}}\right\| _{2}), \end{aligned}$$

where $\varphi : \mathbb {R}_+ \rightarrow \mathbb {R}_+$ is a convex non-decreasing function that vanishes only at 0.

(H.2):

For any $\overline{{\varvec{\theta }}}\in \mathbb {R}^p$ and $\varvec{y}\in \mathbb {R}^n$, $\int _{\mathbb {R}^p}\exp {\left( -F(\varvec{X}{\varvec{\theta }},\varvec{y})/(n\beta )\right) } \big |\langle \nabla F(\varvec{X}{\varvec{\theta }},\varvec{y}),\varvec{X}(\overline{{\varvec{\theta }}}-{\varvec{\theta }})\rangle \big | \mathrm{d}{\varvec{\theta }}< +\infty $.

Recall that by Lemma 2, the monotone conjugate $\varphi ^+$ of $\varphi $ is a proper, closed, convex, strongly coercive and non-decreasing function on $\mathbb {R}_+$ that vanishes at 0. Moreover, $\varphi ^{++}=\varphi $. The function $\varphi ^+$ is finite-valued on $\mathbb {R}_+$ if $\varphi $ is strongly coercive, and it vanishes only at 0 under, e.g., Lemma 2(iii).

The class of data loss functions in (H.1) is fairly general. It is reminiscent of the negative loglikelihood in the regular exponential family. For the moment assumption (H.2) to be satisfied, it is sufficient that

$$\begin{aligned} \int _{\mathbb {R}^p}\exp {\left( -\varphi {\left( \big \Vert \varvec{X}{\varvec{\theta }}\big \Vert _{2}\right) }/(n\beta )\right) }\big \Vert \nabla F(\varvec{X}{\varvec{\theta }}+{\varvec{u}}^\star ,\varvec{y})\big \Vert _{2}\big \Vert \varvec{X}{\varvec{\theta }}+({\varvec{u}}^\star -\varvec{X}\overline{{\varvec{\theta }}})\big \Vert _{2} \mathrm{d}{\varvec{\theta }}< +\infty , \end{aligned}$$

where ${\varvec{u}}^\star $ be a minimizer of $F(\cdot ,\varvec{y})$, which is unique by uniform convexity. We here provide an example.

Example 1

Consider the case where $\varphi (t)=t^q/q$, $q \in ]1,+\infty [$, or equivalently $\varphi ^+(t)=t^{q_*}/q_*$ where $1/q+1/q_*=1$. For $q=q_*=2$, (H.1) amounts to saying that $F(\cdot ,\varvec{y})$ is strongly convex for all $\varvec{y}$. In particular, Bauschke and Combettes (2011, Proposition 10.13) shows that $F({\varvec{u}},\varvec{y})=\big \Vert {\varvec{u}}-\varvec{y}\big \Vert _{2}^q/q$ is uniformly convex for $q \in [2,+\infty [$ with modulus $\varphi (t)=C_q t^q/q$, where $C_q > 0$ is a constant that depends solely on q.

For (H.2) to be verified, it is sufficient that

$$\begin{aligned} \int _{\mathbb {R}^p}\exp {\left( -\big \Vert \varvec{X}{\varvec{\theta }}\big \Vert _{2}^q/(qn\beta )\right) }\big \Vert \nabla F(\varvec{X}{\varvec{\theta }}+{\varvec{u}}^\star ,\varvec{y})\big \Vert _{2}\big \Vert (\varvec{X}{\varvec{\theta }}+{\varvec{u}}^\star )-\varvec{X}\overline{{\varvec{\theta }}}\big \Vert _{2} \mathrm{d}{\varvec{\theta }}< +\infty . \end{aligned}$$

In particular, taking $F({\varvec{u}},\varvec{y})=\big \Vert {\varvec{u}}-\varvec{y}\big \Vert _{2}^q/q$, $q \in [2,+\infty [$, we have $\big \Vert \nabla F({\varvec{u}},\varvec{y})\big \Vert _{2} = \big \Vert {\varvec{u}}-\varvec{y}\big \Vert _{2}^{q-1}$, and thus (H.2) holds since

$$\begin{aligned} \int _{\mathbb {R}^p}\exp {\left( -\big \Vert \varvec{X}{\varvec{\theta }}\big \Vert _{2}^q/(qn\beta )\right) }\big \Vert \varvec{y}- (\varvec{X}{\varvec{\theta }}+{\varvec{u}}^\star )\big \Vert _{2}^{q-1}\big \Vert \varvec{X}\overline{{\varvec{\theta }}}- (\varvec{X}{\varvec{\theta }}+{\varvec{u}}^\star )\big \Vert _{2} \mathrm{d}{\varvec{\theta }}< +\infty . \end{aligned}$$

2.2 Prior penalty

Recall the main definitions and results from convex analysis that are collected in Sect. A. Our main assumption on J is the following.

(H.3):: $J: \mathbb {R}^p \rightarrow \mathbb {R}$ is the gauge of a non-empty convex compact set containing the origin as an interior point.

By Lemma 4, this assumption is equivalent to saying that $J {\mathop {=}\limits ^{\text { def}}}\gamma _{{\mathcal C}}$ is proper, convex, positively homogeneous, finite-valued and coercive. In turn, J is locally Lipschitz continuous on $\mathbb {R}^p$. Observe also that by virtue of Lemma 5 and Lemma 3, the polar gauge $J^\circ {\mathop {=}\limits ^{\text { def}}}\gamma _{{\mathcal C}^\circ }$ enjoys the same properties as J in (H.3).

2.3 Decomposability of the prior penalty

We are now in position to provide an important characterization of the subdifferential mapping of a function J satisfying (H.3). This characterization will play a pivotal role in our proof of the oracle inequality.

We start by defining some essential geometrical objects that were introduced in Vaiter et al. (2015a).

Definition 1

(Model Subspace) Let ${\varvec{\theta }}\in \mathbb {R}^p$. We denote by $e_{{\varvec{\theta }}}$ as

$$\begin{aligned} e_{{\varvec{\theta }}} = {{\mathrm{P}}}_{{{\mathrm{aff}}}(\partial J({\varvec{\theta }}))}(0). \end{aligned}$$

We denote

$$\begin{aligned} S_{{\varvec{\theta }}} = {{\mathrm{par}}}(\partial J({\varvec{\theta }})) \quad \text {and} \quad T_{{\varvec{\theta }}} = S_{{\varvec{\theta }}}^\bot . \end{aligned}$$

$T_{{\varvec{\theta }}}$ is coined the model subspace of ${\varvec{\theta }}$ associated to J.

It can be shown, see Vaiter et al. (2015a, Proposition 5), that ${\varvec{\theta }}\in T_{{\varvec{\theta }}}$, hence the name model subspace. When J is differentiable at ${\varvec{\theta }}$, we have $e_{{\varvec{\theta }}}=\nabla J({\varvec{\theta }})$ and $T_{{\varvec{\theta }}}=\mathbb {R}^p$. When J is the $\ell _1$-norm (Lasso), the vector $e_{{\varvec{\theta }}}$ is nothing but the sign of ${\varvec{\theta }}$. Thus, $e_{{\varvec{\theta }}}$ can be viewed as a generalization of the sign vector. Observe also that $e_{{\varvec{\theta }}}={{\mathrm{P}}}_{T_{{\varvec{\theta }}}}(\partial J({\varvec{\theta }}))$, and thus $e_{{\varvec{\theta }}} \in T_{{\varvec{\theta }}} \cap {{\mathrm{aff}}}(\partial J({\varvec{\theta }}))$. However, in general, $e_{{\varvec{\theta }}} \not \in \partial J({\varvec{\theta }})$.

We now provide a fundamental equivalent description of the subdifferential of J at ${\varvec{\theta }}$ in terms of $e_{{\varvec{\theta }}}$, $T_{{\varvec{\theta }}}$, $S_{{\varvec{\theta }}}$ and the polar gauge $J^\circ $.

Theorem 1

Let J satisfy (H.3). Let ${\varvec{\theta }}\in \mathbb {R}^p$ and $f_{{\varvec{\theta }}} \in {{\mathrm{ri}}}(\partial J({\varvec{\theta }}))$.

(i)
The subdifferential of J at ${\varvec{\theta }}$ reads
$$\begin{aligned} \partial J({\varvec{\theta }})&= {{\mathrm{aff}}}(\partial J({\varvec{\theta }})) \cap {\mathcal C}^\circ \\&=\left\{ {\varvec{\eta }}\in \mathbb {R}^n ~: \right. \\&\left. {\varvec{\eta }}_{T_{{\varvec{\theta }}}} = e_{{\varvec{\theta }}} ~ \text {and} ~ \inf _{\tau \ge 0}\max {\left( J^\circ {\left( \tau e_{{\varvec{\theta }}} + {\varvec{\eta }}_{S_{{\varvec{\theta }}}} + (\tau -1){{\mathrm{P}}}_{S_{{\varvec{\theta }}}}f_{{\varvec{\theta }}}\right) },\tau \right) } \le 1 \right\} . \end{aligned}$$
(ii)
For any ${\varvec{\omega }}\in \mathbb {R}^p$, $\exists {\varvec{\eta }}\in \partial J({\varvec{\theta }})$ such that
$$\begin{aligned} J({\varvec{\omega }}_{S_{{\varvec{\theta }}}}) = \langle {\varvec{\eta }}_{S_{{\varvec{\theta }}}},{\varvec{\omega }}_{S_{{\varvec{\theta }}}}\rangle . \end{aligned}$$

Proof

(i)
This follows by piecing together (Vaiter et al. 2015a, Theorem 1, Propositions 4 and 5(iii)).
(ii)
From Vaiter et al. (2015a, Proposition 5(iv)), we have
$$\begin{aligned} \sigma _{\partial J({\varvec{\theta }}) - f_{{\varvec{\theta }}}}({\varvec{\omega }}) = J({\varvec{\omega }}_{S_{{\varvec{\theta }}}}) - \langle {{\mathrm{P}}}_{S_{{\varvec{\theta }}}}f_{{\varvec{\theta }}},{\varvec{\omega }}_{S_{{\varvec{\theta }}}}\rangle . \end{aligned}$$
Thus there exists a supporting point ${\varvec{v}}\in \partial J({\varvec{\theta }}) - f_{{\varvec{\theta }}} \subset S_{{\varvec{\theta }}}$ with normal vector ${\varvec{\omega }}$ (Bauschke and Combettes 2011, Corollary 7.6(iii)), i.e.,
$$\begin{aligned} \sigma _{\partial J({\varvec{\theta }}) - f_{{\varvec{\theta }}}}({\varvec{\omega }}) = \langle {\varvec{v}},{\varvec{\omega }}_{S_{{\varvec{\theta }}}}\rangle . \end{aligned}$$
Taking ${\varvec{\eta }}={\varvec{v}}+f_{{\varvec{\theta }}}$ concludes the proof. $\square $

Remark 1

The coercivity assumption in (H.3) is not needed for Theorem 1 to hold.

The decomposability of described in Theorem 1(i) depends on the particular choice of the mapping ${\varvec{\theta }}\mapsto f_{{\varvec{\theta }}} \in {{\mathrm{ri}}}(\partial J({\varvec{\theta }}))$. An interesting situation is encountered when $e_{{\varvec{\theta }}} \in {{\mathrm{ri}}}(J({\varvec{\theta }}))$, so that one can choose $f_{{\varvec{\theta }}}=e_{{\varvec{\theta }}}$. Strong gauges, see Vaiter et al. (2015a, Definition 6), are precisely a class of gauges for which this situation occurs, and in this case, Theorem 1(i) has the simpler form

$$\begin{aligned} \partial J({\varvec{\theta }}) = {{\mathrm{aff}}}(\partial J({\varvec{\theta }})) \cap {\mathcal C}^\circ = \big \{ {\varvec{\eta }}\in \mathbb {R}^n \;:\; {\varvec{\eta }}_{T_{{\varvec{\theta }}}} = e_{{\varvec{\theta }}} \quad \text {and} \quad J^\circ ({\varvec{\eta }}_{S_{{\varvec{\theta }}}}) \le 1 \big \} . \end{aligned}$$

(6)

The Lasso, group Lasso and nuclear norms are typical examples of (symmetric) strong gauges. However, analysis sparsity penalties (e.g., the fused Lasso) or the $\ell _\infty $-penalty are not strong gauges, though they obviously satisfy (H.3). See the next section for a detailed discussion.

2.4 Calculus with the prior family

The family of penalties complying with (H.3) forms a robust class enjoying important calculus rules. In particular, it is closed under the sum and composition with an injective linear operator as we now prove.

Lemma 1

The set of functions satisfying (H.3) is closed under addition (actually any positive linear combination) and pre-composition by an injective linear operator. More precisely, the following holds:

(i)
Let J and G be two gauges satisfying (H.3). Then $H {\mathop {=}\limits ^{\text { def}}}J+G$ also obeys (H.3). Moreover,
1. (a)
  $T^H_{{\varvec{\theta }}} = T^J_{{\varvec{\theta }}} \cap T^G_{{\varvec{\theta }}}$ and $e_{{\varvec{\theta }}}^H = {{\mathrm{P}}}_{T^H_{{\varvec{\theta }}}}(e_{{\varvec{\theta }}}^J+e_{{\varvec{\theta }}}^G)$, where $T^J_{{\varvec{\theta }}}$ and $e_{{\varvec{\theta }}}^J$ (resp. $T^G_{{\varvec{\theta }}}$ and $e_{{\varvec{\theta }}}^G$) are the model subspace and vector at ${\varvec{\theta }}$ associated to J (resp. G);
2. (b)
  $H^\circ ({\varvec{\omega }})=\max _{\rho \in [0,1]}{\overline{\mathrm {conv}}\left( \inf {\left( \rho J^\circ ({\varvec{\omega }}),(1-\rho )G^\circ ({\varvec{\omega }})\right) }\right) }$.
(ii)
Let J be a gauge satisfying (H.3), and ${\varvec{D}}: \mathbb {R}^q \rightarrow \mathbb {R}^p$ be surjective. Then $H {\mathop {=}\limits ^{\text { def}}}J \circ {\varvec{D}}^\top $ also fulfills (H.3). Moreover,
1. (a)
  $T^H_{{\varvec{\theta }}} = {{\mathrm{Ker}}}({\varvec{D}}_{S^J_{{\varvec{u}}}}^\top )$ and $e_{{\varvec{\theta }}}^H = {{\mathrm{P}}}_{T^H_{{\varvec{\theta }}}}{\varvec{D}}e_{{\varvec{u}}}^J$, where $T^J_{{\varvec{u}}}$ and $e_{{\varvec{u}}}^J$ are the model subspace and vector at ${\varvec{u}}{\mathop {=}\limits ^{\text { def}}}{\varvec{D}}^\top {\varvec{\theta }}$ associated to J;
2. (b)
  $H^\circ ({\varvec{\omega }})=J^\circ ({\varvec{D}}^+{\varvec{\omega }})$, where ${\varvec{D}}^+={\varvec{D}}^\top {\big ({\varvec{D}}{\varvec{D}}^\top \big )}^{-1}$.

The outcome of Lemma 1 is naturally expected. For instance, assertion (i) states that combining several penalties/priors will promote objects living on the intersection of the respective low-complexity models. Similarly, for (ii), one promotes low-complexity in the image of the analysis operator ${\varvec{D}}^\top $. It then follows that one has not to deploy an ad hoc analysis when linearly pre-composing or combining (or both) several penalties (e.g., $\ell _1$+nuclear norms for recovering sparse and low-rank matrices) since our unified analysis in Sect. 3 will apply to them just as well.

Proof

(i)
Convexity, positive homogeneity, coercivity and finite-valuedness are straightforward.
1. (a)
  This is Vaiter et al. (2015a, Proposition 8(i)–(ii)).
2. (b)
  We have from Lemma 5 and calculus rules on support functions,
  $$\begin{aligned} H^\circ ({\varvec{\omega }})&= \sigma _{J({\varvec{\theta }})+G({\varvec{\theta }}) \le 1}({\varvec{\omega }}) = \sup _{J({\varvec{\theta }})+G({\varvec{\theta }}) \le 1} \langle {\varvec{\omega }},{\varvec{\theta }}\rangle \\&= \max _{\rho \in [0,1]} \sup _{J({\varvec{\theta }}) \le \rho ,G({\varvec{\theta }}) \le 1-\rho } \langle {\varvec{\omega }},{\varvec{\theta }}\rangle \\&= \max _{\rho \in [0,1]} {\overline{\mathrm {conv}}\left( \inf {\left( \sigma _{J({\varvec{\theta }}) \le \rho }({\varvec{\omega }}),\sigma _{G({\varvec{\theta }}) \le 1-\rho }({\varvec{\omega }})\right) }\right) } \\&= \max _{\rho \in [0,1]} {\overline{\mathrm {conv}}\left( \inf {\left( \rho \sigma _{J({\varvec{\theta }}) \le 1}({\varvec{\omega }}),(1-\rho )\sigma _{G({\varvec{\theta }}) \le 1}({\varvec{\omega }})\right) }\right) } \\&= \max _{\rho \in [0,1]} {\overline{\mathrm {conv}}\left( \inf {\left( \rho J^\circ ({\varvec{\omega }}),(1-\rho )G^\circ ({\varvec{\omega }})\right) }\right) } , \end{aligned}$$
  where we used Hiriart-Urruty and Lemaréchal (2001, Theorem V.3.3.3) in third row, positive homogeneity in the fourth, and Lemma 5 in the fifth.
(ii)
Again, convexity, positive homogeneity and finite-valuedness are immediate. Coercivity holds by injectivity of ${\varvec{D}}^\top $.
1. (a)
  This is Vaiter et al. (2015a, Proposition 10(i)–(ii)).
2. (b)
  Denote $J = \gamma _{{\mathcal C}}$. We have
  $$\begin{aligned} H^\circ ({\varvec{\omega }})&= \sup _{{\varvec{D}}^\top {\varvec{\theta }}\in {\mathcal C}} \langle {\varvec{\omega }},{\varvec{\theta }}\rangle \\ {({\varvec{D}}^\top \text { is injective})}&= \sup _{{\varvec{D}}^\top {\varvec{\theta }}\in {\mathcal C}} \langle {\varvec{D}}^+{\varvec{\omega }},{\varvec{D}}^\top {\varvec{\theta }}\rangle \\&= \sup _{{\varvec{u}}\in {\mathcal C}\cap {{\mathrm{Span}}}({\varvec{D}}^\top )} \langle {\varvec{D}}^+{\varvec{\omega }},{\varvec{u}}\rangle \\&= {\overline{\mathrm {conv}}\left( \inf {\left( J^\circ ({\varvec{D}}^+{\varvec{\omega }}),\iota _{{{\mathrm{Ker}}}({\varvec{D}})}({\varvec{D}}^+{\varvec{\omega }})\right) }\right) } \\&= J^\circ ({\varvec{D}}^+{\varvec{\omega }}). \end{aligned}$$
  where in the last equality, we used the fact that ${\varvec{D}}^+{\varvec{\omega }}\in {{\mathrm{Span}}}{\big ({\varvec{D}}^\top \big )}={{\mathrm{Ker}}}({\varvec{D}})^\perp $, and thus $\iota _{{{\mathrm{Ker}}}({\varvec{D}})}({\varvec{D}}^+{\varvec{\omega }})=+\infty $ unless ${\varvec{\omega }}=0$, and $J^\circ $ is continuous and convex by (H.3) and Lemma 5. In the fourth equality, we invoked Hiriart-Urruty and Lemaréchal (2001, Theorem V.3.3.3) and Lemma 5.$\square $

2.5 Examples

2.5.1 Lasso

The Lasso regularization is used to promote the sparsity of the minimizers, see Bühlmann and van de Geer (2011) for a comprehensive review. It corresponds to choosing J as the $\ell _1$-norm

$$\begin{aligned} J({\varvec{\theta }}) = \big \Vert {\varvec{\theta }}\big \Vert _{1} = \sum _{i=1}^p \big |{\varvec{\theta }}_i \big |. \end{aligned}$$

(7)

It is also referred to as $\ell _1$-synthesis in the signal processing community, in contrast to the more general $\ell _1$-analysis sparsity penalty detailed below.

We denote $(\varvec{a}_i)_{1 \le i \le p}$ the canonical basis of $\mathbb {R}^p$ and $\mathrm {supp}({\varvec{\theta }}) {\mathop {=}\limits ^{\text { def}}} \big \{ i \in \left\{ 1,\ldots ,p \right\} \;:\; {\varvec{\theta }}_i \ne 0 \big \} $. Then,

$$\begin{aligned} T_{\varvec{\theta }}= {{\mathrm{Span}}} \left\{ (\varvec{a}_i)_{i \in \mathrm {supp}({\varvec{\theta }})} \right\} , ~~ (e_{{\varvec{\theta }}})_i = {\left\{ \begin{array}{ll} {{\mathrm{sign}}}({\varvec{\theta }}_i) &{} \text {if~} i \in \mathrm {supp}({\varvec{\theta }}) \\ 0 &{} \text {otherwise} \end{array}\right. }, ~ \text {and} ~ J^\circ =\big \Vert \cdot \big \Vert _{\infty } . \end{aligned}$$

(8)

2.5.2 Group Lasso

The group Lasso has been advocated to promote sparsity by groups, i.e., it drives all the coefficients in one group to zero together hence leading to group selection, see Bakin (1999), Yuan and Lin (2006), Bach (2008) and Wei and Huang (2010) to cite a few. The group Lasso penalty with L groups reads

$$\begin{aligned} J({\varvec{\theta }}) = \big \Vert {\varvec{\theta }}\big \Vert _{1,2} {\mathop {=}\limits ^{\text { def}}}\sum _{i=1}^L \big \Vert {\varvec{\theta }}_{b_i}\big \Vert _{2} , \end{aligned}$$

(9)

where $\bigcup _{i=1}^L b_i = \left\{ 1,\ldots ,p \right\} , b_i, b_j \subset \left\{ 1,\ldots ,p \right\} ,$ and $b_i \cap b_j = \emptyset $ whenever $i \ne j$. Define the group support as $\mathrm {supp}_{{\mathcal B}}({\varvec{\theta }}) {\mathop {=}\limits ^{\text { def}}} \big \{ i \in \left\{ 1,\ldots ,L \right\} \;:\; {\varvec{\theta }}_{b_i} \ne 0 \big \} $. Thus, one has

$$\begin{aligned} \begin{aligned}&T_{\varvec{\theta }}= {{\mathrm{Span}}} \left\{ (a_j)_{ \big \{ j \;:\; \exists i \in \mathrm {supp}_{{\mathcal B}}({\varvec{\theta }}), j \in b_i \big \} } \right\} , \\&(e_{{\varvec{\theta }}})_{b_i} = {\left\{ \begin{array}{ll} \tfrac{{\varvec{\theta }}_{b_i}}{\left\| {\varvec{\theta }}_{b_i}\right\| _{2}} &{} \text {if~} i \in \mathrm {supp}_{{\mathcal B}}({\varvec{\theta }}) \\ 0 &{} \text {otherwise} \end{array}\right. }, ~ \text {and} ~ J^\circ ({\varvec{\omega }}) = \max _{i \in \left\{ 1,\ldots ,L \right\} } \left\| {\varvec{\omega }}_{b_i}\right\| _{2} . \end{aligned} \end{aligned}$$

(10)

2.5.3 Analysis (group) Lasso

One can push the structured sparsity idea one step further by promoting group/block sparsity through a linear operator, i.e., analysis-type sparsity. Given a linear operator ${\varvec{D}}: \mathbb {R}^q \rightarrow \mathbb {R}^p$ (seen as a matrix), the analysis group sparsity penalty is

$$\begin{aligned} J({\varvec{\theta }}) = \big \Vert {\varvec{D}}^\top {\varvec{\theta }}\big \Vert _{1,2} . \end{aligned}$$

(11)

This encompasses the 2-D isotropic total variation (Rudin et al. 1992). For when all groups of cardinality one, we have the analysis-$\ell _1$ penalty (a.k.a. general Lasso), which encapsulates several important penalties including that of the 1-D total variation (Rudin et al. 1992), and the fused Lasso (Tibshirani et al. 2005). The overlapping group Lasso (Jacob et al. 2009) is also a special case of (9) by taking ${\varvec{D}}^\top $ to be an operator that extracts the blocks (Peyré et al. 2011; Chen et al. 2010) (in which case ${\varvec{D}}$ has even orthogonal rows).

Let $\varLambda _{{\varvec{\theta }}}=\bigcup _{i \in \mathrm {supp}_{{\mathcal B}}({\varvec{D}}^\top {\varvec{\theta }})} b_i$ and $\varLambda _{{\varvec{\theta }}}^c$ its complement. From Lemma 1(ii) and (10), we get

$$\begin{aligned} \begin{aligned}&T_{\varvec{\theta }}= {{\mathrm{Ker}}}({\varvec{D}}_{\varLambda _{{\varvec{\theta }}}^c}^\top ), \quad e_{{\varvec{\theta }}} = {{\mathrm{P}}}_{T_{{\varvec{\theta }}}}{\varvec{D}}e_{{\varvec{D}}^\top {\varvec{\theta }}}^{\left\| \right\| _{1,2}} \\&\quad \text {where} \quad {\left( e_{{\varvec{D}}^\top {\varvec{\theta }}}^{\left\| \right\| _{1,2}}\right) }_{b_i} = {\left\{ \begin{array}{ll} \tfrac{{\left( {\varvec{D}}^\top {\varvec{\theta }}\right) }_{b_i}}{\left\| {\left( {\varvec{D}}^\top {\varvec{\theta }}\right) }_{b_i}\right\| _{2}} &{} \text {if~} i \in \mathrm {supp}_{{\mathcal B}}({\varvec{D}}^\top {\varvec{\theta }}) \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned} \end{aligned}$$

(12)

If, in addition, ${\varvec{D}}$ is surjective, then by virtue of Lemma 1(ii) we also have

$$\begin{aligned} J^\circ ({\varvec{\omega }}) = \left\| {\varvec{D}}^+{\varvec{\omega }}\right\| _{\infty ,2} {\mathop {=}\limits ^{\text { def}}}\max _{i \in \left\{ 1,\ldots ,L \right\} } \left\| ({\varvec{D}}^+{\varvec{\omega }})_{b_i}\right\| _{2} . \end{aligned}$$

(13)

2.5.4 Anti-sparsity

If the vector to be estimated is expected to be flat (anti-sparse), this can be captured using the $\ell _\infty $ norm (a.k.a. Tchebychev norm) as prior

$$\begin{aligned} J({\varvec{\theta }}) = \big \Vert {\varvec{\theta }}\big \Vert _{\infty } = \max _{i \in \left\{ 1,\ldots ,p \right\} } \big |{\varvec{\theta }}_i \big |. \end{aligned}$$

(14)

The $\ell _\infty $ regularization has found applications in several fields (Jégou et al. 2012; Lyubarskii and Vershynin 2010; Studer et al. 2012). Suppose that ${\varvec{\theta }}\ne 0$, and define the saturation support of ${\varvec{\theta }}$ as $I^{\mathrm {sat}}_{{\varvec{\theta }}} {\mathop {=}\limits ^{\text { def}}} \big \{ i \in \left\{ 1,\ldots ,p \right\} \;:\; \big |{\varvec{\theta }}_i \big |=\left\| {\varvec{\theta }}\right\| _{\infty } \big \} \ne \emptyset $. From Vaiter et al. (2015a, Proposition 14), we have

$$\begin{aligned} \begin{aligned} T_{\varvec{\theta }}&= \big \{ \overline{{\varvec{\theta }}}\in \mathbb {R}^p \;:\; \overline{{\varvec{\theta }}}_{I^{\mathrm {sat}}_{{\varvec{\theta }}}} \in \mathbb {R}{{\mathrm{sign}}}({\varvec{\theta }}_{I^{\mathrm {sat}}_{{\varvec{\theta }}}}) \big \} , \\ (e_{{\varvec{\theta }}})_i&= {\left\{ \begin{array}{ll} {{\mathrm{sign}}}({\varvec{\theta }}_i)/|I^{\mathrm {sat}}_{{\varvec{\theta }}}| &{} \text {if~} i \in I^{\mathrm {sat}}_{{\varvec{\theta }}} \\ 0 &{} \text {otherwise} \end{array}\right. }, \quad \text {and} \quad J^\circ = \left\| \cdot \right\| _{1}. \end{aligned} \end{aligned}$$

(15)

2.5.5 Nuclear norm

The natural extension of low-complexity priors to matrices ${\varvec{\theta }}\in \mathbb {R}^{p_1 \times p_2}$ is to penalize the singular values of the matrix. Let ${{\mathrm{rank}}}({\varvec{\theta }})=r$, and ${\varvec{\theta }}= \varvec{U} {{\mathrm{diag}}}(\uplambda ({\varvec{\theta }}))\varvec{V}^\top $ be a reduced rank-r SVD decomposition, where $\varvec{U} \in \mathbb {R}^{p_1 \times r}$ and $\varvec{V} \in \mathbb {R}^{p_2 \times r}$ have orthonormal columns, and $\uplambda ({\varvec{\theta }}) \in (\mathbb {R}_+ \setminus \left\{ 0 \right\} )^{r}$ is the vector of singular values $(\uplambda _1({\varvec{\theta }}),\ldots ,\uplambda _r({\varvec{\theta }}))$ in non-increasing order. The nuclear norm of ${\varvec{\theta }}$ is

$$\begin{aligned} J({\varvec{\theta }}) = \big \Vert {\varvec{\theta }}\big \Vert _{*} = \big \Vert \uplambda ({\varvec{\theta }})\big \Vert _{1} . \end{aligned}$$

(16)

This penalty is the best convex surrogate to enforce a low-rank prior. It has been widely used for various applications (Recht et al. 2010; Candès and Recht 2009; Candès et al. 2011, 2013; Fazel et al. 2001).

Following, e.g., Vaiter et al. (2017, Example 21), we have

(17)

3 Oracle inequalities for a general loss

Before delving into the details, in the sequel, we will need a bit of notations.

We recall $T_{{\varvec{\theta }}}$ and $e_{{\varvec{\theta }}}$ the model subspace and vector associated to ${\varvec{\theta }}$ (see Definition 1). Denote $S_{{\varvec{\theta }}}=T_{{\varvec{\theta }}}^\perp $. Given two coercive finite-valued gauges $J_1 = \gamma _{{\mathcal C}_1}$ and $J_2 = \gamma _{{\mathcal C}_2}$, and a linear operator $\varvec{A}$, we define the operator bound as

Note that is bounded (this follows from Lemma 4(v)). Furthermore, we have from Lemma 5 that

In the following, whenever it is clear from the context, to lighten notation when $J_i$ is a norm, we write the subscript of the norm instead of $J_i$ (e.g., p for the $\ell _p$ norm, $*$ for the nuclear norm).

Our main result will involve a measure of well-conditionedness of the design matrix $\varvec{X}$ when restricted to some subspace $T$. More precisely, for $c > 0$, we introduce the coefficient

(18)

This generalizes the compatibility factor introduced in van de Geer and Buhlmann (2009) for the Lasso (and used in Dalalyan et al. 2018). The experienced reader may have recognized that this factor is reminescent of the null space property and restricted injectivity that play a central role in the analysis of the performance guarantees of penalized estimators (1); see Fadili et al. (2013) and Vaiter et al. (2015a, b, 2017, 2018). One can see in particular that $\varUpsilon {\left( T,c\right) }$ is larger than the smallest singular value of $\varvec{X}_T$.

The oracle inequalities will be provided in terms of the loss

$$\begin{aligned} R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )} = \tfrac{1}{n}D_{F}^{}{\left( \varvec{X}{\varvec{\theta }},\varvec{X}{\varvec{\theta }}_0\right) } . \end{aligned}$$

3.1 Oracle inequality for ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$

We are now ready to establish our first main result: an oracle inequality for the EWA estimator (3).

Theorem 2

Consider the EWA estimator ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ in (3) with the density (2), where F and J satisfy Assumptions (H.1)–(H.2) and (H.3). Then, for any $\tau > 1$ such that $\lambda _n \ge \tau J^\circ {\left( -\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\right) }/n$, the following holds,

(19)

Remark 2

1.
It should be emphasized that Theorem 2 is actually a deterministic statement for a fixed choice of $\lambda _n$. Probabilistic analysis will be required when the result is applied to particular statistical models as we will see later. For this, we will use concentration inequalities in order to provide bounds that hold with high probability over the data.
2.
The oracle inequality is sharp. The remainder in it has two terms. The first one encodes the complexity of the model promoted by J. The second one, $p\beta $, captures the influence of the temperature parameter. In particular, taking $\beta $ sufficiently small of the order $O{\left( (pn)^{-1}\right) }$, this term becomes $O(n^{-1})$.
3.
When $\varphi (t)=\nu t^2/2$, i.e., $F(\cdot ,\varvec{y})$ is $\nu $-strongly convex, then $\varphi ^+(t)=t^2/(2\nu )$, and the reminder term becomes
(20)
If, moreover, $\nabla F$ is also $\kappa $-Lipschitz continuous, then it can be shown that $R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )}$ is equivalent to a quadratic loss. This means that the oracle inequality in Theorem 2 can be stated in terms of the quadratic prediction error. However, the inequality is not anymore sharp in this case as a constant factor equal to the condition number $\kappa /\nu \ge 1$ naturally multiplies the right-hand side.
4.
If J is such that $e_{{\varvec{\theta }}} \in \partial J({\varvec{\theta }}) \subset {\mathcal C}^\circ $ (typically for a strong gauge by (6)), then $J^\circ (e_{{\varvec{\theta }}}) \le 1$ (in fact an equality if ${\varvec{\theta }}\ne 0$). Thus the term $J^\circ (e_{{\varvec{\theta }}})$ can be omitted in (19).
5.
A close inspection of the proof of Theorem 2 reveals that the term $p\beta $ can be improved to the smaller bound
$$\begin{aligned} p\beta + {\left( V_n\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) -\mathbb {E}_{\mu _n}\left[ V_n({\varvec{\theta }})\right] \right) } , \end{aligned}$$
where the upper bound is a consequence of Jensen inequality.

Proof

By convexity of J and assumption (H.1), we have for any ${\varvec{\eta }}\in \partial V_n({\varvec{\theta }})$ and any $\overline{{\varvec{\theta }}}\in \mathbb {R}^p$,

$$\begin{aligned} D_{V_n}^{{\varvec{\eta }}}{\left( \overline{{\varvec{\theta }}},{\varvec{\theta }}\right) } \ge \frac{1}{n}\varphi {\big (\big \Vert \varvec{X}\overline{{\varvec{\theta }}}-\varvec{X}{\varvec{\theta }}\big \Vert _{2}\big )} . \end{aligned}$$

Since $\varphi $ is non-decreasing and convex, $\varphi \circ \left\| \cdot \right\| _{2}$ is a convex function. Thus, taking the expectation w.r.t. to $\mu _n$ on both sides and using Jensen inequality, we get

$$\begin{aligned} V_n(\overline{{\varvec{\theta }}})&\ge \mathbb {E}_{\mu _n}\left[ V_n({\varvec{\theta }})\right] + \mathbb {E}_{\mu _n}\left[ \langle {\varvec{\eta }},\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right] + \frac{1}{n}\mathbb {E}_{\mu _n}\left[ \varphi {\big (\big \Vert \varvec{X}\overline{{\varvec{\theta }}}-\varvec{X}{\varvec{\theta }}\big \Vert _{2}\big )}\right] \\&\ge V_n\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) + \mathbb {E}_{\mu _n}\left[ \langle {\varvec{\eta }},\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right] + \frac{1}{n}\varphi {\big (\big \Vert \varvec{X}\overline{{\varvec{\theta }}}-\varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\big \Vert _{2}\big )} . \end{aligned}$$

This holds for any ${\varvec{\eta }}\in \partial V_n({\varvec{\theta }})$, and in particular at the minimal selection ${\big (\partial V_n({\varvec{\theta }})\big )}^0$ (see Sect. B for details). It then follows from the pillar result in Proposition 5 that

$$\begin{aligned} \mathbb {E}_{\mu _n}\left[ \langle {\big (\partial V_n({\varvec{\theta }})\big )}^0,\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right] = -p\beta . \end{aligned}$$

We thus deduce the inequality

$$\begin{aligned} V_n\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) - V_n(\overline{{\varvec{\theta }}}) \le p\beta - \frac{1}{n}\varphi {\big (\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-\varvec{X}\overline{{\varvec{\theta }}}\big \Vert _{2}\big )} , \quad \forall \overline{{\varvec{\theta }}}\in \mathbb {R}^p . \end{aligned}$$

(21)

By definition of the Bregman divergence, we have

$$\begin{aligned}&R_n{\big ({\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}},{\varvec{\theta }}_0\big )} - R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )} \\&\quad =\frac{1}{n}{\Big (F(\varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}},\varvec{y}) - F(\varvec{X}{\varvec{\theta }},\varvec{y}) + \langle -\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})),{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-{\varvec{\theta }}\rangle \Big )} \\&\quad ={\left( V_n\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) - V_n({\varvec{\theta }})\right) } + \frac{1}{n}\langle -\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y}),{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-{\varvec{\theta }}\rangle \\&\qquad - \lambda _n{\big (J\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) -J({\varvec{\theta }})\big )} . \end{aligned}$$

By virtue of the duality inequality (42), we have

$$\begin{aligned}&R_n{\big ({\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}},{\varvec{\theta }}_0\big )} - R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )} \\&\quad \le {\left( V_n\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) - V_n({\varvec{\theta }})\right) } + \frac{1}{n}J^\circ {\left( -\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\right) }J\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-{\varvec{\theta }}\right) \\&\qquad - \lambda _n{\big (J\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) -J({\varvec{\theta }})\big )} \\&\quad \le {\left( V_n\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) - V_n({\varvec{\theta }})\right) } + \frac{\lambda _n}{\tau }{\left( J\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-{\varvec{\theta }}\right) - \tau {\big (J\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) -J({\varvec{\theta }})\big )}\right) } . \end{aligned}$$

Denote ${\varvec{\omega }}={\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-{\varvec{\theta }}$. By virtue of (H.3), Theorem 1 and (42), we obtain

$$\begin{aligned} J({\varvec{\omega }}) - \tau {\big (J\left( {\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\right) -J({\varvec{\theta }})\big )}&\le J({\varvec{\omega }}_{T_{{\varvec{\theta }}}}) + J({\varvec{\omega }}_{S_{{\varvec{\theta }}}}) - \tau \langle e_{{\varvec{\theta }}},{\varvec{\omega }}_{T_{{\varvec{\theta }}}}\rangle - \tau J({\varvec{\omega }}_{S_{{\varvec{\theta }}}}) \\&\le J({\varvec{\omega }}_{T_{{\varvec{\theta }}}}) + J({\varvec{\omega }}_{S_{{\varvec{\theta }}}}) + \tau J^\circ (e_{{\varvec{\theta }}})J({\varvec{\omega }}_{T_{{\varvec{\theta }}}}) - \tau J({\varvec{\omega }}_{S_{{\varvec{\theta }}}}) \\&= {\big (\tau J^\circ (e_{{\varvec{\theta }}})+1\big )}J({\varvec{\omega }}_{T_{{\varvec{\theta }}}}) - (\tau -1)J({\varvec{\omega }}_{S_{{\varvec{\theta }}}}) \\&\le {\big (\tau J^\circ (e_{{\varvec{\theta }}})+1\big )}{\left( J({\varvec{\omega }}_{T_{{\varvec{\theta }}}}) - \tfrac{\tau -1}{\tau J^\circ (e_{{\varvec{\theta }}})+1}J({\varvec{\omega }}_{S_{{\varvec{\theta }}}})\right) } . \end{aligned}$$

This inequality together with (21) (applied with $\overline{{\varvec{\theta }}}={\varvec{\theta }}$) and (18) yields

where we applied Fenchel–Young inequality (5) to get the last bound. Taking the infimum over ${\varvec{\theta }}\in \mathbb {R}^p$ yields the desired result. $\square $

In “Appendix”, we provide a novel proof of Proposition 5 based on a Moreau–Yosida regularization argument. In Dalalyan et al. (2018, Corollary 1 and 2), an alternative proof is given using an absolute continuity argument since $\mu _n$ is locally Lipschitz, hence a Sobolev function.

Stratifiable functions Theorem 2 has a nice instanciation when $\mathbb {R}^p$ can be partitioned into a collection of subsets $ \left\{ {\mathcal M}_i \right\} _i$ that form a stratification of $\mathbb {R}^p$. That is, $\mathbb {R}^p$ is a finite disjoint union $\cup _i {\mathcal M}_i$ such that the partitioning sets ${\mathcal M}_i$ (called strata) must fit nicely together and the stratification is endowed with a partial ordering for the closure operation. For example, it is known that a polyhedral function has a polyhedral stratification, and more generally, semialgebraic functions induce stratifications into finite disjoint unions of manifolds, see e.g., Coste (2000). Another example is that of partly smooth convex functions thoroughly studied in Vaiter et al. (2015a, b, 2017, 2018) for various statistical and inverse problems. These functions induce a stratification into strata that are $C^2$-smooth submanifolds of $\mathbb {R}^p$. In turns out that all popular penalty functions discussed in this paper are partly smooth (see Vaiter et al. 2015b, 2017). Let’s denote $\mathscr {M}$ the set of strata associated to J. With this notation at hand, the oracle inequality (19) now reads

(22)

3.2 Oracle inequality for ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$

The next result establishes that ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ satisfies a sharp prediction oracle inequality that we will compare to (19).

Theorem 3

Consider the penalized estimator ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ in (1), where F and J satisfy Assumptions (H.1) and (H.3). Then, for any $\tau > 1$ such that $\lambda _n \ge \tau J^\circ {\left( -\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\right) }/n$, the following holds,

(23)

Proof

The proof follows the same lines as that of Theorem 2 except that we use the fact that ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ is a global minimizer of $V_n$, i.e., $0 \in \partial V_n\left( {\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}\right) $. Indeed, we have for any ${\varvec{\theta }}\in \mathbb {R}^p$

$$\begin{aligned} V_n({\varvec{\theta }})&\ge V_n\left( {\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}\right) + \frac{1}{n}\varphi {\big (\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}\big \Vert _{2}\big )} . \end{aligned}$$

(24)

Continuing exactly as just after (21), replacing ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ with ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ and invoking (24) instead of (21), we arrive at the claimed result. $\square $

Remark 3

1.
Observe that the penalized estimator ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ does not require the moment assumption (H.2) for (23) to hold. The convexity assumption on $\varphi $ in (H.1), which was important to apply Jensen’s inequality in the proof of (19), is not needed either to get (23).
2.
As we remarked for Theorem 2, Theorem 3 is also a deterministic statement for a fixed choice of $\lambda _n$ that holds for any minimizer ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, which is not unique in general. The condition on $\lambda _n$ is similar to the one in Negahban et al. (2012) where authors established different guarantees for ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$.

3.3 Discussion of ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ vs ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$

One clearly sees that the difference between the prediction performance of ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ lies in the term $p\beta $ (or rather its lower bound in Remark 2-2). In particular, for $\beta =O{\left( (pn)^{-1}\right) }$, this term is on the order $O(n^{-1})$. This choice can be refined in most situations. In particular, for the case of quadratic loss, one can take , hence leading to remainder terms in (19) and (23) of the same order.

In view of this discussion, one may wonder what are the actual benefits of using ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ instead of ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$. Generalizing the arguments of Dalalyan et al. (2018), we will show that ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ enjoys one main advantage compared to ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$. To simplify the discussion, we will focus on the case of linear regression (30) with Gaussian noise $\varvec{\xi }\sim {\mathcal N}(0,\sigma ^2\mathbf{Id}_n)$ and F is the quadratic loss.

The chief advantage of ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ is its stability as a function of the data and hyperparameters. It has been shown that for a large class of penalties J, including those studied here, the prediction $\varvec{X}{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ is a smooth function of $\varvec{y}$ outside a set of Lebesgue measure zero; see Vaiter et al. (2017, Theorem 2). Those authors also provided in Vaiter et al. (2017, Theorem 3) an expression of the Stein unbiased risk estimator (SURE). For instance, when J is the gauge of a polytope, the SURE is given by

$$\begin{aligned} \big \Vert \varvec{y}-\varvec{X}{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}\big \Vert _{2}^2 + \sigma ^2\dim {\big (T_{{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}}\big )} - n\sigma ^2 . \end{aligned}$$

The SURE was advocated as an automatic and objective way to choose $\lambda $. However, one can see that $\dim {\big (T_{{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}}\big )}$ is a non-smooth function of $\lambda $, which may lead to numerical instabilities in practice. In contrast, the SURE of ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$, whose closed-form is given in Dalalyan et al. (2018, (10)), is such that $\varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ is a continuous function of $(\lambda ,\beta ) \in ]0,+\infty [^2$ and $\varvec{y}\in \mathbb {R}^n$. This better regularity suggests that it would be wiser to use the SURE associated to ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ for an automatic choice of $\lambda $.

3.4 Oracle inequalities in probability

It remains to check when the event $\mathscr {E}= \left\{ \lambda _n \ge \tau J^\circ {\left( -\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\right) }/n \right\} $ holds with high probability when $\varvec{y}$ is random. We will use concentration inequalities in order to provide bounds that hold with high probability over the data. Toward this goal, we will need the following assumption.

(H.4) $\varvec{y}=(\varvec{y}_1,\varvec{y}_2,\ldots ,\varvec{y}_n)$ are independent random observations, and $F({\varvec{u}},\varvec{y})=\sum _{i=1}^n f_i({\varvec{u}}_i,\varvec{y}_i)$, $f_i: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}$. Moreover,
1. (i)
  $\mathbb {E}_{}\left[ \big |f_i((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \big |\right] < +\infty $, $\forall 1 \le i \le n$ ;
2. (ii)
  $ \big |f_i'((\varvec{X}{\varvec{\theta }}_0)_i,t) \big | \le g(t)$, where $\mathbb {E}_{}\left[ g(\varvec{y}_i)\right] < +\infty $, $\forall 1 \le i \le n$;
3. (iii)
  Bernstein moment condition: $\forall 1 \le i \le n$ and all integers $m \ge 2$, $\mathbb {E}_{}\left[ \big |f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \big |^m\right] \le m! \kappa ^{m-2} \sigma _i^2/2$ for some constants $\kappa > 0$, $\sigma _i > 0$ independent of n.
Let $\sigma ^2=\max _{1 \le i \le n} \sigma _i^2$.

Observe that under (H.4), and by virtue of Lemma 5(iv) and Hiriart-Urruty and Lemaréchal (2001, Proposition V.3.3.4), we have

$$\begin{aligned} \begin{aligned} J^\circ {\big (-\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big )}&\quad = \sigma _{{\mathcal C}}{\big (-\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big )} \\&\quad = \sup _{\varvec{z} \in \varvec{X}({\mathcal C})} -\sum _{i=1}^n f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \varvec{z}_i . \end{aligned} \end{aligned}$$

(25)

As $\varvec{X}({\mathcal C})$ is compact, it has a dense countable subset. Thus, checking the event $\mathscr {E}$ amounts to establishing a deviation inequality for the supremum of an empirical process above its mean under the weak Bernstein moment condition (H.4)(iii), which essentially requires that the $f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i)$ have subexponential tails. We will first tackle the case where ${\mathcal C}$ is the convex hull of a finite set (i.e., ${\mathcal C}$ is a polytope).

3.4.1 Polyhedral penalty

We here suppose that J is a finite-valued gauge of ${\mathcal C}= {\overline{\mathrm {conv}}\left( {\mathcal V}\right) }$, where ${\mathcal V}$ is finite, i.e., ${\mathcal C}$ is a polytope with vertices (Rockafellar 1996, Corollary 19.1.1). Our first oracle inequality in probability is the following.

Proposition 1

Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where F and $J {\mathop {=}\limits ^{\text { def}}}\gamma _{{\mathcal C}}$ satisfy Assumptions (H.1), (H.2), (H.3) and (H.4), and ${\mathcal C}$ is a polytope with vertices ${\mathcal V}$. Suppose that ${{\mathrm{rank}}}(\varvec{X})=n$ and let $s(\varvec{X}) = \max _{{\varvec{v}}\in {\mathcal V}}\big \Vert \varvec{X}{\varvec{v}}\big \Vert _{\infty }$. Choose

$$\begin{aligned} \lambda _n \ge \tau \sigma {{s(\varvec{X})}}\sqrt{\frac{2\delta \log (|{\mathcal V}|)}{n}}{\left( 1+\sqrt{2}\kappa /\sigma \sqrt{\frac{\delta \log (|{\mathcal V}|)}{n}}\right) }, \end{aligned}$$

for some $\tau > 1$ and $\delta > 1$. Then (19) and (23) hold with probability at least $1-2|{\mathcal V}|^{1-\delta }$.

Proof

In view of Assumptions (H.1) and (H.4) , one can differentiate under the expectation sign (Leibniz rule) to conclude that $\mathbb {E}_{}\left[ F(\varvec{X}\cdot ,\varvec{y})\right] $ is $C^1$ at ${\varvec{\theta }}_0$ and $\nabla \mathbb {E}_{}\left[ F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\right] = \varvec{X}^\top \mathbb {E}_{}\left[ \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\right] $. As ${\varvec{\theta }}_0$ minimizes the population risk, one has $\nabla \mathbb {E}_{}\left[ F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\right] = 0$. Using the rank assumption on $\varvec{X}$, we deduce that

$$\begin{aligned} \mathbb {E}_{}\left[ f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i)\right] = 0, \quad \forall 1 \le i \le n . \end{aligned}$$

Moreover, (25) specializes to

$$\begin{aligned} J^\circ {\big (-\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big )} = \sup _{\varvec{z} \in \varvec{X}({\mathcal V})} -\sum _{i=1}^n f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \varvec{z}_i. \end{aligned}$$

Let $t'=\lambda _n n/\tau $ and $t=t'/s(\varvec{X})$. By the union bound and (25), we have

$$\begin{aligned} \mathbb {P}\left( J^\circ {\big (-\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big )} \ge t'\right)&\le \mathbb {P}\left( \max _{\varvec{z} \in \varvec{X}({\mathcal V})}~-\sum _{i=1}^n f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \varvec{z}_i \ge t'\right) \\&\le |{\mathcal V}| \max _{\varvec{z} \in \varvec{X}({\mathcal V})}\mathbb {P}\left( \big |\sum _{i=1}^n f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \varvec{z}_i \big | \ge t'\right) \\&\le |{\mathcal V}| \mathbb {P}\left( s(\varvec{X}) \big |\sum _{i=1}^n f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \big | \ge t'\right) \\&= |{\mathcal V}| \mathbb {P}\left( \big |\sum _{i=1}^n f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \big | \ge t\right) . \end{aligned}$$

Owing to assumption (H.4)(iii), we are in position to apply the Bernstein inequality to get

$$\begin{aligned} \mathbb {P}\left( J^\circ {\big (-\varvec{X}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big )} \ge t\right)&\le 2|{\mathcal V}| \exp {\left( -\frac{t^2}{2(\kappa t + n \sigma ^2)}\right) } . \end{aligned}$$

Every t such that

$$\begin{aligned} t \ge \sqrt{\delta \log (|{\mathcal V}|)}{\left( \kappa \sqrt{\delta \log (|{\mathcal V}|)} + \sqrt{\kappa ^2\delta \log (|{\mathcal V}|) + 2 n \sigma ^2}\right) }, \end{aligned}$$

satisfies $t^2 \ge 2\delta \log (|{\mathcal V}|)(\kappa t + n \sigma ^2)$. Applying the trivial inequality $\sqrt{a+b} \le \sqrt{a}+\sqrt{b}$ to the bound on t, we conclude. $\square $

Remark 4

In the monograph (Bühlmann and van de Geer 2011, Lemma 14.12), the authors derived an exponential deviation inequality for the supremum of an empirical process with finite ${\mathcal V}$ and possibly unbounded empirical processes under a Bernstein moment condition similar to ours (in fact ours implies theirs). The very last part of our proof can be obtained by applying their result. We detailed it here for the sake of completeness.

Lasso To lighten the notation, let $I_{{\varvec{\theta }}}=\mathrm {supp}({\varvec{\theta }})$. From (8), it is easy to see that

where last bound holds as an equality whenever ${\varvec{\theta }}\ne 0$. Further the $\ell _1$ norm is the gauge of the cross-polytope (i.e., the unit $\ell _1$ ball). Its vertex set ${\mathcal V}$ is the set of unit-norm one-sparse vectors $(\pm \varvec{a}_i)_{1 \le i \le p}$, where we recall $(\varvec{a}_i)_{1 \le i \le p}$ the canonical basis. Thus

$$\begin{aligned} |{\mathcal V}| = 2p \quad \text {and} \quad {{s(\varvec{X}) = \max _{{\varvec{v}}\in {\mathcal V}} \left\| \varvec{X}{\varvec{v}}\right\| _{\infty } = \max _{1 \le i \le p} \left\| \varvec{X}_i\right\| _{\infty }}} . \end{aligned}$$

Inserting this into Proposition 1, we obtain the following corollary.

Corollary 1

Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where J is the Lasso penalty and F satisfies Assumptions (H.1), (H.2) and (H.4). Suppose that ${{\mathrm{rank}}}(\varvec{X})=n$ and take

$$\begin{aligned} \lambda _n \ge \tau \sigma {{s(\varvec{X})}}\sqrt{\frac{2\delta \log (2p)}{n}}{\left( 1+\sqrt{2}\kappa /\sigma \sqrt{\frac{\delta \log (2p)}{n}}\right) }, \end{aligned}$$

for some $\tau > 1$ and $\delta > 1$. Then, with probability at least $1-2(2p)^{1-\delta }$, the following holds

$$\begin{aligned} R_n{\big ({\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}},{\varvec{\theta }}_0\big )} \le&\inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,p \right\} \\ {\varvec{\theta }}:~\mathrm {supp}({\varvec{\theta }})=I \end{array}} {\left( R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )} + \tfrac{1}{n}\varphi ^+{\left( \tfrac{\lambda _n\sqrt{n}{\left( \tau +1\right) }\sqrt{|I|}}{\tau \varUpsilon {\left( {{\mathrm{Span}}} \left\{ \varvec{a}_i \right\} _{i \in I},\tfrac{\tau +1}{\tau -1}\right) }}\right) }\right) } + p\beta , \end{aligned}$$

(26)

and

$$\begin{aligned} R_n{\big ({\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n},{\varvec{\theta }}_0\big )} \le&\inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,p \right\} \\ {\varvec{\theta }}:~\mathrm {supp}({\varvec{\theta }})=I \end{array}} {\left( R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )} + \tfrac{1}{n}\varphi ^+{\left( \tfrac{\lambda _n\sqrt{n}{\left( \tau +1\right) }\sqrt{|I|}}{\tau \varUpsilon {\left( {{\mathrm{Span}}} \left\{ \varvec{a}_i \right\} _{i \in I},\tfrac{\tau +1}{\tau -1}\right) }}\right) }\right) } . \end{aligned}$$

(27)

For ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, we recover a similar scaling for $\lambda _n$ and the oracle inequality as in van de Geer (2008), though in the latter the oracle inequality is not sharp unlike ours. Note that the above oracle inequality extends readily to the case of analysis/fused Lasso $\big \Vert \varvec{D}^\top \cdot \big \Vert _{1}$ where $\varvec{D}$ is surjective. We leave the details to the interested reader (see also the analysis group Lasso example in Sect. 4).

Anti-sparsity From Sect. 2.5.4, recall the saturation support $I^{\mathrm {sat}}_{{\varvec{\theta }}}$ of ${\varvec{\theta }}$. From (15), we get

with equality whenever ${\varvec{\theta }}\ne 0$. In addition, the $\ell _\infty $ norm is the gauge of the hypercube whose vertex set is ${\mathcal V}= \left\{ \pm 1 \right\} ^p$. Thus

$$\begin{aligned} |{\mathcal V}| = 2^p . \end{aligned}$$

We have the following oracle inequalities.

Corollary 2

Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where J is anti-sparsity penalty (14), and F satisfies Assumptions (H.1), (H.2) and (H.4) . Suppose that ${{\mathrm{rank}}}(\varvec{X})=n$ and let $s(\varvec{X}) = \max _{i,j}|\varvec{X}_{i,j}|$. Choose

$$\begin{aligned} \lambda _n \ge \tau \sigma {{s(\varvec{X})}}\sqrt{2\delta \log (2)}\sqrt{\frac{p}{n}}{\left( 1+2\kappa /\sigma \sqrt{\delta \log (2)}\sqrt{\frac{p}{n}}\right) }, \end{aligned}$$

for some $\tau > 1$ and $\delta > 1$. Then, with probability at least $1-2^{-p(\delta -1)+1}$, the following holds

$$\begin{aligned} \begin{aligned} R_n{\big ({\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}},{\varvec{\theta }}_0\big )}&\le \inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,p \right\} \\ {\varvec{\theta }}:~I^{\mathrm {sat}}_{{\varvec{\theta }}}=I \end{array}} {\left( R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )} + \tfrac{1}{n}\varphi ^+{\left( \tfrac{\lambda _n\sqrt{n}{\left( \tau +1\right) }}{\tau \varUpsilon {\left( \big \{ \overline{{\varvec{\theta }}} \;:\; \overline{{\varvec{\theta }}}_{I} \in \mathbb {R}{{\mathrm{sign}}}({\varvec{\theta }}_{I}) \big \} ,\tfrac{\tau +1}{\tau -1}\right) }}\right) }\right) } \\&\quad + p\beta , \end{aligned} \end{aligned}$$

(28)

and

$$\begin{aligned} R_n{\big ({\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n},{\varvec{\theta }}_0\big )} \le&\inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,p \right\} \\ {\varvec{\theta }}:~I^{\mathrm {sat}}_{{\varvec{\theta }}}=I \end{array}} {\left( R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )} + \tfrac{1}{n}\varphi ^+{\left( \tfrac{\lambda _n\sqrt{n}{\left( \tau +1\right) }}{\tau \varUpsilon {\left( \big \{ \overline{{\varvec{\theta }}} \;:\; \overline{{\varvec{\theta }}}_{I} \in \mathbb {R}{{\mathrm{sign}}}({\varvec{\theta }}_{I}) \big \} ,\tfrac{\tau +1}{\tau -1}\right) }}\right) }\right) } . \end{aligned}$$

(29)

We are not aware of any result of this kind in the literature. The bound imposed on $\varvec{X}$ is similar to what is generally assumed in the vector quantization literature (Lyubarskii and Vershynin 2010; Studer et al. 2012).

3.4.2 General penalty

Extending the above reasoning to a general penalty requires a deviation inequality for the supremum of an empirical process in (25) under the Bernstein moment condition (H.4)(iii), but without the need of uniform boundedness. This can be achieved via generic chaining along a tree using entropy with bracketing; see van de Geer and Lederer (2013, Theorem 8). The resulting deviation bound will thus depend on the entropies with bracketing. These quantities capture the complexity of the set $\varvec{X}({\mathcal C})$ but are intricate to compute in general. This subject deserves further investigation that we leave to a future work.

Remark 5

(Group Lasso) Using the union bound, we have

$$\begin{aligned}&\mathbb {P}\left( \max _{i \in \left\{ 1,\ldots ,L \right\} } \big \Vert \varvec{X}_{b_i}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big \Vert _{2} \ge \lambda _n n/\tau \right) \\&\quad \le L \max _{i} \mathbb {P}\left( \big \Vert \varvec{X}_{b_i}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big \Vert _{2} \ge \lambda _n n/\tau \right) . \end{aligned}$$

This requires a concentration inequality for quadratic forms of independent random variables satisfying the Bernstein moment assumption above. We are not aware of any such result. But if our moment condition is strengthened to

$$\begin{aligned} \mathbb {E}_{}\left[ \big |f_i'((\varvec{X}{\varvec{\theta }}_0)_i,\varvec{y}_i) \big |^{2m}\right] \le m! \kappa ^{2(m-1)} \sigma _i^2/2, \quad \forall 1 \le i \le n, \forall m \ge 1, \end{aligned}$$

then one can use (Bellec 2014, Theorem 3). Indeed, assuming $\max _{i} \left\| \varvec{X}_i\right\| _{2} \le \sqrt{n}$, which is a natural normalization on the design, we have by independence that

$$\begin{aligned} \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big \Vert _{2}\right]&\le \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \nabla F(\varvec{X}{\varvec{\theta }}_0,\varvec{y})\big \Vert _{2}^2\right] ^{1/2} \\&{{= \sigma \sqrt{{{\mathrm{tr}}}(\varvec{X}_{b_i}^\top \varvec{X}_{b_i})/2}}} {{= \sigma \sqrt{\sum _{j \in b_i} \big \Vert \varvec{X}_j\big \Vert _{2}^2/2}}} {{\le \sigma \sqrt{K n/2}}} . \end{aligned}$$

It then follows that taking

$$\begin{aligned} \lambda _n \ge \tau \frac{\sigma \sqrt{K} + 16\kappa \sqrt{\delta \log (L)}}{\sqrt{n}} , \quad \delta > 1 , \end{aligned}$$

the oracle inequalities (34) and (35) hold for the group Lasso with probability at least $1-L^{1-\delta }$. A similar result can be proved for the analysis group Lasso just as well (see Sect. 4.3.3).

4 Oracle inequalities for low-complexity linear regression

In this section, we consider the classical linear regression problem where the n response–covariate pairs $(\varvec{y}_i,\varvec{X}_i)$ are linked as

$$\begin{aligned} \varvec{y}= \varvec{X}{\varvec{\theta }}_0 + \varvec{\xi }, \end{aligned}$$

(30)

where $\varvec{\xi }$ is a noise vector. The data loss will be set to $F({\varvec{u}},\varvec{y})=\tfrac{1}{2}\big \Vert \varvec{y}-{\varvec{u}}\big \Vert _{2}^2$. This in turn entails that $\varphi =\varphi ^+=\tfrac{1}{2}{\left( \cdot \right) }^2$ on $\mathbb {R}_+$ and $R_n{\big ({\varvec{\theta }},{\varvec{\theta }}_0\big )}=\tfrac{1}{2n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2$.

In this section, we assume that the noise $\varvec{\xi }$ is a zero-mean sub-Gaussian vector in $\mathbb {R}^n$ with parameter $\sigma $. That is, its one-dimensional marginals $\langle \varvec{\xi },\varvec{z}\rangle $ are sub-Gaussian random variables $\forall \varvec{z} \in \mathbb {R}^n$, i.e., they satisfy

$$\begin{aligned} \mathbb {P}\left( \big |\langle \varvec{\xi },\varvec{z}\rangle \big | \ge t\right) \le 2 e^{-t^2/(2\left\| \varvec{z}\right\| _{2}^2\sigma ^2)}, \quad \forall \varvec{z} \in \mathbb {R}^n . \end{aligned}$$

(31)

In this case, the bounds of Sect. 3.4 can be improved.

4.1 General penalty

As we will shortly show, the event $\mathscr {E}$ will depend on the Gaussian width, a summary geometric quantity which, informally speaking, measures the size of the bulk of a set in $\mathbb {R}^n$.

Definition 2

The Gaussian width of a subset ${\mathcal S}\subset \mathbb {R}^n$ is defined as

$$\begin{aligned} w({\mathcal S}) {\mathop {=}\limits ^{\text { def}}}\mathbb {E}_{}\left[ \sigma _{{\mathcal S}}(\varvec{g})\right] , \quad \text {where} \quad \varvec{g} \sim {\mathcal N}(0,\mathbf{Id}_n) . \end{aligned}$$

The concept of Gaussian width has appeared in the literature in different contexts. In particular, it has been used to establish sample complexity bounds to ensure exact recovery (noiseless case) and mean-square estimation stability (noisy case) for low-complexity penalized estimators from Gaussian measurements, see e.g., Rudelson and Vershynin (2008), Chandrasekaran et al. (2012), Tropp (2015a), Vershynin (2015) and Vaiter et al. (2015b).

The Gaussian width has deep connections to convex geometry and it enjoys many useful properties. It is well-known that it is positively homogeneous, monotonic w.r.t. inclusion, and invariant under orthogonal transformations. Moreover, $w({\overline{\mathrm {conv}}\left( {\mathcal S}\right) }) = w({\mathcal S})$. From Lemma 3(ii)–(iii), $w({\mathcal S})$ is a nonnegative finite quantity whenever the set ${\mathcal S}$ is bounded and contains the origin.

We are now ready to state our oracle inequality in probability with sub-Gaussian noise.

Proposition 2

Let the data generated by (30) where $\varvec{\xi }$ is a zero-mean sub-Gaussian random vector with parameter $\sigma $. Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where F and $J {\mathop {=}\limits ^{\text { def}}}\gamma _{{\mathcal C}}$ satisfy Assumptions (H.1)–(H.2) and (H.3) . Suppose that $\lambda _n \ge \frac{\tau \sigma c_1 \sqrt{2\log (c_2/\delta )}w{\left( \varvec{X}({\mathcal C})\right) }}{n}$, for some $\tau > 1$ and $0< \delta < \min (c_2,1)$, where $c_1$ and $c_2$ are positive absolute constants. Then with probability at least $1-\delta $, (19) and (23) hold with the remainder term given by (20) with $\nu =1$.

The proof requires sophisticated ideas from the theory of generic chaining (Talagrand 2005), but we only apply these results. The constants $c_1$ and $c_2$ can be traced back to the proof of these results as detailed in Talagrand (2005).

Proof

First, from (31), we have the bound

$$\begin{aligned} \mathbb {P}\left( \big |\langle \varvec{\xi },\varvec{z}-\varvec{z}'\rangle \big | \ge t\right) \le 2 e^{-t^2/(2\left\| \varvec{z} - \varvec{z}'\right\| _{2}^2\sigma ^2)}, \quad \forall \varvec{z}, \varvec{z}' \in \mathbb {R}^n , \end{aligned}$$

i.e., the increment condition (Talagrand 2005, (0.4)) is verified. Thus combining (25) with the probability bound in Talagrand (2005, p. 11), the generic chaining theorem (Talagrand 2005, Theorem 1.2.6) and the majorizing measure theorem (Talagrand 2005, Theorem 2.1.1), we have

$$\begin{aligned} \mathbb {P}\left( J^\circ (\varvec{X}^\top \varvec{\xi }) \ge \lambda _n n/\tau \right)&\le \mathbb {P}\left( \sup _{\varvec{z} \in \varvec{X}({\mathcal C})} \langle \varvec{\xi },\varvec{z}\rangle \ge \sigma c_1 \sqrt{2\log (c_2/\delta )}w{\left( \varvec{X}({\mathcal C})\right) }\right) \\&\le c_2\exp {\left( -\frac{\sigma ^22\log (c_2/\delta )}{2\sigma ^2}\right) } = \delta . \end{aligned}$$

$\square $

If the noise is Gaussian, an enhanced version can be proved by invoking Gaussian concentration of Lipschitz functions (Ledoux 2001).

Proposition 3

Let the data generated by (30) with noise $\varvec{\xi }\sim {\mathcal N}(0,\sigma ^2\mathbf{Id}_n)$. Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where F and $J {\mathop {=}\limits ^{\text { def}}}\gamma _{{\mathcal C}}$ satisfy Assumptions (H.1)–(H.2) and (H.3) . Suppose that $\lambda _n \ge \frac{(1+\delta )\tau \sigma w{\left( \varvec{X}({\mathcal C})\right) }}{n}$, for some $\tau > 1$ and $\delta > 0$. Then with probability at least , (19) and (23) hold with the remainder term given by (20) with $\nu =1$.

Proof

Thanks to sublinearity (see Lemmas 4(i) and 5), the function $\varvec{\xi }\mapsto J^\circ (\varvec{X}^\top \varvec{\xi })$ is Lipschitz continuous with Lipschitz constant . From (25), we also have

$$\begin{aligned} \mathbb {E}_{}\left[ J^\circ {\big (\varvec{X}^\top \varvec{\xi }\big )}\right] = \sigma w{\left( \varvec{X}({\mathcal C})\right) } . \end{aligned}$$

Observe that $\varvec{X}({\mathcal C})$ is a convex compact set containing the origin. Setting $\epsilon =\lambda _n n/\tau - \sigma w{\left( \varvec{X}({\mathcal C})\right) } \ge \delta \sigma w{\left( \varvec{X}({\mathcal C})\right) }$, it follows from (25) and the Gaussian concentration of Lipschitz functions (Ledoux 2001) that

$\square $

Estimating theoretically the Gaussian width of a set (not to mention its image with a linear operator as for $\varvec{X}({\mathcal C})$) is a non-trivial problem that has been extensively studied in the areas of probability in Banach spaces and stochastic processes. There are classical bounds on the Gaussian width (Sudakov’s and Dudley’s inequalities), but they are difficult to estimate in most cases and neither of these bounds is tight for all sets. When the set is a convex cone (intersected with a sphere), tractable estimates based on polarity arguments were proposed in, e.g., Chandrasekaran et al. (2012).

4.2 Polyhedral penalty

When ${\mathcal C}$ and is polytope, enhanced oracle inequalities can be obtained by invoking a simple union bound argument.

Proposition 4

Let the data generated by (30) where $\varvec{\xi }$ is a zero-mean sub-Gaussian random vector with parameter $\sigma $. Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where F and $J {\mathop {=}\limits ^{\text { def}}}\gamma _{{\mathcal C}}$ satisfy Assumptions (H.1)–(H.2) and (H.3) , and moreover ${\mathcal C}$ is a polytope with vertices ${\mathcal V}$. Suppose that $ \lambda _n \ge \frac{\tau \sigma {\big (\max _{{\varvec{v}}\in {\mathcal V}} \left\| \varvec{X}{\varvec{v}}\right\| _{2}\big )}\sqrt{2\delta \log (|{\mathcal V}|)}}{n}$, for some $\tau > 1$ and $\delta > 1$. Then with probability at least $1-2|{\mathcal V}|^{1-\delta }$, (19) and (23) hold with the remainder term given by (20) with $\nu =1$.

In particular, if $\max _{{\varvec{v}}\in {\mathcal V}} \left\| \varvec{X}{\varvec{v}}\right\| _{2} = C\sqrt{n}$, for a positive constant C, then one can take $\lambda _n \ge C\tau \sigma \sqrt{\frac{2\delta \log (|{\mathcal V}|)}{n}}$.

Proof

From (25) we have

$$\begin{aligned} J^\circ {\big (\varvec{X}^\top \varvec{\xi }\big )} = \max _{{\varvec{v}}\in {\mathcal C}}~\langle \varvec{X}{\varvec{v}},\varvec{\xi }\rangle = \max _{{\varvec{v}}\in {\mathcal V}}~\langle \varvec{X}{\varvec{v}},\varvec{\xi }\rangle , \end{aligned}$$

where in the last inequality, we used the fact that a convex function attains its maximum on ${\mathcal C}$ at an extreme point ${\mathcal V}$. Let $\epsilon =\sigma {\big (\max _{{\varvec{v}}\in {\mathcal V}} \left\| \varvec{X}{\varvec{v}}\right\| _{2}\big )}\sqrt{2\delta \log (|{\mathcal V}|)}$. By the union bound, (31) and (25), we have

$$\begin{aligned} \mathbb {P}\left( J^\circ {\big (\varvec{X}^\top \varvec{\xi }\big )} \ge \lambda _n n/\tau \right)&\le \mathbb {P}\left( \max _{{\varvec{v}}\in {\mathcal V}}~\langle \varvec{X}{\varvec{v}},\varvec{\xi }\rangle \ge \epsilon \right) \\&\le |{\mathcal V}| \max _{{\varvec{v}}\in {\mathcal V}}\mathbb {P}\left( \langle \varvec{X}{\varvec{v}},\varvec{\xi }\rangle \ge \epsilon \right) \\&\le |{\mathcal V}| \max _{{\varvec{v}}\in {\mathcal V}}\mathbb {P}\left( \big |\langle \varvec{X}{\varvec{v}},\varvec{\xi }\rangle \big | \ge \epsilon \right) \\&\le 2|{\mathcal V}| \exp {\big (- \epsilon ^2 / {\big (2\sigma ^2 \max _{{\varvec{v}}\in {\mathcal V}}\left\| \varvec{X}{\varvec{v}}\right\| _{2}^2\big )}\big )}\\&\le 2|{\mathcal V}|^{1-\delta } . \end{aligned}$$

$\square $

4.3 Applications

In this section, we exemplify our oracle inequalities for the penalties described in Sect. 2.5.

4.3.1 Lasso

Recall the derivations for the Lasso in Sect. 3.4.1. We obtain the following corollary of Proposition 4.

Corollary 3

Let the data generated by (30) where $\varvec{\xi }$ is a zero-mean sub-Gaussian random vector with parameter $\sigma $. Assume that $\varvec{X}$ is such that $\max _i\left\| \varvec{X}_i\right\| _{2}\le \sqrt{n}$. Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where J is the Lasso penalty (7) and F satisfies Assumptions (H.1)–(H.2). Suppose that $\lambda _n \ge \tau \sigma \sqrt{\frac{2\delta \log (2p)}{n}}$, for some $\tau > 1$ and $\delta > 1$. Then, with probability at least $1-2(2p)^{1-\delta }$, the following holds

$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2&\le \inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,p \right\} \\ {\varvec{\theta }}:~\mathrm {supp}({\varvec{\theta }})=I \end{array}} {\left( \tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 + \tfrac{\lambda _n^2{\left( \tau +1\right) }^2|I|}{\tau ^2\varUpsilon {\left( {{\mathrm{Span}}} \left\{ \varvec{a}_i \right\} _{i \in I},\tfrac{\tau +1}{\tau -1}\right) }^2}\right) } \\&\quad + 2p\beta , \end{aligned} \end{aligned}$$

(32)

and

$$\begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \le&\inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,p \right\} \\ {\varvec{\theta }}:~\mathrm {supp}({\varvec{\theta }})=I \end{array}} {\left( \tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 + \tfrac{\lambda _n^2{\left( \tau +1\right) }^2|I|}{\tau ^2\varUpsilon {\left( {{\mathrm{Span}}} \left\{ \varvec{a}_i \right\} _{i \in I},\tfrac{\tau +1}{\tau -1}\right) }^2}\right) } . \end{aligned}$$

(33)

The normalization on the design is natural. The remainder term grows as $\tfrac{|I|\log (p)}{n}$. The oracle inequality (33) recovers (Dalalyan et al. 2018, Theorem 1) in the exactly sparse case, and (33) the one in Sun and Zhang (2012, Theorem 4) (see also Koltchinskii et al. 2011, Theorem 11; Dalalyan et al. 2017, Theorem 2). It is worth mentioning, however, that Dalalyan et al. (2018, Theorem 1) handle the inexactly sparse case while we do not. For the choice $\beta =O{\left( \sigma ^2|I|\log (2p)/(pn)\right) }$, the remainder terms in (32) and (33) are of the same order. Observe that this choice of the temperature parameter is optimal in view of the results of Castillo et al. (2015). These authors proved that for the $\ell _1$ penalty, orthonormal design, noise $\varvec{\xi }\sim {\mathcal N}(0,\sigma ^2\mathbf{Id}_n)$, and all choices of the form $\lambda =C\sigma \sqrt{\log (n)/n}$, then the pseudo-posterior $\mu _n$ in (2) with temperature $\beta =\sigma ^2/n$ puts asymptotically no mass on the ball centered at ${\varvec{\theta }}_0$ of radius $\sim \sqrt{\log (n)/n}$.

4.3.2 Group Lasso

Recall the notations in Sect. 2.5.2, and denote $I_{{\varvec{\theta }}}=\mathrm {supp}_{{\mathcal B}}({\varvec{\theta }})$ the set indexing active blocks in ${\varvec{\theta }}$. From (10), we have

where the last bound holds as an equality whenever ${\varvec{\theta }}\ne 0$.

We have the following oracle inequalities as corollaries of Proposition 2 and Proposition 3.

Corollary 4

Let the data generated by (30). Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where F satisfies Assumptions (H.1)–(H.2), and J is the group Lasso (9) with L non-overlapping blocks of equal size K. Let .

(i)
$\varvec{\xi }$ is a zero-mean sub-Gaussian random vector with parameter $\sigma $: suppose that $\lambda _n \ge 3\tau \sigma {{s(\varvec{X})}} c_1 \frac{\sqrt{2\log (c_2/\delta )}{\left( \sqrt{K}+\sqrt{2\log (L)}\right) }}{\sqrt{n}}$, for some $\tau > 1$ and $0< \delta < \min (c_2,1)$, where $c_1$ and $c_2$ are the positive absolute constants in Proposition 2. Then, with probability at least $1-\delta $, the following holds
$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \le&\inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,L \right\} \\ {\varvec{\theta }}:~\mathrm {supp}_{{\mathcal B}}({\varvec{\theta }})=I \end{array}} \Bigg (\tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \\&\quad + \tfrac{\lambda _n^2{\left( \tau +1\right) }^2|I|}{\tau ^2\varUpsilon {\left( {{\mathrm{Span}}} \left\{ a_j \right\} _{j \in b_i, i \in I},\tfrac{\tau +1}{\tau -1}\right) }^2}\Bigg ) + 2p\beta , \end{aligned} \end{aligned}$$
(34)
and
$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \le&\inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,L \right\} \\ {\varvec{\theta }}:~\mathrm {supp}_{{\mathcal B}}({\varvec{\theta }})=I \end{array}} \Bigg (\tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \\&\quad + \tfrac{\lambda _n^2{\left( \tau +1\right) }^2|I|}{\tau ^2\varUpsilon {\left( {{\mathrm{Span}}} \left\{ a_j \right\} _{j \in b_i, i \in I},\tfrac{\tau +1}{\tau -1}\right) }^2}\Bigg ) . \end{aligned} \end{aligned}$$
(35)
(ii)
$\varvec{\xi }\sim {\mathcal N}(0,\sigma ^2\mathbf{Id}_n)$: suppose that $\lambda _n \ge \tau \sigma {{s(\varvec{X})}}\frac{\sqrt{K}+\sqrt{2\delta \log (L)}}{\sqrt{n}}$, for some $\tau > 1$ and $\delta > 1$. Then, with probability at least $1-L^{1-\delta }$, (34) and (35) hold.

When $s(\varvec{X})=O(1)$,^{Footnote 1} the first remainder term is on the order $\frac{|I|{\left( \sqrt{K}+\sqrt{2\log (L)}\right) }^2}{n}$. This is similar to the scaling that has been provided in the literature for EWA with other group sparsity priors and noises (Rigollet and Tsybakov 2012; Duy Luu et al. 2016). Similar rates were given for ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ with the group Lasso in Negahban et al. (2012), Lounici et al. (2011) and van de Geer (2014).

Proof

(i)
This is a consequence of Proposition 2, for which we need to bound
$$\begin{aligned} w(\varvec{X}({\mathcal C})) = \mathbb {E}_{}\left[ \max _{i \in \left\{ 1,\ldots ,L \right\} } \big \Vert \varvec{X}_{b_i}^\top \varvec{g}\big \Vert _{2}\right] . \end{aligned}$$
We first have, for any block $b_i$
$$\begin{aligned} \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \varvec{g}\big \Vert _{2}\right] \le \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \varvec{g}\big \Vert _{2}^2\right] ^{1/2} \le {{s(\varvec{X})}}\sqrt{K n} . \end{aligned}$$
Furthermore, $\big \Vert \varvec{X}_{b_i}^\top \cdot \big \Vert _{2}$ is Lipschitz continuous with Lipschitz constant $s(\varvec{X})\sqrt{n}$. Thus the union bound and Gaussian concentration of Lipschitz functions (Ledoux 2001) yield, for any $t > 0$,
$$\begin{aligned}&\mathbb {P}\left( \max _{i \in \left\{ 1,\ldots ,L \right\} } \big \Vert \varvec{X}_{b_i}^\top \varvec{g}\big \Vert _{2} \ge {{s(\varvec{X})}}\sqrt{K n} + t\right) \\&\quad \le \sum _{i=1}^L \mathbb {P}\left( \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2} - \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2}\right] \ge t\right) \le L \exp {\left( -\frac{t^2}{2{{s(\varvec{X})^2}}n}\right) } . \end{aligned}$$
Let $\kappa = {{s(\varvec{X})}}{\left( \sqrt{Kn}+\sqrt{2n\log (L)}\right) }$. $w(\varvec{X}({\mathcal C}))$ can be expressed as
$$\begin{aligned}&w(\varvec{X}({\mathcal C})) = \int _{0}^{\infty } \mathbb {P}\left( \max _{i \in \left\{ 1,\ldots ,L \right\} } \big \Vert \varvec{X}_{b_i}^\top \varvec{g}\big \Vert _{2} \ge u\right) \mathrm{d}u \\&\quad \le \int _{0}^{\kappa } \mathrm{d}u + \int _{\kappa }^\infty e^{-\frac{(u-{{s(\varvec{X})}}\sqrt{Kn})^2-2{{s(\varvec{X})}}^2n\log (L)}{2n}} \mathrm{d}u \\&\quad = \kappa + {{s(\varvec{X})\sqrt{n}}}\int _{\kappa /{{(s(\varvec{X})\sqrt{n})}}}^\infty e^{-\frac{(s-\sqrt{K})^2-2\log (L)}{2}} \mathrm{d}u \\&\quad \le \kappa + {{s(\varvec{X})\sqrt{n}}}\int _{\kappa /{{(s(\varvec{X})\sqrt{n})}}}^\infty e^{-\frac{s-\kappa /{{(s(\varvec{X})\sqrt{n})}}}{2}} \mathrm{d}u = \kappa + 2{{s(\varvec{X})\sqrt{n}}} \le 3\kappa . \end{aligned}$$
(ii)
The proof follows the lines of Proposition 3 where we additionally use the union bound. Indeed,
$$\begin{aligned}&\mathbb {P}\left( \max _{i \in \left\{ 1,\ldots ,L \right\} } \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2} \ge \lambda _n n/\tau \right) \\&\quad \le \sum _{i=1}^L \mathbb {P}\left( \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2} - \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2}\right] \ge \lambda _n n/\tau - \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2}\right] \right) \\&\quad \le \sum _{i=1}^L \mathbb {P}\left( \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2} - \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2}\right] \ge \lambda _n n/\tau - \sigma {{s(\varvec{X})}}\sqrt{Kn}\right) \\&\quad \le \sum _{i=1}^L \mathbb {P}\left( \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2} - \mathbb {E}_{}\left[ \big \Vert \varvec{X}_{b_i}^\top \varvec{\xi }\big \Vert _{2}\right] \ge \sigma {{s(\varvec{X})}}\sqrt{2\delta n\log (L)}\right) \\&\quad \le L \exp {\left( -\delta \log (L)\right) } = L^{1-\delta }, \end{aligned}$$
where used the Gaussian concentration of Lipschitz functions (Ledoux 2001) in the last inequality. $\square $

We observe in passing that another way to prove the oracle inequalities in the sub-Gaussian is to use Dudley’s inequality on the sphere in $\mathbb {R}^K$ after applying a union bound on the L blocks. In addition, in the Gaussian case, the (similar) bound $\lambda _n \ge 3\delta \tau \sigma {{s(\varvec{X})}}\frac{\sqrt{K}+\sqrt{2\log (L)}}{\sqrt{n}}$ can be obtained by combining Proposition 3 and the estimate $w(\varvec{X}({\mathcal C})) \le 3{{s(\varvec{X})}}(\sqrt{Kn}+\sqrt{2n\log (L)})$ in the proof of (i). The corresponding probability of success would be at least $1-L^{-9(\delta -1)^2}$.

4.3.3 Analysis group Lasso

We now turn to the prior penalty (11). Recall the notations in Sect. 2.5.3, and remind $\varLambda _{{\varvec{\theta }}}=\bigcup _{i \in \mathrm {supp}_{{\mathcal B}}({\varvec{D}}^\top {\varvec{\theta }})} b_i$. We assume that ${\varvec{D}}$ is a frame of $\mathbb {R}^p$, hence surjective, meaning that there exist $c, d > 0$ such that for any ${\varvec{\omega }}\in \mathbb {R}^p$

$$\begin{aligned} d \left\| {\varvec{\omega }}\right\| _{2}^2 \le \big \Vert {\varvec{D}}^\top {\varvec{\omega }}\big \Vert _{2}^2 \le c \left\| {\varvec{\omega }}\right\| _{2}^2. \end{aligned}$$

This together with (12)-(13) and Cauchy–Schwarz inequality entail

Note, however, that from (12), we do not have in general ${{C({\varvec{D}},{\varvec{\theta }})}} {\mathop {=}\limits ^{\text { def}}}\left\| {\varvec{D}}^+{{\mathrm{P}}}_{{{\mathrm{Ker}}}({\varvec{D}}^\top _{\varLambda ^c_{{\varvec{\theta }}}})}{\varvec{D}}e_{{\varvec{D}}^\top {\varvec{\theta }}}^{\left\| \right\| _{1,2}}\right\| _{\infty ,2} \le 1$.

With exactly the same arguments to those for proving Corollary 4, replacing $\varvec{X}$ by $\varvec{X}{\varvec{D}}$, we arrive at the following oracle inequalities.

Corollary 5

Let the data generated by (30). Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where F satisfies Assumptions (H.1)–(H.2), and J is the analysis group Lasso (11) with L blocks of equal size K. Assume that ${\varvec{D}}$ is a frame, and le .

(i)
$\varvec{\xi }$ is a zero-mean sub-Gaussian random vector with parameter $\sigma $: suppose that $\lambda _n \ge 3\tau \sigma {{s(\varvec{X}{\varvec{D}})}}c_1 \frac{\sqrt{\log (c_2/\delta )}{\left( \sqrt{K}+\sqrt{2\log (L)}\right) }}{\sqrt{n}}$, for some $\tau > 1$ and $0< \delta < \min (c_2,1)$, where $c_1$ and $c_2$ are the positive absolute constants in Proposition 2. Then, with probability at least $1-\delta $, the following holds
$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \le&\inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,L \right\} \\ {\varvec{\theta }}:~\mathrm {supp}_{{\mathcal B}}({\varvec{D}}^\top {\varvec{\theta }})=I \end{array}} \Bigg (\tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \\&\quad + \tfrac{c\lambda _n^2{\big (\tau {{C({\varvec{D}},{\varvec{\theta }})}}+1\big )}^2|I|}{\tau ^2\varUpsilon {\Bigg ({{\mathrm{Ker}}}({\varvec{D}}^\top _{\varLambda ^c_{{\varvec{\theta }}}}),\tfrac{\tau {{C({\varvec{D}},{\varvec{\theta }})}}+1}{\tau -1}\Bigg )}^2}\Bigg ) + 2p\beta , \end{aligned} \end{aligned}$$
(36)
and
$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \le&\inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,L \right\} \\ {\varvec{\theta }}:~\mathrm {supp}_{{\mathcal B}}({\varvec{D}}^\top {\varvec{\theta }})=I \end{array}} \left( \tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \right. \\&\left. \quad + \tfrac{c\lambda _n^2{\big (\tau {{C({\varvec{D}},{\varvec{\theta }})}}+1\big )}^2|I|}{\tau ^2\varUpsilon {\Bigg ({{\mathrm{Ker}}}({\varvec{D}}^\top _{\varLambda ^c_{{\varvec{\theta }}}}),\tfrac{\tau {{C({\varvec{D}},{\varvec{\theta }})}}+1}{\tau -1}\Bigg )}^2}\right) . \end{aligned} \end{aligned}$$
(37)
(ii)
$\varvec{\xi }\sim {\mathcal N}(0,\sigma ^2\mathbf{Id}_n)$: suppose that $\lambda _n \ge \tau \sigma {{s(\varvec{X}{\varvec{D}})}}\frac{\sqrt{K}+\sqrt{2\delta \log (L)}}{\sqrt{n}}$, for some $\tau > 1$ and $\delta > 1$. Then, with probability at least $1-L^{1-\delta }$, (36) and (37) hold.

To the best of our knowledge, this result is new to the literature. The scaling of the remainder term is the same as in Duy Luu et al. (2016) and Rigollet and Tsybakov (2012) with analysis sparsity priors different from ours (the authors in the latter also assume that ${\varvec{D}}$ is invertible).

4.3.4 Anti-sparsity

Recall the derivations for the $\ell _\infty $ norm example in Sect. 3.4.1. We have the following oracle inequalities from Proposition 4.

Corollary 6

Let the data generated by (30) where $\varvec{\xi }$ is a zero-mean sub-Gaussian random vector with parameter $\sigma $. Assume that $\varvec{X}$ is such that $\max _{i,j}|\varvec{X}_{i,j}|\le 1/p$. Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where F satisfies Assumptions (H.1)–(H.2), and J is the anti-sparsity penalty (14). Suppose that $\lambda _n \ge \tau \sigma \sqrt{2\delta \log (2)}\sqrt{\frac{p}{n}}$, for some $\tau > 1$ and $\delta > 1$. Then, with probability at least $1-2^{-p(\delta -1)+1}$, the following holds

$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2&\le \inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,p \right\} \\ {\varvec{\theta }}:~I^{\mathrm {sat}}_{{\varvec{\theta }}}=I \end{array}} \left( \tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \right. \\&\left. \quad + \tfrac{\lambda _n^2{\left( \tau +1\right) }^2}{\tau ^2\varUpsilon {\left( \big \{ \overline{{\varvec{\theta }}} \;:\; \overline{{\varvec{\theta }}}_{I} \in \mathbb {R}{{\mathrm{sign}}}({\varvec{\theta }}_{I}) \big \} ,\tfrac{\tau +1}{\tau -1}\right) }^2}\right) + 2p\beta , \end{aligned} \end{aligned}$$

(38)

and

$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2&\le \inf _{\begin{array}{c} I \subset \left\{ 1,\ldots ,p \right\} \\ {\varvec{\theta }}:~I^{\mathrm {sat}}_{{\varvec{\theta }}}=I \end{array}} \left( \tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \right. \\&\left. \quad + \tfrac{\lambda _n^2{\left( \tau +1\right) }^2}{\tau ^2\varUpsilon {\left( \big \{ \overline{{\varvec{\theta }}} \;:\; \overline{{\varvec{\theta }}}_{I} \in \mathbb {R}{{\mathrm{sign}}}({\varvec{\theta }}_{I}) \big \} ,\tfrac{\tau +1}{\tau -1}\right) }^2}\right) . \end{aligned} \end{aligned}$$

(39)

The first remainder term scales as $\tfrac{p}{n}$ which reflects that anti-sparsity regularization requires an overdetermined regime to ensure good stability performance. This is in agreement with Vaiter et al. (2015a, Theorem 7). This phenomenon was also observed by Donoho and Tanner (2010) who studied sample complexity thresholds for noiseless recovery from random projections of the hypercube.

4.3.5 Nuclear norm

We now turn to the nuclear norm case. Recall the notations of Sect. 2.5.5. For matrices ${\varvec{\theta }}\in \mathbb {R}^{p_1 \times p_2}$, a measurement map $\varvec{X}$ takes the form of a linear operator whose ith component is given by the Frobenius scalar product

$$\begin{aligned} \varvec{X}({\varvec{\theta }})_i = {{\mathrm{tr}}}((\varvec{X}^i)^\top {\varvec{\theta }}) = \langle \varvec{X}^i,{\varvec{\theta }}\rangle _{\mathrm {F}}, \end{aligned}$$

where $\varvec{X}^i$ is a matrix in $\mathbb {R}^{p_1 \times p_2}$. We denote $\left\| \cdot \right\| _{\mathrm {F}}$ the associated norm. From (17), it is immediate to see that whenever ${\varvec{\theta }}\ne 0$,

Moreover, from (17), we have

To apply Proposition 2 and Proposition 3, we need to bound $w(\varvec{X}({\mathcal C}))$ (${\mathcal C}$ is the nuclear ball), or equivalently, to bound

which is the expectation of the operator norm of a random series with matrix coefficients. Thus using Tropp (2015b, Theorem 4.1.1(4.1.5)) to get this bound, and inserting it into Proposition 2 and Proposition 3, we get the following oracle inequalities for the nuclear norm. Define

Corollary 7

Let the data generated by (30) with a linear operator $\varvec{X}: \mathbb {R}^{p_1 \times p_2} \rightarrow \mathbb {R}^n$. Consider the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$, where F satisfies Assumptions (H.1)–(H.2), and J is the nuclear norm (16).

(i)
$\varvec{\xi }$ is a zero-mean sub-Gaussian random vector with parameter $\sigma $: suppose that $\lambda _n \ge 2\tau \sigma {{s(\varvec{X})}}c_1 \sqrt{\frac{\log (c_2/\delta )\log (p_1+p_2)}{n}}$, for some $\tau > 1$ and $0< \delta < \min (c_2,1)$, where $c_1$ and $c_2$ are the positive absolute constants in Proposition 2. Then, with probability at least $1-\delta $, the following holds
$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2&\le \inf _{\begin{array}{c} r \in \left\{ 1,\ldots ,\min (p_1,p_2) \right\} \\ {\varvec{\theta }}:~{{\mathrm{rank}}}({\varvec{\theta }})=r \end{array}} \Bigg (\tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \\&\quad + \tfrac{2\lambda _n^2{\left( \tau +1\right) }^2 r}{\tau ^2\varUpsilon {\left( T_{{\varvec{\theta }}},\tfrac{\tau +1}{\tau -1}\right) }^2}\Bigg ) + 2p_1p_2\beta , \end{aligned} \end{aligned}$$
(40)
and
$$\begin{aligned} \begin{aligned} \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2&\le \inf _{\begin{array}{c} r \in \left\{ 1,\ldots ,\min (p_1,p_2) \right\} \\ {\varvec{\theta }}:~{{\mathrm{rank}}}({\varvec{\theta }})=r \end{array}} \left( \tfrac{1}{n}\big \Vert \varvec{X}{\varvec{\theta }}-\varvec{X}{\varvec{\theta }}_0\big \Vert _{2}^2 \right. \\&\left. \quad + \tfrac{2\lambda _n^2{\left( \tau +1\right) }^2 r}{\tau ^2\varUpsilon {\left( T_{{\varvec{\theta }}},\tfrac{\tau +1}{\tau -1}\right) }^2}\right) . \end{aligned} \end{aligned}$$
(41)
(ii)
$\varvec{\xi }\sim {\mathcal N}(0,\sigma ^2\mathbf{Id}_n)$: suppose that $\lambda _n \ge (1+\delta )\tau \sigma {{s(\varvec{X})}}\sqrt{\frac{2\log (p_1+p_2)}{n}}$, for some $\tau > 1$ and $\delta > 0$. Then, with probability at least $1-(p_1+p_2)^{-\delta ^2}$, (40) and (41) hold.

The set over which the infimum is taken just reminds us that the nuclear norm is partly smooth (see above) relative to the constant rank manifold (which is a Riemannian submanifold of $\mathbb {R}^{p_1 \times p_2}$) (Daniilidis et al. 2014, Theorem 3.19). In the iid Gaussian noise case, we recover the same rate as in Dalalyan et al. (2018, Theorem 3) for ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and in Koltchinskii et al. (2011, Theorem 2) for ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$. If $s(\varvec{X})=O(\sqrt{p_1+p_2})$, then the first remainder term scales as $\frac{r(p_1+p_2)\log (p_1+p_2)}{n}$. For low-rank matrix recovery, the same rate was also independently proved in Mai and Alquier (2015) and Suzuki (2015) for EWA and the posterior conditional mean, respectively, in the temperature regime $\beta =C/n$, though with completely different priors, but without requiring the compatibility factor assumption. The noise is iid Gaussian in Suzuki (2015) and subexponential in Mai and Alquier (2015). The assumptions on the design are also different.

The assumption $s(\varvec{X})=O(\sqrt{p_1+p_2})$ on the design is mild and verified in many situations. Indeed, by Jensen’s inequality we have

If, for example, $(\varvec{X}^i)_i$ are independent copies of a standard random Gaussian matrix, then classical concentration bounds of the largest eigenvalue of a Wishart matrix entail that concentrates around its mean .

4.4 Discussion of minimax optimality

In this section, we discuss the optimality of the estimators ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ (we remind the reader that the design $\varvec{X}$ is fixed). Recall the discussion on stratification at the end of Sect. 3.1. Let ${\mathcal M}_0 \in \mathscr {M}$ be the stratum active at ${\varvec{\theta }}_0 \in {\mathcal M}_0$. In this setting, choosing for some constant $C > 0$, (22) and Proposition 3 ensure that

with high probability. In particular, for a polyhedral gauge penalty, in which case ${\mathcal M}_0=T_{{\varvec{\theta }}_0}$ (see Vaiter et al. 2015a), and under the normalization $\max _{{\varvec{v}}{\mathcal V}} \left\| \varvec{X}{\varvec{v}}\right\| _{2} \le \sqrt{n}$ and with the choice , Proposition 4 entails

with high probability. Thus the risk bounds only depend on ${\mathcal M}_0$. A natural question that arises is whether the above bounds are optimal, i.e., whether an estimator can achieve a significantly better prediction risk than ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ uniformly on ${\mathcal M}_0$. A classical way to answer this question is the minimax point of view. This amounts to finding a lower bound on the minimax probabilities of the form

$$\begin{aligned} \inf _{{\widehat{{\varvec{\theta }}}}} \sup _{{\varvec{\theta }}\in {\mathcal M}_0} \Pr {\left( \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}}-\varvec{X}{\varvec{\theta }}\big \Vert _{2}^2 \ge \psi _n\right) } , \end{aligned}$$

where $\psi _n$ is the rate, which ideally, should be comparable to the risk bounds above. A standard path to derive such a lower bound is to exhibit a subset of ${\mathcal M}_0$ of well-separated points while controlling its diameter, see Tsybakov (2008, Chapter 2) or Massart (2007, Section 4.3). This, however, must be worked out on a case-by-case basis.

Example 2

(Lasso) In this case, ${\mathcal M}_0 = T_{{\varvec{\theta }}_0}$ is the subspace of vectors whose support is contained in that of ${\varvec{\theta }}_0$. Let $I=\mathrm {supp}({\varvec{\theta }}_0)$ and $s = \left\| {\varvec{\theta }}_0\right\| _{0}$. Define the set

$$\begin{aligned} {\mathcal B}_0 = \big \{ {\varvec{\theta }}\in \mathbb {R}^p \;:\; {\varvec{\theta }}_{I} \in \left\{ 0,1 \right\} ^{s} \quad \text {and} \quad {\varvec{\theta }}_{I^c}=0 \big \} . \end{aligned}$$

We have ${\mathcal B}_0 \subset {\mathcal M}_0$ and $\left\| {\varvec{\theta }}-{\varvec{\theta }}'\right\| _{0} \le 2s$ for all $({\varvec{\theta }},{\varvec{\theta }}') \in {\mathcal B}_0$. Define ${\mathcal F}_0 {\mathop {=}\limits ^{\text { def}}} \big \{ r\varvec{X}{\varvec{\theta }} \;:\; {\varvec{\theta }}\in {\mathcal B}_0 \big \} $, for $r > 0$ to be specified later. Due to the Varshamov–Gilbert lemma (Massart 2007, Lemma 4.7), given $a \in ]0,1[$, there exists a subset ${\mathcal B}\subset {\mathcal B}_0$ with cardinality $|{\mathcal B}| \ge 2^{\rho s/2}$ such that for two distinct elements $\varvec{X}{\varvec{\theta }}$ and $\varvec{X}{\varvec{\theta }}'$ in ${\mathcal F}_0$

$$\begin{aligned} \left\| \varvec{X}({\varvec{\theta }}-{\varvec{\theta }}')\right\| _{2}^2&\ge \underline{\kappa }r^2 \left\| {\varvec{\theta }}-{\varvec{\theta }}'\right\| _{2}^2 \ge 2(1-a)\underline{\kappa }r^2 s , \\ \left\| \varvec{X}({\varvec{\theta }}-{\varvec{\theta }}')\right\| _{2}^2&\le \overline{\kappa }r^2 \left\| {\varvec{\theta }}-{\varvec{\theta }}'\right\| _{2}^2 \le 4 \overline{\kappa }r^2 s , \end{aligned}$$

where

$$\begin{aligned} \underline{\kappa }= \inf _{{\varvec{\theta }}\in {\mathcal M}_0} \frac{\left\| \varvec{X}{\varvec{\theta }}\right\| _{2}^2}{\left\| {\varvec{\theta }}\right\| _{2}^2} \le \overline{\kappa }= \sup _{{\varvec{\theta }}\in {\mathcal M}_0} \frac{\left\| \varvec{X}{\varvec{\theta }}\right\| _{2}^2}{\left\| {\varvec{\theta }}\right\| _{2}^2}. \end{aligned}$$

Standard results from random matrix theory, see Tropp (2015a), ensure that $\underline{\kappa }> 0$ for a Gaussian design with high probability as long as $n \ge s + C\sqrt{s}$ for some positive absolute constant C.

Then choosing $r^2=\frac{c\rho \sigma ^2}{4\overline{\kappa }}$, where $c \in ]0,1/8[$ and $\rho =(1+a)\log (1+a)+(1-a)\log (1-a)$, we get the bounds

$$\begin{aligned} \left\| \varvec{X}({\varvec{\theta }}-{\varvec{\theta }}')\right\| _{2}^2&\ge \frac{\sigma ^2c(1-a)\rho \underline{\kappa }}{2\overline{\kappa }} s , \\ \left\| \varvec{X}({\varvec{\theta }}-{\varvec{\theta }}')\right\| _{2}^2&\le 2\sigma ^2c\log (|{\mathcal B}|) . \end{aligned}$$

We are now in position to apply Tsybakov (2008, Theorem 2.5) to conclude that there exists $\eta \in ]0,1[$ (that depends on a) such that

$$\begin{aligned} \inf _{{\widehat{{\varvec{\theta }}}}} \sup _{{\varvec{\theta }}\in {\mathcal M}_0} \Pr {\left( \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}}-\varvec{X}{\varvec{\theta }}\big \Vert _{2}^2 \ge \frac{\sigma ^2c(1-a)\rho \underline{\kappa }}{4\overline{\kappa }}\frac{s}{n}\right) } \ge \eta . \end{aligned}$$

This lower bound together with Corollary 3 shows that ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ (with $\beta =O{\big (\sigma ^2s\log (2p)/(pn)\big )}$) and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ are nearly minimax (up to a logarithmic factor) over ${\mathcal M}_0$.

One can generalize this reasoning to get a minimax lower bound over the larger class of s-sparse vectors, i.e., $\bigcup \big \{ V={{\mathrm{Span}}} \left\{ (\varvec{a}_j)_{1 \le j \le p} \right\} \;:\; \dim (V)=s \big \} $, which is a finite union of subspaces that contains ${\mathcal M}_0$. Let $(a,b) \in ]0,1[^2$ such that $1 \le s \le abp$ and $a(-1+b-\log (b)) \ge \log (2)$ (e.g., take $b=1/(1+e\root a \of {2})$), $c \in ]0,1/8[$. Then combining Tsybakov (2008, Theorem 2.5) and Massart (2007, Lemma 4.6 and Lemma 4.10), we have for $\eta {\mathop {=}\limits ^{\text { def}}}\frac{1}{1+(ab)^{\rho s/2}}{\left( 1-2c-\sqrt{\frac{2c}{-\rho \log (ab)}}\right) } \in ]0,1[$

$$\begin{aligned} \inf _{{\widehat{{\varvec{\theta }}}}} \sup _{{\varvec{\theta }}\in {\mathcal M}_0} \Pr {\left( \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}}-\varvec{X}{\varvec{\theta }}\big \Vert _{2}^2 \ge \frac{\sigma ^2c\rho (1-\alpha )\underline{\kappa }}{2\overline{\kappa }}\frac{s\log (p/s)}{n}\right) } \ge \eta , \end{aligned}$$

where $\rho =-a(-1+b-\log (b))/\log (ab)$, and $\underline{\kappa }$ and $\overline{\kappa }$ are now the restricted isometry constants of $\varvec{X}$ of degree 2s, i.e.,

$$\begin{aligned} \underline{\kappa }= \inf _{\left\| {\varvec{\theta }}\right\| _{0} \le 2s} \frac{\left\| \varvec{X}{\varvec{\theta }}\right\| _{2}^2}{\left\| {\varvec{\theta }}\right\| _{2}^2} \le \overline{\kappa }= \sup _{\left\| {\varvec{\theta }}\right\| _{0} \le 2s} \frac{\left\| \varvec{X}{\varvec{\theta }}\right\| _{2}^2}{\left\| {\varvec{\theta }}\right\| _{2}^2} . \end{aligned}$$

For this lower bound to be meaningful, $\underline{\kappa }$ should be positive. From the compressed sensing literature, many random designs are known to verify this condition for n large enough compared to s, e.g., sub-Gaussian designs with $n \gtrsim s\log (p)$.

One can see that the difference between this lower bound and the one on ${\mathcal M}_0$ lies in the $\log (p/s)$ factor, which basically derives from the control over the union of subspaces. The minimax prediction risk (in expectation) over the $\ell _0$-ball was studied in Rigollet and Tsybakov (2011), Raskutti et al. (2011), Verzelen (2012), Ye and Zhang (2010) and Wang et al. (2014), where similar lower bounds were obtained.

Example 3

(Group Lasso) For the group Lasso with L groups of equal size K, ${\mathcal M}_0$ is the subspace group sparse vectors whose group support is included in that of ${\varvec{\theta }}_0$. Let s be the number of nonzero (active) groups in ${\varvec{\theta }}_0$. Following exactly the same reasoning as for the Lasso, one can show that the risk lower bound in probability scales as $C\sigma ^2sK/n$, which together with Corollary 4 shows that ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ (with $\beta =O{\big (\sigma ^2s{\big (\sqrt{K}+\sqrt{2\log (L)}\big )}^2/(pn)\big )}$) and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ are nearly minimax (up again to a logarithmic factor) over ${\mathcal M}_0$. One can also derive the lower bound $C\sigma ^2s(K+\log (L/s))/n$ over the set of s-block sparse vectors. Such minimax lower bound is comparable to the one in Lounici et al. (2011).

Example 4

(Anti-sparsity) Denote the saturation support of ${\varvec{\theta }}_0$ as $I^{\mathrm {sat}}$ and recall the subspace $T_{{\varvec{\theta }}_0}$ form (15). Thus, ${\mathcal M}_0=T_{{\varvec{\theta }}_0}$ is the subspace of vectors which are collinear to ${{\mathrm{sign}}}({\varvec{\theta }}_0)$ on $I^{\mathrm {sat}}$ and free on its complement. Observe that $\dim ({\mathcal M}_0) = p-s+1$, where $s=|I^{\mathrm {sat}}|$. Define the set

$$\begin{aligned} {\mathcal B}_0 = \big \{ {\varvec{\theta }}\in \mathbb {R}^p \;:\; {\varvec{\theta }}_{I^{\mathrm {sat}}} = {{\mathrm{sign}}}({\varvec{\theta }}_{I^{\mathrm {sat}}}) \quad \text {and} \quad {\varvec{\theta }}_{(I^{\mathrm {sat}})^c} \in \left\{ 0,1 \right\} ^{p-s}) \big \} . \end{aligned}$$

By construction, ${\mathcal B}_0 \subset {\mathcal M}_0$, and $\left\| {\varvec{\theta }}-{\varvec{\theta }}'\right\| _{0} \le 2(p-s)$ for all $({\varvec{\theta }},{\varvec{\theta }}') \in {\mathcal B}_0$. Thus following the same arguments as for the Lasso example [using again Varshamov–Gilbert lemma and Tsybakov (2008, Theorem 2.5)], we conclude that there exists $\eta \in ]0,1[$ (that depends on a) such that

$$\begin{aligned} \inf _{{\widehat{{\varvec{\theta }}}}} \sup _{{\varvec{\theta }}\in {\mathcal M}_0} \Pr {\left( \tfrac{1}{n}\big \Vert \varvec{X}{\widehat{{\varvec{\theta }}}}-\varvec{X}{\varvec{\theta }}\big \Vert _{2}^2 \ge \frac{\sigma ^2c(1-a)\rho \underline{\kappa }}{4\overline{\kappa }}\frac{p-s}{n}\right) } \ge \eta , \end{aligned}$$

where the restricted isometry constants are defined similarly to the Lasso but with respect to the model subspace ${\mathcal M}_0$ of the $\ell _\infty $ norm. Again, for a Gaussian design, $\underline{\kappa }> 0$ with high probability as long as $n \ge (p-s+1) + C\sqrt{p-s+1}$ (Tropp 2015a).

The obtained minimax lower bound is consistent with the sample complexity thresholds derived in Donoho and Tanner (2010) for noiseless recovery from random projections of the hypercube. For a saturation support size s small compared to p, the bound of Corollary 6 (with $\beta =O{\big (\sigma ^2/n^2\big )}$) comes close to the minimax lower bound.

Example 5

(Nuclear norm) Let $r={{\mathrm{rank}}}({\varvec{\theta }}_0)$, where ${\varvec{\theta }}_0 \in \mathbb {R}^{p_1 \times p_2}$, and $p=\max (p_1,p_2)$. For the nuclear norm, ${\mathcal M}_0$ is the manifold of rank-r matrices. Thus arguing as in Koltchinskii et al. (2011, Theorem 5) [who use the Varshamov–Gilbert lemma (Massart 2007), to find the covering set], one can show that the minimax risk lower bound over ${\mathcal M}_0$ is $C\sigma ^2 r/n$. In view of Corollary 7, we deduce that ${\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}$ (with $\beta =O{\big (\sigma ^2r\log (p_1+p_2)/(p_1p_2n)\big )}$) and ${\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}$ are nearly minimax over the constant rank manifolds.

Notes

This is for instance the case if $\varvec{X}$ is drawn from the standard Gaussian ensemble and $K=O(n)$ (the $O(\cdot )$ is in fact even $o(\cdot )$ as the remainder term is supposed to go to 0 as $n \rightarrow +\infty $). In this case, classical concentration bounds of the largest eigenvalue of a Wishart matrix allow to conclude that $s(\varvec{X}) = O(1+\sqrt{K/n}) = O(1)$ with high probability.

References

Bach, F. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9, 1179–1225.
MathSciNet MATH Google Scholar
Bakin, S. (1999). Adaptive regression and model selection in data mining problems. PhD Thesis, Australian National University, Canberra.
Bauschke, H. H., Combettes, P. L. (2011). Convex analysis and monotone operator theory in Hilbert spaces1st ed. New York: Springer.
MATH Google Scholar
Bellec, P. (2014). Concentration of quadratic forms under a Bernstein moment assumption. Technical report 1, ENSAE, France.
Bickel, P. J., Ritov, Y., Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4), 1705–1732.
MathSciNet MATH Google Scholar
Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candès, E. J. (2014). Slope—Adaptive variable selection via convex optimization. Annals of Applied Statistics, 9(3), 1103–1140.
MathSciNet MATH Google Scholar
Bühlmann, P., van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications, springer series in statistics. Berlin, Heidelberg: Springer.
MATH Google Scholar
Candès, E., Plan, Y. (2009). Near-ideal model selection by $\ell _1$ minimization. The Annals of Statistics, 37(5A), 2145–2177.
MathSciNet MATH Google Scholar
Candès, E. J., Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6), 717–772.
MathSciNet MATH Google Scholar
Candès, E. J., Li, X., Ma, Y., Wright, J. (2011). Robust principal component analysis? Journal of the ACM, 58(3), 11:1–11:37.
MathSciNet MATH Google Scholar
Candès, E. J., Strohmer, T., Voroninski, V. (2013). Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8), 1241–1274.
MathSciNet MATH Google Scholar
Castillo, I., Schmidt-Hieber, J., van der Vaart, A. (2015). Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5), 1986–2018.
MathSciNet MATH Google Scholar
Catoni, O. (2003). A PAC-Bayesian approach to adaptive classification. Technical report PMA-840, Laboratoire de Probabilités et de modèles aléatoires, Paris, France.
Catoni, O. (2007). PAC-Bayesian supervised classification (the thermodynamics of statistical learning), lecture notes-monograph series, Vol. 56. Beachwood, OH: Institute of Mathematical Statistics.
MATH Google Scholar
Chandrasekaran, V., Recht, B., Parrilo, P. A., Willsky, A. (2012). The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6), 805–849.
MathSciNet MATH Google Scholar
Chen, S., Donoho, D., Saunders, M. (1999). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1), 33–61.
MathSciNet MATH Google Scholar
Chen, X., Lin, Q., Kim, S., Carbonell, J. G., Xing, E. P. (2010). An efficient proximal-gradient method for general structured sparse learning. Preprint arXiv:1005.4717.
Coste, M. (2000). An Introduction to Semialgebraic Geometry. Dottorato di ricerca in matematica / Università di Pisa,Dipartimento di Matematica, Istituti Editoriali e PoligraficiInternazionali, Pisa.
Dalalyan, A., Tsybakov, A. (2009). PAC-Bayesian bounds for the expected error of aggregation by exponential weights. Technical report 1, Université Paris 6, CREST and CERTIS, Ecole des Ponts ParisTech, Paris, personal communication.
Dalalyan, A., Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72(1–2), 39–61. https://doi.org/10.1007/s10994-008-5051-0.
Article Google Scholar
Dalalyan, A. S., Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Proceedings of the 20th annual conference on learning theory (pp. 97–111). Springer, Berlin, Heidelberg, COLT’07. http://dl.acm.org/citation.cfm?id=1768841.1768854.
Dalalyan, A. S., Tsybakov, A. B. (2012). Sparse regression learning by aggregation and Langevin Monte-Carlo. Journal of Computer and System Sciences, 78(5), 1423–1443. https://doi.org/10.1016/j.jcss.2011.12.023.
Article MathSciNet MATH Google Scholar
Dalalyan, A. S., Hebiri, M., Lederer, J. (2017). On the prediction performance of the Lasso. Bernoulli, 23(1), 552–581. https://doi.org/10.3150/15-BEJ756.
Article MathSciNet MATH Google Scholar
Dalalyan, A. S., Grappin, E., Paris, Q. (2018). On the exponentially weighted aggregate with the Laplace prior. The Annals of Statistics, 46(5), 2452–2478.
MathSciNet MATH Google Scholar
Daniilidis, A., Drusvyatskiy, D., Lewis, A. S. (2014). Orthogonal invariance and identifiability. SIAM Journal on Matrix Analysis and Applications, 35(2), 580–598. https://doi.org/10.1137/130916710.
Article MathSciNet MATH Google Scholar
Donoho, D. (2006). For most large underdetermined systems of linear equations the minimal $\ell ^1$-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 797–829.
MathSciNet MATH Google Scholar
Donoho, D., Tanner, J. (2010). Counting the faces of randomly-projected hypercubes and orthants. Discrete and Computational Geometry, 43(3), 522–541.
MathSciNet MATH Google Scholar
Durmus, A., Moulines, E., Pereyra, M. (2016). Sampling from convex non continuously differentiable functions, when Moreau meets Langevin. https://hal.archives-ouvertes.fr/hal-01267115, preprint hal-01267115.
Duy Luu, T., Fadili, J. M., Chesneau, C. (2016). PAC-Bayesian risk bounds for group-analysis sparse regression by exponential weighting. Technical report, hal-01367742. https://hal.archives-ouvertes.fr/hal-01367742.
Fadili, M. J., Peyré, G., Vaiter, S., Deledalle, C., Salmon, J. (2013). Stable recovery with analysis decomposable priors. In Sampling theory and applications (SAMPTA) (pp. 113–116). Bremen: Springer.
Fazel, M., Hindi, H., Boyd, S. P. (2001). A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American Control Conference (ACC), IEEE, Arlington, USA (Vol. 6, pp. 4734–4739).
Guedj, B., Alquier, P. (2013). PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics, 7, 264–291. https://doi.org/10.1214/13-EJS771.
Article MathSciNet MATH Google Scholar
Hiriart-Urruty, J. B., Lemaréchal, C. (2001). Convex analysis and minimization algorithms, Grundlehren der mathematischen Wissenschaften, vol I and II. Berlin, Heidelberg: Springer.
Jacob, L., Obozinski, G., Vert, J. P. (2009). Group lasso with overlap and graph lasso. In A. P. Danyluk, L. Bottou, M. L. Littman (Eds.), 26th International conference on machine learning (ICML) (Vol. 382, p. 55). Montreal; ACM Press.
Jégou, H., Furon, T., Fuchs, J. J. (2012). Anti-sparse coding for approximate nearest neighbor search. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2029–2032). Kyoto: IEEE.
Koltchinskii, V. (2008). Oracle inequalities in empirical risk minimization and sparse recovery problems. In 38th Summer school on probability theory and statistics saint-flour. Lecture Notes in Mathematics (Vol. 2033). New York: Springer.
Koltchinskii, V., Lounici, K., Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5), 2302–2329. https://doi.org/10.1214/11-AOS894.
Article MathSciNet MATH Google Scholar
Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. The Annals of Statistics, 35(4), 1698–1721. https://doi.org/10.1214/009053607000000055.
Article MathSciNet MATH Google Scholar
Ledoux, M. (2001). The concentration of measure phenomenon. Mathematical surveys and monographs. Providence, RI: American Mathematical Society.
Leung, G., Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory, 52(8), 3396–3410.
MathSciNet MATH Google Scholar
Lounici, K., Pontil, M., van de Geer, S., Tsybakov, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 39(4), 2164–2204. https://doi.org/10.1214/11-AOS896.
Article MathSciNet MATH Google Scholar
Lyubarskii, Y., Vershynin, R. (2010). Uncertainty principles and vector quantization. IEEE Transactions on Information Theory, 56(7), 3491–3501.
MathSciNet MATH Google Scholar
Mai, T. T., Alquier, P. (2015). A bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electronic Journal of Statistics, 9(1), 823–841. https://doi.org/10.1214/15-EJS1020.
Article MathSciNet MATH Google Scholar
Massart, P. (2007). Concentration inequalities and model selection. In Summer school on probability theory and statistics Saint-Flour XXXIII—2003. New York: Springer.
Negahban, S., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
MathSciNet MATH Google Scholar
Nemirovski, A. (2000). Topics in non-parametric statistics. In M. Emery, A. Nemirovski & D. Voiculescu (Eds.), Summer school on probability theory and statistics Saint-Flour XXVIII-1998, Lecture notes in mathematics, Vol. 1738, pp. 87–285. New York: Springer.
Google Scholar
Osborne, M., Presnell, B., Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.
MathSciNet MATH Google Scholar
Peyré, G., Fadili, J., Chesneau, C. (2011). Adaptive structured block sparsity via dyadic partitioning. In 19th European signal processing conference (EUSIPCO). Barcelona, Spain: Springer.
Raskutti, G., Wainwright, M. J., Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell _q$ -balls. IEEE Transactions on Information Theory, 57(10), 6976–6994. https://doi.org/10.1109/TIT.2011.2165799.
Article MathSciNet MATH Google Scholar
Recht, B., Fazel, M., Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3), 471–501.
MathSciNet MATH Google Scholar
Rigollet, P., Tsybakov, A. (2007). Linear and convex aggregation of density estimators. Mathematical Methods of Statistics, 16(3), 260–280. https://doi.org/10.3103/S1066530707030052.
Article MathSciNet MATH Google Scholar
Rigollet, P., Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2), 731–771. https://doi.org/10.1214/10-AOS854.
Article MathSciNet MATH Google Scholar
Rigollet, P., Tsybakov, A. B. (2012). Sparse estimation by exponential weighting. Statistical Science, 27(4), 558–575. https://doi.org/10.1214/12-STS393.
Article MathSciNet MATH Google Scholar
Rockafellar, R. (1996). Convex analysis, Vol. 28. Princeton, NJ: Princeton University Press.
Google Scholar
Rockafellar, R. T., Wets, R. (1998). Variational analysis, Vol. 317. New York: Springer.
MATH Google Scholar
Rudelson, M., Vershynin, R. (2008). On sparse reconstruction from Fourier and Gaussian measurements. Communications on Pure and Applied Mathematics, 61(8), 1025–1045.
MathSciNet MATH Google Scholar
Rudin, L., Osher, S., Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1–4), 259–268.
MathSciNet MATH Google Scholar
Studer, C., Yin, W., Baraniuk, R. G. (2012). Signal representations with minimum $\ell _\infty $-norm. In 50th Annual Allerton conference on communication, control and computing. Champaign-Urbana, USA: IEEE.
Su, W., Candès, E. J. (2015). Slope is adaptive to unknown sparsity and asymptotically minimax. The Annals of Statistics, 44(3), 1038–1068.
MathSciNet MATH Google Scholar
Sun, T., Zhang, C. H. (2012). Scaled sparse linear regression. Biometrika, 99(4), 879. https://doi.org/10.1093/biomet/ass043.
Article MathSciNet MATH Google Scholar
Suzuki, T. (2015). Convergence rate of Bayesian tensor estimator and its minimax optimality. In 32nd International conference on machine learning (ICML) (Vol. 37, pp. 1273–1282). Lille, France: ACM Press.
Talagrand, M. (2005). The generic chaining. Upper and lower bounds of stochastic processes. Berlin: Springer.
MATH Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B Methodological, 58(1), 267–288.
MathSciNet MATH Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. (2005). Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.
MathSciNet MATH Google Scholar
Tropp, J. (2015a). Convex recovery of a structured signal from independent random linear measurements. In G. Pfander (Ed.), Sampling theory, a renaissance, applied and numerical harmonic analysis (ANHA). Berlin, Heidelberg: Birkhäuser/Springer.
Google Scholar
Tropp, J. A. (2015b). An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1–2), 1–230. https://doi.org/10.1561/2200000048.
Article MATH Google Scholar
Tsybakov, A. B. (2008). Introduction to nonparametric estimation1st ed. New York: Springer.
MATH Google Scholar
Vaiter, S., Golbabaee, M., Fadili, J., Peyré, G. (2015a). Model selection with low complexity priors. Information and Inference: A Journal of the IMA, 4(3), 230.
Vaiter, S., Peyré, G., Fadili, M. J. (2015b). Low complexity regularization of linear inverse problems. In G. Pfander (Ed.), Sampling theory, a renaissance, applied and numerical harmonic analysis (ANHA). Berlin, Heidelberg: Birkhäuser/Springer.
Google Scholar
Vaiter, S., Deledalle, C., Fadili, M. J., Peyré, G., Dossal, C. (2017). The degrees of freedom of partly smooth regularizers. Annals of the Institute of Statistical Mathematics, 69(4), 791–832.
MathSciNet MATH Google Scholar
Vaiter, S., Peyré, G., Fadili, M. J. (2018). Model consistency of partly smooth regularizers. IEEE Transactions on Information Theory, 64(3), 1725–1737.
MathSciNet MATH Google Scholar
van de Geer, S. (2008). High-dimensional generalized linear models and the Lasso. The Annals of Statistics, 36, 614–645.
MathSciNet MATH Google Scholar
van de Geer, S. (2014). Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics, 41(1), 72–86. https://doi.org/10.1111/sjos.12032.
Article MathSciNet MATH Google Scholar
van de Geer, S., Buhlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3, 1360–1392. https://doi.org/10.1214/09-EJS506.
Article MathSciNet MATH Google Scholar
van de Geer, S., Lederer, J. (2013). The Bernstein–Orlicz norm and deviation inequalities. Probability Theory and Related Fields, 157, 225–250.
MathSciNet MATH Google Scholar
Vershynin, R. (2015). Estimation in high dimension : A geometric perspective. In G. Pfander (Ed.), Sampling theory, a renaissance, applied and numerical harmonic analysis (ANHA). Berlin, Heidelberg: Birkhäuser/Springer.
Google Scholar
Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electronic Journal of Statistics, 6, 38–90. https://doi.org/10.1214/12-ejs666.
Article MathSciNet MATH Google Scholar
Wang, Z., Paterlini, S., Gao, F., Yang, Y. (2014). Adaptive minimax regression estimation over sparse lq-hulls. Journal of Machine Learning Research, 15(1), 1675–1711. http://dl.acm.org/citation.cfm?id=2627435.2638589.
Wei, F., Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli, 16(4), 1369–1384.
MathSciNet MATH Google Scholar
Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli, 10(1), 25–47.
MathSciNet MATH Google Scholar
Ye, F., Zhang, C. H. (2010). Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. Journal of Machine Learning Research, 11, 3519–3540. http://dl.acm.org/citation.cfm?id=1756006.1953043.
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported by Conseil Régional de Basse-Normandie and partly by Institut Universitaire de France.

Author information

Authors and Affiliations

GREYC, Department of computer science, ENSICAEN, CNRS, Normandie Univ, 6, Bd du Maréchal Juin, 14050, Caen, France
Tung Duy Luu & Jalal Fadili
Department of mathematics, LMNO, CNRS, Normandie Univ, Université de Caen Normandie, BP 5186, 14032, Caen, France
Christophe Chesneau

Authors

Tung Duy Luu
View author publications
You can also search for this author in PubMed Google Scholar
Jalal Fadili
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Chesneau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jalal Fadili.

Appendices

Prerequisites from convex analysis

We here collect some ingredients from convex analysis that are essential to our exposition.

Monotone conjugate

Lemma 2

Let g be a non-decreasing function on $\mathbb {R}_+$ that vanishes at 0. Then the following holds:

(i)
$g^+$ is a proper closed convex and non-decreasing function on $\mathbb {R}_+$ that vanishes at 0.
(ii)
If g is also closed and convex, then $g^{++}=g$.
(iii)
Let $f: t \in \mathbb {R}\mapsto g(|t|)$ such that f is differentiable on $\mathbb {R}$, where g is finite-valued, strictly convex and strongly coercive. Then $g^+$ is likewise finite-valued, strictly convex, strongly coercive, and $f^*=g^+ \circ |\cdot |$ is differentiable on $\mathbb {R}$. In particular, both g and $g^+$ are strictly increasing on $\mathbb {R}_+$.

Proof

(i)
By Bauschke and Combettes (2011, Proposition 13.11), $g^+$ is a closed convex function. We have $\inf _{t \ge 0} g(t)=-\sup _{t \ge 0} t \cdot 0 - g(t) = -g^+(0)$. Since g is non-decreasing and $g(0)=0$, then $g^+(0)=-\inf _{t \ge 0} g(t)=-g(0)=0$. In addition, by (5), we have $g^+(a) \ge a \cdot 0 - g(0)=0$, $\forall a \in \mathbb {R}_+$. This shows that $g^+$ is nonnegative and ${{\mathrm{dom}}}(g^+) \ne \emptyset $, and in turn, it is also proper.

Let a, b in $\mathbb {R}_+$ such that $a < b$. Then
$$\begin{aligned} g^+(a)-g^+(b)= & {} (\sup _{t \ge 0} t a - g(t)) - (\sup _{t' \ge 0} t' b - g(t')) \\\le & {} \sup _{t \ge 0} (t a - g(t) - t b + g(t)) \\= & {} \sup _{t \ge 0} t(a-b) = 0. \end{aligned}$$
That is, $g^+$ is non-decreasing on $\mathbb {R}_+$.
(ii)
This follows from Rockafellar (1996, Theorem 12.4).
(iii)
By definition of f, f is a finite-valued function on $\mathbb {R}$, strictly convex, differentiable and strongly coercive. It then follows from Hiriart-Urruty and Lemaréchal (2001, Corollary X.4.1.4) that $f^*$ enjoys the same properties. In turn, using the fact that both f and $f^*$ are even, we have $g^+$ is strongly coercive, and strict convexity of f (resp. $f^*$) is equivalent to that of g (resp. $g^+$). Altogether, this shows the first claim. We now prove that g vanishes only at 0 (and similarly for $g^+$). As g is non-decreasing and strictly convex, we have, for any $\rho \in ]0,1[$ and a, b in $\mathbb {R}_+$ such that $a < b$,
$$\begin{aligned} g(a)\le & {} g(\rho a + (1-\rho ) b) < \rho g(a) + (1-\rho ) g(b)\\\le & {} \rho g(b) + (1-\rho ) g(b) = g(b) . \end{aligned}$$

$\square $

Support function The support function of ${\mathcal C}\subset \mathbb {R}^p$ is

$$\begin{aligned} \sigma _{{\mathcal C}}({\varvec{\omega }})=\sup _{{\varvec{\theta }}\in {\mathcal C}} \langle {\varvec{\omega }},{\varvec{\theta }}\rangle . \end{aligned}$$

We recall the following properties whose proofs can be found in, e.g., Rockafellar (1996) and Hiriart-Urruty and Lemaréchal (2001).

Lemma 3

Let ${\mathcal C}$ be a non-empty set.

(i)
$\sigma _{\mathcal C}$ is proper lower semicontinuous (lsc) and sublinear.
(ii)
$\sigma _{\mathcal C}$ is finite-valued if and only if ${\mathcal C}$ is bounded.
(iii)
If $0 \in {\mathcal C}$, then $\sigma _{\mathcal C}$ is nonnegative.
(iv)
If ${\mathcal C}$ is convex and compact with $0 \in {{\mathrm{int}}}({\mathcal C})$, then $\sigma _{\mathcal C}$ is finite-valued and coercive.

Gauges and polars

Definition 3

(Polar set) Let ${\mathcal C}$ be a non-empty convex set. The set ${\mathcal C}^\circ $ given by

$$\begin{aligned} {\mathcal C}^\circ = \big \{ {\varvec{\eta }}\in \mathbb {R}^p \;:\; \langle {\varvec{\eta }},{\varvec{\theta }}\rangle \le 1 \quad \forall {\varvec{\theta }}\in {\mathcal C} \big \} \end{aligned}$$

is called the polar of ${\mathcal C}$.

The set ${\mathcal C}^\circ $ is closed convex and contains the origin. When ${\mathcal C}$ is also closed and contains the origin, then it coincides with its bipolar, i.e., ${\mathcal C}^{\circ \circ }={\mathcal C}$.

Let ${\mathcal C}\subseteq \mathbb {R}^p$ be a non-empty closed convex set containing the origin. The gauge of ${\mathcal C}$ is the function $\gamma _{\mathcal C}$ defined on $\mathbb {R}^p$ by

$$\begin{aligned} \gamma _{\mathcal C}({\varvec{\theta }}) = \inf \big \{ \lambda > 0 \;:\; {\varvec{\theta }}\in \lambda {\mathcal C} \big \} . \end{aligned}$$

As usual, $\gamma _{\mathcal C}({\varvec{\theta }}) = + \infty $ if the infimum is not attained.

Lemma 4 hereafter recaps the main properties of a gauge that we need. In particular, (ii) is a fundamental result of convex analysis that states that there is a one-to-one correspondence between gauge functions and closed convex sets containing the origin. This allows to identify sets from their gauges, and vice versa.

Lemma 4

(i)
$\gamma _{\mathcal C}$ is a nonnegative, lsc and sublinear function.
(ii)
${\mathcal C}$ is the unique closed convex set containing the origin such that
$$\begin{aligned} {\mathcal C}= \big \{ {\varvec{\theta }}\in \mathbb {R}^p \;:\; \gamma _{\mathcal C}({\varvec{\theta }}) \le 1 \big \} . \end{aligned}$$
(iii)
$\gamma _{\mathcal C}$ is finite-valued if, and only if, $0 \in {{\mathrm{int}}}({\mathcal C})$, in which case $\gamma _{\mathcal C}$ is Lipschitz continuous.
(iv)
$\gamma _{\mathcal C}$ is finite-valued and coercive if, and only if, ${\mathcal C}$ is compact and $0 \in {{\mathrm{int}}}({\mathcal C})$.

See Vaiter et al. (2015a) for the proof.

Observe that thanks to sublinearity, local Lipschitz continuity valid for any finite-valued convex function is strengthened to global Lipschitz continuity. Moreover, $\gamma _{\mathcal C}$ is a norm, having ${\mathcal C}$ as its unit ball, if and only if ${\mathcal C}$ is bounded with non-empty interior and symmetric.

We now define the polar gauge.

Definition 4

(Polar Gauge) The polar of a gauge $\gamma _{\mathcal C}$ is the function $\gamma _{\mathcal C}^\circ $ defined by

$$\begin{aligned} \gamma _{\mathcal C}^\circ ({\varvec{\omega }}) = \inf \big \{ \mu \ge 0 \;:\; \langle {\varvec{\theta }},{\varvec{\omega }}\rangle \le \mu \gamma _{\mathcal C}({\varvec{\theta }}), \forall {\varvec{\theta }} \big \} . \end{aligned}$$

An immediate consequence is that gauges polar to each other have the property

$$\begin{aligned} \langle {\varvec{\theta }},{\varvec{u}}\rangle \le \gamma _{\mathcal C}({\varvec{\theta }}) \gamma _{\mathcal C}^\circ ({\varvec{u}}) \quad \forall ({\varvec{\theta }},{\varvec{u}}) \in {{\mathrm{dom}}}(\gamma _{\mathcal C}) \times {{\mathrm{dom}}}(\gamma _{\mathcal C}^\circ ) , \end{aligned}$$

(42)

just as dual norms satisfy a duality inequality. In fact, polar pairs of gauges correspond to the best inequalities of this type.

Lemma 5

Let ${\mathcal C}\subseteq \mathbb {R}^p$ be a closed convex set containing the origin. Then,

(ii)
$\gamma _{\mathcal C}^\circ $ is a gauge function and $\gamma _{\mathcal C}^{\circ \circ }=\gamma _{\mathcal C}$.
(iii)
$\gamma _{\mathcal C}^\circ =\gamma _{{\mathcal C}^\circ }$, or equivalently
$$\begin{aligned} {\mathcal C}^\circ = \big \{ {\varvec{\theta }}\in \mathbb {R}^p \;:\; \gamma _{\mathcal C}^\circ ({\varvec{\theta }}) \le 1 \big \} . \end{aligned}$$
(iv)
The gauge of ${\mathcal C}$ and the support function of ${\mathcal C}$ are mutually polar, i.e.,
$$\begin{aligned} \gamma _{\mathcal C}= \sigma _{{\mathcal C}^\circ } \quad \text {and} \quad \gamma _{{\mathcal C}^\circ } = \sigma _{\mathcal C}~. \end{aligned}$$

See Rockafellar (1996), Hiriart-Urruty and Lemaréchal (2001) and Vaiter et al. (2015a) for the proof.

Expectation of the inner product

We start with some definitions and notations that will be used in the proof. For a non-empty closed convex set ${\mathcal C}\in \mathbb {R}^p$, we denote ${\big ({\mathcal C}\big )}^0$ its minimal selection, i.e., the element of minimal norm in ${\mathcal C}$. This element is of course unique. For a proper lsc and convex function f and $\gamma > 0$, its Moreau envelope (or Moreau–Yosida regularization) is defined by

$$\begin{aligned} {}^{f}\gamma ({\varvec{\theta }})&{\mathop {=}\limits ^{\text { def}}}\min _{\overline{{\varvec{\theta }}}\in \mathbb {R}^p} \frac{1}{2\gamma }\big \Vert \overline{{\varvec{\theta }}}- {\varvec{\theta }}\big \Vert _{2}^2 + f(\overline{{\varvec{\theta }}}). \end{aligned}$$

The Moreau envelope enjoys several important properties that we collect in the following lemma.

Lemma 6

Let f be a finite-valued and convex function. Then

(i)
${\left( {}^{f}\gamma ({\varvec{\theta }})\right) }_{\gamma > 0}$ is a decreasing net, and $\forall {\varvec{\theta }}\in \mathbb {R}^p$, ${}^{f}\gamma ({\varvec{\theta }}) \nearrow f({\varvec{\theta }})$ as $\gamma \searrow 0$.
(ii)
${}^{f}\gamma \in C^1(\mathbb {R}^p)$ with $\gamma ^{-1}$-Lipschitz continuous gradient.
(iii)
$\forall {\varvec{\theta }}\in \mathbb {R}^p$, $\nabla {}^{f}\gamma ({\varvec{\theta }}) \rightarrow {\big (\partial f({\varvec{\theta }})\big )}^0$ and $\big \Vert \nabla {}^{f}\gamma ({\varvec{\theta }})\big \Vert _{2} \nearrow \big \Vert {\big (\partial f({\varvec{\theta }})\big )}^0\big \Vert _{2}$ as $\gamma \searrow 0$.

Proof

(i) Bauschke and Combettes (2011, Proposition 12.32). (ii) Bauschke and Combettes (2011, Proposition 12.29). (iii) Since f is finite-valued and convex, it is subdifferentiable everywhere and its subdifferential is a maximal monotone operator with full domain $\mathbb {R}^p$, and the result follows from Bauschke and Combettes (2011, Corollary 23.46(i)). $\square $

We are now equipped to prove the following important result. It will be proved here using Moreau–Yosida regularization. Yet another alternative proof could be based on mollifiers for approximating subdifferentials. Our result hereafter turns out to be instrumental to study EWA in the low-temperature regime for general penalties.

Proposition 5

Let the density $\mu _n$ in (2), where

(a)
F satisfies Assumptions (H.1)–(H.2);
(b)
J is a finite-valued lower-bounded convex function, and $\exists R > 0$ and $\rho \ge 0$, such that $\forall {\varvec{\theta }}\in \mathbb {R}^p$, $\big \Vert {\big (\partial J({\varvec{\theta }})\big )}^0\big \Vert _{2} \le R \left\| {\varvec{\theta }}\right\| _{2}^\rho $;
(c)
and $V_n$ is coercive.

Then, $\forall \overline{{\varvec{\theta }}}\in \mathbb {R}^p$,

$$\begin{aligned} \mathbb {E}_{\mu _n}\left[ \langle {\big (\partial V_n({\varvec{\theta }})\big )}^0,\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right] = -p\beta . \end{aligned}$$

This result covers of course the situation where J fulfills (H.3). In this case, since $\partial J({\varvec{\theta }}) \subset {\mathcal C}^\circ $ by Theorem 1(i), we have $\rho =0$ and $R={{\mathrm{diam}}}({\mathcal C}^\circ )$, the diameter of the convex compact set ${\mathcal C}^\circ $ containing the origin. It can be shown that, when $F(\cdot ,\varvec{y})$ is strongly coercive, the coercivity assumption (c) can be equivalently stated as $J_{\infty }({\varvec{\theta }}) > 0$, $\forall {\varvec{\theta }}\in \ker (\varvec{X}) \setminus \left\{ 0 \right\} $, where $J_\infty $ is the recession/asymptotic function of J, see e.g., Rockafellar and Wets (1998).

Proof

Let $V^\gamma _n({\varvec{\theta }}) {\mathop {=}\limits ^{\text { def}}}\tfrac{1}{n}F(\varvec{X}{\varvec{\theta }},\varvec{y})+\lambda _n {}^{J}\gamma ({\varvec{\theta }})$ and define $\mu ^{\gamma }_n({\varvec{\theta }}) {\mathop {=}\limits ^{\text { def}}}\exp {\left( -V^\gamma _n({\varvec{\theta }})/\beta \right) }/Z$, where $0< Z < +\infty $ is the normalizing constant of the density $\mu _n$. Assumption (H.1) and Lemma 6(ii)–(iii) tell us that $V^\gamma _n \in C^1(\mathbb {R}^p)$ and $\nabla V^\gamma _n({\varvec{\theta }}) \rightarrow {\big (\partial V_n({\varvec{\theta }})\big )}^0$ as $\gamma \rightarrow 0$. Thus

$$\begin{aligned} \mathbb {E}_{\mu _n}\left[ \langle {\big (\partial V_n({\varvec{\theta }})\big )}^0,\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right]&= \int _{\mathbb {R}^p} \lim _{\gamma \rightarrow 0} \langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \mathrm{d}{\varvec{\theta }}. \end{aligned}$$

We now check that $\langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle $ is dominated by an integrable function. From the definition of the Moreau envelope, we have

$$\begin{aligned} V^\gamma _n({\varvec{\theta }}) = \min _{\overline{{\varvec{\theta }}}\in \mathbb {R}^p} \tfrac{1}{n}F(\varvec{X}{\varvec{\theta }},\varvec{y}) + \lambda _n{\big (J({\varvec{\theta }}- \overline{{\varvec{\theta }}}) + \frac{1}{2\gamma }\big \Vert \overline{{\varvec{\theta }}}\big \Vert _{2}^2\big )} . \end{aligned}$$

From coercivity of $V_n$, the objective in the $\min $ is also coercive in $({\varvec{\theta }},\overline{{\varvec{\theta }}})$ by Rockafellar and Wets (1998, Exercise 3.29(b)). It then follows from Rockafellar and Wets (1998, Theorem 3.31) that $V^\gamma _n$ is also coercive. In turn, Rockafellar and Wets (1998, Theorem 11.8(c) and 3.26(a)) allow to assert that for some $a \in ]0,+\infty [$, $\exists b \in ]-\infty ,+\infty [$ such that for all $\gamma > 0$ and ${\varvec{\theta }}\in \mathbb {R}^p$

$$\begin{aligned} \mu ^{\gamma }_n({\varvec{\theta }}) \le \exp {\left( -a\left\| {\varvec{\theta }}\right\| _{2}-b\right) }/Z . \end{aligned}$$

(43)

Lemma 6-(iii) and assumption (b) on J entail that for any ${\varvec{\theta }}\in \mathbb {R}^p$,

$$\begin{aligned} \big \Vert \nabla {}^{J}\gamma ({\varvec{\theta }})\big \Vert _{2} \le \big \Vert {\big (\partial J({\varvec{\theta }})\big )}^0\big \Vert _{2} \le R \left\| {\varvec{\theta }}\right\| _{2}^\rho . \end{aligned}$$

Altogether, we have

$$\begin{aligned}&\big |\langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \big |\\&\quad \le \mu ^\gamma _n({\varvec{\theta }}) {\left( \big |\langle \varvec{X}^\top \tfrac{1}{n}\nabla F(\varvec{X}{\varvec{\theta }},\varvec{y}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \big |+\lambda _n\big \Vert \nabla {}^{J}\gamma ({\varvec{\theta }})\big \Vert _{2}\big \Vert \overline{{\varvec{\theta }}}-{\varvec{\theta }}\big \Vert _{2}\right) }\\&\quad \le C Z^{-1}\exp {\left( -F(\varvec{X}{\varvec{\theta }},\varvec{y})/(n\beta )\right) } \big |\langle \tfrac{1}{n}\nabla F(\varvec{X}{\varvec{\theta }},\varvec{y}),\varvec{X}(\overline{{\varvec{\theta }}}-{\varvec{\theta }})\rangle \big | \\&\qquad + (Z\exp {b})^{-1} \lambda _n R \exp {\left( -a\left\| {\varvec{\theta }}\right\| _{2}\right) } \big \Vert {\varvec{\theta }}\big \Vert _{2}^\rho \big \Vert \overline{{\varvec{\theta }}}-{\varvec{\theta }}\big \Vert _{2} , \end{aligned}$$

where the constant $C > 0$ reflects the lower boundedness of J. It is easy to see that the function in this upper bound is integrable, where we also use (H.2). Hence, we can apply the dominated convergence theorem to get

$$\begin{aligned} \mathbb {E}_{\mu _n}\left[ \langle {\big (\partial V_n({\varvec{\theta }})\big )}^0,\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right]&= \lim _{\gamma \rightarrow 0} \int _{\mathbb {R}^p} \langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \mathrm{d}{\varvec{\theta }}. \end{aligned}$$

Now, by simple differential calculus (chain and product rules), we have

$$\begin{aligned} \langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle&= -\beta \langle \nabla \mu ^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \\&= -\beta \sum _{i=1}^p \frac{\partial }{\partial {\varvec{\theta }}_i}{\left( \mu ^\gamma _n({\varvec{\theta }})(\overline{{\varvec{\theta }}}_i-{\varvec{\theta }}_i)\right) } - p\beta \mu ^\gamma _n({\varvec{\theta }}). \end{aligned}$$

Integrating the first term, we get by Fubini theorem and the Newton–Leibniz formula

$$\begin{aligned}&\int _{\mathbb {R}^{p-1}} {\left( \int _{\mathbb {R}} \frac{\partial }{\partial {\varvec{\theta }}_i}{\left( \mu ^\gamma _n({\varvec{\theta }})(\overline{{\varvec{\theta }}}_i-{\varvec{\theta }}_i)\right) } \mathrm{d}{\varvec{\theta }}_i\right) } \mathrm{d}{\varvec{\theta }}_1 \cdots \mathrm{d}{\varvec{\theta }}_{i-1} \mathrm{d}{\varvec{\theta }}_{i+1}\cdots \mathrm{d}{\varvec{\theta }}_p \\&\quad = \int _{\mathbb {R}^{p-1}} {\left[ \mu ^\gamma _n({\varvec{\theta }})(\overline{{\varvec{\theta }}}_i-{\varvec{\theta }}_i)\right] }_{\mathbb {R}} \mathrm{d}{\varvec{\theta }}_1 \cdots \mathrm{d}{\varvec{\theta }}_{i-1} \mathrm{d}{\varvec{\theta }}_{i+1}\cdots \mathrm{d}{\varvec{\theta }}_p = 0 , \end{aligned}$$

where we used coercivity of $V^\gamma _n$ (see (43)) to conclude that $\lim _{|{\varvec{\theta }}_i| \rightarrow +\infty } \mu ^\gamma _n({\varvec{\theta }})(\overline{{\varvec{\theta }}}_i-{\varvec{\theta }}_i) = 0$. For the second term, we have from Lemma 6(i) that $\mu ^\gamma _n \rightarrow \mu _n$ as $\gamma \rightarrow 0$. Thus, arguing again as in (43), we can apply the dominated convergence theorem to conclude that

$$\begin{aligned} \lim _{\gamma \rightarrow 0} \int _{\mathbb {R}^p} \mu ^\gamma _n({\varvec{\theta }}) \mathrm{d}{\varvec{\theta }}= \int _{\mathbb {R}^p} \mu _n({\varvec{\theta }}) \mathrm{d}{\varvec{\theta }}= 1. \end{aligned}$$

This concludes the proof. $\square $

About this article

Cite this article

Luu, T.D., Fadili, J. & Chesneau, C. Sharp oracle inequalities for low-complexity priors. Ann Inst Stat Math 72, 353–397 (2020). https://doi.org/10.1007/s10463-018-0693-6

Download citation

Received: 04 September 2017
Revised: 10 July 2018
Published: 19 November 2018
Issue Date: April 2020
DOI: https://doi.org/10.1007/s10463-018-0693-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Sharp oracle inequalities for low-complexity priors

Abstract

Similar content being viewed by others

Penalized Least Square in Sparse Setting with Convex Penalty and Non Gaussian Errors

Bounds on the Prediction Error of Penalized Least Squares Estimators with Convex Penalty

Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models

1 Introduction

1.1 Problem statement

1.2 Penalized estimators

1.3 Exponential weighted aggregation (EWA)

1.4 Oracle inequalities

1.5 Contributions

1.6 Relation to previous work

1.7 Paper organization

1.8 Notations

2 Estimation with low-complexity penalties

2.1 Data loss

Example 1

2.2 Prior penalty

2.3 Decomposability of the prior penalty

Definition 1

Theorem 1

Proof

Remark 1

2.4 Calculus with the prior family

Lemma 1

Proof

2.5 Examples

2.5.1 Lasso

2.5.2 Group Lasso

2.5.3 Analysis (group) Lasso

2.5.4 Anti-sparsity

2.5.5 Nuclear norm

3 Oracle inequalities for a general loss

3.1 Oracle inequality for \({\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\)

Theorem 2

Remark 2

Proof

3.2 Oracle inequality for \({\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}\)

Theorem 3

Proof

Remark 3

3.3 Discussion of \({\widehat{{\varvec{\theta }}}_n^{\mathrm {EWA}}}\) vs \({\widehat{{\varvec{\theta }}}^{\mathrm {PEN}}_n}\)

3.4 Oracle inequalities in probability

3.4.1 Polyhedral penalty

Proposition 1

Proof

Remark 4

Corollary 1

Corollary 2

3.4.2 General penalty

Remark 5

4 Oracle inequalities for low-complexity linear regression

4.1 General penalty

Definition 2

Proposition 2

Proof

Proposition 3

Proof

4.2 Polyhedral penalty

Proposition 4

Proof

4.3 Applications

4.3.1 Lasso

Corollary 3

4.3.2 Group Lasso

Corollary 4

Proof

4.3.3 Analysis group Lasso

Corollary 5

4.3.4 Anti-sparsity

Corollary 6

4.3.5 Nuclear norm

Corollary 7

4.4 Discussion of minimax optimality

Example 2

Example 3

Example 4

Example 5

Notes