1 Introduction

We consider the following random design regression model:

$$\begin{aligned} Y_i = b({\varvec{X}}_i) + \varepsilon _i, \quad i=1,\dotsc ,n, \end{aligned}$$

where the variables \({\varvec{X}}_i\in {\mathbb {R}}^p\) are independent but not necessarily identically distributed, the noise variables \(\varepsilon _i\in {\mathbb {R}}\) are i.i.d. centered with finite variance \(\sigma ^2\) and independent from the \({\varvec{X}}_i\)s, and \(b:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\) is a regression function. We seek to recover the function b on a domain \(A\subset {\mathbb {R}}^p\) from the observations \(({\varvec{X}}_i, Y_i)_{i=1,\dotsc ,n}\).

More precisely, we consider the following framework. We assume that the variance of the noise \(\sigma ^2\) is known. We assume that the variables \({\varvec{X}}_i\) are independent but not identically distributed, we call \(\mu _i\) the distribution of \({\varvec{X}}_i\), but we do not assume that \(\mu _i\) is known. However, we fix \(\nu \) a reference measure on A and we assume that \(\mu {:}{=} \frac{1}{n} \sum _{i=1}^{n} \mu _i\) admits a bounded density with respect to \(\nu \), so that we have \(\textrm{L}^2(A,\mu )\subset \textrm{L}^2(A,\nu )\). In particular, this assumption implies that \({{\,\textrm{supp}\,}}(\mu )\subset A\). Finally, we consider domains \(A\subset {\mathbb {R}}^p\) of the form \(A_1\times \cdots \times A_p\) where \(A_k\subset {\mathbb {R}}\) and we consider a measure \(\nu \) on A that is of the form \(\nu _1\otimes \cdots \otimes \nu _p\) with \(\nu _k\) supported on \(A_k\). Our goal is to estimate the regression function b on the domain A and to control the expected error with respect to the norm \(\left| \left| \cdot \right| \right| _\mu \) associated with the distribution of the \({\varvec{X}}_i\)s:

$$\begin{aligned} \forall t\in \textrm{L}^2(A,\mu ),\quad \left| \left| t\right| \right| _\mu ^2 {:}{=} \int _{A} t({\varvec{x}})^2 \,\textrm{d}{\mu ({\varvec{x}})} = \frac{1}{n} \sum _{i=1}^{n} \int _{A} t({\varvec{x}})^2 \,\textrm{d}{\mu _i({\varvec{x}})}. \end{aligned}$$

We can interpret the error with respect to this norm as a prediction risk: if \({\varvec{X}}_1',\dotsc ,{\varvec{X}}'_n\) are independent copies of \({\varvec{X}}_1,\dotsc ,{\varvec{X}}_n\), then we have:

$$\begin{aligned} \forall {\hat{b}}\text { estimator},\quad \left| \left| b - {\hat{b}}\right| \right| _\mu ^2 = \frac{1}{n} \sum _{i=1}^{n} {\mathbb {E}}\left[ \big (b({\varvec{X}}'_i) - {\hat{b}}({\varvec{X}}'_i) \big )^2\bigg |{{\varvec{X}}_1, \dotsc , {\varvec{X}}_n}\right] , \end{aligned}$$

which is the mean quadratic error of a new observation drawn uniformly from one of the distributions \(\mu _i\).

Nonparametric regression problems have a long history, and a large number of methods have been proposed. In this introduction, we focus on two main families of methods: kernel estimators and projection estimators. For reference books on the subject, see Efromovich (1999) regarding the projection method and Györfi et al. (2002) for the kernel method.

The classical estimator of Nadaraya (1964) and Watson (1964) consists of a quotient of estimators \({\widehat{bf}} /{\hat{f}}\), where \({\widehat{bf}}\) and \({\hat{f}}\) are kernel estimators of the functions bf and f (the function f being the common density of the \({\varvec{X}}_i\)s in the i.i.d case). This estimator can also be interpreted as locally fitting a constant by averaging the \(Y_i\)s, the locality being determined by the kernel, see the book of Györfi et al. (2002) or Tsybakov (2009). This method can then be generalized by replacing the local constant by a local polynomial, leading to the so-called local polynomial estimator.

The main drawback of the Nadaraya–Watson estimator is that it relies on an estimator of the density of the \({\varvec{X}}_i\)s. As such, the rate of convergence depends on the regularity of f, and two smoothing parameters have to be chosen. A popular solution is to choose the same bandwidth for both estimators using leave-one-out cross-validation. This method works well in practice and has been proven consistent by Härdle and Marron (1985) (see also Chapter 8 in Györfi et al. (2002)). Recently, Comte and Marie (2021) have proposed to use the Penalized Comparison to Overfitting method (PCO), a bandwidth selection method developed by Lacour et al. (2017) for kernel density estimation, to select separately the bandwidths of the numerator and the denominator of the Nadaraya–Watson estimator. Their estimator matches the performances of the single bandwidth CV estimator when the noise is high, but the latter is better when the noise is small. Other bandwidth selection methods exist such as plug-in or bootstrap; see Köhler et al. (2014) for an extensive survey and comparison of the different bandwidth selection methods for the local linear estimator.

Another approach is to use a projection estimator. The idea is to minimize a least squares contrast over finite-dimensional spaces of functions \(\{S_{{\varvec{m}}}:{{\varvec{m}}\in {\mathscr {M}}_n}\}\) called models:

$$\begin{aligned} {\hat{b}}_{{\varvec{m}}} {:}{=} {{\,\mathrm{arg\, min}\,}}_{t\in S_{{\varvec{m}}}} \frac{1}{n} \sum _{i=1}^{n} \big ( Y_i - t({\varvec{X}}_i) \big )^2, \end{aligned}$$

the model collection \({\mathscr {M}}_{n}\) being allowed to depend on the number of observations. This method overcomes the problems of the Nadaraya–Watson estimator: it does not need to estimate the density of the \({\varvec{X}}_i\)s, and only one model selection procedure is required. Moreover, it can provide a sparse representation of the estimator. This approach was developed in a fixed design setting by Birgé and Massart (1998); Barron et al. (1999) and Baraud (2000). In particular, the papers of Baraud (2000, 2002) provide a model selection procedure that optimizes the bias-variance compromise under weak assumptions on the moments of the noise distribution. They obtain an estimator that is adaptive both in the fixed and random design setting when the domain A is compact.

The non-compact case has been studied recently in the simple regression setting (\(p=1\)) by Comte and Genon-Catalot (2020a, 2020b). They use non-compactly supported bases, specifically the Hermite basis (supported on \({\mathbb {R}}\)) and the Laguerre basis (supported on \({\mathbb {R}}_+\)), to construct their estimator. Significant attention has been paid to these bases in the past years since they exhibit nice mathematical properties that are useful for solving inverse problems (Mabon, 2017; Comte and Genon-Catalot, 2018; Sacko, 2020). Non-compactly supported bases also avoid issues concerning the choice of support. When A is compact, the theory assumes it is fixed a priori. In practice, however, the support is generally determined using the data, although this dependency between data and support is not taken into account in the theoretical development. Working with a non-compact domain, for example \({\mathbb {R}}\) or \({\mathbb {R}}_+\), allows us to bypass this issue.

Concerning the regression problem, difficulties arise when we go from the compact case to the non-compact case. When A is compact, it is usual to assume that the density of the \({\varvec{X}}_i\)s is bounded from below by some positive constant \(f_0\). In the non-compact case, this assumption fails. Instead, the study of the minimum eigenvalue of some random matrix must be done. This question has been studied in the simple regression case (\(p=1\)) by Cohen et al. (2013) by using the matrix concentration inequalities of Tropp (2012). However, their results are obtained under the assumption that the regression function is bounded by a known quantity and they do not provide a model selection procedure.

We make the following contributions in our paper. We extend the results of Comte and Genon-Catalot (2020a) to the multiple regression case (\(p\ge 2\)) with more general assumptions on the design, and we improve their result on the oracle inequality under the empirical norm (see Theorem 2). Our work generalizes the results of Baraud (2002) to the non-compact case and improves their results in the compact case (see Theorem 3). We do so by combining the fixed design results of Baraud (2000) with a more refined study of the discrepancy between the empirical norm and the \(\mu \)-norm. This discrepancy is expressed in terms of the deviation of the minimum eigenvalue of a random matrix, of which we control the probability with the concentration inequalities of Tropp (2012) and Gittens and Tropp (2011). Finally, our estimator is constructed as a projection estimator on a tensorized basis whose coefficients are computed using hypermatrix calculus and can be implemented in practice. This feasibility is illustrated in Sect. 5 which also shows that the procedure works well.

Outline of the paper In Sect. 2, we define the projection estimator. In Sect. 3, we study the probability that the empirical norm and the \(\mu \)-norm depart from each other and we derive an upper bound on the \(\mu \)-risk of our estimator. In Sect. 4, we propose a model selection procedure and we prove that it satisfies an oracle inequality both in empirical norm and in \(\mu \)-norm. Finally, in Sect. 5, we study numerically the performance of our estimator. All the proofs are gathered in Sect. 7.

Notations

  • \({\mathbb {E}}_{{\varvec{X}}} {:}{=} {\mathbb {E}}\left[ {\,\cdot }|{{\varvec{X}}_1,\dotsc ,{\varvec{X}}_n}\right] \), \({\mathbb {P}}_{{\varvec{X}}} {:}{=} {\mathbb {P}}\left[ {\,\cdot }|{{\varvec{X}}_1,\dotsc ,{\varvec{X}}_n}\right] \), \({{\,\textrm{Var}\,}}_{{\varvec{X}}} {:}{=} {{\,\textrm{Var}\,}}( \,\cdot \,\vert \,{\varvec{X}}_1,\dotsc ,{\varvec{X}}_n )\), where \({\varvec{X}} = ({\varvec{X}}_1,\dotsc ,{\varvec{X}}_n)\).

  • If \(\pi \) is a measure on A, we write \(\left| \left| \cdot \right| \right| _\pi \) and \({\langle }{ \cdot , \cdot }{\rangle }_\pi \) the norm and the inner product weighted by the measure \(\pi \).

  • We denote by \({\langle }{ \cdot , \cdot }{\rangle }_n\) and \(\left| \left| \cdot \right| \right| _n\) the empirical inner product and the empirical normFootnote 1, defined as \({\langle }{ t, s }{\rangle }_n {:}{=} \frac{1}{n} \sum _{i=1}^{n} t({\varvec{X}}_i)s({\varvec{X}}_i)\) and \( \left| \left| t\right| \right| _n^2 {:}{=} \frac{1}{n} \sum _{i=1}^{n} t({\varvec{X}}_i)^2\). If \({\textbf{u}} \in {\mathbb {R}}^n\) is a vector, we also write \(\left| \left| {\textbf{u}}\right| \right| _n^2 {:}{=} \frac{1}{n} \sum _{i=1}^{n} u_i^2\).

2 Projection estimator

In our setting, the domain is a Cartesian product \(A=A_1\times \cdots \times A_p\) and \(\nu =\nu _1\otimes \cdots \otimes \nu _p\) where \(\nu _k\) is supported on \(A_k\). For each \(i\in \{1,\dotsc ,p\}\), we consider \((\varphi ^i_{j})_{j\in {\mathbb {N}}}\) an orthonormal basis of \(\textrm{L}^2(A_i, \textrm{d}\nu _i)\) and we form an orthonormal basis of \(\textrm{L}^2(A, \textrm{d}\nu )\) by tensorization:

$$\begin{aligned} \forall {\varvec{j}}\in {\mathbb {N}}^p,\quad \forall {\varvec{x}}\in A,\quad \varphi _{{\varvec{j}}}({\varvec{x}}) {:}{=} (\varphi ^1_{j_1}\otimes \cdots \otimes \varphi ^p_{j_p})({\varvec{x}}) {:}{=} \varphi ^1_{j_1}(x_1) \times \cdots \times \varphi ^p_{j_p}(x_p). \end{aligned}$$

For \({\varvec{m}}\in {\mathbb {N}}_+^p\), we set \(S_{{\varvec{m}}} {:}{=} {{\,\textrm{Span}\,}}(\varphi _{{\varvec{j}}} : {\varvec{j}}\le {\varvec{m}}-\textbf{1})\) and we write \(D_{{\varvec{m}}} {:}{=} m_1\cdots m_p\) its dimension. We estimate b by minimizing a least squares contrast on \(S_{{\varvec{m}}}\):

$$\begin{aligned} {\hat{b}}_{{\varvec{m}}} {:}{=} {{\,\mathrm{arg\, min}\,}}_{t\in S_{{\varvec{m}}}} \frac{1}{n} \sum _{i=1}^{n} \big (Y_i - t({\varvec{X}}_i)\big )^2. \end{aligned}$$

If we expand \({\hat{b}}_{{\varvec{m}}}\) on the basis \((\varphi _{{\varvec{j}}})_{{\varvec{j}}\in {\mathbb {N}}^p}\), this problem can be written as:

$$\begin{aligned} {\hat{b}}_{{\varvec{m}}} = \sum _{{\varvec{j}}\le {\varvec{m}}-\textbf{1}} {\hat{a}}_{{\varvec{j}}}^{({\varvec{m}})}\, \varphi _{{\varvec{j}}}, \qquad \hat{{\textbf{a}}}^{({\varvec{m}})} {:}{=} {{\,\mathrm{arg\, min}\,}}_{{\varvec{a}}\in {\mathbb {R}}^{{\textbf{m}}}} \,\left| \left| {\textbf{Y}} - \widehat{{\varvec{\varPhi }}}_{{\varvec{m}}} \times _p {\textbf{a}}\right| \right| _{{\mathbb {R}}^n}^2, \end{aligned}$$
(1)

where \({\textbf{Y}} {:}{=} (Y_1,\dotsc , Y_n) \in {\mathbb {R}}^n\) and \(\widehat{{\varvec{\varPhi }}}_{{\varvec{m}}} \in {\mathbb {R}}^{n\times {\varvec{m}}}\) is defined as:

$$\begin{aligned} \forall i\in {\lbrace }1,\dotsc , n{\rbrace } \, \forall {\varvec{j}}\le {\varvec{m}}-\textbf{1},\quad \left[ {\widehat{{\varvec{\varPhi }}}_{{\varvec{m}}}}\right] _{i,{\varvec{j}}} {:}{=} \varphi _{{\varvec{j}}}({\varvec{X}}_i). \end{aligned}$$

Using Lemma 8 in Appendix, the problem (1) has a unique solution if and only if \(\widehat{{\varvec{\varPhi }}}_{{\varvec{m}}}\) is injective and in that case:

$$\begin{aligned} \hat{{\textbf{a}}}^{({\varvec{m}})}&= (\widehat{{\varvec{\varPhi }}}_{{\varvec{m}}}^*\times _1\widehat{{\varvec{\varPhi }}}_{{\varvec{m}}})^{-1} \times _p \widehat{{\varvec{\varPhi }}}_{{\varvec{m}}}^*\times _1 {\textbf{Y}}\\&= \frac{1}{n} {\widehat{\textbf{G}}}_{{\varvec{m}}}^{-1} \times _p \widehat{{\varvec{\varPhi }}}_{{\varvec{m}}}^*\times _1 {\textbf{Y}}, \end{aligned}$$

where \([\widehat{{\varvec{\varPhi }}}_{{\varvec{m}}}^*]_{{\varvec{j}}, i} = [\widehat{{\varvec{\varPhi }}}_{{\varvec{m}}}]_{i,{\varvec{j}}}\) and where \({\widehat{\textbf{G}}}_{{\varvec{m}}}\) is the Gram hypermatrix of \((\varphi _{{\varvec{j}}})_{ {\varvec{j}}\le {\varvec{m}}-\textbf{1}}\) relatively to the empirical inner product \({\langle }{ \cdot , \cdot }{\rangle }_n\):

$$\begin{aligned} \forall {\varvec{j}}, {\varvec{k}}\le {\varvec{m}}-\textbf{1},\quad \left[ {{\widehat{\textbf{G}}}_{{\varvec{m}}} }\right] _{{\varvec{j}},{\varvec{k}}}{:}{=} {\langle }{ \varphi _{{\varvec{j}}}, \varphi _{{\varvec{k}}} }{\rangle }_n . \end{aligned}$$

Notice that \(\widehat{{\varvec{\varPhi }}}_{{\varvec{m}}}\) is injective if and only if \({\widehat{\textbf{G}}}_{{\varvec{m}}}\) is invertible, that is if and only if \(\left| \left| \cdot \right| \right| _n\) is a norm on \(S_{{\varvec{m}}}\).

3 Bound on the risk of the estimator

Let us start with the classical bias-variance decomposition of the empirical risk. In our context, this result is given by the next Proposition.

Proposition 1

If \({\widehat{\textbf{G}}}_{{\varvec{m}}}\) is invertible, then we have:

$$\begin{aligned} {\mathbb {E}}_{{\varvec{X}}}\left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _n^2 = \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _n^2 + \sigma ^2 \frac{D_{{\varvec{m}}}}{n}. \end{aligned}$$

As a consequence, if \({{\widehat{\textbf{G}}}}_{{\varvec{m}}}\) is invertible a.s, then we have:

$$\begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _n^2 \le \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _\mu ^2 + \sigma ^2 \frac{D_{{\varvec{m}}}}{n}. \end{aligned}$$

Hereafter, we always assume that \({\hat{\textbf{G}}}_{{\varvec{m}}}\) is invertible a.s.

If we want to obtain a similar result for the \(\mu \)-norm, we need to understand how the empirical norm can deviate from the \(\mu \)-norm. More generally, we need to understand the relations between the different norms we have on the subspace \(S_{{\varvec{m}}}\) (\(\left| \left| \cdot \right| \right| _n\), \(\left| \left| \cdot \right| \right| _\mu \), \(\left| \left| \cdot \right| \right| _\nu \) and \(\left| \left| \cdot \right| \right| _\infty \)). It is well known that all norms are equivalent on finite-dimensional spaces; our question concerns the constants in this equivalence. We introduce the following notation: if \(\left| \left| \cdot \right| \right| _\alpha \) and \(\left| \left| \cdot \right| \right| _{\beta }\) are two norms on a space S, we define:

$$\begin{aligned} K_\beta ^\alpha (S) {:}{=} \sup _{t\in S\setminus \{0\}} \frac{\left| \left| t\right| \right| _\alpha ^2}{\left| \left| t\right| \right| _\beta ^2}, \end{aligned}$$

and when \(S=S_{{\varvec{m}}}\), we use the notation \(K^\alpha _\beta ({\varvec{m}}) {:}{=} K^\alpha _\beta (S_{{\varvec{m}}})\). The next lemma gives the value of \(K_\alpha ^\beta (S)\) when the norms are Euclidean.

Lemma 1

Let \((S, {\langle }{ \cdot , \cdot }{\rangle }_\alpha )\) be a d-dimensional Euclidean vector space equipped with an orthonormal basis \((\phi _1, \dots , \phi _d)\). Let \({\langle }{ \cdot , \cdot }{\rangle }_\beta \) be another inner product on E and let \(\textbf{G}\) be the Gram matrix of the basis \((\phi _1,\dotsc ,\phi _d)\) relatively to \({\langle }{ \cdot , \cdot }{\rangle }_\beta \), that is:

$$\begin{aligned} \textbf{G}{:}{=} \left[ { {\langle }{ \phi _j, \phi _k }{\rangle }_\beta }\right] _{1\le j, k\le d}. \end{aligned}$$

We have:

$$\begin{aligned} K_\alpha ^\beta (S) = \left| \left| \textbf{G}\right| \right| _{\textrm{op}} = \lambda _{\max }(\textbf{G}), \qquad K_\beta ^\alpha (S) = \left| \left| \textbf{G}^{-1}\right| \right| _{\textrm{op}} = \frac{1}{\lambda _{\min }(\textbf{G})}. \end{aligned}$$

The proof of Lemma 1 is identical to the proof of Lemma 3.1 in Baraud (2000), so we leave it out.

The next lemma provides a way to compute \(K_\alpha ^\infty (S)\) from an orthonormal basis when \(\left| \left| \cdot \right| \right| _\alpha \) is Euclidean. It is essentially the same as Lemma 1 in Birgé and Massart (1998).

Lemma 2

Let S be a space of bounded functions on A such that \(d{:}{=}\dim (S)\) is finite. Let \({\langle }{ \cdot , \cdot }{\rangle }_\alpha \) be an inner product on S. If \((\psi _1,\dotsc ,\psi _d)\) is an orthonormal basis of S, then we have:

$$\begin{aligned} K_\alpha ^\infty (S) = \left| \left| \sum _{j=1}^{d} \psi _j^2\right| \right| _\infty . \end{aligned}$$

The question we are interested in is how close are the norms \(\left| \left| \cdot \right| \right| _n\) and \(\left| \left| \cdot \right| \right| _\mu \) on \(S_{{\varvec{m}}}\). Following a similar idea of Cohen et al. (2013), let us define the event:

$$\begin{aligned} \forall \delta \in (0,1),\quad \varOmega _{{\varvec{m}}}(\delta ) {:}{=} {\lbrace }\forall t\in S_{{\varvec{m}}},\, \left| \left| t\right| \right| _\mu ^2 \le \frac{1}{1-\delta } \left| \left| t\right| \right| _n^2{\rbrace } = {\lbrace } K_n^\mu ({\varvec{m}}) \le \frac{1}{1-\delta } {\rbrace }. \end{aligned}$$
(2)

The key decomposition of the \(\mu \)-risk of \({\hat{b}}_{{\varvec{m}}}\) is given by the following Proposition.

Proposition 2

For all \(\delta \in (0,1)\), we have:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _{\mu }^2 \le&\left( 1 + \frac{2}{1-\delta } \left[ \frac{K_\mu ^\infty ({\varvec{m}})}{(1-\delta )n} \wedge 1\right] \right) \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _\mu ^2 + \frac{2\sigma ^2D_{{\varvec{m}}}}{(1-\delta )n} \\&+ 2\left| \left| b\right| \right| _\mu ^2 \,{\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c\right] +{\mathbb {E}}\left[ K_n^\mu ({\varvec{m}}) \left| \left| {\textbf{Y}}\right| \right| _n^2\, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c}\right] , \end{aligned} \end{aligned}$$

where \(K_n^\mu ({\varvec{m}})\) and \(K_\mu ^\infty ({\varvec{m}})\) are given by Lemmas 1 and 2.

We see that we need an upper bound on the probability of the event \(\varOmega _{{\varvec{m}}}(\delta )^c\). The following proposition is a consequence of the matrix Chernoff bound of Tropp (2012) (Theorem 5 in Appendix) .

Proposition 3

For all \(\delta \in (0,1)\), we have:

$$\begin{aligned} {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c \right] \le D_{{\varvec{m}}} \exp \left( -h(\delta ) \frac{n}{K_\mu ^\infty ({\varvec{m}})} \right) , \end{aligned}$$

where \(h(\delta ){:}{=} \delta + (1-\delta )\log (1-\delta ) \) and \(K_\mu ^\infty ({\varvec{m}})\) is given by Lemma 2.

Remark 1

The quantity \(K_\mu ^\infty ({\varvec{m}})\) is unknown but we have the following upper bound using Lemmas 1 and 2:

$$\begin{aligned} K_\mu ^\infty ({\varvec{m}}) \le K_\nu ^\infty ({\varvec{m}})\, K_\mu ^\nu ({\varvec{m}}) = \left( \sup _{{\varvec{x}}\in A} \sum _{{\varvec{j}}\le {\varvec{m}}-\textbf{1}} \varphi _{{\varvec{j}}}({\varvec{x}})^2\right) \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}. \end{aligned}$$

The quantity \(\left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\) is still unknown but can be estimated by plugging in \({{\widehat{\textbf{G}}}}_{{\varvec{m}}}\).

Comte and Genon-Catalot (2020a) show in their Proposition 8 that, when one uses the Hermite or the Laguerre basis, the inverse of the Gram matrix is unbounded (it satisfies \(\Vert {\textbf{G}}_m^{-1}\Vert _{\textrm{op}} \gtrsim \sqrt{m}\)), while it is bounded in the compact case:

$$\begin{aligned} \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}= \sup _{t\in S_{{\varvec{m}}}\setminus \{0\}} \frac{\left| \left| t\right| \right| _\nu ^2}{\left| \left| t\right| \right| _\mu ^2} \le \frac{1}{f_0}, \end{aligned}$$
(3)

where \(f_0\) is a positive lower bound of the covariates density. Hence, the least squares minimization problem will become highly unstable as the dimension of the projection space grows. That is why a form of regularization is needed if we want to control the \(\mu \)-risk of the estimator. For \(\alpha \) a positive constant, let us consider the following model collection:

$$\begin{aligned} {\mathscr {M}}^{(1)}_{n,\alpha } {:}{=} {\lbrace }{ {\varvec{m}}\in {\mathbb {N}}_+^p \,\vert \, K_{\nu }^\infty ({\varvec{m}}) \big ( \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \vee 1 \big ) \le \alpha \frac{n}{\log n} }{\rbrace }. \end{aligned}$$
(4)

Gathering Propositions 2 and 3, we obtain the following bound on the \(\mu \)-risk of \({\hat{b}}_{{\varvec{m}}}\) when \({\varvec{m}}\) belongs to \({\mathscr {M}}^{(1)}_{n,\alpha }\).

Theorem 1

Let us assume that \(b\in \textrm{L}^{2r}(\mu )\) for some \(r \in (1, +\infty ]\) and let \(r'\in [1,+\infty )\) be the conjugated index of r, that is: \(\frac{1}{r} + \frac{1}{r'} = 1\). For all \(\alpha \in (0, \frac{1}{2r'+1})\) and for all \({\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha }\) we have:

$$\begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \le C_n(\alpha , r') \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _\mu ^2 + C'(\alpha , r')\, \sigma ^2\frac{D_{{\varvec{m}}}}{n} + \frac{C''\big (b, \sigma ^2, \alpha ,r\big )}{n\log n}, \end{aligned}$$

where the constants \(C_n(\alpha , r')\) and \(C'(\alpha , r')\) are given by:

$$\begin{aligned} C_n(\alpha , r') {:}{=} 1 + \frac{2}{1-\delta (\alpha , r')} \left( \frac{\alpha }{\big (1-\delta (\alpha , r')\big )\log n} \wedge 1 \right) ,\quad C'(\alpha , r') {:}{=} \frac{2}{1-\delta (\alpha , r')}, \end{aligned}$$

where \(\delta (\alpha , r')\in (0,1)\) tends to 1 as \(\alpha \) tends to \(\frac{1}{2r'+1}\), and where \(C''\big (b, \sigma ^2, \alpha ,r\big )\) is defined by (18).

Remark 2

Let us make some statements concerning the behavior of \(C_n(\alpha ,r')\) and \(C'(\alpha ,r')\):

  • \(C_n(\alpha , r')\) is bounded relatively to n;

  • \(C_n(\alpha , r')\ge 1\) and \(C'(\alpha , r')\ge 2\);

  • as \(\alpha \rightarrow \frac{1}{2r'+1}\) with n fixed, \(C_n(\alpha , r')\) and \(C'(\alpha , r')\) tend to \(+\infty \);

  • as \(n\rightarrow +\infty \) with \(\alpha \) and \(r'\) fixed, \(C_n(\alpha , r')\) tends to 1.

4 Adaptive estimator

We consider the empirical version of the model collection \({\mathscr {M}}_{n,\alpha }\) defined by (4):

$$\begin{aligned} {{\widehat{{\mathscr {M}}}}}^{\ (1)}_{n,\beta } {:}{=} {\lbrace }{ {\varvec{m}}\in {\mathbb {N}}_+^p \,\vert \, K_\nu ^\infty ({\varvec{m}}) \big ( \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\vee 1 \big ) \le \beta \frac{n}{\log n} }{\rbrace }, \end{aligned}$$

with \(\beta \) a positive constant. We choose \({\varvec{{\hat{m}}}}_1 \in {{\widehat{{\mathscr {M}}}}}^{\ (1)}_{n,\beta }\) by minimizing the following penalized least squares criterion:

$$\begin{aligned} {\varvec{{\hat{m}}}}_1 {:}{=} {{\,\mathrm{arg\, min}\,}}_{{\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}^{\ (1)}_{n,\beta }} \left( -\left| \left| {\hat{b}}_{{\varvec{m}}}\right| \right| _n^2 + (1+\theta )\sigma ^2\frac{D_{{\varvec{m}}}}{n} \right) , \qquad \theta >0. \end{aligned}$$
(5)

Based on a result of Baraud (2000) for fixed design regression, we prove that \({\hat{b}}_{{\varvec{{\hat{m}}}}_1}\) automatically optimizes the bias-variance compromise in empirical norm on \({\mathscr {M}}_{n,\alpha }\), up to a constant and a remainder term.

Theorem 2

If \(b \in \textrm{L}^{2r}(\mu )\) for some \(r\in (1,+\infty ]\) and if \({\mathbb {E}}\left| \varepsilon _1 \right| ^q\) is finite for some \(q>6\), then there exists a constant \(\alpha _{\beta , r'}>0\) depending on \(\beta \) and \(r'\) (the conjugated index of r) such that for all \(\alpha \in (0, \alpha _{\beta , r'})\), the following upper bound on the risk of the estimator \({\hat{b}}_{{\varvec{{\hat{m}}}}_1}\) with \({\varvec{{\hat{m}}}}_1\) defined by (5) holds:

$$\begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 \le C(\theta ) \inf _{{\varvec{m}}\in {\mathscr {M}}_{n,\alpha }^{(1)}} \left( \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _\mu ^2 + \sigma ^2\frac{D_{{\varvec{m}} }}{n} \right) + \sigma ^2\frac{\varSigma (\theta , q)}{n} + R_n, \end{aligned}$$

where \(C(\theta ) {:}{=} (2+8\theta ^{-1})(1+\theta )\), and where:

$$\begin{aligned} \varSigma (\theta , q) {:}{=} C''(\theta , q) \frac{{\mathbb {E}}\left| \varepsilon _1 \right| ^q}{\sigma ^q} \sum _{{\varvec{m}} \in {\mathbb {N}}_+^p} D_{{\varvec{m}}}^{-(\frac{q}{2}-2)}, \quad R_n {:}{=} C'(\left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}, \sigma ^2) \frac{(\log n)^{(p-1)/r'}}{n^{\kappa (\alpha ,\beta )/r'}}, \end{aligned}$$

with \(\kappa (\alpha ,\beta )\) a positive constant satisfying \(\frac{\kappa (\alpha ,\beta )}{r'} > 1\) and \(\frac{\kappa (\alpha ,\beta )}{r'}\rightarrow 1\) as \(\alpha \rightarrow \alpha _{\beta , r'}\).

Remark 3

The term \(\varSigma (\theta ,q)\) is finite if \(q>6\). Indeed, let \(2\epsilon {:}{=} (\frac{q}{2}-2)-1>0\), we have:

$$\begin{aligned} \sum _{{\varvec{m}} \in {\mathbb {N}}_+^p} D_{{\varvec{m}}}^{-(\frac{q}{2}-2)} = \sum _{d=1}^{+\infty } {{\,\textrm{Card}\,}}{\lbrace }{ {\varvec{m}}\in {\mathbb {N}}_+^p \,\vert \, D_{{\varvec{m}}} =d }{\rbrace } \times d^{-(\frac{q}{2}-2)} \le \sum _{d=1}^{+\infty } \frac{{{\,\mathrm{\textrm{o}}\,}}(d^{\epsilon })}{d^{1+2\epsilon }} < +\infty , \end{aligned}$$

where we use Theorem 7 in Appendix.

Remark 4

The constant \(\alpha _{\beta , r'}\) is increasing with \(\beta \) and goes from 0 to \(\frac{1}{2r'+1}\). It is also decreasing with \(r'\) (so increasing with r) and tends to 0 as \(r'\rightarrow +\infty \) (as \(r\rightarrow 1\)).

To transfer the previous adaptive result from the empirical norm into the \(\mu \)-norm, we use once again concentration inequalities on the matrix \({{\widehat{\textbf{G}}}}_{{\varvec{m}}}\). However, we need to make a distinction between the compact case and the non-compact case. Indeed, when A is compact, we can make the usual assumption that the density \(\frac{\textrm{d}\mu }{\textrm{d}\nu }\) is bounded from below and apply the matrix Chernoff bound of Gittens and Tropp (2011), see Lemma 6. This lemma relies critically on the “bounded from below” assumption so it cannot work in the non-compact case.

To handle the non-compact case, we make use of the matrix Bernstein bound of Tropp (2012) instead (Theorem 6 in appendix), see Lemma 7. This inequality is different from the matrix Chernoff bounds we have used so far, so we have to consider smaller model collections to make it work. In the following, we consider two cases:

  1. 1.

    Compact case. We assume that there exists \(f_0>0\) such that for all \(x\in A\), \(\frac{\textrm{d}\mu }{\textrm{d}\nu }(x)>f_0\). In that case, \(\textbf{G}_{{\varvec{m}}}\) is always invertible and we have \(\left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\le \frac{1}{f_0}\), see (3).

  2. 2.

    General case. We consider smaller model collections:

    $$\begin{aligned}{} & {} {\mathscr {M}}_{n,\alpha }^{(2)} {:}{=} {\lbrace }{ {\varvec{m}}\in {\mathbb {N}}_+^p \,\vert \, K_{\nu }^\infty ({\varvec{m}}) \left( \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}^2 \vee 1 \right) \le \alpha \frac{n}{\log n} }{\rbrace },\\{} & {} \quad {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)} {:}{=} {\lbrace }{ {\varvec{m}}\in {\mathbb {N}}_+^p \,\vert \, K_{\nu }^\infty ({\varvec{m}}) \left( \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}^2 \vee 1 \right) \le \beta \frac{n}{\log n} }{\rbrace }, \end{aligned}$$

    where \(\alpha \) and \(\beta \) are positive constants and we choose \({\varvec{{\hat{m}}}}_2 \in {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)}\) as:

    $$\begin{aligned} {\varvec{{\hat{m}}}}_2 {:}{=} {{\,\mathrm{arg\, min}\,}}_{{\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)}} \left( -\left| \left| {\hat{b}}_{{\varvec{m}}}\right| \right| _n^2 + (1+\theta )\sigma ^2 \frac{D_{{\varvec{m}}}}{n} \right) ,\quad \theta > 0. \end{aligned}$$
    (6)

Theorem 3

Let \(r\in (1,+\infty ]\), let \(r'\in [1,+\infty )\) be its conjugated index and let us assume that b belongs to \(\textrm{L}^{2r}(\mu )\) and that \({\mathbb {E}}\left| \varepsilon _1 \right| ^q\) is finite for some \(q>6\).

\(\bullet \) Compact case. Let \(f_0>0\) such that \(\frac{\textrm{d}\mu }{\textrm{d}\nu }(x)\ge f_0\) for all \(x\in A\), there exists \(\beta _{f_0,r'}>0\) such that for all \(\beta \in (0, \beta _{f_0,r'})\), there exists \(\alpha _{\beta , r'}>0\) such that for all \(\alpha \in (0, \alpha _{\beta , r'})\), the following upper bound on the risk of the estimator \({\hat{b}}_{{\varvec{{\hat{m}}}}_1}\) with \({\varvec{{\hat{m}}}}_1\) defined by (5) holds:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _\mu ^2&\le C(\theta , \beta , r) \inf _{{\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha }} \left( \inf _{t\in S_{{\varvec{m}}}} \left| \left| b-t\right| \right| _\mu ^2 + \sigma ^2 \frac{D_{{\varvec{m}}}}{n}\right) \\ {}&+ C'(\beta , r) \sigma ^2 \frac{\varSigma (\theta ,q)}{n} + R_n, \end{aligned} \end{aligned}$$

where the remainder term is given by:

$$\begin{aligned} R_n = C''\big (\left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}, \sigma ^2, \beta , r\big ) \left( n^{-\frac{\kappa (\alpha ,\beta )}{r'}} (\log n)^{\frac{p-1}{r'}} + n^{-\lambda (\beta , r, f_0)}\, (\log n)^{\frac{p-1}{r'} - 1} \right) , \end{aligned}$$

with \(\lambda (\beta ,r,f_0)>1\) and \(\frac{\kappa (\alpha , \beta )}{r'} >1\).

\(\bullet \) General case. Let \(B{:}{=} (\left| \left| \frac{\textrm{d}\mu }{\textrm{d}\nu }\right| \right| _\infty + \frac{2}{3})^{-1}\), there exists \(\beta _{B,r'}>0\) such that for all \(\beta \in (0, \beta _{B,r'})\), there exists \({\tilde{\alpha }}_{\beta , r'}>0\) such that for all \(\alpha \in (0, {\tilde{\alpha }}_{\beta , r'})\), the following upper bound on the risk of the estimator \({\hat{b}}_{{\varvec{{\hat{m}}}}_2}\) with \({\varvec{{\hat{m}}}}_2\) defined by (6) holds:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_2}\right| \right| _\mu ^2&\le C(\theta , \beta , r) \inf _{{\varvec{m}}\in {\mathscr {M}}_{n,\alpha }^{(2)}} \left( \inf _{t\in S_{{\varvec{m}}}} \left| \left| b-t\right| \right| _\mu ^2 + \sigma ^2 \frac{D_{{\varvec{m}}}}{n}\right) \\ {}&+ C'(\beta , r) \sigma ^2 \frac{\varSigma (\theta ,q)}{n} + R_n, \end{aligned} \end{aligned}$$

where the remainder term is given by:

$$\begin{aligned} R_n = C''\big (\left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}, \sigma ^2, \beta , r\big ) \left( n^{-\frac{{\tilde{\kappa }}(\alpha ,\beta )}{r'}} (\log n)^{\frac{p-1}{r'}} + n^{-\lambda (\beta , r, B)}\, (\log n)^{\frac{p-1}{r'} - 1} \right) , \end{aligned}$$

with \(\lambda (\beta ,r,B)>1\) and \(\frac{{\tilde{\kappa }}(\alpha , \beta )}{r'} >1\).

This result shows that there is a range of values for the constant \(\beta \) that depends on the integrability of b and on \(f_0\) (compact case) or \(\left| \left| \frac{\textrm{d}\mu }{\textrm{d}\nu }\right| \right| _\infty \) (general case), such that for the \(\mu \)-norm, the estimator \({\hat{b}}_{{\varvec{{\hat{m}}}}}\) automatically optimizes the bias-variance trade-off (up to a constant and a rest) on \({\mathscr {M}}_{n,\alpha }\) for all \(\alpha \) in a range that depends on \(\beta \).

Remark 5

Theorem 3 improves previous results in the literature:

  1. 1.

    In the compact case, we improve the result of Baraud (2002). Indeed in this article, the model collections considered are built by picking an “envelope model”, that is a linear space \({\mathscr {S}}_n\) with finite dimension \(N_n\), whose all models are a subspace. Their assumptions concern the space \({\mathscr {S}}_n\): they assume that \(K_\nu ^\infty ({\mathscr {S}}_n) \le C^2 N_n\) for some constant \(C>0\) and they require that \(N_n \le C^{-1} \sqrt{ n/(\log n)^3 }\). In comparison, our procedure avoids the choice a priori of an envelope model, and uses a looser constraint on the dimension of the models.

  2. 2.

    In the non-compact case, we extend the results of Comte and Genon-Catalot (2020a) to the case \(p\ge 2\) without losing much on the assumptions: their result requires a moment of order 6 on the noise whereas our result is obtained with a moment of order q, with \(q>6\). We also generalize their result by considering a non i.i.d. design and by using a more general moment assumption on the regression function.

Remark 6

(Unknown variance) During all of our work, we assume that \(\sigma ^2\) is known. To handle the case of an unknown variance, we can use the same method proposed by Baraud (2000) in the fixed design setting. Using a residual least-squares estimator of \(\sigma ^2\) in the penalized criterion for choosing the model, they prove (Theorem 6.1) that the resulting estimator of the regression function satisfies an oracle inequality. Starting from Baraud’s result, and using the same arguments we used in this paper, we think one can obtain an oracle inequality for a projection estimator, in the random design framework with unknown variance. We omit such development for the sake of conciseness.

5 Numerical illustrations

In this section, we compare our estimator with the Nadaraya–Watson estimator on simulated data in the case \(p=1\) and \(p=2\).

Regression function We consider the following regression functions:

  1. 1.

    \(b_1(x) = \exp ((x-1)^2) + \exp ((x+1)^2)\),

  2. 2.

    \(b_2(x) {:}{=} \frac{1}{1+x^2}\),

  3. 3.

    \(b_3(x) {:}{=} x\cos (x)\),

  4. 4.

    \(b_4(x) {:}{=} \left| x \right| \),

  5. 5.

    \(b_5(x_1,x_2) {:}{=} \exp (-\frac{1}{2} [(x_1-1)^2+(x_2-1)^2]) + \exp (-\frac{1}{2} [(x_1+1)^2+(x_2+1)^2])\),

  6. 6.

    \(b_6(x_1,x_2) {:}{=} 1/({1+x_1^2+x_2^2})\),

  7. 7.

    \(b_7(x_1,x_2) {:}{=} \cos (x_1)\sin (x_2)\),

  8. 8.

    \(b_8(x_1,x_2) {:}{=} \left| x_1x_2 \right| \).

The functions \(b_2\) and \(b_6\) are smooth bounded functions and have a unique maximum at 0, so they should be an easy case. The functions \(b_1\) and \(b_5\) are smooth and bounded with two maximums. The functions \(b_3\) and \(b_7\) are smooth oscillating functions. Finally, the functions \(b_4\) and \(b_8\) are not smooth nor bounded, and should be a harder case.

Distribution of \({\varvec{X}}\) For the sake of simplicity, we consider the case where \({\varvec{X}}_1, \dotsc , {\varvec{X}}_n\) are i.i.d. and have a density with respect to Lebesgue measure (i.e. \(\nu = \textrm{Leb}\)). For the case \(p=1\), we consider the following distributions: \(X \sim {\mathscr {N}}(0, 1)\), and \(X\sim \textrm{Laplace}\). Both distributions are symmetric and centered at 0, but the normal distribution is more concentrated around its mean than the Laplace distribution. For the case \(p=2\), we use independent marginals for the distribution of the covariates: \({\varvec{X}} \sim {\mathscr {N}}(0, 1) \otimes {\mathscr {N}}(0, 1)\), and \({\varvec{X}} \sim \textrm{Laplace} \otimes \textrm{Laplace}\).

Noise distribution We consider the normal distribution: \(\varepsilon \sim {\mathscr {N}}(0, \sigma ^2)\). The variance \(\sigma ^2\) is chosen such that the signal-to-noise ratio is the same for each choice of regression function and distribution of \({\varvec{X}}\), where we define the signal-to-noise ratio as:

$$\begin{aligned} \textrm{SNR}{:}{=} \frac{\left| \left| b\right| \right| _\mu ^2}{\sigma ^2}. \end{aligned}$$

We consider the following values: \(\textrm{SNR}=2\) (High noise), and \(\textrm{SNR}= 20\) (Low noise).

Parameters of the projection estimator Since the distributions of \({\varvec{X}}\) are supported on \({\mathbb {R}}\) or \({\mathbb {R}}^2\), we choose the Hermite basis. The Hermite functions are defined as:

$$\begin{aligned} \varphi _j (x) {:}{=} c_j \, H_j(x)\, \textrm{e}^{-\frac{x^2}{2}},\quad H_j(x) {:}{=} (-1)^j \textrm{e}^{x^2} \frac{\textrm{d}^j}{\textrm{d}x^j} \left[ \textrm{e}^{-x^2} \right] , \quad c_j {:}{=} \big (2^j j! \sqrt{\pi }\big )^{-1/2}. \end{aligned}$$

and form a basis of \(\textrm{L}^2({\mathbb {R}})\). We form a basis of \(\textrm{L}^2({\mathbb {R}}^2)\) by tensorizing the Hermite basis as explained in Sect. 2. We choose the parameter \({\varvec{{\hat{m}}}}\) with the model selection procedure (6). This procedure requires two additional parameters: the constant \(\theta \) in the penalty and the constant \(\beta \) in the model collection \({{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)}\).

We choose \(\beta \) such that the model collection \({{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)}\) is not too small, especially for small sample sizes. Indeed, we find that the operator norm \(\left| \left| {\widehat{\textbf{G}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\) can grow very fast with \({\varvec{m}}\), which can result in model collections with very few models. In our case, we choose \(\beta =10^4\).

The constant \(\kappa {:}{=} 1+\theta \) in front of the penalty is chosen following the “minimum penalty heuristic” (Arlot and Massart 2009). On several preliminary simulations, we compute the selected dimension \(D_{{\varvec{{\hat{m}}}}}\) as a function of \(\kappa \) and we find \(\kappa _{\min }\) such that for \(\kappa < \kappa _{\min }\) the dimension is too high and for \(\kappa >\kappa _{\min }\) it is acceptable. Then, we choose \(\kappa _\star = 2\kappa _{\min }\). In our case, we find \(\kappa _\star =2\) when \(p=1\) and \(p=2\).

Nadaraya–Watson estimator Let us define the Nadaraya–Watson estimator in the case \(p=1\). For all \(h \in (0, 1)\), let \(K_{h}\) be the pdf of the \({\mathscr {N}}(0, h)\) distribution. The Nadaraya–Watson estimator is defined as:

$$\begin{aligned} \forall x\in {\mathbb {R}},\qquad {\hat{b}}^{\textrm{NW}}_{h}(x) {:}{=} \frac{\sum _{i=1}^{n} Y_i\,K_{h}(x-X_i)}{ \sum _{i=1}^{n} K_{h}(x-X_i) }. \end{aligned}$$

The bandwidth h is selected by leave-one-out cross-validation, that is:

$$\begin{aligned} {{\hat{h}}} {:}{=} {{\,\mathrm{arg\, min}\,}}_{h} \sum _{i=1}^{n} \left( Y_i - {\hat{b}}_{h, -i}^{\textrm{NW}}(X_i)\right) ^2, \end{aligned}$$

where \({\hat{b}}_{ h, -i}^{\textrm{NW}}\) is the Nadaraya–Watson estimator computed from the data set:

$$\begin{aligned} \left\{ { (X_j, Y_j) }:{j \in {\lbrace }1,\dotsc n{\rbrace } \setminus \{i\}}\right\} . \end{aligned}$$

In the case \(p=2\), the definition of the estimator is the same but with a couple of bandwidths \({\varvec{h}} = (h_1, h_2)\in (0, 1)^2\), and with \(K_{{\varvec{h}}}\) the pdf of the \({\mathscr {N}}_2({\varvec{0}}, {\textbf{H}})\) distribution, where \({\textbf{H}} {:}{=} \textrm{diag}(h_1, h_2)\).

Computation of the risk We consider samples of size \(n=250\) and \(n=1000\) in the case \(p=1\), and samples of size \(n=500\) and \(n=2000\) in the case \(p=2\). For each choice of regression function, distribution of \({\varvec{X}}\) and \(\textrm{SNR}\), we generate \(N=100\) samples of size n. For each sample, we compute the Hermite projection estimator and the Nadaraya–Watson estimator; then, we compute the relative \(\mu \)-error of the estimators, that is:

$$\begin{aligned} \text {relative error} {:}{=} \frac{\left| \left| {\hat{b}} - b\right| \right| _\mu ^2}{\left| \left| b\right| \right| _\mu ^2} = \frac{\int _{{\mathbb {R}}^p} \left| {\hat{b}}({\varvec{x}}) - b({\varvec{x}}) \right| ^2 f({\varvec{x}})\,\textrm{d}{{\varvec{x}}}}{\int _{{\mathbb {R}}^p} b({\varvec{x}})^2 f({\varvec{x}})\,\textrm{d}{{\varvec{x}}}}, \end{aligned}$$

where f is the density of the distribution \(\mu \). We compute an approximation of these integrals: we consider a compact domain \(I\times I\) with I an interval such that \({\mathbb {P}}\left[ X\in I \right] = 95\%\) in the case \(p=1\) and \({\mathbb {P}}\left[ {\varvec{X}}\in I\times I \right] = 95\%\) in the case \(p=2\). Then, we consider a discretization with 200 points of I. In the case \(p=1\), we use Simpson’s rule with this discretization of I to approximate the integrals. In the case \(p=2\), we approximate the integrals by a sum over the grid of \(I\times I\):

$$\begin{aligned} \iint _{{\mathbb {R}}^2} \left| {\hat{b}}({\varvec{x}}) - b({\varvec{x}}) \right| ^2 f({\varvec{x}})\,\textrm{d}{{\varvec{x}}} \approx \sum _{i=1}^{200} \sum _{j=1}^{200} \left| {\hat{b}}(x_{1,i}, x_{2,j}) - b(x_{1,i}, x_{2,j}) \right| ^2 f(x_{1,i}, x_{2,j}) \varDelta ^2, \end{aligned}$$

where \(\varDelta \) is the discretization step.

Table 1 Risk comparison, \(p=1\). Table showing the relative \(\mu \)-risks of the Hermite projection estimator and the Nadaraya–Watson estimator. For each distribution of X, regression function, \(\textrm{SNR}\) and n, we display the estimated relative \(\mu \)-risk over \(N=100\) samples with a 95% confidence interval, multiplied by 100. For the projection estimator, we display the mean selected model, and for the Nadaraya–Watson estimator, we display the mean selected bandwidth
Table 2 Risk comparison, \(p=2\). Table showing the relative \(\mu \)-risks of the Hermite projection estimator and the Nadaraya–Watson estimator. For each distribution of \({\varvec{X}}\), regression function, \(\textrm{SNR}\) and n, we display the estimated relative \(\mu \)-risk over \(N=100\) samples with a 95% confidence interval, multiplied by 100. For the projection estimator, we display the mean selected dimension, and for the Nadaraya–Watson estimator, we display the mean selected bandwidths

Results In the case \(p=1\), we show our results in Table 1. First of all, we see that the results are superior when X has a Normal distribution compared to a Laplace distribution. This can be explained by the fact that the Laplace distribution is less concentrated around 0 than the normal distribution, so the \(X_i\)s are more scattered and the mu-risk covers a larger range. In addition, in the normal setting, we see that the Hermite estimator is better than the Nadaraya–Watson estimator for estimating \(b_1\), \(b_2\) and \(b_3\), and both estimators are equivalent for estimating \(b_4\). In the Laplace setting, the Hermite estimator is still better for \(b_1\) and \(b_2\), but for \(b_3\) it has similar performances as the Nadaraya–Watson estimator. For estimating \(b_4\), the latter is better, although the difference becomes small as n increases.

In the case \(p=2\), we show our results in Table 2. In the normal setting, the Hermite projection estimator is better for estimating \(b_5\), \(b_6\) and \(b_7\). For \(b_8\), its performances are worse than the kernel estimator on small samples but they are equivalent on large samples. In the Laplace setting, our estimator is better for estimating \(b_5\) and \(b_6\), but it is worse for estimating \(b_7\). Moreover, the Hermite estimator has very poor performances for estimating \(b_8\). We think that the functions \(b_7\) and \(b_8\) are hard to approximate with the Hermite basis, so that the Hermite projection estimator performs poorly. This can be seen by looking at the mean selected dimension, which grows quickly as n grows, showing that the estimator needs a large number of coefficients to reconstruct the regression function. This is especially true for \(b_8\), as it is a non differentiable and unbounded function.

In addition, we observe that the Hermite estimator is faster to compute than the Nadaraya–Watson estimator with leave-one-out cross-validation. The difference is small when n is small, but for example, when \(n=2000\) and \(p=2\), the Hermite estimator is about 3 time faster. In conclusion, the Hermite projection estimator is a good alternative to the Nadaraya–Watson estimator.

6 Concluding remark

In this paper, we have considered the nonparametric regression problem with a random design. The covariates are assumed to be independent but not identically distributed, and the variance of the noise is assumed to be known. We estimate the regression function on a non-compact domain of \({\mathbb {R}}^p\) with a projection estimator, using tensorized orthonormal bases. The projection space is chosen by a penalized criterion, as in Birgé and Massart (1998) and Baraud (2000). Our model collection depends on the design and is thus random. Indeed, we consider subspaces \(S_{{\varvec{m}}}\) on which the operator norm of the Gram hypermatrix associated with the least squared minimization problem is constrained. This constraint on the operator norm comes from a refined study of the discrepancy between the norms \(\left| \left| \cdot \right| \right| _n\) and \(\left| \left| \cdot \right| \right| _\mu \) on \(S_{{\varvec{m}}}\). This study relies on Matrix concentration inequalities of Tropp (2012) and Gittens and Tropp (2011), as it has been suggested by the work of Cohen et al. (2013). Doing so, we obtain oracle bounds for the selected estimator, in both norms. Our work extends and improves the results of Baraud (2002) and Comte and Genon-Catalot (2020a), as explained by Remark 5.

Different extension of our work can be pursued. A natural extension would be to consider the heteroskedastic regression model, in which the observations \(({\varvec{X}}_i,Y_i)\) satisfy:

$$\begin{aligned} Y_i = b({\varvec{X}}_i) + \sigma ({\varvec{X}}_i)\varepsilon _i, \end{aligned}$$

were \(\varepsilon _i\)s have unit variance. Using the same projection estimator, Comte and Genon-Catalot (2020b) have obtained similar results for this model in the one-dimensional case. The extension to the multivariate case could be done in two ways. The first way would be to generalize the fixed design results of Baraud (2000) to the case of noise variables with different variance, and then to apply the same arguments we used in this paper to deduce the results for the random design setting. The second way would be to follow the approach of Comte and Genon-Catalot (2020b), that is based on Talagrand’s inequality, and to see if it can be extended to the multivariate case.

Another extension of our work would be to investigate the use of more general approximation spaces \(S_m\), as does Baraud (2002). We want to know if the same method we used could handle approximation spaces that are not constructed from an orthonormal basis. A typical example we have in mind is splines approximation. We suspect that our results on the comparison between the norms \(\left| \left| \cdot \right| \right| _n\) and \(\left| \left| \cdot \right| \right| _\mu \) still hold in this context, so that adaptive strategies could be derived from it.

7 Proofs

7.1 Proofs of Sect. 2

Proposition 1

Let \(\varPi ^{(n)}_{{\varvec{m}}}\) be the projector on \(S_{{\varvec{m}}}\) for the empirical inner product. We have the decomposition:

$$\begin{aligned} {\mathbb {E}}_{{\varvec{X}}}\left| \left| b- {\hat{b}}_{{\varvec{m}}}\right| \right| _n^2&= \left| \left| b - \varPi ^{(n)}_{{\varvec{m}}} b\right| \right| _n^2 + {\mathbb {E}}_{{\varvec{X}}}\left| \left| {\hat{b}}_{{\varvec{m}}} - \varPi ^{(n)}_{{\varvec{m}}}b\right| \right| _n^2 \\&= \inf _{t \in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _n^2 + {\mathbb {E}}_{{\varvec{X}}} \left| \left| \varPi ^{(n)}_{{\varvec{m}}} {\varvec{\varepsilon }}\right| \right| _n^2 \\&= \inf _{t \in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _n^2 + \sigma ^2 \frac{{{\,\textrm{Tr}\,}}\big (\varPi ^{(n)}_{{\varvec{m}}}\big )}{n} \\&= \inf _{t \in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _n^2 + \sigma ^2 \frac{D_{{\varvec{m}}}}{n}. \end{aligned}$$

Taking the expected value in this equality, we obtain:

$$\begin{aligned} {\mathbb {E}}\left| \left| b- {\hat{b}}_{{\varvec{m}}}\right| \right| _n^2 = {\mathbb {E}}\left[ \inf _{t \in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _n^2\right] + \sigma ^2 \frac{D_{{\varvec{m}}}}{n}&\le \inf _{t \in S_{{\varvec{m}}}} {\mathbb {E}}\left| \left| b - t\right| \right| _n^2 + \sigma ^2 \frac{D_{{\varvec{m}}}}{n} \\&= \inf _{t \in S_{{\varvec{m}}}} {\mathbb {E}}\left| \left| b - t\right| \right| _\mu ^2 + \sigma ^2 \frac{D_{{\varvec{m}}}}{n}. \end{aligned}$$

\(\square \)

7.2 Proofs of Sect.3

Lemma 2

Let \(x\in A\) and let \(t = \sum _{j=1}^{d} a_j\, \psi _j\in S\). The family of functions \((\psi _1,\ldots , \psi _d)\) is orthonormal with respect to \({\langle }{ \cdot , \cdot }{\rangle }_\alpha \), so by the Cauchy–Schwarz inequality we have:

$$\begin{aligned} t^2(x) = \left( \sum _{j=1}^{d} a_j\, \psi _j(x)\right) ^2 \le \left( \sum _{j=1}^{d} a_j^2\right) \left( \sum _{j=1}^{d} \psi _j^2(x)\right) = \left| \left| t\right| \right| _\alpha ^2 \sum _{j=1}^{d} \psi _j^2(x), \end{aligned}$$

with equality if \((\alpha _1,\dotsc , \alpha _d)\) is proportional to \((\psi _1(x),\dotsc , \psi _d(x))\). Hence, we have:

$$\begin{aligned} \sum _{j=1}^{d} \psi _j^2(x) = \sup _{t\in S\setminus \{0\}} \frac{t^2(x)}{\left| \left| t\right| \right| _\alpha ^2}. \end{aligned}$$

Taking the supremum for \(x\in A\), we obtain:

$$\begin{aligned} \sup _{x\in A}\sum _{j=1}^{d} \psi _j^2(x) = \sup _{x\in A}\sup _{t\in S\setminus \{0\}} \frac{t^2(x)}{\left| \left| t\right| \right| _\alpha ^2} = \sup _{t\in S\setminus \{0\}} \frac{\sup _{x\in A} t^2(x)}{\left| \left| t\right| \right| _\alpha ^2}, \end{aligned}$$

that is:

$$\begin{aligned} \left| \left| [\right| \right| \bigg ]{\sum _{j=1}^{d} \psi _j^2}_\infty = \sup _{t\in S\setminus \{0\}} \frac{\left| \left| t\right| \right| _\infty ^2}{\left| \left| t\right| \right| _\alpha ^2} {=}{:} K_\alpha ^\infty (S). \end{aligned}$$

\(\square \)

To prove Proposition 3 and Theorem 2, we need the following lemma.

Lemma 3

Let \((\psi _1,\dotsc ,\psi _{D_{{\varvec{m}}}})\) be an orthonormal basis of \(S_{{\varvec{m}}}\) relatively to an inner product \({\langle }{ \cdot , \cdot }{\rangle }_\alpha \). Let \(\widehat{{\textbf{H}}}_{{\varvec{m}}}\) be the Gram matrix of this basis relatively to the empirical inner product and let \({\textbf{H}}_{{\varvec{m}}} {:}{=} {\mathbb {E}}[\widehat{{\textbf{H}}}_{{\varvec{m}}}]\), that is:

$$\begin{aligned} \forall j,k\in {\lbrace }1,\dotsc , D_{{\varvec{m}}}{\rbrace }, \quad \left[ {\widehat{{\textbf{H}}}_{{\varvec{m}}}}\right] _{j,k} {:}{=} {\langle }{ \psi _j, \psi _k }{\rangle }_n \text { and } \left[ {{\textbf{H}}_{{\varvec{m}}}}\right] _{j,k} {:}{=} {\langle }{ \psi _j, \psi _k }{\rangle }_\mu . \end{aligned}$$

For all \(\delta \in (0,1),\) we have:

$$ {\mathbb{P}}\left[ {\lambda _{{\rm min }} \left({\widehat{{\mathbf{H}}}_{\varvec{m}} } \right) \le \left( {1 -\delta } \right)\lambda _{{\rm min }} \left({{\mathbf{H}}_{\varvec{m}} } \right)} \right] \le D_{\varvec{m}}\exp \left( { - h\left( \delta \right)\frac{{n\lambda _{{\rm min }}\left( {{\mathbf{H}}_{\varvec{m}} } \right)}}{{K_{\alpha }^{\infty }\left( \varvec{m} \right)}}} \right), $$

with \(h(\delta ){:}{=} \delta + (1-\delta )\log (1-\delta )\) and where \(K_{\alpha }^\infty ({\varvec{m}})\) is given by Lemma 2.

Proof

We use Theorem 5 in Appendix. Indeed, \(\widehat{{\textbf{H}}}_{{\varvec{m}}}\) can be written as a sum \({\textbf{Z}}_1 + \dotsc + {\textbf{Z}}_n\) where

$$\begin{aligned} \forall j,k\in {\lbrace }1,\dotsc , D_{{\varvec{m}}}{\rbrace }, \quad \left[ {\textbf{Z}}_i\right] _{j,k} {:}{=} \frac{1}{n} \psi _j({\varvec{X}}_i)\psi _k({\varvec{X}}_i), \end{aligned}$$

so we have using Lemma 2:

$$\begin{aligned} \lambda _{\max }({\textbf{Z}}_i) = \left| \left| {\textbf{Z}}_i\right| \right| _{\textrm{op}} = \frac{1}{n} \sum _{k=1}^{D_{{\varvec{m}}}} \psi _k({\varvec{X}}_i)^2 \le \frac{1}{n} \left| \left| [\right| \right| \bigg ]{\sum _{k=1}^{D_{{\varvec{m}}}} \psi _k^2}_\infty = \frac{1}{n} K_\alpha ^\infty ({\varvec{m}}). \end{aligned}$$

Therefore, applying inequality (29) of Theorem 5 with \(\mu _{\min } = \lambda _{\min }({\textbf{H}}_{{\varvec{m}}})\) and \(R = \frac{1}{n}K_\alpha ^\infty ({\varvec{m}})\) yields:

$$\begin{aligned} {\mathbb {P}}\left[ \lambda _{\min }(\widehat{{\textbf{H}}}_{{\varvec{m}}}) \le (1-\delta ) \lambda _{\min }({\textbf{H}}_{{\varvec{m}}}) \right] \le D_{{\varvec{m}}} \exp \left( -h(\delta ) \frac{n \lambda _{\min }({\textbf{H}}_{{\varvec{m}}})}{K_\alpha ^\infty ({\varvec{m}})} \right) . \end{aligned}$$

\(\square \)

Proposition 3

Let \(\psi _1,\dotsc ,\psi _{D_{{\varvec{m}}}}\) be an orthonormal basis of \(S_{{\varvec{m}}}\) relatively to the inner product \({\langle }{ \cdot , \cdot }{\rangle }_\mu \). Let \(\widehat{{\textbf{H}}}_{{\varvec{m}}}\) be their Gram matrix relatively to the empirical inner product. According to Lemma 1, we have \(K_n^\mu ({\varvec{m}}) = \left| \left| \widehat{{\textbf{H}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}= \lambda _{\min }(\widehat{{\textbf{H}}}_{{\varvec{m}}})^{-1}\) and we have \({\mathbb {E}}[\widehat{{\textbf{H}}}_{{\varvec{m}}}] = {\textbf{I}}_{{\varvec{m}}}\) because \((\psi _1, \dotsc , \psi _{D_{{\varvec{m}}}})\) is orthonormal for the inner product associated with \(\mu \), so the event \(\varOmega _{{\varvec{m}}}(\delta )^c\) can be written as:

$$\begin{aligned} \varOmega _{{\varvec{m}}}(\delta )^c = {\lbrace } \lambda _{\min }(\widehat{{\textbf{H}}}_{{\varvec{m}}}) \le 1-\delta {\rbrace } = {\lbrace } \lambda _{\min }(\widehat{{\textbf{H}}}_{{\varvec{m}}}) \le (1-\delta ) \lambda _{\min }({\mathbb {E}}[\widehat{{\textbf{H}}}_{{\varvec{m}}}]){\rbrace }. \end{aligned}$$

Applying Lemma 3 yields the result.\(\square \)

Proposition 2

We start with the decomposition:

$$\begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 = {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \,\textbf{1}_ {\varOmega _{{\varvec{m}}}(\delta )}\right] + {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \,\textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c}\right] . \end{aligned}$$
(7)

We consider these two terms separately. The expectation of the first term is controlled as in Theorem 3 in Cohen et al. (2013). On the event \(\varOmega _{{\varvec{m}}}(\delta )\) we have \((1-\delta )\left| \left| t\right| \right| _{\mu }^2 \le \left| \left| t\right| \right| _n^2\) for all \(t\in S_{{\varvec{m}}}\), so if \(b_{{\varvec{m}}}^{(\mu )}\) is the projection of b on \(S_{{\varvec{m}}}\) for the norm \(\left| \left| \cdot \right| \right| _\mu \), we have:

$$\begin{aligned} \left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}&\le \left| \left| b - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 + \left| \left| {\hat{b}}_{{\varvec{m}}} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )} \\&\le \left| \left| b - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 + 2\,\left| \left| {\hat{b}}_{{\varvec{m}}} - b_{{\varvec{m}}}^{(n)}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )} + 2\,\left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )} \\&\le \left| \left| b - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 + \frac{2}{1-\delta }\,\left| \left| {\hat{b}}_{{\varvec{m}}} - b_{{\varvec{m}}}^{(n)}\right| \right| _n^2 + 2\,\left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )} \end{aligned}$$

Taking the expectation, we obtain:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right] \le \left| \left| b - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 + \frac{2}{1-\delta } \,\sigma ^2 \frac{D_{{\varvec{m}}}}{n} + 2\,{\mathbb {E}}\left[ \left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right] . \end{aligned}$$
(8)

We give an upper bound on the last term in two ways. Firstly, we have:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right]&\le {\mathbb {E}}\left[ K_n^\mu ({\varvec{m}}) \left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _n^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right] \\&\le \frac{1}{1-\delta } {\mathbb {E}}\left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _n^2 \end{aligned}$$

since \(K_n^\mu ({\varvec{m}})\le \frac{1}{1-\delta }\) on the event \(\varOmega _{{\varvec{m}}}(\delta )\), see (2). Let \(\varPi _{{\varvec{m}}}^{(n)}\) be the empirical projector on \(S_{{\varvec{m}}}\), we have:

$$\begin{aligned} \left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _n^2 = \left| \left| \varPi _{{\varvec{m}}}^{(n)}\big (b - b_{{\varvec{m}}}^{(\mu )}\big )\right| \right| _n^2 \le \left| \left| b - b_{{\varvec{m}}}^{(\mu )}\right| \right| _n^2. \end{aligned}$$

Thus, we have shown:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right] \le \frac{1}{1-\delta } {\mathbb {E}}\left| \left| b- b_{{\varvec{m}}}^{(\mu )}\right| \right| _n^2 = \frac{1}{1-\delta } \left| \left| b- b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2. \end{aligned}$$
(9)

Secondly, let \(g {:}{=} b - b_{{\varvec{m}}}^{(\mu )}\) and let \(\varPi _{{\varvec{m}}}^{(n)}\) be the empirical projector on \(S_{{\varvec{m}}}\) we have:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right] = {\mathbb {E}}\left[ \left| \left| \varPi _{{\varvec{m}}}^{(n)} g\right| \right| _\mu ^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right] . \end{aligned}$$

Let \((\psi _1,\dotsc , \psi _{D_{{\varvec{m}}}})\) be an orthonormal basis of \(S_{{\varvec{m}}}\) for the inner product \({\langle }{ \cdot , \cdot }{\rangle }_\mu \), we have:

$$\begin{aligned} \varPi _{{\varvec{m}}}^{(n)} g = {{\,\mathrm{arg\, min}\,}}_{t\in S_{{\varvec{m}}}} \,\left| \left| g - t\right| \right| _n^2 = \sum _{j=1}^{D_{{\varvec{m}}}} c_j^\star \, \psi _j, \quad {\textbf{c}}^\star {:}{=} {{\,\mathrm{arg\, min}\,}}_{{\textbf{c}} \in {\mathbb {R}}^{D_{{\varvec{m}}}}} \,\left| \left| {\textbf{g}} - \varPsi _{{\varvec{m}}} {\textbf{c}}\right| \right| _{{\mathbb {R}}^n}^2, \end{aligned}$$

where \(\varPsi _{{\varvec{m}}} \in {\mathbb {R}}^{n\times D_{{\varvec{m}}}}\) is the matrix defined by \([\varPsi _{{\varvec{m}}}]_{i,j} {:}{=} \psi _{j}({\varvec{X}}_i)\), and where \({\textbf{g}}\) is the vector \(\big ( g({\varvec{X}}_1), \dotsc , g({\varvec{X}}_n) \big )\in {\mathbb {R}}^n\). By Lemma 8, \({\textbf{c}}^\star \) is given by:

$$\begin{aligned} {\textbf{c}}^\star = (\varPsi _{{\varvec{m}}}^*\varPsi _{{\varvec{m}}})^{-1} \varPsi _{{\varvec{m}}}^* {\textbf{g}} = \frac{1}{n} {\textbf{H}}_{{\varvec{m}}}^{-1} \varPsi _{{\varvec{m}}}^*{\textbf{g}}, \end{aligned}$$

where \({\textbf{H}}_{{\varvec{m}}}\) is the Gram matrix of \((\psi _1,\dotsc , \psi _{D_{{\varvec{m}}}})\) relatively to the empirical inner product. Using Lemma 1, we get:

$$\begin{aligned} \left| \left| \varPi _{{\varvec{m}}}^{(n)} g\right| \right| _\mu ^2 = \left| \left| {\textbf{c}}^\star \right| \right| _{{\mathbb {R}}^{D_{{\varvec{m}}}}}^2 \le \left| \left| {\textbf{H}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}^2 \, \left| \left| \frac{1}{n} \varPsi _{{\varvec{m}}}^* {\textbf{g}}\right| \right| _{{\mathbb {R}}^{D_{{\varvec{m}}}}}^2 = K_n^\mu ({\varvec{m}})^2\, \sum _{j=1}^{D_{{\varvec{m}}}} {\langle }{ g, \psi _j }{\rangle }_n^2. \end{aligned}$$

Hence, on the event \(\varOmega _{{\varvec{m}}}(\delta )\) we obtain:

$$\begin{aligned} \left| \left| \varPi _{{\varvec{m}}}^{(n)} g\right| \right| _\mu ^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )} \le \frac{1}{(1-\delta )^2} \sum _{j=1}^{D_{{\varvec{m}}}} {\langle }{ g, \psi _j }{\rangle }_n^2. \end{aligned}$$

Since \(g=b-b_{{\varvec{m}}}^{(\mu )}\) is orthogonal to \(\psi _1,\dotsc ,\psi _{D_{{\varvec{m}}}}\) relatively to the inner product \({\langle }{ \cdot , \cdot }{\rangle }_\mu \), we have \({\mathbb {E}}[{\langle }{ g, \psi _j }{\rangle }_n] = {\langle }{ g, \psi _j }{\rangle }_\mu = 0\), so we get:

$$\begin{aligned} {\mathbb {E}}\left[ \sum _{k=1}^{D_{{\varvec{m}}}} {\langle }{ g, \psi _k }{\rangle }_n^2\right] = \sum _{k=1}^{D_{{\varvec{m}}}} {{\,\textrm{Var}\,}}\big ( {\langle }{ g, \psi _k }{\rangle }_n \big )&= \frac{1}{n^2} \sum _{i=1}^{n} \sum _{j=1}^{D_{{\varvec{m}}}} {{\,\textrm{Var}\,}}\big ( g({\varvec{X}}_i) \psi _j({\varvec{X}}_i) \big ) \\&= \frac{1}{n^2} \sum _{i=1}^{n} {\mathbb {E}}\left[ g({\varvec{X}}_i)^2 \sum _{j=1}^{D_{{\varvec{m}}}} \psi _j({\varvec{X}}_i)^2 \right] \\&\le \frac{1}{n^2} \sum _{i=1}^{n} {\mathbb {E}}\left[ g({\varvec{X}}_i)^2 \right] \sup _{x\in A} \sum _{j=1}^{D_{{\varvec{m}}}} \psi _j(x)^2 \\&= \frac{1}{n} \left| \left| g\right| \right| _\mu ^2 K_\mu ^\infty ({\varvec{m}}) = \frac{K_\mu ^\infty ({\varvec{m}})}{n} \left| \left| b - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2, \end{aligned}$$

where the last equality comes from Lemma 2. Hence, we have shown:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right] \le \frac{1}{(1-\delta )^2} \frac{K_\mu ^\infty ({\varvec{m}})}{n} \left| \left| b - b_{{\varvec{m}}}^{(\mu )}\right| \right| _ \mu ^2. \end{aligned}$$
(10)

Combining (9) and (10) yields:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b_{{\varvec{m}}}^{(n)} - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )}\right] \le \frac{1}{1-\delta } \left| \left| b - b_{{\varvec{m}}}^{(\mu )}\right| \right| _\mu ^2 \left( 1 \wedge \frac{K_\mu ^\infty ({\varvec{m}})}{(1-\delta )n}\right) . \end{aligned}$$
(11)

For the second term in (7), we have:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c}\right] \le 2 \left| \left| b\right| \right| _\mu ^2\, {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c\right] + 2\,{\mathbb {E}}\left[ \left| \left| {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c}\right] . \end{aligned}$$

We have the following upper bound on \(\left| \left| {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2\):

$$\begin{aligned} \left| \left| {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \le K_n^\mu ({\varvec{m}})\, \left| \left| {\hat{b}}_{{\varvec{m}}}\right| \right| _n^2 \le K_n^\mu ({\varvec{m}})\, \left| \left| {\textbf{Y}}\right| \right| _n^2, \end{aligned}$$
(12)

where the last inequality comes from the fact that \({\hat{b}}_{{\varvec{m}}}\) is the empirical projection of \({\textbf{Y}}\). Hence, we get:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _\mu ^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c}\right] \le 2 \left| \left| b\right| \right| _\mu ^2\, {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c\right] + 2\,{\mathbb {E}}\left[ K_n^\mu ({\varvec{m}})\, \left| \left| {\textbf{Y}}\right| \right| _n^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c}\right] . \end{aligned}$$
(13)

The inequality of Proposition 2 is obtained using (8), (11) and (13) in (7).\(\square \)

Theorem 1

Let \({\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha }\) and let \(\delta \in (0,1)\) (we choose it later in the proof). By Remark 1, we have by definition of \({\mathscr {M}}^{(1)}_{n,\alpha }\):

$$\begin{aligned} K_{\mu }^\infty ({\varvec{m}}) \le K_\nu ^\infty ({\varvec{m}}) \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\le \alpha \frac{n}{\log n}, \end{aligned}$$
(14)

so Proposition 2 yields:

$$\begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _{\mu }^2 \le C_n(\delta ,\alpha ) \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _\mu ^2 + C'(\delta ) \sigma ^2\frac{D_{{\varvec{m}}}}{n} + R_n, \end{aligned}$$

with \(C_n(\alpha , \delta ) {:}{=} \left( 1 + \frac{2}{1-\delta } \left[ \frac{\alpha }{(1-\delta )\log n} \wedge 1\right] \right) \), \(C'(\delta ) {:}{=} \frac{2}{1-\delta }\) and:

$$\begin{aligned} R_n {:}{=} 2\left| \left| b\right| \right| _\mu ^2 \,{\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c\right] +{\mathbb {E}}\left[ K_n^\mu ({\varvec{m}}) \left| \left| {\textbf{Y}}\right| \right| _n^2\, \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c}\right] . \end{aligned}$$

For the first term in \(R_n\), we apply Proposition 3 with (14):

$$\begin{aligned} {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c \right] \le D_{{\varvec{m}}} \, n^{-\frac{h(\delta )}{\alpha }} \le n^{-\frac{h(\delta )}{\alpha } + 1}. \end{aligned}$$
(15)

For the second term in \(R_n\), since \(\left| \left| \cdot \right| \right| _\mu \le \left| \left| \cdot \right| \right| _\infty \) and \({\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha }\) we have:

$$\begin{aligned} K_n^\mu ({\varvec{m}}) \le K_\nu ^\mu ({\varvec{m}})\, K_n^\nu ({\varvec{m}}) \le K_\nu ^\infty ({\varvec{m}}) \, \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\le \mathfrak \alpha \frac{n}{\log n}, \end{aligned}$$
(16)

and we have using the independence of \(({\varvec{X}}_i)_{1\le i\le n}\) and \((\varepsilon _i)_{1\le i\le n}\):

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| {\textbf{Y}}\right| \right| _n^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c} \right]&= \frac{1}{n} \sum _{i=1}^{n} {\mathbb {E}}\left[ \big (b({\varvec{X}}_i) + \varepsilon _i\big )^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c}\right] \\&= {\mathbb {E}}\left[ \frac{1}{n} \sum _{i=1}^{n} b({\varvec{X}}_i)^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c} \right] + \sigma ^2\, {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c \right] . \end{aligned}$$

We apply Hölder’s inequality with \(r,r'\in (1,+\infty )\) such that \(\frac{1}{r} + \frac{1}{r'} = 1\):

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| {\textbf{Y}}\right| \right| _n^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c} \right]&\le {\mathbb {E}}\left[ \left( \frac{1}{n} \sum _{i=1}^{n} b({\varvec{X}}_i)^2 \right) ^r \,\right] ^{\frac{1}{r}} {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c\right] ^{\frac{1}{r'}} + \sigma ^2\, {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c \right] \\&\le {\mathbb {E}}\left[ \frac{1}{n} \sum _{i=1}^{n} b({\varvec{X}}_i)^{2r} \right] ^{\frac{1}{r}} {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c\right] ^{\frac{1}{r'}} + \sigma ^2\, {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c \right] \\&\le \left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2 \, n^{-\frac{h(\delta )}{\alpha r'} + \frac{1}{r'}} + \sigma ^2 \, n^{-\frac{h(\delta )}{\alpha }+1}, \end{aligned}$$

and if \(b\in \textrm{L}^\infty (\mu )\), the last inequality also holds for \(r=\infty \) and \(r'=1\) (just take the limit as \(r\rightarrow +\infty \)). Hence, we obtain:

$$\begin{aligned} {\mathbb {E}}\left[ K_n^\mu ({\varvec{m}}) \left| \left| {\textbf{Y}}\right| \right| _n^2 \textbf{1}_{\varOmega _{{\varvec{m}}}(\delta )^c} \right] \le \frac{\alpha }{\log n} \left( \left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2 \, n^{-\frac{h(\delta )}{\alpha r'} + \frac{1}{r'} + 1} + \sigma ^2\, n^{-\frac{h(\delta )}{\alpha }+2} \right) . \end{aligned}$$
(17)

If we choose \(\delta \) such that \(h(\delta ) \ge (2r'+1)\alpha \), then all the exponents of n in (15) and (17) are less than \(-1\). The function h is an increasing function from [0, 1] to itself so it is invertible on [0, 1]. Since \(\alpha \in (0, \frac{1}{2r'+1})\), we can choose \(\delta = \delta (\alpha , r') {:}{=} h^{-1}((2r'+1)\alpha )\). For this choice, we obtain:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{m}}}\right| \right| _{\mu }^2&\le C_n\big (\delta (\alpha , r'),\alpha \big ) \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _\mu ^2 + C'\big (\delta (\alpha , r')\big ) \sigma ^2\frac{D_{{\varvec{m}}}}{n} \\ {}&+ \frac{C''\big (b,\sigma ^2,\alpha ,r\big )}{n\log n}, \end{aligned} \end{aligned}$$

where \(C_n(\delta ,\alpha )\) and \(C'(\delta )\) were defined at the beginning of the proof, and are as follows:

$$\begin{aligned} C''\big (b,\sigma ^2,\alpha ,r\big ) \le 2\left| \left| b\right| \right| _{\textrm{L}^{2}(\mu )}^2 + \alpha \left( \left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2 + \sigma ^2 \right) . \end{aligned}$$
(18)

\(\square \)

7.3 Proof of Theorem 2

The proof of Theorem 2 is based on a result for fixed design regression of Baraud (2000). Let \({{\widehat{{\mathscr {M}}}}}_n\) be a finite collection of models, that may depend on \(({\varvec{X}}_1,\dotsc , {\varvec{X}}_n)\), such that for all \({\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_n\), \({{\widehat{\textbf{G}}}}_{{\varvec{m}}}\) is invertible. Let \({\varvec{{\hat{m}}}} \in {{\widehat{{\mathscr {M}}}}}_n\) be the minimizer of the following penalized least squares criterion:

$$\begin{aligned} {\varvec{{\hat{m}}}} {:}{=} {{\,\mathrm{arg\, min}\,}}_{{\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_n} \left( -\left| \left| {\hat{b}}_{{\varvec{m}}}\right| \right| _n^2 + {{\,\textrm{pen}\,}}({\varvec{m}}) \right) ,\quad {{\,\textrm{pen}\,}}({\varvec{m}}) {:}{=} (1+\theta )\sigma ^2\frac{D_{{\varvec{m}}}}{n}, \quad \theta >0. \end{aligned}$$
(19)

Theorem 4

(Corollary 3.1 in Baraud (2000)) If \({\mathbb {E}}\left| \varepsilon _1 \right| ^q\) is finite for some \(q>4\), then the following upper bound on the risk of the estimator \({\hat{b}}_{{\varvec{{\hat{m}}}}}\) with \({\varvec{{\hat{m}}}}\) defined by (19) holds:

$$\begin{aligned} {\mathbb {E}}_{{\varvec{X}}} \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}}\right| \right| _n^2 \le C(\theta ) \inf _{{\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_n} \left( \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _n^2 + \sigma ^2\frac{D_{{\varvec{m}} }}{n} \right) + \sigma ^2\frac{\varSigma _n(\theta , q)}{n}, \end{aligned}$$

with:

$$\begin{aligned} \varSigma _n(\theta , q) {:}{=} C'(\theta , q) \frac{{\mathbb {E}}\left| \varepsilon _1 \right| ^q}{\sigma ^q} \sum _{{\varvec{m}} \in {\widehat{{\mathscr {M}}}}_n} D_{{\varvec{m}}}^{-(\frac{q}{2}-2)}, \end{aligned}$$

where \(C(\theta ) {:}{=} (2+8\theta ^{-1})(1+\theta )\) and \(C'(\theta ,q)\) is a positive constant.

Theorem 2

Let \(\varDelta _{n,\alpha ,\beta } {:}{=} {\lbrace } {\mathscr {M}}^{(1)}_{n,\alpha }\subset {{\widehat{{\mathscr {M}}}}}^{\ (1)}_{n,\beta } {\rbrace }\), we have:

$$\begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 = {\mathbb {E}}\left[ {\mathbb {E}}_{{\varvec{X}}}\left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }} \right] + {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }^c} \right] . \end{aligned}$$

For the first term, on \(\varDelta _{n,\alpha ,\beta }\) we have \(\inf _{{\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}^{\ (1)}_{n,\beta }} (\ldots ) \le \inf _{{\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha }} (\ldots )\) so by applying Theorem 4 we obtain:

$$\begin{aligned} {\mathbb {E}}\left[ {\mathbb {E}}_{{\varvec{X}}}\left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }} \right]&\le {\mathbb {E}}\left[ C(\theta ) \inf _{m\in {\mathscr {M}}^{(1)}_{n,\alpha }} \left( \inf _{t\in S_{{\varvec{m}}}} \left| \left| b -t\right| \right| _n^2 + \sigma ^2\frac{D_{{\varvec{m}}}}{n} \right) + \sigma ^2\frac{\varSigma (\theta ,q)}{n}\right] \\&\le C(\theta ) \inf _{m\in {\mathscr {M}}^{(1)}_{n,\alpha }} \left( \inf _{t\in S_{{\varvec{m}}}} \left| \left| b -t\right| \right| _\mu ^2 + \sigma ^2\frac{D_{{\varvec{m}}}}{n} \right) + \sigma ^2\frac{\varSigma (\theta ,q)}{n}. \end{aligned}$$

For the second term, we have:

$$\begin{aligned} \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }^c} \le 2\left| \left| b\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }^c} + 2 \left| \left| {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }^c}. \end{aligned}$$

Using Hölder’s inequality with \(r,r'\in (1,\infty )\) such that \(\frac{1}{r} + \frac{1}{r'}=1\), we obtain:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }^c} \right] \le {\mathbb {E}}\left[ \left( \frac{1}{n} \sum _{i=1}^{n} b({\varvec{X}}_i)^2 \right) ^r\right] ^{1/r} {\mathbb {P}}\left[ \varDelta _{n,\alpha ,\beta }^c \right] ^{ 1 /r'} \le \left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2 \, {\mathbb {P}}\left[ \varDelta _{n,\alpha ,\beta }^c \right] ^{1/r'}, \end{aligned}$$

and if \(b\in \textrm{L}^\infty (\mu )\), the inequality also holds for \(r=\infty \) and \(r'=1\). Since \({\hat{b}}_{{\varvec{{\hat{m}}}}_1}\) is the empirical projection of \({\textbf{Y}}\) on \(S_{{\varvec{{\hat{m}}_1}}}\), we have \(\left| \left| {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 \le \left| \left| {\textbf{Y}}\right| \right| _n^2\). Hence, we get:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }^c}\right] \le {\mathbb {E}}\left[ \left| \left| {\textbf{Y}}\right| \right| _n^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }^c}\right]&= {\mathbb {E}}\left[ \frac{1}{n} \sum _{i=1}^{n} b({\varvec{X}}_i)^2 \textbf{1}_{\varDelta _{n,\alpha ,\beta }^c}\right] + \sigma ^2\, {\mathbb {P}}\left[ \varDelta _{n,\alpha ,\beta }^c \right] \nonumber&\\&\le \left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2 \, {\mathbb {P}}\left[ \varDelta _{n,\alpha ,\beta }^c \right] ^{\frac{1}{r'}} + \sigma ^2 \,{\mathbb {P}}\left[ \varDelta _{n,\alpha ,\beta }^c \right] . \end{aligned}$$
(20)

To conclude, we give an upper bound on \({\mathbb {P}}\left[ \varDelta _{n,\alpha , \beta }^c\right] \):

$$\begin{aligned} {\mathbb {P}}\left[ \varDelta _{n,\alpha , \beta }^c \right]&= {\mathbb {P}}\left[ \exists {\varvec{m}}\in {\mathbb {N}}_+^p,\ {\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha } \text { and } {\varvec{m}}\notin {{\widehat{{\mathscr {M}}}}}^{\ (1)}_{n,\beta } \right] \\&\le \sum _{{\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha }} {\mathbb {P}}\left[ {\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha } \text { and } {\varvec{m}}\notin {{\widehat{{\mathscr {M}}}}}^{\ (1)}_{n,\beta } \right] . \end{aligned}$$

Using the following inclusion of events:

$$\begin{aligned}&{ {\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha } \text { and } {\varvec{m}}\notin {{\widehat{{\mathscr {M}}}}}^{\ (1)}_{n,\beta } } \\&\subset { K_\nu ^\infty ({\varvec{m}}) \left( \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \vee 1 \right) \le \alpha \frac{n}{\log n}} \cap {K_\nu ^\infty ({\varvec{m}}) \left( \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \vee 1 \right) \ge \beta \frac{n}{\log n}} \\&\subset { \frac{\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}}{\left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}} \ge \frac{\beta }{\alpha }} = { \lambda _{\min }({{\widehat{\textbf{G}}}}_{{\varvec{m}}}) \le \frac{\alpha }{\beta } \lambda _{\min } (\textbf{G}_{{\varvec{m}}})}, \end{aligned}$$

we get:

$$\begin{aligned} {\mathbb {P}}\left[ \varDelta _{n,\alpha , \beta }^c \right] \le \sum _{{\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha }} {\mathbb {P}}\left[ \lambda _{\min }({{\widehat{\textbf{G}}}}_{{\varvec{m}}}) \le \frac{\alpha }{\beta } \lambda _{\min } (\textbf{G}_{{\varvec{m}}})\right] . \end{aligned}$$
(21)

Using Lemma 3 with the inequality \(K_\nu ^\infty ({\varvec{m}})\left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\le \alpha \frac{n}{\log n}\) for \({\varvec{m}} \in {\mathscr {M}}_{n,\alpha }^{(1)}\), we obtain:

$$\begin{aligned} \forall {\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha },\ {\mathbb {P}}\left[ \lambda _{\min }({{\widehat{\textbf{G}}}}_{{\varvec{m}}}) \le \frac{\alpha }{\beta } \lambda _{\min } (\textbf{G}_{{\varvec{m}}})\right]&\le D_{{\varvec{m}}} \exp \left( h(1-\tfrac{\alpha }{\beta }) \frac{n}{K_\nu ^\infty ({\varvec{m}}) \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}} \right) \\&\le D_{{\varvec{m}}} \, n^{-{h(1-\frac{\alpha }{\beta })}/{\alpha }}. \end{aligned}$$

Hence, we get:

$$\begin{aligned} {\mathbb {P}}\left[ \varDelta _{n,\alpha , \beta }^c \right] \le \sum _{{\varvec{m}}\in {\mathscr {M}}^{(1)}_{n,\alpha }} D_{{\varvec{m}}} \, n^{-h(1-\frac{\alpha }{\beta }) / \alpha } \le {{\,\textrm{Card}\,}}({\mathscr {M}}^{(1)}_{n,\alpha })\, n^{1-h(1-\frac{\alpha }{\beta }) / \alpha } . \end{aligned}$$

Using Proposition 4 in appendix, we obtain:

$$\begin{aligned} {\mathbb {P}}\left[ \varDelta _{n,\alpha , \beta }^c \right] \le n^{2-h(1-\frac{\alpha }{\beta }) / \alpha } H_n^{p-1} = n^{-\kappa (\alpha , \beta )} H_n^{p-1}, \end{aligned}$$

with \(H_n {:}{=} \sum _{k=1}^{n} \frac{1}{k}\) and \(\kappa (\alpha ,\beta ) {:}{=} \frac{h(1-\frac{\alpha }{\beta })}{\alpha } - 2\). We know that \(H_n\sim \log n\), so we want a condition on \(\alpha \) such that the \(\kappa (\alpha ,\beta )\) is strictly greater than \(r'\). Let \(x {:}{=} \frac{\beta }{\alpha } \ge 1\), we have:

$$\begin{aligned} \kappa (\alpha ,\beta )> r'&\iff h\!\left( 1-\frac{\alpha }{\beta }\right)> (2+r')\alpha \nonumber \\&\iff 1 - \frac{\alpha }{\beta } + \frac{\alpha }{\beta } \log \left( \frac{\alpha }{\beta } \right)> (2+r')\alpha \nonumber \\&\iff 1 - \frac{1 + \log (x)}{x} > \frac{(2+r')\beta }{x} \nonumber \\&\iff \frac{1 + (2+r')\beta + \log (x)}{x} < 1. \end{aligned}$$
(22)

The function:

$$\begin{aligned} f_{\beta , r'} (x) {:}{=}\frac{1 + (2+r')\beta + \log (x)}{x}, \end{aligned}$$

is decreasing on \([1,+\infty )\), we have \(f_{\beta , r'}(1)>1\) and \(f_{\beta , r'}(x) \rightarrow 0\) when \(x\rightarrow +\infty \), so there exists a unique \(x_{\beta , r'}\in (1,+\infty )\) such that \(f_{\beta , r'}(x_{\beta , r'})=1\). Thus, we have:

$$\begin{aligned} (22) \iff x\in (x_{\beta , r'}, +\infty ) \iff \alpha \in (0, \alpha _{\beta , r'} ), \end{aligned}$$

where \(\alpha _{\beta , r'} {:}{=} \frac{\beta }{x_{\beta , r'}}\). Hence, if \(\alpha \in (0, \alpha _{\beta , r'} )\) then we have:

$$\begin{aligned} {\mathbb {P}}\left[ \varDelta _{n,\alpha , \beta }^c \right] ^{1/r'} \le n^{-\frac{\kappa (\alpha ,\beta )}{r'}} H_n^{\frac{p-1}{r'}}, \end{aligned}$$

with \(\frac{\kappa (\alpha ,\beta )}{r'}>1\) and \(\frac{\kappa (\alpha , \beta )}{r'} \rightarrow 1\) as \(\alpha \rightarrow \alpha _{\beta , r'}\).\(\square \)

Remark 7

If we use the collections \({\mathscr {M}}_{n,\alpha }^{(2)}\) and \({{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)}\) instead, we obtain the inequality (21) with \(\alpha \) and \(\beta \) replaced by \(\alpha '{:}{=}\sqrt{\alpha }\) and \(\beta '{:}{=}\sqrt{\beta }\). The rest of the proof is unchanged.

Remark 4

We have \(\alpha _{\beta , r'} {:}{=} \frac{\beta }{x_{\beta , r'}}\) where \(x_{\beta , r'}\) is the unique solution in \((1,+\infty )\) of the equation \(f_{\beta , r'}(x) = 1\) with:

$$\begin{aligned} f_{\beta , r'} (x) {:}{=}\frac{1+(2+r')\beta +\log x}{x}. \end{aligned}$$

Hence, \(x_\beta \) satisfies the relation:

$$\begin{aligned} x_{\beta , r'} - \log x_{\beta , r'} = 1 + (2+r')\beta . \end{aligned}$$
(23)

Since the functions \(f_{\beta , r'}\) are decreasing on \((1,+\infty )\) and since \(\forall x\), \(f_{\beta , r'}(x)\) is increasing with \(\beta \) and \(r'\), we see that \(x_{\beta , r'}\) is increasing with \(\beta \) and \(r'\). Thus, the limits of \(x_{\beta , r'}\) when \(\beta \rightarrow 0\) and \(\beta \rightarrow +\infty \) exist. Using the relation (23), we obtain:

$$\begin{aligned} \lim _{\beta \rightarrow 0} x_{\beta , r'} = 1,\qquad \lim _{\beta \rightarrow +\infty } x_{\beta , r'} = +\infty , \qquad \lim _{r'\rightarrow \infty } x_{\beta ,r'} = +\infty , \end{aligned}$$

and we have \(x_{\beta , r'} \sim (2+r')\beta \) when \(\beta \rightarrow +\infty \). Thus, the limits of \(\alpha _{\beta , r'}\) are:

$$\begin{aligned} \lim _{\beta \rightarrow 0} \alpha _{\beta , r'} = 0,\qquad \lim _{\beta \rightarrow +\infty } \alpha _{\beta , r'} = \frac{1}{2+r'}, \qquad \lim _{r'\rightarrow +\infty } \alpha _{\beta , r'} = 0. \end{aligned}$$

Since \(x_{\beta , r'}\) is increasing with \(r'\), we see that \(\alpha _{\beta , r'}\) is decreasing with \(r'\). Finally, using the relation (23) again, we have:

$$\begin{aligned} \alpha _{\beta , r'} = \frac{\beta }{x_{\beta , r'}} = \frac{1}{2+r'}\left( 1 - \frac{1}{x_{\beta , r'}} - \frac{\log x_{\beta , r'}}{x_{\beta ,r'}} \right) . \end{aligned}$$

It is easy to see that the function \(x\mapsto 1 - \frac{1}{x} - \frac{\log x}{x}\) is increasing on \([1,+\infty )\) so \(\alpha _{\beta , r'}\) is also increasing with \(\beta \). \(\square \)

7.4 Proof of Theorem 3

Before proving Theorem 3, we need some preliminary results.

Lemma 4

For all \(x> 0\) and all \({\varvec{m}}\in {\mathbb {N}}_+^p\) we have:

$$\begin{aligned} {\mathbb {P}}\left[ \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}} - \textbf{G}_{{\varvec{m}}}\right| \right| _{\textrm{op}} \ge x \right]&\le D_{{\varvec{m}}} \exp \left( \frac{-nx^2/2}{K_{\nu }^{\infty }({\varvec{m}}) \big ( \left| \left| \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}+ \frac{2}{3} x \big ) } \right) \\&\le D_{{\varvec{m}}} \exp \left( \frac{-nx^2/2}{K_{\nu }^{\infty }({\varvec{m}}) \big ( \left| \left| \frac{\textrm{d}\mu }{\textrm{d}\nu }\right| \right| _\infty + \frac{2}{3} x \big ) } \right) . \end{aligned}$$

Proof

The set \(\{{ \varphi _{{\varvec{j}}}}:{{\varvec{j}}\le {\varvec{m}}-\textbf{1}}\}\) has cardinality \(D_{{\varvec{m}}}\) so let \({\lbrace } \phi _1, \dotsc , \phi _{D_{{\varvec{m}}}} {\rbrace }\) be its elements. We define the matrix \(\widehat{{\textbf{H}}}_{{\varvec{m}}}\) as:

$$\begin{aligned} \forall j,k\in {\lbrace }1,\dotsc , D_{{\varvec{m}}}{\rbrace },\quad \left[ \widehat{{\textbf{H}}}_{{\varvec{m}}}\right] _{j,k} {:}{=} {\langle }{ \phi _j, \phi _k }{\rangle }_n, \end{aligned}$$

and we denote its expectation \({\textbf{H}}_{{\varvec{m}}}\), of which the components are \({\langle }{ \phi _j, \phi _k }{\rangle }_\mu \). In other words, we have reshaped the hypermatrices \({{\widehat{\textbf{G}}}}_{{\varvec{m}}}\) and \(\textbf{G}_{{\varvec{m}}}\) into \(D_{{\varvec{m}}}\times D_{{\varvec{m}}}\) matrices. Moreover, this operation preserves the operator norm:

$$\begin{aligned} \left| \left| \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}= \left| \left| {\textbf{H}}_{{\varvec{m}}}\right| \right| _\textrm{op}. \end{aligned}$$

Indeed, let \(d{:}{=} D_{{\varvec{m}}}\), we have:

$$\begin{aligned} \left| \left| \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}= \sup _{\begin{array}{c} {\textbf{a}}\in {\mathbb {R}}^{{\varvec{m}}} \\ \left| \left| {\textbf{a}}\right| \right| _{{\mathbb {R}}^{{\varvec{m}}}} = 1 \end{array}} \left| \left| \textbf{G}_{{\varvec{m}}} \times _p {\textbf{a}}\right| \right| _{{\mathbb {R}}^{{\varvec{m}}}}^2 = \sup _{\begin{array}{c} {\textbf{a}}\in {\mathbb {R}}^{{\varvec{m}}} \\ \left| \left| {\textbf{a}}\right| \right| _{{\mathbb {R}}^{{\varvec{m}}}} = 1 \end{array}} \sum _{{\varvec{\ell }}\le {\varvec{m}} - \textbf{1}} \left( \sum _{{\varvec{k}} \le {\varvec{m}}-\textbf{1}} {\langle }{ \varphi _{{\varvec{\ell }}}, \varphi _{{\varvec{k}}} }{\rangle } a_{{\varvec{k}}} \right) ^2, \\ \left| \left| {\textbf{H}}_{{\varvec{m}}}\right| \right| _\textrm{op}= \sup _{\begin{array}{c} {\textbf{a}}\in {\mathbb {R}}^{d} \\ \left| \left| {\textbf{a}}\right| \right| _{{\mathbb {R}}^{d}} = 1 \end{array}} \left| \left| {\textbf{H}}_{{\varvec{m}}} {\textbf{a}}\right| \right| _{{\mathbb {R}}^{d}}^2 = \sup _{\begin{array}{c} {\textbf{a}}\in {\mathbb {R}}^{d} \\ \left| \left| {\textbf{a}}\right| \right| _{{\mathbb {R}}^{d}} = 1 \end{array}} \sum _{j=1}^d \left( \sum _{i=1}^d {\langle }{ \psi _j, \psi _i }{\rangle } a_{i} \right) ^2. \end{aligned}$$

Since the sets \(\{{ \varphi _{{\varvec{j}}}}:{{\varvec{j}}\le {\varvec{m}}-\textbf{1}}\}\) and \({\lbrace } \phi _1, \dotsc , \phi _{d} {\rbrace }\) are equal, these two quantities are also equal. Hence, we have:

$$\begin{aligned} \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}- \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}= \left| \left| \widehat{{\textbf{H}}}_{{\varvec{m}}}- {\textbf{H}}_{{\varvec{m}}}\right| \right| _\textrm{op}, \end{aligned}$$

so we work on \(\widehat{{\textbf{H}}}_{{\varvec{m}}}\) and \({\textbf{H}}_{{\varvec{m}}}\) from now on. We write:

$$\begin{aligned} \widehat{{\textbf{H}}}_{{\varvec{m}}} - {\textbf{H}}_{{\varvec{m}}} = \sum _{i=1}^{n} {\textbf{Z}}_i,\quad {\textbf{Z}}_i {:}{=} \frac{1}{n} \left( {\textbf{V}}_i {\textbf{V}}_i^\top - {\mathbb {E}}\left[ {\textbf{V}}_i {\textbf{V}}_i^\top \right] \right) ,\quad {\textbf{V}}_i {:}{=} \begin{bmatrix} \phi _1({\varvec{X}}_i) \\ \vdots \\ \phi _{D_{{\varvec{m}}}}({\varvec{X}}_i) \\ \end{bmatrix}, \end{aligned}$$

and we use the Matrix Bernstein bound (Theorem 6 in appendix).

  1. 1.

    Bound on \(\left| \left| {\textbf{Z}}_i\right| \right| _\textrm{op}\):

    $$\begin{aligned} \frac{1}{n}\left| \left| {\textbf{V}}_i {\textbf{V}}_i^\top \right| \right| _\textrm{op}= \frac{1}{n} \left| \left| {\textbf{V}}_i\right| \right| ^2 = \frac{1}{n} \sum _{j=1}^{D_{{\varvec{m}}}} \phi _j({\varvec{X}}_i)^2 \le \frac{K_\nu ^\infty ({\varvec{m}})}{n}, \end{aligned}$$

    where the last inequality comes from Lemma 2. Hence, \(\left| \left| {\textbf{Z}}_i\right| \right| _\textrm{op}\le R\), with \(R{:}{=} \frac{K_\nu ^\infty ({\varvec{m}})}{n}\).

  2. 2.

    Bound on \(\left| \left| \sum _{i=1}^{n} {\mathbb {E}}\left[ {\textbf{Z}}_i^2\right] \right| \right| _\textrm{op}\):

    $$\begin{aligned} \left| \left| [\right| \right| \bigg ]{\sum _{i=1}^{n} {\mathbb {E}}\left[ {\textbf{Z}}_i^2\right] }_\textrm{op}= \sup _{\left| \left| {\textbf{a}}\right| \right| = 1} \sum _{i=1}^{n} {\mathbb {E}}\left[ \left| \left| {\textbf{Z}}_i\, {\textbf{a}}\right| \right| ^2 \right]&= \sup _{\left| \left| {\textbf{a}}\right| \right| = 1} \sum _{i=1}^{n} \sum _{j=1}^{D_{{\varvec{m}}}} {\mathbb {E}}\left[ ({\textbf{Z}}_i\, {\textbf{a}})_j^2 \right] \\&= \sup _{\left| \left| {\textbf{a}}\right| \right| = 1} \sum _{i=1}^{n} \sum _{j=1}^{D_{{\varvec{m}}}} {{\,\textrm{Var}\,}}\left[ ({\textbf{Z}}_i\, {\textbf{a}})_j \right] , \end{aligned}$$

    since \({\mathbb {E}}{\textbf{Z}}_i = {\textbf{0}}\). We compute the variance:

    $$\begin{aligned} {{\,\textrm{Var}\,}}\left[ ({\textbf{Z}}_i\, {\textbf{a}})_j \right] = {{\,\textrm{Var}\,}}\left[ \frac{1}{n} \phi _j({\varvec{X}}_i)\sum _{k=1}^{D_{{\varvec{m}}}} \phi _k({\varvec{X}}_i)\, a_k \right]&\le \frac{1}{n^2} {\mathbb {E}}\left[ \left( \phi _j({\varvec{X}}_i)\sum _{k=1}^{D_{{\varvec{m}}}} \phi _k({\varvec{X}}_i)\, a_k \right) ^2 \right] \\&= \frac{1}{n} {\mathbb {E}}\left[ \phi _j({\varvec{X}}_i)^2 \, t_{{\textbf{a}}}({\varvec{X}}_i)^2 \right] , \end{aligned}$$

    where \(t_{{\textbf{a}}} {:}{=} \sum _{k=1}^{D_{{\varvec{m}}}} a_k \,\phi _k \). Using Lemmas 1 and 2 yields:

    $$\begin{aligned} \sum _{i=1}^{n} \sum _{j=1}^{D_{{\varvec{m}}}} {{\,\textrm{Var}\,}}\left[ ({\textbf{Z}}_i\, {\textbf{a}})_j \right] \le \frac{1}{n^2} \sum _{i=1}^{n} {\mathbb {E}}\left[ \sum _{j=1}^{D_{{\varvec{m}}}} \phi _j({\varvec{X}}_i)^2\, t_{{\textbf{a}}}({\varvec{X}}_i)^2 \right]&\le \frac{1}{n} K_{\nu }^\infty ({\varvec{m}})\, \left| \left| t_{{\textbf{a}}} \right| \right| _\mu ^2 \\&\le \frac{1}{n} K_{\nu }^\infty ({\varvec{m}}) \, K_{\nu }^\mu ({\varvec{m}}) \, \left| \left| t_{{\textbf{a}}} \right| \right| _\nu ^2 \\&= \frac{1}{n} K_{\nu }^\infty ({\varvec{m}}) \,\left| \left| \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}\,\left| \left| {\textbf{a}}\right| \right| ^2. \end{aligned}$$

    Hence, \(\left| \left| \sum _{i=1}^{n} {\mathbb {E}}\left[ {\textbf{Z}}_i^2\right] \right| \right| _\textrm{op}\le \frac{1}{n} K_{\nu }^\infty ({\varvec{m}}) \left| \left| \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}{=}{:} v\).

Applying Theorem 6 yields:

$$\begin{aligned} {\mathbb {P}}\left[ \left| \left| \widehat{{\textbf{H}}}_{{\varvec{m}}} - {\textbf{H}}_{{\varvec{m}}}\right| \right| _\textrm{op}\ge x\right] \le D_{{\varvec{m}}} \exp \left( -\frac{nx^2/2}{K_\nu ^\infty ({\varvec{m}}) \big ( \left| \left| \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}+ \frac{2}{3} x \big )} \right) , \end{aligned}$$

which is the first inequality of Lemma 4. The second inequality follows from the following upper bound on \(\left| \left| \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}\):

$$\begin{aligned} \left| \left| \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}= \sup _{t\in S_{{\varvec{m}}}\setminus \{0\}} \frac{\left| \left| t\right| \right| _\mu ^2}{\left| \left| t\right| \right| _\nu ^2} \le \left| \left| \frac{\textrm{d}\mu }{\textrm{d}\nu }\right| \right| _\infty . \end{aligned}$$

\(\square \)

In order to prove Theorem 3, let us consider the events:

$$\begin{aligned} \varLambda ^{(\iota )}_{n}(\beta ,\gamma ) {:}{=} { {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{(\iota )} \subset {\mathscr {M}}_{n,\gamma }^{(\iota )}},\qquad {\widetilde{\varOmega }}^{(\iota )}_{n}(\delta , \gamma ) {:}{=} \bigcap _{{\varvec{m}}\in {\mathscr {M}}_{n,\gamma }^{(\iota )}} \varOmega _{{\varvec{m}}}(\delta ),\qquad \iota \in {\lbrace }1,2{\rbrace }, \end{aligned}$$
(24)

where \(\varOmega _{{\varvec{m}}}(\delta )\) is defined by (2).

Lemma 5

For \(\iota \in {\lbrace }1,2{\rbrace }\), we have for all \(\delta \in (0,1)\) and all \(\gamma >0\):

$$\begin{aligned} {\mathbb {P}}\left[ {\widetilde{\varOmega }}^{(\iota )}_{n}(\delta , \gamma )^c \right] \le n^{-\frac{h(\delta )}{\gamma }+2} \, H_n^{p-1}, \end{aligned}$$

where \(H_n {:}{=}\sum _{k=1}^{n} \frac{1}{k}\) is the n-th harmonic number.

Proof

We use Proposition 3 with Remark 1:

$$\begin{aligned} {\mathbb {P}}\left[ {\widetilde{\varOmega }}^{(\iota )}_{n}(\delta , \gamma )^c \right] \le \sum _{{\varvec{m}}\in {\mathscr {M}}_{n,\gamma }^{(\iota )}} {\mathbb {P}}\left[ \varOmega _{{\varvec{m}}}(\delta )^c\right]&\le \sum _{{\varvec{m}}\in {\mathscr {M}}_{n,\gamma }^{(\iota )}} D_{{\varvec{m}}} \exp \left( -h(\delta )\frac{n}{K_\mu ^\infty ({\varvec{m}})} \right) \\&\le \sum _{{\varvec{m}}\in {\mathscr {M}}_{n,\gamma }^{(\iota )}} D_{{\varvec{m}}} \exp \left( -h(\delta )\frac{n}{K_\nu ^\infty ({\varvec{m}})\left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}} \right) \\&\le \sum _{{\varvec{m}}\in {\mathscr {M}}_{n,\gamma }^{(\iota )}} D_{{\varvec{m}}}\, n^{-\frac{h(\delta )}{\gamma }} \le n^{-\frac{h(\delta )}{\gamma }+2} \, H_n^{p-1}, \end{aligned}$$

where the last inequality comes from Proposition 4.\(\square \)

Lemma 6

(Compact case) We have for all \(\gamma>\beta >0\):

$$\begin{aligned} {\mathbb {P}}\left[ \varLambda ^{(1)}_{n}(\beta ,\gamma )^c\right] \le n^{- h(1-\frac{\gamma }{\beta })\frac{f_0}{\beta } +1 } \, H_n^{p-1}, \end{aligned}$$

where \(h(\delta ) = \delta + (1-\delta )\log (1-\delta )\), \(f_0>0\) is such that \(\frac{\textrm{d}\mu }{\textrm{d}\nu }(x)\ge f_0\) for all \(x\in A\) and \(H_n {:}{=} \sum _{k=1}^{n} \frac{1}{k}\).

Proof

We start with a union bound:

$$\begin{aligned} {\mathbb {P}}\left[ \varLambda ^{(1)}_{n}(\beta ,\gamma )^c\right]&= {\mathbb {P}}\left[ \exists {\varvec{m}}\in {\mathbb {N}}_+^p,\ {\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (1)} \text { and } {\varvec{m}}\notin {\mathscr {M}}_{n,\gamma }^{(1)} \right] \\&\le \sum _{\begin{array}{c} {\varvec{m}}\in {\mathbb {N}}_+^p \\ K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n} \end{array}} {\mathbb {P}}\left[ {\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (1)} \text { and } {\varvec{m}}\notin {\mathscr {M}}_{n,\gamma }^{(1)} \right] . \end{aligned}$$

We have the following inclusion of events:

$$\begin{aligned}&{{\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (1)} \text { and } {\varvec{m}}\notin {\mathscr {M}}_{n,\gamma }^{(1)}} \\&\subset { K_\nu ^\infty ({\varvec{m}}) \left( \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \vee 1 \right) \le \beta \frac{n}{\log n}} \cap {K_\nu ^\infty ({\varvec{m}}) \left( \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \vee 1 \right) \ge \gamma \frac{n}{\log n}} \\&\subset { \frac{\left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}}{\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}} \ge \frac{\gamma }{\beta } } \subset {\lambda _{\min }({{\widehat{\textbf{G}}}}_{{\varvec{m}}}) \ge \frac{\gamma }{\beta } \lambda _{\min }(\textbf{G}_{{\varvec{m}}})}, \end{aligned}$$

hence we obtain:

$$\begin{aligned} {\mathbb {P}}\left[ \varLambda ^{(1)}_{n}(\beta ,\gamma )^c\right] \le \sum _{\begin{array}{c} {\varvec{m}}\in {\mathbb {N}}_+^p \\ K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n} \end{array}} {\mathbb {P}}\left[ \lambda _{\min }({{\widehat{\textbf{G}}}}_{{\varvec{m}}}) \ge \frac{\gamma }{\beta } \lambda _{\min }(\textbf{G}_{{\varvec{m}}}) \right] . \end{aligned}$$

We apply inequality (30) of Theorem 5 with \(R = \frac{1}{n} K_\nu ^\infty ({\varvec{m}})\):

$$\begin{aligned} {\mathbb {P}}\left[ \lambda _{\min }({{\widehat{\textbf{G}}}}_{{\varvec{m}}}) \ge \frac{\gamma }{\beta } \lambda _{\min }(\textbf{G}_{{\varvec{m}}}) \right] \le \exp \left( -h\left( 1-\frac{\gamma }{\beta }\right) \frac{n}{K_\nu ^\infty ({\varvec{m}}) \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}}\right) . \end{aligned}$$

In the compact case, we have \(\left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\le \frac{1}{f_0}\), see (3). Using Proposition 4, we obtain:

$$\begin{aligned} {\mathbb {P}}\left[ \varLambda ^{(1)}_{n}(\beta ,\gamma )^c\right] \le \sum _{\begin{array}{c} {\varvec{m}}\in {\mathbb {N}}_+^p \\ K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n} \end{array}} n^{-h(1-\frac{\gamma }{\beta }) \frac{f_0}{\beta }} \le n^{-h(1-\frac{\gamma }{\beta }) \frac{f_0}{\beta }+1} H_n^{p-1}. \end{aligned}$$

\(\square \)

Lemma 7

(General case) We have for all \(\gamma>\beta >0\):

$$\begin{aligned} {\mathbb {P}}\left[ \varLambda ^{(2)}_{n}(\beta ,\gamma )^c\right] \le n^{-C(\beta ,\gamma ) \frac{B}{2\beta }+2} \, H_n^{p-1}, \end{aligned}$$

where \(C(\beta ,\gamma ){:}{=} \left( 1 - \sqrt{\beta /\gamma } \right) ^2\), \(B{:}{=} \big ( \left| \left| \frac{\textrm{d}\mu }{\textrm{d}\nu }\right| \right| _\infty + \frac{2}{3}\big )^{-1}\) and \(H_n {:}{=} \sum _{k=1}^{n} \frac{1}{k}\).

Proof

We start with a union bound:

$$\begin{aligned}&{\mathbb {P}}\left[ \varLambda ^{(2)}_{n}(\beta ,\gamma )^c\right] = {\mathbb {P}}\left[ \exists {\varvec{m}}\in {\mathbb {N}}_+^p,\, {\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)} \text { and } {\varvec{m}}\notin {\mathscr {M}}_{n,\gamma }^{(2)} \right] \\&\le \sum _{\begin{array}{c} {\varvec{m}}\in {\mathbb {N}}_+^p \\ K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n} \end{array}} {\mathbb {P}}\left[ {\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)} \text { and } {\varvec{m}}\notin {\mathscr {M}}_{n,\gamma }^{(2)} \right] . \end{aligned}$$

We have the following inclusion of events:

$$\begin{aligned}&{ {\varvec{m}}\in {{\widehat{{\mathscr {M}}}}}_{n,\beta }^{\ (2)} \text { and } {\varvec{m}}\notin {\mathscr {M}}_{n,\gamma }^{(2)} } \\&\subset {\lbrace } K_\nu ^\infty ({\varvec{m}}) \left( \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}^2 \vee 1 \right) \le \beta \frac{n}{\log n}{\rbrace } \cap {K_\nu ^\infty ({\varvec{m}}) \left( \left| \left| \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}^2 \vee 1 \right) \ge \gamma \frac{n}{\log n}} \\&\subset {\lbrace } K_\nu ^\infty ({\varvec{m}}) \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}^2 \le \beta \frac{n}{\log n}{\rbrace } \cap {K_\nu ^\infty ({\varvec{m}}) \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} - \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}^2 \ge \big (\sqrt{\gamma }-\textstyle {\sqrt{\beta }}\big )^2\dfrac{n}{\log n}} \\&\subset {\lbrace } \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}^2 \le \frac{\beta }{K_\nu ^\infty ({\varvec{m}}) } \frac{n}{\log n}{\rbrace } \cap {\lbrace }\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} - \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \ge \left( \sqrt{\frac{\gamma }{\beta }}-1\right) \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}{\rbrace }. \end{aligned}$$

Let \(\eta {:}{=} \sqrt{\frac{\gamma }{\beta }}-1\) and let \(\epsilon \in (0,1)\). We consider the following decomposition:

$$\begin{aligned} {\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} - \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \ge \eta \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}} = E_1 \cup E_2, \end{aligned}$$

with:

$$\begin{aligned} E_1 {:}{=} {\lbrace }\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} - \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \ge \eta \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}{\rbrace } \cap { \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} (\textbf{G}_{{\varvec{m}}} - {{\widehat{\textbf{G}}}}_{{\varvec{m}}})\right| \right| _\textrm{op}< \epsilon }, \\ E_2 {:}{=} {\lbrace }\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} - \textbf{G}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}} \ge \eta \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}{\rbrace } \cap {\lbrace } \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} (\textbf{G}_{{\varvec{m}}} - {{\widehat{\textbf{G}}}}_{{\varvec{m}}})\right| \right| _\textrm{op}\ge \epsilon {\rbrace }. \end{aligned}$$
  • For \(E_1\), we apply Lemma 9 with \({\textbf{A}} {:}{=} {{\widehat{\textbf{G}}}}_{{\varvec{m}}}\) and \({\textbf{B}} {:}{=} \textbf{G}_{{\varvec{m}}} - {{\widehat{\textbf{G}}}}_{{\varvec{m}}}\):

    $$\begin{aligned} E_1&\subset { \frac{ \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}^2 \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}} - \textbf{G}_{{\varvec{m}}}\right| \right| _{\textrm{op}}}{1 -\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} (\textbf{G}_{{\varvec{m}}} - {{\widehat{\textbf{G}}}}_{{\varvec{m}}})\right| \right| _\textrm{op}} \ge \eta \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}} \cap { \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} (\textbf{G}_{{\varvec{m}}} - {{\widehat{\textbf{G}}}}_{{\varvec{m}}})\right| \right| _\textrm{op}< \epsilon } \\&\subset { \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}} - \textbf{G}_{{\varvec{m}}}\right| \right| _{\textrm{op}} \ge (1-\epsilon )\eta }. \end{aligned}$$
  • For \(E_2\), we have directly:

    $$\begin{aligned} E_2 \subset {\lbrace } \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1} (\textbf{G}_{{\varvec{m}}} - {{\widehat{\textbf{G}}}}_{{\varvec{m}}})\right| \right| _\textrm{op}\ge \epsilon {\rbrace } \subset {\lbrace } \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| \left| \left| \textbf{G}_{{\varvec{m}}} - {{\widehat{\textbf{G}}}}_{{\varvec{m}}}\right| \right| _\textrm{op}\ge \epsilon {\rbrace }. \end{aligned}$$

Thus, we obtain:

$$\begin{aligned} \forall \epsilon \in (0,1),\qquad E_1\cup E_2 \subset { \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}\left| \left| \textbf{G}_{{\varvec{m}}} - {{\widehat{\textbf{G}}}}_{{\varvec{m}}}\right| \right| _\textrm{op}\ge (1-\epsilon )\eta \wedge \epsilon }. \end{aligned}$$

We now choose \(\epsilon \) maximizing \((1-\epsilon )\eta \wedge \epsilon \). This maximum is achieved when \(\epsilon = (1-\epsilon )\eta \), that is:

$$\begin{aligned} \epsilon = \frac{\eta }{1-\eta } = 1 - \sqrt{\beta /\gamma } {=}{:} c(\beta ,\gamma ) \in (0,1). \end{aligned}$$

Thus, we obtain:

$$\begin{aligned}&{\mathbb {P}}\left[ \varLambda ^{(2)}_{n}(\beta ,\gamma )^c\right] \\&\le \sum _{\begin{array}{c} {\varvec{m}}\in {\mathbb {N}}_+^p \\ K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n} \end{array}} {\mathbb {P}}\left[ { \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _{\textrm{op}}^2 \le \frac{\beta }{K_\nu ^\infty ({\varvec{m}}) } \frac{n}{\log n}} \cap {\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}} - \textbf{G}_{{\varvec{m}}}\right| \right| _{\textrm{op}} \ge \frac{c(\beta ,\gamma )}{\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}}^{-1}\right| \right| _\textrm{op}}}\right] \\&\le \sum _{\begin{array}{c} {\varvec{m}}\in {\mathbb {N}}_+^p \\ K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n} \end{array}} {\mathbb {P}}\left[ \left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{m}}} - \textbf{G}_{{\varvec{m}}}\right| \right| _\textrm{op}\ge c(\beta ,\gamma ) \sqrt{\frac{K_\nu ^\infty ({\varvec{m}})}{\beta } \frac{\log n}{n}}\right] . \end{aligned}$$

Let \(x{:}{=} c(\beta ,\gamma ) \sqrt{\frac{K_\nu ^\infty ({\varvec{m}})}{\beta } \frac{\log n}{n}}\) and notice that \(x\le 1\) if \(K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n}\). We apply Lemma 4 and Proposition 4:

$$\begin{aligned}&{\mathbb {P}}\left[ \varLambda ^{(2)}_{n}(\beta ,\gamma )^c\right] \\&\le \sum _{\begin{array}{c} {\varvec{m}}\in {\mathbb {N}}_+^p \\ K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n} \end{array}} D_{{\varvec{m}}} \exp \left( -\frac{n}{2} c^2(\beta ,\gamma ) \frac{K_\nu ^\infty ({\varvec{m}})}{\beta } \frac{\log n}{n} \left[ K_\nu ^\infty ({\varvec{m}}) \left( \left| \left| [\right| \right| \Big ]{\frac{\textrm{d}\mu }{\textrm{d}\nu }}_\infty + \frac{2}{3} x \right) \right] ^{-1} \right) \\&\le \sum _{\begin{array}{c} {\varvec{m}}\in {\mathbb {N}}_+^p \\ K_\nu ^\infty ({\varvec{m}}) \le \beta \frac{n}{\log n} \end{array}} D_{{\varvec{m}}}\, n^{- c^2(\beta ,\gamma ) \frac{B}{2\beta }} \le n^{-c^2(\beta ,\gamma )\frac{B}{2\beta } + 2} \, H_n^{p-1}, \end{aligned}$$

where \(B {:}{=} (\left| \left| \frac{\textrm{d}\mu }{\textrm{d}\nu }\right| \right| _\infty + \frac{2}{3})^{-1}\).\(\square \)

Now we can prove Theorem 3.

Theorem 3

Let \(\delta \in (0,1)\) and \(\gamma > \beta \) be constants to be chosen later. Let us introduce the event \(\varXi ^{(\iota )}_{n}(\beta ,\gamma ,\delta ){:}{=}\varLambda ^{(\iota )}_{n}(\beta ,\gamma ) \cap {\widetilde{\varOmega }}^{(\iota )}_{n}(\delta , \gamma )\) where \(\varLambda ^{(\iota )}_{n}(\beta ,\gamma )\) and \({\widetilde{\varOmega }}^{(\iota )}_{n}(\delta , \gamma )\) are defined by (24). On the event \(\varXi ^{(\iota )}_{n}(\beta ,\gamma ,\delta )\), for all \({\varvec{m}}\in {\mathscr {M}}^{(\iota )}_{n,\alpha }\), for all \(t\in S_{{\varvec{m}}}\) we have:

$$\begin{aligned} \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota }\right| \right| _\mu ^2&\le 2 \left| \left| b - t\right| \right| _\mu ^2 + 2 \left| \left| {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota } - t\right| \right| _\mu ^2 \\&\le 2 \left| \left| b - t\right| \right| _\mu ^2 + \frac{2}{1-\delta } \left| \left| {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota } - t\right| \right| _n^2 \\&\le 2 \left| \left| b - t\right| \right| _\mu ^2 + \frac{4}{1-\delta } \left| \left| b - t\right| \right| _n^2 + \frac{4}{1-\delta } \left| \left| b- {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota }\right| \right| _n^2. \end{aligned}$$

Taking the expectation yields for all \(t\in S_{{\varvec{m}}}\):

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota }\right| \right| _\mu ^2 \textbf{1}_{\varXi ^{(\iota )}_{n}(\beta ,\gamma ,\delta )}\right] \le \left( 2+\frac{4}{1-\delta } \right) \left| \left| b - t\right| \right| _\mu ^2 + \frac{4}{1-\delta } {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota }\right| \right| _n^2. \end{aligned}$$
(25)

On the event \(\varXi ^{(\iota )}_{n}(\beta ,\gamma ,\delta )^c\), we use inequalities (12) and (16):

$$\begin{aligned} \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota }\right| \right| _\mu ^2 \le 2\, \left| \left| b\right| \right| _\mu ^2 + 2\, \left| \left| {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota }\right| \right| _\mu ^2&\le 2\, \left| \left| b\right| \right| _\mu ^2 + 2\, K_n^\mu ({\varvec{{\hat{m}}}}_\iota ) \left| \left| {\textbf{Y}}\right| \right| _n^2 \\&\le 2\, \left| \left| b\right| \right| _\mu ^2 + 2\, K_\nu ^\infty ({\varvec{{\hat{m}}}}_\iota )\left| \left| {{\widehat{\textbf{G}}}}_{{\varvec{{\hat{m}}}}_\iota }^{-1}\right| \right| _\textrm{op}\left| \left| {\textbf{Y}}\right| \right| _n^2 \\&\le 2\, \left| \left| b\right| \right| _\mu ^2 + 4\beta \frac{n}{\log n} \left| \left| {\textbf{Y}}\right| \right| _n^2. \end{aligned}$$

Using Hölder’s inequality as we did in (20), we obtain:

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_\iota }\right| \right| _\mu ^2 \textbf{1}_{\varXi ^{(\iota )}_n(\beta ,\gamma ,\delta )^c}\right] \le 2\,\left| \left| b\right| \right| _\mu ^2 \,{\mathbb {P}}\left[ \varXi ^{(\iota )}_n(\beta ,\gamma ,\delta )^c\right] \\&+ 8\beta \frac{n}{\log n} \left( \left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2 \,{\mathbb {P}}\left[ \varXi ^{(\iota )}_n(\beta ,\gamma ,\delta )^c\right] ^{1/r'} + \sigma ^2 \,{\mathbb {P}}\left[ \varXi ^{(\iota )}_n(\beta ,\gamma ,\delta )^c\right] \right) . \end{aligned} \end{aligned}$$
(26)

We see we need to control \({\mathbb {P}}\left[ \varXi ^{(\iota )}_n(\beta ,\gamma ,\delta )^c\right] \) by a term of order \(n^{-2r'}\).

We have decomposed the risk as the sum of (25) and (26). We give different upper bounds on these two terms depending on whether we are in the compact case or the general case.

\(\bullet \) Compact case. In equation (25), we apply Theorem 2: for all \(\alpha \in (0, \alpha _{\beta , r'})\) we have:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}&\left[ \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _\mu ^2 \textbf{1}_{\varXi ^{(1)}_{n}(\beta ,\gamma ,\delta )}\right] \\ {}&\le \left( 2 + \frac{4}{1-\delta } \big (1+C(\theta )\big ) \right) \inf _{{\varvec{m}} \in {\mathscr {M}}_{n,\alpha }} \left( \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _\mu ^2 + \sigma ^2\frac{D_{{\varvec{m}}}}{n} \right) \\&+ \frac{4\sigma ^2}{1-\delta } \frac{\varSigma (\theta , q)}{n} + \frac{4}{1-\delta } C'\big (\left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2, \sigma ^2\big ) \frac{(\log n)^{(p-1)/r'}}{n^{\kappa (\alpha ,\beta )/r'}}, \end{aligned} \end{aligned}$$

with \(\frac{\kappa (\alpha ,\beta )}{r'}>1\). To obtain an upper bound on (26), we apply Lemmas 5 and 6:

$$\begin{aligned} {\mathbb {P}}\left[ \varXi ^{(1)}_n(\beta ,\gamma ,\delta )^c\right]&\le {\mathbb {P}}\left[ {\widetilde{\varOmega }}^{(1)}_{n}(\delta , \gamma )^c \right] + {\mathbb {P}}\left[ \varLambda ^{(1)}_{n}(\beta ,\gamma )^c\right] \\&\le \left( n^{-\frac{h(\delta )}{\gamma }+2} + n^{- h(1-\frac{\gamma }{\beta })\frac{f_0}{\beta } +1 } \right) H_n^{p-1}, \end{aligned}$$

where \(h(\delta ) {:}{=} \delta + (1-\delta )\log (1-\delta )\) and \(H_n{:}{=} \sum _{k=1}^{n} \frac{1}{k}\). In order to obtain a term of order \(n^{-2r'}\), we need:

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{h(\delta )}{\gamma }-2>2r', \\ h\left( 1-\frac{\gamma }{\beta }\right) \frac{f_0}{\beta } -1>2r', \end{array}\right. }&\iff {\left\{ \begin{array}{ll} h(\delta )> 2(1+r')\gamma , \\ h\left( 1-\frac{\gamma }{\beta }\right)>(2r'+1) \frac{\beta }{f_0} , \end{array}\right. } \\ {}&\iff {\left\{ \begin{array}{ll} \delta> h^{-1}\big (2(1+r')\gamma \big ), \\ \gamma < \frac{1}{2(1+r')}, \\ h\left( 1-\frac{\gamma }{\beta }\right) >(2r'+1) \frac{\beta }{f_0} . \end{array}\right. } \end{aligned}$$

Let us work on the last two conditions. Let \(x{:}{=} \frac{\gamma }{\beta }>1\), the conditions on \((\beta , \gamma )\) become:

$$\begin{aligned} {\left\{ \begin{array}{ll} x < \frac{1}{2(1+r')\beta }, \\ x\log x - x +1 > (2r'+1)\frac{\beta }{f_0}. \end{array}\right. } \end{aligned}$$

The function \(x\mapsto x\log x - x +1\) is increasing on \((1,+\infty )\) and ranges from 0 to \(+\infty \), so there exists \(x_{f_0,\beta }>1\) such that for all \(x>x_{f_0,\beta }\) we have \(x\log x - x +1 > (2r'+1)\frac{\beta }{f_0}\). Hence, we need to choose x such that:

$$\begin{aligned} x_{f_0,\beta }< x < \frac{1}{(2r'+2)\beta }. \end{aligned}$$
(27)

This is possible only if \(x_{f_0,\beta } < \frac{1}{(2r'+2)\beta }\), that is if:

$$\begin{aligned} (2r'+1) \frac{\beta }{f_0} < \frac{1}{(2r'+2)\beta } \log \left( \frac{1}{(2r'+2)\beta } \right) - \frac{1}{(2r'+2)\beta } +1. \end{aligned}$$

Let us introduce a new variable \(y{:}{=} (2r'+2)\beta \) and let \(R = \frac{2r'+1}{2r'+2}\), the last inequality becomes:

$$\begin{aligned} \frac{R}{f_0} y + \frac{1+\log y}{y} < 1. \end{aligned}$$
(28)

The function \(y\mapsto \frac{R}{f_0} y + \frac{1+\log y}{y}\) is increasing on (0, 1), it tends to \(-\infty \) at 0 and for \(y=1\) it is greater than 1, so there exists \(y_{f_0,r'}\in (0,1)\) such that the condition (28) is satisfied on \((0, y_{f_0,r'})\). To sum up, we have shown that there exists \(\beta _{f_0,r'} \in (0, \frac{1}{2r'+2})\) such that for every \(\beta < \beta _{f_0,r'}\), the condition (27) is not empty. We choose:

$$\begin{aligned} \gamma {:}{=} \beta x,\qquad x \text { satisfying } (27), \qquad \delta {:}{=} \frac{1 + h^{-1}\big (2(1+r')\gamma \big )}{2}, \end{aligned}$$

and we obtain that:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_1}\right| \right| _\mu ^2 \textbf{1}_{\varXi ^{(1)}_n(\beta ,\gamma ,\delta )^c}\right] \le C''(\left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}, \beta , \sigma ^2) \, n^{-\lambda (\beta , r, f_0)} \, (\log n)^{\frac{p-1}{r'}-1}, \end{aligned}$$

where \(\lambda (\beta , r, f_0)>1\).

\(\bullet \) General case. In equation (25), if we follow the proof of Theorem 2 (see Remark 7), we see that if \(\alpha \in (0, \alpha _{\beta ^{1/2}, r'}^2)\) then we have:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_2}\right| \right| _n^2&\le C(\theta ) \left| \left| b - t\right| \right| _\mu ^2 + \sigma ^2\frac{D_{{\varvec{m}}}}{n} + \sigma ^2 \frac{\varSigma (\theta , q)}{n} \\ {}&+ C'\big (\left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2, \sigma ^2\big ) \frac{(\log n)^{(p-1)/r'}}{n^{\kappa (\alpha ^{\frac{1}{2}},\beta ^{\frac{1}{2}})/r'}}, \end{aligned} \end{aligned}$$

with \(\frac{\kappa (\alpha ^{\frac{1}{2}},\beta ^{\frac{1}{2}})}{r'}>1\). Thus, we obtain:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}&\left[ \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_2}\right| \right| _\mu ^2 \textbf{1}_{\varXi ^{(2)}_{n}(\beta ,\gamma ,\delta )}\right] \\ {}&\le \left( 2 + \frac{4}{1-\delta } \big (1+C(\theta )\big ) \right) \inf _{{\varvec{m}} \in {\mathscr {M}}_{n,\alpha }^{(2)}} \left( \inf _{t\in S_{{\varvec{m}}}} \left| \left| b - t\right| \right| _\mu ^2 + \sigma ^2\frac{D_{{\varvec{m}}}}{n} \right) \\&+ \frac{4\sigma ^2}{1-\delta } \frac{\varSigma (\theta , q)}{n} + \frac{4}{1-\delta } C'\big (\left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}^2, \sigma ^2\big ) \frac{(\log n)^{(p-1)/r'}}{n^{\kappa (\alpha ^{\frac{1}{2}},\beta ^{\frac{1}{2}})/r'}}. \end{aligned} \end{aligned}$$

To obtain an upper bound on (26), we apply Lemmas 5 and 7:

$$\begin{aligned} {\mathbb {P}}\left[ \varXi ^{(2)}_n(\beta ,\gamma ,\delta )^c\right]&\le {\mathbb {P}}\left[ {\widetilde{\varOmega }}^{(2)}_{n}(\delta , \gamma )^c \right] + {\mathbb {P}}\left[ \varLambda ^{(2)}_{n}(\beta ,\gamma )^c\right] \\&\le \left( n^{-\frac{h(\delta )}{\gamma }+2} + n^{-C(\beta ,\gamma ) \frac{B}{2\beta }+2} \right) H_n^{p-1}, \end{aligned}$$

where \(C(\beta ,\gamma ){:}{=} \left( 1 - \sqrt{\beta /\gamma } \right) ^2\), \(B {:}{=} (\left| \left| \frac{\textrm{d}\mu }{\textrm{d}\nu }\right| \right| _\infty + \frac{2}{3})^{-1}\) and \(H_n{:}{=} \sum _{k=1}^{n} \frac{1}{k}\). To obtain a term of order \(n^{-2r'}\), we need:

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{h(\delta )}{\gamma }- 2>2r', \\ C(\beta ,\gamma ) \frac{B}{2\beta } - 2>2r', \end{array}\right. }&\iff {\left\{ \begin{array}{ll} h(\delta )> 2(1+r')\gamma , \\ C(\beta ,\gamma ) \frac{B}{2}> 2(1+r')\beta , \end{array}\right. } \\ {}&\iff {\left\{ \begin{array}{ll} \delta> h^{-1}\big (2(1+r')\gamma \big ), \\ \gamma < \frac{1}{2(1+r')}, \\ \frac{C(\beta ,\gamma ) B}{4(1+r')} > \beta . \end{array}\right. } \end{aligned}$$

Let \(x {:}{=} \sqrt{\beta /\gamma } \in (0,1)\), the conditions on \((\beta , \gamma )\) can be rewritten as:

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{\beta }{x^2}< \frac{1}{2(1+r')}, \\ \beta< (1-x)^2 \frac{B}{4(1+r')}, \end{array}\right. } \iff \beta < \frac{1}{2(1+r')} \left( x^2 \wedge (1-x)^2 \frac{B}{2} \right) . \end{aligned}$$

We choose x maximizing this bound. This maximum is achieved when \(x^2 = (1-x)^2 \frac{B}{2}\), that is \(x = \frac{\sqrt{B/2}}{1 + \sqrt{B/2}}\). Finally, we choose:

$$\begin{aligned} x {:}{=} \frac{\sqrt{B/2}}{1 + \sqrt{B/2}}, \qquad \gamma {:}{=} \frac{\beta }{x^2}, \qquad \delta {:}{=} \frac{1+h^{-1}\big (2(1+r')\gamma \big )}{2}, \end{aligned}$$

and we obtain that for all \(\beta \in (0, \beta _{B,r'})\) with:

$$\begin{aligned} \beta _{B,r'} {:}{=} \frac{1}{2(1+r')} \left( \frac{\sqrt{B/2}}{1 + \sqrt{B/2}} \right) ^2, \end{aligned}$$

we have:

$$\begin{aligned} {\mathbb {E}}\left[ \left| \left| b - {\hat{b}}_{{\varvec{{\hat{m}}}}_2}\right| \right| _\mu ^2 \textbf{1}_{\varXi ^{(2)}_n(\beta ,\gamma ,\delta )^c}\right] \le C''(\left| \left| b\right| \right| _{\textrm{L}^{2r}(\mu )}, \beta , \sigma ^2) \, n^{-\lambda (\beta , r, B)} \, (\log n)^{\frac{p-1}{r'}-1}, \end{aligned}$$

where \(\lambda (\beta , r, B)>1\).\(\square \)