1 Introduction

1.1 Statistical model

The study of differences among groups is the main challenge of two-sample problems, and statistical methods are required to do this in various fields (biology or social research for example). Nonparametric inference procedures are well developed for comparing samples coming from two populations, modeled by two real random variables, \(X_0\) and \(X\). Most of the methods are based on the comparison of the cumulative distribution functions (c.d.f. in the sequel) \(F_0\) and \(F\) of \(X_0\) and \(X,\) respectively. The study of the relative density \(r\) of \(X\) with respect to \(X_0\) is quite recent. Assume that \(f_0\), the density of \(X_0\), is defined on an interval \(A_0\) and does not vanish on it. Denote by \(F_0^{-1}\) the inverse of \(F_0\). The relative density is defined as the density of the variable \(F_0(X)\) and can be expressed as

$$\begin{aligned} r(x)=\frac{f\circ F_0^{-1}(x)}{f_0\circ F_0^{-1}(x)},\quad x\in F_0(A), \end{aligned}$$
(1)

where \(\circ \) is the composition symbol and \(f\) is a density of \(X\), defined on an interval \(A\subset \mathbb {R}\). In the present work, we focus on the optimal adaptive estimation of this function (in the oracle and minimax senses), from two independent samples \((X_i)_{i\in \{1,\ldots ,n\}}\) and \((X_{0,i_0})_{i_0\in \{1,\dots ,n_0\}}\) of variables \(X\) and \(X_0\).

1.2 Motivation

The most classical nonparametric methods to tackle the initial issue of the comparison of \(F\) and \(F_0\) are statistical tests such as Kolmogorov and Smirnov (Kolmogorov 1933, 1941; Smirnov 1939, 1944) and Wilcoxon (Wilcoxon 1945), or Mann and Whitney tests (Mann and Whitney 1947), all of which propose to check the null hypothesis of equal c.d.f.. We refer to Gibbons and Chakraborti (2011) for a detailed review of these tests. Probability plotting tools such as quantile–quantile plots, whose functional form is \(x\mapsto F_0^{-1}(F(x))\), are also commonly considered. However, the representation of the quantiles of one distribution versus the quantiles of the other may be questionable. For example, Holmgren (1995) showed that it does not enable scale-invariant comparisons of treatment effects and that it depends on outliers. Some authors have thus been interested by an alternative, the probability–probability plot, a graph of the percentiles of one distribution versus the percentiles of the other (see among all Li et al. 1996). The functional form can be written \(x\mapsto F(F_0^{-1}(x))\), which defines the relative c.d.f., a function closely related to the receiver operating characteristic (ROC) curve: the latter is \(x\mapsto 1-F(F_0^{-1}(1-x))\). This curve is well known in fields such as signal detection and diagnostic test, for example. Both the relative c.d.f. and the ROC curve are based on the following transformation of data: to compare \(X\) to \(X_0\), consider \(F_0(X)\), a variable known in the literature as the grade transformation or most commonly as the relative transformation. Its c.d.f. is the relative c.d.f. defined above. The basic idea is to look at the rank that a comparison value (that is a value of \(X\)) would have in the reference group (that is in the values of the sample of \(X_0\)). To recover from a double sample the ROC curve or the relative c.d.f. in a nonparametric way, two types of strategies have mainly been studied: estimators based on the empirical c.d.f. of \(X\) and \(X_0\) (see Hsieh and Turnbull 1996a, b and references therein), as well as kernel smoothers (see among all Lloyd 1998; Lloyd and Yong 1999; Hall and Hyndman 2003 for the ROC curve, Gastwirth 1968; Hsieh 1995; Handcock and Morris 1999 for the relative c.d.f.). Conditional versions of the previous strategies have also been studied (see the review provided by Pardo et al. 2013). These two functions are based on the c.d.f. \(F\) and \(F_0\) of the two variables to be compared.

Nevertheless, focusing on their densities is likely to provide more precise and visual details. That is why the present work addresses the problem of comparison through the estimation of the relative density (1), which is the derivative of the relative c.d.f. and thus a density of the variable \(F_0(X)\). Graphically more informative than the ROC curve (see the introduction of Molanes and Cao 2008b), another reason for the choice of the relative density is that an estimate of this function is required to study the asymptotic variance of any ROC curve estimator and thus to build confidence regions based on it (see the references above and also Claeskens et al. 2003). Moreover, some summary measures for the comparison of \(X\) and \(X_0\) are based on the relative density \(r\): the most classical example is the Kullback–Leibler divergence (Kullback and Leibler 1951) which can be recovered by the plug-in of an estimate of \(r\) (Mielniczuk 1992; Handcock and Morris 1999). But there exist other measures that can pertain to the relative density, such as the Gini separation measurement and some discriminant rules (Gijbels and Mielniczuk 1995), Lorenz curves and the median polarization index (Handcock and Morris 1999). It is also possible to build goodness-of-fit tests from the relative density; see Kim (2000).

However, not many investigations are concerned with theoretical results for the estimation of the relative density and most of the references are sociological ones. A clear account is provided by Handcock and Janssen (2002). Early mathematical references for the relative density are Bell and Doksum (1966) and Silverman (1978), who approached the problem with the maximum likelihood point of view. A kernel estimate was first proposed by Ćwik and Mielniczuk (1993) and modified by Molanes and Cao (2008a), who proved asymptotic developments for the mean integrated squared error (MISE), under the assumption that \(r\) is twice continuously derivable. The problem of bandwidth selection is also addressed, but few theoretical results are proved for the estimators with the selected parameters to the best of our knowledge. The question has also been studied in a semiparametric setting (see Cheng and Chu 2004 and references therein). If the relative density can also be brought closer to the density ratio, for which numerous studies are available (see Sugiyama et al. 2012 for a review), some authors have noticed that the relative distribution leads to smoother and more stable results (Yamada et al. 2013). Our work is the first to study a nonparametric projection method in this setting and provide a detailed optimal study of an adaptive estimator.

1.3 Contribution and overview

Our main contribution is a theoretical one. The novelty of our work is to provide a theoretically justified adaptive estimator with optimal rate of convergence. A collection of projection estimators on linear models is built in Sect. 2, and the quadratic risk is studied: the upper bound is non-trivial and requires non-straightforward splittings. We obtain a bias-variance decomposition which permits understanding what we can expect at best from adaptive estimation, which is the subject of Sect. 3: the model selection is automatically performed in the spirit of the Goldenshluger–Lepski method in a data-driven way (Goldenshluger and Lepski 2011). The resulting estimator is shown to be optimal in the collection, but also from an asymptotic point of view among all possible estimators for a large class of regular relative density. To be more precise, an oracle-type inequality first proves that adaptation has no cost (Sect. 3.2): the estimator achieves the same performance as the one which would have been selected if the regularity index of the target function has been known. The choice of the quadratic risk permits using the Hilbert structure and thus the standard model selection tools (mainly concentration inequalities) even if our selection criterion is based on the Goldenshluger–Lepski methodology. Rates of convergence are deduced for functions \(r\) belonging to Besov balls: we obtain the nonparametric rate \((n^{-1}+n_0^{-1})^{2\alpha /(2\alpha +1)}\), where \(\alpha \) is the smoothness index of \(r\). These rates are also shown to be optimal: a lower bound for the minimax risk is established (Sect. 3.3). Such results are new for this estimation problem. Especially, no assumption about a link between the sample sizes \(n\) and \(n_0\) is required and the regularity assumptions are not restrictive. Section 4 provides a brief discussion of some practical issues via simulations. Finally, the proofs are gathered in Sect. 6 after some concluding remarks. A supplementary material is available with further simulation results (reconstructions and risk computations), as well as further details about technical definitions and proofs.

2 The collection of projection estimators

For the sake of clarity, we assume that the variables \(X\) and \(X_0\) have the same support: \(A=A_0\). Hence, \(F_0(A)=(0;1)\) is the estimation interval. This assumption is natural to compare the distribution of \(X\) to the one of \(X_0\).

2.1 Approximation spaces

We denote by \(L^2((0;1))\) the space of square integrable functions on \((0;1)\), equipped with its usual Hilbert structure: \(\langle .,.\rangle \) is the scalar product and \(\Vert .\Vert \) the associated norm. The relative density \(r\), defined by (1) and estimated on its definition set \((0;1),\) is assumed to belong to \(L^2((0;1))\). Our estimation method is based on this device: we consider a family \(S_m\), \(m\in \fancyscript{M}\) of finite dimensional subspaces of \(L^2((0;1))\) and compute a collection of estimators \((\hat{r}_m)_{m\in \fancyscript{M}}\), where, for all \(m\), \(\hat{r}_m\) belongs to \(S_m\). In a second step, a data-driven procedure chooses among the collection the final estimator \(\hat{r}_{\hat{m}}\).

Here, simple projection trigonometric spaces are considered: the set \(S_m\) is linearly spanned by \(\varphi _1,\ldots ,\varphi _{2m+1}\), with

$$\begin{aligned} \varphi _1(x)=1,\;\;\varphi _{2j}(x)=\sqrt{2}\cos (2\pi jx),\;\;\varphi _{2j+1}(x)=\sqrt{2}\sin (2\pi jx) \text{, } \quad x\in (0;1). \end{aligned}$$

We set \(D_m=2m+1\), the dimension of \(S_m\), and \(\fancyscript{M}=\{1,2,\ldots ,\lfloor \min (n,n_0)/2\rfloor -1\}\), the collection of indices, whose cardinality depends on the two sample sizes. The largest space in the collection has maximal dimension \(D_{m_{\max }}\), which is subject to constraints appearing later. We focus on the trigonometric basis mainly for its simplicity to be handled. It is also used for a lot of other nonparametric estimation problems, by several authors (see, e.g., Efromovich 1999 among all). Moreover, the presence of a constant function (namely \(\varphi _1\)) in the basis is perfectly well adapted to the relative density estimation context; see Sect. 4.2 below. The method may however probably be extended to other projection spaces, thanks to different “tricks” in the computations.

2.2 Estimation on a fixed model

For each index \(m\in \fancyscript{M}\), we define an estimator for the orthogonal projection \(r_m=\sum _{j=1}^{D_m}a_j\varphi _j\) of \(r\) onto the model \(S_m\), where \(a_j=\langle \varphi _j,r\rangle \). First, notice that

$$\begin{aligned} \mathbb {E}\left[ \varphi _j(F_0(X))\right] \!{=}\!\int _A\varphi _j\circ F_0(x)f(x)dx\!=\!\int _{F_0(A)}\varphi _j(u)\frac{f\circ F_{0}^{-1}(u)}{f_0\circ F_0^{-1}(u)}du\!=\!\langle \varphi _j,r\rangle =a_j, \end{aligned}$$
(2)

with the change of variables \(u=F_0(x)\) and keeping in mind that \(F_0(A)=(0;1)\). Thus, the following function suits well to estimate \(r_m\):

$$\begin{aligned} \hat{r}_m(x)=\sum _{j=1}^{D_m}\hat{a}_j\varphi _j(x),\;\;\text{ with }\;\;\hat{a}_j=\frac{1}{n}\sum _{i=1}^n\varphi _j\left( \hat{F}_{0}(X_i)\right) , \end{aligned}$$
(3)

and where \(\hat{F}_{0}\) is the empirical c.d.f. of the sample \((X_{0,i_0})_{i_0=1,\ldots ,n_0}\), that is

$$\begin{aligned} \hat{F}_{0}:\;x\;\mapsto \;\frac{1}{n_0}\sum _{i_0=1}^{n_0}\mathbf {1}_{X_{0,i_0}\le x}. \end{aligned}$$

Note that in the “toy” case of known c.d.f. \(F_0\), the procedure amounts to estimating a density: \(\hat{r}_m\) is the classical density projection estimator (adapted to the estimation of the density of \(F_0(X)\)).

Remark 1

Comparison with other estimation methods.

  1. 1.

    The estimator \(\hat{r}_m\) defined in (3) can also be seen as a minimum of contrast estimate: \(\hat{r}_m=\text{ arg }\inf _{t\in S_m}\gamma _n(t,\hat{F}_{0}),\;\;m\in \fancyscript{M}\), with

    $$\begin{aligned} \gamma _n(t,\hat{F}_{0})=\Vert t\Vert ^2-\frac{2}{n}\sum _{i=1}^{n}t\circ \hat{F}_{0}(X_i). \end{aligned}$$
  2. 2.

    It is worthwhile to draw a parallel between the projection method and the kernel estimator of Ćwik and Mielniczuk (1993) or Molanes and Cao (2008a). Thanks to the properties of the sine–cosine basis,

    $$\begin{aligned} \hat{r}_m(x)=\frac{2}{n}\sum _{i=1}^n\sum _{j=0}^{(D_m-1)/2}\cos \left( 2\pi j\left( \hat{F}_0(X_i)-x\right) \right) . \end{aligned}$$

    Heuristically, by setting \((D_m-1)/2=\lfloor 1/(2\pi h)\rfloor -1\), \(h>0\), the previous expression shows that \(\hat{r}_m\) can be seen as an approximation of

    $$\begin{aligned}\tilde{r}_h(x)&= \frac{2}{n}\sum _{i=1}^n\int _0^{1/(2\pi h)}\cos \left( 2\pi u\left( \hat{F}_0(X_i)-x\right) \right) du,\\&= \frac{1}{2\pi n}\sum _{i=1}^n\int _{-1/h}^{1/h}\cos \left( u\left( \hat{F}_0(X_i)-x\right) \right) du,\\&= \frac{1}{2\pi n}\sum _{i=1}^n\int _{-1/h}^{1/h}\exp \left( -i u\left( x-\hat{F}_0(X_i)\right) \right) du\\&= \frac{1}{n}\sum _{i=1}^n\frac{1}{h}K\left( \frac{x-\hat{F}_0(X_i)}{h}\right) , \end{aligned}$$

    with \(K\) the sinus cardinal kernel defined by its Fourier transform: \(\fancyscript{F}(K)(x)=1\) if \(x\in (0;1)\); \(\fancyscript{F}(K)(x)=0\) otherwise. Our strategy thus seems to be close to the kernel estimators of Ćwik and Mielniczuk (1993) and Molanes and Cao (2008a). But contrary to them, the projection method makes possible to obtain an unbiased estimate when the target function belongs to one of the approximation spaces. In the relative density estimation setting, this can occur if the two variables \(X\) and \(X_0\) have the same distribution and if the constant functions are included in one of the models, which is the case.

2.3 Risk of a projection estimator

The global squared error is the natural criterion associated with the projection estimation procedure. First, consider the toy case of known c.d.f. \(F_0\). The Pythagoras theorem simply leads to the classical bias-variance decomposition:

$$\begin{aligned} \left\| r-\hat{r}_m\right\| ^2=\left\| r-r_m\right\| ^2+\left\| \hat{r}_m-r_m\right\| ^2. \end{aligned}$$
(4)

Moreover, the variance term can be easily bounded, still with known \(F_0\), and using the property of the trigonometric basis:

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{r}_m-r_m\right\| ^2\right] =\sum _{j=1}^{D_m}\text{ Var }\left( \hat{a}_j\right) \le \frac{1}{n}\sum _{j=1}^{D_m}\mathbb {E}\left[ \varphi _j^2(F_0(X_1))\right] =\frac{D_m}{n}. \end{aligned}$$
(5)

The challenge in the general case comes from the plug-in of the empirical \(\hat{F}_{0}\). It seems natural but involves non-straightforward computations. This is why the proof of the following upper bound for the risk is postponed to Sect. 6.

Proposition 1

Assume that the relative density \(r\) is continuously differentiable on \((0;1)\). Assume also that \(D_m\le \kappa n_0^{1/3}\), for a constant \(\kappa >0\). Then, there exist two positive constants \(c_1\) and \(c_2,\) such that

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{r}_m-r\right\| ^2\right]&\le 3\left\| r-r_m\right\| ^2+\left( 3\frac{D_m}{n}+c_1\Vert r\Vert ^2\frac{D_m}{n_0}\right) +c_2\left( \frac{1}{n}+\frac{1}{n_0}\right) . \end{aligned}$$
(6)

The constants \(c_1\) and \(c_2\) do not depend on \(n\), \(n_0\) and \(m\). Moreover, \(c_1\) also does not depend on \(r\).

The assumption on the model dimension \(D_m\) comes from the control of the deviations between \(\hat{F}_{0}\) and \(F_0\). Proposition 1 shows that the risk is divided into three terms: a squared-bias term, a variance term [proportional to \(D_m(n^{-1}+n_0^{-1})\)] and a remainder [proportional to \((n^{-1}+n_0^{-1})\)]. The upper bound of (1) is nontrivial and the proof requires tricky approximations (see Sect. 6.2).

2.4 Rates of convergence on Besov balls

The result (6) also gives the asymptotic rate for an estimator if we consider that \(r\) has smoothness \(\alpha >0\). Indeed, in this case, it is possible to calculate the approximation error \(\Vert r-r_m\Vert \). A space of functions with smoothness \(\alpha \) which has good approximation properties is the Besov space \(\fancyscript{B}_{2,\infty }^{\alpha }\), where index \(2\) refers to the \(L^2\) norm. This space is somehow a generalization of the Sobolev space and is known to be optimal for nonparametric estimation (Kerkyacharian and Picard 1993). More precisely, we assume that the relative density \(r\) belongs to a Besov ball \(B_{2,\infty }^{\alpha }((0;1),L)\) of radius \(L\), for the Besov norm \(\Vert .\Vert _{\alpha ,2}\) on the Besov space \(\fancyscript{B}_{2,\infty }^{\alpha }((0;1))\). A precise definition is recalled in the supplementary material (Section 1 of the supplementary material, see also DeVore and Lorentz (1993)). The following rate is then obtained.

Corollary 1

Assume that the relative density \(r\) belongs to the Besov ball \(B_{2,\infty }^{\alpha }((0;1),L)\), for \(L>0\), and \(\alpha \ge 1\). Choose a model \(m_{n,n_0}\) such that \(D_{m_{n,n_0}}=C(n^{-1}+n_0^{-1})^{-1/(2\alpha +1)}\), for \(C>0\). Then, under the assumptions of Proposition 1, there exists a numerical constant \(C'\) such that

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{r}_{m_{n,n_0}}-r\right\| ^2\right] \le C'\left( \frac{1}{n}+\frac{1}{n_0}\right) ^{\frac{2\alpha }{2\alpha +1}}. \end{aligned}$$

This inequality is a straightforward consequence of the result of DeVore and Lorentz (1993) and of Lemma 12 of Barron et al. (1999), which imply that the bias term \(\Vert r-r_m\Vert ^2\) is of order \(D_m^{-2\alpha }\). The minimum of the right-hand side term of (6) can thus be computed, leading to Corollary 1. Nevertheless, it is worth noticing that the rate depends on the two sample sizes \(n\) and \(n_0\). Heuristically, it is \((\min (n,n_0))^{-2\alpha /(2\alpha +1)}\). The rate we obtain is new in nonparametric estimation, but it is not surprising. Actually, it looks like the Kolmogorov–Smirnov two-sample test convergence result: it is well known that the test statistic rate is \(\sqrt{nn_0/(n+n_0)}\) (see for example Doob 1949). More recently, similar rates have been obtained in adaptive minimax testing (see, e.g., Butucea and Tribouley 2006).

Remark 2

The regularity condition \(\alpha \ge 1\) ensures that there exists a dimension \(D_{m_{n,n_0}}\) which satisfies \(D_m\le Cn_0^{1/3}\) while being of order \((n^{-1}+n_0^{-1})^{-1/(2\alpha +1)}\). When \(\alpha <1\), this choice remains possible and the convergence rate is preserved under the additional assumption \(n\le n_0/(n_0^{(2-2\alpha )/3}-1)\). Roughly, this condition means that \(n\le n_0^{(2\alpha +1)/3}<n_0\), and thus \(n\) and \(n_0\) must be put in order to handle this case.

It follows from Corollary 1 that the optimal dimension depends on the unknown regularity \(\alpha \) of the function to be estimated. The aim is to perform an adaptive selection only based on the data.

3 Adaptive optimal estimation

3.1 Model selection

Consider the collection \((S_m)_{m\in \fancyscript{M}}\) of models defined in Sect. 2.1 and the collection \((\hat{r}_m)_{m\in \fancyscript{M}}\) of estimators defined by (3). The aim is to propose a data-driven choice of \(m\) leading to an estimator with risk near the squared-bias/variance compromise [see (6)]. The selection combines two strategies: the model selection device performed with a penalization of the contrast (see, e.g., Barron et al. 1999) and the recent Goldenshluger–Lepski method (Goldenshluger and Lepski 2011). A similar device has already been used in Comte and Johannes (2012), Bertin et al. (2013) and Chagny (2013). We set, for every index \(m\),

$$\begin{aligned} \begin{array}{l} \displaystyle V(m)=c_0\left( \frac{D_{m}}{n}+\Vert r\Vert ^2\frac{D_{m}}{n_0}\right) ,\\ \displaystyle A(m)=\max \limits _{m'\in \fancyscript{M}}\left( \left\| \hat{r}_{m'}-\hat{r}_{m\wedge m'}\right\| ^2-V(m')\right) _+, \end{array} \end{aligned}$$
(7)

where \(m\wedge m'\) is the minimum between \(m\) and \(m'\), \((x)_+\) the maximum between \(x\) and \(0\) (for a real number \(x\)), and \(c_0\) a tuning parameter. The quantity \(V\) must be understood as a penalty term and \(A\) is an approximation of the squared-bias term (see Lemma 10). The estimator of \(r\) is now given by \(\hat{r}_{\hat{m}}\), with

$$\begin{aligned} \hat{m}=\displaystyle \text{ argmin }_{m\in \fancyscript{M}}\{A(m)+V(m)\}. \end{aligned}$$

By construction, the choice of the index \(m\) and hence the estimator \(\hat{r}_{\hat{m}}\) does not depend on the regularity assumption on the relative density \(r\).

3.2 Optimality in the oracle sense

A non-asymptotic upper-bound is derived for the risk of the estimator \(\hat{r}_{\hat{m}}\).

Theorem 2

Assume that the relative density \(r\) is continuously differentiable on \((0;1)\). Assume also that \(D_m\le \kappa n_0^{1/3}/\ln ^{2/3}(n_0)\), for a constant \(\kappa >0\). Then, there exist two positive constants \(c\) and \(C\) such that

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{r}_{\hat{m}}-r\right\| ^2\right]&\le c\min _{m\in \fancyscript{M}}\left\{ \left( \frac{D_m}{n}+\Vert r\Vert ^2\frac{D_m}{n_0}\right) +\left\| r_m-r\right\| ^2\right\} +C\left( \frac{1}{n}+\frac{1}{n_0}\right) .\nonumber \\ \end{aligned}$$
(8)

The constant \(c\) is purely numerical, while \(C\) depends on \(r\), but neither on \(n\) nor \(n_0\).

Theorem 2 establishes the optimality of the selection rule in the oracle sense. For every index \(m\in \fancyscript{M}\), \(\{(D_m/n+\Vert r\Vert ^2D_m/n_0)+\Vert r_m-r\Vert ^2\}\) has the same order as \(\mathbb {E}\left[ \Vert \hat{r}_m-r\Vert ^2\right] \) (see Proposition 1). Thus, Inequality (8) indicates that up to a multiplicative constant, the estimator \(\hat{r}_{\hat{m}}\) converges as fast as the best estimator in the collection. The proof of such result is based on the following scheme: we first come down to the case of a known c.d.f. \(F_0\), by using deviation results for the empirical c.d.f. Then, we use concentration results for empirical processes to prove that \(A(m)\) defined in (7) is a good estimate of the bias term.

The following corollary states the convergence rate of the risk over Besov balls. Since the regularity parameter defining the functional class is not supposed to be known to select the estimator \(\hat{r}_{\hat{m}}\), it is an adaptation result: the estimator adapts to the unknown regularity \(\alpha \) of the function \(r\).

Corollary 2

Assume that the relative density \(r\) belongs to \(B_{2,\infty }^{\alpha }((0;1),L)\), for \(L>0\), and \(\alpha \ge 1\). Under the assumptions of Theorem 2,

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{r}_{\hat{m}}-r\right\| ^2\right] \le C\left( \frac{1}{n}+\frac{1}{n_0}\right) ^{\frac{2\alpha }{2\alpha +1}}. \end{aligned}$$

It is worth noticing that the rate of convergence computed above (that is the one of the best estimators among the collection, see Corollary 1) is automatically achieved by the estimator \(\hat{r}_{\hat{m}}\). The corollary 2 is established with regularity assumptions stated on the target function \(r\) only. To the best of our knowledge, in the previous works, convergence results for selected relative density estimators (among a family of kernel ones) depended on strong assumptions on \(r\) (\(r\in \fancyscript{C}^6((0;1))\) e.g.) and also on the regularity of \(f_0\).

The penalty term \(V\) given in (7) cannot be used in practice, since it depends on the unknown quantity \(\Vert r\Vert ^2\). A solution is to replace it by an estimator and to prove that the estimator of \(r\) built with this random penalty keeps the adaptation property. To that aim, set, for an index \({m^*}\in \fancyscript{M}\),

$$\begin{aligned} \begin{array}{l} \displaystyle \widetilde{V}(m)=c_0\left( \frac{D_{m}}{n}+4\Vert \hat{r}_{{m^*}}\Vert ^2\frac{D_{m}}{n_0}\right) ,\\ \displaystyle \widetilde{A}(m)=\max _{m'\in \fancyscript{M}}\left( \left\| \hat{r}_{m'}-\hat{r}_{m\wedge m'}\right\| ^2-\widetilde{V}(m')\right) _+, \end{array} \end{aligned}$$
(9)

and \(\tilde{m}=\text{ argmin }_{m\in \fancyscript{M}}\{\widetilde{A}(m)+\widetilde{V}(m)\}\). The result for \(\hat{r}_{\tilde{m}}\) is described in the following theorem.

Theorem 3

Assume that the assumptions of Theorem 2 are satisfied and that \(r\) belongs to \(B_{2,\infty }^{\alpha }((0;1),L)\), for \(L>0\), and \(\alpha \ge 1\). Choose \({m^*}\) in the definition of \(\widetilde{V}\) such that \(D_{{m^*}}\ge \ln (n_0)\) and \(D_{{m^*}}=O(n^{1/4}/\ln ^{1/4}(n))\). Then, for \(n_0\) large enough, there exist two positive constants \(c\) and \(C,\) such that

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{r}_{\tilde{m}}-r\right\| ^2\right]&\le c\min _{m\in \fancyscript{M}}\left\{ \left( \frac{D_m}{n}+\Vert r\Vert ^2\frac{D_m}{n_0}\right) +\left\| r_m-r\right\| ^2\right\} +C\left( \frac{1}{n}+\frac{1}{n_0}\right) . \end{aligned}$$

As for Theorem 2, the result proves that the selection rule leads to the best trade-off between a bias and a variance term. Our estimation procedure is thus optimal in the oracle sense. The convergence rates derived in Corollary 2 remain valid for \(\hat{r}_{\tilde{m}}\). Now, the only remaining parameter to tune is the constant \(c_0\) involved in the definition of \(\widetilde{V}\). A value is obtained in the proof, but it is quite rough and useless in practice. A sharp bound seems difficult to obtain from a theoretical point of view: obtaining minimal penalties is still a difficult problem (see e.g., Birgé and Massart 2007), and this question could be the subject of a full paper. Therefore, we experiment the tuning by a simulation study over various models.

3.3 Optimality in the minimax sense

Until now, we have drawn conclusions about the performance of the selected estimators \(\hat{r}_{\hat{m}}\) or \(\hat{r}_{\tilde{m}}\) within the collection \((\hat{r}_{m})_{m\in \fancyscript{M}}\) of projection estimators. A natural question follows: is the convergence rate obtained in Corollary 2 optimal among all the possible estimation strategies? We prove that the answer is yes by establishing the following lower bound for the minimax risk of the relative density estimation problem without making any assumption.

Theorem 4

Let \(\fancyscript{F}_\alpha \) be the set of relative density functions on \((0;1)\) which belong to the Besov ball \(B_{2,\infty }^{\alpha }((0;1),L)\), for a fixed radius \(L> 1\), and for \(\alpha \ge 1\). Then there exists a constant \(c>0\) which depends on \((\alpha ,L)\) such that

$$\begin{aligned} \inf _{\hat{r}_{n,n_0}}\sup _{r\in \fancyscript{F}_{\alpha }}\mathbb {E}\left[ \left\| \hat{r}_{n,n_0}-r\right\| ^2\right] \ge c \left( \frac{1}{n}+\frac{1}{n_0}\right) ^{2\alpha /(2\alpha +1)}, \end{aligned}$$
(10)

where the infimum is taken over all possible estimators \(\hat{r}_{n,n_0}\) obtained with the two data samples \((X_i)_{i\in \{1,\ldots ,n\}}\) and \((X_{0,i_0})_{i_0\in \{1,\ldots ,n_0\}}\).

The optimal convergence rate is thus \((n^{-1}+n_0^{-1})^{2\alpha /(2\alpha +1)}\). The upper bound of Corollary 2 and the lower bound (10) match, up to constants. This proves that our estimation procedure achieves the minimax rate and is thus also optimal in the minimax sense. The result is not straightforward: the proof requires specific constructions, since it captures the influence of both sample sizes, \(n\) and \(n_0\). Although it is a lower bound for a kind of density function, we think it cannot be easily deduced from the minimax rate of density estimation over the Besov ball (see for example Kerkyacharian and Picard 1992), since the two samples do not have symmetric roles.

4 Simulation

In this section, we present the performance of the adaptive estimator \(\hat{r}_{\tilde{m}}\) on simulated data. We have carried out an intensive simulation study (with the computing environment MATLAB) which shows that the results are equivalent to the ones of Ćwik and Mielniczuk (1993) and Molanes and Cao (2008a). Here, we thus prefer to discuss two types of questions, to evaluate the specific robustness of our method. After describing the way we compute the estimator, we first focus on the quality of estimation when the variable \(X\) is close (in distribution) to \(X_0\). Second, we investigate the role of the two sample sizes, \(n\) and \(n_0\). For additional reconstructions, risk computations and details about calibration, the reader may refer to the supplementary material (Sect. 2).

4.1 Implementation

The implementation of the estimator is very simple and follows the steps below.

  • For each \(m\in \fancyscript{M}\), compute \((\hat{r}_{m}(x_k))_{k=1,\ldots ,K}\) defined by (3) for grid points \((x_k)_{k=1,\ldots ,K}\) evenly distributed across \((0;1)\), with \(K=50\).

  • For each \(m\in \fancyscript{M}\), compute \(\widetilde{V}(m)\) and \(\widetilde{A}(m)\), defined by (9).

    • For \(\widetilde{V}(h)\) we choose \(c_0=1\), but the estimation results seem quite robust to slight changes. This value has been obtained by a numerical calibration on various examples (see Section 2.2 of the supplementary material for more details). The index \(m^*\) of the estimator \(\hat{r}_{m^*}\) used in \(\widetilde{V}\) is the smallest integer greater than \(\ln (n_0)-1\).

    • For \(\widetilde{A}(h)\), we approximate the \(L^2\) norms by the corresponding Riemann sums computed over the grid points \((x_k)_k\):

      $$\begin{aligned} \left\| \hat{r}_{m'}-\hat{r}_{m\wedge m'}\right\| ^2 \approx \frac{1}{K}\sum _{k=1}^K\left( \hat{r}_{m'}(x_k)-\hat{r}_{m\wedge m'}(x_k)\right) ^2. \end{aligned}$$
  • Select the argmin \(\tilde{m}\) of \(\widetilde{A}(m)+\widetilde{V}(m)\), and choose \(\hat{r}_{\tilde{m}}\).

The risks \(\mathbb {E}[\Vert (\hat{r}_{\tilde{m}})_+-r\Vert ^2]\) are also computed: it is not difficult to see that the choice of the positive part of the estimator can only make its risk decrease. To compute the expectation, we average the integrated squared error (ISE) computed with \(N=500\) replications of the samples \((X_{0,i_0})_{i_0}\) and \((X_i)_i\). Notice that the grid size (\(K=50\)) and the number of replications (\(N=500\)) are the same as Ćwik and Mielniczuk (1993).

4.2 Experiment 1: two samples with close distributions

The trigonometric basis suits well to recover relative densities. Indeed, the first function of the basis is \(\varphi _1: x\in (0;1)\mapsto 1\), and thus the first estimated coefficient \(\hat{a}_1\) in (3) also equals 1. But we know that the relative density is constant and equal to 1 over \((0;1)\) when \(X\) and \(X_0\) have the same distribution. Consequently, our procedure permits obtaining an exact estimation in this case, provided that the data-driven criterion leads to the choice of the first model in the collection. We hope to select \(D_{\hat{m}}=1\), that is \(\hat{m}=0\). In this section, we check that the estimation procedure actually easily handles this case.

First, we generate two samples \((X_{0,i_0})_{i_0=1,\ldots ,n_0}\) and \((X_i)_{i=1,\ldots ,n}\) coming from random variables \(X_0\) and \(X,\) respectively, with one of the following common probability distributions (Example (1) in the sequel): (a1) a uniform distribution in the set \((0;1)\), (b1) a beta distribution \(\fancyscript{B}(2,5)\), (c1) a Gaussian distribution with mean \(0\) and variance \(1\), (d1) an exponential distribution with mean \(5\). As explained, the estimator is expected to be constant and equal to 1: the selected index \(m\) must thus be \(0\). This is the case for most of the samples we simulate: for example, only \(1\%\) of the 500 estimators computed with 50 i.i.d. Gaussian pairs \((X,X_0)\) are not identically equal to 1. The medians of the ISE over 500 replicated samples are always equal to 0, whatever the distribution of \(X\) and \(X_0\), chosen among the examples (uniform, beta, Gaussian, or exponential). The MISE values are dispayed in Table 1 for different possible sample sizes. We can also check that they are much more smaller than the MISE obtained with two different distributions for \(X\) and \(X_0\) (see Table 2 in the supplementary material, Section 2.2).

Table 1 Values of MISE \(\times 10\) averaged over \(500\) samples for the estimator \(\hat{r}_{\tilde{m}}\), in Example (1) [(a1) to (d1)]

Then, we investigate what happens when \(X\) is close to \(X_0\) but slightly different, with samples simulated from the set of Example (2).

  1. (a2)

    The variable \(X_0\) is from the uniform distribution on \((0;1)\), and the variable \(X\) has the density \(f(x)=c\mathbf {1}_{(0;0.5)}(x)+(2-c)\mathbf {1}_{(0.5;1)}(x)\), with \(c\in \{1.01,1.05,1.1,1.3,1.5\}\) (the case \(c=1\) is the case of the uniform distribution on \((0;1)\)).

  2. (b2)

    The variable \(X_0\) is from the beta distribution \(\fancyscript{B}(2,5)\), and the variable \(X\) from a beta distribution \(\fancyscript{B}(a,5)\) with \(a\in \{2.01,2.05,2.1,2.3,2.5\}\). For this example, the risks are computed over a regular grid of the interval \([F_0(0.01);F_0(0.99)]\).

Figure 1 shows the true relative densities for these two examples.

Fig. 1
figure 1

Plot of the different investigated relative densities of Examples (2), (a2) and (b2)

The MISEs in Examples (2) (a2) and (b2) are plotted in Fig. 2 with respect to the sample sizes \(n=n_0\). Details are also given in Table 1 of the supplementary material (Sect. 2.2). The larger the \(c\) (resp. \(a\)) and the further the \(X\) from \(X_0,\) the larger is the MISE. The results are thus better especially when the two variable distributions are close.

Fig. 2
figure 2

Values of MISE (averaged over \(500\) samples) for the estimator \(\hat{r}_{\tilde{m}}\) with respect to the sample sizes \(n=n_0\) in Examples (2) (a2) and (b2)

4.3 Experiment 2: influence of the two sample sizes

We now study the influence of the two sample sizes. Recall that the theoretical results we obtain do not require any link between \(n\) and \(n_0\). On the contrary, they are often supposed to be proportional in the literature. But we obtain a convergence rate in which \(n\) and \(n_0\) play symmetric roles (see Corollary 2). What happens in practice? To briefly discuss this question, let us consider the observations of \((X_i)_{i\in \{1,\ldots ,n\}}\) and \((X_{0,i_0})_{i_0\in \{1,\ldots ,n_0\}}\) fitting the following model [Example (3)]. The variable \(X_0\) is from the Weibull distribution with parameters (2,3) (we denote by \(W\) the corresponding c.d.f.) and \(X\) is built such that \(X=W^{-1}(S)\), with \(S\) a mixture of two beta distributions: \(\fancyscript{B}(14,37)\) with probability \(4/5\) and \(\fancyscript{B}(14,20)\) with probability \(1/5\). The example is borrowed from Molanes and Cao (2008a). Let us look at the beams of estimates \(\hat{r}_{\tilde{m}}\): in Fig. 3, ten estimators built from i.i.d. samples of data are plotted together with the true functions. This illustrates that increasing \(n_0\) for fixed \(n\) seems to improve more substantially the risk than the other way round (the improvement when \(n_0\) increases appears horizontally in Fig. 3). Such a phenomenon also appears when a more quantitative criterion is considered: the MISE in Table 2 are not symmetric with respect to \(n\) and \(n_0\), even if, as expected, they all get smaller when the sample sizes \(n\) and \(n_0\) increase. Even if this can be surprising when compared with the theory, recall that the relative density of \(X\) with respect to \(X_0\) is not the same as the relative density of \(X_0\) with respect to \(X\). The role of the reference variable is coherently more important, even if it is not clear in the convergence rate of Corollary 2. The details of the computation in the proofs also show that \(n\) and \(n_0\) do not play similar roles (see Lemma 9). An explanation may be the following: in the method, the sample \((X_i)_{i\in \{1,\ldots ,n\}}\) is used in a nonparametric way, like in classical density estimation, while the other, that is \((X_{0,i_0})_{i_0\in \{1,\ldots ,n_0\}},\) is useful through the empirical c.d.f. which is known to be convergent at a parametric rate, faster than the previous one. Notice finally that the same results are obtained for estimators computed from the sets of observations described in the supplementary material (see Table 2 of the supplementary material). In any case, such results might be used by a practitioner, when the choice of the reference sample is not natural: a judicious way to decide which of the sample plays the role of \((X_{0,i_0})\) is to choose the larger one.

Fig. 3
figure 3

Beams of ten estimators built from i.i.d. samples of various sizes \((n;n_0)\) (thin lines) versus true function (thick line) in Example (3)

5 Concluding remarks

In this paper, we have proposed a new method for estimating relative density. Our procedure has two main advantages compared with the kernel methods of Ćwik and Mielniczuk (1993) and Molanes and Cao (2008a). First, we obtain an unbiased estimator that exactly recovers the target when the two variables \(X\) and \(X_0\) have the same distribution. Secondly, it permits to obtain precise theoretical results on the minimax rate of convergence for relative density estimation. For a function with smoothness index \(\alpha \), and sample sizes \(n\) and \(n_0\), the minimax rate is proved to be \((n^{-1}+n_0^{-1})^{2\alpha /(2\alpha +1)}\). Although we do not assume knowing this smoothness index, our estimator achieves this minimax rate as soon as \(\alpha \ge 1\). Eventually, an outstanding issue is the theoretical comprehension of the asymmetry in the roles of \(n\) and \(n_0\), which is noticeable in the simulations.

Table 2 Values of MISE \(\times 10\) averaged over \(500\) samples for the estimator \(\hat{r}_{\tilde{m}}\), in Example (3)

6 Proofs

Detailed proofs of Proposition 1 and Theorem 2 are gathered in this section. The proofs of Theorems 3 and 4 are only sketched. Complete proofs are available in Section 3 of the supplementary material.

6.1 Preliminary notations and results

6.1.1 Notations

We need additional notations in this section. First, we specify the definition of the procedure. The estimators \(\hat{r}_m\), \(m\in \fancyscript{M}\) defined by (3) are now denoted by \(\hat{r}_m(.,\hat{F}_{0})\). Its coefficients in the Fourier basis are \(\hat{a}_j^{\hat{F}_0}\). When we plug \(F_0\) in (3), we denote it by \(\hat{r}_m(.,F_0)\) and the coefficients by \(\hat{a}_j^{F_0}\). Then, we set \(U_{0,i_0}=F_0(X_{0,i_0})\) (\(i_0=1,\ldots ,n_0\)) and let \(\widehat{U}_{0}\) be the empirical c.d.f. associated with the sample \((U_{0,i_0})_{i_0=1,\ldots ,n_0}\). We also denote by \(\mathbb {E}[.|(X_{0})]\) the conditional expectation given the sample \((X_{0,i_0})_{i_0=1,\ldots ,n_0}\) (the conditional variance will be coherently denoted by \(\text{ Var }(.|(X_{0}))\)).

Finally, for any measurable function \(t\) defined on \((0;1)\), we denote by \(\Vert t\Vert _{\infty }\) the quantity \(\sup _{x\in (0;1)}|t(x)|\), and \(id\) is the function such that \(u\mapsto u\) on the interval \((0;1)\).

6.1.2 Useful tools

Key arguments for the proofs are the deviation properties of the empirical c.d.f. \(\hat{F}_{0}\) of the sample \((X_{0,i_0})_{i_0}\).

First, recall that \(U_{0,i_0}\) is a uniform variable on \((0;1)\) and that \(\hat{F}_{0}(F_0^{-1}(u))=\widehat{U}_{0}(u)\), for all \(u\in (0;1)\). Keep in mind that the random variable \(\sup _{x\in A_0}|\hat{F}_{0}(x)-F_0(x)|\) has the same distribution as \(\Vert \widehat{U}_{0}-id\Vert _{\infty }\). The following inequalities are used several times to control the deviations of the empirical c.d.f. \(\hat{U}_n\). Dvoretzky et al. (1956) established the first one.

Proposition 5

(Dvoretzky–Kiefer–Wolfowitz’s Inequality) There exists a constant \(C>0\), such that, for any integer \(n_0\ge 1\) and any \(\lambda >0\),

$$\begin{aligned} \displaystyle \mathbb {P}\left( \left\| \widehat{U}_{0}-id\right\| _{\infty }\ge \lambda \right) \le C\exp \left( -2n_0\lambda ^2\right) . \end{aligned}$$

By integration, we then deduce a first other bound:

Proposition 6

For any integer \(p>0\), there exists a constant \(C_p>0\) such that

$$\begin{aligned} \displaystyle \mathbb {E}\left[ \left\| \widehat{U}_{0}-id\right\| _{\infty }^p\right] \le \frac{C_p}{n_0^{p/2}}. \end{aligned}$$

More precise bounds are also required:

Corollary 3

For any \(\kappa >0\), for any integer \(p\ge 2\), there exists also a constant \(C\) such that

$$\begin{aligned} \mathbb {E}\left[ \left( \left\| \widehat{U}_{0}-id\right\| _{\infty }^p-\kappa \frac{\ln ^{p/2}(n_0)}{n_0^{p/2}}\right) _+\right] \le Cn_0^{-2^{\frac{2-p}{p}}\kappa ^{2/p}}. \end{aligned}$$
(11)

6.1.3 The talagrand inequality

The proofs of the main results (Theorems 2 and 3) are based on the use of concentration inequalities. The first one is the classical Bernstein Inequality, and the second one is the following version of the Talagrand Inequality.

Proposition 7

Let \(\xi _1,\ldots ,\xi _n\) be i.i.d. random variables and define \(\nu _n(s)=\frac{1}{n}\sum _{i=1}^ns(\xi _i)-\mathbb {E}[s(\xi _i)]\), for \(s\) belonging to a countable class \(\fancyscript{S}\) of real-valued measurable functions. Then, for \(\delta >0\), there exist three constants \(c_l\), \(l=1,2,3\), such that

$$\begin{aligned} \displaystyle \mathbb {E}\left[ \left( \sup _{s\in \fancyscript{S}}\left( \nu _n\left( s\right) \right) ^2-c(\delta )H^2\right) _{+}\right]&\le c_1\left\{ \frac{v}{n}\exp \left( -c_2\delta \frac{nH^2}{v}\right) \right. \nonumber \\&\displaystyle \left. +\frac{M_1^2}{C^2(\delta )n^2}\exp \left( -c_3C(\delta )\sqrt{\delta }\frac{nH}{M_1}\right) \right\} ,\qquad \quad \end{aligned}$$
(12)

with \(C(\delta )=(\sqrt{1+\delta }-1)\wedge 1\), \(c(\delta )=2(1+2\delta )\) and

$$\begin{aligned} \displaystyle \sup _{s\in \fancyscript{S}} \Vert s\Vert _{\infty } \le M_1\text{, } \mathbb {E}\left[ \sup _{s\in \fancyscript{S}}\left| \nu _n(s)\right| \right] \le H,\quad \text{ and } \quad \sup _{s\in \fancyscript{S}}\text{ Var }\left( s\left( \xi _1\right) \right) \le v. \end{aligned}$$

Inequality (12) is a classical consequence of Talagrand’s Inequality given in Klein and Rio (2005): see for example Lemma \(5\) (page 812) in Lacour (2008). Using density arguments, we can apply it to the unit sphere of a finite dimensional linear space.

6.2 Proof of Proposition 1

A key point is the following decomposition which holds for any index \(m\)

$$\begin{aligned} \left\| \hat{r}_m(.,\hat{F}_{0})-r\right\| ^2\le 3T_1^m+3T_2^m+3\left\| \hat{r}_m(.,F_0)-r\right\| ^2, \end{aligned}$$

with

$$\begin{aligned} \begin{array}{l} T_1^m=\left\| \hat{r}_m(.,\hat{F}_{0})-\hat{r}_m(.,F_0)-\mathbb {E}\left[ \hat{r}_m(.,\hat{F}_{0})-\hat{r}_m(.,F_0)\left| (X_0)\right. \right] \right\| ^2,\\ T_2^m=\left\| \mathbb {E}\left[ \hat{r}_m(.,\hat{F}_{0})-\hat{r}_m(.,F_0)\left| (X_0)\right. \right] \right\| ^2.\\ \end{array} \end{aligned}$$
(13)

We have already proved [see (4) and (5)] that \(\Vert \hat{r}_m(.,F_0)-r\Vert ^2\le D_m/n+\Vert r_m-r\Vert ^2.\) Therefore, the two lemmas, proved in the two following sections, remain to be applied.

Lemma 8

Under the assumptions of Proposition 1,

$$\begin{aligned} \mathbb {E}\left[ T_1^m\right] \le 2\pi ^2\frac{D_m^3}{nn_0}. \end{aligned}$$

Lemma 9

Under the assumptions of Proposition 1,

$$\begin{aligned} \mathbb {E}\left[ T_2^m\right] \le 3\Vert r\Vert ^2\frac{D_m}{n_0}+ 3\frac{\pi ^4}{4}C_4\Vert r\Vert ^2\frac{D_m^4}{n_0^2}+\frac{32\pi ^6C_6}{3}\Vert r\Vert ^2\frac{D_m^7}{n_0^3}+3\frac{\Vert r'\Vert ^2}{n_0}. \end{aligned}$$

The result follows if \(D_m\le \kappa n_0^{1/3}\).

6.2.1 Proof of Lemma 8

The decomposition of the estimator in the orthogonal basis \((\varphi _j)_j\) yields

$$\begin{aligned} T_{1}^{m}&= \sum _{j=1}^{D_m}\left( \hat{a}^{\hat{F}_{0}}_j-\hat{a}_j^{F_0}-\mathbb {E}\left[ \hat{a}^{\hat{F}_{0}}_j-\hat{a}_j^{F_0}\left| (X_0)\right. \right] \right) ^2 \end{aligned}$$

and, therefore, \(\mathbb {E}[T_1^m|(X_0)]=\sum _{j=1}^{D_m}\text{ Var }(\hat{a}^{\hat{F}_{0}}_j-\hat{a}_j^{F_0}|(X_0)).\) Moreover, for any index \(j\),

$$\begin{aligned} \text{ Var }\left( \hat{a}^{\hat{F}_{0}}_j-\hat{a}_j^{F_0}\left| (X_0)\right. \right)&\le \frac{1}{n}\mathbb {E}\left[ \left( \varphi _j\circ \hat{F}_{0}(X_1)-\varphi _j\circ F_0(X_1)\right) ^2\left| (X_0)\right. \right] ,\\&\le \frac{1}{n}\left\| \varphi _j'\right\| _{\infty }^2 \int _A\left( \hat{F}_{0}(x)-F_0(x)\right) ^2f(x)dx, \end{aligned}$$

using the mean-value theorem. Since \(\Vert \varphi _j'\Vert _{\infty }^2\!\le \! 8\pi ^2D_m^2\) in the Fourier basis, this leads to

$$\begin{aligned} \mathbb {E}\left[ T_1^m\right] \le \frac{8\pi ^2}{n}D_m^3\int _A\mathbb {E}\left[ \left( \hat{F}_{0}(x)-F_0(x)\right) ^2\right] f(x)dx. \end{aligned}$$

Notice finally that \(\mathbb {E}[(\hat{F}_{0}(x)-F_0(x))^2]=\text{ Var }(\hat{F}_{0}(x))=(F_0(x)(1-F_0(x)))/n_0\le 1/(4n_0)\). This permits to conclude the proof of Lemma 8. \(\square \)

6.2.2 Proof of Lemma 9

Arguing as in the beginning of the proof of Lemma 8 yields

$$\begin{aligned} T_2^m&= \sum _{j=1}^{D_m}\left( \int _A\left( \varphi _j\circ \hat{F}_{0}(x)-\varphi _j\circ F_0(x)\right) f(x)dx\right) ^2. \end{aligned}$$
(14)

We apply the Taylor formula to the function \(\varphi _j\), with the Lagrange form for the remainder. There exists a random number \(\hat{\alpha }_{j,n_0,x}\) such that the following decomposition holds: \(T_2^m\le 3T_{2,1}^m+3T_{2,2}^m+3T_{2,3}^m\), where

$$\begin{aligned} \begin{array}{l} \displaystyle T_{2,1}^m=\sum _{j=1}^{D_m}\left( \int _A\varphi _j'(F_0(x))\left( \hat{F}_{0}(x)- F_0(x)\right) f(x)dx\right) ^2,\\ \displaystyle T_{2,2}^m=\sum _{j=1}^{D_m}\left( \int _A\varphi _j''(F_0(x))\frac{\left( \hat{F}_{0}(x)- F_0(x)\right) ^2}{2}f(x)dx\right) ^2,\\ \displaystyle T_{2,3}^m=\sum _{j=1}^{D_m}\left( \int _A\varphi _j^{(3)}(\hat{\alpha }_{j,n_0,x})\frac{\left( \hat{F}_{0}(x)- F_0(x)\right) ^3}{6}f(x)dx\right) ^2. \end{array} \end{aligned}$$

We now bound each of these three terms. Let us begin with \(T_{2,1}^m\). The change of variables \(u=F_0(x)\) permits obtaining first

$$\begin{aligned} T_{2,1}^m=\sum _{j=1}^{D_m}\left( \int _{(0;1)}\varphi _j'(u)\left( \widehat{U}_{0}(u)- u\right) r(u)du\right) ^2, \end{aligned}$$

and, with the definition of \(\widehat{U}_{0}(u)\), we get

$$\begin{aligned} \displaystyle T_{2,1}^m=\sum _{j=1}^{D_m}\left( \frac{1}{n_0}\sum _{i=1}^{n_0}B_{i,j}-\mathbb {E}[B_{i,j}]\right) ^2,\;\;\; \text{ with } B_{i,j}=\int _{U_{0,i}}^{1}r(u)\varphi _j'(u)du. \end{aligned}$$

An integration by parts for \(B_{i,j}\) leads to another splitting \( T_{2,1}^m\le 2T_{2,1,1}^m+2T_{2,1,2}^m,\) with notations

$$\begin{aligned} \begin{array}{l} \displaystyle T_{2,1,1}^m=\sum _{j=1}^{D_m}\left\{ \frac{1}{n_0}\sum _{i=1}^{n_0}r(U_{0,i})\varphi _j(U_{0,i})-\mathbb {E}\left[ r(U_{0,i})\varphi _j(U_{0,i})\right] \right\} ^2,\\ \displaystyle T_{2,1,2}^m=\sum _{j=1}^{D_m}\left\{ \int _{(0;1)}r'(u)\left( \widehat{U}_{0}(u)-u\right) \varphi _j(u)du\right\} ^2. \end{array} \end{aligned}$$

The expectation of the first term is a variance and is bounded as follows:

$$\begin{aligned} \displaystyle \mathbb {E}\left[ T_{2,1,1}^m\right]&\le \frac{1}{n_0}\sum _{j=1}^{D_m}\mathbb {E}\left[ \left( r(U_{0,1})\varphi _j(U_{0,1})\right) ^2\right] \le \int _0^1r(u)^2du\frac{D_m}{n_0}. \end{aligned}$$

For \(T_{2,1,2}^m\), we use the definitions and properties of the orthogonal projection operator \(\Pi _{S_m}\) on the space \(S_m\):

$$\begin{aligned} \displaystyle T_{2,1,2}^m&= \sum _{j=1}^{D_m}\left( \langle r'(\widehat{U}_{0}-id),\varphi _j\rangle _{(0;1)}\right) ^2=\left\| \Pi _{S_m}(r'(\widehat{U}_{0}-id))\right\| ^2,\\&\le \left\| r'(\widehat{U}_{0}-id)\right\| ^2\le \Vert r'\Vert ^2\Vert \widehat{U}_{0}-id\Vert _{\infty }^2. \end{aligned}$$

Applying Proposition 6 proves that \(\mathbb {E}[T_{2,1,2}^m]\le C_2\Vert r'\Vert ^2/n_0\). Therefore,

$$\begin{aligned} \mathbb {E}\left[ T_{2,1}^m\right] \le \Vert r\Vert ^2\frac{D_m}{n_0} +C_2\Vert r'\Vert ^2\frac{1}{n_0}. \end{aligned}$$
(15)

Consider now \(T_{2,2}^m\). The trigonometric basis satisfies \(\varphi _j''=-(\pi \mu _j)^2\varphi _j\), with \(\mu _j=j\) for even \(j\ge 2\), and \(\mu _{j}=j-1\) for odd \(j\ge 2\). We thus have

$$\begin{aligned} \mathbb {E}\left[ T_{2,2}^m\right]&= (\pi ^4/4)\mathbb {E}\left[ \sum _{j=1}^{D_m}\left\{ \int _{(0;1)} r(u)\left( \widehat{U}_{0}(u)-u\right) ^2\mu _j^2\varphi _j(u)du\right\} ^2\right] ,\\&\le (\pi ^4/4)D_{m}^4\mathbb {E}\left[ \sum _{j=1}^{D_{m}}\left\{ \langle r\left( \widehat{U}_{0}-id\right) ^2,\varphi _j\rangle _{(0;1)}\right\} ^2\right] ,\\&\le (\pi ^4/4)D_{m}^4\mathbb {E}\left[ \left\| r\left( \widehat{U}_{0}-id\right) ^2\right\| ^2\right] \\&\le (\pi ^4/4)D_{m}^4\mathbb {E}\left[ \left\| \widehat{U}_{0}-id\right\| _{\infty }^4\right] \int _{(0;1)}r^2(u)du. \end{aligned}$$

Thanks to Proposition 6, we obtain

$$\begin{aligned} \mathbb {E}\left[ T_{2,1}^m\right] \le C_4(\pi ^4/4)\Vert r\Vert ^2\frac{D_{m}^4}{n_0^2}. \end{aligned}$$
(16)

The last term is then easily controlled, using also Proposition 6:

$$\begin{aligned} \mathbb {E}\left[ T_{2,3}^m\right]&\le \frac{32\pi ^6}{9}\sum _{j=1}^{D_m}\Vert r\Vert ^2\mathbb {E}\left[ \left\| \widehat{U}_{0}-id\right\| _{\infty }^6\right] \le \frac{32\pi ^6C_6}{9}\Vert r\Vert ^2\frac{D_m^7}{n_0^3}. \end{aligned}$$
(17)

Lemma 9 is proved by gathering (15, 16) and (17). \(\square \)

6.3 Proof of Theorem 2

In the proof, \(C\) is a constant which may change from line to line and is independent of all \(m\in \fancyscript{M}\), \(n\) and \(n_0\). Let \(m\in \fancyscript{M}\) be fixed. The following decomposition holds:

$$\begin{aligned} \left\| \hat{r}_{\hat{m}}\left( .,\hat{F}_{0}\right) -r\right\| ^2&\le 3\left\| \hat{r}_{\hat{m}}\left( .,\hat{F}_{0}\right) -\hat{r}_{m\wedge \hat{m}}\left( .,\hat{F}_{0}\right) \right\| ^2\\&+3\left\| \hat{r}_{m\wedge \hat{m}}\left( .,\hat{F}_{0}\right) -\hat{r}_{m}\left( .,\hat{F}_{0}\right) \right\| ^2+3\left\| \hat{r}_{m}\left( .,\hat{F}_{0}\right) -r\right\| ^2. \end{aligned}$$

We use successively the definition of \(A(\hat{m})\), \(A(m)\) and \(\hat{m}\) to obtain

$$\begin{aligned} \left\| \hat{r}_{\hat{m}}\left( .,\hat{F}_{0}\right) -r\right\| ^2&\le 6\left( A(m)+V(m)\right) +3\left\| \hat{r}_{m}\left( .,\hat{F}_{0}\right) -r\right\| ^2. \end{aligned}$$

Keeping in mind that we can split \(\Vert \hat{r}_{m}(.,\hat{F}_{0})-r\Vert ^2\le 3T_1^m+3T_2^m+3\Vert \hat{r}_{m}(.,F_0)-r\Vert ^2\) with the notations of Sect. 6.2, we derive from (4) and (5):

$$\begin{aligned} \left\| \hat{r}_{\hat{m}}\left( .,\hat{F}_{0}\right) -r\right\| ^2&\le 6\left( A(m)+V(m)\right) +9T_1^m+9T_2^m+9\frac{D_m}{n}+9\left\| r_m-r\right\| ^2. \end{aligned}$$

We also apply Lemmas 8 and 9. Taking into account that \(D_m\le \kappa n_0^{1/3}\), we thus have

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{r}_{\hat{m}}\left( .,\hat{F}_{0}\right) -r\right\| ^2\right]&\le 6\mathbb {E}\left[ A(m)\right] +6V(m)+C\frac{D_m}{n}+C\Vert r\Vert ^2\frac{D_m}{n_0}\\&+9\left\| r_m-r\right\| ^2+\frac{C}{n_0}+\frac{C}{n}. \end{aligned}$$

Therefore, the conclusion of Theorem 2 is the result of the following lemma.

Lemma 10

Under the assumptions of Theorem 2, there exists a constant \(C>0\) such that, for any \(m\in \fancyscript{M}\),

$$\begin{aligned} \mathbb {E}\left[ A(m)\right] \le C\left( \frac{1}{n}+\frac{1}{n_0}\right) +12\left\| r_m-r\right\| ^2. \end{aligned}$$

\(\square \)

6.3.1 Proof of Lemma 10

To study \(A(m,\hat{F}_{0})\), we write, for \(m'\in \fancyscript{M}\),

$$\begin{aligned} \left\| \hat{r}_{m'}\left( .,\hat{F}_{0}\right) -\hat{r}_{m\wedge m'}\left( .,\hat{F}_{0}\right) \right\| ^2&\le 3\left\| \hat{r}_{m'}\left( .,\hat{F}_{0}\right) -r_{m'}\right\| ^2+3\left\| r_{m'}-r_{m\wedge m'}\right\| ^2\\&+3\left\| r_{m\wedge m'}-\hat{r}_{m\wedge m'}\left( .,\hat{F}_{0}\right) \right\| ^2. \end{aligned}$$

Let \(\fancyscript{S}(p_{m'})\) be the set \(\{t\in S_{p_{m'}},\;\Vert t\Vert =1\}\) for \(p_{m'}=m'\) or \(p_{m'}=m\wedge m'\) . We note that

$$\begin{aligned} \left\| r_{p_{m'}}-\hat{r}_{p_{m'}}(.,\hat{F}_{0})\right\| ^2=\sum _{j=1}^{D_{p_{m'}}}\left( \tilde{\nu }_n(\varphi _j)\right) ^2=\sup _{t\in \fancyscript{S}(p_{m'})}\tilde{\nu }_n(t)^2, \end{aligned}$$
(18)

with \(\tilde{\nu }_n(t)=n^{-1}\sum _{i=1}^nt\circ \hat{F}_{0}(X_i)-\mathbb {E}[ t\circ F_0(X_i)].\) Since the empirical process \(\tilde{\nu }_n\) is not centered, we consider the following splitting: \((\tilde{\nu }_n(t))^2\le 2\nu _n^2(t)+2((1/n)\sum _{i=1}^n(t\circ \hat{F}_{0}(X_i)-t\circ F_0(X_i)))^2\), with

$$\begin{aligned} \nu _n(t)=\frac{1}{n}\sum _{i=1}^n\left( t\circ F_0(X_i)-\mathbb {E}\left[ t\circ F_0(X_i)\right] \right) . \end{aligned}$$
(19)

But, we also have

$$\begin{aligned}&\sup _{t\in \fancyscript{S}(p_{m'})}\left( \frac{1}{n}\sum _{i=1}^n\left( t\circ \hat{F}_{0}(X_i)-t\circ F_0(X_i)\right) \right) ^2 \\&\quad = \sum _{j=1}^{D_{p_{m'}}}\left( \hat{a}_j^{\hat{F}_0}-\hat{a}_j^{F_0}\right) ^2\le 2T_1^{p_{m'}}+2T_2^{p_{m'}}, \end{aligned}$$

with the notations of Sect. 6.2. This shows that

$$\begin{aligned} \left\| r_{p_{m'}}-\hat{r}_{p_{m'}}(.,\hat{F}_{0})\right\| ^2\le 2\sup _{t\in \fancyscript{S}(p_{m'})}\left( \nu _n(t)\right) ^2+4T_1^{p_{m'}}+4T_2^{p_{m'}}. \end{aligned}$$
(20)

We thus have

$$\begin{aligned}&\left\| \hat{r}_{m'}\left( .,\hat{F}_{0}\right) -\hat{r}_{m\wedge m'}\left( .,\hat{F}_{0}\right) \right\| ^2\\&\quad \le 6\sup _{t\in \fancyscript{S}(m')}\left( \nu _n(t)\right) ^2+6\sup _{t\in \fancyscript{S}(m\wedge m')}\left( \nu _n(t)\right) ^2+12T_2^{m'}+12T_2^{m\wedge m'}\\&\qquad +\,12T_1^{m'}+12T_1^{m\wedge m'}+3\left\| r_{m'}-r_{m\wedge m'}\right\| ^2. \end{aligned}$$

We get back to the definition of \(A(m)\). To do so, we subtract \(V(m')\). For convenience, we split it into two terms: \(V(m')=V^{(1)}(m')+V^{(2)}(m'),\) with \( V^{(1)}(m')=c_0D_m/n\) and \(V^{(2)}(m')=c_0\Vert r\Vert ^2D_m/n_0\). Thus,

$$\begin{aligned} \mathbb {E}\left[ A\left( m\right) \right]&\le 6\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( \sup _{t\in \fancyscript{S}(m')}\left( \nu _n(t)\right) ^2-\frac{V^{(1)}(m')}{12}\right) _+\right] \!+\!3\max _{m'\in \fancyscript{M}}\left\| r_{m'}-r_{m\wedge m'}\right\| ^2\\&+6\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( \sup _{t\in \fancyscript{S}(m\wedge m')}\left( \nu _n(t)\right) ^2-\frac{V^{(1)}(m')}{12}\right) _+\right] \\&+12\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( T_2^{m'}-\frac{V^{(2)}(m')}{24}\right) _+\right] \\&+12\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( T_2^{m\wedge m'}-\frac{V^{(2)}(m')}{24}\right) _+\right] \\&+12\mathbb {E}\left[ \max _{m'\in \fancyscript{M}} T_1^{m'}\right] +12\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}T_1^{m\wedge m'}\right] . \end{aligned}$$

For the deterministic term, we notice that

$$\begin{aligned} \max _{m'\in \fancyscript{M}}\left\| r_{m'}-r_{m\wedge m'}\right\| ^2&\le 2\max _{\begin{array}{c} m'\in \fancyscript{M}\\ m\le m' \end{array}} \left\| r_{m'}-r\right\| ^2+2\left\| r-r_{m}\right\| ^2. \end{aligned}$$

If \(m\le m'\), the spaces are nested \(S_m\subset S_{m'}\); thus the orthogonal projections \(r_m\) and \(r_{m'}\) of \(r\) onto \(S_m\) and \(S_m',\) respectively, satisfy \(\Vert r_{m'}-r\Vert ^2\le \Vert r_{m}-r\Vert ^2.\) Thus,

$$\begin{aligned} \max _{m'\in \fancyscript{M}}\left\| r_{m'}-r_{m\wedge m'}\right\| ^2\le 4\left\| r_{m}-r\right\| ^{2}. \end{aligned}$$
(21)

Moreover, for \(p_{m'}=m'\) or \(p_{m'}=m\wedge m'\), \(T_1^{p_{m'}}\le T_1^{m_{\max }}\) (recall that \(m_{\max }\) is the largest index in the collection \(\fancyscript{M}\)). Therefore,

$$\begin{aligned} 12\mathbb {E}\left[ \max _{m'\in \fancyscript{M}} T_1^{m'}\right] +12\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}T_1^{m\wedge m'}\right] \le 24\mathbb {E}\left[ T_1^{m_{\max }}\right] \le C\frac{D_{m_{\max }}^3}{nn_0}\le \frac{C}{n}. \end{aligned}$$

Consequently, we have at this stage

$$\begin{aligned} \mathbb {E}\left[ A\left( m\right) \right]&\le \frac{C}{n}+12\left\| r_m-r\right\| ^2+ 6\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( \sup _{t\in \fancyscript{S}(m')}\left( \nu _n(t)\right) ^2-\frac{V^{(1)}(m')}{12}\right) _+\right] \\&+6\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( \sup _{t\in \fancyscript{S}(m\wedge m')}\left( \nu _n(t)\right) ^2-\frac{V^{(1)}(m')}{12}\right) _+\right] \\&+12\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( T_2^{m'}-\frac{V^{(2)}(m')}{24}\right) _+\right] \\&+12\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( T_2^{m\wedge m'}-\frac{V^{(2)}(m')}{24}\right) _+\right] . \end{aligned}$$

Since \(V^{(l)}(m')\ge V^{(l)}(m'\wedge m),\) the two following terms it need to be bound:

$$\begin{aligned} \mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( \sup _{t\in \fancyscript{S}(p_{m'})}\left( \nu _n(t)\right) ^2-\frac{V^{(1)}(p_{m'})}{12}\right) _+\right] \text{ and } \mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( T_2^{p_{m'}}-\frac{V^{(2)}(p_{m'})}{24}\right) _+\right] . \end{aligned}$$

We use the two following lemmas. The first one is proved below and the second one is proved in Section 3.1 of the supplementary material.

Lemma 11

Assume that \(r\) is bounded on \((0;1)\). The deviations of the empirical process \(\nu _n\) defined by (19) can be controlled as follows,

$$\begin{aligned} \forall \delta >0,\;\;\mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left\{ \sup _{t\in \fancyscript{S}(p_{m'})}\nu _n^2(t)-\bar{V}_\delta (p_{m'})\right\} _+\right] \le \frac{C(\delta )}{n}, \end{aligned}$$

where \(\bar{V}_\delta (p_{m'})=2(1+2\delta ) D_{p_{m'}}/n\) and \(C(\delta )\) a constant which depends on \(\delta \).

We fix a \(\delta >0\) (e.g., \(\delta =1/2\)). We choose \(c_0\) in the definition of \(V\) (see (7)) large enough to have \(V^{(1)}(p_{m'})/12\ge \bar{V}_\delta (p_{m'})\), for every \(m'\) and the inequality of Lemma 11 with \(V^{(1)}(p_{m'})\) as a replacement for \(\bar{V}_\delta ({p_{m'}})\).

Lemma 12

Under the assumptions of Theorem 2,

$$\begin{aligned} \mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left( T_2^{p_{m'}}-V_2(p_{m'})\right) _+\right] \le \frac{C}{n_0}, \end{aligned}$$

with \(V_2(p_{m'})=c_2\Vert r\Vert ^2D_{p_m'}/n_0\), \(c_2\) a positive constant large enough, and \(C\) depending on the basis, on \(r\), and on the constants \(C_p\) of Proposition 6.

We choose \(c_0\) in the definition of \(V\) [see (7)] large enough to have \(V^{(2)}(p_{m'})/24\ge {V}_2(p_{m'})\), for every \(m'\). This enables to apply Lemma 12 with \(V^{(2)}(p_{m'})\) as a replacement for \(V_2(p_{m'})\).

The proof of Lemma 10 is completed. \(\square \)

6.3.2 Proof of Lemma 11

We roughly bound

$$\begin{aligned} \mathbb {E}\left[ \max _{m'\in \fancyscript{M}}\left\{ \sup _{t\in \fancyscript{S}(p_{m'})}\nu _n^2(t)-\bar{V}_\delta (p_{m'})\right\} _+\right] \le \sum _{m'\in \fancyscript{M}}\mathbb {E}\left[ \left\{ \sup _{t\in \fancyscript{S}(p_{m'})}\nu _n^2(t)-\bar{V}_\delta (p_{m'})\right\} _+\right] . \end{aligned}$$

We apply the Talagrand Inequality recalled in Proposition 7. To this aim, we compute \(M_1\), \(H^2\) and \(v\). Write for a moment \(\nu _n(t)=(1/n)\sum _{i=1}^n\psi _t(X_i)-\mathbb {E}[\psi _t(X_i)]\), with \(\psi _t(x)=t\circ F_0(x)\).

  • First, for \(t\in \fancyscript{S}(p_{m'})\), \(\sup _{x\in A}|\psi _t(x)|\le \Vert t\Vert _{\infty }\le \sqrt{D_{p_{m'}}}\Vert t\Vert =\sqrt{D_{p_{m'}}}=:M_1.\)

  • Next, we develop \(t\in \fancyscript{S}(p_{m'})\) in the orthogonal basis \((\varphi _j)_{j=1,\ldots ,D_{p_{m'}}}\). This leads to

    $$\begin{aligned} \mathbb {E}\left[ \sup _{t\in \fancyscript{S}(p_{m'})}\nu _n^2(t)\right] \le \sum _{j=1}^{D_{p_{m'}}}\mathbb {E}\left[ \nu _n(\varphi _j^2)\right] =\sum _{j=1}^{D_{p_{m'}}}\mathbb {E}\left[ \left( \hat{a}_j^{F_0}-a_j\right) ^2\right] \le \frac{D_{p_{m'}}}{n}=:H^2,\end{aligned}$$

    thanks to the upper-bound for the variance term [see (5)].

  • Last, for \(t\in \fancyscript{S}(p_{m'})\), \(\text{ Var }(\psi _t(X_1))\le \int _At^2(F_0(x))f(x)dx=\int _{(0;1)}t^2(u)r(u)du\le \Vert r\Vert _{\infty }\Vert t\Vert ^2=\Vert r\Vert _{\infty }=:v\).

Inequality (12) gives, for \(\delta >0\),

$$\begin{aligned} \sum _{m'\in \fancyscript{M}}\mathbb {E}\left[ \left( \sup _{t\in \fancyscript{S}(p_{m'})}\nu _n^2(t) -c(\delta )H^2\right) _{+}\right]&\le c_1\sum _{m'\in \fancyscript{M}}\left\{ \frac{1}{n}\exp \left( -c_2\delta D_{p_{m'}}\right) \right. \\&\left. +\frac{D_{p_{m'}}}{C^2(\delta )n^2}\exp \left( -c_3C(\delta )\sqrt{\delta }\sqrt{n}\right) \right\} , \end{aligned}$$

where \(c_l\), \(l=1,2,3\) are three constants. Now, it is sufficient to use that \(D_{p_m'}=2p_{m'}+1\) and that the cardinal of \(\fancyscript{M}\) is bounded by \(n\) to end the proof of Lemma 11.

6.4 Sketch of the proof of Theorem 3

The main idea is to introduce the set

$$\begin{aligned} \Lambda =\left\{ \left| \frac{\Vert \hat{r}_{m^*}(.,\hat{F}_{0})\Vert }{\Vert r\Vert }-1\right| < \frac{1}{2}\right\} \end{aligned}$$

and to split

$$\begin{aligned} \displaystyle \mathbb {E}\left[ \Vert \hat{r}_{\tilde{m}}(.,\hat{F}_{0})-r\Vert ^2\right] =\mathbb {E}\left[ \Vert \hat{r}_{\tilde{m}}(.,\hat{F}_{0})-r\Vert ^2\mathbf {1}_{\Lambda }\right] +\mathbb {E}\left[ \Vert \hat{r}_{\tilde{m}}(.,\hat{F}_{0})-r\Vert ^2\mathbf {1}_{\Lambda ^c}\right] . \end{aligned}$$

Then, the aim is to show that the first term gives the order of the upper bound of Theorem 3 and that the probability of the set \(\Lambda ^c\) is negligible compared to \(1/n+1/n_0\). See the supplementary material (Sect. 3.2).

6.5 Sketch of the proof of Theorem 4

Denote by \(\phi _{n,n_0}=(\min (n,n_0))^{-2\alpha /(2\alpha +1)}\). Since there exists a constant \(c'>0\) (depending on \(\alpha \)) such that \((n^{-1}+n_0^{-1})^{2\alpha /(2\alpha +1)}\le c' \phi _{n,n_0}\), it is sufficient to prove Inequality (10) with the lower bound \(\phi _{n,n_0}\). We also separate two cases: \(n\le n_0\) and \(n>n_0\). Then the result comes down to the proof of two inequalities. For each of these inequalities, the proof is based on the general reduction scheme which can be found in Section 2.6 of Tsybakov (2009): the main idea is to reduce the class of functions \(\fancyscript{F}_\alpha \) to a finite well-chosen subset \(\{r_a,r_1,\ldots ,r_M\}\), \(M\ge 2\). All the technical details are provided in the supplementary material (Sect. 3.3).