1 Introduction

Let \((\varvec{Z},Y)\) be a \(\mathbb {R}^d\times \mathbb {R}\)-valued random vector with distribution function \(F_{{\varvec{Z}}, {\mathrm{Y}}}\) and consider the problem of estimating the regression function

$$\begin{aligned} m(\varvec{z})=E\left[ Y|\varvec{Z}=\varvec{z}\right] \end{aligned}$$
(1)

based on the data \(\mathbb {D}_n = \{(\varvec{Z}_1,Y_1), \ldots ,(\varvec{Z}_n,Y_n)\}\), where \((\varvec{Z}_i, Y_i)\)’s are independently and identically distributed (i.i.d) random vectors from \(F_{{\varvec{Z}},{\mathrm{Y}}}\). In practice the data \(\mathbb {D}_n\) may be incomplete. More specifically, we are concerned with the important problem of estimating the regression function \(m(\varvec{z})\) when \(\mathbb {D}_n\) has incomplete covariates. The literature contains a large number of results on parametric and semi-parametric techniques for estimating a regression function in the presence of missing covariates. Robins et al. (1994) developed a class of locally and globally adaptive semi-parametric estimators using inverse probability weighted estimating equations. Lipsitz and Ibrahim (1996) examined the use of the EM algorithm to obtain parameter estimates for a general parametric regression model in the case of incomplete categorical covariates. Ibrahim et al. (1999) extended the results of Lipsitz and Ibrahim (1996) to general linear models for the case of mixed continuous and categorical covariates for non-ignorable missing data, in which the probability of observing the missing covariate is dependent on the possible missing value itself. Chen (2004) developed consistent maximum likelihood estimates for the parameters of the regression function by modeling the covariate distribution semi-parametrically. Liang et al. (2004) proposed estimators of the unknown components of a partially linear model and extend their results to longitudinal data. Zhang and Rockette (2005) constructed semi-parametric maximum likelihood estimates for the parameters of the regression function, without specifying the distribution for either the selection probabilities or the covariates, provided that the always-observable covariates are discrete or can be discretized. Sinha et al. (2014) proposed semi-parametric estimators for the model parameters using an estimated score approach, whereas Guo et al. (2014) considered the estimation of a semi-parametric multi-index model based on a weighted estimating equation approach. Bravo (2015) considered an iterative estimator based on inverse probability weighting and local linear estimation.

The literature on nonparametric regression function estimation with incomplete covariates is not as extensive as that of parametric and semi-parametric models. Results along these lines include the imputation based estimators of a distribution function proposed and studied by Cheng and Chu (1996). Mojirsheibani (2007) extended the approach of Cheng and Chu (1996) to the estimation of set- and function-indexed parameters, which are then used to study the uniform performance of certain nonparametric estimators of regression and density functions in the presence of incomplete covariates. Efromovich (2012) constructed adaptive orthogonal series estimators and studies the minimax rate of the MISE of these estimators when the regression function belongs to a Sobolev class. Faes et al. (2011) developed variational Bayes algorithms using penalized splines with a mixed model representation, whereas Hu et al. (2014) proposed a two-stage multiple imputation method for nonparametric estimation in quantile regression models.

This article focuses on kernel estimation of a regression function when there may be incomplete covariates in the data \(\mathbb {D}_n\). We recall that when \(\mathbb {D}_n\) is fully observable (i.e., there are no missing variables), the popular kernel regression estimator of \(m(\varvec{z})\) is given by

$$\begin{aligned} m_n(\varvec{z})= & {} \frac{\sum _{i=1}^n Y_i \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\,, \end{aligned}$$
(2)

where \(\mathcal {K}: \mathbb {R}^d\rightarrow \mathbb {R}_+\) is an integrable function, called the kernel, and \(h_n>0\) is the corresponding smoothing parameter of the kernel. Devroye and Wagner (1980), and Spiegelman and Sacks (1980), established the \(L_1\) convergence \(\int |m_n(\varvec{z})-m(\varvec{z})|\mu (d\varvec{z})\overset{p}{\rightarrow } 0\), as \(n\rightarrow \infty \), where \(\mu \) is the probability measure of the random vector \(\varvec{Z}\). Devroye (1981) established the strong convergence of (2) in \(L_1\), whereas Devroye and Krzyz̀ak (1989) derived the following nonparametric exponential upper bound under the assumption that \(|Y|\le M\), for some \(M<\infty \). This classical result may be summarized as follows: for every \(\epsilon >0\) and n large enough

$$\begin{aligned} P\left\{ \int \left| m_n(\varvec{z})-m(\varvec{z})\right| \mu (d\varvec{z})>\epsilon \right\} \le 4~e^{-c n}\,, \end{aligned}$$
(3)

with \(c={\min }^2\left\{ \epsilon ^2/[128M^2(1+c_1)]~,~ \epsilon /[32L(1+c_1)]\right\} \), where \(c_1\) is a constant that depends on the kernel \(\mathcal {K}\) only (see Lemma 1). Clearly, this bound together with the Borel-Cantelli lemma yield the complete convergence (and thus the almost-sure convergence) of the \(L_p\) norm of \(m_n\). For more on such results, one may also refer to Kohler et al. (2003), Walk (2002a), and Walk (2002b). The problem can be substantially more complicated when the data contains covariate vectors that are not necessarily fully observable, and this will be the focus of this paper.

In the rest of this paper we develop kernel based estimators of the regression function (1), when a subset of the covariate vector is Missing At Random (MAR). In Sect. 2 we proceed to develop results similar to that of Devroye and Krzyz̀ak (1989), and derive exponential upper bounds on the general \(L_p\) norms of our estimators for \(1\le p<\infty \). In Sect. 3 we apply our results to the problem of statistical classification in the presence of partially observed covariates.

2 Main results

Our goal in this section is to construct kernel estimators of the regression function \(m(\varvec{z})=E[Y|\varvec{Z}=\varvec{z}]\) when some of the components of the covariate vector \(\varvec{Z}\) may be missing. Let \((\varvec{Z},Y)\) be an \(\mathbb {R}^{d+s}\times \mathbb {R}\)-valued random vector where \(\varvec{Z}=(\varvec{X}',\varvec{V}')'\), with \(\varvec{X}\in \mathbb {R}^d\), \(\varvec{V}\in \mathbb {R}^s\), and \(d,s\ge 1\). Here \(\varvec{X}\) is always observable but \(\varvec{V}\) may be missing at random. Define the Bernoulli random variable \(\Delta \) according to \(\Delta =1\) if \(\varvec{V}\) is observable (and \(\Delta =0\), otherwise). Therefore, the data may be represented as \(\mathbb {D}_n= \{(\varvec{Z}_1,Y_1,\Delta _1), \ldots ,(\varvec{Z}_n,Y_n,\Delta _n)\}\). We also define the selection probability (also called the missing probability mechanism) according to

$$\begin{aligned} P(\Delta =1|\varvec{Z}=\varvec{z},Y=y)=P(\Delta |\varvec{X}=\varvec{x}, \varvec{V}=\varvec{v},Y=y)\,. \end{aligned}$$
(4)

If the right hand side of (4) is equal to a constant, then the missing probability mechanism is said to be Missing Completely At Random (MCAR). The MCAR assumption is unrealistic in practice. Therefore, in what follows, we will use the more commonly used Missing At Random (MAR) assumption. Under the MAR assumption, the probability that \(\varvec{V}\) is missing depends only on the variables which are always available. More specifically, under the MAR assumption, the selection probability in (4) becomes

$$\begin{aligned} P(\Delta =1|\varvec{Z}=\varvec{z},Y=y)=P(\Delta |\varvec{X} = \varvec{x},Y=y)=:\eta ^*(\varvec{x},y)\,. \end{aligned}$$
(5)

For more on these and other missingness patterns one may refer to the monograph by Little and Rubin (2002). To motivate our proposed regression estimators, first consider the hypothetical situation where the selection probability \(\eta ^*\) is completely known. In this case, we propose to work with the modified kernel regression estimator given by

$$\begin{aligned} \widehat{m}_{\eta ^*}(\varvec{z})=\frac{\sum _{i=1}^n \frac{\Delta _i Y_i}{\eta ^*\left( \varvec{X}_i,Y_i\right) }\mathcal {K}(\frac{\varvec{z}-\varvec{Z}_i}{h_n})}{\sum _{i=1}^n \frac{\Delta _i }{\eta ^*\left( \varvec{X}_i,Y_i\right) } \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\,. \end{aligned}$$
(6)

Observe that the estimator in (6) works by weighting the complete cases by the inverse of the selection probabilities, which is in the spirit of the classical estimator of Horvitz and Thompson (1952). In fact, this approach has been used by many authors in the literature; see, for example, Robins et al. (1994), Hirano and Ridder (2003), and Wang et al. (2010). As for the usefulness of \(\widehat{m}_{\eta ^*}(\varvec{z})\) as an estimator of \(m(\varvec{z})\), observe that (6) can be rewritten as

$$\begin{aligned} \widehat{m}_{\eta ^*}(\varvec{z})=\frac{\sum _{i=1}^n \frac{\Delta _i Y_i}{\eta ^*\left( \varvec{X}_i,Y_i\right) }\mathcal {K}(\frac{\varvec{z}-\varvec{Z}_i}{h_n})/\sum _{i=1}^n\mathcal {K}(\frac{\varvec{z}-\varvec{Z}_i}{h_n})}{\sum _{i=1}^n \frac{\Delta _i }{\eta ^*\left( \varvec{X}_i,Y_i\right) } \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) /\sum _{i=1}^n \mathcal {K}(\frac{\varvec{z}-\varvec{Z}_i}{h_n})}\,, \end{aligned}$$

which is a ratio of kernel estimators for the following ratio of two conditional expectations

$$\begin{aligned} \frac{E\left[ \frac{\Delta Y}{\eta ^*\left( \varvec{X}, Y\right) }|\varvec{Z} \right] }{E\left[ \frac{\Delta }{\eta ^*\left( \varvec{X},Y\right) }|\varvec{Z} \right] }&\overset{a.s.}{=}&\frac{E\left[ \frac{Y}{\eta ^*\left( \varvec{X},Y\right) }E\left[ \Delta |\varvec{Z},Y\right] |\varvec{Z}\right] }{E\left[ \frac{1}{\eta ^*\left( \varvec{X}, Y\right) }E\left[ \Delta |\varvec{Z},Y\right] |\varvec{Z}\right] }= \frac{E\left[ Y|\varvec{Z}\right] }{E\left[ 1|\varvec{Z}\right] }=m(\varvec{Z})\,.\\ \end{aligned}$$

Therefore when the missing probability mechanism \(\eta ^*\) is known, (6) can be seen as a kernel regression estimator of the regression function \(E[Y|\varvec{Z}]=m(\varvec{Z})\). To study the performance of the estimator in (6), we examine its convergence properties in \(L_p\) norms. In what follows we shall assume that the selected kernel \(\mathcal {K}\) is regular: A nonnegative kernel \(\mathcal {K}\) is said to be regular if there are positive constants \(b>0\) and \(r>0\) for which \(\mathcal {K}({\varvec{z}})\ge b\, I\{{\varvec{z}}\in S_{0,r}\}\) and \(\int \sup \nolimits _{{\varvec{y}}\in {\varvec{z}}+S_{0,r}} \mathcal {K}({\varvec{y}})d{\varvec{z}} < \infty \), where \(S_{0,r}\) is the ball of radius r centered at the origin. This is also the type of kernel used by Devroye and Krzyz̀ak (1989). In fact, many of the kernels used in practice are regular kernels; the popular Gaussian kernel is a regular kernel. For more on regular kernels, one may also refer to Györfi et al. (2002). Before going any further, we will state a condition that amounts to requiring \(\varvec{V}\) to be observable with a nonzero probability:

Condition A1

\(\inf \nolimits _{\varvec{x}\in \mathbb {R}^d,y\in \mathbb {R}} \, P(\Delta =1|\varvec{X} = \varvec{x},Y=y) :=\eta _0>0\) for some \(\eta _0\).

Theorem 1

Let \(\widehat{m}_{\eta ^*}(\varvec{z})\) be the kernel regression estimator defined in (6), where \(\mathcal {K}\) is a regular kernel, and suppose that condition A1 holds. If \(|Y|\le M<\infty \) and \(h_n\rightarrow 0\) and \(nh_n^{d+s}\rightarrow \infty \), as \(n\rightarrow \infty \), then for every \(\epsilon >0\) and n large enough

$$\begin{aligned} P\left\{ \displaystyle \int \left| \widehat{m}_{\eta ^*}(\varvec{z}) - m(\varvec{z})\right| ^p \mu (d\varvec{z})>\epsilon \right\} \le 8e^{-an } \end{aligned}$$

where \(a\equiv a(\epsilon )={\min }^2(\epsilon ^2 \eta _0^{2}/[2^{2p+7}M^{2p}(1+c_1)],\epsilon \eta _0/[2^{p+5}M^{p}(1+c_1)])\), and \(c_1\) is the positive constant of Lemma 1 in the Appendix.

The proof of Theorem 1 is rather straightforward and follows from the upper bound in (3) and the fact that for any \(\varvec{z}\in \mathbb {R}^{d+s}\) and \(p\in [1,\infty )\) one has \(\left| \widehat{m}_{\eta ^*}(\varvec{z})-m(\varvec{z})\right| ^p\le \left( \left| \widehat{m}_{\eta ^*} (\varvec{z})\right| + \left| m(\varvec{z}) \right| \right) ^{p-1}\left| \widehat{m}_{\eta ^*}(\varvec{z})-m(\varvec{z}) \right| \le (2M)^{p-1} \left| \widehat{m}_{\eta ^*} (\varvec{z})-m(\varvec{z})\right| \,.\) In passing, we also note that the bound in Theorem 1 together with the Borel-Cantelli lemma immediately yield \(E[|\widehat{m}_{\eta ^*} (\varvec{Z})-m(\varvec{Z})|^p|\mathbb {D}_n] \overset{\text {a.s.}}{\rightarrow }0\), as \(n\rightarrow \infty \).

Clearly, the kernel estimator in (6) is useful only if the missing probability mechanism \(\eta ^*(\varvec{X},Y)=E[\Delta |\varvec{X},Y]\) is known (an unrealistic case). If \(\eta ^*\) is unknown, it must be replaced by some sample-based estimator. We consider two different estimators of \(\eta ^*\) and study the performance of the corresponding revised versions of (6). Our first estimator of \(\eta ^*(\varvec{X},Y)\) is itself a kernel estimator whereas the second one is based on the least-squares method.

2.1 The first estimator

Here we consider replacing the unknown selection probability \(\eta ^*(\varvec{X}_i,Y_i)\) in (6) by the following estimator

$$\begin{aligned}&\widehat{\eta }(\varvec{X}_i,Y_i) \nonumber \\&\quad = \left\{ \begin{array}{ll} \frac{\displaystyle \sum \nolimits _{j=1,\ne i}^n \Delta _j I\left\{ Y_i=Y_j\right\} \mathcal {H}\left( (\varvec{X}_i-\varvec{X}_j)/\lambda _n\right) }{\displaystyle \sum \nolimits _{j=1,\ne i}^n I\left\{ Y_i=Y_j\right\} \mathcal {H}\left( (\varvec{X}_i-\varvec{X}_j)/\lambda _n\right) }\,,~&{}~\hbox {if } Y \hbox { is a discrete r.v.}~~~\\ \frac{\displaystyle \sum \nolimits _{j=1,\ne i}^n \Delta _j \mathcal {J} \left( (\varvec{U}_i - \varvec{U}_j)/\lambda _n\right) }{\displaystyle \sum \nolimits _{j=1,\ne i}^n \mathcal {J} \left( (\varvec{U}_i - \varvec{U}_j)/\lambda _n\right) }\,,\text { with } \varvec{U}_i=(\varvec{X}_i',Y_i)'\,,~&{}~\hbox {otherwise,} \end{array} \right. \nonumber \\ \end{aligned}$$
(7)

with the convention that \(0/0=0\), where \(\mathcal {H}{: }\mathbb {R}^d\rightarrow \mathbb {R}_+\) and \(\mathcal {J}{: }\mathbb {R}^{d+1}\rightarrow \mathbb {R}_+\) are the kernels used, and the smoothing parameter \(\lambda _n\) satisfies \(\lambda _n\rightarrow 0\) as \(n\rightarrow \infty \). The top estimator in (7) will be used if Y is a discrete random variable, otherwise the second estimator will be considered. Our modified kernel-type estimator of \(m(\varvec{z})=E[Y|\varvec{Z}=\varvec{z}]\) is given by

$$\begin{aligned} \widehat{m}_{\widehat{\eta }}(\varvec{z})=\dfrac{\sum _{i=1}^n \frac{\Delta _i Y_i}{\widehat{\eta }\left( \varvec{X}_i,Y_i\right) } \mathcal {K}(\frac{\varvec{z}-\varvec{Z}_i}{h_n})}{\sum _{i=1}^n \frac{\Delta _i }{\widehat{\eta }\left( \varvec{X}_i,Y_i\right) }\mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }\,. \end{aligned}$$
(8)

To study the regression function estimator (8), first we state a number of assumptions.

Condition A2

The kernel \(\mathcal {H}\) in (7) satisfies \(\int _{\mathbb {R}^d} \mathcal {H}({\varvec{w}})d{\varvec{w}}=1\) and \(\int _{\mathbb {R}^d} |w_i| \mathcal {H}({\varvec{w}})d{\varvec{w}}<\infty ,\) for \(i=1,\ldots , d,\) where \(\varvec{w}=(w_1,\ldots ,w_d)'\). Furthermore, the smoothing parameter \(\lambda _n\) satisfies \(\lambda _n\rightarrow 0\) and \(n \lambda _n^d\rightarrow \infty \), as \(n\rightarrow \infty \)

Condition A3

The random vector \({\varvec{X}}\) has a compactly supported probability density function, \(f({\varvec{x}})=\sum _{y\in \mathcal {Y}} f_y(\varvec{x})P(Y=y)\), which is bounded away from zero on its support, where \(f_y(\varvec{x})\) is the conditional density of \(\varvec{X}\) given \(Y=y\). Additionally, both f and its first-order partial derivatives are uniformly bounded on its support.

Condition A4

The partial derivatives \(\frac{\partial }{\partial x_i} \eta ^*(\varvec{x},y)\), exist for \(i=1,\ldots , d\) and are bounded uniformly, in \(\varvec{x}\), on the compact support of f.

Condition A2 is not restrictive since the choice of the kernel \(\mathcal {H}\) is at our discretion. Condition A3 is usually imposed in nonparametric regression in order to avoid having unstable estimates (in the tail of the p.d.f of \(\varvec{X}\)). Condition A4 is technical and has already been used in the literature; see, for example, Cheng and Chu (1996).

Theorem 2

Let \(\widehat{m}_{\widehat{\eta }}(\varvec{z})\) be as in (8), where \(\widehat{\eta }\) is the top estimator in (7). Suppose that \(|Y|\le M<\infty \) and that \(\mathcal {K}\) is a regular kernel. If \(h_n\rightarrow 0\) and \(nh_n^{d+s}\rightarrow \infty \), as \(n\rightarrow \infty \), then under conditions A1A2A3, and A4, for every \(\epsilon >0\) and every \(p\in [1,\infty )\), and n large enough,

$$\begin{aligned} P\left\{ \int \left| \widehat{m}_{\widehat{\eta }} (\varvec{z})-m(\varvec{z})\right| ^p \mu (d\varvec{z})>\epsilon \right\}\le & {} 8e^{-b n }+2e^{-a_1 n \epsilon ^2 } \\&+\,8ne^{-a_2(n-1) \lambda _n^d \epsilon ^2 }+8n e^{-a_3(n-1) \lambda _n^d }\,, \end{aligned}$$

where \(a_1,~a_2,\) and \(a_3\) are positive constants not depending on n or \(\epsilon \), and \(b\equiv b(\epsilon )>0\) does not depend on n.

Proof

See the Appendix. \(\square \)

Remarks

Theorem 2 has been stated (and proved) for the case where \(\widehat{\eta }\) is the top estimator in (7) which corresponds to the case where Y is a discrete random variable. This is particularly useful when Y is the class variable in a classification problem. However, Theorem 2 continues to hold, with different values of the constants \(a_1, a_2, a_3,\) and b, even if Y is a continuous random variable (in which case the bottom estimator in (7) will be used for \(\widehat{\eta }\)). The proof of this result is similar to (and, in fact, easier than) that of Theorem 2 and will not be given here. In this case \(\mathcal {H}\) will be replaced by the kernel \(\mathcal {J}\) and d will be replaced by \(d+1\) in condition A2, and f will be the pdf of \({\varvec{U}}=({\varvec{X}'},Y)'\) in condition A3. Furthermore, condition A4 will be expressed in terms of the partial derivatives \(\frac{\partial }{\partial u_i} \eta ^*({\varvec{u}})\), \(i=1,\ldots , d+1\), where \({\varvec{u}}=({\varvec{x}}', y)'\). In passing, we also note that in view of Theorem 2 and the Borel-Cantelli lemma, if \(\log n/(n\lambda _n^d)\rightarrow 0\) as \(n\rightarrow \infty \), then \(E[|\widehat{m}_{\widehat{\eta }} (\varvec{z})-m(\varvec{z})|^p|\mathbb {D}_n]\overset{\text {a.s.}}{\rightarrow }0\).

2.2 The second estimator

Another method to estimate the selection probability \(\eta ^*\) (under the MAR assumption) is the logistic regression model; this model’s flexibility and convenience has made it one of the most popular regression estimators for such cases. In addition to this, here we also discuss the least-squares method to estimate \(\eta ^*\). Suppose that it is known in advance that the regression function belongs to a given (known) class of functions. In this case the least-squares estimator is an alternative to the kernel estimator of \(\eta ^*\) and works as follows. Let \(\eta ^*\) belong to a known class of functions \(\mathcal {M}\) of the form \(\eta :\mathbb {R}^d\times \mathbb {R}\rightarrow [\eta _0,1]\), where \(\eta _0=\inf \limits _{\varvec{x},y}\, P(\Delta =1|\varvec{X}=\varvec{x},Y=y)>0\), as described in assumption A1. The least-squares estimator of the function \(\eta ^*\) is

$$\begin{aligned} \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i) = \mathop {\mathrm{argmin}}\limits _{\eta \in \mathcal {M}}\,\frac{1}{n}\sum _{i=1}^n \left( \Delta _i-\eta \left( \varvec{X}_i,Y_i\right) \right) ^2\,. \end{aligned}$$
(9)

Replacing \(\eta ^*(\varvec{X}_i,Y_i)\) in (6) with the least-squares estimator (9), we arrive at the revised regression estimator

$$\begin{aligned} \widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})= & {} \frac{\sum _{i=1}^n\frac{\Delta _i Y_i}{\widehat{\eta }_{{\mathrm{LS}}} (\varvec{X}_i,Y_i)} \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n \frac{\Delta _i}{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)} \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\,. \end{aligned}$$
(10)

To study the performance of \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}} (\varvec{z})\), we employ results from the empirical process theory [see, for example, van der Vaart and Wellner (1996, p. 83) and Pollard (1984, p. 25); also see Györfi, et al. (2002, p. 134)]. We say \(\mathcal {M}\) is totally bounded with respect to the \(L_1\) empirical norm if for every \(\epsilon >0\), there exists a subclass of functions \(\mathcal {M}_{\epsilon } = \{\mathcal {M}_1,\ldots ,\mathcal {M}_{\varvec{N}_{\epsilon }}\}\) such that for every \(\eta \in \mathcal {M}\) there exists a \(\eta ^{\dagger }\in \mathcal {M}_{\epsilon }\) with the property that for the fixed points \((\varvec{x}_1,y_1),\ldots ,(\varvec{x}_n,y_n)\) we have \(\frac{1}{n} \sum _{i=1}^n\left| \eta (\varvec{x}_i,y_i) - \eta ^{\dagger }(\varvec{x}_i,y_i)\right| <\epsilon \). The subclass \(\mathcal {M}_{\epsilon }\) is called an \(\epsilon \)-cover of \(\mathcal {M}\). The cardinality of the smallest such cover is called the \(\epsilon \)-covering number of \(\mathcal {M}\) and is denoted by \(\mathcal {N}_1(\epsilon ,\mathcal {M},\mathbb {D}_n)\). The following result summarizes the performance of the \(L_p\) norms of \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})\).

Theorem 3

Let \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})\) be as defined via (10) and (9) and suppose that \(\mathcal {M}\) is totally bounded with respect to the empirical \(L_1\) norm. Let \(\mathcal {K}\) in (10) be a regular kernel and suppose that condition A1 holds and that \(|Y|\le M<\infty \). Then, provided that \(h_n\rightarrow 0\) and \(nh_n^{d+s}\rightarrow \infty \), as \(n\rightarrow \infty \), one has for every \(\epsilon >0\), every \(p\in [1,\infty )\), and n large enough

$$\begin{aligned} P\left\{ \int \left| \widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}} (\varvec{z})-m(\varvec{z})\right| ^p \mu (d\varvec{z})>\epsilon \right\}\le & {} 8e^{-b n }+2e^{-b_1n \epsilon ^2} \\&+ 16~E\left[ \mathcal {N}_1\left( b_2\epsilon , \mathcal {M},\mathbb {D}_n\right) \right] ~e^{-b_3 n\epsilon ^2}\\&+ 16~E\left[ \mathcal {N}_1\left( b_4\epsilon ^2, \mathcal {M}, \mathbb {D}_n\right) \right] ~e^{-b_5 n \epsilon ^4}\,, \end{aligned}$$

with \(b\equiv b(\epsilon )\) as in Theorem 2, and \(b_1,\ldots ,b_5>0\) are constants not depending on n or \(\epsilon \).

We also note that \(E[|\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}} (\varvec{z})-m(\varvec{z})|^p| \mathbb {D}_n]\overset{\text {a.s}}{\rightarrow }0\), whenever \(\log (E\big [\mathcal {N}_1\big (b_3\wedge b_5,\mathcal {M},\mathbb {D}_n\big )\big ])/n\rightarrow 0\) as \(n\rightarrow \infty \).

Proof

See the Appendix. \(\square \)

2.3 Applications to classification with missing covariates

In this section we consider an application of our main results to the problem of statistical classification. Let \((\varvec{Z},Y)\) be an \(\mathbb {R}^{d+s}\times \{1,\ldots ,N\}\)-valued random vector, where Y is the class label and is to be predicted from the explanatory variables \(\varvec{Z}=(Z_1,\ldots ,Z_{d+s})\); here \(N\ge 2\) is an integer. In classification one searches for a function \(\phi :\mathbb {R}^{d+s}\rightarrow \{1,\ldots ,N\}\) such that the missclassification error probability \(L(\phi )=P\{\phi (\varvec{Z})\ne Y\}\) is as small as possible. Let \(\pi _k(\varvec{z}) = P\left\{ Y=k|\varvec{Z} = \varvec{z}\right\} ,~\varvec{z}\in \mathbb {R}^{d+s},~1\le k\le N\,,\) be the conditional class probabilities. Then, the best classifier called the Bayes classifier (i.e., the one that minimizes \(L(\phi )\)) is given by [see, for example, Devroye and Györfi (1985, pp. 253–254)]

$$\begin{aligned} \phi _B(\varvec{z})= & {} \mathop {\mathrm{argmax}}\limits _{1\le k\le N}~\pi _k(\varvec{z})\,. \end{aligned}$$
(11)

We note that the Bayes classifier satisfies \(\max \nolimits _{1\le k\le N} \pi _k(\varvec{z})=\pi _{\phi _{B}(\varvec{z})}(\varvec{z})\,.\) In practice, the Bayes classifier is unavailable and one has to use a random sample \(\mathbb {D}_n=\{(\varvec{Z}_1,Y_1),\ldots ,(\varvec{Z}_n,Y_n)\}\) from the distribution \((\varvec{Z},Y)\) to construct estimates of \(\phi _B\). Now, let \(\widehat{\phi }_n(\varvec{Z})\) be any sample-based classifier to predict Y from the data \(\mathbb {D}_n\) and \(\varvec{Z}\). The missclassification error for this sample-based classifier is given by

$$\begin{aligned} L_n(\widehat{\phi }_n)=P\left\{ \widehat{\phi }_n(\varvec{Z})\ne Y\big |\mathbb {D}_n\right\} \,. \end{aligned}$$

Here we are interested in sample-based classifiers whose error rates converge to that of the optimal classifier. It can be shown [see, for example, Devroye and Györfi (1985, pp. 254)] that

$$\begin{aligned} 0\le L_n(\widehat{\phi }_n)-L(\phi _B)\le \displaystyle \sum _{k=1}^N\displaystyle \int \left| \widehat{\pi }_k \left( \varvec{z}\right) -\pi _k\left( \varvec{z}\right) \right| \mu (d\varvec{z}). \end{aligned}$$
(12)

Therefore, to show \(L_n(\widehat{\phi }_n) - L(\phi _B)\overset{\text {a.s.}}{\longrightarrow }0\,,\) it is sufficient to show that \(E[|\widehat{\pi }_k(\varvec{Z})-\pi _k(\varvec{Z})||\mathbb {D}_n] \overset{\text {a.s.}}{\longrightarrow }0\), as \(n\rightarrow \infty \), for \(k=1,\ldots ,N\).

In what follows, we consider the problem of classification with missing covariates. More specifically, we consider the situation where a subset of the covariate vector \(\varvec{Z}\) may be missing at random. For some relevant results along these lines see, for example, Mojirsheibani (2012). Our notation below is as before, i.e., for all \(i\ge 1\), the Bernoulli random variable \(\Delta _i\) will be 1 if the corresponding \(\varvec{Z}_i\) is fully observable. To construct our classifiers in the presence of missing covariates, first let

$$\begin{aligned} \widehat{\pi }_k(\varvec{z})=\frac{\sum _{i=1}^n\frac{\Delta _i I\{Y_i=k\}}{\breve{\eta }(\varvec{X}_i,Y_i)}\mathcal {K}\left( \frac{\varvec{z} -\varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\frac{\Delta _i }{\breve{\eta }(\varvec{X}_i,Y_i)}\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\,,~k=1,\ldots ,N\,, \end{aligned}$$
(13)

be the estimated class conditional probabilities, where \(\breve{\eta }\) is any estimator of the selection probability \(\eta \). Our proposed sample-based classifier is the plug-in type estimator of (11) given by

$$\begin{aligned} \widehat{\phi }_n(\varvec{z})= & {} \mathop {\mathrm{argmax}}\limits _{1\le k\le N}~ \widehat{\pi }_k(\varvec{z})\,. \end{aligned}$$
(14)

The following result summarizes the performance of \(\widehat{\phi }_n\).

Theorem 4

Let \(\widehat{\phi }_n\) be the classifier defined via (14) and (13).

  1. (i)

    If \(\breve{\eta }\) in (13) is taken to be the kernel estimator \(\widehat{\eta }\) defined in (7) then, under the conditions of Theorem 2, \(\widehat{\phi }_n\) is strongly consistent i.e., \(L_n(\widehat{\phi }_n)-L(\phi _B)\overset{\text {a.s.}}{\rightarrow }0\), as \(n\rightarrow \infty \).

  2. (ii)

    If \(\breve{\eta }\) in (13) is taken to be the least-squares estimator \(\widehat{\eta }_{{LS}}\) defined in (9) then, under the conditions of Theorem 3, \(\widehat{\phi }_n\) is strongly consistent i.e., \(L_n(\widehat{\phi }_n) - L(\phi _B)\overset{\text {a.s.}}{\rightarrow }0\), as \(n\rightarrow \infty \).

Proof of Theorem 4

  1. (i)

    Let \(\breve{\eta }\) be the kernel estimator \(\widehat{\eta }\) defined by the top estimator in (7). Then it is straightforward to show that the following counterpart of (12) holds

    $$\begin{aligned} P\{L_n(\widehat{\phi }_n)-L(\phi _B)>\epsilon \} \le \displaystyle \sum _{i=1}^N P\left\{ \int \left| \widehat{\pi }_k(\varvec{z}) - \pi _k(\varvec{z})\right| \mu (d\varvec{z})>\frac{\epsilon }{N}\right\} \,. \end{aligned}$$

    The proof of the theorem now follows from an application of Theorem 2 and the Borel Cantelli lemma.

  2. (ii)

    The proof of part (ii) is similar and will not be given.\(\square \)

2.4 Some numerical examples

In this section we give some numerical examples to illustrate the performance of our proposed estimators. We will also compare our estimators with the one based on complete case analysis (i.e., the one that ignores the incomplete covariates and uses the fully observed cases only).

Example 1

Here we consider the performance of the estimator \(\widehat{m}_{\widehat{\eta }}\), as given in (8), the estimator \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}\), as defined via (10), and the complete case estimator which uses the fully observed data only, i.e., the estimator \(\widehat{m}_{\mathrm{cc}}(\varvec{z}) := \left[ \sum _{i=1}^n \Delta _i Y_i \mathcal {K}\left( (\varvec{z}-\varvec{Z}_i)/h_n)\right) \right] \div \left[ \sum _{i=1}^n \Delta _i\mathcal {K}\left( (\varvec{z}-\varvec{Z}_i)/h_n)\right) \right] \,. \) To carry out our numerical studies, we generated \(n=150\) observations from each of the following three models:

Model A: :

      \(\varvec{Z}\in \mathbb {R}^2,~\hbox {and}~~~~Y= Z_1Z_2+Z_2^2+N(0,0.5)\)

Model B: :

      \(\varvec{Z}\in \mathbb {R}^4,~\hbox {and}~~~~Y= -\sin (2Z_1)+Z_2^2+Z_3-\exp (-Z_4)+N(0,0.5)\)

Model C: :

      \(\begin{array}{l}\varvec{Z}\in \mathbb {R}^4,~\hbox {and}~~~~ Y= Z_1+(2Z_2-1)^2+\frac{\sin (2\pi Z_3)}{2-\sin (2\pi Z_3)}+\sin (2\pi Z_4)\\ \qquad \qquad \qquad \qquad \qquad \quad +2\cos (2\pi Z_4) + 3\sin ^2(2\pi Z_4)\\ \qquad \qquad \qquad \qquad \qquad \quad +4\cos ^2(2\pi Z_4)+N(0,0.5) \end{array}\),

where the vector \(\varvec{Z}\) has a Gaussian distribution with mean \(\mathbf{0}\) and the covariance matrix \({\varvec{\Sigma }}=(\sigma _{ij})_{i,j\ge 1},\) where \(\sigma _{ij}= 2^{-|i-j|}\,.\) Here, model A is a toy example whereas models B and C are as in Meier et al. (2009). For model A, we allowed \(Z_2\) to be missing at random (MAR) based on the logistic missing probability mechanism \(P\{\Delta =1|Z_1,Y\}=\exp (1+0.2Z_1-0.5Y)/[1+\exp (1+0.2Z_1-0.5Y)]\,,\) where the coefficients (1, 0.2, \(-0.5\)) in this model were chosen to produce approximately 50 % missing values. For models B and C, we allowed \(Z_3\) and \(Z_4\) to be missing, but not \(Z_1\) and \(Z_2\). The missing probability mechanism for models B and C was taken to be \(P\{\Delta =1|Z_1,Z_2,Y\}=\exp (a+bZ_1+cZ_2+dY)/[1+\exp (a+bZ_1+cZ_2+dY)]\), where (abcd) is (\(0.1,-0.2,1,0.2\)) for model B and (\(0.8, 0.2, 0.2, -0.1\)) for model C. These choices result in approximately 50 % missing values in the data. For our kernel estimators and their smoothing parameters we used the cross-validation method of Racine and Li (2004) in the R package called “np” (Racine and Hayfield 2008). The parameters of the logistic missing probability mechanism were estimated using nonlinear least squares regression (based on the R package “nls2”). Next, to access the performance of the three estimators, we computed the empirical \(L_1\) and \(L_2\) errors of each estimator based on the observed data. The entire above process was repeated a total of 300 times (each time using a sample of \(n=150\) observations) and the average \(L_1\) and \(L_2\) error were computed. The numerical results appear in Table 1. The numbers appearing in brackets are the standard errors computed over 300 Monte Carlo runs. The last row of Table 1 correspond to the case where there are no missing data, i.e., \(P\{\Delta =1\}=1\). As Table 1 shows, the estimators \(\widehat{m}_{\widehat{\eta }}\) and \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}\) have the ability to outperform the complete case estimator \(\widehat{m}_{\mathrm{cc}}\). We also note that the method based on the least squares estimator of the missing probability mechanism performs better than the one based on kernel regression, which is not surprising because we are assuming that we know that the true underlying missing probability mechanism follows a logistic model. The standard errors of \(\widehat{m}_{\widehat{\eta }}\) and \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}\) are slightly higher than that of the complete case estimator; this is primarily due to the presence of the terms \(\widehat{\eta }\) and \(\widehat{\eta }_{{\mathrm{LS}}}\) that appear in (8) and (10). We have produced the boxplots of the 300 empirical \(L_1\) errors; these appear in the top row of Fig. 1. The boxplots for the 300 empirical \(L_2\) errors appear in Fig. 2 (top row).

Table 1 Empirical errors of models A, B, and C, with logistic missing probability mechanism
Fig. 1
figure 1

Boxplots of the \(L_1\) errors of various estimators under different models and different missing patterns. In each of these 9 plots, boxplot 1 corresponds to \(\widehat{m}_{\mathrm{cc}}\), 2 corresponds to \(\widehat{m}_{\widehat{\eta }}\), 3 corresponds to \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}\), and 4 correspond to the case with no missing data

Fig. 2
figure 2

Boxplots of the \(L_2\) errors of various estimators under different models and different missing patterns. In each of these 9 plots, boxplot 1 corresponds to \(\widehat{m}_{\mathrm{cc}}\), 2 corresponds to \(\widehat{m}_{\widehat{\eta }}\), 3 corresponds to \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}\), and 4 correspond to the case with no missing data

Table 2 Empirical errors of models A, B, and C, with cosine missing probability mechanism

Next, we consider some different missing probability mechanisms as follows. For the data corresponding to Model A, once again we allowed \(Z_2\) to be missing at random (MAR) based on the missing probability mechanism \(P\{\Delta =1|Z_1,Y\}=\left| \cos \big (e^{0.6 Y}-0.1 \sin (-2Z_1Y+Y^2)\big )\right| \). For models B and C, again we allowed \(Z_3\) and \(Z_4\) to be missing at random (but not \(Z_1\) and \(Z_2\)), and the MAR missing probability mechanism is taken to be \(P\{\Delta =1|Z_1,Z_2,Y\}=0.8\times \big |\cos \big (Z_1+Z_2-Y-2 \sin (Z_1Z_2Y)\big )\big |\). These choices result in approximately 50 % missing data in each of the 3 models. The results based on 300 Monte Carlo runs (with n=150 as before) appear in Table 2. The last row of Table 2 gives the results when there are no missing data, i.e., \(P\{\Delta =1\}=1\). Table 2 shows that \(\widehat{m}_{\widehat{\eta }}\) continues to perform better than the complete case estimator \(\widehat{m}_{\mathrm{cc}}\), but \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}\) is no longer a good estimator. This is not surprising because we are pretending, as many do, that the underlying MAR probabilities still follow logistic models (when in fact they are given by the more complicated trigonometric function defined earlier), and are using nonlinear least squares to estimate the parameters of the incorrectly assumed logistic model. The boxplots of the 300 empirical \(L_1\) errors corresponding to the left panel of Table 2 appear in the middle row of Fig. 1, whereas those of the empirical \(L_2\) errors of Table 2 appear in the middle row of Fig. 2.

Finally, we consider the MCAR setup where \(P\{\Delta =1|\varvec{Z}=\varvec{z},Y=y\}=P\{\Delta =1\}=0.5\). The results, based on 300 Monte Carlo runs (with n=150 for each run), appear in Table 3.

Table 3 Empirical errors of models A, B, and C, with MCAR missing probability mechanism

As this table shows, the complete case estimator \(\widehat{m}_{\mathrm{cc}}\) is the best estimator. This is no fluke; in fact, in view of Theorem 1, under the MACR assumption the estimator \(\widehat{m}_{\mathrm{cc}}\) is already strongly consistent in the \(L_p\) norm, \(p\ge 1\). We have also produced the boxplots of the empirical \(L_1\) and \(L_2\) errors of various estimators; these appear in bottom rows of Figs. 1 and 2.

Example 2

(Pima Indians Diabetes Data) In this example we consider the problem of kernel classification with missing covariates, as described in Sect. 2.3, and apply our results to the above data. This real data set involves 768 patients, 268 of whom have “tested positive” for diabetes (class 1), whereas the remaining 500 patients had negative test results (class 0). A complete description of this data set is available from the UCI Repository of machine learning databases at: http://www.icu.uci.edu/~mlearn/MLRepository.html. All of the covariates are numerical, but there are missing values among them. We will focus on one dominant missing pattern where the covariates ‘Triceps skin fold thickness’ and ‘2-Hour serum insulin’ are jointly missing for 227 patients. Here, we are considering the classification of a patient’s diabetes status (class 0 or class 1) based on the covariates \(Z_1=\) ‘Number of times pregnant’, \(Z_2=\) ‘Plasma glucose concentration’, \(Z_3=\) ‘Diastolic blood pressure’, \(Z_4=\) ‘Triceps skin fold thickness’, and \(Z_5=\) ‘2-Hour serum insulin’. The classifier works by assigning a patient to class 1 if her observed covariate vector \(\varvec{z}\) satisfies \(\widehat{\pi }_1(\varvec{z})>\widehat{\pi }_0(\varvec{z})\), otherwise she is assigned to class 0, where \(\widehat{\pi }_k(\varvec{z}),~k=0,1\), is the estimated class conditional probability given by (13). Once again, we consider three versions of \(\widehat{\pi }_k(\varvec{z})\): (i) the version that takes \(\breve{\eta }\) in (13) to be the kernel estimator \(\widehat{\eta }\) defined in (7), (ii) the version that takes \(\breve{\eta }\) in (13) to be the estimator \(\widehat{\eta }_{{LS}}\) based on logistic regression, and (iii) the complete case estimator that ignores the incomplete covariates and takes \(\breve{\eta }= 1\) in (13). Table 4 gives the misclassification error rates of the three classifiers based on the resubstitution method, i.e., the error committed by each classifier on the data. As Table 4 shows, both kernel and regression based estimators of the missing probability mechanism can produce better results than the naïve complete case estimator.

Table 4 Misclassification error rates of the three classifiers of Example 2

3 Discussion

In this paper we have proposed kernel methods to construct nonparametric estimators of a regression function with incomplete data. The presence of missing covariates is handled using a Horvitz–Thompson inverse weighting approach, where the weights are themselves estimates of the unknown selection probabilities. When weights are kernel estimators, our proposed nonparametric estimator of the regression function \(m(\varvec{z})=E[Y|\varvec{Z}=\varvec{z}]\) is given by \(\widehat{m}_{\widehat{\eta }}(\varvec{z})\), as defined by (8), and the least squares estimator is given by \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})\), defined by (10). Using numerical studies we have also compared the performance of these two estimators with that of the complete case estimator that deletes the incomplete observations and only uses the fully observed data, i.e. the estimator

$$\begin{aligned} \widehat{m}_{\mathrm{cc}}(\varvec{z}) := \left[ \sum _{i=1}^n \Delta _i Y_i \mathcal {K}\left( (\varvec{z}-\varvec{Z}_i)/h_n)\right) \right] \div \left[ \sum _{i=1}^n \Delta _i\mathcal {K}\left( (\varvec{z}-\varvec{Z}_i)/h_n)\right) \right] \,. \end{aligned}$$
(15)

In fact, our numerical results in Sect. 2.4 show that, in general, both \(\widehat{m}_{\widehat{\eta }}(\varvec{z})\) and \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})\) have the ability to outperform \(\widehat{m}_{\mathrm{cc}}(\varvec{z})\) (except when the missing pattern is MCAR). This is not surprising since the complete case estimator in (15) is, in general, an incorrect estimator. To appreciate this, observe that (15) is equal to

$$\begin{aligned} \frac{\sum _{i=1}^n \Delta _i Y_i \mathcal {K}\left( (\varvec{z}-\varvec{Z}_i)/h_n)\right) }{\sum _{i=1}^n \mathcal {K}\left( (\varvec{z}-\varvec{Z}_i)/h_n)\right) }\div \frac{\sum _{i=1}^n \Delta _i\mathcal {K}\left( (\varvec{z}-\varvec{Z}_i)/h_n)\right) }{\sum _{i=1}^n \mathcal {K}\left( (\varvec{z}-\varvec{Z}_i)/h_n)\right) } \end{aligned}$$

which is the kernel estimator of the ratio \(E[\Delta Y|\varvec{Z}=\varvec{z}]\div E[\Delta |\varvec{Z}=\varvec{z}]\) which is, in general, not equal to \(E[Y|\varvec{Z}=\varvec{z}]=m(\varvec{z})\), [not even under the MAR assumption (5)]. Therefore, the results in Theorems 2 and 3 on the \(L_p\) norms of \(\widehat{m}_{\widehat{\eta }}(\varvec{z})\) and \(\widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})\) fail to hold for the estimator \(\widehat{m}_{\mathrm{cc}}(\varvec{z})\).

A main issue with any kernel type estimator is the choice of the bandwidth. In the case of kernel regression estimators, a popular choice of the bandwidth is the one that minimizes the Integrated Squared Error (ISE) of the corresponding kernel regression estimator. Since ISE depends on the underlying unknown regression and density functions, Hardle and Marron (1985) replace them with “leave-one-out” estimators which are then used to define their cross-validation bandwidth selection rule. A more recent approach is based on the cross-validation method of Racine and Li (2004), which is implemented in the ‘R’ package called “np” (see Racine and Hayfield (2008)); in fact, we have utilized this method in our numerical studies. Better results may be achievable for the first estimator [which is defined via (8) and (7)] if the two bandwidths were selected simultaneously to minimize the cross-validation estimate of an error criterion such as ISE, but we have not pursued that path in this paper.

4 Appendix

We first state a number of technical lemmas that will be used in the proofs of out main results.

Lemma 1

Let K be a regular kernel, and let \(\mu \) be any probability measure on the Borel sets of \(\mathbb {R}^{d+s}\). Then, there exists a finite positive constant \(c_1\), only depending on the kernel K, such that for all \(h_n>0\)

$$\begin{aligned} \sup \limits _{\varvec{u}\in \mathbb {R}^{d+s}} \displaystyle \int \frac{K\left( \frac{\varvec{z}-\varvec{u}}{h_n}\right) }{E\left[ K\left( \frac{\varvec{z} - \varvec{Z}}{h_n}\right) \right] }\,\mu (d\varvec{z})\le c_1\,. \end{aligned}$$

This lemma and its proof is given in Devroye and Krzyz̀ak (1989, Lemma 1). Also, see Devroye and Wagner (1980) as well as Spiegelman and Sacks (1980).

Lemma 2

Let K be a regular kernel, and put \(m_n^*(\varvec{z})=\sum _{i=1}^n Y_i K(\frac{\varvec{z}-\varvec{Z}_i}{h_n})/(nE[K(\frac{\varvec{z}-\varvec{Z}}{h_n})])\), where \((\varvec{Z}_i,Y_i)\)’s are i.i.d. \(\mathbb {R}^{d+s}\times [-M,M]\)-valued random vectors. Then, for every \(\epsilon >0\) and n large enough,

$$\begin{aligned} P\left\{ \displaystyle \int \left| m_n^*(\varvec{z})-m(\varvec{z}) \right| \mu (d\varvec{z})>\epsilon \right\} \le \exp \left\{ \frac{-n\epsilon ^2}{64 M^2 c_1^2}\right\} \,. \end{aligned}$$

Here, \(\mu \) is the probability measure of \(\varvec{Z}\), \(m(\varvec{z})=E[Y|\varvec{Z}=\varvec{z}]\), and \(c_1\) is as in Lemma 1.

For a proof of this result see, for example, Györfi, et al. (2002, Lemma 23.9).

Lemma 3

Let \(\phi (\varvec{X}_i,Y_i)=\eta ^*(\varvec{X}_i,Y_i)f(\varvec{X}_i)P(Y=Y_i|Y_i)\) where \(\eta ^*(\varvec{X}_i,Y_i)\) is as in (5), and put \(\widehat{\phi }(\varvec{X}_i,Y_i)=\lambda _n^{-d} n^{-1}\sum _{j=1}^n\Delta _j I\{Y_j=Y_i\}\mathcal {H}((\varvec{X}_i-\varvec{X}_j)/\lambda _n)\) where \(\lambda _n\) and the kernel \(\mathcal {H}\) are as in (7). Suppose that conditions A2A3, and A4 hold.

Then,

$$\begin{aligned} \left| E\left[ \widehat{\phi }(\varvec{X}_i,Y_i)|\varvec{X}_i,Y_i\right] - \phi (\varvec{X}_i,Y_i)\right| \quad \overset{\text {a.s.}}{\le } \quad c_2\, \lambda _n \end{aligned}$$

where \(c_2\) is a positive constant not depending on n.

Proof of lemma 3

The proof is similar to that of Mojirsheibani (2012, Lemma 3) and goes as follows. First note that

$$\begin{aligned}&E\left[ \widehat{\phi }(\varvec{X}_i,Y_i) \Big |\varvec{X}_i,Y_i\right] -\phi (\varvec{X}_i,Y_i)\\&\quad = E\left[ \lambda _n^{-d} n^{-1}\sum _{j=1}^n \Delta _j I\{Y_j=Y_i\}\mathcal {H}\left( \frac{\varvec{X}_i-\varvec{X}_j}{\lambda _n} \right) \Bigg |\varvec{X}_i,Y_i\right] -\phi (\varvec{X}_i,Y_i)\\&\quad \overset{\text {a.s.}}{=} \lambda _n^{-d} E \left[ I\{Y_1=Y_i\}\mathcal {H}\left( \frac{\varvec{X}_i - \varvec{X}_1}{\lambda _n}\right) E\left[ \Delta _1 |\varvec{X}_1,Y_1, \varvec{X}_i,Y_i\right] \Bigg |\varvec{X}_i,Y_i\right] -\phi (\varvec{X}_i,Y_i)\\&\quad =\lambda _n^{-d} E\left[ I\{Y_1=Y_i\}\mathcal {H} \left( \frac{\varvec{X}_i-\varvec{X}_1}{\lambda _n}\right) \eta ^* \left( \varvec{X}_1, Y_1\right) \Bigg |\varvec{X}_i,Y_i\right] \\&\qquad - \eta ^*(\varvec{X}_i,Y_i)f(\varvec{X}_i)P(Y=Y_i|Y_i)\\&\qquad (\text {since } \Delta _1 \text { is independent of } (\varvec{X}_i,Y_i))\\&\quad =\lambda _n^{-d} E\left[ \eta ^*\left( \varvec{X}_1, Y_1\right) I\{Y_1=Y_i\}\mathcal {H}\left( \frac{\varvec{X}_i - \varvec{X}_1}{\lambda _n}\right) \Bigg |\varvec{X}_i,Y_i\right] \\&\qquad \pm \lambda _n^{-d} E\left[ \eta ^*\left( \varvec{X}_i, Y_i\right) I\{Y_1=Y_i\}\mathcal {H}\left( \frac{\varvec{X}_i - \varvec{X}_1}{\lambda _n}\right) \Bigg |\varvec{X}_i,Y_i\right] \\&\qquad - \eta ^*(\varvec{X}_i,Y_i)f(\varvec{X}_i)P(Y=Y_i|Y_i)\\&\quad = E\left[ \lambda _n^{-d} \left( \eta ^*\left( \varvec{X}_1, Y_1\right) -\eta ^*\left( \varvec{X}_i,Y_i\right) \right) I\{Y_1=Y_i\} \mathcal {H}\left( \frac{\varvec{X}_i-\varvec{X}_1}{\lambda _n}\right) \Bigg |\varvec{X}_i,Y_i\right] \\&\qquad + E\left[ \eta ^*\left( \varvec{X}_i,Y_i\right) \left( \lambda _n^{-d} I\{Y_1=Y_i\}\mathcal {H} \left( \frac{\varvec{X}_i - \varvec{X}_1}{\lambda _n}\right) -f(\varvec{X}_i)P(Y=Y_i|Y_i)\right) \bigg | \varvec{X}_i,Y_i\right] \\&\quad :=\Delta _{n,i}(1)+\Delta _{n,i}(2)~~~(\text {say})\,. \end{aligned}$$

Using a one-term Taylor expansion, we can bound \(\Delta _{n,i}(1)\) as follows

$$\begin{aligned} \left| \Delta _{n,i}(1)\right|\le & {} \lambda _n^{-d} E\left[ \sum _{k=1}^d \left| \frac{\partial \eta ^*(\varvec{X}^{\dagger },Y_i)}{\partial X_k}\right| \left| X_{i,k}-X_{1,k} \right| \mathcal {H} \left( \frac{\varvec{X}_i - \varvec{X}_1}{\lambda _n}\right) \bigg |\varvec{X}_i,Y_i\right] \,, \end{aligned}$$

where \(X_{1,k}\) and \(X_{i,k}\) are the kth components of \(\varvec{X}_1\) and \(\varvec{X}_i\), respectively, and \(\varvec{X}^{\dagger }\) is on the interior of the line segment joining \(\varvec{X}_1\) and \(\varvec{X}_i\). Therefore,

$$\begin{aligned} \left| \Delta _{n,i}(1)\right|\le & {} \alpha _1 E\left[ \sum _{k=1}^d \left| X_{i,k}-X_{1,k}\right| \lambda _n^{-d} \mathcal {H}\left( \frac{\varvec{X}_i - \varvec{X}_1}{\lambda _n}\right) \bigg |\varvec{X}_i,Y_i\right] \\&\left( \text {where }\alpha _1=\max \limits _{1\le k\le d} \, \sup \limits _{\varvec{v}\in \mathbb {R}^d,y\in \mathcal {Y}}\,\left| \frac{\partial \eta ^*(\varvec{x},y)}{\partial x_k}\right| _{\varvec{x}=\varvec{v}}\right) \\= & {} \alpha _1\sum _{k=1}^d \int _{\mathbb {R}^d}\left| X_{i,k}-x_k\right| \lambda _n^{-d} \mathcal {H}\left( \frac{\varvec{X}_i-\varvec{x}}{\lambda _n}\right) f(\varvec{x})d\varvec{x}\\\le & {} \alpha _1 \Vert f\Vert _{\infty } \sum _{k=1}^d \int _{\mathbb {R}^d}\left| u_k\right| \lambda _n \mathcal {H}\left( \varvec{u}\right) d\varvec{u}~~~(\text {by condition}\,\mathbf{A3})\\\le & {} \alpha \lambda _n\,, ~~~(\text {by condition } \mathbf{A2})\,. \end{aligned}$$

To bound the term \(\Delta _{n,i}(2)\), first note that

$$\begin{aligned} \Delta _{n,i}(2)= & {} \eta ^*\left( \varvec{X}_i,Y_i\right) E\left[ \lambda _n^{-d} I\{Y_1=Y_i\}\mathcal {H} \left( \frac{\varvec{X}_i - \varvec{X}_1}{\lambda _n}\right) \right. \\&\left. -f(\varvec{X}_i)P(Y=Y_i|Y_i) \bigg |\varvec{X}_i,Y_i\right] \end{aligned}$$

Now, let \(f_y(\varvec{x})\) be the conditional pdf of \(\varvec{X}\), given \(Y=y\), and observe that

$$\begin{aligned}&E\left[ \lambda _n^{-d} I\{Y_i=Y_1\}\mathcal {H}\left( \frac{\varvec{X}_i - \varvec{X}_1}{\lambda _n}\right) \bigg |\varvec{X}_i,Y_i\right] \\&\quad = \sum _{y\in \mathcal {Y}}P(Y=y)I\left\{ Y_i=y\right\} \int _{\mathbb {R}^d}\lambda _n^{-d} \mathcal {H}\left( \frac{\varvec{X}_i - \varvec{x}}{\lambda _n}\right) f_y(\varvec{x})\, d\varvec{x} \end{aligned}$$

Therefore, by condition A2 and the fact that \(P(Y = Y_i|Y_i) = \sum _{y\in \mathcal {Y}} P(Y=y)I\{Y_i=y\}\), one finds

$$\begin{aligned} \Delta _{n,i}(2)= & {} \eta ^*\left( \varvec{X}_i,Y_i\right) \sum _{y\in \mathcal {Y}}P(Y=y)I\left\{ Y_i=y\right\} \\&\quad \left[ \int _{\mathbb {R}^d} \lambda _n^{-d}\mathcal {H}\left( \frac{\varvec{X}_i-\varvec{x}}{\lambda _n}\right) \left( f_y(\varvec{x})-f_y(\varvec{X}_i)\right) d\varvec{x}\right] \end{aligned}$$

Now, a one-term Taylor expansion and the fact that \(|\eta ^*(\varvec{X}_i,Y_i)|\le 1\) yield

$$\begin{aligned} \left| \Delta _{n,i}(2)\right|\le & {} \sum _{y\in \mathcal {Y}}P(Y=y)\left| \int _{\mathbb {R}^d} \mathcal {H}\left( \varvec{u}\right) \left( f_y(\varvec{X}_i-\lambda _n \varvec{u})-f_y(\varvec{X}_i)\right) d\varvec{u}\right| \\\le & {} \lambda _n \sum _{k=1}^d \int _{\mathbb {R}^d}\left| u_k\right| \mathcal {H}(\varvec{u})d\varvec{u}\sum _{y\in \mathcal {Y}}P(Y=y) \sup \limits _{\varvec{v}\in \mathbb {R}^d} \left| \frac{\partial f_y(\varvec{x})}{\partial x_k}\right| _{\varvec{x}=\varvec{v}}\\\le & {} d\lambda _n~ \max \limits _{1\le k \le d}~ \sup \limits _{\varvec{v}\in \mathbb {R}^d} \left| \frac{\partial f(\varvec{x})}{\partial x_k}\right| _{\varvec{x} = \varvec{v}}\int _{\mathbb {R}^d}\left| u_k\right| \mathcal {H}(\varvec{u})d\varvec{u}\\\le & {} \beta \, \lambda _n~~~(\text {by conditions } \mathbf{A2} \text { and } \mathbf{A3}). \end{aligned}$$

This completes the proof of Lemma 3 \(\square \)

Proof of theorem 2

To prove Theorem 2, we first define the following kernel-type estimators:

$$\begin{aligned} \tilde{m}_{\eta ^*}(\varvec{z})= & {} \dfrac{\sum _{i=1}^n\frac{\Delta _i Y_i}{\eta ^*(\varvec{X}_i,Y_i)}\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n} \right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) } \,,~~~~\overline{m}_{\eta ^*}(\varvec{z})=\dfrac{\sum _{i=1}^n \frac{\Delta _i}{\eta ^*(\varvec{X}_i,Y_i)}\mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }, \nonumber \\ \end{aligned}$$
(16)
$$\begin{aligned} \tilde{m}_{\widehat{\eta }}(\varvec{z})= & {} \dfrac{\sum _{i=1}^n \frac{\Delta _i Y_i}{\widehat{\eta }(\varvec{X}_i,Y_i)} \mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\,,\text { and} ~~~~ \overline{m}_{\widehat{\eta }}(\varvec{z}) = \dfrac{\sum _{i=1}^n \frac{\Delta _i}{\widehat{\eta }(\varvec{X}_i,Y_i)}\mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }.\nonumber \\ \end{aligned}$$
(17)

Using the expressions in (17), and upon observing that \(|\tilde{m}_{\widehat{\eta }}(\varvec{z})/\overline{m}_{\widehat{\eta }} (\varvec{z})|\le M\), one obtains

$$\begin{aligned} \left| \widehat{m}_{\widehat{\eta }}(\varvec{z})-m(\varvec{z})\right| = \left| \frac{\tilde{m}_{\widehat{\eta }}(\varvec{z})}{\overline{m}_{\widehat{\eta }}(\varvec{z})}-\frac{m(\varvec{z})}{1}\right|\le & {} \left| \frac{\tilde{m}_{\widehat{\eta }} (\varvec{z})}{\overline{m}_{\widehat{\eta }}(\varvec{z})}\right| \left| \overline{m}_{\widehat{\eta }}(\varvec{z})-1\right| +\left| \tilde{m}_{\widehat{\eta }}(\varvec{z})-m(\varvec{z})\right| \\\le & {} M\left| \overline{m}_{\widehat{\eta }}(\varvec{z}) - 1\right| +\left| \tilde{m}_{\widehat{\eta }}(\varvec{z})-m(\varvec{z})\right| \,. \end{aligned}$$

Therefore, with \(\tilde{m}_{\widehat{\eta }}(\varvec{z})\) and \(\overline{m}_{\widehat{\eta }}(\varvec{z})\) as in (17), and \(\tilde{m}_{\eta ^*}(\varvec{z})\) and \(\overline{m}_{\eta *}(\varvec{z})\) as in (16), we have that for every \(\epsilon >0\)

$$\begin{aligned}&P\left\{ \int \left| \widehat{m}_{\widehat{\eta }}(\varvec{z}) - m(\varvec{z})\right| ^p \mu (d\varvec{z})>\epsilon \right\} \\&\quad \le P\left\{ 2^{p-1}M^p\int \left| \overline{m}_{\widehat{\eta }} (\varvec{z}) - 1\right| ^p\mu (d\varvec{z}) > \frac{\epsilon }{2} \right\} \\&\qquad + P\left\{ 2^{p-1}\int \left| \tilde{m}_{\widehat{\eta }}(\varvec{z}) - m(\varvec{z})\right| ^p\mu (d\varvec{z})>\frac{\epsilon }{2}\right\} \\&\quad \le P\left\{ 2^{2p-2}M^p\int \left| \overline{m}_{\widehat{\eta }} (\varvec{z})- \overline{m}_{\eta *}(\varvec{z})\right| ^p\mu (d\varvec{z})>\frac{\epsilon }{4} \right\} \\&\qquad +P\left\{ 2^{2p-2}M^p\int \left| \overline{m}_{\eta *}(\varvec{z}) - 1\right| ^p\mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \\&\qquad +P\left\{ 2^{2p-2}\int \left| \tilde{m}_{\widehat{\eta }} (\varvec{z})- \tilde{m}_{\eta ^*}(\varvec{z})\right| ^p\mu (d\varvec{z}) > \frac{\epsilon }{4}\right\} \\&\qquad +P\left\{ 2^{2p-2}\int \left| \tilde{m}_{\eta ^*} (\varvec{z})-m(\varvec{z})\right| ^p\mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \\&\quad \le P\left\{ 2^{2p-2}M^p\left| \frac{1}{\wedge _{i=1}^n \widehat{\eta }(\varvec{X}_i,Y_i)}+\frac{1}{\eta _0}\right| ^{p-1} \int \left| \overline{m}_{\widehat{\eta }}(\varvec{z})- \overline{m}_{\eta *}(\varvec{z})\right| \mu (d\varvec{z}) > \frac{\epsilon }{4}\right\} \\&\qquad +P\left\{ 2^{2p-2}M^p\left| \frac{1}{\eta _0} + 1\right| ^{p-1}\int \left| \overline{m}_{\eta *}(\varvec{z}) - 1\right| \mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \\&\qquad +P\left\{ 2^{2p-2}\left| \frac{M}{\wedge _{i=1}^n \widehat{\eta }(\varvec{X}_i,Y_i)}+\frac{M}{\eta _0}\right| ^{p-1} \int \left| \tilde{m}_{\widehat{\eta }}(\varvec{z})- \tilde{m}_{\eta ^*}(\varvec{z})\right| \mu (d\varvec{z}) > \frac{\epsilon }{4}\right\} \\&\qquad +P\left\{ 2^{2p-2}\left| \frac{M}{\eta _0}+M \right| ^{p-1}\int \left| \tilde{m}_{\eta ^*}(\varvec{z}) - m(\varvec{z})\right| \mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \\&\quad := P_{n,1}+P_{n,2}+P_{n,3}+P_{n,4}, ~~~\text {(say).} \end{aligned}$$

But, by (3), in view of the result of Devroye and Krzyz̀ak (1989),

$$\begin{aligned} P_{n,2}+P_{n,4} \le 8 e^{-bn}\,, \end{aligned}$$
(18)

where \(b\equiv b(\epsilon )={\min }^2\big (\epsilon ^2 \eta _0^{2p}/[2^{4p+7}M^{2p}(1+\eta _0)^{2p-2}(1+c_1)]\,,\epsilon \eta _0^{p}/[2^{2p+5}M^{p}(1+\eta _0)^{p-1}(1+c_1)]\big ).\) To deal with the term \(P_{n,1}\), first observe that

$$\begin{aligned}&\left| \overline{m}_{\widehat{\eta }}(\varvec{z})- \overline{m}_{\eta *}(\varvec{z})\right| \\&\quad = \left| \frac{\sum _{i=1}^n \frac{\Delta _i}{\widehat{\eta }(\varvec{X}_i,Y_i)}\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }-\frac{\sum _{i=1}^n \frac{\Delta _i}{\eta ^*(\varvec{X}_i,Y_i)}\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\right| \\&\quad =\left| \frac{\sum _{i=1}^n \Delta _i\left( \frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\right| \\&\quad \le \left| \frac{\sum _{i=1}^n \Delta _i\left( \frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\right| \\&\qquad +\left| \sum _{i=1}^n \Delta _i\left( \frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)} - \frac{1}{\eta ^*(\varvec{X}_i, Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) \right. \\&\quad \qquad \left. \left( \frac{1}{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }-\frac{1}{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\right) \right| \\&\quad := S_{n,1}(\varvec{z})+S_{n,2}(\varvec{z}),~~~\text {(say).} \end{aligned}$$

However, we can bound \(\int S_{n,1}(\varvec{z})\mu (d\varvec{z})\) as follows

$$\begin{aligned} \int S_{n,1}(\varvec{z})\mu (d\varvec{z})\le & {} \int \frac{\sum _{i=1}^n\left| \frac{1}{\eta ^*(\varvec{X}_i,Y_i)}-\frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}\right| \Delta _i\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\mu (d\varvec{z})\nonumber \\\le & {} \left( \sup \limits _{\varvec{w}}\int \frac{\mathcal {K}\left( \frac{\varvec{z}-\varvec{w}}{h_n}\right) }{E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] } \mu (d\varvec{z})\right) \nonumber \\&\quad \left( \frac{1}{n}\sum _{i=1}^n \left| \frac{1}{\eta ^*(\varvec{X}_i,Y_i)}-\frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}\right| \right) \nonumber \\\le & {} c_1 \left( \frac{1}{n}\sum _{i=1}^n \left| \frac{1}{\eta ^*(\varvec{X}_i,Y_i)}-\frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}\right| \right) \,, \end{aligned}$$
(19)

where the last line follows from Lemma 1 in the Appendix. To bound \(\int S_{n,2}(\varvec{z})\mu (d\varvec{z})\), we have

$$\begin{aligned}&\int S_{n,2}(\varvec{z})\mu (d\varvec{z}) \nonumber \\&\quad \le \max \limits _{1\le i\le n} \left| \frac{1}{\eta ^*(\varvec{X}_i,Y_i)}-\frac{1}{\widehat{\eta } (\varvec{X}_i,Y_i)}\right| ~\int \left| \frac{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z}).\qquad \end{aligned}$$
(20)

Therefore, using (19) and (20) we have

$$\begin{aligned} P_{n,1}\le & {} P\left\{ 2^{2p-2}M^p\left| \frac{1}{\eta _0} + \frac{1}{\wedge _{i=1}^n\widehat{\eta }(\varvec{X}_i,Y_i)}\right| ^{p-1}\int \left( S_{n,1}(\varvec{z})+S_{n,2}(\varvec{z})\right) \mu (d\varvec{z}) > \frac{\epsilon }{4}\right\} \nonumber \\\le & {} P\left\{ \left[ 2^{2p-2}M^p\left| \frac{1}{\eta _0} + \frac{1}{\wedge _{i=1}^n\widehat{\eta }(\varvec{X}_i,Y_i)}\right| ^{p-1} \frac{c_1}{n}\sum _{i=1}^n\left| \frac{\eta ^*(\varvec{X}_i,Y_i) - \widehat{\eta }(\varvec{X}_i,Y_i)}{\eta ^*(\varvec{X}_i,Y_i)\widehat{\eta } (\varvec{X}_i,Y_i)}\right| >\frac{\epsilon }{8}\right] \right. \nonumber \\&\qquad \left. \cap \bigcap _{i=1}^n\left[ \widehat{\eta }(\varvec{X}_i,Y_i)\ge \frac{\eta _0}{2}\right] \right\} \nonumber \\&+P\left\{ \left[ 2^{2p-2}M^p \left| \frac{1}{\eta _0}+\frac{1}{\wedge _{i=1}^n\widehat{\eta }(\varvec{X}_i,Y_i)}\right| ^{p-1} \max \limits _{1\le i\le n} \left| \frac{1}{\eta ^*(\varvec{X}_i,Y_i)} - \frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}\right| \right. \right. \nonumber \\&\times \left. \left. \int \left| \frac{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z}) >\frac{\epsilon }{8}\right] \cap \bigcap _{i=1}^n\left[ \widehat{\eta }(\varvec{X}_i,Y_i)\ge \frac{\eta _0}{2}\right] \right\} \nonumber \\&+ P\left\{ \bigcup _{i=1}^n \left[ \widehat{\eta }(\varvec{X}_i,Y_i)< \frac{\eta _0}{2}\right] \right\} \nonumber \\\le & {} P\left\{ \frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| >\frac{\epsilon \eta _0^{p+1}}{2^{2p+2}3^{p-1}M^p c_1}\right\} \end{aligned}$$
(21)
$$\begin{aligned}&+P\left\{ \int \left| \frac{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})>\frac{\epsilon \eta _0^p}{2^{2p+1}3^p M^p}\right\} \end{aligned}$$
(22)
$$\begin{aligned}&+ P\left\{ \bigcup _{i=1}^n \left[ \widehat{\eta }(\varvec{X}_i,Y_i)< \frac{\eta _0}{2}\right] \right\} . \end{aligned}$$
(23)

The bound for the term \(P_{n,3}\) follows in a similar fashion as in \(P_{n,1}\). More specifically, we first note that

$$\begin{aligned}&\left| \tilde{m}_{\widehat{\eta }}(\varvec{z})- \tilde{m}_{\eta ^*}(\varvec{z})\right| \\&\quad = \left| \frac{\sum _{i=1}^n\Delta _i Y_i\left( \frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\right| \\&\quad \le \left| \frac{\sum _{i=1}^n\Delta _i Y_i\left( \frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{nE\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\right| \\&\qquad +\left| \sum _{i=1}^n\Delta _i Y_i\left( \frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) \right. \\&\qquad \left. \times \left( \frac{1}{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }-\frac{1}{nE\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\right) \right| \\&\quad := S'_{n,1}(\varvec{z})+S'_{n,2}(\varvec{z})~~~\text { (say).} \end{aligned}$$

The integral \(\int S'_{n,1}(\varvec{z})\mu (d\varvec{z})\) can be upper-bounded similar to the term \(\int S_{n,1}(\varvec{z})\mu (d\varvec{z})\) in (19); in fact we have

$$\begin{aligned} \int S'_{n,1}(\varvec{z})\mu (d\varvec{z})\le & {} M\cdot c_1 \left( \frac{1}{n}\sum _{i=1}^n \left| \frac{1}{\widehat{\eta } (\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right| \right) \,. \end{aligned}$$

For the term \(\int S'_{n,2}(\varvec{z})\mu (d\varvec{z})\), we have

$$\begin{aligned}&\int S'_{n,2}(\varvec{z})\mu (d\varvec{z}) \\&\quad \le M\cdot \max \limits _{1\le i\le n} \left| \frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right| \int \left| \frac{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})\,. \end{aligned}$$

Therefore, \(P_{n,3}\) can be bounded as follows

$$\begin{aligned} P_{n,3}\le & {} P\left\{ 2^{2p-2}\left| \frac{M}{\eta _0}+\frac{M}{\wedge _{i=1}^n\widehat{\eta }(\varvec{X}_i,Y_i)}\right| ^{p-1}\int \left( S_{n,1}'(\varvec{z})+S_{n,2}'(\varvec{z})\right) \mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \nonumber \\\le & {} P\left\{ \left[ 2^{2p-2}\left| \frac{M}{\eta _0}+\frac{M}{\wedge _{i=1}^n\widehat{\eta }(\varvec{X}_i,Y_i)}\right| ^{p-1}\frac{c_1}{n}\sum _{i=1}^n\left| \frac{\eta ^*(\varvec{X}_i,Y_i)-\widehat{\eta }(\varvec{X}_i,Y_i)}{\eta ^*(\varvec{X}_i,Y_i)\widehat{\eta }(\varvec{X}_i,Y_i)}\right| >\frac{\epsilon }{8}\right] \right. \nonumber \\&\qquad \left. \cap \bigcap _{i=1}^n\left[ \widehat{\eta }(\varvec{X}_i,Y_i)\ge \frac{\eta _0}{2}\right] \right\} \nonumber \\&+P\left\{ \left[ 2^{2p-2} \left| \frac{M}{\eta _0}+\frac{M}{\wedge _{i=1}^n\widehat{\eta }(\varvec{X}_i,Y_i)}\right| ^{p-1} \max \limits _{1\le i\le n} \left| \frac{1}{\eta ^*(\varvec{X}_i,Y_i)}-\frac{1}{\widehat{\eta }(\varvec{X}_i,Y_i)}\right| \right. \right. \nonumber \\&\qquad \times \left. \left. \int \left| \frac{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z}) >\frac{\epsilon }{8}\right] \cap \bigcap _{i=1}^n\left[ \widehat{\eta }(\varvec{X}_i,Y_i)\ge \frac{\eta _0}{2}\right] \right\} \nonumber \\&+ P\left\{ \bigcup _{i=1}^n \left[ \widehat{\eta }(\varvec{X}_i,Y_i)< \frac{\eta _0}{2}\right] \right\} \nonumber \\\le & {} P\left\{ \frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| >\frac{\epsilon \eta _0^{p+1}}{2^{2p+2}3^{p-1}M^p c_1}\right\} \end{aligned}$$
(24)
$$\begin{aligned}&+\,P\left\{ \int \left| \frac{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})>\frac{\epsilon \eta _0^p}{2^{2p+1}3^p M^p}\right\} \end{aligned}$$
(25)
$$\begin{aligned}&+\, P\left\{ \bigcup _{i=1}^n \left[ \widehat{\eta }(\varvec{X}_i,Y_i)< \frac{\eta _0}{2}\right] \right\} . \end{aligned}$$
(26)

Thus, in view of (21), (22), (23), (24), (25), and (26), we find

$$\begin{aligned} P_{n,1}+P_{n,3}\le & {} 2\sum _{i=1}^n P\left\{ \left| \widehat{\eta }(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| >\frac{\epsilon \eta _0^{p+1}}{2^{2p+2}3^{p-1}M^p c_1}\right\} \nonumber \\&+\,2 P\left\{ \int \left| \frac{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})>\frac{\epsilon \eta _0^p}{2^{2p+1}3^p M^p}\right\} \nonumber \\&+\,2\sum _{i=1}^n P\left\{ \widehat{\eta }(\varvec{X}_i,Y_i)< \frac{\eta _0}{2}\right\} \nonumber \\:= & {} T_{n,1}+T_{n,2}+T_{n,3},~~~\text {(say).} \end{aligned}$$
(27)

First note that upon taking \(m(\varvec{z})=1\) in Lemma 2 (in the Appendix) and \(Y_i\overset{\text {a.s.}}{=}1\) for all \(i=1,\ldots , n\), we have

$$\begin{aligned} T_{n,2}\le & {} 2 e^{-a_1n\epsilon ^2}\,, \end{aligned}$$
(28)

for n large enough, where \(a_1=\eta _0^{2p}/(2^{4p+8}\, 3^{2p}c_1^2 M^{2p})\). To deal with the term \(T_{n,1}\), define the terms

$$\begin{aligned} \Phi \left( \varvec{X}_i,Y_i\right)= & {} \eta ^*(\varvec{X}_i,Y_i)f(\varvec{X}_i)P(Y=Y_i|Y_i) \end{aligned}$$
(29)
$$\begin{aligned} \widehat{\Phi }\left( \varvec{X}_i,Y_i\right)= & {} (n-1)^{-1}\lambda _n^{-d}\sum _{j=1,\ne i}^n\Delta _j I\{Y_i=Y_j\}\mathcal {H}\left( \frac{\varvec{X}_i-\varvec{X}_j}{\lambda _n}\right) \end{aligned}$$
(30)
$$\begin{aligned} \Omega \left( \varvec{X}_i,Y_i\right)= & {} f(\varvec{X}_i)P(Y=Y_i|Y_i) \end{aligned}$$
(31)
$$\begin{aligned} \widehat{\Omega }\left( \varvec{X}_i,Y_i\right)= & {} (n-1)^{-1}\lambda _n^{-d}\sum _{j=1,\ne i}^n I\{Y_i=Y_j\}\mathcal {H}\left( \frac{\varvec{X}_i-\varvec{X}_j}{\lambda _n}\right) , \end{aligned}$$
(32)

where \(\lambda _n\) and \(\mathcal {H}\) are as in (7), and note that

$$\begin{aligned} \frac{\widehat{\Phi }\left( \varvec{X}_i,Y_i\right) }{\widehat{\Omega } \left( \varvec{X}_i,Y_i\right) }=\widehat{\eta }(\varvec{X}_i,Y_i)~~\text { and }~~ \frac{\Phi \left( \varvec{X}_i,Y_i\right) }{\Omega \left( \varvec{X}_i,Y_i\right) }=\eta ^*(\varvec{X}_i,Y_i)\,. \end{aligned}$$

Also, note that \(|\widehat{\Phi }\left( \varvec{X}_i,Y_i\right) /\widehat{\Omega } \left( \varvec{X}_i,Y_i\right) |\le 1\). Using the above notation, observe that

$$\begin{aligned} \left| \widehat{\eta }(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right|= & {} \left| \frac{\widehat{\Phi }\left( \varvec{X}_i,Y_i\right) }{\widehat{\Omega }\left( \varvec{X}_i,Y_i\right) }-\frac{\Phi \left( \varvec{X}_i,Y_i\right) }{\Omega \left( \varvec{X}_i,Y_i\right) }\right| \\\le & {} \frac{\left| \widehat{\Omega }\left( \varvec{X}_i,Y_i\right) - \Omega \left( \varvec{X}_i,Y_i\right) \right| }{\Omega \left( \varvec{X}_i,Y_i \right) }-\frac{\left| \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) -\Phi \left( \varvec{X}_i,Y_i\right) \right| }{\Omega \left( \varvec{X}_i,Y_i \right) }\,. \end{aligned}$$

Now, let \(0<f_0:= \inf \limits _{\varvec{x}\in \mathbb {R}^d} f(\varvec{x})\); see condition A3. Then we have

$$\begin{aligned} \Omega (\varvec{X}_i,Y_i)=f(\varvec{X}_i)P(Y=Y_i|Y_i)\ge f_0 ~\inf \limits _{\varvec{y}\in \mathcal {Y}} P(Y=y)=:\omega _0>0. \end{aligned}$$
(33)

Therefore,

$$\begin{aligned} T_{n,1}\le & {} 2\sum _{i=1}^n P\left\{ \left| \widehat{\Omega } \left( \varvec{X}_i,Y_i\right) -\Omega \left( \varvec{X}_i,Y_i\right) \right| >t\right\} \end{aligned}$$
(34)
$$\begin{aligned}&+\, 2\sum _{i=1}^n P\left\{ \left| \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) -\Phi \left( \varvec{X}_i,Y_i\right) \right| >t\right\} , \end{aligned}$$
(35)

where \(t=\epsilon \eta _0^{p+1}\omega _0/[2^{2p+3}\, 3^{p-1} M^p c_1]>0\), with \(\omega _0\) as in (33). Now, for \(i=1,\ldots ,n\) and \(j=1\ldots ,n,\,j\ne i\), put

$$\begin{aligned} \Gamma _j(\varvec{X}_i,Y_i)= & {} \lambda _n^{-d}\left\{ \Delta _j I\{Y_i=Y_j\}\mathcal {H}\left( \frac{\varvec{X}_i-\varvec{X}_j}{\lambda _n}\right) \right. \\&\left. -E\left[ \Delta _j I\{Y_i=Y_j\}\mathcal {H}\left( \frac{\varvec{X}_i-\varvec{X}_j}{\lambda _n}\right) \Bigg |\varvec{X}_i,Y_i\right] \right\} \, \end{aligned}$$

and observe that

$$\begin{aligned}&P\left\{ \left| \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) -\Phi \left( \varvec{X}_i,Y_i\right) \right| >t\right\} \\&\quad = P\left\{ \left| \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) \pm E\left[ \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) \bigg |\varvec{X}_i,Y_i\right] -\Phi \left( \varvec{X}_i,Y_i\right) \right| >t\right\} \\&\quad \le P\left\{ \left| \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) -E\left[ \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) \bigg |\varvec{X}_i,Y_i\right] \right| +\frac{t}{2}>t\right\} \\&\qquad (\text {for large enough } n, \text {by Lemma 3 in the appendix}) \\&\quad = E\left[ P\left\{ \left| \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) -E\left[ \widehat{\Phi }\left( \varvec{X}_i,Y_i\right) \bigg |\varvec{X}_i,Y_i\right] \right| >\frac{t}{2}\Bigg |\varvec{X}_i,Y_i\right\} \right] \\&\quad = E\left[ P\left\{ (n-1)^{-1}\left| \sum _{j=1\ne i}^n \Gamma _j(\varvec{X}_i,Y_i)\right| >\frac{t}{2}\Bigg |\varvec{X}_i,Y_i\right\} \right] \,. \end{aligned}$$

But, conditional on \((\varvec{X}_i,Y_i)\), the terms \(\Gamma _j(\varvec{X}_i,Y_i)\,, j=1,\ldots ,n\,,j\ne i\), are independent zero-mean random variables, bounded by \(-\lambda _n^{-d} \Vert \mathcal {H}\Vert _{\infty }\) and \(+\lambda _n^{-d} \Vert \mathcal {H}\Vert _{\infty }\). Furthermore, conditional on \((\varvec{X}_i,Y_i)\), we have \(\text {Var}(\Gamma _j(\varvec{X}_i,Y_i)|\varvec{X}_i,Y_i) = E[\Gamma _j^2(\varvec{X}_i,Y_i)|\varvec{X}_i,Y_i]\le \lambda _n^{-d} \Vert \mathcal {H}\Vert _{\infty } \Vert f \Vert _{\infty }\). Therefore by an application of Bernstein’s Inequality (Bernstein 1946), we find

$$\begin{aligned}&P\left\{ \frac{1}{n-1}\left| \sum _{j=1\ne i}^n \Gamma _j(\varvec{X}_i,Y_i)\right| >\frac{t}{2}\Bigg |\varvec{X}_i,Y_i\right\} \nonumber \\&\quad \le 2\exp \left\{ \frac{-(n-1)\left( \frac{t}{2}\right) ^2}{2\left[ \lambda _n^{-d}\Vert \mathcal {H}\Vert _{\infty }\Vert f\Vert _{\infty }+\frac{1}{3}\lambda _n^{-d}\Vert \mathcal {H}\Vert _{\infty }\frac{t}{2}\right] }\right\} \nonumber \\&\quad \le 2\exp \left\{ \frac{-(n-1)\lambda _n^d \eta _0^{2p+2}\omega _0^2\epsilon ^2}{2^{4p+9}\, 3^{2p-2} c_1^2 M^{2p}\Vert \mathcal {H}\Vert _{\infty }\left[ \Vert f\Vert _{\infty }+\frac{\omega _0}{12}\right] }\right\} , \end{aligned}$$
(36)

for n large enough; here the last line follows by the fact that in bounding \(P\{|\widehat{\eta }(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)|>\epsilon \eta _0^{p+1}/[2^{2p+2}\, 3^{p-1} c_1 M^p]\}\), we only need to consider \(\epsilon \in (0,(2^{2p+2}\, 3^{p-1} c_1 M^p)/\eta _0^{p+1})\), because \(|\widehat{\eta }(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)|\le 1\). Similarly, since \(\widehat{\Omega }(\varvec{X}_i,Y_i)\) is a special case of \(\widehat{\Phi }(\varvec{X}_i,Y_i)\), with \(\Delta _j=1\) for \(j=1,\ldots ,n\), one finds an upper bound for (34) as such

$$\begin{aligned}&P\left\{ \left| \widehat{\Omega }\left( \varvec{X}_i,Y_i\right) -\Omega \left( \varvec{X}_i,Y_i\right) \right| >t\right\} \nonumber \\&\quad \le 2\exp \left\{ \frac{-(n-1)\lambda _n^d \eta _0^{2p+2}\omega _0^2\epsilon ^2}{2^{4p+9}\, 3^{2p-2} c_1^2 M^{2p}\Vert \mathcal {H}\Vert _{\infty }\left[ \Vert f\Vert _{\infty }+\omega _0/12\right] }\right\} \,, \end{aligned}$$
(37)

for n large enough. Combining (36) and (37), we obtain

$$\begin{aligned} T_{n,1}\le 8n e^{-a_2(n-1)\lambda _n^d\epsilon ^2 }\,, \end{aligned}$$
(38)

where \(a_2= (\eta _0^{2p+2}\omega _0^2)/(2^{4p+9}\,3^{2p-2}\, c_1^2 M^{2p}\Vert H\Vert _{\infty }[\Vert f\Vert _{\infty }+\omega _0/12])\). Finally, to deal with the term \(T_{n,3}\), first note that \(P\{\widehat{\eta }(\varvec{X}_i,Y_i)<\eta _0/2\} \le P\{|\widehat{\eta }(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)|\ge \eta _0/2\}\). Therefore, by the arguments that lead to the derivation of the bound on \(T_{n,1}\), we have

$$\begin{aligned} T_{n,3}\le 8n e^{-a_3(n-1)\lambda _n^d }\,, \end{aligned}$$
(39)

where \(a_3 = (\eta _0^2 \omega _0^2)/(2^7\Vert H\Vert _{\infty }[\Vert f\Vert _{\infty }+\eta _0/48])\), which does not depend on \(\epsilon \) or n. Consequently, in view of (27), (28), (38), and (39), one finds

$$\begin{aligned} P_{n,1} +P_{n,3}\le & {} 2 \exp \left\{ \frac{-n \eta _0^{2p}\epsilon ^2}{2^{4p+8}\, 3^{2p}c_1^2 M^{2p}}\right\} \\&+\,8n\exp \left\{ \frac{-(n-1) \eta _0^{2p+2}\omega _0^2\lambda _n^d\epsilon ^2}{2^{4p+9}\,3^{2p-2}\, c_1^2 M^{2p}\Vert H\Vert _{\infty }[\Vert f\Vert _{\infty }+\omega _0/12]}\right\} \\&+\,8n\exp \left\{ \frac{-(n-1)\eta _0^2 \omega _0^2\lambda _n^d}{2^7\Vert H\Vert _{\infty }[\Vert f\Vert _{\infty }+\eta _0/48]}\right\} \,. \end{aligned}$$

This completes the proof of Theorem 2. \(\square \)

Proof of Theorem 3

Start by defining the quantities

$$\begin{aligned} \tilde{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})= & {} \dfrac{\sum _{i=1}^n\frac{\Delta _i Y_i}{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)} \mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z} - \varvec{Z}_i}{h_n}\right) }\,,\text { and} \nonumber \\ \overline{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})= & {} \dfrac{\sum _{i=1}^n\frac{\Delta _i}{\widehat{\eta }_{{\mathrm{LS}}} (\varvec{X}_i,Y_i)}\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{\sum _{i=1}^n\mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }\,. \end{aligned}$$
(40)

Since \(|\tilde{m}_{\widehat{\eta }_{{\mathrm{LS}}}} (\varvec{z})/\overline{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})|\le M\), one obtains

$$\begin{aligned} \left| \widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})-m(\varvec{z})\right| =\left| \frac{\tilde{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})}{\overline{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})}-\frac{m(\varvec{z})}{1}\right|\le & {} M\left| \overline{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})-1\right| +\left| \tilde{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})-m(\varvec{z})\right| \,. \end{aligned}$$

Therefore, with \(\tilde{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})\) and \(\overline{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})\) as in (40), and \(\tilde{m}_{\eta ^*}(\varvec{z})\) and \(\overline{m}_{\eta *}(\varvec{z})\) as in (16), we can use simple algebra to show that for every \(\epsilon >0\)

$$\begin{aligned}&P\left\{ \int \left| \widehat{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})-m(\varvec{z})\right| ^p \mu (d\varvec{z})>\epsilon \right\} \\&\quad \le P\left\{ 2^{2p-2}M^p\left| \frac{1}{\eta _0}+\frac{1}{\eta _0}\right| ^{p-1}\int \left| \overline{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})- \overline{m}_{\eta *}(\varvec{z})\right| \mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \\&\qquad +P\left\{ 2^{2p-2}M^p\left| \frac{1}{\eta _0}+1\right| ^{p-1}\int \left| \overline{m}_{\eta *}(\varvec{z})-1\right| \mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \\&\qquad +P\left\{ 2^{2p-2}\left| \frac{M}{\eta _0}+\frac{M}{\eta _0}\right| ^{p-1}\int \left| \tilde{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})- \tilde{m}_{\eta ^*}(\varvec{z})\right| \mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \\&\qquad +P\left\{ 2^{2p-2}\left| \frac{M}{\eta _0}+M\right| ^{p-1}\int \left| \tilde{m}_{\eta ^*}(\varvec{z})-m(\varvec{z})\right| \mu (d\varvec{z})>\frac{\epsilon }{4}\right\} \\&\quad := Q_{n,1}+Q_{n,2}+Q_{n,3}+Q_{n,4}, ~~~\text {(say).} \end{aligned}$$

But, as an immediate consequence of the main result of Devroye and Krzyz̀ak (1989) (see (3)), we have

$$\begin{aligned} Q_{n,2}+Q_{n,4}\le & {} 8 e^{-b n}\,, \end{aligned}$$
(41)

where b is as in Theorem 2. To deal with the term \(Q_{n,1}\), first note that

$$\begin{aligned}&\left| \overline{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})- \overline{m}_{\eta *}(\varvec{z})\right| \\&\quad \le \left| \frac{\sum _{i=1}^n \Delta _i\left( \frac{1}{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\right| \\&\qquad + \left| \sum _{i=1}^n \Delta _i\left( \frac{1}{\widehat{\eta }_{{\mathrm{LS}}} (\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) \right. \\&\quad \qquad \left. \left( \frac{1}{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }-\frac{1}{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\right) \right| \\&\quad := \pi _{n,1}(\varvec{z})+\pi _{n,2}(\varvec{z}),~~~\text {(say)}\,. \end{aligned}$$

Now, using the arguments that lead to (19), we find

$$\begin{aligned} \int \pi _{n,1}(\varvec{z})\mu (d\varvec{z}) \le c_1\left( \frac{1}{n}\sum _{i=1}^n \left| \frac{1}{\widehat{\eta }_{{\mathrm{LS}}} (\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right| \right) \,. \end{aligned}$$
(42)

Furthermore, the term \(\int \pi _{n,2}(\varvec{z})\mu (d\varvec{z})\) can be bounded as in (20), i.e., we have

$$\begin{aligned}&\int \pi _{n,2}(\varvec{z})\mu (d\varvec{z}) \nonumber \\&\quad \le \max \limits _{1\le i\le n}\left| \frac{1}{\widehat{\eta }_{{\mathrm{LS}}} (\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right| \int \left| \frac{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})\,.\qquad \end{aligned}$$
(43)

Now, put \(\epsilon _1=\epsilon \eta _0^{p-1}/(2^{3p-1}M^p)\) and observe that combining (42) and (43) one finds

$$\begin{aligned} Q_{n,1}\le & {} P\left\{ \int \left( \pi _{n,1}(\varvec{z})+\pi _{n,2}(\varvec{z})\right) \mu (d\varvec{z})>\epsilon _1\right\} \nonumber \\\le & {} P\left\{ \frac{c_1}{n}\sum _{i=1}^n \left| \frac{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)}{\eta ^*(\varvec{X}_i,Y_i)\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)}\right| >\frac{\epsilon _1}{2}\right\} \nonumber \\&+ P\left\{ \max \limits _{1\le i\le n}\left| \frac{1}{\eta ^*(\varvec{X}_i,Y_i)}-\frac{1}{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)}\right| \right. \nonumber \\&\left. \int \left| \frac{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})>\frac{\epsilon _1}{2}\right\} \nonumber \\\le & {} P\left\{ \frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| >\frac{\epsilon _1\eta _0^2}{2c_1}\right\} \end{aligned}$$
(44)
$$\begin{aligned}&+P\left\{ \int \left| \frac{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})>\frac{\epsilon _1\eta _0}{4}\right\} \end{aligned}$$
(45)

Finally, to bound \(Q_{n,3}\), first note that

$$\begin{aligned}&\left| \tilde{m}_{\widehat{\eta }_{{\mathrm{LS}}}}(\varvec{z})- \tilde{m}_{\eta ^*}(\varvec{z})\right| \\&\quad \le \left| \frac{\sum _{i=1}^n \Delta _i Y_i\left( \frac{1}{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\right| \\&\qquad + \left| \sum _{i=1}^n \Delta _iY_i\left( \frac{1}{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right) \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) \right. \\&\quad \qquad \left. \left( \frac{1}{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }-\frac{1}{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }\right) \right| \\&\quad := \pi _{n,1}'(\varvec{z})+\pi _{n,2}'(\varvec{z}),~~~\text {(say)}\,. \end{aligned}$$

But, once again, the arguments that lead to (19) and (20) yield

$$\begin{aligned} \int \pi _{n,1}'(\varvec{z})\mu (d\varvec{z}) \le M c_1\left( \frac{1}{n}\sum _{i=1}^n\left| \frac{1}{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right| \right) \,. \end{aligned}$$

and

$$\begin{aligned}&\int \pi _{n,2}'(\varvec{z})\mu (d\varvec{z}) \\&\quad \le M \max \limits _{1\le i\le n}\left| \frac{1}{\widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)}-\frac{1}{\eta ^*(\varvec{X}_i,Y_i)}\right| \int \left| \frac{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})\,. \end{aligned}$$

Next, with \(\epsilon _1=\epsilon \eta _0^{p-1}/(2^{3p-1}M^p)\), it is straightforward to see that

$$\begin{aligned} Q_{n,3}\le & {} P\left\{ \int \left( \pi _{n,1}'(\varvec{z})+\pi _{n,2}'(\varvec{z})\right) \mu (d\varvec{z})>M\epsilon _1\right\} \nonumber \\\le & {} P\left\{ \frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| >\frac{\epsilon _1\eta _0^2}{2c_1}\right\} \end{aligned}$$
(46)
$$\begin{aligned}&+P\left\{ \int \left| \frac{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})>\frac{\epsilon _1\eta _0}{4}\right\} \,. \end{aligned}$$
(47)

Thus, from (44), (45), (46), and (47), it follows that

$$\begin{aligned} Q_{n,1}+Q_{n,3}\le & {} 2P\left\{ \frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| >\frac{\epsilon _1\eta _0^2}{2c_1}\right\} \nonumber \\&+2P\left\{ \int \left| \frac{\sum _{i=1}^n \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}_i}{h_n}\right) }{n E\left[ \mathcal {K}\left( \frac{\varvec{z}-\varvec{Z}}{h_n}\right) \right] }-1\right| \mu (d\varvec{z})>\frac{\epsilon _1\eta _0}{4}\right\} \nonumber \\:= & {} U_{n,1}+U_{n,2}\,,~~~\text{( }say). \end{aligned}$$
(48)

Now, upon taking \(Y_i\overset{\text {a.s.}}{=}1~, i=1,\ldots ,n\), and \(m(\varvec{z})=1\) in lemma 2, one obtains

$$\begin{aligned} U_{n,2} \le 2e^{-b_1 n\epsilon ^2 }\,, \end{aligned}$$
(49)

where \(b_1 = \eta _0^{2p}/(2^{6p+8} M^{2p} c_1^2)\). To deal with the term \(U_{n,1}\), observe that

$$\begin{aligned} U_{n,1}\le & {} 2P\left\{ \left| \frac{1}{n}\sum _{i=1}^n \left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i) -\eta ^*(\varvec{X}_i,Y_i)\right| \right. \right. \nonumber \\&\left. \left. -E\left[ \left| \widehat{\eta }_{{\mathrm{LS}}} (\varvec{X},Y)-\eta ^*(\varvec{X},Y)\right| \big |\mathbb {D}_n\right] \right| >\frac{\epsilon _1\eta _0^2}{4c_1}\right\} \end{aligned}$$
(50)
$$\begin{aligned}&+2P\left\{ E\left[ \left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X},Y)-\eta ^*(\varvec{X},Y)\right| \big |\mathbb {D}_n\right] >\frac{\epsilon _1\eta _0^2}{4c_1}\right\} \,. \end{aligned}$$
(51)

To handle the term (50), first put \(\epsilon _2=\epsilon _1\eta _0^2/(2^5 c_1)>0\) and let the class of functions \(\mathcal {G}\) be defined as \(\mathcal {G}=\{g_{\eta }~|~g_{\eta }(\varvec{x},y)=|\eta (\varvec{x},y)-\eta ^*(\varvec{x},y)|\,,~\eta \in \mathcal {M}\}\). Now, observe that for any \(g_{\eta },g_{\eta ^{\dagger }}\in \mathcal {G}\) one has

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^n \left| g_{\eta } (\varvec{X}_i,Y_i) - g_{\eta ^{\dagger }} (\varvec{X}_i,Y_i)\right| \\&\quad = \frac{1}{n}\sum _{i=1}^n \left| \left| \eta (\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| -\left| \eta ^{\dagger } (\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| \right| \\&\quad \le \frac{1}{n}\sum _{i=1}^n \left| \eta (\varvec{X}_i,Y_i)-\eta ^{\dagger }(\varvec{X}_i,Y_i)\right| \,. \end{aligned}$$

Therefore, if \(\mathcal {M}_{\epsilon _2}=\{\eta _1,\ldots ,\eta _{N_{\epsilon _2}}\}\) is a minimal \(\epsilon _2\)-cover of \(\mathcal {M}\) with respect to the empirical \(L_1\) norm, then \(\mathcal {G}_{\epsilon _2}=\{g_{\eta _1},\ldots ,g_{\eta _{N_{\epsilon _2}}}\}\) will be an \(\epsilon _2\)-cover of \(\mathcal {G}\). Futhermore, the \(\epsilon _2\)-covering numbers for \(\mathcal {G}\) and \(\mathcal {M}\) satisfy \(\mathcal {N}_1(\epsilon _2,\mathcal {G},\mathbb {D}_n)\le \mathcal {N}_1(\epsilon _2,\mathcal {M},\mathbb {D}_n)\). Now, using standard results from the empirical process theory [see Pollard (1984, p. 25), or Theorem 9.1 of Györfi, et al. (2002, p. 136 )], one finds

$$\begin{aligned} (50)\le & {} 2P\left\{ \sup \limits _{\eta \in \mathcal {M}}\left| \frac{1}{n}\sum _{i=1}^n \left| \eta (\varvec{X}_i,Y_i)-\eta ^*(\varvec{X}_i,Y_i)\right| -E\left| \eta (\varvec{X},Y)- \eta ^*(\varvec{X},Y)\right| \right| >\frac{\epsilon _1\eta _0^2}{4c_1}\right\} \nonumber \\\le & {} 16~E[\mathcal {N}_1\left( b_2\epsilon ,\mathcal {M},\mathbb {D}_n\right) ]~e^{-b_3 n\epsilon ^2}\,, \end{aligned}$$
(52)

where \(b_2 =\eta _0^{p+1}/(2^{3p+4} M^p c_1)\) and \(b_3 = \eta _0^{2p+2}/(2^{6p+9} M^{2p} c_1^2 )\). Next, to deal with the term (51), note that by the Cauchy-Schwartz inequality we have

$$\begin{aligned} (51)\le & {} 2P\left\{ \sqrt{E\left[ \left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X},Y)-\eta ^*(\varvec{X},Y)\right| ^2\big |\mathbb {D}_n\right] }>\frac{\epsilon _1\eta _0^2}{4c_1}\right\} \nonumber \\\le & {} 2P\left\{ E\left[ \left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X},Y)-\eta ^*(\varvec{X},Y)\right| ^2\big |\mathbb {D}_n\right] >\frac{\epsilon _1^2\eta _0^4}{16c_1^2}\right\} \,. \end{aligned}$$
(53)

Since

$$\begin{aligned}&E\left[ \left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X},Y) - \eta ^*(\varvec{X},Y)\right| ^2\Big |\mathbb {D}_n\right] \nonumber \\&\quad = E\left[ \left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X},Y) - \Delta \right| ^2\Big |\mathbb {D}_n\right] - \inf \limits _{\eta \in \mathcal {M}} E\left[ \left| \eta (\varvec{X},Y)-\Delta \right| ^2\right] , \end{aligned}$$
(54)

one can write

$$\begin{aligned} (54)= & {} \sup \limits _{\eta \in \mathcal {M}}\left\{ E\left[ \left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X},Y)-\Delta \right| ^2\Big |\mathbb {D}_n\right] -\frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)-\Delta _i\right| ^2\right. \nonumber \\&+\left. \frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)-\Delta _i\right| ^2-\frac{1}{n}\sum _{i=1}^n\left| \eta (\varvec{X}_i,Y_i)-\Delta _i\right| ^2\right. \nonumber \\&+\left. \frac{1}{n}\sum _{i=1}^n\left| \eta (\varvec{X}_i,Y_i)-\Delta _i\right| ^2-E\left[ \left| \eta (\varvec{X},Y)-\Delta \right| ^2\right] \right\} \nonumber \\\le & {} \left| \frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)-\Delta _i\right| ^2-E\left[ \left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X},Y)-\Delta \right| ^2\Big |\mathbb {D}_n\right] \right| \nonumber \\&+\sup \limits _{\eta \in \mathcal {M}}\left| \frac{1}{n}\sum _{i=1}^n\left| \eta (\varvec{X}_i,Y_i)-\Delta _i\right| ^2-E\left[ \left| \eta (\varvec{X},Y)-\Delta \right| ^2\right] \right| \nonumber \\&\left( \text {because }\frac{1}{n}\sum _{i=1}^n\left| \widehat{\eta }_{{\mathrm{LS}}}(\varvec{X}_i,Y_i)-\Delta _i\right| ^2-\frac{1}{n}\sum _{i=1}^n\left| \eta (\varvec{X}_i,Y_i) -\Delta _i\right| ^2\le 0\,, \right. \nonumber \\&\qquad \qquad \left. \forall \eta \in \mathcal {M}\right) \nonumber \\\le & {} 2\sup \limits _{\eta \in \mathcal {M}}\left| \frac{1}{n}\sum _{i=1}^n\left| \eta (\varvec{X}_i,Y_i)-\Delta _i\right| ^2-E\left[ \left| \eta (\varvec{X},Y)-\Delta \right| ^2\right] \right| \,. \end{aligned}$$
(55)

Therefore, in view of (55),

$$\begin{aligned} (53) \le 2P\left\{ \sup \limits _{\eta \in \mathcal {M}}\left| \frac{1}{n}\sum _{i=1}^n\left| \eta (\varvec{X}_i,Y_i)-\Delta _i\right| ^2-E\left[ \left| \eta (\varvec{X},Y)-\Delta \right| ^2\right] \right| >\frac{\epsilon _1^2\eta _0^4}{2^5c_1^2}\right\} .\nonumber \\ \end{aligned}$$
(56)

Now, let \(\Psi = \{\psi _{\eta }~|~\psi _{\eta }(\varvec{x},y,\delta ) = |\eta (\varvec{x},y)-\delta |^2\,,~~\eta \in \mathcal {M}\}\) and put \(\epsilon _3=\epsilon _1^2\eta _0^4/(2^8 c_1^2)\). Then for any \(\psi _{\eta },\psi _{\eta '}\in \Psi \) (where \(\eta , \eta '\in \mathcal {M}\)), one has

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n\left| \psi _{\eta }(\varvec{X}_i,Y_i)-\psi _{\eta '}(\varvec{X}_i,Y_i)\right|= & {} \frac{1}{n}\sum _{i=1}^n\left| \left| \eta (\varvec{X}_i,Y_i)-\Delta _i\right| ^2-\left| \eta '(\varvec{X}_i,Y_i)-\Delta _i\right| ^2\right| \\\le & {} \frac{1}{n}\sum _{i=1}^n\left| \left( \eta (\varvec{X}_i,Y_i)-\Delta _i\right) -\left( \eta '(\varvec{X}_i,Y_i)-\Delta _i\right) \right| \\&\times \left| \left( \eta (\varvec{X}_i,Y_i)-\Delta _i\right) +\left( \eta '(\varvec{X}_i,Y_i)-\Delta _i\right) \right| \\\le & {} \frac{2}{n}\sum _{i=1}^n\left| \eta (\varvec{X}_i,Y_i)-\eta '(\varvec{X}_i,Y_i)\right| \,. \end{aligned}$$

In other words, if \(\mathcal {M}_{\epsilon _3/2}=\{\eta _1,\ldots ,\eta _{N_{\epsilon _3/2}}\}\) is a minimal \(L_1\) empirical \(\epsilon _3/2\)-cover of \(\mathcal {M}\), then \(\Psi _{\epsilon _3}=\{\psi _{\eta _1},\ldots ,\psi _{\eta _{N_{\epsilon _3}}}\}\) is an \(\epsilon _3\)-cover of \(\Psi \). Additionally, one has that \(\mathcal {N}_1(\epsilon _3,\Psi ,\mathbb {D}_n)\le \mathcal {N}_1(\epsilon _3/2,\mathcal {M},\mathbb {D}_n)\). Thus, one obtains

$$\begin{aligned} (56)= & {} 2P\left\{ \sup \limits _{\eta \in \mathcal {M}}\left| \frac{1}{n}\sum _{i=1}^n\left| \eta (\varvec{X}_i,Y_i)-\Delta _i\right| ^2-E\left[ \left| \eta (\varvec{X},Y)-\Delta \right| ^2\right] \right| >\frac{\epsilon _1^2\eta _0^4}{2^5c_1^2}\right\} \nonumber \\\le & {} 16 ~E\left[ \mathcal {N}_1\left( b_4\epsilon ^2,\mathcal {M},(\varvec{X}_i,Y_i)_{i=1}^n\right) \right] ~e^{-b_5n \epsilon ^4} \end{aligned}$$
(57)

where \(b_4 = \eta _0^{2p+2}/(2^{6p+6} M^{2p} c_1^2 )\) and \(b_5 = \eta _0^{4p+4}/(2^{12p+6} M^{4p} c_1^4)\). Therefore, in view of (48), (49), (52), and (57), we have

$$\begin{aligned} Q_{n,1}+Q_{n,3}\le & {} 2\exp \left\{ \frac{-n\eta _0^{2p}\epsilon ^2}{2^{6p+8} M^{2p}c_1^2}\right\} \\&+ 16~E\left[ \mathcal {N}_1\left( \frac{\eta _0^{p+1}\epsilon }{2^{3p+4} M^p c_1},\mathcal {M},\mathbb {D}_n\right) \right] ~\exp \left\{ \frac{-n \eta _0^{2p+2}\epsilon ^2}{2^{6p+9} M^{2p} c_1^2 }\right\} \\&+ 16~E\left[ \mathcal {N}_1\left( \frac{ \eta _0^{2p+2}\epsilon ^2}{2^{6p+6} M^{2p} c_1^2 },\mathcal {M},\mathbb {D}_n\right) \right] ~\exp \left\{ \frac{-n \eta _0^{4p+4}\epsilon ^4}{2^{12p+6} M^{4p} c_1^4}\right\} \,. \end{aligned}$$

This completes the proof of Theorem 3. \(\square \)