On automatic kernel density estimate-based tests for goodness-of-fit

Tenreiro, Carlos

doi:10.1007/s11749-021-00799-3

On automatic kernel density estimate-based tests for goodness-of-fit

Original Paper
Published: 01 February 2022

Volume 31, pages 717–748, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

TEST Aims and scope Submit manuscript

On automatic kernel density estimate-based tests for goodness-of-fit

Download PDF

Carlos Tenreiro ORCID: orcid.org/0000-0002-5495-6644¹

283 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Although estimation and testing are different statistical problems, if we want to use a test statistic based on the Parzen–Rosenblatt estimator to test the hypothesis that the underlying density function f is a member of a location-scale family of probability density functions, it may be found reasonable to choose the smoothing parameter in such a way that the kernel density estimator is an effective estimator of f irrespective of which of the null or the alternative hypothesis is true. In this paper we address this question by considering the well-known Bickel–Rosenblatt test statistics which are based on the quadratic distance between the nonparametric kernel estimator and two parametric estimators of f under the null hypothesis. For each one of these test statistics we describe their asymptotic behaviours for a general data-dependent smoothing parameter, and we state their limiting Gaussian null distribution and the consistency of the associated goodness-of-fit test procedures for location-scale families. In order to compare the finite sample power performance of the Bickel–Rosenblatt tests based on a null hypothesis-based bandwidth selector with other bandwidth selector methods existing in the literature, a simulation study for the normal, logistic and Gumbel null location-scale models is included in this work.

A Kernel Goodness-of-fit Test for Maximum Likelihood Density Estimates of Normal Mixtures

Maximum likelihood method for bandwidth selection in kernel conditional density estimate

Article 27 March 2019

Estimator Selection: a New Method with Applications to Kernel Density Estimation

Article 12 June 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Given $X_1,\ldots ,X_n$ independent and identically distributed real-valued random variables from an absolutely continuous distribution with continuous density function f, it is well known that the unknown density function f may be estimated by using the Parzen–Rosenblatt estimator (Rosenblatt 1956; Parzen 1962) defined, for $x\in \mathbb {R}$, by

$$\begin{aligned} f_{h}(x) := \frac{1}{n} \sum _{i=1}^{n} K_{h}( x-X_i ), \end{aligned}$$

where $K_{h}(\cdot )=K(\cdot /h)/h,$ for $h>0$, with K a kernel in $\mathbb {R}$, that is, K is a bounded and symmetric probability density function, and the bandwidth $h=h_n$ is a sequence of strictly positive real numbers converging to zero as n tends to infinity, which we always assume along this paper (see Devroye and Györfi 1985; Silverman 1986; Bosq and Lecoutre 1987; Wand and Jones 1995; Simonoff 1996, and Tsybakov 2009, for general reviews on density estimation).

Other than the estimation of the underlying probability density function, the kernel density estimator can also be used for testing the null hypothesis

$$\begin{aligned} H_0 : \, f \in \mathscr {F}_0, \end{aligned}$$

(1)

where $\mathscr {F}_0$ is a parametric family of density functions, against a general alternative hypothesis. This idea was first explored in Bickel and Rosenblatt (1973) who considered, among other, two test statistics based on the $L^2$ distance between the nonparametric estimator $f_{h}$ and two parametric estimators of f under the null hypothesis. Focusing our attention on the case where $\mathscr {F}_0$ is a location-scale family, that is,

$$\begin{aligned} \mathscr {F}_0 = \big \{ g(\cdot ;\theta _1,\theta _2) : \theta _1 \in \mathbb {R} ,\theta _2>0 \big \}, \end{aligned}$$

(2)

with $g(x;\theta _1,\theta _2)=f_0 ( (x-\theta _1)/\theta _2 )/\theta _2$, and $f_0$ is a known probability density function on $\mathbb {R}$, the Bickel–Rosenblatt test statistics we are interested in are given by

$$\begin{aligned} I_n(h)=I_n(X_1,\ldots ,X_n;h):= nh \int \{ f_{h}(x) - K_h*g(x; \hat{\theta }_1, \hat{\theta }_2 ) \}^2 \mathrm{d}x, \end{aligned}$$

(3)

and

$$\begin{aligned} J_n(h)=J_n(X_1,\ldots ,X_n;h) := nh \int \{ f_{h}(x) - g(x; \hat{\theta }_1, \hat{\theta }_2 ) \}^2 \mathrm{d}x, \end{aligned}$$

(4)

where the integrals are over $\mathbb {R}$ with respect to the Lebesgue measure, $\hat{\theta }_k$, for $k=1,2$, are consistent estimators of $\theta _k$ under $H_0$, and $*$ denotes the convolution operator. The theoretical properties of goodness-of-fit tests based on $I_n(h)$ and $J_n(h)$ were first studied by Bickel and Rosenblatt (1973) in the univariate case by using strong approximation techniques for empirical processes, and by Rosenblatt (1975) in the multivariate case, by using a Poissonization of sample size technique. However, a full description of their asymptotic behaviour was later provided in Fan (1994) by using the fact, first noticed in Hall (1984), that central limit theorems for the integrated squared error of kernel density estimators can be derived by using a central limit theorem for degenerate U-statistics with variable kernels (see Ghosh and Huang 1991; Fan 1998; Gouriéroux and Tenreiro 2001; Cao and Lugosi 2005, for other works on goodness-of-fit tests based on the kernel density estimator).

Taking into account that the class $\mathscr {F}_0$ is closed with respect to affine transformations, some authors argue that any reasonable statistic $T_n=T_n(X_1,\ldots ,X_n)$ for testing $H_0$ should be location-scale invariant, that is, it should satisfy the equality

$$\begin{aligned} T_n(a + bX_1,\ldots ,a+bX_n)=T_n(X_1,\ldots ,X_n), \end{aligned}$$

for each $a\in \mathbb {R}$ and $b>0$ (see Henze 2002, p. 469, Ebner and Henze 2020, p. 847). As we can easily see, this invariance property does not hold for the functionals $I_n(h)$ and $J_n(h)$ whenever we take for h a deterministic bandwidth, even when $\hat{\theta }_1$ is location-scale equivariant and $\hat{\theta }_2$ is scale equivariant, that is,

$$\begin{aligned} \hat{\theta }_1(a + bX_1,\ldots ,a+bX_n)=a+b\,\hat{\theta }_1(X_1,\ldots ,X_n) \end{aligned}$$

and

$$\begin{aligned} \hat{\theta }_2(a + bX_1,\ldots ,a+bX_n)=b\,\hat{\theta }_2(X_1,\ldots ,X_n), \end{aligned}$$

for each $a\in \mathbb {R}$ and $b>0$. However, if we further assume that $h=\hat{h}(X_1,\ldots ,X_n)$ depends on the observations and is scale equivariant, then $I_n(\hat{h})$ and $J_n(\hat{h})$ are location-scale invariant test statistics. This invariance property follows easily from the representations

$$\begin{aligned} I_n(\hat{h})= n (\hat{h}/\hat{\theta }_2) \int \{ \tilde{f}_{\hat{h}/\hat{\theta }_2}(y) - K_{\hat{h}/\hat{\theta }_2}*f_0(y) \}^2 \mathrm{d}y, \end{aligned}$$

(5)

and

$$\begin{aligned} J_n(\hat{h})= n (\hat{h}/\hat{\theta }_2) \int \{ \tilde{f}_{\hat{h}/\hat{\theta }_2}(y) - f_0(y) \}^2 \mathrm{d}y, \end{aligned}$$

(6)

where

$$\begin{aligned} \tilde{f}_{h}(y) = \frac{1}{n} \sum _{i=1}^{n} K_{h}( y-Y_{n,i} ), \end{aligned}$$

is the kernel estimator with kernel K and smoothing parameter h, based on the so-called scaled residuals $Y_{n,j} = (X_j -\hat{\theta }_1)/\hat{\theta }_2$, $j=1,\ldots ,n$. When $\hat{h}$ takes the form $\hat{h}=\hat{\theta }_2 h$ with h a deterministic bandwidth, the statistic $I_n(\hat{h})$ is considered in Bowman (1992) (see also Fan 1994, pp. 332–336) and the theoretical properties of the goodness-of-fit test based on $I_n(\hat{h})$ are described in Tenreiro (2007) in the case where $\theta _1$ and $\theta _2$ are, respectively, the mean and the standard deviation of $g(\cdot ;\theta _1,\theta _2)$, and $\hat{\theta }_1=\bar{X}_n$ and $\hat{\theta }_2=S_n$, where $\bar{X}_n=n^{-1} \sum _{i=1}^n X_i$ is the sample mean and $S_n^2=n^{-1} \sum _{i=1}^n (X_i - \bar{X}_n)^2$ is the sample variance. Moreover, Bowman (1992, p. 3) also suggests to take for the deterministic bandwidth h the asymptotic optimal bandwidth, in the sense of the mean integrated square error, for estimating the null density $f_0$. In this case, we have

$$\begin{aligned} h=h_1=h_1(f_0;K,n)=c_K \, R(f^{\prime \prime }_0)^{-1/5} n^{-1/5}, \end{aligned}$$

(7)

with

$$\begin{aligned} c_K = R(K)^{1/5} \mu _2(K)^{-2/5} \end{aligned}$$

(8)

(see Bosq and Lecoutre 1987, pp. 78–83 and Wand and Jones 1995, pp. 19–23), where $R(\varphi )=\int \varphi (x)^2 \mathrm{d}x$ and $\mu _2(\varphi )=\int x^2\varphi (x)\mathrm{d}x$, for an arbitrary real-valued measurable function $\varphi $, which leads to consider for $\hat{h}$ the null hypothesis-based bandwidth selector

$$\begin{aligned} \hat{h}_{H_0}=\hat{\theta }_2 h_1(f_0;K,n). \end{aligned}$$

(9)

In the case of testing an hypothesis of normality, that is, $f_0=\phi $, where $\phi (x) = (2\pi )^{-1/2}\exp (-x^2/2)$, $x\in \mathbb {R}$, is the standard Gaussian density, and taking $K=\phi $ and $\hat{\theta }_2=S_n$, this leads to the data-dependent bandwidth

$$\begin{aligned} \hat{h}_{H_0} = (4 / 3)^{1/5} S_n n^{-1/5}. \end{aligned}$$

(10)

This approach, also considered in Bowman and Foster (1993, p. 535) for testing a multivariate hypothesis of normality, was first suggested with apparent good results by Henze and Zirkler (1990, p. 3600; see also Ebner and Henze 2020) and the corresponding theoretical properties of the test statistic $I_n(\hat{h})$ first established in Gürtler (2000).

From an estimation perspective, the choice of the bandwidth is crucial to the performance of the kernel density estimator, this being one of the most studied topics in kernel density estimation, and several data-based approaches have been proposed for selecting h (see Wand and Jones 1995, pp. 58–89, and also Tenreiro 2017, p. 3440, where more recent bandwidth selection methods are mentioned). Although estimation and testing are different statistical problems, if we want to test $H_0$ through a test statistic based on the kernel density estimator, it may be found reasonable to select the smoothing parameter in such a way that $f_h$ is an effective estimator of the underlying density f irrespective of which of the null or the alternative hypothesis is true, a property that is not fulfilled by automatic bandwidth selector (9). Although some scepticism has been expressed about this approach by Bowman (1992, p. 3), mainly due to the extra source of variation introduced into the null distribution of the test statistic by the considered bandwidth selector, in this paper we intend to address this issue deeply by considering the situation where the data-dependent smoothing parameter $\hat{h}$ satisfies the relative consistency condition

$$\begin{aligned} \frac{\hat{h}}{h_0} - 1 = o_p(1), \end{aligned}$$

(11)

where $h_0=h_0(f;K,n)$ is the exact optimal bandwidth in the sense that it minimizes the kernel density estimator mean integrated square error, that is,

$$\begin{aligned} h_0=\mathrm {arg}\!\min _{h>0} \mathrm {MISE}(f;K,n,h), \end{aligned}$$

(12)

where

$$\begin{aligned} \mathrm {MISE}(f;K,n,h)= {E}\bigg ( \int \{ f_{h}(x) - f(x) \}^2 \mathrm{d}x \bigg ). \end{aligned}$$

For a square integrable density f, the existence of this exact optimal bandwidth for all $n\in \mathbb {N}$ can be established whenever the kernel K is continuous at zero with $R(K) < 2K(0)$ (see Chacón et al. 2007). Classical data-based bandwidth selectors such as the least squares cross-validation bandwidth or the two-stage direct plug-in bandwidth selector based on $h_1=h_1(f;K,n)$, which are both described in Wand and Jones (1995, pp. 63–65, 71–72), are scale equivariant and satisfy (11).

The remainder of this work is organised as follows. In Sect. 2 we describe the asymptotic behaviour of the Bickel–Rosenblatt test statistics $I_n(\hat{h})$ and $J_n(\hat{h})$ with $\hat{h}=\hat{h}_n(X_1,\ldots ,X_n)$ a general data-dependent smoothing parameter. In a univariate context these results extend those obtained by Fan (1994), Gürtler (2000) and Tenreiro (2007). The limiting null distribution and the consistency of the considered Bickel–Rosenblatt tests for location-scale families are stated in Sect. 3. In Sect. 4 we conduct a simulation study to compare the finite sample power performance of the Bickel–Rosenblatt tests based on the null hypothesis-based bandwidth selector $\hat{h}_{H_0}$ with other scale equivariant bandwidth selectors $\hat{h}$ satisfying condition (11). We consider the cases of the normal, logistic and Gumbel null location-scale models. Although $\hat{h}_{H_0}$ does not satisfy this relative consistency condition unless $f\in \mathscr {F}_0$, we conclude that the tests based on it, especially those based on $I_n$, are in general more powerful than, or at least as powerful as, those based on the considered bandwidth selectors that satisfy such condition. Some other data-driven bandwidths inspired in the methods considered in Cao and Van Keilegom (2006), Martínez-Camblor et al. (2008) and Martínez-Camblor and Uña-Álvarez (2013) in the context of smooth tests for the k-sample problem are adapted to our context and included in the simulation study. These last bandwidth selectors, which can be computed by resampling, take the general form $\hat{\lambda } \hat{h}$, where $\hat{h}$ is a scale equivariant bandwidth selector (e.g. $\hat{h}=\hat{h}_{H_0}$) and $\hat{\lambda }$ is a data-driven tuning parameter selector taking values in a finite set of tuning parameters $\Lambda $ (e.g. $\Lambda = \{ 0.5,0.75,1,1.5,2 \}$). Nevertheless, none of these bandwidth selectors have shown to be preferable to $\hat{h}_{H_0}$. Section 5 includes a brief summary and some conclusions. For convenience of exposition, the proofs are deferred to “Appendix A” and some of the simulation results are relegated to the online supplementary material.

2 Test statistics asymptotic behaviour

In this section we are interested in the asymptotic behaviour of the Bickel–Rosenblatt test statistics $I_n(\hat{h})$ and $J_n(\hat{h})$ given by (3) and (4), respectively, where $\hat{h}=\hat{h}_n(X_1,\ldots ,X_n)$ is a general data-dependent smoothing parameter. In a univariate framework the results presented here extend those obtained by Fan (1994), Gürtler (2000) and Tenreiro (2007).

2.1 Asymptotic behaviour of $I_n(\hat{h})$

In order to describe the asymptotic behaviour of the integrated square error $I_n(\hat{h})$ we consider the following assumptions on the underlying probability density function f, the parametric family $\mathscr {F}_0$ given by (2), the estimators $\hat{\theta }_1$ and $\hat{\theta }_2$, the kernel K and the data-dependent bandwidth $\hat{h}$. We denote by $\mathscr {F}$ an appropriate set of probability density functions on $\mathbb {R}$ that contains $\mathscr {F}_0$ and to which the underlying probability density function f belongs, and by $L^r$, for $r\in [1,\infty ]$, the normed vector space of measurable functions $\varphi : \mathbb {R} \rightarrow \mathbb {R}$ for which $||\varphi ||_r<\infty $, where $||\varphi ||_r:=\big (\int |\varphi (x)|^r \mathrm{d}x\big )^{1/r}<\infty $ for $r\in [1,\infty [$, and $||\varphi ||_\infty = \inf \{ c \ge 0: |\varphi (x)| \le c \text{ for } \text{ almost } \text{ every } x \}$.

Assumption (D) $f\in L^\infty $, for all $f\in \mathscr {F}$.

Assumption (F) $f_0$ is two times continuously differentiable with $f_0\in L^\infty $, $f_0^\prime , y \mapsto y f_0^{\prime }(y) \in L^2 \cap L^r$ and $f_0^{\prime \prime }, y \mapsto y^2 f_0^{\prime \prime }(y) \in L^r$, for some $r\in \, ]2,\infty ]$.

Assumption (P) For all $f\in \mathscr {F}$ there exist $\theta _1(f) \in \mathbb {R}$ and $\theta _2(f)>0$ such that $\hat{\theta }_k {\mathop {\longrightarrow }\limits ^{p}} \theta _k(f)$, for $k=1,2$. Moreover, if $f=g(\cdot ;\theta _1,\theta _2)$, for some $\theta _1\in \mathbb {R}$ and $\theta _2>0$ (i.e. $f \in \mathscr {F}_0$), we assume that

$$\begin{aligned} \sqrt{n} \big ( \hat{\theta }_k - \theta _k \big ) = \frac{1}{\sqrt{n}} \sum _{i=1}^n \psi _k(X_i;\theta _1,\theta _2) + o_p(1), \end{aligned}$$

where $\psi _k(\cdot ;\theta _1,\theta _2)$ is a real function depending on $\theta _1$ and $\theta _2$, with ${E}_f(\psi _k(X;\theta _1,\theta _2))=0$ and ${E}_f(\psi _k(X;\theta _1,\theta _2)^2)<\infty $, for $k=1,2$.

Assumption (K) The kernel K belongs to $\mathscr {K}^\omega $, for some $\omega \in \{2,3,\ldots \}$, where $\mathscr {K}^\omega $ is the set of real-valued functions K on $\mathbb {R}$ with continuous derivatives up to order $\omega $ such that $\lim _{|u| \rightarrow \infty } uK(u) = 0,$ for which there exists $\eta \in \,]0,1[$, such that the real-valued functions $K^{\ell ,\eta }$ defined, for $u\in \mathbb {R}$, by $K^{\ell ,\eta }(u) = |u|^{\ell } \sup _{|h-1|\le \eta } |K^{(\ell )}(u/h)|,$ are bounded and integrable on $\mathbb {R}$ for $\ell =0,1,\ldots ,\omega $.

The standard Gaussian kernel $K=\phi $ belongs to $\mathscr {K}^\omega $ for all $\omega $, and every kernel with compact support with continuous derivatives up to order $\omega $ belongs to $\mathscr {K}^\omega $.

Assumption (B) For all $f\in \mathscr {F}$, there exists a deterministic sequence $(h_n(f))=(h(f))$ of strictly positive real numbers satisfying $h(f) \rightarrow 0$ and $nh(f) \rightarrow \infty $, as $n\rightarrow \infty $, such that

$$\begin{aligned} \xi _n := \frac{\hat{h}}{h(f)} - 1 = o_p(1). \end{aligned}$$

As mentioned before, under some conditions on f and K, assumption (B) is fulfilled by the least squares cross-validation bandwidth and by the two-stage direct plug-in bandwidth selector with $h(f)=h_0$, where $h_0$ is given by (12). Of course, in these cases assumption (B) is also fulfilled with $h(f)=h_1$, where $h_1$ is given by (7), as $h_0$ and $h_1$ are asymptotically equivalent (see Hall and Marron 1991, p. 159). From a density estimation point of view, the distinction between bandwidth selectors is usually based on the rate of convergence to zero of the relative error $\xi _n$. For example, we have $\xi _n = O_p\big ( n^{-1/10} \big )$ for the least squares cross-validation bandwidth (see Scott and Terrel 1987; Hall and Marron 1987), and $\xi _n = O_p\big ( n^{-5/14} \big )$ for the two-stage direct plug-in bandwidth selector (see Tenreiro 2003). A better order of convergence is achieved by the smoothed cross-validation method of Hall et al. (1992) and by the plug-in method of Hall et al. (1991) for which we have $\xi _n = O_p\big ( n^{-1/2} \big )$. Note that these rates of convergence are not directly comparable since the conditions imposed to f in each case are not necessarily the same. A different situation occurs when $\hat{h}$ is the well-known normal scale bandwidth selector defined by $\hat{h} = (8 \sqrt{\pi }/3)^{1/5} c_K n^{-1/5} \hat{\sigma },$ where $c_K$ is given by (8) and $\hat{\sigma }$ is a consistent estimator of the standard deviation $\sigma _f$ of f (see Wand and Jones 1995, p. 60). Although this bandwidth selector satisfies assumption (B) with $h(f)=(8 \sqrt{\pi }/3)^{1/5} c_K n^{-1/5} \sigma _f$, and we have $\xi _n = O_p\big ( n^{-1/2} \big )$ whenever the scale estimator is such that $\hat{\sigma } - \sigma _f = O_p\big (n^{-1/2}\big )$, the normal scale bandwidth selector does not fulfil relative consistency condition (11).

In the next result, which proof is given in Sect. A.1, we describe the asymptotic behaviour of the Bickel–Rosenblatt test statistic $I_n(\hat{h})$ given by (3) where $\hat{h}=\hat{h}_n(X_1,\ldots ,X_n)$ is a general data-dependent smoothing parameter. Recall that $R(\varphi )=\int \varphi (x)^2 \mathrm{d}x$ for an arbitrary real-valued measurable function $\varphi $.

Theorem 1

Under assumptions (D), (F), (P), (K) and (B), let us assume that

$$\begin{aligned} h(f)^{-1/2} \xi _n^2 + nh(f)^{1/2} \xi _n^\omega = o_p(1). \end{aligned}$$

(13)

(a)
If the null hypothesis is true, then
$$\begin{aligned} \nu _f^{-1} h(f)^{-1/2} \big ( I_n(\hat{h}) - R(K) \big ) {\mathop {\longrightarrow }\limits ^{d}} N(0,1), \end{aligned}$$
where
$$\begin{aligned} \nu _f^2 = 2 R(K*K) R(f). \end{aligned}$$
(b)
If the alternative hypothesis is true, then
$$\begin{aligned} ( nh(f) )^{-1} \big ( I_n(\hat{h}) - R(K) \big ) {\mathop {\longrightarrow }\limits ^{p}} R\big (f-g(\cdot ; \theta _1(f),\theta _2(f) )\big ). \end{aligned}$$

Remark 1

If $h(f)=c n^{-1/5}(1+o(1))$ and $\xi _n = O_p\big ( n^{-\alpha } \big )$, for some $c>0$ and $0<\alpha \le 1/2$, condition (13) is satisfied whenever $\alpha >\max (1/20,9/(10\omega ))$. Therefore, it holds for the least squares cross-validation bandwidth selector whenever $\omega \ge 10$, and for the two-stage direct plug-in bandwidth selector if $\omega \ge 3$.

2.2 Asymptotic behaviour of $J_n(\hat{h})$

In order to describe the asymptotic behaviour of the integrated square error $J_n(\hat{h})$ some additional assumptions are needed.

Assumption (D$^{\prime }$) For all $f\in \mathscr {F}$, f is two times continuously differentiable on $\mathbb {R}$ with $f'' \in L^\infty \cap L^2$.

Assumption (F$^{\prime }$) $f_0$ is such that $f_0^{\prime \prime }\in L^\infty \cap L^s$, with $1/r+1/s=1$ and $r \in \, ]2,\infty ]$ is given in assumption (F).

Assumption (K$^{\prime }$) The functions $u \mapsto u^2 K^{\ell ,\epsilon }(u)$, for $\ell =0,1,\ldots ,\omega $, where $K^{\ell ,\epsilon }$ is defined in assumption (K), are bounded and integrable on $\mathbb {R}$.

Assumption (B$^{\prime }$) For all $f\in \mathscr {F}$, the deterministic sequence h(f) is such that $nh(f)^5 \rightarrow \lambda _f$, as $n\rightarrow \infty $, for some $\lambda _f \in \, ]0,\infty [$.

Note that if $h(f)=h_0$, where $h_0$ is given by (12), then $\lambda _f=c_K^{5} R(f^{\prime \prime })^{-1}$, where $c_K$ is given in (8).

In the next result, which proof is given in Sect. A.2, we describe the asymptotic behaviour of the Bickel–Rosenblatt test statistic $J_n(\hat{h})$ given by (4) where $\hat{h}=\hat{h}_n(X_1,\ldots ,X_n)$ is a general data-dependent smoothing parameter.

Theorem 2

Under assumptions (D), (D$^{\prime }$), (F), (F$^{\prime }$), (P), (K), (K$^{\prime }$), (B), (B$^{\prime }$), let us assume that

$$\begin{aligned} h(f)^{-1/2} \xi _n + nh^{1/2} \xi _n^\omega = o_p(1). \end{aligned}$$

(14)

(a)
If the null hypothesis is true, then
$$\begin{aligned} \upsilon _f^{-1} h(f)^{-1/2} \big ( J_n(\hat{h}) - R(K) - c_n(f;K) \big ) {\mathop {\longrightarrow }\limits ^{d}} N(0,1) \end{aligned}$$
where
$$\begin{aligned} c_n(f;K) = nh(f) R(K_{h(f)}*f - f), \end{aligned}$$
and
$$\begin{aligned} \upsilon _f^2 = 2R(K*K) R(f) + \lambda _f \mu _2(K)^2 \mathrm {Var}_f(\varphi _f(X)) \end{aligned}$$
with
$$\begin{aligned} \varphi _f(u) = f''(u) - \sum _{k} \psi _k(u;\theta _1(f),\theta _2(f))\int \bar{f}^{\prime \prime }(x) \frac{\partial g}{\partial \theta _k} (x;\theta _1(f),\theta _2(f))\mathrm{d}x, \end{aligned}$$
where $\bar{f}(x)=f(-x)$, for $x\in \mathbb {R}$.
(b)
If the alternative hypothesis is true, then
$$\begin{aligned} ( nh(f) )^{-1} \big ( J_n(\hat{h}) - R(K) - c_n(f;K) \big ) {\mathop {\longrightarrow }\limits ^{p}} R\big (f-g(\cdot ; \theta _1(f),\theta _2(f) )\big ). \end{aligned}$$

Remark 2

Under the conditions of Remark 1, condition (14) holds if $\alpha >\max (1/10,9/(10\omega ))$. Therefore, it is not fulfilled by the least squares cross-validation bandwidth selector, and it holds for the two-stage direct plug-in bandwidth selector whenever $\omega \ge 3$.

3 Bickel–Rosenblatt tests for location-scale families

Under the assumptions of Theorems 1 and 2, if $\hat{\theta }_1$ and $\hat{\theta }_2$ are location-scale and scale equivariant estimators of $\theta _1$ and $\theta _2$, respectively, and the deterministic sequence h(f) is scale equivariant (that is, $h(g)=b h(f)$, where $g(\cdot )=f((\cdot -a)/b)/b$, for all $a\in \mathbb {R}$ and $b>0$), a property that is satisfied by exact optimal bandwidth (12), we can easily conclude that $\nu ^{-1}_f h(f)^{-1/2} = \nu ^{-1}_{f_0} h(f_0)^{-1/2}$, $\upsilon _{f}^{-1} h(f)^{-1/2} = \upsilon ^{-1}_{f_0} h(f_0)^{-1/2}$ and $c_n(f;K)=c_n(f_0;K)$. Therefore, from Theorems 1 and 2 we deduce that the tests based on the critical regions

$$\begin{aligned} C_n(I_n(\hat{h}),\alpha ) = \big \{ \nu ^{-1}_{f_0} h(f_0)^{-1/2} \big ( I_n(\hat{h}) - R(K) \big ) > \Phi ^{-1}(1-\alpha ) \big \} \end{aligned}$$

and

$$\begin{aligned} C_n(J_n(\hat{h}),\alpha ) = \big \{ \upsilon _{f_0}^{-1} h(f_0)^{-1/2} \big ( J_n(\hat{h}) - R(K) - c_n(f_0;K) \big ) > \Phi ^{-1}(1-\alpha ) \big \}, \end{aligned}$$

where $\alpha \in \,]0,1[$ and $\Phi ^{-1}(1-\alpha )$ is the quantile of order $1-\alpha $ of the standard normal distribution, are asymptotically of level $\alpha $ and consistent to test $f\in \mathscr {F}_0$ against $f\in \mathscr {F}{\setminus } \mathscr {F}_0$, that is, $P_f\big ( C_n(T_n,\alpha ) \big ) \rightarrow \alpha $, for all $f\in \mathscr {F}_0$, and $P_f\big ( C_n(T_n,\alpha ) \big ) \rightarrow 1$, for all $f\in \mathscr {F}{\setminus }\mathscr {F}_0$, where $T_n=T_n(X_1,\ldots ,X_n)$ stands for either $I_n(X_1,\ldots ,X_n;\hat{h}(X_1,\ldots ,X_n))$ or $J_n(X_1,\ldots ,X_n;$ $\hat{h}(X_1,\ldots ,X_n))$.

Such as in the case where $\hat{h}$ is deterministic (see Fan 1995, p. 372), some simulation results reveal that the asymptotic normal distribution provides a poor approximation to the finite sample distributions of $I_n(\hat{h})$ and $J_n(\hat{h})$ under the null hypothesis, which implies large differences between the true level and the nominal level of the tests based on the previous critical regions. This fact is illustrated in Table 1 where type I error estimates based on 20,000 simulations under the null hypothesis are shown for the normality tests based on the previous critical regions with $K=\phi $ and $\hat{h}=\hat{h}_{H_0}$ given by (10).

Table 1 Type I error estimates for the normality tests based on the critical regions $C_n(I_n(\hat{h}),\alpha )$ and $C_n(J_n(\hat{h}),\alpha )$, with $K=\phi $ and $\hat{h}=\hat{h}_{H_0}$, for nominal significant levels $\alpha =0.1,0.05,0.01$ and sample sizes $n=10^k$, $k=2,3,4$

Full size table

In order to circumvent this problem, the standard strategy (see Fan 1995, pp. 372–373) is to consider instead the test defined by the critical region

$$\begin{aligned} \mathscr {C}(T_n,\alpha ) = \big \{ T_n > q(T_n^*,\alpha ) \big \}, \end{aligned}$$

where $T_n=T_n(X_1,\ldots ,X_n)$ stands for either $I_n(X_1,\ldots ,X_n;\hat{h}(X_1,\ldots ,X_n))$ or $J_n(X_1,\ldots ,X_n;$ $\hat{h}(X_1,\ldots ,X_n))$, and $q(T_n^*,\alpha )=q(T_n^*,\alpha ;X_1,\ldots ,X_n)$ denotes the quantile of order $1-\alpha $ of the distribution of the random variable $T_n^*$ defined as follows:

(1)
Use the original sample $X_1,\ldots ,X_n$ to compute $\hat{\theta }_1$ and $\hat{\theta }_2$;
(2)
Draw a random sample $U_1,\ldots ,U_n$ from the distribution $f_0$ and define the bootstrap sample by $X_{n,i}^* = \hat{\theta }_1 + \hat{\theta }_2 U_i$, for $i=1,\ldots ,n$;
(3)
Use the bootstrap sample to compute $T_n(X_{n,1}^*,\ldots ,X_{n,n}^*)$ and call it $T_n^*$.

Of course, if the test statistic $T_n$ is location-scale invariant, which occurs if we further assume that $\hat{h}$ is scale equivariant, the quantile $q(T_n^*,\alpha )$, which does not depend on the observed sample, is the quantile of order $1-\alpha $ of $T_n$ under $H_0$, we denote by $q(T_n,\alpha )$. This quantile is assumed to be a known quantity as is can be well approximated by repeating steps 2) and 3) for a large number of times. As stated in the next result, which proof is presented in Sect. A.3, in this important case the test based on the critical region $\mathscr {C}(T_n,\alpha )$, has a level of significance at most equal to $\alpha $ for each sample size n and is consistent to test $f\in \mathscr {F}_0$ against $f\in \mathscr {F}{\setminus }\mathscr {F}_0$.

Theorem 3

Under the assumptions of Theorems 1 or 2, let us assume that $\hat{\theta }_1$ and $\hat{\theta }_2$ are location-scale and scale equivariant estimators of $\theta _1$ and $\theta _2$, respectively. If the bandwidth selector $\hat{h}$ is scale equivariant, then the test statistic $T_n$, where $T_n$ stands for either $I_n(\hat{h})$ or $J_n(\hat{h})$, is location-scale invariant, and the test based on the critical region

$$\begin{aligned} \mathscr {C}(T_n,\alpha ) = \big \{ T_n > q(T_n,\alpha ) \big \}, \end{aligned}$$

(15)

where $\alpha \in \,]0,1[$, is such that

$$\begin{aligned} P_f\big ( \mathscr {C}(T_n,\alpha ) \big ) \le \alpha , \;\text{ for } \text{ all } \, f\in \mathscr {F}_0, \end{aligned}$$

and

$$\begin{aligned} P_f\big ( \mathscr {C}(T_n,\alpha ) \big ) \rightarrow 1, \;\text{ for } \text{ all } \, f\in \mathscr {F}{\setminus }\mathscr {F}_0. \end{aligned}$$

4 Finite sample results

In this section we conduct a simulation study to compare the finite sample power performance of goodness-of-fit tests based on critical regions (15) for several choices of the scale equivariant bandwidth selector $\hat{h}$. More precisely, we intend to compare the null hypothesis-based bandwidth selector $\hat{h}_{H_0}$ proposed by Bowman (1992) given by (9), with other scale equivariant bandwidth selectors $\hat{h}$ satisfying relative consistency condition (11), for which it is expected, at least from an asymptotic point of view, that the kernel estimator $f_{\hat{h}}$ is an effective estimator of the underlying density f irrespective of which of the null or the alternative hypothesis is true. To this end, besides $\hat{h}_{H_0}$ three other automatic and scale equivariant bandwidth selectors are considered in our study. They are the least squares cross-validation bandwidth selector $\hat{h}_{\mathrm {CV}}$, the two-stage direct plug-in bandwidth selector $\hat{h}_{\mathrm {PI}}$ (see Wand and Jones 1995, pp. 63–65, 71–72) and also a modified version of the bandwidth selector proposed in Chacón and Tenreiro (2013), where the cross-validation function is replaced by the weighted cross-validation function with $\gamma =0.5$ (for the definition of the weighted cross-validation function, see Tenreiro 2017, p. 3440). Under some conditions on f, $\hat{h}_{\mathrm {CT}}$ fulfils assumption (B) with $h(f)=h_0$ and $\xi _n = O_p\big ( n^{-5/14} \big )$ (see Chacón and Tenreiro 2013, Theorem 3.1, p. 2207). The power results observed in our simulation study for the bandwidths $\hat{h}_{\mathrm {CV}}$, $\hat{h}_{\mathrm {PI}}$ and $\hat{h}_{\mathrm {CT}}$ reveal that this latter bandwidth presents a good overall performance for a wide range of alternative density features, which is relevant for real data situations where there is usually little prior information on the alternative density shape. For this reason, and because no essential feature is lost, hereafter we confine to the results obtained by the bandwidths $\hat{h}_{H_0}$ and $\hat{h}_{\mathrm {CT}}$.

From representations (5) and (6), and taking for K the standard normal density, which we always assume from now on, the test statistics $I_n(\hat{h})$ and $J_n(\hat{h})$ can be evaluated from the equalities

$$\begin{aligned} I_n(\hat{h}) = \frac{\tilde{h}}{n} \sum _{i,j=1}^n Q(Y_{n,i},Y_{n,j};\tilde{h}) \end{aligned}$$

and

$$\begin{aligned} J_n(\hat{h}) = \frac{\tilde{h}}{n} \sum _{i,j=1}^n R(Y_{n,i},Y_{n,j};\tilde{h}), \end{aligned}$$

where for $u,v\in \mathbb {R}$ and $h>0$,

$$\begin{aligned} Q(u,v;h)= \phi _{\sqrt{2}h}(u-v) - \phi _{\sqrt{2}h}*f_0(u) - \phi _{\sqrt{2}h}*f_0(v) + \phi _{\sqrt{2}h}*\bar{f}_0*f_0(0) \end{aligned}$$

and

$$\begin{aligned} R(u,v;h)=\phi _{\sqrt{2}h}(u-v) - \phi _{h}*f_0(u) - \phi _{h}*f_0(v) + \bar{f}_0*f_0(0), \end{aligned}$$

with $\tilde{h}=\hat{h}/\hat{\theta }_2$, $Y_{n,j} = (X_j -\hat{\theta }_1)/\hat{\theta }_2$, $j=1,\ldots ,n$, and $\bar{f}_0(u)=f_0(-u)$, for $u\in \mathbb {R}$. Taking into account the convolution properties of the Gaussian densities (see Wand and Jones 1995, pp. 177–180), the calculation of $I_n(\hat{h})$ and $J_n(\hat{h})$ is especially simple for the normality test in which case no numerical integration is required. In this case, we have

$$\begin{aligned} I_n(\hat{h}) = n\tilde{h} \sum _{k,l=1}^{n+1} w_k \phi _{(\beta _k^2+\beta _l^2)^{1/2}}(\alpha _k-\alpha _l) w_l \end{aligned}$$

and

$$\begin{aligned} J_n(\hat{h}) = n\tilde{h} \sum _{k,l=1}^{n+1} w_k \phi _{(\gamma _k^2+\gamma _l^2)^{1/2}}(\alpha _k-\alpha _l) w_l, \end{aligned}$$

where $w=(\frac{1}{n},\ldots ,\frac{1}{n},-1),$ $\alpha =(Y_{n,1},\ldots ,Y_{n,n},0),$ $\beta =\big (\tilde{h},\ldots ,\tilde{h},(\tilde{h}^2 + 1)^{\frac{1}{2}}\big )$ and $\gamma =(\tilde{h},\ldots ,\tilde{h},1).$

We start the study on the finite sample performance of the tests based on critical regions (15) for nominal levels $\alpha =0.1,0.05,0.01$, by considering the test of normality in which case the null model is given by (2) with $f_0=\phi $, and we take $\hat{\theta }_1=\bar{X}_n$ and $\hat{\theta }_2=S_n$ the maximum likelihood estimators of $\theta _1$ and $\theta _2$ under $H_0$. As the test statistics $I_n(\hat{h})$ and $J_n(\hat{h})$ for $\hat{h}=\hat{h}_{H_0}$ and $\hat{h}=\hat{h}_{\mathrm {CT}}$ are invariant under null hypothesis (see Theorem 3), the quantiles of order $1-\alpha $ in critical regions (15) are estimated by performing 100,000 simulations under the null hypothesis. We consider alternative distributions from a well-known set of normal mixture densities considered in Marron and Wand (1992) which is often used in the context of kernel density estimation. This set is very rich, containing densities with a wide variety of features, such as kurtosis, skewness and multimodality. The densities of the considered alternatives jointly with the density of the normal distribution with the same mean and variance are shown in Fig. 1. The densities are identified as in Marron and Wand (1992), and the values for the parameters of this set of normal mixture densities are given in Table 1 of the same article. For the nominal level $\alpha =0.05$ and sample sizes $n=20,50,80$ we report in Table 2 the power estimates based on 10,000 samples from the considered set of alternative densities. All the simulations in this work were carried out using programs written in the R language (R Development Core Team 2019).

Table 2 Empirical power results, at level $\alpha =0.05$, for the goodness-of-fit tests for the normal distribution based on $I_n(\hat{h})$ and $J_n(\hat{h})$ with $\hat{h}=\hat{h}_{H_0}$ and $\hat{h}=\hat{h}_{\mathrm {CT}}$, for some alternatives from the Marron and Wand (1992) set of normal mixture densities

Full size table

Taking into account some simulation experiments, not presented here to save space, to estimate the mean integrated squared error of the kernel density estimator for each one of the bandwidths $\hat{h}_{H_0}$ and $\hat{h}_{\mathrm {CT}}$, we can conclude that the kernel estimator based on $\hat{h}_{H_0}$ performs better than that based on $\hat{h}_{\mathrm {CT}}$ for the normal mixture densities 2, 6, 8, 9 and 12 (for the considered sample sizes). This may explain the results shown in Table 2 where the tests based on $\hat{h}_{H_0}$ perform generally better than those based on $\hat{h}_{\mathrm {CT}}$ for alternatives 2, 6, 8 and 9, and they perform similarly for alternative 12. The opposite situation occurs for the remaining four normal mixtures where the kernel density estimator based on $\hat{h}_{\mathrm {CT}}$ performs much better than that based on $\hat{h}_{H_0}$. However, only for the normal mixtures 4 and 15 the tests based on $\hat{h}_{\mathrm {CT}}$ perform clearly better than those based on $\hat{h}_{H_0}$. For densities 3 and 7 the tests perform similarly. As the considered alternative densities are far from the null hypothesis density family in shape, we can conclude that even a low performing bandwidth selector from a density estimation point of view is good enough to detect such alternatives. In this situation, estimation and testing demand different answers regarding bandwidth selection. The results presented in Table 2 for the skewed unimodal density 2 also deserve an additional comment. This is an interesting case because density 2 is not far from the normal density in shape, and we may expect that $\hat{h}_{H_0}$, as based on the null density family, may reach good power results for alternative densities which are not far from the null density model in shape. The simulations results observed for density 2 support this idea. The results presented in Table 2 also show different performances for the tests based on the test statistics $I_n(\hat{h})$ and $J_n(\hat{h})$ no matter which bandwidth is used. The statistic $J_n(\hat{h})$ seems to be more effective in detecting multimodal alternatives, whereas $I_n(\hat{h})$ shows in general a better performance in the detection of unimodal alternatives.

Based on the previous conclusions, we have good reasons to believe that $\hat{h}_{H_0}$ may reach a good power performance for wide sets of alternative distributions. In order to examine in detail this question, other than the goodness-of-fit test for the normal distribution we also consider two other null location-scale models. They are the logistic model where $f_0(x)=(\exp (-x/2) + \exp (x/2))^{-2}$, for $x\in \mathbb {R}$, and the Gumbel extreme value model where $f_0(x)=\exp (-x-\exp (-x))$, for $x\in \mathbb {R}$. For this latter family of distributions we take for $\hat{\theta }_1$ and $\hat{\theta }_2$ the maximum likelihood estimators of $\theta _1$ and $\theta _2$, which satisfy

$$\begin{aligned} \hat{\theta }_2 = \bar{X}_n - \frac{\sum _{j=1}^n X_j \exp (-X_j/\hat{\theta }_2)}{\sum _{j=1}^n \exp (-X_j/\hat{\theta }_2)} \text{ and } \hat{\theta }_1 = -\hat{\theta }_2 \log \bigg ( n^{-1} \sum _{j=1}^n X_j \exp (-X_j/\hat{\theta }_2) \bigg ). \end{aligned}$$

In the case of the goodness-of-fit test for the logistic distribution we use the moment estimators $\hat{\theta }_1=\bar{X}_n$ and $\hat{\theta }_2=\sqrt{3}\,S_n / \pi $ which are simpler to evaluate and nearly as efficient as the maximum likelihood estimators (see Johnson et al. 1995, pp. 127–130). Similarly to the goodness-of-fit test for the normal distribution, we are under the assumptions of Theorem 3 and the tests based on critical regions (15) are implemented as explained before.

For comparison proposes, besides the bandwidth selectors $\hat{h}_{H_0}$ and $\hat{h}_{\mathrm {CT}}$, we consider in this study other bandwidth selectors which are based on the common principle that the bandwidth should be tuned in order to improve the power performance of the test. In order to implement this idea, we consider the set of scale equivariant bandwidths based on $\hat{h}$, where $\hat{h}$ stands for $\hat{h}_{H_0}$ or $\hat{h}_{\mathrm {CT}}$, given by

$$\begin{aligned} \hat{h}_\lambda (X_1,\ldots ,X_n)=\lambda \hat{h}(X_1,\ldots ,X_n),\;\lambda \in \Lambda , \end{aligned}$$

where $\Lambda $ is a finite set of strictly positive real numbers that will act as tuning parameters. Besides the value $\lambda =1$ associated with the reference bandwidth $\hat{h}$, this set is meant to include tuning parameters smaller and larger than the unit. If we denote by $T_{n,\lambda }(X_1,\ldots ,X_n)$ one the statistics $I_n(X_1,\ldots ,X_n; \hat{h}_\lambda (X_1,\ldots ,X_n))$ or $J_n(X_1,\ldots ,X_n;\hat{h}_\lambda (X_1,\ldots ,X_n))$, from the scale equivariant property of $\hat{h}$ we know that $T_{n,\lambda }$ is location-scale invariant, and therefore the null distribution of $T_{n,\lambda }$ does not depend on $f \in \mathscr {F}_0$, where $\mathscr {F}_0$ is given by (2). Therefore, the tests with critical regions

$$\begin{aligned} \mathscr {C}(T_{n,\lambda },\alpha )=\{ T_{n,\lambda } > q(T_{n,\lambda },\alpha ) \},\;\lambda \in \Lambda , \end{aligned}$$

(16)

where $q(T_{n,\lambda },\alpha )$ denotes the quantile of order $1-\alpha $ of $T_{n,\lambda }$ under $H_0$, have levels of significance at most equal to $\alpha $. As before, we assumed that these quantiles are known quantities as they can be well approximated by simulating under the null hypothesis for a large number of times (100,000 replications under the null hypothesis are used). The power properties of each one of the previous test procedures depend on $\lambda $ which is the reason that its choice is usually crucial to obtain a performing test procedure. In order to make such a choice, we need to define a suitable location-scale invariant measurable function taking values in $\Lambda $, $\hat{\lambda }=\hat{\lambda }(X_1,\ldots ,X_n)$, called tuning parameter selector, on the basis of which we can consider a test procedure based on the critical region

$$\begin{aligned} \mathscr {C}(T_{n,\hat{\lambda }},\alpha ) = \{ T_{n,\hat{\lambda }} > q(T_{n,\hat{\lambda }},\alpha ) \}, \end{aligned}$$

(17)

where $q(T_{n,\hat{\lambda }},\alpha )$ denotes the quantile of order $1-\alpha $ of $T_{n,\hat{\lambda }}$ under $H_0$. This test has a level of significance at most equal to $\alpha $ for each sample size n.

In order to define effective methods for selecting the tuning parameter $\lambda \in \Lambda $, we will adapt to our situation three methods considered in Cao and Van Keilegom (2006), Martínez-Camblor et al. (2008) and Martínez-Camblor and Uña-Álvarez (2013) in the context of smooth tests for the k-sample problem. Given the level $\alpha $ of the test, and a sample $X_1,\ldots ,X_n$ from f, the first tuning parameter selector we consider, we denote by $\hat{\lambda }_{1}=\hat{\lambda }_{1}(X_1,\ldots ,X_n;\alpha )$, was originally used in Cao and Van Keilegom (2006, p. 69) and is defined as the value in $\Lambda $ that maximises the smooth bootstrap power, that is,

$$\begin{aligned} \hat{\lambda }_1 = \mathrm {arg}\!\max _{\lambda \in \Lambda } \frac{1}{B_1} \sum _{k=1}^{B_1} I\big ( T_{n,\lambda }(X^*_{k,1},\ldots ,X^*_{k,n}) > q(T_{n,\lambda },\alpha ) \big ), \end{aligned}$$

(18)

with

$$\begin{aligned} X^*_{k,j}=X_{U_{(k-1)n+j}} + \hat{h}_{\mathrm {CT}}(X_1,\ldots ,X_n) N_{(k-1)n+j}, \end{aligned}$$

for $k=1,\ldots ,B_1$ and $j=1,\ldots ,n$, where $N_l$, for $l=1,\ldots ,nB_1$, are independent copies of the standard normal distribution, and $U_l$, for $l=1,\ldots ,nB_1$, are independent copies of the discrete uniform distribution on $\{ 1,\ldots ,n \}$; that is, for each $k=1,\ldots ,B_1$, $X^*_{k1},\ldots ,X^*_{kn}$ is generated by resampling from the Parzen–Rosenblatt estimator with Gaussian kernel and smoothing parameter $\hat{h}_{\mathrm {CT}}(X_1,\ldots ,X_n)$. As expressed by the notation $\hat{\lambda }_{1}(X_1,\ldots ,X_n;\alpha )$, note that $\hat{\lambda }_1$ depends on the considered level $\alpha $.

The second method for selecting $\lambda $ we consider is based on the observation that given the values $T_{n,\lambda }(X_1,\ldots ,X_n)$ of the test statistics for the observed sample $X_1,\ldots ,X_n$, more evidence against the null hypothesis is obtained for smaller p-values. Therefore, to construct a powerful test it makes sense to minimise the bootstrap p-value along $\lambda \in \Lambda $, an idea that was used in Martínez-Camblor et al. (2008, pp. 4014–4015); see also Martínez-Camblor and Uña-Álvarez (2009). Hence, we denote by $\hat{\lambda }_{2} = \hat{\lambda }_{2}(X_1,\ldots ,X_n)$ the tuning parameter selector given by

$$\begin{aligned} \hat{\lambda }_{2} = \mathrm {arg}\!\min _{\lambda \in \Lambda } \frac{1}{B_0} \sum _{j=1}^{B_0} I\big ( T_{n,\lambda }(X_{0,(j-1)n+1},\ldots ,X_{0,jn}) \ge T_{n,\lambda }(X_1,\ldots ,X_n) \big ), \end{aligned}$$

(19)

where $X_{0,l}$, for $l=1,\ldots ,nB_0$, are independent copies of the random variable $X_0$ with density $f_0$.

The last method for selecting $\lambda $ we consider was introduced in Martínez-Camblor and Uña-Álvarez (2013, p. 273) and is based on the idea that $\lambda $ should be chosen in order to maximise the discrimination capability, between the null and the alternative hypotheses, of the diagnostic variable $T_{n,\lambda }$ expressed by the area under the ROC curve associated with it. As this area is given by $P(T_{n,\lambda }^0 < T_{n,\lambda }^1)$, where $T_{n,\lambda }^0$ and $T_{n,\lambda }^1$ are independent random variable with the null and the alternative distributions of $T_{n,\lambda }$, respectively (see Krzanowski and Hand 2009, pp. 26–28), we consider the tuning parameter selector $\hat{\lambda }_{3}=\hat{\lambda }_{3}(X_1,\ldots ,X_n)$ defined by

$$\begin{aligned} \hat{\lambda }_{3} = \mathrm {arg}\!\max _{\lambda \in \Lambda } \frac{1}{B_0 B_1} \sum _{j=1}^{B_0} \sum _{k=1}^{B_1} I\big ( T_{n,\lambda }(X_{0,(j-1)n+1},\ldots ,X_{0,jn}) < T_{n,\lambda }(X^*_{k,1},\ldots ,X^*_{k,n}) \big ), \end{aligned}$$

(20)

where $X_{0,l}$, for $l=1,\ldots ,nB_0$, and $X^*_{k,j}$, for $k=1,\ldots ,B_1$ and $j=1,\ldots ,n$, are defined as before.

Taking into account that, conditionally on the sequences $N=(N_l)$, $U=(U_l)$ and $X_0=(X_{0,l})$, the previous tuning parameter selectors are location-scale invariants, we conclude that the tests based on critical region (17), where $\hat{\lambda }$ stands for either $\hat{\lambda }_1$, $\hat{\lambda }_2$ or $\hat{\lambda }_3$, have levels of significance at most equal to $\alpha $ for each sample size n (conditionally on N, U and $X_0$). In the practical implementation of these tests we always take $\Lambda =\{0.5,0.75,1,1.5,2\}$. For the normality goodness-of-fit tests we take $B_0=B_1=200$, and the quantiles $q(T_{n,\hat{\lambda }},\alpha )$ are estimated by performing 100,000 simulations under the null hypothesis. For the goodness-of-fit tests for the logistic and Gumbel distributions we take $B_0=B_1=100$, and the quantiles are estimated by performing 50,000 simulations under the null hypothesis, because the evaluation of the corresponding test statistics is more time-consuming than in the normal case.

For the alternatives 4 and 8 from the Marron and Wand (1992) set of normal mixture densities, Tables 3 and 4 present power estimates, at level $\alpha =0.05$, for the normality goodness-of-fit tests based on critical regions (16) with $\hat{h}_\lambda = \lambda \hat{h}$, $\lambda = 0.25,0.5,0.75,1,1.25,1.5,1.75,2,3,5$, and (17) with $\Lambda =\{ 0.5,0.75,1,1.5,2 \}$, where $\hat{h}=\hat{h}_{H_0},\hat{h}_\mathrm {CT}$. As mentioned before, for all samples sizes we see that the empirical power depends on $\lambda $. However, these two alternatives reveal different situations. For alternative 8 the best power results are in general observed for values of $\lambda $ close or even equal to 1, and therefore the tests based on $\hat{\lambda }_j \hat{h}$, for $j=1,2,3$, are not expected to be more powerful than those based on the bandwidth selector $\hat{h}$. The figures in both tables support this idea. A similar situation occurs for alternative 4 and bandwidth $\hat{h}_\mathrm {CT}$. However, when the bandwidth $\hat{h}_{H_0}$ is used for alternative 4, an alternative for which the kernel estimator based on $\hat{h}_{H_0}$ performs poorly from a density estimation point of view, we see that it is highly advisable to use a tuning parameter smaller than 1, which may explain the good results obtained by the tuning parameters selectors $\hat{\lambda }_2$ and $\hat{\lambda }_3$ for the test based on $I_n$ and by $\hat{\lambda }_1$ for the test based on $J_n$.

Table 3 Power estimates, at level $\alpha =0.05$, for the normality goodness-of-fit tests based on $I_n(\lambda \hat{h}_{H_0})$ and $J_n(\lambda \hat{h}_{H_0})$, with $\lambda = 0.25,0.5,0.75,1,1.25,1.5,1.75,2,3,5$, and $I_n(\hat{\lambda }_j \hat{h}_{H_0})$ and $J_n(\hat{\lambda }_j \hat{h}_{H_0})$, $j=1,2,3$, with $\Lambda =\{ 0.5,0.75,1,1.5,2 \}$, for alternatives 4 and 8 from the Marron and Wand (1992) set of normal mixture densities

Full size table

Table 4 Power estimates, at level $\alpha =0.05$, for the normality goodness-of-fit tests based on $I_n(\lambda \hat{h}_{\mathrm {CT}})$ and $J_n(\lambda \hat{h}_{\mathrm {CT}})$, with $\lambda = 0.25,0.5,0.75,1,1.25,1.5,1.75,2,3,5$, and $I_n(\hat{\lambda }_j \hat{h}_{\mathrm {CT}})$ and $J_n(\hat{\lambda }_j \hat{h}_{\mathrm {CT}})$, $j=1,2,3$, with $\Lambda =\{ 0.5,0.75,1,1.5,2 \}$, for alternatives 4 and 8 from the Marron and Wand (1992) set of normal mixture densities

Full size table

For $\alpha =0.01,0.05,0.1$, and sample sizes $n=20,50,80$, we present in Tables 5–7 (see the supplementary online material) estimates of the nominal levels of significance for the goodness-of-fit tests for the normal, logistic and Gumbel distributions, respectively, based on $I_n(\hat{h})$ and $J_n(\hat{h})$ for the different bandwidth selectors $\hat{h}$ based on $\hat{h}_{H_0}$ and $\hat{h}_\mathrm {CT}$. They are based on 20, 000 simulations under the null hypothesis. These results indicate that all the tests have an effective level of significance very close to $\alpha $. With some few exceptions, the estimated levels are inside the approximate 95% confidence interval for the preassigned nominal levels.

Although a larger set of alternative distributions, usually considered in power studies for testing the normal, logistic and Gumbel distributions, was considered in our study (see Epps and Pulley 1983; Meintanis 2004; Epps 2005; Romão et al. 2010), we limit ourselves to present in Tables 8–10 (normal distribution), Tables 11–13 (logistic distribution) and Tables 14–16 (Gumbel distribution) the empirical power results for some of these alternatives (see the supplementary online material). The first seven alternatives are from the following location-scale families: uniform, exponential, Laplace, Cauchy, normal, logistic and Gumbel. The remaining six alternatives are from the following families of distributions: Student, lognormal, Tukey, gamma, Weibull and beta. For the exact definition of the distributions included in these tables, see Epps (2005). We limit ourselves to present here the results obtained for the nominal level $\alpha =0.05$ and sample sizes $n=20,50,80$. However, similar conclusions can be drawn for the nominal levels $\alpha =0.1, 0.01$ also considered in our study. For comparison purposes, we include in the previous tables power estimates for the classical Anderson–Darling (1954) goodness-of-fit test which is based on a weighted quadratic distance between the empirical distribution function and a parametric estimator of the distribution function of f under the null hypothesis (see Stephens 1986, and the references therein, for goodness-of-fit tests based on the empirical distribution function). In order to implement this test, the quantiles of order $1-\alpha $ of the Anderson–Darling test statistic $A^2$ are estimated by performing 100,000 simulations under the null hypothesis. In the case of the goodness-of-fit test for the normal distribution we also include in our simulation study the highly recommended Shapiro–Wilk (1965) test SW implemented by the R-function shapiro.test. For all the tests included in the study, the power estimates are based on 10,000 samples from the considered alternatives.

Although none of the considered tests present uniformly better results for the considered set of alternative distributions, the main conclusion that can be drawn from this study is that the tests based on $\hat{h}_{H_0}$ present in fact a good overall performance for a wide set of alternative distributions. Regarding the two tests based on $\hat{h}_{H_0}$, our preference goes to the test based on the test statistic $I_n(\hat{h}_{H_0})$. This test is in general more powerful than, or at least as powerful as, the tests based on $\hat{h}_{\mathrm {CT}}$, and also proves to be quite competitive against the Anderson–Darling test, although slightly less performing than the Shapiro–Wilk test for normality. However, no matter the considered null hypothesis model, for some of the considered alternatives, such as the light-tailed alternatives uniform and beta, the test based on $J_n(\hat{h}_{H_0})$ shows to be more powerful than that based on $I_n(\hat{h}_{H_0})$. Finally, note that the new bandwidth selectors $\hat{\lambda }_j \hat{h}_{H_0}$ or $\hat{\lambda }_j \hat{h}_\mathrm {CT}$, for $j=1,2,3$, which are much more time-consuming to compute than $\hat{h}_{H_0}$ or $\hat{h}_\mathrm {CT}$, do not reveal in general any special advantage over these simple to compute bandwidths, the exception being the Tukey(5) alternative distribution for the normal and the logistic models. As some simulation experiments reveal (not presented here), the extra source of variation they introduce into the null hypothesis distribution of the associated test statistics, especially those based on $J_n$, may explain the observed results.

5 Conclusions

The choice of the bandwidth is crucial to the performance of the Parzen–Rosenblatt estimator and several automatic bandwidth selectors considered in the literature satisfy relative consistency condition (11). This is not the case of the null hypothesis-based bandwidth selector $\hat{h}_{H_0}$ that only satisfies this condition when the null hypothesis is true. However, if we want to use the Bickel–Rosenblatt test statistics to test the hypothesis that the underlying density function f is a member of a location-scale family of probability density functions, the finite sample results presented in this paper support the idea that the tests based on $\hat{h}_{H_0}$ present a good overall performance for a wide set of alternative distributions. These tests are in general more powerful than, or at least as powerful as, those based on data-dependent smoothing parameters $\hat{h}$ that satisfy the relative consistency condition irrespective of which of the null or the alternative hypothesis is true, as well as those inspired on existing data-driven bandwidths for smooth tests for the k-sample problem which can be computed by resampling.

References

Anderson TW, Darling DA (1954) A test of goodness of fit. J Am Stat Assoc 49:765–769
Article MATH Google Scholar
Bickel PJ, Rosenblatt M (1973) On some global measures of the deviations of density function estimates. Ann Stat 1:1071–1095
Article MathSciNet MATH Google Scholar
Bosq D, Lecoutre J-P (1987) Théorie de l’estimation fonctionnelle. Economica, Paris
Google Scholar
Bowman AW (1992) Density based tests for goodness-of-fit normality. J Stat Comput Simul 40:1–13
Article MATH Google Scholar
Bowman AW, Foster PJ (1993) Adaptive smoothing and density-based tests of multivariate normality. J Am Stat Assoc 88:529–537
Article MathSciNet MATH Google Scholar
Cao R, Lugosi G (2005) Goodness-of-fit tests based on the kernel density estimator. Scand J Stat 32:599–616
Article MathSciNet MATH Google Scholar
Cao R, Van Keilegom I (2006) Empirical likelihood tests for two-sample problems via nonparametric density estimation. Can J Stat 34:61–77
Article MathSciNet MATH Google Scholar
Chacón JE, Tenreiro C (2013) Data-based choice of the number of pilot stages for plug-in bandwidth selection. Commun Stat Theory Methods 42:2200–2214
Article MathSciNet MATH Google Scholar
Chacón JE, Montanero J, Nogales AG, Pérez P (2007) On the existence and limit behavior of the optimal bandwidth in kernel density estimation. Stat Sin 17:289–300
MATH Google Scholar
Devroye L, Györfi L (1985) Nonparametric density estimation: the L$_1$ view. Wiley, New York
MATH Google Scholar
Ebner B, Henze N (2020) Tests for multivariate normality—a critical review with emphasis on weighted $L^2$-statistics. TEST 29:845–892
Article MathSciNet MATH Google Scholar
Epps TW (2005) Tests for location-scale families based on the empirical characteristic function. Metrika 62:99–114
Article MathSciNet MATH Google Scholar
Epps TW, Pulley LB (1983) A test for normality based on the empirical characteristic function. Biometrika 70:723–726
Article MathSciNet MATH Google Scholar
Fan Y (1994) Testing the goodness of fit of a parametric density function by kernel method. Econom Theory 10:316–356
Article MathSciNet Google Scholar
Fan Y (1995) Bootstrapping a consistent nonparametric goodness-of-fit test. Econom Rev 14:367–382
Article MathSciNet MATH Google Scholar
Fan Y (1998) Goodness-of-fit tests based on kernel density estimators with fixed smoothing parameters. Econom Theory 14:604–621
Article MathSciNet Google Scholar
Ghosh BK, Huang W-M (1991) The power and optimal kernel of the Bickel–Rosenblatt test for goodness of fit. Ann Stat 19:999–1009
Article MathSciNet MATH Google Scholar
Gouriéroux C, Tenreiro C (2001) Local power properties of kernel based goodness of fit tests. J Multivar Anal 78:161–190
Article MathSciNet MATH Google Scholar
Gürtler N (2000) Asymptotic theorems for the class of BHEP-tests for multivariate normality with fixed and variable smoothing parameter (in German). Doctoral dissertation, University of Karlsruhe, Germany
Hall P (1984) Central limit theorem for integrated square error of multivariate nonparametric density estimators. J Multivar Anal 14:1–16
Article MathSciNet MATH Google Scholar
Hall P, Marron JS (1987) Extent to which least-squares cross-validation minimizes integrated square error in nonparametric density estimation. Probab Theory Rel Fields 74:567–581
Article MATH Google Scholar
Hall P, Marron JS (1991) Lower bounds for bandwidth selection in density estimation. Probab Theory Rel Fields 90:149–173
Article MathSciNet MATH Google Scholar
Hall P, Sheather SJ, Jones MC, Marron JS (1991) On optimal data-based bandwidth selection in kernel density estimation. Biometrika 78:263–269
Article MathSciNet MATH Google Scholar
Hall P, Marron JS, Park BU (1992) Smoothed cross-validation. Probab Theory Rel Fields 92:1–20
Article MathSciNet MATH Google Scholar
Henze N (2002) Invariant tests for multivariate normality: a critical review. Stat Pap 43:467–506
Article MathSciNet MATH Google Scholar
Henze N, Zirkler B (1990) A class of invariant consistent tests for multivariate normality. Commun Stat Theory Methods 19:3595–3617
Article MathSciNet MATH Google Scholar
Johnson NL, Kotz S, Balakrishnan N (1995) Continuous univariate distributions 2. Wiley, New York
MATH Google Scholar
Krzanowski WJ, Hand DJ (2009) ROC curves for continuous data. CRC Press, Boca Raton
Book MATH Google Scholar
Marron JS, Wand MP (1992) Exact mean integrated squared error. Ann Stat 20:712–736
Article MathSciNet MATH Google Scholar
Martínez-Camblor P, Uña-Álvarez J (2009) Nonparametric k-sample tests: density function vs. distribution function. Comput Stat Data Anal 53:3344–3357
Article MATH Google Scholar
Martínez-Camblor P, Uña-Álvarez J (2013) Density comparison for independent and right censored samples via kernel smoothing. Comput Stat 28:269–288
Article MathSciNet MATH Google Scholar
Martínez-Camblor P, Uña-Álvarez J, Corral N (2008) $k$-Sample test based on the common area of kernel density estimator. J Stat Plan Inference 138:4006–4020
Article MathSciNet MATH Google Scholar
Meintanis SG (2004) Goodness-of-fit tests for the logistic distribution based on empirical transforms. Sankhya Ser A 66:306–326
MathSciNet MATH Google Scholar
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076
Article MathSciNet MATH Google Scholar
R Development Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org
Romão X, Delgado R, Costa A (2010) An empirical power comparison of univariate goodness-of-fit tests for normality. J Stat Comput Simul 80:545–591
Article MathSciNet MATH Google Scholar
Rosenblatt M (1956) Remarks on some non-parametric estimates of a density function. Ann Math Stat 27:832–837
Article MATH Google Scholar
Rosenblatt M (1975) A quadratic measure of deviation of two-dimensional density estimates and a test of independence. Ann Stat 3:1–14
Article MathSciNet MATH Google Scholar
Scott DW, Terrel GR (1987) Biased and unbiased cross-validation in density estimation. J Am Stat Assoc 82:1131–1146
Article MathSciNet MATH Google Scholar
Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52:591–611
Article MathSciNet MATH Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London
MATH Google Scholar
Simonoff JS (1996) Smoothing methods in statistics. Springer, New York
Book MATH Google Scholar
Stephens MA (1986) Tests based on EDF statistics. In: D’Agostino RB, Stephens MA (eds) Goodness-of-fit techniques. Marcel Dekker, New York, pp 97–193
Tenreiro C (1997) Loi asymptotique des erreurs quadratiques intégrées des estimateurs à noyau de la densité et de la régression sous des conditions de dépendance. Port Math 54:187–213
MathSciNet MATH Google Scholar
Tenreiro C (2001) On the asymptotic behaviour of the integrated square error of kernel density estimators with data-dependent bandwidth. Stat Probab Lett 53:283–292
Article MathSciNet MATH Google Scholar
Tenreiro C (2003) On the asymptotic normality of multistage integrated density derivatives kernel estimators. Stat Probab Lett 64:311–322
Article MathSciNet MATH Google Scholar
Tenreiro C (2007) On the asymptotic behaviour of location-scale invariant Bickel–Rosenblatt tests. J Stat Plan Inference 137:103–116 (Erratum: 139, 2115, 2009)
Article MathSciNet MATH Google Scholar
Tenreiro C (2017) A weighted least-squares cross-validation bandwidth selector for kernel density estimation. Commun Stat Theory Methods 46:3438–3458
Article MathSciNet MATH Google Scholar
Tsybakov AB (2009) Introduction to nonparametric estimation. Springer, London
Book MATH Google Scholar
Wand MP, Jones MC (1995) Kernel smoothing. Chapman & Hall, New York
Book MATH Google Scholar

Download references

Acknowledgements

The author would like to thank the anonymous reviewers and associate editor for their constructive comments and suggestions that greatly helped to improve this work.

Funding

Research partially supported by the Centre for Mathematics of the University of Coimbra—UID/MAT/00324/2019, funded by the Portuguese Government through FCT/MEC and co-funded by the European Regional Development Fund through the Partnership Agreement PT2020.

Author information

Authors and Affiliations

CMUC, Department of Mathematics, University of Coimbra, Coimbra, Portugal
Carlos Tenreiro

Authors

Carlos Tenreiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Tenreiro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 81 KB)

Proofs

1.1 Proof of Theorem 1

Consider the expansion

$$\begin{aligned} (n\hat{h})^{-1} I_n(\hat{h})&= \int \{ f_{\hat{h}}(x) - K_{\hat{h}}*f(x) \}^2 \mathrm{d}x \nonumber \\&\quad + \int \{ K_{\hat{h}}*f(x) - K_{\hat{h}}*g(x; \hat{\theta }_1, \hat{\theta }_2 ) \}^2 \mathrm{d}x \nonumber \\&\quad + 2 \int \{ f_{\hat{h}}(x) - K_{\hat{h}}*f(x) \} \{ K_{\hat{h}}*f(x) - K_{\hat{h}}*g(x; \hat{\theta }_1, \hat{\theta }_2 ) \} \mathrm{d}x \nonumber \\&=: I_{n,1} + I_{n,2} + 2 I_{n,3}. \end{aligned}$$

(21)

In order to establish the asymptotic behaviour of each one of the previous terms, we use the approach of Tenreiro (2001), which is based on the Taylor expansion

$$\begin{aligned} K_h(u) := W(u,h) = \sum _{\ell =0}^{\omega -1} (h-1)^\ell K^{\partial (\ell )}(u) + (h-1)^\omega K^{\partial (\omega )}(u,h), \end{aligned}$$

where $u\in \mathbb {R}$, $h>0$,

$$\begin{aligned} K^{\partial (\ell )}(u) := \frac{1}{\ell !}\frac{\partial ^\ell W}{\partial h^\ell }(u,1), \; \ell =0,\ldots ,\omega -1, \end{aligned}$$

and

$$\begin{aligned} K^{\partial (\omega )}(u,h) := \frac{1}{(\omega -1)!} \int _0^1 (1-t)^{\omega -1} \frac{\partial ^\omega W}{\partial h^\omega }(u,1+t(h-1)) \mathrm{d}t. \end{aligned}$$

Note that, from assumption (K) the functions $K^{\partial (\ell )}$ are bounded and integrable on $\mathbb {R}$, for $\ell =1,\ldots ,\omega -1$, and there exists $\eta \in \,]0,1[$ such that the function $K^{\partial (\omega ),\eta }(u) := \sup _{|h-1|\le \eta } |K^{\partial (\omega )}(u,h)|,$ is bounded and integrable on $\mathbb {R}$. From the previous Taylor expansion we deduce the following expansions for $f_{\hat{h}}$, $K_{\hat{h}}*f$ and $K_{\hat{h}}*g(\cdot ; \hat{\theta }_1, \hat{\theta }_2 )$, that play a crucial role in what follows. For $x\in \mathbb {R}$ and denoting by h the deterministic bandwidth h(f) given in assumption (B), we have

$$\begin{aligned} f_{\hat{h}}(x)= & {} \sum _{\ell =0}^{\omega -1} \xi _n^\ell \frac{1}{n} \sum _{i=1}^n K^{\partial (\ell )}_h(x-X_i) + \xi _n^\omega \frac{1}{n} \sum _{i=1}^n K^{\partial (\omega )}_h(x-X_i,\hat{h}), \end{aligned}$$

(22)

$$\begin{aligned} K_{\hat{h}}*f(x)= & {} \sum _{\ell =0}^{\omega -1} \xi _n^\ell K^{\partial (\ell )}_h\!*\!f(x) + \xi _n^\omega K^{\partial (\omega )}_h(\cdot ,\hat{h})\!*\!f(x), \end{aligned}$$

(23)

and

$$\begin{aligned} K_{\hat{h}}*g(x; \hat{\theta }_1, \hat{\theta }_2 ) = \sum _{\ell =0}^{\omega -1} \xi _n^\ell K^{\partial (\ell )}_h\!*\!g(x; \hat{\theta }_1, \hat{\theta }_2 ) + \xi _n^\omega K^{\partial (\omega )}_h(\cdot ,\hat{h})\!*\!g(x; \hat{\theta }_1, \hat{\theta }_2 ), \end{aligned}$$

(24)

where $K^{\partial (\ell )}_h(u)=K^{\partial (\ell )}_h(u/h)/h$ and $K^{\partial (\omega )}_h(u,\hat{h})=K^{\partial (\omega )}_h(u/h,\hat{h}/h)/h$. Moreover, for $|\hat{h}/h-1| \le \eta $ we have $| K^{\partial (\omega )}_h(u,\hat{h})| \le K^{\partial (\omega ),\eta }_h(u)$, for $u\in \mathbb {R}$.

Each one of the terms in (21) is studied in the following propositions. We denote by h the deterministic sequence h(f) given in assumption (B).

Proposition 1

We have

$$\begin{aligned} \begin{aligned} I_{n,1}&= (1-\xi _n) \frac{1}{nh} R(K) + \frac{1}{nh^{1/2}} U_n (1+ o_p(1)) \\&\quad + O_p\Big ( n^{-1} h^{-1/2} \xi _n + (nh)^{-1} \xi _n^2 + \xi _n^\omega \Big ), \end{aligned} \end{aligned}$$

where $U_n$ given by (25) is asymptotically normal with zero mean and variance $2R(K\!*\!K)R(f)$.

Proof

Using equalities (22) and (23), and assumptions (D), (K) and (B), from Proposition 2 of Tenreiro (2001, p. 290) we have

$$\begin{aligned} \begin{aligned}I_{n,1}&= \int \{ f_h(x)-K_h*f(x) \}^2 dx \mathrm {d}x - \xi _n \frac{1}{nh} R(K)\\&\quad + O_p\Big ( n^{-1} h^{-1/2} \xi _n + (nh)^{-1} \xi _n^2 + \xi _n^\omega \Big ). \end{aligned} \end{aligned}$$

Moreover, using degenerated U-statistics techniques (see Hall 1984; Tenreiro 1997) we have

$$\begin{aligned} \int \{ f_h(x)-K_h*f(x) \}^2 \mathrm{d}x = \frac{1}{nh} R(K) + \frac{1}{nh^{1/2}} U_n (1+ o_p(1)), \end{aligned}$$

with

$$\begin{aligned} U_n= & {} \frac{2}{n} \sum _{1\le i < j \le n} q_n(X_i,X_j),\\ q_n(u,v)= & {} h^{1/2} \int \big \{ K_h(x-u) - K_h\!*\!f(x) \big \} \big \{ K_h(x-v) - K_h\!*\!f(x) \big \} \mathrm{d}x,\nonumber \end{aligned}$$

(25)

and $U_{n}$ is asymptotically normal with zero mean and variance equal to $2R(K*K) R(f)$.

$\square $

Proposition 2

We have

$$\begin{aligned} I_{n,2} = R\big ( f - g(\cdot ; \theta _1(f),\theta _2(f) ) \big ) + o_p(1). \end{aligned}$$

Moreover, under the null hypothesis we have

$$\begin{aligned} I_{n,2} = O_p\big ( n^{-1} \big ). \end{aligned}$$

Proof

From (23) and (24) we have

$$\begin{aligned} I_{n,2}&= \sum _{\ell ,\ell '=0}^{\omega -1} \xi _n^{\ell +\ell '} \int K^{\partial (\ell )}_h \!* \hat{\delta }_n(x) K^{\partial (\ell ')}_h \!* \hat{\delta }_n(x) \mathrm{d}x \\&\quad + 2 \sum _{\ell =0}^{\omega -1} \xi _n^{\omega +\ell } \int K^{\partial (\ell )}_h \!* \hat{\delta }_n(x) K^{\partial (\omega )}_h (\cdot ,\hat{h}) \!* \hat{\delta }_n(x) \mathrm{d}x \\&\quad + \xi _n^{2\omega } \int \big ( K^{\partial (\omega )}_h (\cdot ,\hat{h}) \!* \hat{\delta }_n(x) \big )^2 \mathrm{d}x, \end{aligned}$$

where $\hat{\delta }_n(x)=f(x) - g(x; \hat{\theta }_1, \hat{\theta }_2 )$. Moreover,

$$\begin{aligned} \bigg | \int K^{\partial (\ell )}_h \!* \hat{\delta }_n(x) K^{\partial (\ell ')}_h \!* \hat{\delta }_n(x) \mathrm{d}x \bigg | \le ||K^{\partial (\ell )}||_1 ||K^{\partial (\ell ')}||_1 || \hat{\delta }_n ||_2^2, \end{aligned}$$

and for all $\epsilon \in \,]0,\eta [$ and for $|\hat{h}/h-1| \le \epsilon $ we have

$$\begin{aligned} \bigg | \int K^{\partial (\ell )}_h \!* \hat{\delta }_n(x) K^{\partial (\omega )}_h (\cdot ,\hat{h}) \!* \hat{\delta }_n(x) \mathrm{d}x \bigg | \le ||K^{\partial (\ell )}||_1 ||K^{\partial (\omega ),\eta }||_1 || \hat{\delta }_n ||_2^2 \end{aligned}$$

and

$$\begin{aligned} \bigg | \int \big ( K^{\partial (\omega )}_h (\cdot ,\hat{h}) \!* \hat{\delta }_n(x) \big )^2 \mathrm{d}x \bigg | \le ||K^{\partial (\omega ),\eta }||_1^2 || \hat{\delta }_n ||_2^2. \end{aligned}$$

Therefore, from assumption (B) we can write

$$\begin{aligned} I_{n,2} = R\big ( K_h*\hat{\delta }_n \big ) + O_p\big ( || \hat{\delta }_n ||_2^2 \xi _n \big ). \end{aligned}$$

(26)

On the other hand, from assumption (F) the function $(\theta _1,\theta _2) \mapsto g(x; \theta _1,\theta _2)$ has continuous first-order partial derivatives, and the functions $(\theta _1,\theta _2) \mapsto \big |\big | \frac{\partial g}{\partial \theta _k} (\cdot ; \theta _1,\theta _2)\big |\big |_2$ are locally bounded on $\mathbb {R} \times ]0,+\infty [$ for $k=1,2$. Therefore, for each $x\in \mathbb {R}$, a Taylor expansion of $g(x; \hat{\theta }_1, \hat{\theta }_2 )$ at the point $(\theta _1(f), \theta _2(f))$ leads to

$$\begin{aligned} \hat{\delta }_n(x) = f(x) - g(x; \hat{\theta }_1, \hat{\theta }_2 ) = f(x) - g(x; \theta _1(f), \theta _2(f) ) + u_n(x), \end{aligned}$$

(27)

where

$$\begin{aligned} || u_n ||_2 = O_p\big ( |\hat{\theta }_1-\theta _1(f) | + |\hat{\theta }_2-\theta _2(f) | \big ). \end{aligned}$$

(28)

The first part of the stated result follows now from (26) and the following convergence that can be established from standard arguments as h tends to zero, when n tends to infinity:

$$\begin{aligned} R\big ( K_h \!*\! (f - g(\cdot ; \theta _1(f),\theta _2(f) )) \big ) = R\big ( f - g(\cdot ; \theta _1(f),\theta _2(f)) \big ) + o(1). \end{aligned}$$

Finally, taking into account that $\hat{\delta }_n = u_n$ under the null hypothesis, where $|| u_n ||_2 = O_p(n^{-1/2})$ from assumption (P), we deduce that $I_{n,2} = O_p(n^{-1})$ under the null hypothesis. $\square $

To establish the order of convergence of $I_{n,3}$ we need the following lemma. Note that we are always assuming that $\hat{h}$ satisfies assumption (B).

Lemma 1

Let $\varphi $ be a real-valued function defined on $\mathbb {R} \times ]0,+\infty [$, and assume that there exists $\eta \in \,]0,1[$ such that the function $\varphi ^\eta (u)= \sup _{|h-1|\le \eta } |\varphi (u,h)|$ is bounded and integrable.

(a)
If $\gamma _n : \mathbb {R} \mapsto \mathbb {R}$ is such $||\gamma _n||_2 = O(1)$ then
$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \int \big \{ \varphi _h(x-X_i) - \varphi _h*f(x) \big \} \gamma _n(x) \mathrm{d}x = O_p\big ( n^{-1/2} \big ). \end{aligned}$$
(b)
If $\gamma _n : \mathbb {R} \mapsto \mathbb {R}$ is such $||\gamma _n||_r = O(1)$, for some $r\in [1,\infty ]$, then
$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \int | \varphi _h(x-X_i,\hat{h}) - \varphi _h(\cdot ,\hat{h})*f(x) | \gamma _n(x) \mathrm{d}x = O_p( 1 ). \end{aligned}$$
(c)
If $\tilde{\gamma }_n = \tilde{\gamma }_n(\cdot ;X_1,\ldots ,X_n): \mathbb {R} \mapsto \mathbb {R}$ is such that $||\tilde{\gamma }_n||_r = O_p(1)$, for some $r\in [1,\infty ]$, then
$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \int | \varphi _h(x-X_i,\hat{h}) - \varphi _h(\cdot ,\hat{h})*f(x) | \tilde{\gamma }_n(x) \mathrm{d}x = O_p\big ( h^{-1/r} \big ). \end{aligned}$$

Proof

Write $S_{n,a}$, $S_{n,b}$ and $S_{n,c}$ for the sums considered in each one of the parts a), b) and c). The order of convergence stated in part a) follows from the inequalities

$$\begin{aligned} {E}(S_{n,a}^2)&\le \frac{1}{n} {E}\bigg ( \int \varphi _h(x-X_i) \gamma _n(x) \mathrm{d}x \bigg )^2 \\&\le \frac{1}{n} \iint \varphi (u)^2 \gamma _n(z+uh)^2 f(z) \mathrm{d}u \mathrm{d}z \\&\le \frac{1}{n} ||f||_\infty ||\varphi ||_2^2 ||\gamma _n||_2^2. \end{aligned}$$

In order to establish parts b) and c), it is enough to note that for all $\epsilon \in \,]0,\eta [$ and for $|\hat{h}/h-1| \le \epsilon $ we have

$$\begin{aligned} |S_{n,b}| \le \frac{1}{n} \sum _{i=1}^n \bigg \{ \int \varphi _h^\epsilon (x-X_i) |\gamma _n|(x) \mathrm{d}x + \int \varphi _h^\epsilon *f(x) |\gamma _n|(x) \mathrm{d}x \bigg \}=:S_{n,b}^\epsilon , \end{aligned}$$

and

$$\begin{aligned} |S_{n,c}| \le \frac{1}{n} \sum _{i=1}^n \bigg \{ \int \varphi _h^\epsilon (x-X_i) |\tilde{\gamma }_n|(x) \mathrm{d}x + \int \varphi _h^\epsilon *f(x) |\tilde{\gamma }_n|(x) \mathrm{d}x \bigg \}=:S_{n,c}^\epsilon , \end{aligned}$$

where

$$\begin{aligned} {E}(S_{n,b}^\epsilon ) \le 2 \int \varphi _h^\epsilon *f(x) |\gamma _n|(x) \mathrm{d}x \le 2 ||f||_\infty ||\varphi ^\epsilon ||_s ||\gamma _n||_r, \end{aligned}$$

and

$$\begin{aligned} S_{n,c}^\epsilon \le 2 h^{-1/r} ||\tilde{\gamma }_n||_r ||\varphi ^\epsilon ||_s, \end{aligned}$$

with $1/r+1/s=1$. Therefore, $S_{n,b}^\epsilon = O_p(1)$ and $S_{n,c}^\epsilon = O_p\big ( h^{-1/r} \big )$ which implies the stated results as $\hat{h}/h -1 = o_p(1)$. $\square $

Proposition 3

We have

$$\begin{aligned} I_{n,3} = O_p\big ( (nh)^{-1/2} \big ). \end{aligned}$$

Moreover, under the null hypothesis we have

$$\begin{aligned} I_{n,3} = O_p\big ( n^{-1} h^{-1/r} + n^{-1/2} \xi _n^\omega \big ), \end{aligned}$$

where $r \in \,]2,\infty ]$ is given in assumption (F).

Proof

The first statement follows from Propositions 1 and 2 since $|I_{n,3}| \le I_{n,1}^{1/2} I_{n,2}^{1/2}$. On the other hand, from (22), (23) and (24) we have

$$\begin{aligned} I_{n,3}&= \sum _{\ell ,\ell '=0}^{\omega -1} \xi _n^{\ell +\ell '}\frac{1}{n} \sum _{i=1}^n \int \big \{ K^{\partial (\ell )}_h(x-X_i) - K^{\partial (\ell )}_h\!*\!f(x) \big \}K^{\partial (\ell ')}_h \!* \hat{\delta }_n(x) \mathrm{d}x \\&\qquad + \sum _{\ell =0}^{\omega -1} \xi _n^{\omega +\ell }\frac{1}{n} \sum _{i=1}^n \big \{ K^{\partial (\ell )}_h(x-X_i) - K^{\partial (\ell )}_h\!*\!f(x) \big \}K^{\partial (\omega )}_h (\cdot ,\hat{h}) \!* \hat{\delta }_n(x) \mathrm{d}x \\&\qquad + \sum _{\ell =0}^{\omega -1} \xi _n^{\omega +\ell } \frac{1}{n} \sum _{i=1}^n \int \big \{ K^{\partial (\omega )}_h(x\!-\!X_i,\hat{h}) \!-\! K^{\partial (\omega )}_h(\cdot ,\hat{h})\!*\!f(x) \big \} K^{\partial (\ell )}_h \!* \hat{\delta }_n(x) \mathrm{d}x \\&\qquad + \xi _n^{2\omega } \frac{1}{n} \sum _{i=1}^n \int \big \{ K^{\partial (\omega )}_h(x\!-\!X_i,\hat{h}) \!-\! K^{\partial (\omega )}_h(\cdot ,\hat{h})\!*\!f(x) \big \} K^{\partial (\omega )}_h (\cdot ,\hat{h}) \!* \hat{\delta }_n(x) \mathrm{d}x \\&\quad = I_{n,3}^1 + I_{n,3}^2 + I_{n,3}^3 + I_{n,3}^4. \end{aligned}$$

where $\hat{\delta }_n(x)=f(x) - g(x; \hat{\theta }_1, \hat{\theta }_2 )$. From assumption (F), the function $(\theta _1,\theta _2) \mapsto g(x; \theta _1,\theta _2)$ has continuous second-order partial derivatives, and for some $r \in \,]2,\infty ]$ the functions $(\theta _1,\theta _2) \mapsto \big |\big | \frac{\partial ^2 g}{\partial \theta _k \partial \theta _l} (\cdot ; \theta _1,\theta _2)\big |\big |_r$ are locally bounded on $\mathbb {R} \times ]0,+\infty [$, for $k,l=1,2$. Therefore, under the null hypothesis a Taylor expansion of $g(x; \hat{\theta }_1, \hat{\theta }_2 )$ at the point $(\theta _1(f), \theta _2(f))$ leads to

$$\begin{aligned} \hat{\delta }_n(x) = - \sum _k (\hat{\theta }_k-\theta _k(f)) \frac{\partial g}{\partial \theta _k} (x; \theta _1(f), \theta _2(f)) + v_n(x), \end{aligned}$$

(29)

for $x\in \mathbb {R}$, where from assumption (P)

$$\begin{aligned} || v_n ||_r = O_p\big ( (\hat{\theta }_1-\theta _1(f) )^2 + (\hat{\theta }_2-\theta _2(f) )^2 \big ) = O_p\big ( n^{-1} \big ). \end{aligned}$$

Therefore, from Lemma 1 we get $I_{n,3}^1 = O_p \big ( n^{-1} h^{-1/r} \big )$, $I_{n,3}^2 = O_p \big ( (n^{-1/2} + n^{-1} h^{-1/r}) \xi _n^\omega \big )$, $I_{n,3}^3 = O_p \big ( ( n^{-1/2} + n^{-1} h^{-1/r} ) \xi _n^\omega \big )$ and $I_{n,3}^4 = O_p \big ( ( n^{-1/2} + n^{-1} h^{-1/r} ) \xi _n^{2\omega } \big )$, which completes the proof. $\square $

We can now conclude the proof of Theorem 1. As $\xi _n = o_p(1)$ and $h \rightarrow 0$, as $n\rightarrow \infty $, from Proposition 1 we have

$$\begin{aligned} I_{n,1} = O_p\big ( (nh)^{-1} + \xi _n^\omega \big ). \end{aligned}$$

Therefore, from expansion (21) and Propositions 2 and 3, we get

$$\begin{aligned} (nh)^{-1} \big ( I_n(\hat{h}) - R(K) \big ) = R\big ( f - g(\cdot ; \theta _1(f),\theta _2(f) ) \big ) + O_p\big ( (nh)^{-1/2} + \xi _n^\omega \big ), \end{aligned}$$

which completes the proof of part b) as $nh \rightarrow \infty $, when $n\rightarrow \infty $. Moreover, under the null hypothesis from Propositions 1, 2 and 3 we also have

$$\begin{aligned} h^{-1/2} \big ( I_n(\hat{h}) - R(K) \big ) = U_n + O_p\Big ( h^{-1/2} \xi _n^2 + h^{1/2-1/r} + n h^{1/2} \xi _n^\omega \Big ) + o_p(1). \end{aligned}$$

Taking into account hypothesis (13), this completes the proof of part a) as $r>2$ and $U_n$ are asymptotically normal with zero mean and variance equal to $\nu _f^2 = 2R(K\!*\!K)R(f)$.

$\square $

1.2 Proof of Theorem 2

Let us consider the expansion

$$\begin{aligned} (n\hat{h})^{-1} J_n(\hat{h})&= \int \{ f_{\hat{h}}(x) - f(x) \}^2 \mathrm{d}x \nonumber \\&\quad + \int \{ f(x) - g(x; \hat{\theta }_1, \hat{\theta }_2 ) \}^2 \mathrm{d}x \nonumber \\&\quad + 2 \int \{ f_{\hat{h}}(x) - f(x) \} \{ f(x) - g(x; \hat{\theta }_1, \hat{\theta }_2 ) \} \mathrm{d}x \nonumber \\&=: J_{n,1} + J_{n,2} + 2 J_{n,3}. \end{aligned}$$

(30)

Each one of these terms will be studied in the following propositions. As before, we denote by h the deterministic sequence h(f) which existence is assured by assumption (B).

Proposition 4

We have

$$\begin{aligned} J_{n,1}= & {} \frac{1}{nh} R(K) + R(K_h*f - f) + \frac{1}{nh^{1/2}} U_n (1+o_p(1)) + \frac{1}{\sqrt{n}h^{-2}} V_n\\&+ O_p \Big ( \big ( (nh)^{-1} + h^4 \big ) \xi _n + \xi _n^\omega \Big ), \end{aligned}$$

where $U_n$ is defined in Proposition 1 and $V_n$ given by (31) is asymptotically normal with zero mean and variance $\mu _2(K)^2 \mathrm {Var}_f(f''(X_1))$.

Proof

Taking into account equality (22) and assumptions (D), (D’), (K), (K’) and (B), from Lemma 1 of Tenreiro (2001, p. 286) we have

$$\begin{aligned} J_{n,1} = \int \{ f_{h}(x) - f(x) \}^2 \mathrm{d}x + O_p \Big ( \big ( (nh)^{-1} + h^4 \big ) \xi _n + \xi _n^\omega \Big ). \end{aligned}$$

Using degenerated U-statistics techniques (see Hall 1984) we know that

$$\begin{aligned} \begin{aligned}\int \{ f_{h}(x) - f(x) \}^2 dx \mathrm {d}x&= \frac{1}{nh} R(K) + R(K_h*f - f) \\&\quad + \frac{1}{nh^{1/2}} U_n (1+o_p(1)) + \frac{1}{\sqrt{n}h^{-2}} V_n, \end{aligned} \end{aligned}$$

with $U_n$ given by (25) and

$$\begin{aligned} V_n := \frac{2}{\sqrt{n}} \sum _{i=1}^n \int \{ K_h(x-X_i) - K_h*f(x) \} h^{-2} \{ K_h*f(x) - f(x) \} \mathrm{d}x, \end{aligned}$$

(31)

with

$$\begin{aligned} h^{-2} \{ K_h*f(x) - f(x) \} = \iint _0^1 (1-t) u^2 K(u) f''(x-tuh) \mathrm{d}u \mathrm{d}t, \end{aligned}$$

(32)

is asymptotically normal with zero mean and variance equal to $\mu _2(K)^2 \mathrm {Var}_f(f''(X_1))$.

$\square $

Proposition 5

We have

$$\begin{aligned} J_{n,2} = R\big ( f - g(\cdot ; \theta _1(f),\theta _2(f) ) \big ) + o_p(1). \end{aligned}$$

Moreover, under the null hypothesis we have

$$\begin{aligned} J_{n,2} = O_p\big ( n^{-1} \big ). \end{aligned}$$

Proof

It follows straightforwardly from (27) and (28). $\square $

Proposition 6

We have

$$\begin{aligned} J_{n,3} = O_p\big ( (nh)^{-1/2} \big ). \end{aligned}$$

Moreover, under the null hypothesis we have

$$\begin{aligned} J_{n,3} = - \frac{1}{\sqrt{n}h^{-2}} \big ( W_n + o_p(1) \big ) + O_p \big ( n^{-1} h^{-1/r} + n^{-1/2} \xi _n^\omega \big ), \end{aligned}$$

where $W_n$ is given by (36).

Proof

The first statement follows from Propositions 4 and 5 because $|J_{n,3}| \le J_{n,1}^{1/2} J_{n,2}^{1/2}$ and $R(K_h*f - f)=O(h^4)$. Write

$$\begin{aligned} J_{n,3}&= \int \{ f_{\hat{h}}(x) - K_{\hat{h}}*f(x) \} \hat{\delta }_n(x) \mathrm{d}x + \int \{ K_{\hat{h}}*f(x) - f(x) \} \hat{\delta }_n(x) \mathrm{d}x \nonumber \\&=: J_{n,3}^1 + J_{n,3}^2 , \end{aligned}$$

(33)

where $\hat{\delta }_n(x)=f(x) - g(x; \hat{\theta }_1, \hat{\theta }_2 )$. From (22) and (23) we have

$$\begin{aligned} J_{n,3}^1&= \sum _{\ell =0}^{\omega -1} \xi _n^\ell \frac{1}{n} \sum _{i=1}^n \int \big \{ K^{\partial (\ell )}_h(x-X_i) - K^{\partial (\ell )}_h\!*\!f(x) \big \} \hat{\delta }_n(x) \mathrm{d}x \nonumber \\&\quad + \xi _n^\omega \frac{1}{n} \sum _{i=1}^n \int \big \{ K^{\partial (\omega )}_h(x-X_i,\hat{h}) - K^{\partial (\omega )}_h(\cdot ,\hat{h})\!*\!f(x) \big \} \hat{\delta }_n(x) \mathrm{d}x, \end{aligned}$$

where from Lemma 1 we get

$$\begin{aligned} J_{n,3}^1 = O_p \big ( n^{-1} h^{-1/r} + n^{-1/2} \xi _n^\omega \big ). \end{aligned}$$

(34)

On the other hand, from (23) we have

$$\begin{aligned} J_{n,3}^{2}&= \int \{ K_{h}*f(x) - f(x) \} \hat{\delta }_n(x) \mathrm{d}x \nonumber \\&\quad + \sum _{\ell =1}^{\omega -1} \xi _n^\ell \int K^{\partial (\ell )}_h\!*\!f(x) \hat{\delta }_n(x) \mathrm{d}x + \xi _n^\omega \int K^{\partial (\omega )}_h(\cdot ,\hat{h})\!*\!f(x) \hat{\delta }_n(x) \mathrm{d}x, \end{aligned}$$

(35)

where for all $\epsilon \in \,]0,\eta [$ and for $|\hat{h}/h-1| \le \epsilon $ we have

$$\begin{aligned} \bigg | \int K^{\partial (\omega )}_h(\cdot ,\hat{h})\!*\!f(x) \hat{\delta }_n(x) \mathrm{d}x \bigg | \le || K^{\partial (\omega ),\eta }||_1 ||f||_2 || \hat{\delta }_n ||_2. \end{aligned}$$

Moreover, as $\int K^{\partial (\ell )}(u) \mathrm{d}u = \int uK^{\partial (\ell )}(u) \mathrm{d}u=0$ for $\ell \ge 1$, a Taylor expansion of second order leads to

$$\begin{aligned} K^{\partial (\ell )}_h\!*\!f(x) = h^2 \iint _0^1 (1-t) u^2 K^{\partial (\ell )}(u) f^{\prime \prime }(x-tuh) \mathrm{d}t\mathrm{d}u. \end{aligned}$$

Therefore, for $\ell \ge 1$ we have

$$\begin{aligned} \bigg | \int K^{\partial (\ell )}_h \!*\!f(x) \hat{\delta }_n(x) \mathrm{d}x \bigg | \le h^2 \int u^2 |K^{\partial (\ell )}(u)| \mathrm{d}u ||f''||_2 || \hat{\delta }_n ||_2. \end{aligned}$$

Taking into account (27) and the fact that $|| \hat{\delta }_n ||_2 = O_p(n^{-1/2})$ under the null hypothesis, from (35) we get

$$\begin{aligned} J_{n,3}^{2} = \int \{ K_{h}*f(x) - f(x) \} \hat{\delta }_n(x) \mathrm{d}x + O_p\big ( n^{-1/2} ( h^2 \xi _n + \xi _n^\omega ) \big ). \end{aligned}$$

Finally, from (29) and (32), and assumptions (E) and ($G'$), we have

$$\begin{aligned}&\int \{ K_{h}*f(x) - f(x) \} \hat{\delta }_n(x) \mathrm{d}x \\&\quad = h^2 \iiint _0^1 (1-t) u^2 K(u) f^{\prime \prime }(x-tuh) \hat{\delta }_n(x) \mathrm{d}t\mathrm{d}u \mathrm{d}x \\&\quad = - \frac{1}{\sqrt{n}h^{-2}} \big ( W_n + o_p(1) \big ) + O_p\big ( n^{-1} h^2 \big ), \end{aligned}$$

where

$$\begin{aligned} W_n = \frac{1}{\sqrt{n}} \sum _{i=1}^n \sum _k \psi _k(X_i;\theta _1(f),\theta _2(f)) D_k(f), \end{aligned}$$

(36)

with

$$\begin{aligned} D_k(f) = \frac{1}{2} \mu _2(K) \int \bar{f}^{\prime \prime }(x) \frac{\partial g}{\partial \theta _k} (x; \theta _1(f) \theta _2(f)) \mathrm{d}x, \end{aligned}$$

as

$$\begin{aligned} \iiint _0^1 (1-t) u^2 K(u) f^{\prime \prime }(x-tuh) \frac{\partial g}{\partial \theta _k} (x; \theta _1(f) \theta _2(f)) \mathrm{d}t\mathrm{d}u \mathrm{d}x = D_k(f) + o(1), \end{aligned}$$

and

$$\begin{aligned} \bigg | \iiint _0^1 (1-t) u^2 K(u) f^{\prime \prime }(x-tuh) v_n(x) \mathrm{d}t\mathrm{d}u \mathrm{d}x \bigg | \le \mu _2(K) ||f''||_s ||v_n||_r, \end{aligned}$$

with $1/r+1/s=1$. Thus

$$\begin{aligned} J_{n,3}^{2} = - \frac{1}{\sqrt{n}h^{-2}} \big ( W_n + o_p(1) \big ) + O_p\big ( n^{-1} h^2 + n^{-1/2} ( h^2 \xi _n + \xi _n^\omega ) \big ) \end{aligned}$$

(37)

The proposition follows from (33), (34) and (37). $\square $

We can now conclude the proof of Theorem 2. From Proposition 4 and assumption (B’) we have

$$\begin{aligned} J_{n,1}= O_p\big ( (nh)^{-1} + \xi _n^\omega \big ). \end{aligned}$$

Therefore, from expansion (30) and Propositions 5 and 6, we get

$$\begin{aligned}&(nh)^{-1} \big ( J_n(\hat{h}) - R(K) - c_n(f;K) \big ) \\&\quad = R\big ( f - g(\cdot ; \theta _1(f),\theta _2(f) \big ) + O_p\big ( (nh)^{-1/2} + \xi _n^\omega \big ), \end{aligned}$$

which completes the proof of part b). Moreover, from Propositions 4, 5 and 6, under the null hypothesis we also have

$$\begin{aligned} h^{-1/2} \big ( J_n(\hat{h}) - R(K) - c_n(f;K) \big )&= U_n + ( nh^5 )^{1/2} (V_n - 2W_n) \\&\quad + O_p\Big ( h^{-1/2} \xi _n + h^{1/2-1/r} + n h^{1/2} \xi _n^\omega \Big ) + o_p(1). \end{aligned}$$

Taking into account hypothesis (14), this completes the proof of part a) as $r>2$ and, from the central limit theorem for degenerate U-statistics with variable kernels established in Tenreiro (1997, Theorem 1, p. 190), the sum $U_n + ( nh^5 )^{1/2} (V_n - 2W_n)$ is asymptotically normal with zero mean nd variance equal to $\sigma _f^2 = 2R(K*K) R(f) + \lambda _f \mu _2(K)^2 \mathrm {Var}_f(\varphi _f(X))$. $\square $

1.3 Proof of Theorem 3

We consider only the case of the test based on the critical region $\mathscr {C}(J_n(\hat{h}),\alpha )=\{J_n(\hat{h}),\alpha ) > q(J_n(\hat{h}),\alpha ) \}$ given in (15), where $q(J_n(\hat{h}),\alpha )$ is the quantile of order $1-\alpha $ of the null distribution of $J_n(\hat{h})$, but similar arguments can be used to establish the consistency of the test based on $\mathscr {C}(I_n(\hat{h}),\alpha )$. From Theorem 2.a) and for $f\in \mathscr {F}_0$ we have $\upsilon _{f}^{-1} h(f)^{-1/2} \big ( q(J_n(\hat{h}),\alpha ) - R(K) - c_n(f;K) \big ) \rightarrow \Phi ^{-1}(1-\alpha )$. Therefore,

$$\begin{aligned} q(J_n(\hat{h}),\alpha ) \rightarrow R(K)+c(f_0;K), \end{aligned}$$

(38)

because h(f) tends to zero, as $n\rightarrow \infty $, and $c_n(f;K)=c_n(f_0;K)=c(f_0;K) (1+o(1))$, with $c(f;K)=\frac{1}{4} \lambda _{f} \mu _2(K)^2 R(f^{\prime \prime }) (1 + o(1))$ (see Wand and Jones 1995, pp. 19–23, and Bosq and Lecoutre 1987, pp. 80–81). On the other hand, from Theorem 2.b) and for $f\in \mathscr {F}{\setminus }\mathscr {F}_0$ we have $(nh(f))^{-1} \big ( J_n(\hat{h}) - R(K) - c_n(f;K) \big ) {\mathop {\longrightarrow }\limits ^{p}} R\big (f - g(\cdot ; \theta _1(f),\theta _2(f) )\big ) \ne 0,$ which enables us to conclude that

$$\begin{aligned} J_n(\hat{h}) {\mathop {\longrightarrow }\limits ^{p}} + \infty , \; \text{ for } \text{ all } \;f\in \mathscr {F}{\setminus }\mathscr {F}_0, \end{aligned}$$

(39)

as $nh(f) \rightarrow \infty $, and $c_n(f;K)=c(f;K) (1+o(1))$. The consistency of the test based on $\mathscr {C}(J_n(\hat{h}),\alpha )$ follows now from (38) and (39). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tenreiro, C. On automatic kernel density estimate-based tests for goodness-of-fit. TEST 31, 717–748 (2022). https://doi.org/10.1007/s11749-021-00799-3

Download citation

Received: 27 February 2021
Accepted: 18 December 2021
Published: 01 February 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11749-021-00799-3

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On automatic kernel density estimate-based tests for goodness-of-fit

Abstract

Similar content being viewed by others

A Kernel Goodness-of-fit Test for Maximum Likelihood Density Estimates of Normal Mixtures

Maximum likelihood method for bandwidth selection in kernel conditional density estimate

Estimator Selection: a New Method with Applications to Kernel Density Estimation

1 Introduction

2 Test statistics asymptotic behaviour

2.1 Asymptotic behaviour of \(I_n(\hat{h})\)

Theorem 1

Remark 1

2.2 Asymptotic behaviour of \(J_n(\hat{h})\)

Theorem 2

Remark 2

3 Bickel–Rosenblatt tests for location-scale families

Theorem 3

4 Finite sample results

5 Conclusions

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 81 KB)

Proofs

Proofs

1.1 Proof of Theorem 1

Proposition 1

Proof

Proposition 2

Proof

Lemma 1

Proof

Proposition 3

Proof

1.2 Proof of Theorem 2

Proposition 4

Proof

Proposition 5

Proof

Proposition 6

Proof

1.3 Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation