1 Introduction

Given \(X_1,\ldots ,X_n\) independent and identically distributed real-valued random variables from an absolutely continuous distribution with continuous density function f, it is well known that the unknown density function f may be estimated by using the Parzen–Rosenblatt estimator (Rosenblatt 1956; Parzen 1962) defined, for \(x\in \mathbb {R}\), by

$$\begin{aligned} f_{h}(x) := \frac{1}{n} \sum _{i=1}^{n} K_{h}( x-X_i ), \end{aligned}$$

where \(K_{h}(\cdot )=K(\cdot /h)/h,\) for \(h>0\), with K a kernel in \(\mathbb {R}\), that is, K is a bounded and symmetric probability density function, and the bandwidth \(h=h_n\) is a sequence of strictly positive real numbers converging to zero as n tends to infinity, which we always assume along this paper (see Devroye and Györfi 1985; Silverman 1986; Bosq and Lecoutre 1987; Wand and Jones 1995; Simonoff 1996, and Tsybakov 2009, for general reviews on density estimation).

Other than the estimation of the underlying probability density function, the kernel density estimator can also be used for testing the null hypothesis

$$\begin{aligned} H_0 : \, f \in \mathscr {F}_0, \end{aligned}$$
(1)

where \(\mathscr {F}_0\) is a parametric family of density functions, against a general alternative hypothesis. This idea was first explored in Bickel and Rosenblatt (1973) who considered, among other, two test statistics based on the \(L^2\) distance between the nonparametric estimator \(f_{h}\) and two parametric estimators of f under the null hypothesis. Focusing our attention on the case where \(\mathscr {F}_0\) is a location-scale family, that is,

$$\begin{aligned} \mathscr {F}_0 = \big \{ g(\cdot ;\theta _1,\theta _2) : \theta _1 \in \mathbb {R} ,\theta _2>0 \big \}, \end{aligned}$$
(2)

with \(g(x;\theta _1,\theta _2)=f_0 ( (x-\theta _1)/\theta _2 )/\theta _2\), and \(f_0\) is a known probability density function on \(\mathbb {R}\), the Bickel–Rosenblatt test statistics we are interested in are given by

$$\begin{aligned} I_n(h)=I_n(X_1,\ldots ,X_n;h):= nh \int \{ f_{h}(x) - K_h*g(x; \hat{\theta }_1, \hat{\theta }_2 ) \}^2 \mathrm{d}x, \end{aligned}$$
(3)

and

$$\begin{aligned} J_n(h)=J_n(X_1,\ldots ,X_n;h) := nh \int \{ f_{h}(x) - g(x; \hat{\theta }_1, \hat{\theta }_2 ) \}^2 \mathrm{d}x, \end{aligned}$$
(4)

where the integrals are over \(\mathbb {R}\) with respect to the Lebesgue measure, \(\hat{\theta }_k\), for \(k=1,2\), are consistent estimators of \(\theta _k\) under \(H_0\), and \(*\) denotes the convolution operator. The theoretical properties of goodness-of-fit tests based on \(I_n(h)\) and \(J_n(h)\) were first studied by Bickel and Rosenblatt (1973) in the univariate case by using strong approximation techniques for empirical processes, and by Rosenblatt (1975) in the multivariate case, by using a Poissonization of sample size technique. However, a full description of their asymptotic behaviour was later provided in Fan (1994) by using the fact, first noticed in Hall (1984), that central limit theorems for the integrated squared error of kernel density estimators can be derived by using a central limit theorem for degenerate U-statistics with variable kernels (see Ghosh and Huang 1991; Fan 1998; Gouriéroux and Tenreiro 2001; Cao and Lugosi 2005, for other works on goodness-of-fit tests based on the kernel density estimator).

Taking into account that the class \(\mathscr {F}_0\) is closed with respect to affine transformations, some authors argue that any reasonable statistic \(T_n=T_n(X_1,\ldots ,X_n)\) for testing \(H_0\) should be location-scale invariant, that is, it should satisfy the equality

$$\begin{aligned} T_n(a + bX_1,\ldots ,a+bX_n)=T_n(X_1,\ldots ,X_n), \end{aligned}$$

for each \(a\in \mathbb {R}\) and \(b>0\) (see Henze 2002, p. 469, Ebner and Henze 2020, p. 847). As we can easily see, this invariance property does not hold for the functionals \(I_n(h)\) and \(J_n(h)\) whenever we take for h a deterministic bandwidth, even when \(\hat{\theta }_1\) is location-scale equivariant and \(\hat{\theta }_2\) is scale equivariant, that is,

$$\begin{aligned} \hat{\theta }_1(a + bX_1,\ldots ,a+bX_n)=a+b\,\hat{\theta }_1(X_1,\ldots ,X_n) \end{aligned}$$

and

$$\begin{aligned} \hat{\theta }_2(a + bX_1,\ldots ,a+bX_n)=b\,\hat{\theta }_2(X_1,\ldots ,X_n), \end{aligned}$$

for each \(a\in \mathbb {R}\) and \(b>0\). However, if we further assume that \(h=\hat{h}(X_1,\ldots ,X_n)\) depends on the observations and is scale equivariant, then \(I_n(\hat{h})\) and \(J_n(\hat{h})\) are location-scale invariant test statistics. This invariance property follows easily from the representations

$$\begin{aligned} I_n(\hat{h})= n (\hat{h}/\hat{\theta }_2) \int \{ \tilde{f}_{\hat{h}/\hat{\theta }_2}(y) - K_{\hat{h}/\hat{\theta }_2}*f_0(y) \}^2 \mathrm{d}y, \end{aligned}$$
(5)

and

$$\begin{aligned} J_n(\hat{h})= n (\hat{h}/\hat{\theta }_2) \int \{ \tilde{f}_{\hat{h}/\hat{\theta }_2}(y) - f_0(y) \}^2 \mathrm{d}y, \end{aligned}$$
(6)

where

$$\begin{aligned} \tilde{f}_{h}(y) = \frac{1}{n} \sum _{i=1}^{n} K_{h}( y-Y_{n,i} ), \end{aligned}$$

is the kernel estimator with kernel K and smoothing parameter h, based on the so-called scaled residuals \(Y_{n,j} = (X_j -\hat{\theta }_1)/\hat{\theta }_2\), \(j=1,\ldots ,n\). When \(\hat{h}\) takes the form \(\hat{h}=\hat{\theta }_2 h\) with h a deterministic bandwidth, the statistic \(I_n(\hat{h})\) is considered in Bowman (1992) (see also Fan 1994, pp. 332–336) and the theoretical properties of the goodness-of-fit test based on \(I_n(\hat{h})\) are described in Tenreiro (2007) in the case where \(\theta _1\) and \(\theta _2\) are, respectively, the mean and the standard deviation of \(g(\cdot ;\theta _1,\theta _2)\), and \(\hat{\theta }_1=\bar{X}_n\) and \(\hat{\theta }_2=S_n\), where \(\bar{X}_n=n^{-1} \sum _{i=1}^n X_i\) is the sample mean and \(S_n^2=n^{-1} \sum _{i=1}^n (X_i - \bar{X}_n)^2\) is the sample variance. Moreover, Bowman (1992, p. 3) also suggests to take for the deterministic bandwidth h the asymptotic optimal bandwidth, in the sense of the mean integrated square error, for estimating the null density \(f_0\). In this case, we have

$$\begin{aligned} h=h_1=h_1(f_0;K,n)=c_K \, R(f^{\prime \prime }_0)^{-1/5} n^{-1/5}, \end{aligned}$$
(7)

with

$$\begin{aligned} c_K = R(K)^{1/5} \mu _2(K)^{-2/5} \end{aligned}$$
(8)

(see Bosq and Lecoutre 1987, pp. 78–83 and Wand and Jones 1995, pp. 19–23), where \(R(\varphi )=\int \varphi (x)^2 \mathrm{d}x\) and \(\mu _2(\varphi )=\int x^2\varphi (x)\mathrm{d}x\), for an arbitrary real-valued measurable function \(\varphi \), which leads to consider for \(\hat{h}\) the null hypothesis-based bandwidth selector

$$\begin{aligned} \hat{h}_{H_0}=\hat{\theta }_2 h_1(f_0;K,n). \end{aligned}$$
(9)

In the case of testing an hypothesis of normality, that is, \(f_0=\phi \), where \(\phi (x) = (2\pi )^{-1/2}\exp (-x^2/2)\), \(x\in \mathbb {R}\), is the standard Gaussian density, and taking \(K=\phi \) and \(\hat{\theta }_2=S_n\), this leads to the data-dependent bandwidth

$$\begin{aligned} \hat{h}_{H_0} = (4 / 3)^{1/5} S_n n^{-1/5}. \end{aligned}$$
(10)

This approach, also considered in Bowman and Foster (1993, p. 535) for testing a multivariate hypothesis of normality, was first suggested with apparent good results by Henze and Zirkler (1990, p. 3600; see also Ebner and Henze 2020) and the corresponding theoretical properties of the test statistic \(I_n(\hat{h})\) first established in Gürtler (2000).

From an estimation perspective, the choice of the bandwidth is crucial to the performance of the kernel density estimator, this being one of the most studied topics in kernel density estimation, and several data-based approaches have been proposed for selecting h (see Wand and Jones 1995, pp. 58–89, and also Tenreiro 2017, p. 3440, where more recent bandwidth selection methods are mentioned). Although estimation and testing are different statistical problems, if we want to test \(H_0\) through a test statistic based on the kernel density estimator, it may be found reasonable to select the smoothing parameter in such a way that \(f_h\) is an effective estimator of the underlying density f irrespective of which of the null or the alternative hypothesis is true, a property that is not fulfilled by automatic bandwidth selector (9). Although some scepticism has been expressed about this approach by Bowman (1992, p. 3), mainly due to the extra source of variation introduced into the null distribution of the test statistic by the considered bandwidth selector, in this paper we intend to address this issue deeply by considering the situation where the data-dependent smoothing parameter \(\hat{h}\) satisfies the relative consistency condition

$$\begin{aligned} \frac{\hat{h}}{h_0} - 1 = o_p(1), \end{aligned}$$
(11)

where \(h_0=h_0(f;K,n)\) is the exact optimal bandwidth in the sense that it minimizes the kernel density estimator mean integrated square error, that is,

$$\begin{aligned} h_0=\mathrm {arg}\!\min _{h>0} \mathrm {MISE}(f;K,n,h), \end{aligned}$$
(12)

where

$$\begin{aligned} \mathrm {MISE}(f;K,n,h)= {E}\bigg ( \int \{ f_{h}(x) - f(x) \}^2 \mathrm{d}x \bigg ). \end{aligned}$$

For a square integrable density f, the existence of this exact optimal bandwidth for all \(n\in \mathbb {N}\) can be established whenever the kernel K is continuous at zero with \(R(K) < 2K(0)\) (see Chacón et al. 2007). Classical data-based bandwidth selectors such as the least squares cross-validation bandwidth or the two-stage direct plug-in bandwidth selector based on \(h_1=h_1(f;K,n)\), which are both described in Wand and Jones (1995, pp. 63–65, 71–72), are scale equivariant and satisfy (11).

The remainder of this work is organised as follows. In Sect. 2 we describe the asymptotic behaviour of the Bickel–Rosenblatt test statistics \(I_n(\hat{h})\) and \(J_n(\hat{h})\) with \(\hat{h}=\hat{h}_n(X_1,\ldots ,X_n)\) a general data-dependent smoothing parameter. In a univariate context these results extend those obtained by Fan (1994), Gürtler (2000) and Tenreiro (2007). The limiting null distribution and the consistency of the considered Bickel–Rosenblatt tests for location-scale families are stated in Sect. 3. In Sect. 4 we conduct a simulation study to compare the finite sample power performance of the Bickel–Rosenblatt tests based on the null hypothesis-based bandwidth selector \(\hat{h}_{H_0}\) with other scale equivariant bandwidth selectors \(\hat{h}\) satisfying condition (11). We consider the cases of the normal, logistic and Gumbel null location-scale models. Although \(\hat{h}_{H_0}\) does not satisfy this relative consistency condition unless \(f\in \mathscr {F}_0\), we conclude that the tests based on it, especially those based on \(I_n\), are in general more powerful than, or at least as powerful as, those based on the considered bandwidth selectors that satisfy such condition. Some other data-driven bandwidths inspired in the methods considered in Cao and Van Keilegom (2006), Martínez-Camblor et al. (2008) and Martínez-Camblor and Uña-Álvarez (2013) in the context of smooth tests for the k-sample problem are adapted to our context and included in the simulation study. These last bandwidth selectors, which can be computed by resampling, take the general form \(\hat{\lambda } \hat{h}\), where \(\hat{h}\) is a scale equivariant bandwidth selector (e.g. \(\hat{h}=\hat{h}_{H_0}\)) and \(\hat{\lambda }\) is a data-driven tuning parameter selector taking values in a finite set of tuning parameters \(\Lambda \) (e.g. \(\Lambda = \{ 0.5,0.75,1,1.5,2 \}\)). Nevertheless, none of these bandwidth selectors have shown to be preferable to \(\hat{h}_{H_0}\). Section 5 includes a brief summary and some conclusions. For convenience of exposition, the proofs are deferred to “Appendix A” and some of the simulation results are relegated to the online supplementary material.

2 Test statistics asymptotic behaviour

In this section we are interested in the asymptotic behaviour of the Bickel–Rosenblatt test statistics \(I_n(\hat{h})\) and \(J_n(\hat{h})\) given by (3) and (4), respectively, where \(\hat{h}=\hat{h}_n(X_1,\ldots ,X_n)\) is a general data-dependent smoothing parameter. In a univariate framework the results presented here extend those obtained by Fan (1994), Gürtler (2000) and Tenreiro (2007).

2.1 Asymptotic behaviour of \(I_n(\hat{h})\)

In order to describe the asymptotic behaviour of the integrated square error \(I_n(\hat{h})\) we consider the following assumptions on the underlying probability density function f, the parametric family \(\mathscr {F}_0\) given by (2), the estimators \(\hat{\theta }_1\) and \(\hat{\theta }_2\), the kernel K and the data-dependent bandwidth \(\hat{h}\). We denote by \(\mathscr {F}\) an appropriate set of probability density functions on \(\mathbb {R}\) that contains \(\mathscr {F}_0\) and to which the underlying probability density function f belongs, and by \(L^r\), for \(r\in [1,\infty ]\), the normed vector space of measurable functions \(\varphi : \mathbb {R} \rightarrow \mathbb {R}\) for which \(||\varphi ||_r<\infty \), where \(||\varphi ||_r:=\big (\int |\varphi (x)|^r \mathrm{d}x\big )^{1/r}<\infty \) for \(r\in [1,\infty [\), and \(||\varphi ||_\infty = \inf \{ c \ge 0: |\varphi (x)| \le c \text{ for } \text{ almost } \text{ every } x \}\).

Assumption (D) \(f\in L^\infty \), for all \(f\in \mathscr {F}\).

Assumption (F) \(f_0\) is two times continuously differentiable with \(f_0\in L^\infty \), \(f_0^\prime , y \mapsto y f_0^{\prime }(y) \in L^2 \cap L^r\) and \(f_0^{\prime \prime }, y \mapsto y^2 f_0^{\prime \prime }(y) \in L^r\), for some \(r\in \, ]2,\infty ]\).

Assumption (P) For all \(f\in \mathscr {F}\) there exist \(\theta _1(f) \in \mathbb {R}\) and \(\theta _2(f)>0\) such that \(\hat{\theta }_k {\mathop {\longrightarrow }\limits ^{p}} \theta _k(f)\), for \(k=1,2\). Moreover, if \(f=g(\cdot ;\theta _1,\theta _2)\), for some \(\theta _1\in \mathbb {R}\) and \(\theta _2>0\) (i.e. \(f \in \mathscr {F}_0\)), we assume that

$$\begin{aligned} \sqrt{n} \big ( \hat{\theta }_k - \theta _k \big ) = \frac{1}{\sqrt{n}} \sum _{i=1}^n \psi _k(X_i;\theta _1,\theta _2) + o_p(1), \end{aligned}$$

where \(\psi _k(\cdot ;\theta _1,\theta _2)\) is a real function depending on \(\theta _1\) and \(\theta _2\), with \({E}_f(\psi _k(X;\theta _1,\theta _2))=0\) and \({E}_f(\psi _k(X;\theta _1,\theta _2)^2)<\infty \), for \(k=1,2\).

Assumption (K) The kernel K belongs to \(\mathscr {K}^\omega \), for some \(\omega \in \{2,3,\ldots \}\), where \(\mathscr {K}^\omega \) is the set of real-valued functions K on \(\mathbb {R}\) with continuous derivatives up to order \(\omega \) such that \(\lim _{|u| \rightarrow \infty } uK(u) = 0,\) for which there exists \(\eta \in \,]0,1[\), such that the real-valued functions \(K^{\ell ,\eta }\) defined, for \(u\in \mathbb {R}\), by \(K^{\ell ,\eta }(u) = |u|^{\ell } \sup _{|h-1|\le \eta } |K^{(\ell )}(u/h)|,\) are bounded and integrable on \(\mathbb {R}\) for \(\ell =0,1,\ldots ,\omega \).

The standard Gaussian kernel \(K=\phi \) belongs to \(\mathscr {K}^\omega \) for all \(\omega \), and every kernel with compact support with continuous derivatives up to order \(\omega \) belongs to \(\mathscr {K}^\omega \).

Assumption (B) For all \(f\in \mathscr {F}\), there exists a deterministic sequence \((h_n(f))=(h(f))\) of strictly positive real numbers satisfying \(h(f) \rightarrow 0\) and \(nh(f) \rightarrow \infty \), as \(n\rightarrow \infty \), such that

$$\begin{aligned} \xi _n := \frac{\hat{h}}{h(f)} - 1 = o_p(1). \end{aligned}$$

As mentioned before, under some conditions on f and K, assumption (B) is fulfilled by the least squares cross-validation bandwidth and by the two-stage direct plug-in bandwidth selector with \(h(f)=h_0\), where \(h_0\) is given by (12). Of course, in these cases assumption (B) is also fulfilled with \(h(f)=h_1\), where \(h_1\) is given by (7), as \(h_0\) and \(h_1\) are asymptotically equivalent (see Hall and Marron 1991, p. 159). From a density estimation point of view, the distinction between bandwidth selectors is usually based on the rate of convergence to zero of the relative error \(\xi _n\). For example, we have \(\xi _n = O_p\big ( n^{-1/10} \big )\) for the least squares cross-validation bandwidth (see Scott and Terrel 1987; Hall and Marron 1987), and \(\xi _n = O_p\big ( n^{-5/14} \big )\) for the two-stage direct plug-in bandwidth selector (see Tenreiro 2003). A better order of convergence is achieved by the smoothed cross-validation method of Hall et al. (1992) and by the plug-in method of Hall et al. (1991) for which we have \(\xi _n = O_p\big ( n^{-1/2} \big )\). Note that these rates of convergence are not directly comparable since the conditions imposed to f in each case are not necessarily the same. A different situation occurs when \(\hat{h}\) is the well-known normal scale bandwidth selector defined by \(\hat{h} = (8 \sqrt{\pi }/3)^{1/5} c_K n^{-1/5} \hat{\sigma },\) where \(c_K\) is given by (8) and \(\hat{\sigma }\) is a consistent estimator of the standard deviation \(\sigma _f\) of f (see Wand and Jones 1995, p. 60). Although this bandwidth selector satisfies assumption (B) with \(h(f)=(8 \sqrt{\pi }/3)^{1/5} c_K n^{-1/5} \sigma _f\), and we have \(\xi _n = O_p\big ( n^{-1/2} \big )\) whenever the scale estimator is such that \(\hat{\sigma } - \sigma _f = O_p\big (n^{-1/2}\big )\), the normal scale bandwidth selector does not fulfil relative consistency condition (11).

In the next result, which proof is given in Sect. A.1, we describe the asymptotic behaviour of the Bickel–Rosenblatt test statistic \(I_n(\hat{h})\) given by (3) where \(\hat{h}=\hat{h}_n(X_1,\ldots ,X_n)\) is a general data-dependent smoothing parameter. Recall that \(R(\varphi )=\int \varphi (x)^2 \mathrm{d}x\) for an arbitrary real-valued measurable function \(\varphi \).

Theorem 1

Under assumptions (D), (F), (P), (K) and (B), let us assume that

$$\begin{aligned} h(f)^{-1/2} \xi _n^2 + nh(f)^{1/2} \xi _n^\omega = o_p(1). \end{aligned}$$
(13)
  1. (a)

    If the null hypothesis is true, then

    $$\begin{aligned} \nu _f^{-1} h(f)^{-1/2} \big ( I_n(\hat{h}) - R(K) \big ) {\mathop {\longrightarrow }\limits ^{d}} N(0,1), \end{aligned}$$

    where

    $$\begin{aligned} \nu _f^2 = 2 R(K*K) R(f). \end{aligned}$$
  2. (b)

    If the alternative hypothesis is true, then

    $$\begin{aligned} ( nh(f) )^{-1} \big ( I_n(\hat{h}) - R(K) \big ) {\mathop {\longrightarrow }\limits ^{p}} R\big (f-g(\cdot ; \theta _1(f),\theta _2(f) )\big ). \end{aligned}$$

Remark 1

If \(h(f)=c n^{-1/5}(1+o(1))\) and \(\xi _n = O_p\big ( n^{-\alpha } \big )\), for some \(c>0\) and \(0<\alpha \le 1/2\), condition (13) is satisfied whenever \(\alpha >\max (1/20,9/(10\omega ))\). Therefore, it holds for the least squares cross-validation bandwidth selector whenever \(\omega \ge 10\), and for the two-stage direct plug-in bandwidth selector if \(\omega \ge 3\).

2.2 Asymptotic behaviour of \(J_n(\hat{h})\)

In order to describe the asymptotic behaviour of the integrated square error \(J_n(\hat{h})\) some additional assumptions are needed.

Assumption (D\(^{\prime }\)) For all \(f\in \mathscr {F}\), f is two times continuously differentiable on \(\mathbb {R}\) with \(f'' \in L^\infty \cap L^2\).

Assumption (F\(^{\prime }\)) \(f_0\) is such that \(f_0^{\prime \prime }\in L^\infty \cap L^s\), with \(1/r+1/s=1\) and \(r \in \, ]2,\infty ]\) is given in assumption (F).

Assumption (K\(^{\prime }\)) The functions \(u \mapsto u^2 K^{\ell ,\epsilon }(u)\), for \(\ell =0,1,\ldots ,\omega \), where \(K^{\ell ,\epsilon }\) is defined in assumption (K), are bounded and integrable on \(\mathbb {R}\).

Assumption (B\(^{\prime }\)) For all \(f\in \mathscr {F}\), the deterministic sequence h(f) is such that \(nh(f)^5 \rightarrow \lambda _f\), as \(n\rightarrow \infty \), for some \(\lambda _f \in \, ]0,\infty [\).

Note that if \(h(f)=h_0\), where \(h_0\) is given by (12), then \(\lambda _f=c_K^{5} R(f^{\prime \prime })^{-1}\), where \(c_K\) is given in (8).

In the next result, which proof is given in Sect. A.2, we describe the asymptotic behaviour of the Bickel–Rosenblatt test statistic \(J_n(\hat{h})\) given by (4) where \(\hat{h}=\hat{h}_n(X_1,\ldots ,X_n)\) is a general data-dependent smoothing parameter.

Theorem 2

Under assumptions (D), (D\(^{\prime }\)), (F), (F\(^{\prime }\)), (P), (K), (K\(^{\prime }\)), (B), (B\(^{\prime }\)), let us assume that

$$\begin{aligned} h(f)^{-1/2} \xi _n + nh^{1/2} \xi _n^\omega = o_p(1). \end{aligned}$$
(14)
  1. (a)

    If the null hypothesis is true, then

    $$\begin{aligned} \upsilon _f^{-1} h(f)^{-1/2} \big ( J_n(\hat{h}) - R(K) - c_n(f;K) \big ) {\mathop {\longrightarrow }\limits ^{d}} N(0,1) \end{aligned}$$

    where

    $$\begin{aligned} c_n(f;K) = nh(f) R(K_{h(f)}*f - f), \end{aligned}$$

    and

    $$\begin{aligned} \upsilon _f^2 = 2R(K*K) R(f) + \lambda _f \mu _2(K)^2 \mathrm {Var}_f(\varphi _f(X)) \end{aligned}$$

    with

    $$\begin{aligned} \varphi _f(u) = f''(u) - \sum _{k} \psi _k(u;\theta _1(f),\theta _2(f))\int \bar{f}^{\prime \prime }(x) \frac{\partial g}{\partial \theta _k} (x;\theta _1(f),\theta _2(f))\mathrm{d}x, \end{aligned}$$

    where \(\bar{f}(x)=f(-x)\), for \(x\in \mathbb {R}\).

  2. (b)

    If the alternative hypothesis is true, then

    $$\begin{aligned} ( nh(f) )^{-1} \big ( J_n(\hat{h}) - R(K) - c_n(f;K) \big ) {\mathop {\longrightarrow }\limits ^{p}} R\big (f-g(\cdot ; \theta _1(f),\theta _2(f) )\big ). \end{aligned}$$

Remark 2

Under the conditions of Remark 1, condition (14) holds if \(\alpha >\max (1/10,9/(10\omega ))\). Therefore, it is not fulfilled by the least squares cross-validation bandwidth selector, and it holds for the two-stage direct plug-in bandwidth selector whenever \(\omega \ge 3\).

3 Bickel–Rosenblatt tests for location-scale families

Under the assumptions of Theorems 1 and 2, if \(\hat{\theta }_1\) and \(\hat{\theta }_2\) are location-scale and scale equivariant estimators of \(\theta _1\) and \(\theta _2\), respectively, and the deterministic sequence h(f) is scale equivariant (that is, \(h(g)=b h(f)\), where \(g(\cdot )=f((\cdot -a)/b)/b\), for all \(a\in \mathbb {R}\) and \(b>0\)), a property that is satisfied by exact optimal bandwidth (12), we can easily conclude that \(\nu ^{-1}_f h(f)^{-1/2} = \nu ^{-1}_{f_0} h(f_0)^{-1/2}\), \(\upsilon _{f}^{-1} h(f)^{-1/2} = \upsilon ^{-1}_{f_0} h(f_0)^{-1/2}\) and \(c_n(f;K)=c_n(f_0;K)\). Therefore, from Theorems 1 and 2 we deduce that the tests based on the critical regions

$$\begin{aligned} C_n(I_n(\hat{h}),\alpha ) = \big \{ \nu ^{-1}_{f_0} h(f_0)^{-1/2} \big ( I_n(\hat{h}) - R(K) \big ) > \Phi ^{-1}(1-\alpha ) \big \} \end{aligned}$$

and

$$\begin{aligned} C_n(J_n(\hat{h}),\alpha ) = \big \{ \upsilon _{f_0}^{-1} h(f_0)^{-1/2} \big ( J_n(\hat{h}) - R(K) - c_n(f_0;K) \big ) > \Phi ^{-1}(1-\alpha ) \big \}, \end{aligned}$$

where \(\alpha \in \,]0,1[\) and \(\Phi ^{-1}(1-\alpha )\) is the quantile of order \(1-\alpha \) of the standard normal distribution, are asymptotically of level \(\alpha \) and consistent to test \(f\in \mathscr {F}_0\) against \(f\in \mathscr {F}{\setminus } \mathscr {F}_0\), that is, \(P_f\big ( C_n(T_n,\alpha ) \big ) \rightarrow \alpha \), for all \(f\in \mathscr {F}_0\), and \(P_f\big ( C_n(T_n,\alpha ) \big ) \rightarrow 1\), for all \(f\in \mathscr {F}{\setminus }\mathscr {F}_0\), where \(T_n=T_n(X_1,\ldots ,X_n)\) stands for either \(I_n(X_1,\ldots ,X_n;\hat{h}(X_1,\ldots ,X_n))\) or \(J_n(X_1,\ldots ,X_n;\) \(\hat{h}(X_1,\ldots ,X_n))\).

Such as in the case where \(\hat{h}\) is deterministic (see Fan 1995, p. 372), some simulation results reveal that the asymptotic normal distribution provides a poor approximation to the finite sample distributions of \(I_n(\hat{h})\) and \(J_n(\hat{h})\) under the null hypothesis, which implies large differences between the true level and the nominal level of the tests based on the previous critical regions. This fact is illustrated in Table 1 where type I error estimates based on 20,000 simulations under the null hypothesis are shown for the normality tests based on the previous critical regions with \(K=\phi \) and \(\hat{h}=\hat{h}_{H_0}\) given by (10).

Table 1 Type I error estimates for the normality tests based on the critical regions \(C_n(I_n(\hat{h}),\alpha )\) and \(C_n(J_n(\hat{h}),\alpha )\), with \(K=\phi \) and \(\hat{h}=\hat{h}_{H_0}\), for nominal significant levels \(\alpha =0.1,0.05,0.01\) and sample sizes \(n=10^k\), \(k=2,3,4\)

In order to circumvent this problem, the standard strategy (see Fan 1995, pp. 372–373) is to consider instead the test defined by the critical region

$$\begin{aligned} \mathscr {C}(T_n,\alpha ) = \big \{ T_n > q(T_n^*,\alpha ) \big \}, \end{aligned}$$

where \(T_n=T_n(X_1,\ldots ,X_n)\) stands for either \(I_n(X_1,\ldots ,X_n;\hat{h}(X_1,\ldots ,X_n))\) or \(J_n(X_1,\ldots ,X_n;\) \(\hat{h}(X_1,\ldots ,X_n))\), and \(q(T_n^*,\alpha )=q(T_n^*,\alpha ;X_1,\ldots ,X_n)\) denotes the quantile of order \(1-\alpha \) of the distribution of the random variable \(T_n^*\) defined as follows:

  1. (1)

    Use the original sample \(X_1,\ldots ,X_n\) to compute \(\hat{\theta }_1\) and \(\hat{\theta }_2\);

  2. (2)

    Draw a random sample \(U_1,\ldots ,U_n\) from the distribution \(f_0\) and define the bootstrap sample by \(X_{n,i}^* = \hat{\theta }_1 + \hat{\theta }_2 U_i\), for \(i=1,\ldots ,n\);

  3. (3)

    Use the bootstrap sample to compute \(T_n(X_{n,1}^*,\ldots ,X_{n,n}^*)\) and call it \(T_n^*\).

Of course, if the test statistic \(T_n\) is location-scale invariant, which occurs if we further assume that \(\hat{h}\) is scale equivariant, the quantile \(q(T_n^*,\alpha )\), which does not depend on the observed sample, is the quantile of order \(1-\alpha \) of \(T_n\) under \(H_0\), we denote by \(q(T_n,\alpha )\). This quantile is assumed to be a known quantity as is can be well approximated by repeating steps 2) and 3) for a large number of times. As stated in the next result, which proof is presented in Sect. A.3, in this important case the test based on the critical region \(\mathscr {C}(T_n,\alpha )\), has a level of significance at most equal to \(\alpha \) for each sample size n and is consistent to test \(f\in \mathscr {F}_0\) against \(f\in \mathscr {F}{\setminus }\mathscr {F}_0\).

Theorem 3

Under the assumptions of Theorems 1 or 2, let us assume that \(\hat{\theta }_1\) and \(\hat{\theta }_2\) are location-scale and scale equivariant estimators of \(\theta _1\) and \(\theta _2\), respectively. If the bandwidth selector \(\hat{h}\) is scale equivariant, then the test statistic \(T_n\), where \(T_n\) stands for either \(I_n(\hat{h})\) or \(J_n(\hat{h})\), is location-scale invariant, and the test based on the critical region

$$\begin{aligned} \mathscr {C}(T_n,\alpha ) = \big \{ T_n > q(T_n,\alpha ) \big \}, \end{aligned}$$
(15)

where \(\alpha \in \,]0,1[\), is such that

$$\begin{aligned} P_f\big ( \mathscr {C}(T_n,\alpha ) \big ) \le \alpha , \;\text{ for } \text{ all } \, f\in \mathscr {F}_0, \end{aligned}$$

and

$$\begin{aligned} P_f\big ( \mathscr {C}(T_n,\alpha ) \big ) \rightarrow 1, \;\text{ for } \text{ all } \, f\in \mathscr {F}{\setminus }\mathscr {F}_0. \end{aligned}$$

4 Finite sample results

In this section we conduct a simulation study to compare the finite sample power performance of goodness-of-fit tests based on critical regions (15) for several choices of the scale equivariant bandwidth selector \(\hat{h}\). More precisely, we intend to compare the null hypothesis-based bandwidth selector \(\hat{h}_{H_0}\) proposed by Bowman (1992) given by (9), with other scale equivariant bandwidth selectors \(\hat{h}\) satisfying relative consistency condition (11), for which it is expected, at least from an asymptotic point of view, that the kernel estimator \(f_{\hat{h}}\) is an effective estimator of the underlying density f irrespective of which of the null or the alternative hypothesis is true. To this end, besides \(\hat{h}_{H_0}\) three other automatic and scale equivariant bandwidth selectors are considered in our study. They are the least squares cross-validation bandwidth selector \(\hat{h}_{\mathrm {CV}}\), the two-stage direct plug-in bandwidth selector \(\hat{h}_{\mathrm {PI}}\) (see Wand and Jones 1995, pp. 63–65, 71–72) and also a modified version of the bandwidth selector proposed in Chacón and Tenreiro (2013), where the cross-validation function is replaced by the weighted cross-validation function with \(\gamma =0.5\) (for the definition of the weighted cross-validation function, see Tenreiro 2017, p. 3440). Under some conditions on f, \(\hat{h}_{\mathrm {CT}}\) fulfils assumption (B) with \(h(f)=h_0\) and \(\xi _n = O_p\big ( n^{-5/14} \big )\) (see Chacón and Tenreiro 2013, Theorem 3.1, p. 2207). The power results observed in our simulation study for the bandwidths \(\hat{h}_{\mathrm {CV}}\), \(\hat{h}_{\mathrm {PI}}\) and \(\hat{h}_{\mathrm {CT}}\) reveal that this latter bandwidth presents a good overall performance for a wide range of alternative density features, which is relevant for real data situations where there is usually little prior information on the alternative density shape. For this reason, and because no essential feature is lost, hereafter we confine to the results obtained by the bandwidths \(\hat{h}_{H_0}\) and \(\hat{h}_{\mathrm {CT}}\).

From representations (5) and (6), and taking for K the standard normal density, which we always assume from now on, the test statistics \(I_n(\hat{h})\) and \(J_n(\hat{h})\) can be evaluated from the equalities

$$\begin{aligned} I_n(\hat{h}) = \frac{\tilde{h}}{n} \sum _{i,j=1}^n Q(Y_{n,i},Y_{n,j};\tilde{h}) \end{aligned}$$

and

$$\begin{aligned} J_n(\hat{h}) = \frac{\tilde{h}}{n} \sum _{i,j=1}^n R(Y_{n,i},Y_{n,j};\tilde{h}), \end{aligned}$$

where for \(u,v\in \mathbb {R}\) and \(h>0\),

$$\begin{aligned} Q(u,v;h)= \phi _{\sqrt{2}h}(u-v) - \phi _{\sqrt{2}h}*f_0(u) - \phi _{\sqrt{2}h}*f_0(v) + \phi _{\sqrt{2}h}*\bar{f}_0*f_0(0) \end{aligned}$$

and

$$\begin{aligned} R(u,v;h)=\phi _{\sqrt{2}h}(u-v) - \phi _{h}*f_0(u) - \phi _{h}*f_0(v) + \bar{f}_0*f_0(0), \end{aligned}$$

with \(\tilde{h}=\hat{h}/\hat{\theta }_2\), \(Y_{n,j} = (X_j -\hat{\theta }_1)/\hat{\theta }_2\), \(j=1,\ldots ,n\), and \(\bar{f}_0(u)=f_0(-u)\), for \(u\in \mathbb {R}\). Taking into account the convolution properties of the Gaussian densities (see Wand and Jones 1995, pp. 177–180), the calculation of \(I_n(\hat{h})\) and \(J_n(\hat{h})\) is especially simple for the normality test in which case no numerical integration is required. In this case, we have

$$\begin{aligned} I_n(\hat{h}) = n\tilde{h} \sum _{k,l=1}^{n+1} w_k \phi _{(\beta _k^2+\beta _l^2)^{1/2}}(\alpha _k-\alpha _l) w_l \end{aligned}$$

and

$$\begin{aligned} J_n(\hat{h}) = n\tilde{h} \sum _{k,l=1}^{n+1} w_k \phi _{(\gamma _k^2+\gamma _l^2)^{1/2}}(\alpha _k-\alpha _l) w_l, \end{aligned}$$

where \(w=(\frac{1}{n},\ldots ,\frac{1}{n},-1),\) \(\alpha =(Y_{n,1},\ldots ,Y_{n,n},0),\) \(\beta =\big (\tilde{h},\ldots ,\tilde{h},(\tilde{h}^2 + 1)^{\frac{1}{2}}\big )\) and \(\gamma =(\tilde{h},\ldots ,\tilde{h},1).\)

We start the study on the finite sample performance of the tests based on critical regions (15) for nominal levels \(\alpha =0.1,0.05,0.01\), by considering the test of normality in which case the null model is given by (2) with \(f_0=\phi \), and we take \(\hat{\theta }_1=\bar{X}_n\) and \(\hat{\theta }_2=S_n\) the maximum likelihood estimators of \(\theta _1\) and \(\theta _2\) under \(H_0\). As the test statistics \(I_n(\hat{h})\) and \(J_n(\hat{h})\) for \(\hat{h}=\hat{h}_{H_0}\) and \(\hat{h}=\hat{h}_{\mathrm {CT}}\) are invariant under null hypothesis (see Theorem 3), the quantiles of order \(1-\alpha \) in critical regions (15) are estimated by performing 100,000 simulations under the null hypothesis. We consider alternative distributions from a well-known set of normal mixture densities considered in Marron and Wand (1992) which is often used in the context of kernel density estimation. This set is very rich, containing densities with a wide variety of features, such as kurtosis, skewness and multimodality. The densities of the considered alternatives jointly with the density of the normal distribution with the same mean and variance are shown in Fig. 1. The densities are identified as in Marron and Wand (1992), and the values for the parameters of this set of normal mixture densities are given in Table 1 of the same article. For the nominal level \(\alpha =0.05\) and sample sizes \(n=20,50,80\) we report in Table 2 the power estimates based on 10,000 samples from the considered set of alternative densities. All the simulations in this work were carried out using programs written in the R language (R Development Core Team 2019).

Fig. 1
figure 1

Probability density functions of alternatives from the Marron and Wand (1992) set of normal mixture densities (solid) and the probability function of the Gaussian distribution with the same mean and variance of the considered alternative (dashed)

Table 2 Empirical power results, at level \(\alpha =0.05\), for the goodness-of-fit tests for the normal distribution based on \(I_n(\hat{h})\) and \(J_n(\hat{h})\) with \(\hat{h}=\hat{h}_{H_0}\) and \(\hat{h}=\hat{h}_{\mathrm {CT}}\), for some alternatives from the Marron and Wand (1992) set of normal mixture densities

Taking into account some simulation experiments, not presented here to save space, to estimate the mean integrated squared error of the kernel density estimator for each one of the bandwidths \(\hat{h}_{H_0}\) and \(\hat{h}_{\mathrm {CT}}\), we can conclude that the kernel estimator based on \(\hat{h}_{H_0}\) performs better than that based on \(\hat{h}_{\mathrm {CT}}\) for the normal mixture densities 2, 6, 8, 9 and 12 (for the considered sample sizes). This may explain the results shown in Table 2 where the tests based on \(\hat{h}_{H_0}\) perform generally better than those based on \(\hat{h}_{\mathrm {CT}}\) for alternatives 2, 6, 8 and 9, and they perform similarly for alternative 12. The opposite situation occurs for the remaining four normal mixtures where the kernel density estimator based on \(\hat{h}_{\mathrm {CT}}\) performs much better than that based on \(\hat{h}_{H_0}\). However, only for the normal mixtures 4 and 15 the tests based on \(\hat{h}_{\mathrm {CT}}\) perform clearly better than those based on \(\hat{h}_{H_0}\). For densities 3 and 7 the tests perform similarly. As the considered alternative densities are far from the null hypothesis density family in shape, we can conclude that even a low performing bandwidth selector from a density estimation point of view is good enough to detect such alternatives. In this situation, estimation and testing demand different answers regarding bandwidth selection. The results presented in Table 2 for the skewed unimodal density 2 also deserve an additional comment. This is an interesting case because density 2 is not far from the normal density in shape, and we may expect that \(\hat{h}_{H_0}\), as based on the null density family, may reach good power results for alternative densities which are not far from the null density model in shape. The simulations results observed for density 2 support this idea. The results presented in Table 2 also show different performances for the tests based on the test statistics \(I_n(\hat{h})\) and \(J_n(\hat{h})\) no matter which bandwidth is used. The statistic \(J_n(\hat{h})\) seems to be more effective in detecting multimodal alternatives, whereas \(I_n(\hat{h})\) shows in general a better performance in the detection of unimodal alternatives.

Based on the previous conclusions, we have good reasons to believe that \(\hat{h}_{H_0}\) may reach a good power performance for wide sets of alternative distributions. In order to examine in detail this question, other than the goodness-of-fit test for the normal distribution we also consider two other null location-scale models. They are the logistic model where \(f_0(x)=(\exp (-x/2) + \exp (x/2))^{-2}\), for \(x\in \mathbb {R}\), and the Gumbel extreme value model where \(f_0(x)=\exp (-x-\exp (-x))\), for \(x\in \mathbb {R}\). For this latter family of distributions we take for \(\hat{\theta }_1\) and \(\hat{\theta }_2\) the maximum likelihood estimators of \(\theta _1\) and \(\theta _2\), which satisfy

$$\begin{aligned} \hat{\theta }_2 = \bar{X}_n - \frac{\sum _{j=1}^n X_j \exp (-X_j/\hat{\theta }_2)}{\sum _{j=1}^n \exp (-X_j/\hat{\theta }_2)} \text{ and } \hat{\theta }_1 = -\hat{\theta }_2 \log \bigg ( n^{-1} \sum _{j=1}^n X_j \exp (-X_j/\hat{\theta }_2) \bigg ). \end{aligned}$$

In the case of the goodness-of-fit test for the logistic distribution we use the moment estimators \(\hat{\theta }_1=\bar{X}_n\) and \(\hat{\theta }_2=\sqrt{3}\,S_n / \pi \) which are simpler to evaluate and nearly as efficient as the maximum likelihood estimators (see Johnson et al. 1995, pp. 127–130). Similarly to the goodness-of-fit test for the normal distribution, we are under the assumptions of Theorem 3 and the tests based on critical regions (15) are implemented as explained before.

For comparison proposes, besides the bandwidth selectors \(\hat{h}_{H_0}\) and \(\hat{h}_{\mathrm {CT}}\), we consider in this study other bandwidth selectors which are based on the common principle that the bandwidth should be tuned in order to improve the power performance of the test. In order to implement this idea, we consider the set of scale equivariant bandwidths based on \(\hat{h}\), where \(\hat{h}\) stands for \(\hat{h}_{H_0}\) or \(\hat{h}_{\mathrm {CT}}\), given by

$$\begin{aligned} \hat{h}_\lambda (X_1,\ldots ,X_n)=\lambda \hat{h}(X_1,\ldots ,X_n),\;\lambda \in \Lambda , \end{aligned}$$

where \(\Lambda \) is a finite set of strictly positive real numbers that will act as tuning parameters. Besides the value \(\lambda =1\) associated with the reference bandwidth \(\hat{h}\), this set is meant to include tuning parameters smaller and larger than the unit. If we denote by \(T_{n,\lambda }(X_1,\ldots ,X_n)\) one the statistics \(I_n(X_1,\ldots ,X_n; \hat{h}_\lambda (X_1,\ldots ,X_n))\) or \(J_n(X_1,\ldots ,X_n;\hat{h}_\lambda (X_1,\ldots ,X_n))\), from the scale equivariant property of \(\hat{h}\) we know that \(T_{n,\lambda }\) is location-scale invariant, and therefore the null distribution of \(T_{n,\lambda }\) does not depend on \(f \in \mathscr {F}_0\), where \(\mathscr {F}_0\) is given by (2). Therefore, the tests with critical regions

$$\begin{aligned} \mathscr {C}(T_{n,\lambda },\alpha )=\{ T_{n,\lambda } > q(T_{n,\lambda },\alpha ) \},\;\lambda \in \Lambda , \end{aligned}$$
(16)

where \(q(T_{n,\lambda },\alpha )\) denotes the quantile of order \(1-\alpha \) of \(T_{n,\lambda }\) under \(H_0\), have levels of significance at most equal to \(\alpha \). As before, we assumed that these quantiles are known quantities as they can be well approximated by simulating under the null hypothesis for a large number of times (100,000 replications under the null hypothesis are used). The power properties of each one of the previous test procedures depend on \(\lambda \) which is the reason that its choice is usually crucial to obtain a performing test procedure. In order to make such a choice, we need to define a suitable location-scale invariant measurable function taking values in \(\Lambda \), \(\hat{\lambda }=\hat{\lambda }(X_1,\ldots ,X_n)\), called tuning parameter selector, on the basis of which we can consider a test procedure based on the critical region

$$\begin{aligned} \mathscr {C}(T_{n,\hat{\lambda }},\alpha ) = \{ T_{n,\hat{\lambda }} > q(T_{n,\hat{\lambda }},\alpha ) \}, \end{aligned}$$
(17)

where \(q(T_{n,\hat{\lambda }},\alpha )\) denotes the quantile of order \(1-\alpha \) of \(T_{n,\hat{\lambda }}\) under \(H_0\). This test has a level of significance at most equal to \(\alpha \) for each sample size n.

In order to define effective methods for selecting the tuning parameter \(\lambda \in \Lambda \), we will adapt to our situation three methods considered in Cao and Van Keilegom (2006), Martínez-Camblor et al. (2008) and Martínez-Camblor and Uña-Álvarez (2013) in the context of smooth tests for the k-sample problem. Given the level \(\alpha \) of the test, and a sample \(X_1,\ldots ,X_n\) from f, the first tuning parameter selector we consider, we denote by \(\hat{\lambda }_{1}=\hat{\lambda }_{1}(X_1,\ldots ,X_n;\alpha )\), was originally used in Cao and Van Keilegom (2006, p. 69) and is defined as the value in \(\Lambda \) that maximises the smooth bootstrap power, that is,

$$\begin{aligned} \hat{\lambda }_1 = \mathrm {arg}\!\max _{\lambda \in \Lambda } \frac{1}{B_1} \sum _{k=1}^{B_1} I\big ( T_{n,\lambda }(X^*_{k,1},\ldots ,X^*_{k,n}) > q(T_{n,\lambda },\alpha ) \big ), \end{aligned}$$
(18)

with

$$\begin{aligned} X^*_{k,j}=X_{U_{(k-1)n+j}} + \hat{h}_{\mathrm {CT}}(X_1,\ldots ,X_n) N_{(k-1)n+j}, \end{aligned}$$

for \(k=1,\ldots ,B_1\) and \(j=1,\ldots ,n\), where \(N_l\), for \(l=1,\ldots ,nB_1\), are independent copies of the standard normal distribution, and \(U_l\), for \(l=1,\ldots ,nB_1\), are independent copies of the discrete uniform distribution on \(\{ 1,\ldots ,n \}\); that is, for each \(k=1,\ldots ,B_1\), \(X^*_{k1},\ldots ,X^*_{kn}\) is generated by resampling from the Parzen–Rosenblatt estimator with Gaussian kernel and smoothing parameter \(\hat{h}_{\mathrm {CT}}(X_1,\ldots ,X_n)\). As expressed by the notation \(\hat{\lambda }_{1}(X_1,\ldots ,X_n;\alpha )\), note that \(\hat{\lambda }_1\) depends on the considered level \(\alpha \).

The second method for selecting \(\lambda \) we consider is based on the observation that given the values \(T_{n,\lambda }(X_1,\ldots ,X_n)\) of the test statistics for the observed sample \(X_1,\ldots ,X_n\), more evidence against the null hypothesis is obtained for smaller p-values. Therefore, to construct a powerful test it makes sense to minimise the bootstrap p-value along \(\lambda \in \Lambda \), an idea that was used in Martínez-Camblor et al. (2008, pp. 4014–4015); see also Martínez-Camblor and Uña-Álvarez (2009). Hence, we denote by \(\hat{\lambda }_{2} = \hat{\lambda }_{2}(X_1,\ldots ,X_n)\) the tuning parameter selector given by

$$\begin{aligned} \hat{\lambda }_{2} = \mathrm {arg}\!\min _{\lambda \in \Lambda } \frac{1}{B_0} \sum _{j=1}^{B_0} I\big ( T_{n,\lambda }(X_{0,(j-1)n+1},\ldots ,X_{0,jn}) \ge T_{n,\lambda }(X_1,\ldots ,X_n) \big ), \end{aligned}$$
(19)

where \(X_{0,l}\), for \(l=1,\ldots ,nB_0\), are independent copies of the random variable \(X_0\) with density \(f_0\).

The last method for selecting \(\lambda \) we consider was introduced in Martínez-Camblor and Uña-Álvarez (2013, p. 273) and is based on the idea that \(\lambda \) should be chosen in order to maximise the discrimination capability, between the null and the alternative hypotheses, of the diagnostic variable \(T_{n,\lambda }\) expressed by the area under the ROC curve associated with it. As this area is given by \(P(T_{n,\lambda }^0 < T_{n,\lambda }^1)\), where \(T_{n,\lambda }^0\) and \(T_{n,\lambda }^1\) are independent random variable with the null and the alternative distributions of \(T_{n,\lambda }\), respectively (see Krzanowski and Hand 2009, pp. 26–28), we consider the tuning parameter selector \(\hat{\lambda }_{3}=\hat{\lambda }_{3}(X_1,\ldots ,X_n)\) defined by

$$\begin{aligned} \hat{\lambda }_{3} = \mathrm {arg}\!\max _{\lambda \in \Lambda } \frac{1}{B_0 B_1} \sum _{j=1}^{B_0} \sum _{k=1}^{B_1} I\big ( T_{n,\lambda }(X_{0,(j-1)n+1},\ldots ,X_{0,jn}) < T_{n,\lambda }(X^*_{k,1},\ldots ,X^*_{k,n}) \big ), \end{aligned}$$
(20)

where \(X_{0,l}\), for \(l=1,\ldots ,nB_0\), and \(X^*_{k,j}\), for \(k=1,\ldots ,B_1\) and \(j=1,\ldots ,n\), are defined as before.

Taking into account that, conditionally on the sequences \(N=(N_l)\), \(U=(U_l)\) and \(X_0=(X_{0,l})\), the previous tuning parameter selectors are location-scale invariants, we conclude that the tests based on critical region (17), where \(\hat{\lambda }\) stands for either \(\hat{\lambda }_1\), \(\hat{\lambda }_2\) or \(\hat{\lambda }_3\), have levels of significance at most equal to \(\alpha \) for each sample size n (conditionally on N, U and \(X_0\)). In the practical implementation of these tests we always take \(\Lambda =\{0.5,0.75,1,1.5,2\}\). For the normality goodness-of-fit tests we take \(B_0=B_1=200\), and the quantiles \(q(T_{n,\hat{\lambda }},\alpha )\) are estimated by performing 100,000 simulations under the null hypothesis. For the goodness-of-fit tests for the logistic and Gumbel distributions we take \(B_0=B_1=100\), and the quantiles are estimated by performing 50,000 simulations under the null hypothesis, because the evaluation of the corresponding test statistics is more time-consuming than in the normal case.

For the alternatives 4 and 8 from the Marron and Wand (1992) set of normal mixture densities, Tables 3 and 4 present power estimates, at level \(\alpha =0.05\), for the normality goodness-of-fit tests based on critical regions (16) with \(\hat{h}_\lambda = \lambda \hat{h}\), \(\lambda = 0.25,0.5,0.75,1,1.25,1.5,1.75,2,3,5\), and (17) with \(\Lambda =\{ 0.5,0.75,1,1.5,2 \}\), where \(\hat{h}=\hat{h}_{H_0},\hat{h}_\mathrm {CT}\). As mentioned before, for all samples sizes we see that the empirical power depends on \(\lambda \). However, these two alternatives reveal different situations. For alternative 8 the best power results are in general observed for values of \(\lambda \) close or even equal to 1, and therefore the tests based on \(\hat{\lambda }_j \hat{h}\), for \(j=1,2,3\), are not expected to be more powerful than those based on the bandwidth selector \(\hat{h}\). The figures in both tables support this idea. A similar situation occurs for alternative 4 and bandwidth \(\hat{h}_\mathrm {CT}\). However, when the bandwidth \(\hat{h}_{H_0}\) is used for alternative 4, an alternative for which the kernel estimator based on \(\hat{h}_{H_0}\) performs poorly from a density estimation point of view, we see that it is highly advisable to use a tuning parameter smaller than 1, which may explain the good results obtained by the tuning parameters selectors \(\hat{\lambda }_2\) and \(\hat{\lambda }_3\) for the test based on \(I_n\) and by \(\hat{\lambda }_1\) for the test based on \(J_n\).

Table 3 Power estimates, at level \(\alpha =0.05\), for the normality goodness-of-fit tests based on \(I_n(\lambda \hat{h}_{H_0})\) and \(J_n(\lambda \hat{h}_{H_0})\), with \(\lambda = 0.25,0.5,0.75,1,1.25,1.5,1.75,2,3,5\), and \(I_n(\hat{\lambda }_j \hat{h}_{H_0})\) and \(J_n(\hat{\lambda }_j \hat{h}_{H_0})\), \(j=1,2,3\), with \(\Lambda =\{ 0.5,0.75,1,1.5,2 \}\), for alternatives 4 and 8 from the Marron and Wand (1992) set of normal mixture densities
Table 4 Power estimates, at level \(\alpha =0.05\), for the normality goodness-of-fit tests based on \(I_n(\lambda \hat{h}_{\mathrm {CT}})\) and \(J_n(\lambda \hat{h}_{\mathrm {CT}})\), with \(\lambda = 0.25,0.5,0.75,1,1.25,1.5,1.75,2,3,5\), and \(I_n(\hat{\lambda }_j \hat{h}_{\mathrm {CT}})\) and \(J_n(\hat{\lambda }_j \hat{h}_{\mathrm {CT}})\), \(j=1,2,3\), with \(\Lambda =\{ 0.5,0.75,1,1.5,2 \}\), for alternatives 4 and 8 from the Marron and Wand (1992) set of normal mixture densities

For \(\alpha =0.01,0.05,0.1\), and sample sizes \(n=20,50,80\), we present in Tables 5–7 (see the supplementary online material) estimates of the nominal levels of significance for the goodness-of-fit tests for the normal, logistic and Gumbel distributions, respectively, based on \(I_n(\hat{h})\) and \(J_n(\hat{h})\) for the different bandwidth selectors \(\hat{h}\) based on \(\hat{h}_{H_0}\) and \(\hat{h}_\mathrm {CT}\). They are based on 20, 000 simulations under the null hypothesis. These results indicate that all the tests have an effective level of significance very close to \(\alpha \). With some few exceptions, the estimated levels are inside the approximate 95% confidence interval for the preassigned nominal levels.

Although a larger set of alternative distributions, usually considered in power studies for testing the normal, logistic and Gumbel distributions, was considered in our study (see Epps and Pulley 1983; Meintanis 2004; Epps 2005; Romão et al. 2010), we limit ourselves to present in Tables 8–10 (normal distribution), Tables 11–13 (logistic distribution) and Tables 14–16 (Gumbel distribution) the empirical power results for some of these alternatives (see the supplementary online material). The first seven alternatives are from the following location-scale families: uniform, exponential, Laplace, Cauchy, normal, logistic and Gumbel. The remaining six alternatives are from the following families of distributions: Student, lognormal, Tukey, gamma, Weibull and beta. For the exact definition of the distributions included in these tables, see Epps (2005). We limit ourselves to present here the results obtained for the nominal level \(\alpha =0.05\) and sample sizes \(n=20,50,80\). However, similar conclusions can be drawn for the nominal levels \(\alpha =0.1, 0.01\) also considered in our study. For comparison purposes, we include in the previous tables power estimates for the classical Anderson–Darling (1954) goodness-of-fit test which is based on a weighted quadratic distance between the empirical distribution function and a parametric estimator of the distribution function of f under the null hypothesis (see Stephens 1986, and the references therein, for goodness-of-fit tests based on the empirical distribution function). In order to implement this test, the quantiles of order \(1-\alpha \) of the Anderson–Darling test statistic \(A^2\) are estimated by performing 100,000 simulations under the null hypothesis. In the case of the goodness-of-fit test for the normal distribution we also include in our simulation study the highly recommended Shapiro–Wilk (1965) test SW implemented by the R-function shapiro.test. For all the tests included in the study, the power estimates are based on 10,000 samples from the considered alternatives.

Although none of the considered tests present uniformly better results for the considered set of alternative distributions, the main conclusion that can be drawn from this study is that the tests based on \(\hat{h}_{H_0}\) present in fact a good overall performance for a wide set of alternative distributions. Regarding the two tests based on \(\hat{h}_{H_0}\), our preference goes to the test based on the test statistic \(I_n(\hat{h}_{H_0})\). This test is in general more powerful than, or at least as powerful as, the tests based on \(\hat{h}_{\mathrm {CT}}\), and also proves to be quite competitive against the Anderson–Darling test, although slightly less performing than the Shapiro–Wilk test for normality. However, no matter the considered null hypothesis model, for some of the considered alternatives, such as the light-tailed alternatives uniform and beta, the test based on \(J_n(\hat{h}_{H_0})\) shows to be more powerful than that based on \(I_n(\hat{h}_{H_0})\). Finally, note that the new bandwidth selectors \(\hat{\lambda }_j \hat{h}_{H_0}\) or \(\hat{\lambda }_j \hat{h}_\mathrm {CT}\), for \(j=1,2,3\), which are much more time-consuming to compute than \(\hat{h}_{H_0}\) or \(\hat{h}_\mathrm {CT}\), do not reveal in general any special advantage over these simple to compute bandwidths, the exception being the Tukey(5) alternative distribution for the normal and the logistic models. As some simulation experiments reveal (not presented here), the extra source of variation they introduce into the null hypothesis distribution of the associated test statistics, especially those based on \(J_n\), may explain the observed results.

5 Conclusions

The choice of the bandwidth is crucial to the performance of the Parzen–Rosenblatt estimator and several automatic bandwidth selectors considered in the literature satisfy relative consistency condition (11). This is not the case of the null hypothesis-based bandwidth selector \(\hat{h}_{H_0}\) that only satisfies this condition when the null hypothesis is true. However, if we want to use the Bickel–Rosenblatt test statistics to test the hypothesis that the underlying density function f is a member of a location-scale family of probability density functions, the finite sample results presented in this paper support the idea that the tests based on \(\hat{h}_{H_0}\) present a good overall performance for a wide set of alternative distributions. These tests are in general more powerful than, or at least as powerful as, those based on data-dependent smoothing parameters \(\hat{h}\) that satisfy the relative consistency condition irrespective of which of the null or the alternative hypothesis is true, as well as those inspired on existing data-driven bandwidths for smooth tests for the k-sample problem which can be computed by resampling.