1 Introduction

Suppose some i.i.d. data \(X_1, X_2, \ldots , X_n\) from a common distribution with density \(f(\cdot )\) are observed, and one aims to estimate this density using the standard Parzen–Rosenblatt kernel estimator

$$\begin{aligned} \widehat{f}_h(x) = \frac{1}{nh} \sum _{i=1}^{n}K\left( \frac{x-X_i}{h}\right) , \end{aligned}$$
(1)

where \(K\) is a kernel and \(h\) the so-called bandwidth parameter.

The problem is then to find a reliable data-driven estimator for the optimal bandwidth. To assess the performance of \(\widehat{f}_h\), generally accepted measures are the integrated squared error

$$\begin{aligned} {\mathrm{ISE}}(h)&= {\mathrm{ISE}}\{\widehat{f}_h(x)\}=\int \{\widehat{f}_h(x)-f(x)\}^2\, {\mathrm{d}}x \end{aligned}$$
(2)

or alternatively, the mean integrated squared error, i.e.

$$\begin{aligned} {\mathrm{{MISE}}(h)={\mathrm{MISE}}\left[ \hat{f}_h(x)\right] =\int \mathrm{MSE}} \left[ \hat{f}_h(x)\right] \,{\mathrm{d}}x. \end{aligned}$$
(3)

Let us denote the minimizers of these two criteria by \(\widehat{h}_{0}\) and \(h_0\) respectively. The main difference is that ISE\((h)\) is a stochastic process indexed by \(h>0\), while MISE\((h)\) is a deterministic function of \(h\). Based on these criteria, we distinguish two classes of methods: the cross-validation methods trying to estimate \(\widehat{h}_{0}\) (and, therefore, looking at the ISE), and the plug-in methods which try to minimize the MISE to find \(h_0\). It is obvious that these criteria coincide asymptotically but not for finite samples.

Part of the community working on nonparametric statistics has accepted that there may not be a perfect procedure to select the optimal bandwidth. Nevertheless, one should be able to say which is a reasonable bandwidth selector, at least for a particular problem. The so-called SiZer method tries to indicate what is a range of reasonable bandwidths and is, therefore, quite attractive for data snooping, see Chaudhuri and Marron (1999) for an introduction, Godtlibsen et al. (2002) for an extension to the bivariate case, and Hanning and Marron (2006) for an interesting modification using extreme value theory. However, SiZer does not give back any specific data-driven bandwidth. Therefore, the development of bandwidth selectors has been going on, so that we believe that a review and comparison of existing selectors would be quite helpful to get an idea of their objective and performance.

We counted more than 30 bandwidth selectors, several of them being modifications made for particular estimation problems. So we decided to limit our study to the following restrictions. Firstly, we consider only independent observations. Secondly, we look at \(L_2\), not \(L_1\)-based methods, see also our discussion below. Boundary problems are not discussed, because it is hard to say how they can be combined with the problem of bandwidth selection. In our simulation comparison, we concentrate on rather small and moderate sample sizes, and on quite smooth densities. The considered degree of smoothness covers a broad range of problems in any research area but excludes sharp peaks and highly oscillating functions.Notice that the latter problems should not be tackled with kernels anyway.Density problems with extreme tails are not included. It is well known that those problems should be solved by data transformation; see e.g. Wand et al. (1991) or Yang and Marron (1999) for parametric, and Ruppert and cline (1994) for nonparametric transformations.After an appropriate data transformation, the remaining estimation problem falls into the here considered smoothness class (though, may be, with boundary problems). Note that the limitation to global bandwidths is not that restrictive, and even quite common in density estimation. Moreover, when the covariates \(X\) were transformed such that a similar smoothness can be assumed over the whole (transformed) support,using a global bandwidth is a quite reasonable choice. Finally, we have limited our study to already published methods.

The idea of cross-validation methods goes back to Rudemo (1982) and Bowman (1984), but we should also mention in this context the so-called pseudo-likelihood CV methods invented by Habbema et al. (1974) and by Duin (1976). Due to the lack of stability of this method, see e.g. Wand and Jones (1995), different modifications have been proposed like the stabilized bandwidth selector of Chiu (1991a,1991b, 1992), the smoothed CV proposed by Hall et al. (1992), the modified CV (MCV) of Stute (1992) or the one of Feluch and Koronacki (1992), and most recently the one-sided CV of Martinez-Miranda et al. (2009), and the indirect CV by Savchuk et al. (2010). The biased CV (BCV) of Scott and Terrell (1987) is minimizing the asymptotic MISE, like plug-in methods do, but uses a jack-knife procedure (therefore called CV) to avoid the use of prior information. Methods that mingle different selectors or density estimators were proposed by Ahmad and Ran (2004), calling it kernel contrast method, and by Mammen et al. (2011), proposing the do-validation method.

Compared to CV, the so-called plug-in methods do minimize a different objective function, namely the MISE instead of the ISE; they are less volatile but not entirely data adaptive as they require some pilot information. In contrast, CV allows to choose the bandwidth without making assumptions about the smoothness class (or the like) to which the unknown density belongs. Plug-in methods have a faster convergence rate compared to CV. Unfortunately, they can heavily depend on the choice of pilots; but if we have excellent pilot estimators, then the performance of plug-in methods is pretty good. Among these selectors, Silverman (1986) rule-of-thumb method is probably the most popular one. Various refinements were introduced, like for example by Park and Marron (1990), Sheather and Jones (1991), or by Hall et al. (1991). The bootstrap methods of Taylor (1989) and all its modifications (cf. Cao 1993, or Chacon et al. 2008) are counted into the plug-in methods as they aim to minimize the MISE.

There are already several papers dealing with a comparison of different automatic data-driven bandwidth selection methods. But they are actually older than 10 years. In the 1970s and the early 1980s, survey papers about density estimation were published by Wegman (1972), Tartar and Kronmal (1976), Fryer (1977), Wertz and Schneider (1979), Bean and Tsokos (1980), etc. An introduction and comparison to various methods of smoothing parameter selection was released by Marron (1988a) and by Park and Marron (1990). Then, extensive simulation studies were published by Park and Turlach (1992), Cao et al. (1994) and Chiu (1996). A brief survey was provided by Jones et al. (1996a) with a comprehensive simulation study published in Jones et al. (1996b). Somewhat later, also Loader (1999) published a comparison paper, partly as a reply to Jones et al. (1996b). To our knowledge, only Chacon et al. (2008), published a comparison study in more recent years. However, they concentrated on Bootstrap methods and only compared them with classical CV and the plug-in version of Sheather and Jones (1991). While all these focused on the \(L_2\) norm, for the \(L_1\) view we refer to Devroye and Gyorfi (1985) for density estimation, to Devroye and Lugosi (1996) for an optimal bandwidth choice, and to Devroye (1997) for a comprehensive comparison study.

The general criticism against the two classes of selection methods can be summarized as follows: CV leads to undersmoothing and is known to hardly stabilize for large data sets (they often just choose the smallest possible value among all bandwidths), whereas plug-in depends on prior information and typically works badly for small samples.

To make some statements about asymptotic theory, we use the following assumptions on the kernel and on the density.

  • (A1) The kernel \(K\) is a compactly supported density function on \(\mathbb R \), symmetric around zero, and has a Holder-continuous derivative \(K^{\prime }\).

  • (A2) It holds \(\mu _2(K) < \infty \), where \(\mu _l(K) = \int u^lK(u)\mathrm{d}u\).

  • (A3) The density \(f\) is bounded and twice differentiable; \(f^{\prime }\) and \(f^{\prime \prime }\) are bounded and integrable, and \(f^{\prime \prime }\) is uniformly continuous.

For some methods, we will have to modify these conditions.

In our simulation study, we restrict to selection methods not using higher-order kernels. Recall that the main motivation for the application of higher-order kernels is their theoretical advantage of faster asymptotic convergence rates. However, their substantial drawback is a loss in the practical interpretability as they involve negative weights and might, therefore, give negative density estimates, see also Marron (1994).

In the context of asymptotic properties of bandwidth selectors, there is a trade-off between the classical plug-in method and standard cross-validation. The plug-in has always a smaller (asymptotic) variance compared to cross-validation (see Hall and Marron 1987a) but often a larger bias in practice. To our knowledge, no other bandwidth selector has so far outperformed the asymptotic properties of the sophisticated plug-in methods. Although Hall and Jonstone (1992) stated that such methods must theoretically exist, they could not give any practical example.

2 A brief review of previous reviews

The study of Park and Turlach (1992) comprised least square cross-validation, the biased cross-validation (BCV) of Scott and Terrell (1987), the plug-in method (SJPI) of Sheather and Jones (1991), the plug-in method (PM) of Park and Marron (1990), the smooth cross-validation (SCV) of Hall et al. (1992), and a modified version (bandwidth factorized SCV by Jones et al. (1991). The algorithms were discussed in the appendix but there was no discussion about motivation, derivation or statistical properties. They considered the estimation performance of mixtures of uni-, bi-and tri-modal normal densities along three criteria: the mean integrated squared error, the mean integrated absolute error, and mode detection. To our knowledge, that study has been published only as working paper.

Jones et al. (1996a, b) gave in their first paper a brief survey including rule-of-thumb (ROT), CV, BCV, SJPI, and finally the smooth bootstrap method. They mentioned other selectors like that of Chiu (1992) or Hall et al. (1992) without giving further details. Their findings mainly coincided with those of Cao et al. (1994) who considered less selectors, and only a few, quite smooth densities but also some qualitative measures like the so-called IP- or the double kernel method. Jones et al. (1996b) took samples of sizes \(n=100\) and \(n=1{,}000\) to estimate 15 different mixtures of normal densities. Along the quality measures they chose, their own SJPI bandwidth selector performed best.

Chiu (1996) extended the set of bandwidth selectors considered in Park and Turlach (1992) by his various stabilized methods, but he used just one specific bias-related criterion to show the superiority of his methods. His article is actually neither a review nor a general (simulation) or comparison study.

Loader (1999) replied to the then often claimed superiority of plug-in methods on several fronts. He compared them with CV methods for density estimation and regression, looking into the sources of differences. He argued that plug-in methods were heavily dependent on arbitrary specifications of pilot bandwidths and failed when this specification was wrong. He considered the likelihood based CV together with its approximation by an Akaike-style criterion, the classical CV, SJPI, BCV, and the fixed point iterations (GKK) approach of Gasser, Kneip and Köhler (1991). A detailed simulation study comparing them all was not performed. Instead, he compared some selectors along real data, and the methods CV, BCV, SJPI and GKK by some particular simulations. Half of the paper was dedicated to regression.

Sheather (2004) gave a practical description of kernel density estimation revising some estimation and bandwidth selection methods which he considered to be the most popular at that time, together with software advise, a new modifications (data sharpening), and a real data application. Simulations or comparison studies were not provided.

Devroye (1997) presented the doubtless largest and most extensive comparison study with discussion. The three main differences to all the other studies (including ours) were that: first,he looked at the \(L_1\) measures when studying the asymptotic properties. Second, he considered different kernel density estimators. Third, for the estimation of about half of the densities from which he draw the samples in his simulations one faces serious boundary problems or problems with jumps. Consequently, he considered partly ‘quite’ and partly ‘slightly’ different bandwidth selectors. For these four reasons, all other studies are hardly comparable with that one.

Nonetheless, our choice of considered densities has partly been guided by his sample even if we show only the results for a tiny subsample. Furthermore, as it has turned out that for large samples, most of the selectors behave pretty well with diminishing differences, we have concentrated in our simulation study on small (\(n\ge 25\)) to at most moderate (\(n\le 200\)) samples sizes. We also tried with \(n=500\) and \(n=1{,}000\); but the only new findings were that the indirect CV methods reveal their (asymptotic) superiority, whereas the leave-one-out cross-validation can easily recommend bandwidths close to zero (depending on the real underlying density). This is also the moment when the modified cross-validation (MCV) of Stute (1992) starts to become attractive. See also our comment below on data rounding.

3 Cross-validation methods in density estimation

Recall the performance measure ‘integrated squared error’ (ISE)

$$\begin{aligned} {\mathrm{ISE}}(h) = \int \widehat{f}_h^2(x)\, {\mathrm{d}}x -2 \, E\{\widehat{f}_h(X)\} +\int f^2(x)\, {\mathrm{d}}x. \end{aligned}$$

Evidently, the first term can be calculated from the data, the second can be expressed as the expected value of \(\widehat{f}_h(X)\), and the third term can be ignored in the context of bandwidth selection since it does not depend on the bandwidth. Note that estimating \(E\{\widehat{f}_h(X)\}\) by \( \frac{1}{n}\sum _{i=1}^n \widehat{f}_{h}(X_i)\) is inadequate due to the implicit dependency (\(\widehat{f}_{h}\) depends on \(X_i\)). So the different modifications of CV basically vary in the estimation of this problematic second part.

3.1 Ordinary least squares cross-validation

This is the classical straightforward approach by just dropping \(X_i\) when estimating \(f(X_i)\), called jack-knife estimator and denoted by \(\widehat{f}_{h,-i}(X_i)\). It yields the least-squares CV criterion

$$\begin{aligned} \mathrel {\mathop {\mathrm{min}}\limits _{h}^{}} \ {\mathrm{CV}}(h) = \int \widehat{f}_h^2(x)\, {\mathrm{d}}x - 2 \, \frac{1}{n}\sum _{i=1}^n \widehat{f}_{h,-i}(X_i). \end{aligned}$$

Stone (1984) showed that under the assumptions (A1)–(A3), the minimizing argument \(\widehat{h}_{{\mathrm{CV}}}\) fulfills

$$\begin{aligned} {\mathrm{ISE}}(\widehat{h}_{{\mathrm{CV}}}) / {\mathrm{ISE}}(\hat{h}_0) \xrightarrow {a.s.} 1. \end{aligned}$$

However, Hall and Marron (1987a) stated that this happened at a very slow rate; specifically

$$\begin{aligned}&n^{3/10}(\widehat{h}_{{\mathrm{CV}}} - \widehat{h}_0) \longrightarrow N(0,\sigma ^2c^{-2}) \qquad {\mathrm{and }} \nonumber \\&n\left( {\mathrm{ISE}}(\widehat{h}_{{\mathrm{CV}}}) - {\mathrm{ISE}}(\hat{h}_0) \right) \longrightarrow \frac{1}{2} \sigma ^2 c^{-1} \chi ^2_1 \end{aligned}$$
(4)

under assumptions (A1)–(A3), and with terms \(\sigma \) and \(c\) depending only on \(f\) and \(K\). Many practitioners use this classical CV method nonetheless because of its intuitive definition and simple implementation.

Recall the criticism saying that this classical CV lacks stability (even) when the sample size increases. Silverman (1986) and others gave explanations based on considerations of what happens if the distances \(|x_i-x_j|\) become very small for many observations \(j\ne i\). Chiu (1991a) studied the problem occurring with data rounding such that one obtains many ties (\(x_j=x_i\) for \(i\ne j\)). Based on these considerations, the following stabilized and modified CV versions emerged.

3.2 Stabilized bandwidth selection

Based on characteristic functions Chiu (1991a, b, 1992) gave an expression for \(wide\hat{h}_{{\mathrm{CV}}}\) which revealed the source of variation. Note that the CV criterion is approximately equal to the expression

$$\begin{aligned} \frac{1}{\pi } \int _0^{\infty } |\tilde{\phi }(\lambda )|^2\left\{ w^2(h \lambda ) - 2w(h \lambda )\right\} {\mathrm{d}} \lambda + 2 K(0)/(nh), \end{aligned}$$

with \( \tilde{\phi }(\lambda ) = \frac{1}{n}\sum _{j=1}^n {\mathrm{e}}^{i \lambda X_j} \) and \( w(\lambda ) = \int {\mathrm{e}}^{i \lambda u}K(u)\mathrm{d}u \). The noise in the CV estimate is mainly contributed by \(|\tilde{\phi }(\lambda )|^2\) at high frequencies, which does not contain much information about \(f\). To mitigate this problem, he looked at the difference of the CV criterion and the MISE. As one alternative, he defined \(\Lambda \) as the first \(\lambda \) fulfilling \(|\tilde{\phi }(\lambda )|^2 \le 3/n\) and replaced \(|\tilde{\phi }(\lambda )|^2\) by \(1/n\) for \(\lambda > \Lambda \). This gave his criterion

$$\begin{aligned} \mathrel {\mathop {\mathrm{min}}\limits _{h}^{}} \ S_n(h)&= \int _0^{\Lambda }|\tilde{\phi }(\lambda )|^2\left\{ w^2(h \lambda ) - 2w(h \lambda )\right\} {\mathrm{d}}\lambda \\&\quad +\frac{1}{n}\int _{\Lambda }^{\infty }\left\{ w^2(h \lambda ) - 2w(h \lambda )\right\} {\mathrm{d}}\lambda + 2\pi K(0)/(nh), \\&= \frac{\pi }{nh}||K||^2_2 + \int _0^{\Lambda }\left\{ |\tilde{\phi }(\lambda )|^2 - \frac{1}{n}\right\} \{w^2(h \lambda ) -2w(h \lambda )\}{\mathrm{d}}\lambda , \end{aligned}$$

For the minimizer \(\widehat{h}_{\mathrm{ST}}\) it can be shown that \(\widehat{h}_{\mathrm{ST}} \xrightarrow {a.s.} \widehat{h}_0\), and that it converges to \(h_0\) even at the optimal \(\sqrt{n}\)-rate. Note that in the literature this procedure is often counted among the plug-in methods as it rather minimizes the MISE than the ISE. In our implementation, when calculating \(\Lambda \) we came across with the computation of square roots of negative terms in our simulations. To avoid complex numbers, we calculated the absolute value of the radicand.

3.3 Modified cross-validation

Stute (1992) proposed a so-called modified CV (MCV). He approximated the problematic term by the aid of the Hajek projection. In fact, he showed that under some regularity assumptions given below, \(2E[ f_h(x) ]\) is the projection of

$$\begin{aligned} S + \frac{1}{h} E\left[ K \left( \frac{X_1 - X_2}{h} \right) \right]&= S + \frac{1}{h} \int \int K \left( \frac{x-y}{h}\right) \,f(x)\,f(y)\,{\mathrm{d}}x\,{\mathrm{d}}y\\&= S+ \int f^2(y){\mathrm{d}}y + \frac{1}{2}h^2\int t^2 K(t){\mathrm{d}}t \int f(y)f^{\prime \prime }(y){\mathrm{d}}y\\&\quad + O(h^3)\\ {\mathrm{for }}&S := \frac{1}{n(n-1)h}\sum _{i \ne j} K\left( \frac{X_i - X_j}{h}\right) \end{aligned}$$

yielding the criterion

$$\begin{aligned} \mathrel {\mathop {\mathrm{min}}\limits _{h}^{}} \ MCV(h) = \int \widehat{f}_h^2(x){\mathrm{d}}x - S - \frac{\mu _2(K)}{2n(n-1)h} \sum _{i \ne j} K^{\prime \prime } \left( \frac{X_i - X_j}{h}\right) . \end{aligned}$$

It can be shown that under assumptions (A1),

  • (A2\(^\prime \)) K is three times differentiable, with \( \quad \int t^4 |K(t)|\,{\mathrm{d}}t < \infty \), \( \quad \int t^4 |K^{\prime \prime }(t)|\,{\mathrm{d}}t < \infty \), \( \quad \int t^4 [K^{\prime }(t)]^2 \,{\mathrm{d}}t < \infty \), and \( \quad \int t^2 [K^{\prime \prime \prime }(t)]^2\,{\mathrm{d}}t < \infty \),

  • (A3\(^\prime \)) \(f\) four times continuously differentiable, the derivatives being bounded and integrable,

one gets the consistency result

$$\begin{aligned} \frac{{\mathrm{ISE}}(\widehat{h}_0)}{{\mathrm{ISE}}(\widehat{h}_{{\mathrm{MCV}}})} \overset{P}{\longrightarrow } 1, \quad {\mathrm{and}} \quad \frac{\widehat{h}_0}{\widehat{h}_{{\mathrm{MCV}}}} \overset{P}{\longrightarrow } 1 \quad as \quad n \rightarrow \infty . \end{aligned}$$

3.4 One-sided cross-validation

Marron (1986) made the point that ”the harder the estimation problem the better CV works”. Based on this idea, Martinez-Miranda et al. (2009) proposed to first apply CV to a harder estimation problem, and to afterward calculate the corresponding bandwidth for the underlying ‘real’ estimation problem. To make the estimation problem harder, they used a worse estimator, still (1) but with a local linear version of a one-sided kernel,

$$\begin{aligned} L (u) = \frac{\mu _2(K) - u\left( 2\int _{-\infty }^0 tK(t)\, {\mathrm{d}}t\right) }{\mu _2(K) - \left( 2\int _{-\infty }^0 tK(t)\, {\mathrm{d}}t\right) ^2}2K(u)\mathbf{1}_{\{u < 0\}}. \end{aligned}$$

This modification goes back to Hart and Yi (1998) who did this for regression. One defines the one-sided versions of ISE and MISE with their minimizers \(\widehat{b}_0\) and \(b_0\), and the criterion becomes

$$\begin{aligned} \mathrel {\mathop {\mathrm{min}}\limits _{b}^{}} \ {\mathrm{OSCV}}(b) = \int \widehat{f}^2_{left,b}(x) \, {\mathrm{d}}x - \frac{2}{n}\sum _{i=1}^n \widehat{f}_{left,b}(X_i), \end{aligned}$$

where \(\widehat{f}_{\mathrm{left},b}\) is the one-sided (to the left) kernel density estimator. The corresponding bandwidth for the ‘real’ estimation problem is then given by

$$\begin{aligned} \widehat{h}_{{\mathrm{OSCV}}} := C \cdot \widehat{b}_{{\mathrm{OSCV}}} \quad {\mathrm{with }} \quad C = h_0 / b_0. \end{aligned}$$

Note that \(C\) is deterministic, depending only on kernel \(K\) because of

$$\begin{aligned} h_0 = \left( \frac{||K||_2^2}{(\mu _2(K))^2||f^{\prime \prime }||_2^2n}\right) ^{1/5} \quad , \quad b_0 = \left( \frac{||L||_2^2}{(\mu _2(L))^2||f^{\prime \prime }||_2^2n}\right) ^{1/5}. \end{aligned}$$

This gives, for example \(C \approx 0.537\) for the Epanechnikov kernel. The theoretical justification for the stability of one-sided CV is based on the result of Hall and Marron (1987a), recall Eq. (4). That result allows to calculate the variance reduction of OSCV compared to CV by \( \{ C \bar{\sigma }c / (\bar{c} \sigma ) \}^2 \) where \(\bar{c}\), \(\bar{\sigma }\) are just as \(c,\sigma \) but with kernel \(L\) instead of \(K\). Note that \(L\) can also be constructed as a one-sided kernel to the right.

3.5 Indirect cross-validation

Based on the same idea, Savchuk et al. (2010) proposed to use

$$\begin{aligned} L(u) = (1+\alpha )\phi (u) - \alpha \varsigma ^{-1} \phi ( u \varsigma ^{-1}), \end{aligned}$$

where \(\phi \) is the Gaussian kernel, and \(\alpha >0\), \(\varsigma >0\) have to be chosen appropriately. They demonstrated the excellent theoretical properties of such an ‘indirect method’, and discussed the robustness of the indirect methods to data rounding (see above or Density estimation 1986). For the two additional prior parameter \((\alpha ,\varsigma )\) they made several proposals derived from their asymptotic theory. Specifically, they first recommended for \(100\le n\le 500{,}000\) to take the values

$$\begin{aligned} \alpha&= 10^{3.39-1.093\log _{10}(n)+0.025\log _{10}^3(n)-0.00004\log _{10}^6(n)}, \\ \varsigma&= 10^{-0.58+0.386\log _{10}(n)-0.012\log _{10}^2(n)}. \end{aligned}$$

But based on asymptotic and practical considerations, the following rule is proposed:

$$\begin{aligned} (\alpha ,\varsigma ) = (2.42 ; {\mathrm{max}}(5.06, 0.149n^{3/8})), \end{aligned}$$

where the \(\mathrm{max}\) function chooses always \(5.06\) unless \(n>12,094\). For our implementation with Epanechnikov kernels, their method worked well only for pretty large samples whatever proposal for choosing \((\alpha ,\varsigma )\) we tried.

3.6 Further cross-validation methods

Feluch and Koronacki (1992) proposed to cut out not only \(X_i\) when estimating \(f (X_i)\) but rather dropping also the \(m<n\) nearest neighbors with \(m\rightarrow \infty \) such that \(m/n\rightarrow 0\). The idea is similar to the CV selection for time series data, cf. Hardle and Vieu (1992). Like Stute (1992), they called this version modified CV. Unfortunately, it turned out that the quality of this method crucially depended on the choice of \(m\). As we could not find any recommendation for its choice, this method cannot be classified as one being automatic or data driven, and would not be considered further.

Scott and Terrell (1987) introduced the B(iased)CV. As they worried about unreliable small-sample results, i.e. the high variability of CV, they directly focused on minimizing the asymptotic MISE. The unknown term \(||f^{\prime \prime }(x)||_2^2\) was estimated via jack-knife methods. Already in their own paper they admitted a poor performance for small samples and mixtures of densities, see also Chiu (1996). Jones et al. (1996b) underlined in their simulation study its deficient performance (‘quite good’ to ‘very poor’) even when referring to situations where it still seemed to be a reasonable selector.

The S(moothed)CV was evolved by Hall et al. (1992). The general idea was a kind of presmoothing of the data before applying the CV criterion. This procedure of presmoothing results in smaller sample variability but enlarges the bias. Therefore, the resulting bandwidth is often oversmoothing and cuts off some important features of the underlying density. With this method, it is possible to achieve the optimal \(\sqrt{n}\) rate of convergence—but only when using a kernel of order \(\ge \! 6\). So it seems to be appropriate for huge samples only.Jones et al. (1996b) showed that without such a higher-order kernel, there exists an \(n^{-1/10}\) convergent version of SCV that is identical to Taylor’s bootstrap method (Taylor 1989), and is closely related to the bootstrap method of Cao (1993). These methods do not belong to the cross-validation methods, and hence, will be discussed later. Additionally, with a special choice of pilot bandwidth (necessary in all these methods), the SCV results in an \(n^{-5/14}\) convergent version that is similar to the so-called diagonal-in selector of Park and Marron (1990). In conclusion, we have not implemented the SCV, because its similarity to other methods and because we did not want to use higher-order kernels for samples with \(n<500\).

The P(artitioned)CV was suggested by Marron (1988b). He modified the CV criterion by splitting the sample of size \(n\) into \(m\) subsamples. The PCV is calculated by minimizing the average of the score functions of the CV-score for all subsamples. In a final step, the resulting bandwidth needs to be rescaled. The number of subsamples affects the trade-off between variance and bias. Hence, the choice of a pilot \(m\) is the crucial problem in this case, and as Park and Marron (1990) noticed: “this method ... is not quite fully objective”. It further requires a large sample size to get subsamples of reasonable sizes.

The pseudo-likelihood (also called the Kullback–Leibler) cross-validation, invented by Habbema et al. (1974) and by Duin (1976), aims to find the bandwidth maximizing a pseudo-likelihood criterion with leaving-out the observation \(X_i\). Due to the fact that many authors criticized this method being inappropriate for density estimation, we skipped also this method in our simulation study.

Wegkamp (1999) suggested a method being very much related to the CV technique providing a quasi-universal bandwidth selector for bounded densities. This was based on a optimality concept of Devroye and Lugosi (1996) but translated to the \(L_2\)-norm context. Among other problems in practice, the procedure requires sample splitting which can be quite problematic for small and moderate sample sizes, see above. His paper stayed on a rather technical level without providing any algorithm or how to do for example the sample splitting in practice.

4 Plug-in methods in density estimation

Under (A1)–(A3) the MISE can be written as

$$\begin{aligned} {\mathrm{MISE}} \left[ \hat{f}_h(x)\right] = \frac{h^4}{4} \mu _2^2 (K) ||f^{\prime \prime }(x)||_2^2 + \frac{1}{nh} ||K||_2^2 + o\left( \frac{1}{nh}\right) + o(h^4), \end{aligned}$$

for \(n\rightarrow \infty \), \(h\rightarrow 0\), such that the asymptotically optimal bandwidth is then equal to

$$\begin{aligned} ||K||_2^{2/5} \left( ||f^{\prime \prime }||_2^2 \left[ \mu _2(K)\right] ^2 n \right) ^{-1/5}, \end{aligned}$$
(5)

where \(||f^{\prime \prime }||_2^2\) is unknown and has to be estimated. A most popular method is the rule-of-thumb choice introduced by Silverman (1986). He used the normal density as a prior for approximating \(||f^{\prime \prime }||_2^2\). For the necessary estimation of the standard deviation of \(X\), he proposed a robust version making use of the interquartile range. If the true underlying density is unimodal, fairly symmetric and does not have fat tails, Silverman’s rule-of-thumb bandwidth \((h_S)\) works fairly well.

4.1 Park and Marron’s plug-in

Natural refinements consist of using nonparametric estimates for \(||f^{\prime \prime }||_2^2\). Let us consider

$$\begin{aligned} \widehat{f}_g^{\prime \prime }(x) = \frac{1}{ng^3} \sum _{i=1}^{n} K^{\prime \prime }\left( \frac{x-X_i}{g}\right) , \end{aligned}$$

where \(g\) is a prior bandwidth.

Hall and Marron (1987b) proposed several estimators for \(||f^{\prime \prime }||_2^2\), all containing double sums over the sample. They pointed out that the diagonal elements give a non-stochastic term which does not depend on the sample but increases the bias. They, therefore, proposed the bias-corrected estimator

$$\begin{aligned} \widehat{||f^{\prime \prime }||_2^2} = ||\widehat{f}_g^{\prime \prime }||_2^2 - \frac{1}{ng^5}||K^{\prime \prime }||_2^2, \end{aligned}$$
(6)

which is used in (5) to obtain

$$\begin{aligned} \hat{h} = \left( \frac{||K||_2^2}{\widehat{||f^{\prime \prime }||_2^2} \mu _2^2(K) n} \right) ^{1/5}. \end{aligned}$$
(7)

The question which arises is how to obtain a proper prior bandwidth \(g\). In Park and Marron (1990), \(g\) was the minimizer for the asymptotic mean squared error of \(\widehat{||f^{\prime \prime }||_2^2}\). With (7), one gets a prior bandwidth \(g\) in terms of \(\hat{h}\) (using the notation in the original paper)

$$\begin{aligned} g = C_3(K) C_4(f) \hat{h}^{10/13}, \end{aligned}$$
(8)

where \(C_3(K)\) contains the fourth derivative and convolutions of \(K\), and \(C_4(f)\) contains the second and third derivatives of \(f\). The optimal \((g,h_{PM})\) can be obtained by numerical solution of the Eqs. (7) and (8). The relative rate of convergence to \(h_0\) is of order \(O_p (n^{-4/13})\), which is suboptimal compared to the optimal \(\sqrt{n}\)-rate, cf. Hall and Marron (1991).

4.2 Sheather and Jones’ plug-in

An often cited method is the so-called Sheather and Jones (1991) bandwidth, see also Jones and Sheather (1991). They used the same idea like Park and Marron (1990) but replaced the ‘diagonal-out’ estimator of \(||f^{\prime \prime }||_2^2\) by their ‘diagonal-in’ version to avoid the problem that the estimator \(\widehat{||f^{\prime \prime }||_2^2}\) (see (6)) may give negative results. They stated that the non-stochastic term in (6) was subducted because of its positive effect on the bias in estimating \(||f^{\prime \prime }||_2^2\). The idea was to choose the prior bandwidth \(g\) such that the negative bias due to the smoothing compensates the impact of the diagonal-in terms. As a result they estimated \(||f^{\prime \prime }||_2^2\) by \(||\widehat{f}_g^{\prime \prime }||_2^2\) which is always positive, and obtained

$$\begin{aligned} g = C(K,L)\left( \frac{||f^{\prime \prime }||_2^2}{||f^{\prime \prime \prime }||_2^2} \right) ^{1/7} h^{5/7}, \end{aligned}$$

where \(C(K,L)\) depends on \(L\), the kernel used to estimate \(||f^{\prime \prime }||_2^2\). As usual, \(K\) indicates the kernel of the original estimation. Then, \(||f^{\prime \prime }||_2^2\) and \(||f^{\prime \prime \prime }||_2^2\) were estimated using \(||\widehat{f}_a^{\prime \prime }||_2^2\) and \(||\widehat{f}_b^{\prime \prime \prime }||_2^2\), where \(a\) and \(b\) were set equal to the rule-of-thumb bandwidths of Silverman. Sheather and Jones (1991) showed that their optimal bandwidth had a convergence rate of \(n^{-5/14}\) which is slightly better than that of Park and Marron (1990). Using real data, Jones et al. (1996b) found that \(\widehat{h}_{\mathrm{SJ}}\) was pretty close to Park and Marron’s bandwidth in practice. Hence, without beating that one in practical performance, having only a slightly better convergence rate, but being computationally much more expensive, we suppressed \(\widehat{h}_{\mathrm{SJ}}\) in favor of the (simplified) Jones et al. (1991) bandwidth.

4.3 Implemented refined plug-in

For small samples and small (optimal) bandwidths, the above estimator \(\widehat{||f^{\prime \prime }||_2^2}\) can easily fail in practice. Also, to find a numerical solution for \((g,h_{\mathrm{PM}})\) may become quite difficult in practice; the final result depend on stopping rules and there might exist multiple local maxima for the finite-sample two-dimensional problem. To avoid these inconveniences, and to offer a quick and easy solution, we propose to first take Silverman’s rule-of-thumb bandwidth for Gaussian kernels, i.e. \(h_S=1.06 \min \{1.34^{-1}\mathrm{IR}, s_n \} n^{-1/5}\) with interquartile range (IR) of \(X\), and \(s_n\) the sample standard deviation. Then adjust \(h_S\) for Quartic kernels along the idea of canonical kernels and equivalence bandwidths, see Hardle et al. (2004). The Quartic kernel is pretty similar to the Epanechnikov kernel but allows for the estimation of second derivatives. Then, adjusting for the slower optimal rate for second derivative estimation gives as a prior for (6)

$$\begin{aligned} g = h_S \frac{2.0362}{0.7764} n^{1/5-1/9}. \end{aligned}$$

This bandwidth leads to very reasonable estimates of the second derivative of \(f\), and hence of \({||f^{\prime \prime }||_2^2}\). A further advantage is that this prior \(g\) is rather easily obtained. For the rest we follow Park and Marron (1990) and call the resulting bandwidth \(\widehat{h}_{PM}\) because this is a simplified version of their ideas.

4.4 Bootstrap methods

The principle of the bootstrap-based selection methods is to select the bandwidth along bootstrap estimates of the ISE or the MISE. For a general description of this resampling idea in nonparametric problems, see Hall (1990). Imagine that we have a Parzen-Rosenblatt estimate \(\widehat{f}_g\) for a given pilot bandwidth \(g\). From \(\widehat{f}_g\) we can now draw some bootstrap samples \((X_1^*, X_2^*, \dots , X_n^*)\). Defining the bootstrap kernel density

$$\begin{aligned} \widehat{f}_h^*(x) = \frac{1}{nh} \sum _{i=1}^{n}K\left( \frac{x-X_i^*}{h}\right) , \end{aligned}$$

the (mean) integrated squared error to be minimized could be approximated by

$$\begin{aligned} \nonumber {\mathrm{ISE}}^*(h)&:= \int \left( \widehat{f}_h^*(x)- \widehat{f}_g(x)\right) ^2 \,{\mathrm{d}}x, \\ {\mathrm{MISE}}^*(h)&:= E_* \left[ \int \left( \widehat{f}_h^*(x)- \widehat{f}_g(x)\right) ^2 \,{\mathrm{d}}x\right] . \end{aligned}$$
(9)

It can be shown that the expectation \(E_*\) and \(\mathrm{MISE}^*\) depend only on the original sample, and not on the bootstrap samples. Consequently, there is no need to do resampling to obtain the \(\mathrm{MISE}^*\). Using Fubini’s theorem and decomposing the \(\mathrm{MISE}^*=V^*+SB^*\) into the integrated variance

$$\begin{aligned} V^*(h) = \frac{1}{nh}\cdot ||K||_2^2 + \frac{1}{n} \cdot \int \left( \int K(u) \cdot \widehat{f}_g(x-hu)\, {\mathrm{d}}u \right) ^2 \, {\mathrm{d}}x \end{aligned}$$
(10)

and the integrated squared bias

$$\begin{aligned} SB^*(h) = \int \left( \int K(u) \cdot (\widehat{f}_g(x-hu) - \widehat{f}_g(x))\,{\mathrm{d}}u \right) ^2 \, {\mathrm{d}}x \end{aligned}$$
(11)

one obtains

$$\begin{aligned} V^*(h) = \frac{1}{nh} ||K||_2^2 + \frac{1}{n^3} \sum _{i=1}^{n} \sum _{j=1}^{n} \left[ (K_h \star K_g)\star (K_h \star K_g)\right] (X_i - X_j), \end{aligned}$$
(12)

where \(\star \) denotes convolution, and

$$\begin{aligned} SB^*(h) = \frac{1}{n^2} \sum _{i=1}^{n} \sum _{j=1}^{n} \left[ (K_h \star K_g - K_g) \star (K_h \star K_g - K_g)\right] (X_i - X_j). \end{aligned}$$
(13)

In practice, it is hard to get explicit formulae for these integrals when the kernel has a bounded support. However, using the Gaussian kernel in (12) and (13), we can directly calculate the optimal bandwidth as the minimizer of

$$\begin{aligned} {\mathrm{MISE}}*(h)&= \frac{1}{2nh\sqrt{\pi }} + \dfrac{1}{\sqrt{2\pi }} \left[ \frac{\sum \nolimits _{i,j} \left( {\mathrm{exp}}\left( -\frac{1}{2}\left( \frac{X_i-X_j}{g \sqrt{2}}\right) ^2\right) \right) }{\sqrt{2 g^2}\cdot n^2}\right. \\&\quad \left. -\frac{2\cdot \sum \nolimits _{i,j} \left( {\mathrm{exp}}\left( -\frac{1}{2}\left( \frac{X_i-X_j}{\sqrt{h^2 +2g^2}}\right) ^2\right) \right) }{\sqrt{h^2+2g^2}\cdot n^2}\right. \nonumber \\&\quad \left. + \frac{(n+1) \sum \nolimits _{i,j} \left( {\mathrm{exp}}\left( -\frac{1}{2} \left( \frac{X_i-X_j}{\sqrt{2(h^2+g^2)}}\right) ^2\right) \right) }{\sqrt{2(h^2+g^2)}\cdot n^3} \right] . \nonumber \end{aligned}$$
(14)

The equivalent bandwidth for any other kernel can be obtained as described in Marron and Nolan (1988) or Hardle et al. (2004).

The bootstrap approach in kernel density estimation was first presented by Taylor (1989). However, many modified versions were published later on, see for example Faraway and Jhun (1990), Hall (1990) or Cao (1993). The crucial differences between these versions are how they choose the pilot bandwidth \(g\), and they generate the bootstrap samples.

Taylor (1989) suggested to take \(g=h\) and used a Gaussian kernel. Several authors pointed out that this procedure had no finite minimum and hence would choose a local minimum or the upper limit of the bandwidths grid as its optimum. Marron (1992) showed that this led to an inappropriate choice and a serious positive bias. Differing from this approach, Faraway and Jhun (1990) proposed a least-square cross-validation estimate to find \(g\). Hall (1990) recommended to use the empirical distribution to draw bootstrap samples of size \(m < n\), proposed \(m \simeq n^{1/2}\), \(h=g(m/n)^{1/5}\), and minimized \(\mathrm{MISE}^*\) with respect to \(g\). Cao et al. (1994) demonstrated that the bootstrap version of Hall was quite unstable and showed a bad performance, especially for mixtures of normal distributions. They found also that the methods of Faraway and Jhun (1990) as well as the one of Hall (1990) were outperformed by the method of Cao (1993) which we introduce below.

A bias corrected bootstrap estimate was developed by Grund and Polzehl (1997). They obtained an root-\(n\) convergent bandwidth estimate which attained very good results for larger sample sizes, but less so for moderate and small samples. Moreover, to derive their asymptotic theory they had to use extraordinarily strong assumptions compared to the other methods. In their simulation study, Grund and Polzehl showed that the performance heavily depended on the choice of \(g\). They stated that using their oversmoothing bandwidth (that guaranteed root-\(n\) convergence) seemed to be far from optimal for smaller sample sizes. In contrast, using \(g=h\) would achieve better performance in practical applications. However, setting \(g=h\) results in a convergence rate of order \(n^{-1/10}\). Summing up, they remarked that faster rates of convergence do not result in better practical performance.

In the smoothed bootstrap version of Cao (1993), the pilot bandwidth \(g\) is estimated by asymptotic expressions of the minimizer of the dominant part of the mean squared error. For further details see Cao (1993). He noticed that in (13) for \(i=j\) this term would inflate the bias artificially. He, therefore, proposed to use a modified bootstrap integrated squared bias, namely

$$\begin{aligned} {\mathrm{MB}}^*(h) = \frac{1}{n^2} \sum _{i \not = j} \left[ (K_h \star K_g - K_g) \star (K_h \star K_g - K_g)\right] (X_i - X_j). \end{aligned}$$

Concerning the convergence rate, he showed for his bandwidth, say \(h_0^*\),

$$\begin{aligned} \frac{{\mathrm{MISE}}(h^*_0) - {\mathrm{MISE}}(h_0)}{{\mathrm{MISE}}(h_0)}&= O_P(n^{-5/7})\\ \frac{{\mathrm{MISE}}(h^*_{0_M}) - {\mathrm{MISE}}(h_0)}{{\mathrm{MISE}}(h_0)}&= O_P(n^{-8/13}). \end{aligned}$$

Note that the convergence rate for the original bootstrap version was slightly faster.

Recently, Chacon et al. (2008), published a bootstrap version quite similar to Cao (1993). They showed that the asymptotic expressions of his bandwidth estimates might be inadequate and defined an expression \(g(h)\) for fixed \(h\). They proposed estimators for \(g\), and allowed for different kernels \(L\) and \(K\) for the bandwidths \(g\) and \(h\). They stated that their approach was a good compromise between classical cross-validation and plug-in. However, its performance depended seriously on the reference density. Exploring the asymptotics, they achieved root-\(n\) convergence only under the use of higher-order kernels.

In sum, in our simulation study, we concentrate on just one representative of the class of bootstrap estimates, going back to Cao (1993). He proved that the pilot bandwidth \(g\) as the minimizer of (9) coincides with the minimizer of the dominant part of the mean squared error. Specifically, it is given by

$$\begin{aligned} g = \left( \frac{||K||_2^2}{\widehat{||f^{\prime \prime \prime }||_2^2} \mu _2^2(K) n} \right) ^{1/7}. \end{aligned}$$

This formula is used for the pilot bandwidth \(g\) when calculating (14). In our simulations, we additionally ran the bootstrap for the Epanechnikov kernel calculating formulae (10) and (11) numerically. As this was much slower and gave uniformly worse results, we will neglect that approach for the rest of the paper.

4.5 Further plug-in methods

Many other plug-in methods have been developed. Some of them exhibited better asymptotic properties and others a better performance in some particular small sample simulations. However, most of them have not become (widely) accepted (n)or known.

Hart et al. (1991) introduced a plug-in method giving back a bandwidth \(\widehat{h_\mathrm{HSJM}}\) which achieved the optimal \(\sqrt{n}\)-rate of convergence. A problem with their bandwidth \(\widehat{h_{\mathrm{HSJM}}}\) was that they used higher-order kernels to ensure the \(\sqrt{n}\) convergence (actually a kernel of order 6 or higher). It is well known (see Marron and Wand 1992) that albeit their theoretical advantages, higher-order kernels have a surprisingly bad performance in practice, at least for moderate sample sizes. Furthermore, in the simulation study of Park and Turlach (1992) \(\widehat{h_{\mathrm{HSJM}}}\) was generally bad for bi- and trimodal densities.

Jones et al. (1991) developed a plug-in method based on the SCV idea, see above. They used the prior bandwidth \(g = C(f) n^p h^m\), where the normal density was used as a reference distribution to calculate the unknown constant \(C(f)\). The advantage of their estimator was the \(\sqrt{n}\) convergence rate if \(m = -2\) and \(p = \frac{23}{45}\) even for kernels of order \(2\). In their simulation studies, Turlach (1994) and Chiu (1996) found a small variance compared to CV, but an unacceptable huge bias.

Also Kim et al. (1994) showed the existence of a \(\sqrt{n}\) convergent method without the use of higher-order kernels. The main idea of obtaining asymptotically best bandwidth selectors was based on an exact MISE expansion. But the results of their paper were primarily provided for theoretical completeness; the practical performance in simulation studies was rather disappointing, which was already explicitly mentioned in their own paper and also confirmed later in a simulation study performed by Jones et al. (1996b).

For the sake of completeness, we also refer to the ’Double Kernel method’ of Devroye (1989) and Jones (1998). This method has the advantage to be quite universal. Under some particular assumptions, it coincides with Taylor’s bootstrap selector, respectively the BCV method, see above. As already mentioned, these two methods had several disadvantages, and also the Double Kernel method required the use of higher-order kernels. In Jones (1998), the performance of the Double Kernel method was assessed by comparing asymptotic convergence rates, but it did not exhibit the expected improvement in the estimation of \(h_0\) (MISE optimal bandwidth) compared for example to SCV.

5 Mixing methods in density estimation

Recall that all authors have criticized that the cross-validation criterion tends to undersmooth and suffers from high sample variability. At the same time, the plug-in estimates deliver a much more stable estimate but typically oversmooth. These findings suggest to combine different bandwidths or density estimators.

5.1 Mixing the bandwidths: Do-validation

Recently, Mammen et al. (2011) took the idea of indirect cross-validation of which OSCV is a particular case, and extended it to the idea of mixing bandwidth selectors. For these mixtures they calculated the asymptotic properties and derived numerically optimal weights for particular cases. They considered

$$\begin{aligned} \widehat{h} = \sum _{j=1}^J w_j \left( \frac{R(K)}{\mu _2^2(K)}\frac{\mu _2^2(L_j)}{R(L_j)} \right) ^{1/5} \widehat{h}_j \end{aligned}$$
(15)

for some weights \( w_j\) (not necessarily positive) with \(\sum _{j=1}^J w_j=1\), where the \(\widehat{h}_{j}\) were bandwidth estimates based on selector methods with selection kernel \(L_j\), see above. After multiplying with the factor \((R(K) \mu _2^2(L_j))^{1/5} ({\mu _2^2(K)}\) \({R(L_j)})^{-1/5}\) one gets a selector for a density estimator with kernel \(K\). They further looked at

$$\begin{aligned} \widehat{h}^* = \sum _{j=2}^{J} w_j \left( \frac{R(K)}{\mu _2^2(K)}\frac{\mu _2^2(L_j)}{R(L_j)} \right) ^{1/5} \widehat{h}_j + w_1 \widehat{h}_{\mathrm{MISE}}, \end{aligned}$$
(16)

with an asymptotically MISE-optimal bandwidth \(\widehat{h}_{\mathrm{MISE}}\). For all these selectors they showed that

$$\begin{aligned} n^{3/10} (\widehat{h} - \hat{h}_{0})&\rightarrow N(0,\sigma _1^2)&\quad {\mathrm{in\, distribution}}, \\ {\mathrm{and }}\quad n^{3/10} (\widehat{h} - h_{0})&\rightarrow N(0,\sigma _2^2)&\quad {\mathrm{in\, distribution}}. \end{aligned}$$

Explicit expressions for \(\sigma _1\) and \(\sigma _2\) were given in that paper for all kind of (mixtures of) bandwidth selectors. For \(J=2\) and \(L_2\) being the left-sided version of \(K\), they found that the asymptotically optimal weights were \(w_2=1-w_1=-0.21\) in (16), and \(w_2=1-w_1=0.5\) in (15) with \(L_1\) being the right-sided version of \(K\). They recommended mixing left-sided CV with right-sided CV, calling it Do-validation. Finally, they compared the asymptotics and finite sample behavior of their proposals with three standard methods.

5.2 Mixing the estimators: the contrast method

Ahmad and Ran (2004) proposed a kernel contrast method for choosing bandwidths either minimizing the ISE or alternatively the MISE. Their idea is as follows. Choose \(J\) different kernels \(K_j\) with arbitrary contrast coefficients \(a_j\) and positive weights \(b_j\) such that \(\sum _j a_j = 0\), \(\sum _j b_j =1\). Then construct the contrast \(\sum _j a_j \widehat{f}_h(x;K_j)\) with \(\widehat{f}_h\) as in (1) but with different kernels \(K_j\). Find \(\hat{h}\) that minimizes the ISE (or the MISE, respectively) of the contrast

$$\begin{aligned} \int \left( \sum _j a_j \widehat{f}_h (x;K_j) \right) ^2 {\mathrm{d}}x. \end{aligned}$$

Take as a final density estimator

$$\begin{aligned} \widetilde{f} (x)&= \sum _j b_j \widehat{f}_{\lambda \hat{h}} (x;K_j),\quad {\mathrm{where }}\, \lambda ^5 = \frac{\left( \sum _j a_j\sigma _j^2\right) ^2 R(\varphi ) }{\left( \sum _j b_j\sigma _j^2\right) ^2 R(\psi ) }\\ \varphi (u)&= \sum _j a_j K_j(u),\quad \psi (u) = \sum _j b_j K_j(u),\quad {\mathrm{and }}\quad \sigma _j^2 = \mu _2 (K_j). \end{aligned}$$

The evident problem is that one has to choose \(J\) and needs two series of coefficients which can have a serious impact on the performance of the method, especially for small and moderate sample sizes. We doubt that practitioners will opt for a method that even increases the number of prior parameter to be chosen—and this even by an arbitrary amount—instead of reducing it. Note that different choices lead to different outcomes. As we are not aware of any reasonable data driven method to choose them, we will not consider this bandwidth selector in the simulation study.

5.3 Further mixing methods

We are aware of the existence of other approaches which combine various density estimators by using a mixture of their smoothing parameters. In the literature, we found some papers that addressed the problem of linear and/or convex aggregation, e.g. Rigollet and Tsybakov (2007), Samarov and Tsybakov (2007) as well as Yang (2000). However, as the main focus of our review paper is not on the aggregation of different density estimators, we will not investigate this further in detail, but only study some mixtures of bandwidths which, admittedly, arise from intuitionFootnote 1. More specifically, having in mind that CV undersmoothes and PI oversmoothes, and that bandwidths are scaling parameters which should be combined on a logarithmic (i.e.multiplicative) scale, we will consider \(( \widehat{h}_{{\mathrm{CV}}}^\alpha \widehat{h}_{\mathrm{PM}}^\beta )^{1/(\alpha +\beta )}\) with \(\alpha = 1,\ \beta =2\) (mix1), \(\alpha =2,\ \beta = 1 \) (mix2), and \(\alpha = \beta =1\) (mix3).

6 Finite sample performance

The small sample performance of the different cross-validation, plug-in and bootstrap methods is compared. For obvious reasons, we limit the study to data adaptive methods without boundary correction. Although we tried many different designs we summarize here the results for six densities where the estimation results expose pretty well the main findings, in particular:

  1. 1.

    Laplace distribution \(f(x)=4\exp (-|8(x-0.5)|)\)

  2. 2.

    Simple Gamma distribution: \(\mathrm{Gamma}(a,b)\) with \(b=1.5\), \(a=b^2\) applied on \(5x\) with \(x\in \mathbf R \), i.e. \(f(x) = 5 \frac{b^a}{\Gamma (a)}(5x)^{a-1}{\mathrm{e}}^{-5xb}\)

  3. 3.

    Mixture of three gamma, \(\mathrm{Gamma}(a_j,b_j)\), \(a_j = b_j^2\), \(b_1 = 1.5\), \(b_2 = 3\) and \(b_3 = 6\) applied on \(8x\) giving two bumps and one plateau

  4. 4.

    Simple normal distribution, \(\mathcal N (0.5,0.2^2)\) with only one mode

  5. 5.

    Mixture of \(\mathcal N (0.35,0.1^2)\) and \(\mathcal N (0.65,0.1^2)\) giving two modes

  6. 6.

    A triple mode mixture of \(\mathcal N (0.25,0.075^2)\), \(\mathcal N (0.5,0.075^2)\), and \(\mathcal N (0.75,0.075^2)\)

As can be seen in Fig. 1, all densities have the main mass in \([0,1]\) with exponentially decreasing tails. So that we can neglect possible boundary effects. We also simulated estimators with boundary corrections getting results very close to what we have found in the here presented study.

Fig. 1
figure 1

The data generating densities: design 1–6 from the upper left to the lower right

We have compared the performance by several measures based on the integrated squared error (ISE) of the resulting density estimate (not the bandwidth estimate), and on the distance to the numerically ISE minimizing bandwidth, say \(h_{\mathrm{opt}}\approx \widehat{h_0}\) (of each simulation run, as it is sample-dependent). The considered performance measures were

  • \(c_1\): \(\mathrm{mean}(\hat{h} - h_{\mathrm{opt}})\), bias of the estimated bandwidth

  • \(c_2\): \(\mathrm{mean}\left[ \mathrm{ISE}(\hat{h})\right] \), the average (or expected) ISE

  • \(c_3\): \(\mathrm{std}\left[ \mathrm{ISE}(\hat{h})\right] \), the volatility of the ISE

  • \(c_4\): \(\mathrm{mean}\left( \left[ \mathrm{ISE}(\hat{h})-\mathrm{ISE}( h_{\mathrm{opt}})\right] ^2\right) \), squared \(L_2\) distance of the ISEs

  • \(c_5\): \(\mathrm{mean}\left[ \mid \mathrm{ISE}(\hat{h})-\mathrm{ISE}( h_{\mathrm{opt}})\mid \right] \), \(L_1\)-distance of the ISEs

We also considered other indicators of quality but will concentrate now only on these as we believe that they are the most meaningful ones. Instead of looking at \((c_2,c_3)\), one can certainly look at the ISE distribution as a whole, for example, via box-plots.

We studied almost all selection methods, excluding the non-automatic ones and those having proved to perform uniformly worse than their competitors. In the presentation of the results, we concentrate on the methods which delivered the best results. Hence, some methods were dropped, e.g. the MCV sometimes provides multiple minima with a global minimum being far outside the range of reasonable bandwidths. We do neither show results for the indirect cross-validation since for small and moderate samples it just works worse than OSCV. Among the bootstrap methods, we concentrate on the presentation of the version (14) of the Smoothed Bootstrap which has achieved the best results among all bootstrap methods. For our mixed version (CV with refined plug-in), we first concentrate on mix3 when comparing it to the other selection methods, and later sketch the results comparing different mixtures.

While it is clear that one-sided CV and Do-validation give almost identical results for symmetric distributions, it is also clear that the latter will be more robust when asymmetry is present but unknown. We, therefore, skipped all results for Do-validation and refer to the paper of Mammen et al. (2011) instead. Notice that they only considered additive mixtures.

Summarizing, we present the following methods: CV (cross validation), OSCV-l (one-sided CV to the left), OSCV-r (oscv to the right), STAB (stabilized), RPI (refined plug-in), SBG (smooth bootstrap with Gaussian kernel—the results refer to the equivalent bandwidth for the Epanechnikov kernel), Mix 1/2 (our mix3), and as a benchmark the ISE (infeasible ISE minimizing \(h_{\mathrm{opt}}\)). For all methods, the bandwidth search is performed on the same bandwidth grid of 25 bandwidths on a logarithmic scale from \(n^{-1}\) to \(1\). We give only results referring to sample sizes \(n=25\), \(50\), \(100\), and \(n=200\).

6.1 Simulation results

To summarize and compare the different bandwidth selectors, we first consider the selected bandwidths and the corresponding biases for each method separately. Afterward, we compare the methods by various performance measures. All results are based on 250 simulation runs.

6.2 Comparison of the bias for the different bandwidths

In Fig. 2, we illustrate the Bias \((c_1)\) for the different methods for different sample sizes and distributions.

Fig. 2
figure 2

Comparison of the BIAS for different sample sizes and different densities

Let us consider the cross-validation method (CV). Many authors have mentioned the lack of stability of the CV criterion and its tendency to undersmooth. In Fig. 2, we see that CV has the smallest bias for all sample sizes and densities (except for the simple normal for which the mix3 is competitive). This is simply due to the fact that CV chooses always a smaller bandwidth than the other selectors. When the ISE optimal bandwidth is indeed very small, CV does, therefore, very well. However, CV clearly undersmoothes in the case of the simple normal distribution as id does for all rather smooth densities.

The one-sided versions (OSCV) are quite stable. Regarding the bias, they are neither the best nor the worst over all sample sizes and models. As already stated by the original authors, the OSCV tends to overestimate the bandwidth a little bit. While for \(n=25\), the OSCV is outperformed by almost all other methods, the bias problem disappears rapidly for increasing \(n\). For \(n=100\) and \(200\) we see that their biases become much smaller than for the other methods except CV (and STAB in the simple normal case). Moreover, they never fail dramatically when \(n>25\). This feature is an intuitive benefit of this method when in practice the underlying density is completely unknown. For the densities studied, the differences between the left-(OSCV-l) and the right-sided (OSCV-r) versions are negligible except for the gamma distributions because of its asymmetry.

The stabilized procedure of Chiu (STAB) is excellent for the simple normal case but it falls short when estimating other densities confirming that ”when the true density is not smooth enough, the stabilized procedure is more biased toward oversmoothing than CV”, see (Chiu 1991a, b). This fact can be seen well in Fig. 2 where STAB has increasing difficulties, compared to CV, with an increasing number of bumps. Even though this method demonstrates here a reasonable performance, the results should be interpreted with care, since in the derivation of \(\Lambda \) one has to deal with complex numbers. This problem we solved in favor of this method for the context of our simulations such that all presented results are clearly biased in favor of STAB.

The refined plug-in (refPI) and the smoothed bootstrap SBG show a similar behavior, though the bias of the SBG is mostly smaller than for refPI. Both are worse than STAB for small samples but outperform it for increasing \(n\). Not surprisingly, in general, the bias for the MISE minimizing methods is larger than for all others. This partly results from the fact that we constructed our prior bandwidth on second and third derivatives that result from a simple normal distribution.

The mixture of CV and plug-in is a compromise giving biases lying between the ISE and the MISE minimizing methods. It is interesting to see that this yields such a stable performance. We should mention that there were only minor differences between the three versions of mixtures (not shown here). Clearly, the larger the share of the respective method, the bigger their impact on the final estimate.

6.3 Comparison of the ISE values

Next, we compare the ISE values of the density estimates based on the different bandwidth selectors. The results are given in terms of boxplots plus the mean (linked squares) displaying this way the distribution of the ISEs over 250 simulation runs. In Fig. 3, we consider the mixture of three normal distributions (model 6) and compare different sample sizes, whereas in Fig. 4 the sample size is fixed to \(n=100\) while the true underlying distribution varies.

Fig. 3
figure 3

Box-plots and means (filled square) of the ISE values for the mixture of three normal densities with different sample sizes

Fig. 4
figure 4

Box-plots and means (filled square) of the ISE values for different distributions with sample size \(n=100\)

Certainly, for all methods the ISE values decrease with the sample size and increase with the complexity of the estimation problem. As expected, the classical CV shows a high variation for all cases (upper extreme values are not shown for the sake of presentation). The one-sided CV and the STAB versions do much better, while the least variation is achieved by the MISE minimizing methods (STAB, refPI and SBG). However, the drawback of these three methods becomes obvious when looking at the size of its ISE values; they are clearly smaller for the CV-based methods when \(n\ge 25\). Moreover, for increasing sample size the ISE values decrease very slowly for the MISE-based methods, whereas for the CV methods they come close to the smallest achievable ISE. Note that regarding the ISE minimization, the one-sided CV methods show the best performance, except for the triple mode normal mixture. They do not vary as much as the classical CV selector, and their mean value is almost always smaller than for the other methods, see Fig. 4.

The stabilized procedure of Chiu (STAB) delivers—as the name suggests—a very stable estimate for the bandwidth. But in the end, it is hardly more stable than, for example, the one-sided CV-based selectors. It is much worse regarding the mean and median. We else see confirmed what we already discussed in the context of biases above. The mixture of CV and plug-in lowers the negative impacts of both versions and does surprisingly well; the mixture delivers a more stable estimate, and produced good density estimates.

6.4 Comparison of the L1- and L2-distance of the ISE

To get an even better idea of the distance between the ISE values achieved by the selectors and the ISE optimal (i.e. achievable) values, let us have a closer look at \(c_5\) and \(c_6\), say the \(L_1\) and \(L_2\) distances. In our opinion, these measures are probably the most interesting ones for practitioners. Figures 5 and 6 show these L1- and the L2-distances for the different sample sizes and models.

Fig. 5
figure 5

L1-distances to ISE\((h_{\mathrm{opt}})\) for different sample sizes of the six underlying densities

Fig. 6
figure 6

L2-distances to ISE\((h_{\mathrm{opt}})\) for different sample sizes of the six underlying densities

The pictures reveal that for CV bandwidths, the \(c_5\) are really big, even if the underlying density is not wiggly at all. This obviously is due to the high variability of the selected bandwidths. This effect does especially apply for small sample sizes; but notice that for large samples like \(n=500\) the classical CV still does not work well (not shown). Regarding the \(L_1\) measure (\(c_5\)), we see that the CV delivers often pretty small values for samples of size \(n>50\).

While both OSCV methods have problems with particularly small sample sizes, they else easily compete with all the other selectors. One may say that for the normal densities the OSCV methods are neither the best nor the worst methods, but always within the grasp of the best method. This corroborates our statement from above that an OSCV selector should be used if we do not know anything about the underlying density. Another conspicuous finding in Fig. 5 is the difference between the two one-sided versions for the gamma distribution(s). Because of missing boundary correction on the left, the OSCV-l behaves very badly especially for small sample sizes. We actually get similar results for \(n=25\) when looking at the L2-distances, see below.

The three MISE minimizing methods do very well for the simple normal and simple gamma distribution, but else we observe a worse performance which can be traced back to the prior selection problem already described above. Even for bigger sample sizes, all three methods deliver a relative big L1-distance for the mixture models. They do further not benefit as much from an increasing \(n\) as other methods do. Within this MISE minimizing group, the STAB shows a smaller L1-distance for more complex densities. Actually, for the mixture of the three Gamma distributions, we can see that his L1-distances to the optimal ISE are always very small, except for the refPI and SBG with \(n=25\).

The mixture of CV and refined plug-in reflects the negative attributes of the CV, but nevertheless it is often in the range of the best methods for larger samples. A further advantage of the mixed version is that it is much more stable than the CV or refPI when varying the sample size. For more details, see our next subsection.

We obtain not exactly the same but similar results when looking at the L2-distance to the optimal ISE, plotted in Fig. 6.CV obtains very large values for small sample sizes. The one-sided versions show an important improvement. The three MISE minimizing methods perform excellent for the simple normal (not surprisingly) and the simple gamma. Among them, the STAB shows the smallest L2 distance to ISE\((h_{\mathrm{opt}})\). For sample sizes \(n>50\) the one-sided CV versions outperform the others in most cases. Large differences between the left and the right one-sided version can be observed where we have asymmetric densities.

A comparison of the L1- and the L2-distance for a sample size fixed at \(n=100\) but varying the distributions is shown in Fig. 7. As can be seen in both pictures, the performance of all measures (without CV) for the simple normal distribution and the mixture of the three gamma distributions is pretty good. Also for the mixture of two normals, most of the methods deliver good results; only the values for CV, refPI and the SBG become much larger. For more complex densities, the pictures show that the MISE minimizing measures deliver worse results, because of the large biases. STAB shows a pretty good performance with respect to the \(L_1\) measure what is not surprising when recalling its construction. The most stable versions are the OSCV and the Mix, except for the triple mode normal mixture. For smaller sample sizes (not shown), the pictures are quite similar, but the tendencies are strengthened and only the Mix version delivers stable results for all distributions.

Fig. 7
figure 7

L1- and L2-distances to ISE\(h_{\mathrm{opt}}\) for different underlying densities when \(n=100\)

6.5 Comparison of the mixed methods

Finally, we have a closer look to the quite promising results obtained by mixing CV with refPI. We have done this mingling using the different proportions described above. In Tables 1 and 2 we have tabulated the different performance measures looking at the bias of the chosen bandwidth, the average ISE as well as the L1- and L2-distances to the optimal ISE for all of the six densities.

Table 1 Laplace, simple gamma, and mixture of three gamma distributions
Table 2 Simple normal distribution, mixture of two and three normal distributions

For all smooth densities, we observe that the values of the different measures are pretty close to each other. The main differences occur for small sample sizes and wiggly densities. It is hard to say which mixture is the best as sometimes mix2 is the best and sometimes mix1 whereas mix3 lies certainly always in between. The reason seems to be obvious, either refPI is more appropriate than CV or vice verse. But this is a conclusion one may draw from the means while at the same time they reduce a lot the variance. We, therefore, see the potential gain of the methods is best when looking at the \(L_2\) distances. Recall also our results from the last subsections and compare mix3 with CV and refPI looking at that measure \(c_4\). We should also now give a special emphasis on this performance measure \(c_4\).

We first see that over the different sample sizes the considered performance measures converge toward zero as expected for increasing sample size. Notice, however, that depending on the smoothness of the underlying density they do so at seemingly different rates. A similar observation can be made if comparing the development over an decrease of smoothness: compare, for example, design 4–6 (from a simple to a triple mode normal mixture). Since we could not identify a clear winner between refPI and CV, it may be not surprising that the best compromise seems indeed to be mix3. In total, the main conclusion is that the considered bandwidth mixtures produce very stable results and are attractive competitor to the other bandwidth selection methods.

7 Conclusions

A first finding is that it definitely makes a difference which bandwidth selector is chosen; not only in numerical terms but also for the quality of density estimation. We can identify clear differences in quality, and we can say in which situation what kind of selector is preferable.As is well known, the CV leads to a small bias but a large variance. It works well for rather wiggly densities and a moderate sample size. However, it neither performs well for rather small nor for rather large samples. The quality is unfortunately dominated by its variability. An also fully automatic alternative is the one-sided version. In contrast to the classical CV, the OSCV methods exhibit much less variation without increasing too much in bias. For very small samples these methods have their numerical problems, what is caused by their construction. They may be not uniformly but quite often the best, and never the worst. Depending on the skewness, either the left- or the right-sided CV performs better. This disadvantage is no longer present for the alternative Do-validation or the indirect SHS bandwidth selector. Unfortunately, for a reasonable working of the SHS selector, a sample size of \(n>100\) is strongly recommended. Further, it also depends on two prior parameters for which some recommendations exist for \(n>100\). Note that also all the following statements are conditioned on our prior choices, and may be just the selection of densities and sample sizes. We are aware that for large samples and quite wiggly densities our findings and conclusions might change.

The refPI and the SBG show a similarly stable behavior due to the fact that they are minimizing the MISE, and depend on prior information. It is generally accepted that the need of prior knowledge is the main disadvantage of these methods, and—as explained above—typically requires a smooth underlying density. We have to admit that larger samples would allow for more complex plug-in methods but these often require more prior knowledge.

The STAB method is quite stable as suggested by its name. Although the full name refers to cross-validation, it actually minimizes the MISE like refPI and SBG do. Consequently, it performs particularly well for the estimation of rather smooth densities but else does not. It certainly pays for the stability with some bias increase. It is, therefore, hard to say to what extend it is an improvement to CV, but it seems to be an improvement compared to refPI when looking at the ISE of the density estimator.

While the mix methods (combining CV and plug-in) do very well, one cannot really identify a ’best mix’ in advance. A further evident computational disadvantage is that we first have to apply two other methods (CV and refPI) to achieve good results. Therefore, we have studied here only the combination of the simplest plug-in with the simplest CV method. It would be little surprising if better results were obtained when mixing more sophisticated methods, see for example Mammen et al. (2011).

Our conclusion is, therefore, that among all existing (automatic) methods for kernel density estimation, concentrating on small or moderate samples and relatively smooth densities, the best strategies seem to be either a mixing or an indirect method. Among the two competing indirect methods (OSCV and SHS) the two OSCV seem to outperform SHS. However, if sample sizes increase a lot, and skewness becomes an important issue, then SHS is doubtless an interesting alternative for the reasons discussed. Depending on the boundary, one would apply left- or right-sided OSCV, respectively. For moderate sample sizes however, the mixture of CV and refPI seems to be an attractive alternative until \(n\) becomes that large that CV fails completely.