1 Introduction

Density estimation is widely used in statistical feature models in computer vision and pattern recognition. Given a set of training data of the features (e.g., intensity, shape, texture), the underlying probability density can be described by a simple or more complicated distribution function. Uniform distribution [1], Gaussian distribution [2] or nonparametric functional [3] were considered in the past. By contrast, without the assumption that the forms of the underlying densities are known, a kernel density estimator (KDE) (also called the Parzen window estimate) is an efficient nonparametric approach to model nonlinear distributions of training data. In this technique the density function is estimated by a sum of kernel functions. The kernel number is equal to the size of the training data. When the training data set is very large, the KDE suffers from high computational cost and becomes intractable for subsequent use. In addition, the feature space is complex, noisy and most often not all the training data obey the same parametric model. This leads to a need for robust estimators to handle data in the presence of severe contaminations, i.e., outliers. In this work, we focus on the problem of how to employ a small percentage of the available data sample to provide a robust and highly accurate density estimator.

KDE is frequently used for various computer vision problems, such as mean shift [4], background subtraction [5], object tracking [6], image segmentation [3, 7] and classification [8]. Even though there have been several attempts to improve the computational efficiency [9, 10], its very high memory requirements and computational complexity inhibit the use of kernel density estimation in real applications. In [11], the support vector approach was used to obtain an estimate from the training data in the form of a mixture of densities. This approach has no additional free parameters. However, for large sample sizes, it requires \( O(n^{3} ) \) optimization routines. The reduced set density estimator (RSDE) was proposed by Girolami and He [12] to solve the above problem by providing a KDE which employs a small subset of the available training data. It is optimal in the integrated squared error (ISE) between the unknown true density and the RSDE. In contrast to the support vector approach, the RSDE only requires \( O(n^{2} ) \) optimization routines to provide similar levels of performance. In order to increase the sparsity further in the weight coefficients, Chen et al. [13] constructed a sparse kernel density estimate using an orthogonal forward regression technique using the classical Parzen window estimate as the desired response. In addition, sparse kernel density estimate has gained attention toward the integration of explicit sparse constraint to the weight coefficients as regularization term [14, 15]. These methods create a trade-off between the sparsity and the quality of the density estimation. They can produce sparsity in the samples at the cost of a slight reduction in the quality of the estimates.

Instead of creating a new probability density estimator, we try to generalize the RSDE to provide more satisfying performance. In this paper, our work focuses on the RSDE based on KDE with Gaussian kernel. In RSDE, there exist many nonzero coherent weighting coefficients which are clustered in regions of space with greater probability mass, specifically for low dimensional data. In order to break the relationship between coherent coefficients, our idea is to induce randomness to the plug-in estimation of weighting coefficients. By means of sequential minimal optimization (SMO), these coherent weighting coefficients can be replaced approximately by one or several larger incoherent weighting coefficients. In contrast to the RSDE, the proposed model can improve the sparsity and accuracy of the density estimation. Moreover, this technique is robust to outliers by analysis in feature space.

This paper is organized as follows. In Sect. 2 the RSDE is reviewed briefly, and in Sect. 3 the proposed robust sparse kernel density estimation by random fluctuations for coherent coefficients is presented. Experimental results are provided in Sect. 4, and the conclusions in Sect. 5.

2 Reduced set density estimator

Given n data samples \( x_{1} ,x_{2} , \ldots ,x_{n} \in R^{\ell } \), each have a weight \( \omega_{i} \ge 0,\sum\nolimits_{i = 1}^{n} {\omega_{i} } = 1 \), and the distribution density can be estimated by a KDE with weight coefficients,

$$ \hat{f}(x;\omega ) = \sum\limits_{i = 1}^{n} {\omega_{i} k_{\sigma } (x,x_{i} )} , $$
(1)

where \( k_{\sigma } (x,x_{i} ) \) is a kernel function (satisfying non-negativity and normalization conditions), and \( \sigma \) is a parameter which controls the kernel width. The most commonly used kernel function is a Gaussian kernel

$$ k_{\sigma } (x,x_{i} ) = (2\pi \sigma^{2} )^{{ - \tfrac{\ell }{2}}} { \exp }\left\{ { - \frac{{\left\| {x - x_{i} } \right\|^{2} }}{{2\sigma^{2} }}} \right\} $$
(2)

Girolami and He [12] estimated KDE by minimizing ISE between the true density \( f(x) \) and the estimated density \( \hat{f}(x;\omega ) \), which was defined as

$$ {\text{ISE}}(\omega ) = \int {\left| {f(x) - \hat{f}(x;\omega )} \right|^{2} {\text{d}}x = \int {\hat{f}^{2} (x;\omega ){\text{d}}x} - 2} \int {\hat{f}(x;\omega )f(x){\text{d}}x} + \int {f^{2} (x){\text{d}}x} $$
(3)

Notice that the first term is

$$ \int {\hat{f}^{2} (x;\omega ){\text{d}}x} = \int {\left( {\sum\limits_{i = 1}^{n} {\omega_{i} k_{\sigma } (x,x_{i} )} } \right)^{2} {\text{d}}x} = \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\omega_{i} \omega_{j} \int {k_{\sigma } (x,x_{i} )k_{\sigma } (x,x_{j} ){\text{d}}x} } } = \omega^{\text{T}} Q\omega $$

Here, \( \omega = [\omega_{1} ,\omega_{2} , \ldots ,\omega_{n} ]^{\text{T}} \) and \( Q = \int {k_{\sigma } (x,x_{i} )k_{\sigma } (x,x_{j} )} {\text{d}}x \) is a \( n \times n \) matrix whose elements are defined as \( Q_{ij} = k_{\sqrt 2 \sigma } (x_{i} ,x_{j} ) \). Since we do not know the true \( f(x) \), we need to estimate the second term, which is denoted as \( M(\omega ) \). An unbiased estimate of it for a KDE can be written as

$$ M(\omega ) = \int {\hat{f}(x;\omega )} f(x){\text{d}}x \approx \tfrac{1}{n}\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\omega_{i} k_{\sigma } (x_{i} ,x_{j} ) = \sum\limits_{i = 1}^{n} {\omega_{i} \tfrac{1}{n}\sum\limits_{j = 1}^{n} {k_{\sigma } (x_{i} ,x_{j} )} } } } = \sum\limits_{i = 1}^{n} {\omega_{i} d_{i} } = \omega^{\text{T}} D $$

Here, the KDE for the point \( x_{i} \) is denoted as \( d_{i} = \tfrac{1}{n}\sum\nolimits_{j = 1}^{n} {k_{\sigma } (x_{i} ,x_{j} )} \), and \( D = [d_{1} ,d_{2} , \ldots ,d_{n} ]^{\text{T}} \). Notice that \( D = K1_{n} \), where \( K \) is the Gram matrix whose elements are defined as \( K_{ij} = k_{\sigma } (x_{i} ,x_{j} ) \), and \( 1_{n} \) is the \( n \times 1 \) vector whose elements are all \( \tfrac{1}{n} \). The last term in (3) can been dropped due to its independence of \( \omega \). Then the minimum ISE can be written as follows

$$ \begin{gathered} \hat{\omega } = { \arg }\mathop { \hbox{min} }\limits_{\omega } \left\{ {{\text{ISE}}(\omega ) = \tfrac{1}{2}\omega^{\text{T}} Q\omega - \omega^{\text{T}} D} \right\} \hfill \\ {\text{s}} . {\text{t}} .\quad \omega^{\text{T}} 1 = 1\quad {\text{and}}\quad \omega_{i} \ge 0\;\forall i \hfill \\ \end{gathered} $$
(4)

Observe that the \( Q \) is positive semi-definite. Thus, the object function is convex with respect to \( \omega \), and can be solved using SMO [12, 16].

3 Robust sparse kernel density estimation

The RSDE was shown in [12] to be able to provide a sparse representation in the weighting coefficients. The authors observed that the weights obtained from minimizing the estimated ISE were sparse. This is because the right hand term \( \omega^{\text{T}} D \) in (4) is a convex combination of positive numbers. Such a convex combination is maximized by assigning a unit weight to the largest, and setting the rest to zero. SMO is a simple algorithm that can quickly solve the quadratic programming (QP) optimization problem (4), by breaking the large QP problem into a series of smallest possible QP problems. The optimal solution should satisfy the following rules

  • (R1) If \( \omega_{i} = 0,\omega_{j} > 0 \) then \( I_{i} \ge I_{j} \).

  • (R2) If \( \omega_{i} ,\omega_{j} > 0 \) then \( I_{i} = I_{j} \).

Here, \( I_{i} = \sum\nolimits_{j = 1}^{n} {Q_{ij} \omega_{j} } - d_{i} \). Clearly, if more points satisfy rule (R1), then we can obtain a more sparse solution. In rule (R2), \( \omega_{i} \) and \( \omega_{j} \) are positive and updated only when \( I_{i} \ne I_{j} \). Therefore, we hope to reduce the number of points which satisfy rule (R2) in order to increase the sparsity further in the weight coefficients. Due to the convex constraint, the coefficients obtained from SMO are naturally sparse. If we suppose that the precision of numbers is sufficient (\( I_{i} \ne I_{j} ,i \ne j \)), then we will finally obtain only one nonzero weighting coefficient in \( \omega \) by using SMO. It implies that the true density \( f(x) \) is estimated by only one kernel function with a nonzero weight coefficient \( \omega_{k} \), \( \hat{f}(x;\omega ) = \omega_{k} k_{\sigma } (x,x_{k} ) \). Obviously, this is a time-consuming method to enforce sparsity and the resulting density estimator is inaccurate. Even when the precision of numbers is increased, the RSDE allows for many relatively close points and their corresponding weighting coefficients are small and nonzero in the optimal solution. As can be seen easily in Fig. 1a, these nonzero weighting coefficients are clustered in regions of space with greater probability mass. If these points in each cluster could be replaced approximately by one or several points with larger weighting coefficients, then this can improve the sparsity further in the weight coefficients.

Fig. 1
figure 1

a The true density and the RSDE; each of the 87 nonzero weighting coefficients is placed at the appropriate sample data point and the value is denoted by the length of the vertical line. The corresponding L 1 error and L 2 error between them are 0.0147 and 4.54 × 10−4. b The true density and the RSKDE; each of the 20 nonzero weighting coefficients is placed at the appropriate sample data point. The corresponding L 1 error and L 2 error between them are 0.014 and 3.71 × 10−4

For the Gaussian kernel, note that there exists a feature mapping functional \( \phi_{\sigma } :{\mathbb{R}}^{\ell } \to {\mathbb{R}}^{L} (\ell < L) \). It maps the feature to a high dimensional feature space: \( x \to \phi_{\sigma } (x) \), such that \( K_{ij} = k_{\sigma } (x_{i} ,x_{j} ) \) \( = \left\langle {\phi_{\sigma } (x_{i} ),\phi_{\sigma } (x_{j} )} \right\rangle \) [17]. Then the KDE with Gaussian kernel can be represented as the inner product between a mapped test point and the centroid of mapped training points in kernel feature space [18]. We have \( d_{i} = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {k_{\sigma } (x_{i} ,x_{k} )} = \left\langle {\phi_{\sigma } (x_{i} ),\frac{1}{n}\sum\nolimits_{k = 1}^{n} {\phi_{\sigma } (x_{k} )} } \right\rangle \). Here, \( \frac{1}{n}\sum\nolimits_{k = 1}^{n} {\phi_{\sigma } (x_{k} )} \) can be treated as nonzero constants which clearly do not depend upon the value \( x_{i} \). Similarly, there exists \( \phi_{\sqrt 2 \sigma } \) such that \( Q_{ij} = k_{\sqrt 2 \sigma } (x_{i} ,x_{j} ) = \left\langle {\phi_{\sqrt 2 \sigma } (x_{i} ),\phi_{\sqrt 2 \sigma } (x_{j} )} \right\rangle \) and \( H_{i} = \sum\nolimits_{k = 1}^{n} {Q_{ik} \omega_{k} } = \left\langle {\phi_{\sqrt 2 \sigma } (x_{i} ),\sum\nolimits_{k = 1}^{n} {\omega_{k} \phi_{\sqrt 2 \sigma } (x_{k} )} } \right\rangle \). By analysis in feature space, we have

$$ I_{i} - I_{j} = \left\langle {\phi_{\sqrt 2 \sigma } (x_{i} ) - \phi_{\sqrt 2 \sigma } (x_{j} ),\sum\limits_{k = 1}^{n} {\omega_{k} \phi_{\sqrt 2 \sigma } (x_{k} )} } \right\rangle - \left\langle {\phi_{\sigma } (x_{i} ) - \phi_{\sigma } (x_{j} ),\frac{1}{n}\sum\limits_{k = 1}^{n} {\phi_{\sigma } (x_{k} )} } \right\rangle $$

If \( d_{i} = d_{j} \), then we have \( \left\langle {\phi_{\sigma } (x_{i} ) - \phi_{\sigma } (x_{j} ),\frac{1}{n}\sum\nolimits_{k = 1}^{n} {\phi_{\sigma } (x_{k} )} } \right\rangle = 0 \), and \( \phi_{\sigma } (x_{i} ) = \phi_{\sigma } (x_{j} ) \). Thus, \( \phi_{\sqrt 2 \sigma } (x_{i} ) = \phi_{\sqrt 2 \sigma } (x_{j} ) \), and \( I_{i} = I_{j} \).

In fact, \( d_{i} ,d_{j} \) are unequal but may be very close. Assume that \( d_{i} \) is very close to \( d_{j} \), then their feature points \( \phi_{\sigma } (x_{i} ) \) and \( \phi_{\sigma } (x_{j} ) \) are also very close. It implies that it is easy to meet the condition \( I_{i} = I_{j} \) in the rule (R2). In other words, if \( d_{i} ,d_{j} \) are close enough, and \( \omega_{i} \) is nonzero, then \( \omega_{j} \) is more likely to be nonzero. Hence, there are two cases for \( d_{i} \) and \( d_{j} \) in the rule (R2): (a) \( |d_{i} - d_{j} | < \delta \) (\( \delta \) is small enough), (b) \( |d_{i} - d_{j} | \ge \delta \). If \( \omega_{i} \) and \( \omega_{j} \) satisfy case (a) in rule (R2), then \( \omega_{i} ,\omega_{j} > 0 \) are called coherent coefficients.

3.1 Random perturbation of coherent coefficients

In this section, we hope to break the relationship between coherent coefficients, which are clustered in regions of space with greater probability mass. A natural approach is to induce randomness to \( D \), in order to produce incoherence for most of \( d_{i} \in D \). Based on the existing structure of \( D \), a part of values that stay very close are added a small random values, while keeping the rest unchanged. The randomness can make all these elements in each cluster stay apart from each other. Assume that there are \( n_{0} \) coherent coefficients \( \omega_{1} , \ldots ,\omega_{{n_{0} }} > 0 \), and the corresponding \( d_{1} , \ldots ,d_{{n_{0} }} \) are close enough. After inducing random values to \( d_{i} \), the relationship \( I_{1} = \cdots = I_{{n_{0} }} \) in rule (R2) does not exist. Clearly, case (a) in rule (R2) would be reduced and a part of them would be reclassified to rule (R1). The weight coefficients in the optimal solution could be made more sparse. Assume that all the elements of \( D \) are collected into a set \( \Upomega \). Then we have the following definitions to partition the set \( \Upomega \).

Definition 1

Coherent relation \( \approx \) is defined as: \( d_{i} \approx d_{j} \Leftrightarrow \left\lfloor {d_{i} } \right\rfloor_{m} = \left\lfloor {d_{j} } \right\rfloor_{m} ,\;d_{i} ,d_{j} \in \Upomega \).

There are many methods to describe the coherent relation. Here, considering that \( d_{i} \) is a positive decimal number, the truncated m-digit approximation to it is the number \( \left\lfloor {d_{i} } \right\rfloor_{m} \) obtained by simply discarding all digits beyond the mth. Hence, we have \( \left| {d_{i} - \left\lfloor {d_{i} } \right\rfloor_{m} } \right| < 10^{ - m} \). Clearly, the coherent relation \( d_{i} \approx d_{j} \) is an equivalence relation that identifies those numbers of \( \Upomega \) that stay very close. Moreover, this relationship gives rise to a partition of \( \Upomega \) into equivalence classes.

Definition 2

Coherent set is an equivalence class defined as: \( \Upomega (\tilde{d}_{i} ) = \left\{ {d_{j} |(d_{j} \in \Upomega ) \wedge (\tilde{d}_{i} \approx d_{j} )} \right\} \). Here, \( \tilde{d}_{i} \) is called a generator. The partition induced by the coherent relation is given by: \( \Uppi (\Upomega , \approx ) = \left\{ {\Upomega_{1} , \ldots ,\Upomega_{c} } \right\},c \le \left| \Upomega \right| \), where \( \left| \Upomega \right| \) is the cardinality of \( \Upomega \), and \( \left| {\Upomega_{1} } \right| \ge \left| {\Upomega_{2} } \right| \ge \cdots \ge \left| {\Upomega_{c} } \right| \). Let \( \tilde{d}_{i} \) be the corresponding generators of \( \Upomega_{i} \), \( i = 1,2, \ldots ,c \), and they form a set \( \Upomega_{B} = \left\{ {\tilde{d}_{1} ,\tilde{d}_{2} , \ldots ,\tilde{d}_{c} } \right\} \). Subsequently, we define \( \Upomega_{N} = \Upomega - \Upomega_{B} \), and obtain the corresponding partition \( \Uppi (\Upomega_{N} , \approx ) \) \( = \left\{ {\Upomega_{1}^{N} ,\Upomega_{2}^{N} , \ldots ,\Upomega_{t}^{N} } \right\} \), where \( \left| {\Upomega_{1}^{N} } \right| \ge \left| {\Upomega_{2}^{N} } \right| \ge \cdots \ge \left| {\Upomega_{t}^{N} } \right| \). Obviously, \( t < c \) and \( \left| {\Upomega_{i}^{N} } \right| = \left| {\Upomega_{i} } \right| - 1,\forall i = 1,2, \ldots ,t \). Here, we select an appropriate m or magnify the values of \( \Upomega \) for partition such that \( \left| {\Upomega_{N} } \right| < \left| {\Upomega_{B} } \right| < n \).

Definition 3

\( \phi_{\sigma } (x_{i} ) \) and \( \phi_{\sigma } (x_{j} ) \) are coherent feature points, such that \( \phi_{\sigma } (x_{i} ) \approx \phi_{\sigma } (x_{j} ) \Leftrightarrow d_{i} \approx d_{j} \). Since one to one correspondence between \( d_{i} \) and \( \phi_{\sigma } (x_{i} ) \), we define the corresponding feature sets \( \Uptheta = \left\{ {\phi_{\sigma } (x_{1} ),\phi_{\sigma } (x_{2} ), \ldots ,\phi_{\sigma } (x_{n} )} \right\} \), \( \Uptheta_{B} ,\Uptheta_{N} = \left\{ {\Uptheta_{1} ,\Uptheta_{2} , \ldots ,\Uptheta_{t} } \right\} \) of \( \Upomega ,\Upomega_{B} ,\Upomega_{N} \).

Given a coherent relation \( \approx \), the set \( \Upomega \) is divided into \( \Upomega_{B} \) and \( \Upomega_{N} \) (Fig. 2). Our method is to induce randomness to \( \Upomega_{N} \), and keep \( \Upomega_{B} \) unchanged. For any \( d \in \Upomega_{i}^{N} ,i = 1,2, \ldots ,t \), we obtain \( \bar{\Upomega }_{i}^{N} \) by setting \( d^{*} = d + \lambda_{i} r \), where \( r \) is a random value from a uniform distribution on the interval [0, 1], and \( \lambda_{i} \) is a scaling parameter for \( \Upomega_{i}^{N} \). Here, \( \lambda_{1} \ge \lambda_{2} \ge \cdots \lambda_{t} \). In our experiments, a simple method is used to define these scaling parameters, \( \lambda_{i} = \left( {d_{i} - \left\lfloor {d_{i} } \right\rfloor_{m} } \right)\left| {\Upomega_{i}^{N} } \right| \). Then, \( \Upomega_{N}^{*} = \left\{ {\bar{\Upomega }_{1}^{N} ,\bar{\Upomega }_{2}^{N} , \ldots ,\bar{\Upomega }_{t}^{N} } \right\} \) is obtained and \( \Upomega_{B} ,\Upomega_{N} ,\Upomega_{N}^{*} \) can be written in matrix form as \( D_{B} ,D_{N} ,D_{N}^{*} \). After rearrangement, the proposed ISE approximation model can be minimized as

$$ \begin{gathered}\left[ {\begin{array}{*{20}c} {\hat{\omega }_{B} } & {\hat{\omega }_{N} } \\ \end{array} } \right] = \mathop {\text{argmin}}\limits_{{\omega_{B} ,\omega_{N} }} \left\{ {{\text{ISE}}^{*} (\omega_{B} ,\omega_{N} ) = \frac{1}{2}\left[ {\begin{array}{*{20}c} {\omega_{B} } & {\omega_{N} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {Q_{BB} } & {Q_{BN} } \\ {Q_{NB} } & {Q_{NN} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {\omega_{B} } \\ {\omega_{N} } \\ \end{array} } \right] - \left[ {\begin{array}{*{20}c} {\omega_{B} } & {\omega_{N} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {D_{B} } \\ {D_{N}^{*} } \\ \end{array} } \right]} \right\}, \end{gathered}$$
(5)

which is called the robust sparse kernel density estimation (RSKDE). In contrast to the RSDE, we use \( D_{N}^{*} \) instead of \( D_{N} \), \( D_{N}^{*} = D_{N} + R \). Here, \( R = \left[ {R_{1} ,R_{2} , \ldots ,R_{{b_{t} }} } \right]^{\text{T}} = \left[ {\lambda_{1} r_{1} , \ldots ,\lambda_{1} r_{{b_{1} }} ,\lambda_{2} r_{{b_{1} + 1}} , \ldots ,\lambda_{2} r_{{b_{2} }} , \ldots ,\lambda_{t} r_{{b_{t} }} } \right]^{\text{T}} , \) where \( b_{j} = \sum\nolimits_{i = 1}^{j} {\left| {\bar{\Upomega }_{i}^{N} } \right|} ,j = 1,2, \ldots ,t \) and \( r_{i} \) is a random value from a uniform distribution on the interval \( [0,1],\,i = 1,2, \ldots ,b_{t} \). Observe that the object function (5) stays convex with respect to \( \omega \), and SMO is also used to solve this problem. Set partitioning is easy to implement in computer program. Our implementation approach is summarized in Algorithm 1.

Fig. 2
figure 2

A diagram of set partition

Algorithm 1 Robust sparse kernel density estimation

Specially, the proposed model can be viewed as a sparse KDE based on a random weighted L 1 penalty. Then it can be also written as

$$ \hat{\omega } = { \arg }\mathop { \hbox{min} }\limits_{\omega } {\text{ISE}}(\omega ) - \beta \left\| {P\omega } \right\|_{1} \quad {\text{s}} . {\text{t}} .\;\omega^{\text{T}} 1 = 1\;{\text{and}}\;\omega_{i} \ge 0\;\forall i . $$
(6)

Here \( P = \left[ {\begin{array}{*{20}c} 0 & R \\ \end{array} } \right]^{\text{T}} \), where \( 0 \) is a \( (n - b_{t} ) \times 1 \) column vector of zeros, and \( \beta > 0 \) is the regularization parameter.

3.2 Analysis of RSKDE

In the RSDE, an unbiased estimate of \( M(\omega ) \) in Sect. 2 can be obtained as a \( \omega_{i} \) weighted sum of KDE \( d_{i} \) of each point \( x_{i} \). However, the \( d_{i} \) can be expressed as the inner product between a mapped test point and the mean of mapped training points in kernel feature space, \( d_{i} = \frac{1}{n}\sum\nolimits_{j = 1}^{n} {k_{\sigma } (x_{i} ,x_{j} )} = \left\langle {\phi_{\sigma } (x_{i} ),\frac{1}{n}\sum\nolimits_{j = 1}^{n} {\phi_{\sigma } (x_{j} )} } \right\rangle \). As we know, the mean estimator \( \hat{\theta } = \frac{1}{n}\sum\nolimits_{j = 1}^{n} {\phi_{\sigma } (x_{j} )} \) can be drastically influenced by outliers. The following proposition shows that the proposed algorithm improves performance of the density estimate.

Proposition 1

In contrast to the RSDE, a small increment of \( d_{i} \in \Upomega_{N} \) can make the proposed model more robust against outliers, and improve the quality of the density estimates.

Proof The RSDE uses mean estimation for KDE, which is not robust against outliers in the data. In our case, the larger the value of \( \left| {\Upomega_{k} } \right| \), \( k = 1, \ldots ,t \), the more coherent feature points in \( \Uptheta_{k} \). It implies that \( \phi_{\sigma } (x_{j} ) \in \Uptheta_{k} \) is more unlikely to be an outlier. To reduce the influence of possible outliers among the training data, we would like to set small weight values for outliers. Instead of giving the concrete implement algorithm, a feasible robust estimate for the sample mean is described below just for the proof.

$$ \hat{\theta } = \sum\limits_{{\phi_{\sigma } (x_{j} ) \in \Uptheta_{k} }} {\alpha_{0} \phi_{\sigma } (x_{j} )} + \sum\limits_{{\phi_{\sigma } (x_{j} ) \in \Uptheta_{N} - \Uptheta_{k} }} {\frac{1}{n}\phi_{\sigma } (x_{j} )} + \sum\limits_{{\phi_{\sigma } (x_{j} ) \in \Uptheta_{B} }} {\alpha_{1} \phi_{\sigma } (x_{j} )} $$

where \( \alpha_{0} \ge \tfrac{1}{n} \ge \alpha_{1} > 0 \), and \( \alpha_{0} \left| {\Uptheta_{k} } \right| + \tfrac{1}{n}\left| {\Uptheta_{N} - \Uptheta_{k} } \right| + \alpha_{1} \left| {\Uptheta_{B} } \right| = 1 \). It can be seen that the influence from \( \phi_{\sigma } (x_{i} ) \in \Uptheta_{B} \) is decreased. If \( \phi_{\sigma } (x_{i} ) \in \Uptheta_{k} \), then we have \( d_{i}^{*} = \left\langle {\phi_{\sigma } (x_{i} ),\hat{\theta }} \right\rangle = \left\langle {\phi _{\sigma } ({\mathbf{x}}_{i} ),\sum\limits_{{\phi _{\sigma } ({\mathbf{x}}_{j} ) \in \Uptheta _{k} }} {\alpha _{0} \phi _{\sigma } ({\mathbf{x}}_{j} )} + \sum\limits_{{\phi _{\sigma } ({\mathbf{x}}_{j} ) \in \Uptheta_{N} - \Uptheta _{k} }} {\frac{1}{n}\phi _{\sigma } ({\mathbf{x}}_{j} )} + \sum\limits_{{\phi _{\sigma } ({\mathbf{x}}_{j} ) \in \Uptheta _{B} }} {\alpha _{1} \phi _{\sigma } ({\mathbf{x}}_{j} )} } \right\rangle > \left\langle {\phi _{\sigma } ({\mathbf{x}}_{i} ),\frac{1}{n}\sum\nolimits_{{j = 1}}^{n} {\phi _{\sigma } ({\mathbf{x}}_{j} )} } \right\rangle \).

Hence, \( d_{i}^{*} > d_{i} \). There exists a small enough \( \lambda_{i} \), such that \( d_{i}^{*} = d_{i} + \lambda_{i} r_{i} > d_{i} \). On the contrary, we suppose that \( d_{i}^{*} \) is a little larger than \( d_{i} = \left\langle {\phi_{\sigma } (x_{i} ),\frac{1}{n}\sum\nolimits_{j = 1}^{n} {\phi_{\sigma } (x_{j} )} } \right\rangle \), then \( d_{i}^{*} = \left\langle {\phi_{\sigma } (x_{i} ),\sum\nolimits_{j = 1}^{n} {\alpha_{j} \phi_{\sigma } (x_{j} )} } \right\rangle \) satisfies \( \alpha_{i} > \frac{1}{n} \). Therefore, a small increment of \( d_{i} \in \Upomega_{N} \) makes the proposed model more robust against outliers.

In addition, suppose that \( \omega_{1} = {\text{argmin}}_{\omega } {\text{ISE}}(\omega ) \) and \( \omega_{2} = {\text{argmin}}_{\omega } {\text{ISE}}^{*} (\omega ) \), the density estimation based on \( \omega_{2} \) is more accurate than \( \omega_{1} \), since \( {\text{ISE}}(\omega_{1} ) \ge {\text{ISE}}^{*} (\omega_{1} ) \ge {\text{ISE}}^{*} (\omega_{2} ).\, \square\)

The matrix \( \left[ {\begin{array}{*{20}c} {D_{B} } & {D_{N}^{*} } \\ \end{array} } \right]^{\text{T}} \) defined in (5) can be interpreted as the more robust estimation of \( M(\omega ) \). It can be written in the form \( M^{*} (\omega ) = \sum\nolimits_{{d_{i} \in \Upomega_{B} }} {\omega_{i} d_{i} } + \sum\nolimits_{{d_{i} \in \Upomega_{N} }} {\omega_{i} (d_{i} + R_{i} )} \). In this case, the error of estimation is bounded as follows: \( \left| {M(\omega ) - M^{*} (\omega )} \right| = \left| {\sum\nolimits_{{d_{i} \in \Upomega_{N} }} {\omega_{i} R_{i} } } \right| < \lambda_{1}. \)

As mentioned previously, the RSDE involves Gaussian kernels of bandwidth \( \sqrt 2 \sigma \) and \( \sigma \), which occurs in \( Q \) and \( D \). The normalizing constants for these kernels are \( (4\pi \sigma^{2} )^{{ - {\ell \mathord{\left/ {\vphantom {\ell 2}} \right. \kern-0pt} 2}}} \) and \( (2\pi \sigma^{2} )^{{ - {\ell \mathord{\left/ {\vphantom {\ell 2}} \right. \kern-0pt} 2}}} \), respectively. As can be seen, the ratio between them is \( 2^{{ - {\ell \mathord{\left/ {\vphantom {\ell 2}} \right. \kern-0pt} 2}}} \). If the dimension \( \ell \) is large enough, the linear term \( D \) dominates the quadratic term \( Q \). It implies that, in high dimensional data, it is hard to find the coherent coefficients. In other words, the RSDE has already yielded a more sparse solution on most high dimensional data. There are no significant difference between the RSKDE and RSDE. This agrees with our intuition that the representation of signals is easier in lower dimensions. For high dimensional data, many of the dimensions are often irrelevant. These irrelevant dimensions can hide clusters in noisy data. It is common for all of the training data to be nearly equidistant from each other in very high dimensions [19]. Therefore, the sparsity for lower dimensional data is much more than the sparsity achieved in the case of higher dimensional data for similar quality of estimates.

4 Experimental results

We implement the proposed RSKDE in MATLAB based on the KDE Toolbox (written by Ihler and Mandel [20]) and evaluate the performance in density estimation. Because of the good performance of RSKDE, it is extensively validated on novelty detection and binary classification.

4.1 Density estimation

We experiment with one-dimensional data which is drawn from a heavily skewed distribution defined as \( p_{1} (x) = \tfrac{1}{8}\sum\nolimits_{i = 0}^{7} {g(x,\mu_{i} ,\sigma_{i} )} \), where \( \sigma_{i} = ({2 \mathord{\left/ {\vphantom {2 3}} \right. \kern-0pt} 3})^{i} \) and \( \mu_{i} = 3(\sigma_{i} - 1) \) [21]. Here, \( g(x,\mu ,\sigma ) \) is a univariate Gaussian distribution with mean \( \mu \) and variance \( \sigma \). Data samples of \( n \) are randomly drawn from the distribution to construct KDE. The width of the kernel is found by Rule of Thumb [22], and a separate test data set of 10,000 samples is used to calculate the L 1 error and L 2 error for the resulting estimate which are defined in [13]. The parameter m is set to 4. For \( n = 500 \), a typical result is shown in Fig. 1b. As we can see, the nonzero weighting coefficients are not concentrated in regions of space with greater probability mass in contrast to Fig. 1a. In addition, there exist one or several points with larger weighting coefficients to represent high probability mass. Therefore, the RSKDE achieves a much sparser estimator than the RSDE estimator. Moreover, the resulting estimate is much closer to the true density. To demonstrate the effectiveness and robustness, we test our model with several recent methods: RSDE, KD-tree based density reduction method of Ihler et al. [20], sparse kernel density estimates (SKDE) with L 0 penalty [14]. The experiment is repeated 200 times for different sample sizes. The remaining data (percentage of sample size) are shown in Fig. 3a. The average L 2 error (mean ± SD) between the true density and respective density estimators against sample size are shown in Fig. 3b. From the results, it is clear that the proposed method provides a significant improvement both in sparsity and accuracy under the same experimental conditions.

Fig. 3
figure 3

a Plot of the remaining data (percentage of sample size) for four related methods. b L 2 error between the true density and respective density estimators against sample size over 200 runs

To test the robust performance, we add uncorrelated outliers from a random distribution over [−4, 4]. For n = 650 (500 data samples are generated from the previous probability density function, and 150 outliers), a typical result is shown in Fig. 4. The L 2 error for the RSKDE is only slightly superior to the RSDE, but the RSKDE has the remarkable advantage for the sparsity. Only 20 nonzero weighting coefficients are need for the RSKDE, while 114 nonzero weighting coefficients are required for the RSDE.

Fig. 4
figure 4

a The true density and the RSDE; each of the 114 nonzero weighting coefficients is placed at the appropriate sample data point. The corresponding L 1 error and L 2 error between them are 0.0296 and 0.00197. b The true density and the RSKDE; each of the 20 nonzero weighting coefficients is placed at the appropriate sample data point. The corresponding L 1 error and L 2 error between them are 0.0291 and 0.00192

To further compare the results of the proposed algorithm, the experiment is repeated 200 times with 500 fixed data samples and different numbers of outliers. The average L 1 error (mean ± SD), L 2 error (mean ± SD) and the number of nonzero weighting coefficients against sample size (data sample size + outlier size) are shown in Table 1. After adding outliers to the original data set, it is clear from the results that the RSKDE we have developed is always better than the RSDE both in sparsity and accuracy of the estimates. Moreover, the number of nonzero weighting coefficients provided by the proposed model remains fairly consistent, when the number of outliers is increased.

Table 1 L 1 error and L 2 error between the true density and respective density estimators against sample size over 200 runs

4.2 Novelty detection

Novelty detection is the identification of new or unknown data that a machine learning system is not aware of during training. Novelty detection is one-class classification. The known data form one class, and a novelty-detection method tries to identify outliers that differ from the distribution of ordinary data. The RSKDE for novelty detection is tested on real-world data sets: Banana and Phoneme. Both data sets are available at http://sci2s.ugr.es/keel. The Banana dataset contains a total of 5,300 samples over two classes. The novelty detectors are trained on the first 400 samples in the first class. The remaining samples are used for testing. The Phoneme dataset has two classes and 5,404 samples. The aim of the dataset is to distinguish between nasal (class 0) and oral sounds (class 1). There are five features. The novelty detectors are trained on the first 730 samples in class 0. The remaining samples are used for testing.

The density estimator \(\,\hat{f}(x;\omega ,\sigma ) \) obtained from the training set give us a quantitative measure of the degree of novelty for each test sample. This is used to reject samples where the estimate \( \hat{f}(x;\omega ,\sigma ) < \rho \) for some threshold \( \rho \) [23]. Thus, any sample where the likelihood \( \hat{f}(x;\omega ,\sigma ) \) is below some threshold is considered to be novel. It implies that all test samples are classified into one of two classes: those which are similar to the training data, and those which are novel. Therefore, we adopt the standard definitions [24] used in binary classification to compare the results of RSKDE with existing algorithms. TP and TN stand for the number of true positives and true negatives, respectively. FP and FN represent, respectively, the number of misclassified positive and negative cases. In two-class problems, the accuracy rate on the positives, called sensitivity, is defined as TP/(TP + FN), whereas the accuracy rate on the negative class, also known as specificity, is TN/(TN + FP). Classification accuracy is (TP + TN)/N, where N = TP + TN + FP + FP is the total number of cases. Table 2 compares qualitatively RSKDE for novelty detection with other algorithms. Here, N 1 is the number of training data, and N 2 is the number of test data. The likelihood cross-validation is employed in selecting the kernel width for fair comparison. In the k-nearest neighbor algorithm, k is set to 3. The weighting coefficient \( \omega \) of the RSKDE is obtained by optimizing (5) over training samples. We can see that in both datasets the RSKDE outperforms the KDE and RSDE.

Table 2 Performance of the kernel density estimation (KDE), reduced set density estimation (RSDE) [25], Gaussian mixture model (GMM) [26], k-nearest neighbor algorithm (k-NN), one-class support vector machines SVM [27] and the proposed RSKDE

4.3 Binary classification

This section further evaluates the RSKDE’s performance for two-class classification problem. The experiments are carried out on the datasets that were used in novelty detection. The number of training samples is 1,000. The remaining samples are used for testing. There are ten randomly permuted partitions of each dataset into training and test sets. We first estimate the two conditional density functions \( \hat{f}(x;\omega ,\sigma |C_{0} ) \) and \( \hat{f}(x;\omega ,\sigma |C_{1} ) \) for class \( C_{0} \) and \( C_{1} \) from the training data, and then apply the Bayes’ rule to the test data set and calculate the corresponding accuracy (ACC).

$$ \left. {\begin{array}{*{20}c} {{\text{if}}\;\hat{f}(x;\omega ,\sigma |C_{0} ) \,\ge\, \hat{f}(x;\omega ,\sigma |C_{1} ),\;x \in C_{0} } \\ {\quad \quad \quad \quad \quad {\text{else}},\quad \quad \quad \quad \quad \quad \quad x \in C_{1} } \\ \end{array} } \right\} $$
(7)

During training, the kernel width \( \sigma \) is tuned by likelihood cross-validation, and the weighting coefficient \( \omega \) is obtained by optimizing (5) over training samples. Table 3 compares the performance of the six related methods. As can be seen, the test mean accuracy for the RSKDE is 0.898 which is only slightly superior to 0.894 of the KDE, but the RSKDE has the remarkable advantage for the test complexity. Only 187 mean samples in the reduced set are needed for the RSKDE classifier while all 1,000 training samples are required for KDE classifier. Meanwhile, on average the RSKDE classifier reduces test computational costs by ~80 %. For high dimensional data, results show no significant difference between the RSKDE and RSDE.

Table 3 Performance of the six related methods

5 Conclusion

In this paper, a novel robust sparse kernel density estimation based on the RSDE is presented. Instead of sparse representation by regularization term, the proposed model induces randomness to the plug-in estimation of the RSDE and yield a more sparse representation in the weighting coefficients. By means of SMO, the randomness can make those nonzero and small weighting coefficients get together into one or several points with larger weighting coefficients. The proposed model shows good performance both in sparsity and accuracy of the estimates for the low dimensional data. Numerical experiments show promising results.