Robust sparse kernel density estimation by inducing randomness

Chen, Fei; Yu, Huimin; Yao, Jincao; Hu, Roland

doi:10.1007/s10044-013-0330-1

Robust sparse kernel density estimation by inducing randomness

Theoretical Advances
Published: 17 April 2013

Volume 18, pages 367–375, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Pattern Analysis and Applications Aims and scope Submit manuscript

Robust sparse kernel density estimation by inducing randomness

Download PDF

Fei Chen^1,3,
Huimin Yu^1,2,
Jincao Yao¹ &
…
Roland Hu¹

439 Accesses
5 Citations
Explore all metrics

Abstract

In this paper, a robust sparse kernel density estimation based on the reduced set density estimator is proposed. The key idea is to induce randomness to the plug-in estimation of weighting coefficients. The random fluctuations can inhibit these small nonzero weighting coefficients to cluster in regions of space with greater probability mass. By sequential minimal optimization, these coefficients are merged into a few larger weighting coefficients. Experimental studies show that the proposed model is superior to several related methods both in sparsity and accuracy of the estimation. Moreover, the proposed density estimation is extensively validated on novelty detection and binary classification.

Kernel density estimation by stagewise algorithm with a simple dictionary

Article 02 December 2022

Kernel-Based Clustering Driven by Density Index

Unsupervised Kernel Function Building Using Maximization of Information Potential Variability

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Density estimation is widely used in statistical feature models in computer vision and pattern recognition. Given a set of training data of the features (e.g., intensity, shape, texture), the underlying probability density can be described by a simple or more complicated distribution function. Uniform distribution [1], Gaussian distribution [2] or nonparametric functional [3] were considered in the past. By contrast, without the assumption that the forms of the underlying densities are known, a kernel density estimator (KDE) (also called the Parzen window estimate) is an efficient nonparametric approach to model nonlinear distributions of training data. In this technique the density function is estimated by a sum of kernel functions. The kernel number is equal to the size of the training data. When the training data set is very large, the KDE suffers from high computational cost and becomes intractable for subsequent use. In addition, the feature space is complex, noisy and most often not all the training data obey the same parametric model. This leads to a need for robust estimators to handle data in the presence of severe contaminations, i.e., outliers. In this work, we focus on the problem of how to employ a small percentage of the available data sample to provide a robust and highly accurate density estimator.

KDE is frequently used for various computer vision problems, such as mean shift [4], background subtraction [5], object tracking [6], image segmentation [3, 7] and classification [8]. Even though there have been several attempts to improve the computational efficiency [9, 10], its very high memory requirements and computational complexity inhibit the use of kernel density estimation in real applications. In [11], the support vector approach was used to obtain an estimate from the training data in the form of a mixture of densities. This approach has no additional free parameters. However, for large sample sizes, it requires $ O(n^{3} ) $ optimization routines. The reduced set density estimator (RSDE) was proposed by Girolami and He [12] to solve the above problem by providing a KDE which employs a small subset of the available training data. It is optimal in the integrated squared error (ISE) between the unknown true density and the RSDE. In contrast to the support vector approach, the RSDE only requires $ O(n^{2} ) $ optimization routines to provide similar levels of performance. In order to increase the sparsity further in the weight coefficients, Chen et al. [13] constructed a sparse kernel density estimate using an orthogonal forward regression technique using the classical Parzen window estimate as the desired response. In addition, sparse kernel density estimate has gained attention toward the integration of explicit sparse constraint to the weight coefficients as regularization term [14, 15]. These methods create a trade-off between the sparsity and the quality of the density estimation. They can produce sparsity in the samples at the cost of a slight reduction in the quality of the estimates.

Instead of creating a new probability density estimator, we try to generalize the RSDE to provide more satisfying performance. In this paper, our work focuses on the RSDE based on KDE with Gaussian kernel. In RSDE, there exist many nonzero coherent weighting coefficients which are clustered in regions of space with greater probability mass, specifically for low dimensional data. In order to break the relationship between coherent coefficients, our idea is to induce randomness to the plug-in estimation of weighting coefficients. By means of sequential minimal optimization (SMO), these coherent weighting coefficients can be replaced approximately by one or several larger incoherent weighting coefficients. In contrast to the RSDE, the proposed model can improve the sparsity and accuracy of the density estimation. Moreover, this technique is robust to outliers by analysis in feature space.

This paper is organized as follows. In Sect. 2 the RSDE is reviewed briefly, and in Sect. 3 the proposed robust sparse kernel density estimation by random fluctuations for coherent coefficients is presented. Experimental results are provided in Sect. 4, and the conclusions in Sect. 5.

2 Reduced set density estimator

Given n data samples $ x_{1} ,x_{2} , \ldots ,x_{n} \in R^{\ell } $, each have a weight $ \omega_{i} \ge 0,\sum\nolimits_{i = 1}^{n} {\omega_{i} } = 1 $, and the distribution density can be estimated by a KDE with weight coefficients,

$$ \hat{f}(x;\omega ) = \sum\limits_{i = 1}^{n} {\omega_{i} k_{\sigma } (x,x_{i} )} , $$

(1)

where $ k_{\sigma } (x,x_{i} ) $ is a kernel function (satisfying non-negativity and normalization conditions), and $ \sigma $ is a parameter which controls the kernel width. The most commonly used kernel function is a Gaussian kernel

$$ k_{\sigma } (x,x_{i} ) = (2\pi \sigma^{2} )^{{ - \tfrac{\ell }{2}}} { \exp }\left\{ { - \frac{{\left\| {x - x_{i} } \right\|^{2} }}{{2\sigma^{2} }}} \right\} $$

(2)

Girolami and He [12] estimated KDE by minimizing ISE between the true density $ f(x) $ and the estimated density $ \hat{f}(x;\omega ) $, which was defined as

$$ {\text{ISE}}(\omega ) = \int {\left| {f(x) - \hat{f}(x;\omega )} \right|^{2} {\text{d}}x = \int {\hat{f}^{2} (x;\omega ){\text{d}}x} - 2} \int {\hat{f}(x;\omega )f(x){\text{d}}x} + \int {f^{2} (x){\text{d}}x} $$

(3)

Notice that the first term is

$$ \int {\hat{f}^{2} (x;\omega ){\text{d}}x} = \int {\left( {\sum\limits_{i = 1}^{n} {\omega_{i} k_{\sigma } (x,x_{i} )} } \right)^{2} {\text{d}}x} = \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\omega_{i} \omega_{j} \int {k_{\sigma } (x,x_{i} )k_{\sigma } (x,x_{j} ){\text{d}}x} } } = \omega^{\text{T}} Q\omega $$

Here, $ \omega = [\omega_{1} ,\omega_{2} , \ldots ,\omega_{n} ]^{\text{T}} $ and $ Q = \int {k_{\sigma } (x,x_{i} )k_{\sigma } (x,x_{j} )} {\text{d}}x $ is a $ n \times n $ matrix whose elements are defined as $ Q_{ij} = k_{\sqrt 2 \sigma } (x_{i} ,x_{j} ) $. Since we do not know the true $ f(x) $, we need to estimate the second term, which is denoted as $ M(\omega ) $. An unbiased estimate of it for a KDE can be written as

$$ M(\omega ) = \int {\hat{f}(x;\omega )} f(x){\text{d}}x \approx \tfrac{1}{n}\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\omega_{i} k_{\sigma } (x_{i} ,x_{j} ) = \sum\limits_{i = 1}^{n} {\omega_{i} \tfrac{1}{n}\sum\limits_{j = 1}^{n} {k_{\sigma } (x_{i} ,x_{j} )} } } } = \sum\limits_{i = 1}^{n} {\omega_{i} d_{i} } = \omega^{\text{T}} D $$

Here, the KDE for the point $ x_{i} $ is denoted as $ d_{i} = \tfrac{1}{n}\sum\nolimits_{j = 1}^{n} {k_{\sigma } (x_{i} ,x_{j} )} $, and $ D = [d_{1} ,d_{2} , \ldots ,d_{n} ]^{\text{T}} $. Notice that $ D = K1_{n} $, where $ K $ is the Gram matrix whose elements are defined as $ K_{ij} = k_{\sigma } (x_{i} ,x_{j} ) $, and $ 1_{n} $ is the $ n \times 1 $ vector whose elements are all $ \tfrac{1}{n} $. The last term in (3) can been dropped due to its independence of $ \omega $. Then the minimum ISE can be written as follows

$$ \begin{gathered} \hat{\omega } = { \arg }\mathop { \hbox{min} }\limits_{\omega } \left\{ {{\text{ISE}}(\omega ) = \tfrac{1}{2}\omega^{\text{T}} Q\omega - \omega^{\text{T}} D} \right\} \hfill \\ {\text{s}} . {\text{t}} .\quad \omega^{\text{T}} 1 = 1\quad {\text{and}}\quad \omega_{i} \ge 0\;\forall i \hfill \\ \end{gathered} $$

(4)

Observe that the $ Q $ is positive semi-definite. Thus, the object function is convex with respect to $ \omega $, and can be solved using SMO [12, 16].

3 Robust sparse kernel density estimation

The RSDE was shown in [12] to be able to provide a sparse representation in the weighting coefficients. The authors observed that the weights obtained from minimizing the estimated ISE were sparse. This is because the right hand term $ \omega^{\text{T}} D $ in (4) is a convex combination of positive numbers. Such a convex combination is maximized by assigning a unit weight to the largest, and setting the rest to zero. SMO is a simple algorithm that can quickly solve the quadratic programming (QP) optimization problem (4), by breaking the large QP problem into a series of smallest possible QP problems. The optimal solution should satisfy the following rules

(R1) If $ \omega_{i} = 0,\omega_{j} > 0 $ then $ I_{i} \ge I_{j} $.
(R2) If $ \omega_{i} ,\omega_{j} > 0 $ then $ I_{i} = I_{j} $.

Here, $ I_{i} = \sum\nolimits_{j = 1}^{n} {Q_{ij} \omega_{j} } - d_{i} $. Clearly, if more points satisfy rule (R1), then we can obtain a more sparse solution. In rule (R2), $ \omega_{i} $ and $ \omega_{j} $ are positive and updated only when $ I_{i} \ne I_{j} $. Therefore, we hope to reduce the number of points which satisfy rule (R2) in order to increase the sparsity further in the weight coefficients. Due to the convex constraint, the coefficients obtained from SMO are naturally sparse. If we suppose that the precision of numbers is sufficient ($ I_{i} \ne I_{j} ,i \ne j $), then we will finally obtain only one nonzero weighting coefficient in $ \omega $ by using SMO. It implies that the true density $ f(x) $ is estimated by only one kernel function with a nonzero weight coefficient $ \omega_{k} $, $ \hat{f}(x;\omega ) = \omega_{k} k_{\sigma } (x,x_{k} ) $. Obviously, this is a time-consuming method to enforce sparsity and the resulting density estimator is inaccurate. Even when the precision of numbers is increased, the RSDE allows for many relatively close points and their corresponding weighting coefficients are small and nonzero in the optimal solution. As can be seen easily in Fig. 1a, these nonzero weighting coefficients are clustered in regions of space with greater probability mass. If these points in each cluster could be replaced approximately by one or several points with larger weighting coefficients, then this can improve the sparsity further in the weight coefficients.

For the Gaussian kernel, note that there exists a feature mapping functional $ \phi_{\sigma } :{\mathbb{R}}^{\ell } \to {\mathbb{R}}^{L} (\ell < L) $. It maps the feature to a high dimensional feature space: $ x \to \phi_{\sigma } (x) $, such that $ K_{ij} = k_{\sigma } (x_{i} ,x_{j} ) $ $ = \left\langle {\phi_{\sigma } (x_{i} ),\phi_{\sigma } (x_{j} )} \right\rangle $ [17]. Then the KDE with Gaussian kernel can be represented as the inner product between a mapped test point and the centroid of mapped training points in kernel feature space [18]. We have $ d_{i} = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {k_{\sigma } (x_{i} ,x_{k} )} = \left\langle {\phi_{\sigma } (x_{i} ),\frac{1}{n}\sum\nolimits_{k = 1}^{n} {\phi_{\sigma } (x_{k} )} } \right\rangle $. Here, $ \frac{1}{n}\sum\nolimits_{k = 1}^{n} {\phi_{\sigma } (x_{k} )} $ can be treated as nonzero constants which clearly do not depend upon the value $ x_{i} $. Similarly, there exists $ \phi_{\sqrt 2 \sigma } $ such that $ Q_{ij} = k_{\sqrt 2 \sigma } (x_{i} ,x_{j} ) = \left\langle {\phi_{\sqrt 2 \sigma } (x_{i} ),\phi_{\sqrt 2 \sigma } (x_{j} )} \right\rangle $ and $ H_{i} = \sum\nolimits_{k = 1}^{n} {Q_{ik} \omega_{k} } = \left\langle {\phi_{\sqrt 2 \sigma } (x_{i} ),\sum\nolimits_{k = 1}^{n} {\omega_{k} \phi_{\sqrt 2 \sigma } (x_{k} )} } \right\rangle $. By analysis in feature space, we have

$$ I_{i} - I_{j} = \left\langle {\phi_{\sqrt 2 \sigma } (x_{i} ) - \phi_{\sqrt 2 \sigma } (x_{j} ),\sum\limits_{k = 1}^{n} {\omega_{k} \phi_{\sqrt 2 \sigma } (x_{k} )} } \right\rangle - \left\langle {\phi_{\sigma } (x_{i} ) - \phi_{\sigma } (x_{j} ),\frac{1}{n}\sum\limits_{k = 1}^{n} {\phi_{\sigma } (x_{k} )} } \right\rangle $$

If $ d_{i} = d_{j} $, then we have $ \left\langle {\phi_{\sigma } (x_{i} ) - \phi_{\sigma } (x_{j} ),\frac{1}{n}\sum\nolimits_{k = 1}^{n} {\phi_{\sigma } (x_{k} )} } \right\rangle = 0 $, and $ \phi_{\sigma } (x_{i} ) = \phi_{\sigma } (x_{j} ) $. Thus, $ \phi_{\sqrt 2 \sigma } (x_{i} ) = \phi_{\sqrt 2 \sigma } (x_{j} ) $, and $ I_{i} = I_{j} $.

In fact, $ d_{i} ,d_{j} $ are unequal but may be very close. Assume that $ d_{i} $ is very close to $ d_{j} $, then their feature points $ \phi_{\sigma } (x_{i} ) $ and $ \phi_{\sigma } (x_{j} ) $ are also very close. It implies that it is easy to meet the condition $ I_{i} = I_{j} $ in the rule (R2). In other words, if $ d_{i} ,d_{j} $ are close enough, and $ \omega_{i} $ is nonzero, then $ \omega_{j} $ is more likely to be nonzero. Hence, there are two cases for $ d_{i} $ and $ d_{j} $ in the rule (R2): (a) $ |d_{i} - d_{j} | < \delta $ ($ \delta $ is small enough), (b) $ |d_{i} - d_{j} | \ge \delta $. If $ \omega_{i} $ and $ \omega_{j} $ satisfy case (a) in rule (R2), then $ \omega_{i} ,\omega_{j} > 0 $ are called coherent coefficients.

3.1 Random perturbation of coherent coefficients

In this section, we hope to break the relationship between coherent coefficients, which are clustered in regions of space with greater probability mass. A natural approach is to induce randomness to $ D $, in order to produce incoherence for most of $ d_{i} \in D $. Based on the existing structure of $ D $, a part of values that stay very close are added a small random values, while keeping the rest unchanged. The randomness can make all these elements in each cluster stay apart from each other. Assume that there are $ n_{0} $ coherent coefficients $ \omega_{1} , \ldots ,\omega_{{n_{0} }} > 0 $, and the corresponding $ d_{1} , \ldots ,d_{{n_{0} }} $ are close enough. After inducing random values to $ d_{i} $, the relationship $ I_{1} = \cdots = I_{{n_{0} }} $ in rule (R2) does not exist. Clearly, case (a) in rule (R2) would be reduced and a part of them would be reclassified to rule (R1). The weight coefficients in the optimal solution could be made more sparse. Assume that all the elements of $ D $ are collected into a set $ \Upomega $. Then we have the following definitions to partition the set $ \Upomega $.

Definition 1

Coherent relation $ \approx $ is defined as: $ d_{i} \approx d_{j} \Leftrightarrow \left\lfloor {d_{i} } \right\rfloor_{m} = \left\lfloor {d_{j} } \right\rfloor_{m} ,\;d_{i} ,d_{j} \in \Upomega $.

There are many methods to describe the coherent relation. Here, considering that $ d_{i} $ is a positive decimal number, the truncated m-digit approximation to it is the number $ \left\lfloor {d_{i} } \right\rfloor_{m} $ obtained by simply discarding all digits beyond the mth. Hence, we have $ \left| {d_{i} - \left\lfloor {d_{i} } \right\rfloor_{m} } \right| < 10^{ - m} $. Clearly, the coherent relation $ d_{i} \approx d_{j} $ is an equivalence relation that identifies those numbers of $ \Upomega $ that stay very close. Moreover, this relationship gives rise to a partition of $ \Upomega $ into equivalence classes.

Definition 2

Coherent set is an equivalence class defined as: $ \Upomega (\tilde{d}_{i} ) = \left\{ {d_{j} |(d_{j} \in \Upomega ) \wedge (\tilde{d}_{i} \approx d_{j} )} \right\} $. Here, $ \tilde{d}_{i} $ is called a generator. The partition induced by the coherent relation is given by: $ \Uppi (\Upomega , \approx ) = \left\{ {\Upomega_{1} , \ldots ,\Upomega_{c} } \right\},c \le \left| \Upomega \right| $, where $ \left| \Upomega \right| $ is the cardinality of $ \Upomega $, and $ \left| {\Upomega_{1} } \right| \ge \left| {\Upomega_{2} } \right| \ge \cdots \ge \left| {\Upomega_{c} } \right| $. Let $ \tilde{d}_{i} $ be the corresponding generators of $ \Upomega_{i} $, $ i = 1,2, \ldots ,c $, and they form a set $ \Upomega_{B} = \left\{ {\tilde{d}_{1} ,\tilde{d}_{2} , \ldots ,\tilde{d}_{c} } \right\} $. Subsequently, we define $ \Upomega_{N} = \Upomega - \Upomega_{B} $, and obtain the corresponding partition $ \Uppi (\Upomega_{N} , \approx ) $ $ = \left\{ {\Upomega_{1}^{N} ,\Upomega_{2}^{N} , \ldots ,\Upomega_{t}^{N} } \right\} $, where $ \left| {\Upomega_{1}^{N} } \right| \ge \left| {\Upomega_{2}^{N} } \right| \ge \cdots \ge \left| {\Upomega_{t}^{N} } \right| $. Obviously, $ t < c $ and $ \left| {\Upomega_{i}^{N} } \right| = \left| {\Upomega_{i} } \right| - 1,\forall i = 1,2, \ldots ,t $. Here, we select an appropriate m or magnify the values of $ \Upomega $ for partition such that $ \left| {\Upomega_{N} } \right| < \left| {\Upomega_{B} } \right| < n $.

Definition 3

$ \phi_{\sigma } (x_{i} ) $ and $ \phi_{\sigma } (x_{j} ) $ are coherent feature points, such that $ \phi_{\sigma } (x_{i} ) \approx \phi_{\sigma } (x_{j} ) \Leftrightarrow d_{i} \approx d_{j} $. Since one to one correspondence between $ d_{i} $ and $ \phi_{\sigma } (x_{i} ) $, we define the corresponding feature sets $ \Uptheta = \left\{ {\phi_{\sigma } (x_{1} ),\phi_{\sigma } (x_{2} ), \ldots ,\phi_{\sigma } (x_{n} )} \right\} $, $ \Uptheta_{B} ,\Uptheta_{N} = \left\{ {\Uptheta_{1} ,\Uptheta_{2} , \ldots ,\Uptheta_{t} } \right\} $ of $ \Upomega ,\Upomega_{B} ,\Upomega_{N} $.

Given a coherent relation $ \approx $, the set $ \Upomega $ is divided into $ \Upomega_{B} $ and $ \Upomega_{N} $ (Fig. 2). Our method is to induce randomness to $ \Upomega_{N} $, and keep $ \Upomega_{B} $ unchanged. For any $ d \in \Upomega_{i}^{N} ,i = 1,2, \ldots ,t $, we obtain $ \bar{\Upomega }_{i}^{N} $ by setting $ d^{*} = d + \lambda_{i} r $, where $ r $ is a random value from a uniform distribution on the interval [0, 1], and $ \lambda_{i} $ is a scaling parameter for $ \Upomega_{i}^{N} $. Here, $ \lambda_{1} \ge \lambda_{2} \ge \cdots \lambda_{t} $. In our experiments, a simple method is used to define these scaling parameters, $ \lambda_{i} = \left( {d_{i} - \left\lfloor {d_{i} } \right\rfloor_{m} } \right)\left| {\Upomega_{i}^{N} } \right| $. Then, $ \Upomega_{N}^{*} = \left\{ {\bar{\Upomega }_{1}^{N} ,\bar{\Upomega }_{2}^{N} , \ldots ,\bar{\Upomega }_{t}^{N} } \right\} $ is obtained and $ \Upomega_{B} ,\Upomega_{N} ,\Upomega_{N}^{*} $ can be written in matrix form as $ D_{B} ,D_{N} ,D_{N}^{*} $. After rearrangement, the proposed ISE approximation model can be minimized as

$$ \begin{gathered}\left[ {\begin{array}{*{20}c} {\hat{\omega }_{B} } & {\hat{\omega }_{N} } \\ \end{array} } \right] = \mathop {\text{argmin}}\limits_{{\omega_{B} ,\omega_{N} }} \left\{ {{\text{ISE}}^{*} (\omega_{B} ,\omega_{N} ) = \frac{1}{2}\left[ {\begin{array}{*{20}c} {\omega_{B} } & {\omega_{N} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {Q_{BB} } & {Q_{BN} } \\ {Q_{NB} } & {Q_{NN} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {\omega_{B} } \\ {\omega_{N} } \\ \end{array} } \right] - \left[ {\begin{array}{*{20}c} {\omega_{B} } & {\omega_{N} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {D_{B} } \\ {D_{N}^{*} } \\ \end{array} } \right]} \right\}, \end{gathered}$$

(5)

which is called the robust sparse kernel density estimation (RSKDE). In contrast to the RSDE, we use $ D_{N}^{*} $ instead of $ D_{N} $, $ D_{N}^{*} = D_{N} + R $. Here, $ R = \left[ {R_{1} ,R_{2} , \ldots ,R_{{b_{t} }} } \right]^{\text{T}} = \left[ {\lambda_{1} r_{1} , \ldots ,\lambda_{1} r_{{b_{1} }} ,\lambda_{2} r_{{b_{1} + 1}} , \ldots ,\lambda_{2} r_{{b_{2} }} , \ldots ,\lambda_{t} r_{{b_{t} }} } \right]^{\text{T}} , $ where $ b_{j} = \sum\nolimits_{i = 1}^{j} {\left| {\bar{\Upomega }_{i}^{N} } \right|} ,j = 1,2, \ldots ,t $ and $ r_{i} $ is a random value from a uniform distribution on the interval $ [0,1],\,i = 1,2, \ldots ,b_{t} $. Observe that the object function (5) stays convex with respect to $ \omega $, and SMO is also used to solve this problem. Set partitioning is easy to implement in computer program. Our implementation approach is summarized in Algorithm 1.

Algorithm 1 Robust sparse kernel density estimation

Full size table

Specially, the proposed model can be viewed as a sparse KDE based on a random weighted L ₁ penalty. Then it can be also written as

$$ \hat{\omega } = { \arg }\mathop { \hbox{min} }\limits_{\omega } {\text{ISE}}(\omega ) - \beta \left\| {P\omega } \right\|_{1} \quad {\text{s}} . {\text{t}} .\;\omega^{\text{T}} 1 = 1\;{\text{and}}\;\omega_{i} \ge 0\;\forall i . $$

(6)

Here $ P = \left[ {\begin{array}{*{20}c} 0 & R \\ \end{array} } \right]^{\text{T}} $, where $ 0 $ is a $ (n - b_{t} ) \times 1 $ column vector of zeros, and $ \beta > 0 $ is the regularization parameter.

3.2 Analysis of RSKDE

In the RSDE, an unbiased estimate of $ M(\omega ) $ in Sect. 2 can be obtained as a $ \omega_{i} $ weighted sum of KDE $ d_{i} $ of each point $ x_{i} $. However, the $ d_{i} $ can be expressed as the inner product between a mapped test point and the mean of mapped training points in kernel feature space, $ d_{i} = \frac{1}{n}\sum\nolimits_{j = 1}^{n} {k_{\sigma } (x_{i} ,x_{j} )} = \left\langle {\phi_{\sigma } (x_{i} ),\frac{1}{n}\sum\nolimits_{j = 1}^{n} {\phi_{\sigma } (x_{j} )} } \right\rangle $. As we know, the mean estimator $ \hat{\theta } = \frac{1}{n}\sum\nolimits_{j = 1}^{n} {\phi_{\sigma } (x_{j} )} $ can be drastically influenced by outliers. The following proposition shows that the proposed algorithm improves performance of the density estimate.

Proposition 1

In contrast to the RSDE, a small increment of $ d_{i} \in \Upomega_{N} $ can make the proposed model more robust against outliers, and improve the quality of the density estimates.

Proof The RSDE uses mean estimation for KDE, which is not robust against outliers in the data. In our case, the larger the value of $ \left| {\Upomega_{k} } \right| $, $ k = 1, \ldots ,t $, the more coherent feature points in $ \Uptheta_{k} $. It implies that $ \phi_{\sigma } (x_{j} ) \in \Uptheta_{k} $ is more unlikely to be an outlier. To reduce the influence of possible outliers among the training data, we would like to set small weight values for outliers. Instead of giving the concrete implement algorithm, a feasible robust estimate for the sample mean is described below just for the proof.

$$ \hat{\theta } = \sum\limits_{{\phi_{\sigma } (x_{j} ) \in \Uptheta_{k} }} {\alpha_{0} \phi_{\sigma } (x_{j} )} + \sum\limits_{{\phi_{\sigma } (x_{j} ) \in \Uptheta_{N} - \Uptheta_{k} }} {\frac{1}{n}\phi_{\sigma } (x_{j} )} + \sum\limits_{{\phi_{\sigma } (x_{j} ) \in \Uptheta_{B} }} {\alpha_{1} \phi_{\sigma } (x_{j} )} $$

where $ \alpha_{0} \ge \tfrac{1}{n} \ge \alpha_{1} > 0 $, and $ \alpha_{0} \left| {\Uptheta_{k} } \right| + \tfrac{1}{n}\left| {\Uptheta_{N} - \Uptheta_{k} } \right| + \alpha_{1} \left| {\Uptheta_{B} } \right| = 1 $. It can be seen that the influence from $ \phi_{\sigma } (x_{i} ) \in \Uptheta_{B} $ is decreased. If $ \phi_{\sigma } (x_{i} ) \in \Uptheta_{k} $, then we have $ d_{i}^{*} = \left\langle {\phi_{\sigma } (x_{i} ),\hat{\theta }} \right\rangle = \left\langle {\phi _{\sigma } ({\mathbf{x}}_{i} ),\sum\limits_{{\phi _{\sigma } ({\mathbf{x}}_{j} ) \in \Uptheta _{k} }} {\alpha _{0} \phi _{\sigma } ({\mathbf{x}}_{j} )} + \sum\limits_{{\phi _{\sigma } ({\mathbf{x}}_{j} ) \in \Uptheta_{N} - \Uptheta _{k} }} {\frac{1}{n}\phi _{\sigma } ({\mathbf{x}}_{j} )} + \sum\limits_{{\phi _{\sigma } ({\mathbf{x}}_{j} ) \in \Uptheta _{B} }} {\alpha _{1} \phi _{\sigma } ({\mathbf{x}}_{j} )} } \right\rangle > \left\langle {\phi _{\sigma } ({\mathbf{x}}_{i} ),\frac{1}{n}\sum\nolimits_{{j = 1}}^{n} {\phi _{\sigma } ({\mathbf{x}}_{j} )} } \right\rangle $.

Hence, $ d_{i}^{*} > d_{i} $. There exists a small enough $ \lambda_{i} $, such that $ d_{i}^{*} = d_{i} + \lambda_{i} r_{i} > d_{i} $. On the contrary, we suppose that $ d_{i}^{*} $ is a little larger than $ d_{i} = \left\langle {\phi_{\sigma } (x_{i} ),\frac{1}{n}\sum\nolimits_{j = 1}^{n} {\phi_{\sigma } (x_{j} )} } \right\rangle $, then $ d_{i}^{*} = \left\langle {\phi_{\sigma } (x_{i} ),\sum\nolimits_{j = 1}^{n} {\alpha_{j} \phi_{\sigma } (x_{j} )} } \right\rangle $ satisfies $ \alpha_{i} > \frac{1}{n} $. Therefore, a small increment of $ d_{i} \in \Upomega_{N} $ makes the proposed model more robust against outliers.

In addition, suppose that $ \omega_{1} = {\text{argmin}}_{\omega } {\text{ISE}}(\omega ) $ and $ \omega_{2} = {\text{argmin}}_{\omega } {\text{ISE}}^{*} (\omega ) $, the density estimation based on $ \omega_{2} $ is more accurate than $ \omega_{1} $, since $ {\text{ISE}}(\omega_{1} ) \ge {\text{ISE}}^{*} (\omega_{1} ) \ge {\text{ISE}}^{*} (\omega_{2} ).\, \square$

The matrix $ \left[ {\begin{array}{*{20}c} {D_{B} } & {D_{N}^{*} } \\ \end{array} } \right]^{\text{T}} $ defined in (5) can be interpreted as the more robust estimation of $ M(\omega ) $. It can be written in the form $ M^{*} (\omega ) = \sum\nolimits_{{d_{i} \in \Upomega_{B} }} {\omega_{i} d_{i} } + \sum\nolimits_{{d_{i} \in \Upomega_{N} }} {\omega_{i} (d_{i} + R_{i} )} $. In this case, the error of estimation is bounded as follows: $ \left| {M(\omega ) - M^{*} (\omega )} \right| = \left| {\sum\nolimits_{{d_{i} \in \Upomega_{N} }} {\omega_{i} R_{i} } } \right| < \lambda_{1}. $

As mentioned previously, the RSDE involves Gaussian kernels of bandwidth $ \sqrt 2 \sigma $ and $ \sigma $, which occurs in $ Q $ and $ D $. The normalizing constants for these kernels are $ (4\pi \sigma^{2} )^{{ - {\ell \mathord{\left/ {\vphantom {\ell 2}} \right. \kern-0pt} 2}}} $ and $ (2\pi \sigma^{2} )^{{ - {\ell \mathord{\left/ {\vphantom {\ell 2}} \right. \kern-0pt} 2}}} $, respectively. As can be seen, the ratio between them is $ 2^{{ - {\ell \mathord{\left/ {\vphantom {\ell 2}} \right. \kern-0pt} 2}}} $. If the dimension $ \ell $ is large enough, the linear term $ D $ dominates the quadratic term $ Q $. It implies that, in high dimensional data, it is hard to find the coherent coefficients. In other words, the RSDE has already yielded a more sparse solution on most high dimensional data. There are no significant difference between the RSKDE and RSDE. This agrees with our intuition that the representation of signals is easier in lower dimensions. For high dimensional data, many of the dimensions are often irrelevant. These irrelevant dimensions can hide clusters in noisy data. It is common for all of the training data to be nearly equidistant from each other in very high dimensions [19]. Therefore, the sparsity for lower dimensional data is much more than the sparsity achieved in the case of higher dimensional data for similar quality of estimates.

4 Experimental results

We implement the proposed RSKDE in MATLAB based on the KDE Toolbox (written by Ihler and Mandel [20]) and evaluate the performance in density estimation. Because of the good performance of RSKDE, it is extensively validated on novelty detection and binary classification.

4.1 Density estimation

We experiment with one-dimensional data which is drawn from a heavily skewed distribution defined as $ p_{1} (x) = \tfrac{1}{8}\sum\nolimits_{i = 0}^{7} {g(x,\mu_{i} ,\sigma_{i} )} $, where $ \sigma_{i} = ({2 \mathord{\left/ {\vphantom {2 3}} \right. \kern-0pt} 3})^{i} $ and $ \mu_{i} = 3(\sigma_{i} - 1) $ [21]. Here, $ g(x,\mu ,\sigma ) $ is a univariate Gaussian distribution with mean $ \mu $ and variance $ \sigma $. Data samples of $ n $ are randomly drawn from the distribution to construct KDE. The width of the kernel is found by Rule of Thumb [22], and a separate test data set of 10,000 samples is used to calculate the L ₁ error and L ₂ error for the resulting estimate which are defined in [13]. The parameter m is set to 4. For $ n = 500 $, a typical result is shown in Fig. 1b. As we can see, the nonzero weighting coefficients are not concentrated in regions of space with greater probability mass in contrast to Fig. 1a. In addition, there exist one or several points with larger weighting coefficients to represent high probability mass. Therefore, the RSKDE achieves a much sparser estimator than the RSDE estimator. Moreover, the resulting estimate is much closer to the true density. To demonstrate the effectiveness and robustness, we test our model with several recent methods: RSDE, KD-tree based density reduction method of Ihler et al. [20], sparse kernel density estimates (SKDE) with L ₀ penalty [14]. The experiment is repeated 200 times for different sample sizes. The remaining data (percentage of sample size) are shown in Fig. 3a. The average L ₂ error (mean ± SD) between the true density and respective density estimators against sample size are shown in Fig. 3b. From the results, it is clear that the proposed method provides a significant improvement both in sparsity and accuracy under the same experimental conditions.

To test the robust performance, we add uncorrelated outliers from a random distribution over [−4, 4]. For n = 650 (500 data samples are generated from the previous probability density function, and 150 outliers), a typical result is shown in Fig. 4. The L ₂ error for the RSKDE is only slightly superior to the RSDE, but the RSKDE has the remarkable advantage for the sparsity. Only 20 nonzero weighting coefficients are need for the RSKDE, while 114 nonzero weighting coefficients are required for the RSDE.

To further compare the results of the proposed algorithm, the experiment is repeated 200 times with 500 fixed data samples and different numbers of outliers. The average L ₁ error (mean ± SD), L ₂ error (mean ± SD) and the number of nonzero weighting coefficients against sample size (data sample size + outlier size) are shown in Table 1. After adding outliers to the original data set, it is clear from the results that the RSKDE we have developed is always better than the RSDE both in sparsity and accuracy of the estimates. Moreover, the number of nonzero weighting coefficients provided by the proposed model remains fairly consistent, when the number of outliers is increased.

Table 1 L ₁ error and L ₂ error between the true density and respective density estimators against sample size over 200 runs

Full size table

4.2 Novelty detection

Novelty detection is the identification of new or unknown data that a machine learning system is not aware of during training. Novelty detection is one-class classification. The known data form one class, and a novelty-detection method tries to identify outliers that differ from the distribution of ordinary data. The RSKDE for novelty detection is tested on real-world data sets: Banana and Phoneme. Both data sets are available at http://sci2s.ugr.es/keel. The Banana dataset contains a total of 5,300 samples over two classes. The novelty detectors are trained on the first 400 samples in the first class. The remaining samples are used for testing. The Phoneme dataset has two classes and 5,404 samples. The aim of the dataset is to distinguish between nasal (class 0) and oral sounds (class 1). There are five features. The novelty detectors are trained on the first 730 samples in class 0. The remaining samples are used for testing.

The density estimator $\,\hat{f}(x;\omega ,\sigma ) $ obtained from the training set give us a quantitative measure of the degree of novelty for each test sample. This is used to reject samples where the estimate $ \hat{f}(x;\omega ,\sigma ) < \rho $ for some threshold $ \rho $ [23]. Thus, any sample where the likelihood $ \hat{f}(x;\omega ,\sigma ) $ is below some threshold is considered to be novel. It implies that all test samples are classified into one of two classes: those which are similar to the training data, and those which are novel. Therefore, we adopt the standard definitions [24] used in binary classification to compare the results of RSKDE with existing algorithms. TP and TN stand for the number of true positives and true negatives, respectively. FP and FN represent, respectively, the number of misclassified positive and negative cases. In two-class problems, the accuracy rate on the positives, called sensitivity, is defined as TP/(TP + FN), whereas the accuracy rate on the negative class, also known as specificity, is TN/(TN + FP). Classification accuracy is (TP + TN)/N, where N = TP + TN + FP + FP is the total number of cases. Table 2 compares qualitatively RSKDE for novelty detection with other algorithms. Here, N ₁ is the number of training data, and N ₂ is the number of test data. The likelihood cross-validation is employed in selecting the kernel width for fair comparison. In the k-nearest neighbor algorithm, k is set to 3. The weighting coefficient $ \omega $ of the RSKDE is obtained by optimizing (5) over training samples. We can see that in both datasets the RSKDE outperforms the KDE and RSDE.

Table 2 Performance of the kernel density estimation (KDE), reduced set density estimation (RSDE) [25], Gaussian mixture model (GMM) [26], k-nearest neighbor algorithm (k-NN), one-class support vector machines SVM [27] and the proposed RSKDE

Full size table

4.3 Binary classification

This section further evaluates the RSKDE’s performance for two-class classification problem. The experiments are carried out on the datasets that were used in novelty detection. The number of training samples is 1,000. The remaining samples are used for testing. There are ten randomly permuted partitions of each dataset into training and test sets. We first estimate the two conditional density functions $ \hat{f}(x;\omega ,\sigma |C_{0} ) $ and $ \hat{f}(x;\omega ,\sigma |C_{1} ) $ for class $ C_{0} $ and $ C_{1} $ from the training data, and then apply the Bayes’ rule to the test data set and calculate the corresponding accuracy (ACC).

$$ \left. {\begin{array}{*{20}c} {{\text{if}}\;\hat{f}(x;\omega ,\sigma |C_{0} ) \,\ge\, \hat{f}(x;\omega ,\sigma |C_{1} ),\;x \in C_{0} } \\ {\quad \quad \quad \quad \quad {\text{else}},\quad \quad \quad \quad \quad \quad \quad x \in C_{1} } \\ \end{array} } \right\} $$

(7)

During training, the kernel width $ \sigma $ is tuned by likelihood cross-validation, and the weighting coefficient $ \omega $ is obtained by optimizing (5) over training samples. Table 3 compares the performance of the six related methods. As can be seen, the test mean accuracy for the RSKDE is 0.898 which is only slightly superior to 0.894 of the KDE, but the RSKDE has the remarkable advantage for the test complexity. Only 187 mean samples in the reduced set are needed for the RSKDE classifier while all 1,000 training samples are required for KDE classifier. Meanwhile, on average the RSKDE classifier reduces test computational costs by ~80 %. For high dimensional data, results show no significant difference between the RSKDE and RSDE.

Table 3 Performance of the six related methods

Full size table

5 Conclusion

In this paper, a novel robust sparse kernel density estimation based on the RSDE is presented. Instead of sparse representation by regularization term, the proposed model induces randomness to the plug-in estimation of the RSDE and yield a more sparse representation in the weighting coefficients. By means of SMO, the randomness can make those nonzero and small weighting coefficients get together into one or several points with larger weighting coefficients. The proposed model shows good performance both in sparsity and accuracy of the estimates for the low dimensional data. Numerical experiments show promising results.

References

Tsai A, Yezzi A, Wells W, Tempany C, Tucker D, Fan A, Grimson E, Willsky A (2003) A shape-based approach to the segmentation of medical imagery using level sets. IEEE Trans Med Imaging 22(2):137–154
Article Google Scholar
Leventon M, Grimson W, Faugeras O (2000) Statistical shape influence in geodesic active contours. IEEE Int Conf Comput Vis Pattern Recogn 1:316–323
Google Scholar
Rousson M, Cremers D (2005) Efficient kernel density estimation of shape and intensity priors for level set segmentation. Int Conf Med Image Comput Comput Assist Interv 3750:757–764
Google Scholar
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Machine Intell 24:603–619
Article Google Scholar
Elgammal A, Duraiswami R, Harwood D, Davis L (2002) Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proc IEEE 90:1151–1163
Article Google Scholar
Han B, Comaniciu D, Zhu Y, Davis LS (2008) Sequential kernel density approximation and its application to real-time visual tracking. IEEE Trans Pattern Anal Machine Intell 30(7):1186–1197
Article Google Scholar
Cremers D, Osher S, Soatto S (2006) Kernel density estimation and intrinsic alignment for shape priors in level set segmentation. Int J Comput Vis 69(3):335–351
Article Google Scholar
Kim J, Scott CD (2010) L₂ kernel classification. IEEE Trans Pattern Anal Machine Intell 32(10):1822–1831
Article Google Scholar
Silverman BW (1982) Kernel density estimation using the fast Fourier transform. Appl Stat 31:93–99
Article MATH Google Scholar
Yang C, Duraiswami R, Gumerov N, Davis L (2003) Improved fast gauss transform and efficient kernel density estimation. IEEE Int Conf Comput vis 1:664–671
Google Scholar
Vapnik V, Mukherjee S (1999) Support vector method for multivariate density estimation. In: Proceedings of NIPS, pp 659–665
Girolami M, He C (2003) Probability density estimation from optimally condensed data samples. IEEE Trans Pattern Anal Machine Intell 25(10):1253–1264
Article Google Scholar
Chen S, Hong X, Harris CJ (2008) An orthogonal forward regression technique for sparse kernel density estimation. Neurocomputing 71(4):931–943
Article Google Scholar
Gopalakrishnan B, Bellala G, Devadas G, Sricharan K (2008) EECS 545 machine learning-sparse kernel density estimates. http://www-personal.umich.edu/~gowtham/. Accessed 12 Apr 2013
Hong X, Chen S, Harris CJ (2010) Sparse kernel density estimation technique based on zero-norm constraint. In: Proceeding of the IJCNN, pp 3782–3787
Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13:1443–1471
Article MATH Google Scholar
Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge
Google Scholar
Kim J, Scott CD (2008) Robust kernel density estimation. ICASSP, pp 3381–3384
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD Explor 6(1):90–105
Article Google Scholar
Ihler A, Mandel M (2003) Kernel density estimation toolbox for MATLAB. http://www.ics.uci.edu/~ihler/code. Accessed 12 Apr 2013
Sain S (1994) Adaptive kernel density estimation. PhD thesis. Rice University, Houston
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London
Book MATH Google Scholar
Bishop CM (1994) Novelty detection and neural network validation. IEEE Proc Vis Image Signal Process 141(4):217–222
Article Google Scholar
Metz C (1978) Basic principles of ROC analysis. Semin Nucl Med 8(4):283–298
Article Google Scholar
Chao H, Girolami M (2004) Novelty detection employing an L₂ optimal nonparametric density estimator. Pattern Recogn Lett 25(12):1389–1397
Article Google Scholar
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley-interscience, New York
MATH Google Scholar
Chang CC, Lin CJ (2006) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Accessed 12 Apr 2013

Download references

Acknowledgment

This work was supported by a National Key Basic Research Project of China (973 Program No. 2012CB316400) and NSFC (No. 60872069).

Author information

Authors and Affiliations

Department of Information Science and Electronic Engineering, Zhejiang University, No. 38 Zheda Road, Hangzhou, 310027, China
Fei Chen, Huimin Yu, Jincao Yao & Roland Hu
State Key Laboratory of CAD&CG, Hangzhou, 310027, China
Huimin Yu
School of Sciences, Jimei University, Xiamen, 361021, China
Fei Chen

Authors

Fei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huimin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jincao Yao
View author publications
You can also search for this author in PubMed Google Scholar
Roland Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huimin Yu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 175 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, F., Yu, H., Yao, J. et al. Robust sparse kernel density estimation by inducing randomness. Pattern Anal Applic 18, 367–375 (2015). https://doi.org/10.1007/s10044-013-0330-1

Download citation

Received: 03 June 2012
Accepted: 30 March 2013
Published: 17 April 2013
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10044-013-0330-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust sparse kernel density estimation by inducing randomness

Abstract

Similar content being viewed by others

Kernel density estimation by stagewise algorithm with a simple dictionary

Kernel-Based Clustering Driven by Density Index

Unsupervised Kernel Function Building Using Maximization of Information Potential Variability

1 Introduction

2 Reduced set density estimator

3 Robust sparse kernel density estimation