Keywords

1 Introduction

Fuzzy clustering has emerged as an important tool for discovering the structure of data. Kernel methods have been applied to fuzzy clustering, and the kernelized version is referred to as kernel-based fuzzy clustering. The kernel-based classification in the feature space not only preserves the inherent structure of groups in the input space, but also simplifies the associated structure of the data [1]. Since Girolami first developed the kernel k-means clustering algorithm for unsupervised classification [2], several studies have demonstrated the superiority of kernel clustering algorithms over other approaches to clustering [35]. The point raised regarding the kernel-based clustering method of data partitioning is the choice of the type of kernel function chosen in defining the nonlinear mapping. Clearly, the choice of kernel is data specific; however, in the specific case of data partitioning, a kernel which will have universal approximation qualities such as RBF is most appropriate. In [6], we proposed kernel-based hybrid c-means clustering (KPFCM) as an improvement over possibilistic fuzzy c-means clustering [8] using Gaussian kernel function. In most papers, Gaussian kernel function is used as the kernel function. Different kernels will induce different metric measures resulting in new clustering algorithms. Very few papers have studied other kernel functions such as hyper-tangent kernel function [7, 9]. In this paper, we have tried to investigate the effect of different kernel functions on the clustering results. To our knowledge, this is the first such comparison of kernel clustering algorithms using different kernel functions for general purpose clustering. The paper is organized as follows. A background of kernel-based approach is given in Sect. 2. Kernel-based hybrid c-means clustering with different kernel functions incorporated is described in Sect. 3. The experimental results and comparative analysis are given in Sect. 4 followed by the main conclusions presented in Sect. 5.

2 Kernel-Based Approach

A kernel function is a generalization of the distance metric that measures the distance between two data points as the data points are mapped into a high-dimensional space in which they are more clearly separable. By employing a mapping function, \( \Upphi (x), \) which defines a nonlinear transformation: \( x \to \Upphi (x) \), the nonlinearly inseparable data structure existing in the original data space can possibly be mapped into a linearly separable case in the higher-dimensional feature space. Given an unlabeled data set \( X = \left\{ {x_{1} , \ldots ,x_{N} } \right\} \) in the p-dimensional space \( R^{p} , \) let \( \Upphi \) be a nonlinear mapping function from this input space to a high-dimensional feature space H.

$$ \Upphi{\text{:}}\;R^{p} \to H,\;x \to \Upphi (x) $$

It is possible to compute Euclidean distances in feature space without knowing explicitly \( \Upphi \). This can be done using kernel trick in which the computation of distances of vectors in feature space is just a function of the input vectors.

$$ \begin{aligned} \left\| {\Upphi (x_{k} ) - \Upphi (v_{i} )} \right\|^{2} & = (\Upphi (x_{k} ) - \Upphi (v_{i} ))\,\cdot\,(\Upphi (x_{k} ) - \Upphi (v_{i} )) \\ & = \Upphi (x_{k} )\,\cdot\,\Upphi (x_{k} ) - 2\Upphi (x_{k} )\Upphi (v_{i} ) + \Upphi (v_{i} )\,\cdot\,\Upphi (v_{i} ) \\ & = K(x_{k} ,x_{k} ) - 2K(x_{k} ,v_{i} ) + K(v_{i} ,v_{i} ) \\ \end{aligned} $$
(1)

Some examples of robust kernel functions are given in Table 1.

Table 1 List of kernel functions

3 Kernel-Based Hybrid c-Means Clustering

We proposed a kernel-based hybrid c-means clustering (KPFCM) in [6] which used Gaussian kernel in the induced distance metric. The KPFCM model minimizes the following objective function:

$$ J_{\text{KPFCM}} (U,V,T) = \sum\limits_{k = 1}^{N} {\sum\limits_{i = 1}^{c} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)\left\| {\Upphi (x_{k} ) - \Upphi (v_{i} )} \right\|^{2} } } + \sum\limits_{i = 1}^{c} {\gamma_{i} } \sum\limits_{k = 1}^{N} {\left( {1 - t_{ik} } \right)}^{\eta } $$
(2)

where \( \left\| {\Upphi (x_{k} ) - \Upphi (v_{i} )} \right\|^{2} \) is the square of distance between \( \Upphi (x_{k} ) \) and \( \Upphi (v_{i} ) \). Also \( 0 \le u_{ik} \le 1,\,t_{ik} < 1,\,a > 0, \, b > 0, \, m > 1 \) and \( \eta > 1. \) The constant a defines the relative importance of fuzzy membership, whereas b relates to the typicality value in the objective function. If we confine ourselves to the Gaussian kernel function which is used almost exclusively in the literature, then \( K(x,x) = 1.\,\left\| {\Upphi (x_{k} ) - \Upphi (v_{i} )} \right\|^{2} = 2\left( {1 - K(x_{k} - v_{i} )} \right) \). Thus, Eq. 2 can be rewritten as

$$ J_{\text{KPFCM}} (U,V,T) = 2\sum\limits_{k = 1}^{N} {\sum\limits_{i = 1}^{c} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)\,\left( {1 - K(x_{k} ,v_{i} )} \right)} } + \sum\limits_{i = 1}^{c} {\gamma_{i} } \sum\limits_{k = 1}^{N} {(1 - t_{ik} )}^{\eta } $$
(3)

The update of \( u_{ik} ,v_{i} \) and \( t_{ik} \) is as follows:

$$ u_{ik} = \left( {1/\left( {1 - K(x_{k} ,v_{i} )} \right)} \right)^{1/m - 1} /\sum\limits_{j = 1}^{c} {\left( {1/\left( {1 - K(x_{k} ,v_{i} )} \right)} \right)^{1/m - 1} } $$
(4)
$$ v_{i} = \sum\limits_{k = 1}^{N} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)K\left( {x_{k} ,v_{i} } \right)x_{k} } /\sum\limits_{k = 1}^{N} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)K\left( {x_{k} ,v_{i} } \right)} $$
(5)
$$ t_{ik} = 1/\left( {1 + \left[ {2b(1 - K\left( {x_{k} ,v_{i} } \right)/\gamma_{i} } \right]^{{\frac{1}{\eta - 1}}} } \right) $$
(6)

We now incorporate different kernel functions to examine and compare with Gaussian kernel. The names given to different clustering algorithms are KPFCM (using Gaussian kernel), KPFCM-H (hyper-tangent kernel), and KPFCM-L (log kernel). Using a hyper-tangent kernel function, the objective function for KPFCM-H clustering is as follows:

$$ J_{\text{KPFCM - H}} = 2\sum\limits_{k = 1}^{N} {\sum\limits_{i = 1}^{c} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)\tanh \left( {\frac{{\left\| {x_{k} - v_{i} } \right\|^{2} }}{{\sigma^{2} }}} \right)} } + \sum\limits_{i = 1}^{c} {\gamma_{i} } \sum\limits_{k = 1}^{N} {\left( {1 - t_{ik} } \right)^{\eta } } $$
(7)

The cluster center \( v_{i} \) for hyper-tangent kernel function is as follows:

$$ v_{i} = \frac{{\sum\nolimits_{k = 1}^{N} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)K\left( {x_{k} ,v_{i} } \right)\left( {1 + \tanh \left( {\frac{{\left\| {x_{k} - v_{i} } \right\|^{2} }}{{\sigma^{2} }}} \right)x_{k} } \right)} }}{{\sum\nolimits_{k = 1}^{N} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)K\left( {x_{k} ,v_{i} } \right)\left( {1 + \tanh \left( {\frac{{\left\| {x_{k} - v_{i} } \right\|^{2} }}{{\sigma^{2} }}} \right)} \right)} }} $$
(8)

The expressions for \( u_{ik} \) and \( t_{ik} \) remain the same as in Eqs. 4 and 6, respectively. Using a log kernel function, the objective function for KPFCM-L clustering is as follows:

$$ J_{{{\text{KPFCM}} - {\text{L}}}} = - 2\sum\limits_{k = 1}^{N} {\sum\limits_{i = 1}^{c} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)\log \left( {1 + \beta \left\| {x_{k} - v_{i} } \right\|^{2} } \right)} } + \sum\limits_{i = 1}^{c} {\gamma_{i} } \sum\limits_{k = 1}^{N} {\left( {1 - t_{ik} } \right)} $$
(9)

The cluster center \( v_{i} \) for log kernel function is as follows:

$$ v_{i} = \frac{{\sum\nolimits_{k = 1}^{N} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)\left( {1/1 + \beta \left\| {x_{k} - v_{i} } \right\|^{2} } \right)x_{k} } }}{{\sum\nolimits_{k = 1}^{N} {\left( {{\text{au}}_{ik}^{m} + {\text{bt}}_{ik}^{\eta } } \right)\left( {1/1 + \beta \left\| {x_{k} - v_{i} } \right\|^{2} } \right)} }} $$
(10)

The update of \( u_{ik} \) and \( t_{ik} \) for log kernel function is as follows:

$$ u_{ik} = \left( {\log \left( {1 + \beta \left\| {x_{k} - v_{i} } \right\|^{2} } \right)} \right)^{ - 1/(m - 1)} /\sum\limits_{i = 1}^{c} {\left( {\log \left( {1 + \beta \left\| {x_{k} - v_{i} } \right\|^{2} } \right)} \right)^{ - 1/(m - 1)} } $$
(11)
$$ t_{ik} = 1/(1 + ( - b(\log (1 + \beta \left\| {x_{k} - v_{i} } \right\|^{2} )/\gamma_{i} )^{ - 1/(\eta - 1)} )) $$
(12)

4 Experimental Study

A series of experiments were run for a variety of data sets using the KPFCM clustering. The objective of this comprehensive suite of experiments is to come up with a thorough comparison of the performance of the KPFCM clustering using different kernel functions in the objective function. Many two-dimensional synthetic data sets were used with a wide variety in the shape of clusters, number of data points, and count of features of each datum. The real-life data sets used in the experiments are well-known Iris data sets.

4.1 Identical Data with Noise

The first simulation experiment involves the data set X 12 [8]. The ideal (true) centroids for the X 12 data set are \( V_{\text{ideal}} = \left[ {\begin{array}{*{20}c} { - 3.34} & 0 \\ {3.34} & 0 \\ \end{array} } \right]. \) Let \( V_{\text{KPFCM}}^{12} ,V_{\text{KPFCM - H}}^{12} ,V_{\text{KPFCM - L}}^{12} \) be the final centroids identified by their respective algorithms. The results of the final centroids identified by their respective algorithms are tabulated in Table 2. To show the effectiveness of the proposed algorithm, we also compute the error \( E_{*} = \left\| {V_{\text{ideal}} - V_{*}^{12} } \right\|^{2} \), where * corresponds to KPFCM/KPFCM-H/KPFCM-L, respectively. Figure 1 displays the clustering results of three clustering algorithms with different kernel functions. The centroids produced by clustering techniques using different kernel functions are almost identical.

Table 2 Terminal centroids produced by KPFCM, KPFCM-H, and KPFCM-L on X12
Fig. 1
figure 1

Clustering results for data X 12

4.2 Dunn Data Set

In general, all clustering techniques perform well for pattern sets that contain patterns of similar volume and similar number of patterns. By changing the volume of clusters in a pattern set, we observe the effectiveness of different clustering algorithms. Figure 2 displays the clustering results of three clustering algorithms. We see that all clustering algorithms correctly partition the two clusters with almost same accuracy.

Fig. 2
figure 2

Clustering results for Dunn data set

4.3 Gaussian Random Data

In this example, a Gaussian random number generator was used to create a data set consisting of two clusters. The noise points are then added to the lower cluster to obtain a difference in volume compared to the upper cluster. Figure 3 shows the clustering results of three clustering algorithms. There are only two misclassifications in case of KPFCM-H. The results produced by KPFCM and KPFCM-L are completely identical.

Fig. 3
figure 3

Clustering results for Gaussian random data set

4.4 Iris Data Set

This is a four-dimensional data set containing 50 samples each of three species of Iris flowers. One class is linearly separable from the other two; the latter are not linearly separable from each other. As indicated in Table 3, the typical result of comparing KPFCM partitions to the physically correct labels of Iris is 11 errors. KPFCM-H also gives 11 errors. The number of misclassified data by KPFCM-L algorithm is 08 with an accuracy of 94.6 %.

Table 3 Number of misclassified data and accuracies using KPFCM, KPFCM-H, and KPFCM-L method for the Iris data set

5 Conclusions

This paper has presented the detailed analysis of different kernel functions on KPFCM clustering. In literature, most kernel-based clustering algorithms use Gaussian kernel functions. We have studied two nonGaussian kernel functions, namely hyper-tangent and log kernel functions. We incorporated these kernel functions in KPFCM clustering and applied these clustering algorithms on a wide variety of synthetic data as well as real-life data set. From the experiments, we have seen that the two nonGaussian kernel functions have worked as well as Gaussian kernel function. To summarize, we conclude that we have another class of robust kernel functions that have worked well in typical clustering examples of nonlinear classification boundaries.