1 Introduction

Clustering is the process of separating the data into different groups on the basis of some similarity measure; data belonging to the same cluster should be similar to each other while data belonging to different clusters should not. Mainly two clustering approaches exist: hard clustering and fuzzy clustering. In hard clustering, data is divided into distinct clusters, where each data point can only belong to exactly one cluster (Likas et al. 2003; Pelleg and Moore 2000; Kanungo et al. 2002). However, when clusters overlap, objects should belong to different clusters with different degrees. In this case a fuzzy clustering is more appropriate. Data points can potentially belong to multiple clusters. Membership grades are assigned to each of the data points. These membership grades indicate the degree to which data points belong to each cluster. One of the most widely used fuzzy clustering algorithms is the fuzzy c-means (FCM) algorithm (Bezdek 1981; Hoppner and Klawonn 1999; Gustafson and Kessel 1979; Babuska et al. 2002; Dunn 1973; Gath and Geva 1989). This algorithm works well for spherical-like clusters. When dealing with non-spherical shape and much overlapped data, kernel methods are best suited. They map data points from the input space to a high dimensional feature space. Thus the original non-linearly separable data structure in the input space becomes linearly separable in the feature space. There are two major variations of the FCM: Kernel-FCM (KFCM) and Multiple-Kernel-FCM (MKFCM). KFCM (Schölkopf and Smola 2002; Xie et al. 2003) transforms the data into a high-dimensional feature space and then performs clustering. It uses a single kernel which doesn’t fit with multiple-density clusters. Instead of using a single fixed kernel, multiple kernels may be used. MKFCM (Baili and Frigui 2012, 2011) tries to find an optimal linear weighted combination of kernels with initial fixed (not necessarily the best) parameters. The disadvantage of that approach is twofold: Giving the number of kernels M and giving “good” parameters of each kernel.

Our motivation in this paper is to do an extension of the KFCM and the MKFCM algorithms. This is done by achieving the following:

  • Characterizing every cluster by a different Gaussian kernel (extension of the KFCM).

  • Trying to find the optimal parameters of each kernel in each cluster without the need to give “good” parameters of each kernel and no need to give an initial “good” number of kernels (extension of the MKFCM).

It should be noted that the originality and the novelty of our work stem from the fact that we achieved a new method that achieves better clustering. Our contribution and innovation can be summarized by an extended kernel-based fuzzy clustering algorithm which is more effective than previous techniques. And the benefits of this work are found in numerous applications which require clustering algorithms.

We have used the followingnomenclature: FCM: Fuzzy c-means, KFCM: kernel-Fuzzy c-means, MKFCM: Multiple-Kernel-Fuzzy c-means, and OKFCM: our new algorithm.

The rest of the paper is organized as follows. In Sect. 2 we review the c-means, the FCM, and the MKFCM algorithms. In Sect. 3 we present our new algorithm. To validate our approach, different validation measures were shown in Sect. 4. Section 5 shows the databases used for comparisons and the results obtained. And we conclude in Sect. 6.

2 Related works

2.1 C-means clustering

Given the data \(X_1 ,X_2 ,\ldots ,X_N \), where each datum is a n-dimensional real vector, then c-means clustering (Likas et al. 2003; Pelleg and Moore 2000; Kanungo et al. 2002) aims to partition the N data into C clusters \(G_1 ,G_2 ,...,G_C\) in order to minimize the following function:

$$\begin{aligned} J\left( V \right) =\sum \limits _{i=1}^C {\sum \limits _{j\in G_i } {\left\| {X_j -v_i} \right\| ^{2}} } \end{aligned}$$
(1)

Here \(v_i\) is the center of the cluster \(G_i \), which consists of \(N_i \) data. It is given by:

$$\begin{aligned} v_i =\frac{\sum \nolimits _{j=1}^{N_i } {X_j } }{N_i },\quad 1\le i\le C,\quad \sum \limits _{i=1}^C {N_i } =N \end{aligned}$$
(2)

Each cluster \(G_i \) will be characterized by the prototype \(v_i\) (real center).

2.2 FCM clustering

Let U be the fuzzy membership matrix (Bezdek 1981; Hoppner and Klawonn 1999) where \(\mu _{ik} \in \left[ {0,1} \right] ,1\le i\le c,1\le k\le N\) is the degree of membership of data \(X_k \) to the cluster with prototype \(v_i \).

$$\begin{aligned} U_{c\times n} =\left[ {{\begin{array}{llll} {\mu _{11} }&{} {\mu _{12} }&{} \ldots &{} {\mu _{1N} } \\ {\mu _{21} }&{} \ldots &{} \ldots &{} {\mu _{2N} } \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ {\mu _{c1} }&{} \ldots &{} \ldots &{} {\mu _{cN} } \\ \end{array} }} \right] \end{aligned}$$
(3)

Dunn defined a fuzzy objective function:

$$\begin{aligned} J_D \left( {U,V} \right) =\sum \limits _{i=1}^c {\sum \limits _{j=1}^N {\mu _\textit{ij}^2 \left\| {X_j -v_i } \right\| ^{2}} } \end{aligned}$$
(4)

Then, (Bezdek 1981) extended it to:

$$\begin{aligned} J_m \left( {U,V;X} \right) =\sum \limits _{i=1}^c {\sum \limits _{j=1}^N {\mu _\textit{ij}^m \left\| {X_j -v_i } \right\| ^{2}, \quad 1\le m<\infty } } \end{aligned}$$
(5)

For getting the minimum (U, V) and using the following Lagrangian objective function:

$$\begin{aligned} \min \left\{ {J_m \left( {U,V,\lambda ;X} \right) =\sum \limits _{i=1}^c {\sum \limits _{j=1}^N {\mu _\textit{ij}^m \left\| {X_j -v_i } \right\| ^{2}+\sum \limits _{k=1}^N {\lambda _k \sum \limits _{i=1}^c {(\mu _{ik} -1)} } } } } \right\} \end{aligned}$$
(6)

By doing \(\left( \frac{\partial J_m }{\partial v_i }=0\quad \frac{\partial J_m }{\partial \mu _\textit{ij} }=0\quad \frac{\partial J_m }{\partial \lambda _k} =0\right) \), we get:

$$\begin{aligned} v_i =\frac{\sum \nolimits _{j=1}^N {(\mu _\textit{ij} )^{m}X_j } }{\sum \nolimits _{j=1}^N {(\mu _\textit{ij} )^{m}} },\quad 1\le i\le c \end{aligned}$$
(7)

And

$$\begin{aligned} \mu _\textit{ij} =\left( {\sum \limits _{k=1}^c {\left( {\left\| {X_j -v_i } \right\| ^{2}}\Big /{\left\| {X_j -v_k } \right\| ^{2}}\right) } ^{1/{m-1}}} \right) ^{-1},1\le i\le c,1\le j\le N \end{aligned}$$
(8)

Each cluster i will be characterized by the prototype \(v_i \) (real center) and the fuzzy membership vector \(\mu _\textit{ij} \).

2.3 Kernel FCM

It uses the following objective (Xie et al. 2003) function:

$$\begin{aligned} J_m \left( {U,V;X} \right) =\sum \limits _{i=1}^c {\sum \limits _{j=1}^N {\mu _\textit{ij}^m \left\| {\phi (X_j )-\phi (v_i )} \right\| ^{2},\quad 1\le m\le \infty } } \end{aligned}$$
(9)

where \(\phi (X)\) is a transformation from the nonlinear data space to a higher dimensional space called feature space (Eschrich et al. 2003).

In that space, every dot product is replaced by a non linear kernel function K (kernel trick) (Eschrich et al. 2003).

$$\begin{aligned} \phi ^{T}(X1)\cdot \phi (X2)=K(X1,X2) \end{aligned}$$
(10)

In the feature space, the kernel function measures the similarity of vector \(X_1 \) and the vector \(X_2 \). The Gaussian kernel that we have used in this paper is given by:

$$\begin{aligned} K(X1,X2)=\exp \left( -\sum \limits _{i=1}^N {\frac{1}{N}} \frac{(X_{1i} -X_{2i} )^{2}}{2\sigma _i ^{2}}\right) \end{aligned}$$
(11)

where we assume we have N n-dimensional data.

It should be noted also that the Euclidian distance in the feature space is given by:

$$\begin{aligned} \left\| {\phi (X_j )-\phi (v_i )} \right\| ^{2}=k(X_j ,X_j )+k(v_i ,v_i )-2k(X_j ,v_i ) \end{aligned}$$
(12)

And the update formulas for the kernel FCM are given by:

$$\begin{aligned} v_i= & {} \frac{\sum \nolimits _{j=1}^N {\mu _\textit{ij}^m k(X_j ,v_i )X_j } }{\sum \nolimits _{j=1}^n {\mu _\textit{ij}^m k(X_j ,v_i )} } \end{aligned}$$
(13)
$$\begin{aligned} \mu _\textit{ij}= & {} \left( {\sum \limits _{k=1}^c \left( \frac{1-k(X_j ,v_i )}{1-k(X_j ,v_k )}\right) ^{\frac{1}{m-1}}} \right) ^{-1}. \end{aligned}$$
(14)

2.4 Multiple kernel-based clustering

In multiple kernel-based clustering (Baili and Frigui 2011), each cluster is characterized by a linear combination of M kernels with known parameters. The similarity of the data X to the center v is given by:

$$\begin{aligned} k(x,v)=\sum \limits _{i=1}^M {w_i K_i } (X,v) \end{aligned}$$
(15)

Every kernel is multiplied by a weight \(w_i \in \{0,1\}\hbox { and }\sum \nolimits _{i=1}^M {w_i } =1\) reflecting the importance of that kernel. The disadvantage of that approach is twofold:

  • Giving the number of kernels M.

  • And giving “good” parameters of each kernel.

It uses the following objective function:

$$\begin{aligned} J_m \left( {U,V;X} \right) =\sum \limits _{i=1}^N {\sum \limits _{c=1}^C {\mu _\textit{ij}^m (\phi (X_i } } )-v_c )^{T}(\phi (X_i )-v_c )=\sum \limits _{i=1}^N {\sum \limits _{c=1}^C {\mu _\textit{ij}^m D_{ic}^2 } }\nonumber \\ \end{aligned}$$
(16)

where:

$$\begin{aligned} D_{ic}^2= & {} \sum \limits _{k=1}^M {w_k^2 } K_k (X_i ,X_i )-2\sum \limits _{j=1}^N {\sum \limits _{k=1}^M {\mu _{jc} } } w_k^2 K_k (X_i ,X_j)\nonumber \\&+\sum \limits _{j=1}^N {\sum \limits _{i=1}^N {\sum \limits _{k=1}^M {\mu _{jc} } } } \mu _{ic} w_k^2 K_k (X_j ,X_i ) \end{aligned}$$
(17)

And

$$\begin{aligned} \mu _{ic} =\frac{1}{\sum \nolimits _{k=1}^c \left( \frac{D_\textit{ic}^2 }{D_\textit{ik} } \right) ^{\frac{1}{m-1}}} \end{aligned}$$
(18)

The optimization problem can be solved now using:

$$\begin{aligned} \frac{\textit{dJ}_m }{dw}=0. \end{aligned}$$
(19)

3 Fuzzy clustering using multiple Gaussian kernels with optimized-parameters

3.1 Problem formulation

In MKFCM, a weighted linear combination of kernels with given variances is used. This weighted linear combination will approximate the true kernel. Every cluster will be characterized by a set of weights which approximates best the true kernel.

$$\begin{aligned} \sum \limits _{i=1}^M {w_i k_i } (X,v_i )\approx \frac{1}{\sqrt{2\pi \sigma _i ^{2}}}\exp \left( -\frac{(x-v_i )^{2}}{2\sigma _i ^{2}}\right) \end{aligned}$$
(20)

It should be noted that 4 kernels are chosen and the weight of each kernel will be the output of the MKFCM algorithm.

Our algorithm (OKFCM) tries to characterize every cluster by a different optimal Gaussian kernel.

3.2 Optimal Gaussian kernel and the EM approach

In our algorithm, clustering in the new high-dimensional (feature) space is based on minimizing the following objective function:

$$\begin{aligned} J_m \left( {U,V;X} \right) =\sum \limits _{i=1}^N {\sum \limits _{c=1}^C {\mu _\textit{ij}^m (\phi (X_i } } )-v_c )^{T}(\phi (X_i )-v_c )=\sum \limits _{i=1}^N {\sum \limits _{c=1}^C {\mu _\textit{ij}^m D_{ic}^2 } } \end{aligned}$$

The distance \(D_{ic}^2\) of the data \(X_i \) to the center \(v_k \) in that space is given using by:

$$\begin{aligned}&D_{ic}^2 =(\phi (X_i )-v_c )^{T}(\phi (X_i )-v_c )\\&\quad {\hbox {where}} v_c =\frac{\sum \nolimits _{i=1}^N {\mu _{ic}^m \phi (X_i )} }{\sum \nolimits _{i=1}^N {\mu _{ic}^m } }\hbox { is the center in the feature space.} \end{aligned}$$

Which Can be written as:

$$\begin{aligned} D_{ic}^2 =K(X_i ,X_i )-2\sum \limits _{j=1}^N {\mu _{jc} } K (X_i, X_j)+\sum \limits _{j=1}^N {\sum \limits _{i=1}^N {\mu _{jc} } } \mu _{ic} K (X_j,X_i) \end{aligned}$$
(21)

We have used the Gaussian kernel which is given by:

$$\begin{aligned} K(x,v_k )=\frac{1}{\sqrt{2\pi \sigma _k ^{2}}}\exp \left( -\frac{(x-v_k)^{2}}{2\sigma _k ^{2}}\right) \end{aligned}$$

The above equation shows that similarity in the feature space depends on the variance of the data in the original space. Our approach in determining the optimal kernel is very similar to the Expectation-Maximization (EM) method given by the following flowchart.

Fig. 1
figure 1

Flow chart of our algorithm compared to the EM method

It should be noted that our algorithm as the other clustering algorithms which are similar to the EM method are not guaranteed to reach the global minimum (see Fig. 1). They might reach Local minimum or Saddle point (Zhou et al. 2013).

3.3 Methodology

Our method is described by the following algorithm:

  • Step 1: Given the N n-dimensional data \(X_1 ,X_2 ,\ldots ,X_N \)and small positive tolerance \(\varepsilon \).

  • Step 2: Initialize the membership function matrix which indicates in the feature space how much each datum is close to each center.

    Initialize \(U_{c\times N} =\left[ {{\begin{array}{llll} {\mu _{11} }&{} {\mu _{12} }&{} \ldots &{} {\mu _{1N} } \\ {\mu _{21} }&{} \ldots &{} \ldots &{} {\mu _{2N} } \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ {\mu _{c1} }&{} \ldots &{} \ldots &{} {\mu _{cN} } \\ \end{array} }} \right] \) such that \(\sum \limits _{i=1}^c {\mu _{ik} } =1\;k=1\ldots N\)

  • Step 3: The mean and the variance of each cluster are given by the following formulas:

    $$\begin{aligned} v_i= & {} \frac{\sum \nolimits _{j=1}^N {(\mu _\textit{ij} )^{m}x_j } }{\sum \nolimits _{j=1}^n {(\mu _\textit{ij} )^{m}} }\hbox { } \quad 1\le i\le c \end{aligned}$$
    (22)
    $$\begin{aligned} \delta _i ^{2}= & {} \frac{\sum \nolimits _{j=1}^N {(\mu _\textit{ij} )^{m}(x_j -v_i )^{2}} }{\sum \nolimits _{j=1}^n {(\mu _\textit{ij} )^{m}} } \quad 1\le i\le c \end{aligned}$$
    (23)

    Step 3: emphasizes that every cluster c is characterized by the mean \(v_c \) and the variance \(\delta _c ^{2}\).

  • In Step 4, the variance of each data is given by the following formula:

    $$\begin{aligned} \sigma _i^2 =\sum \limits _{j=1}^c {(\mu _{ji} )^{m}} \delta _i ^{2}\quad {i=1,...,N} \end{aligned}$$
    (24)

    Now every data is characterized by the n-dimensional variance vector.

    $$\begin{aligned} \{X_i \}\leftrightarrow \sigma _i^2 =\left[ {{\begin{array}{c} {\sigma _{i1}^2 } \\ \vdots \\ {\sigma _{in}^2 } \\ \end{array} }} \right] \quad i=1,...,N \end{aligned}$$

    It should be noted that data from the same class c will have almost similar variance which is \(\delta _c ^{2}\).

  • Step 5: Calculates the similarity between all the data using the kernel matrix K:

    In order for the matrix K to be a similarity matrix, it should be a symmetric matrix \(K(X_i ,X_j )=K(X_j ,X_i )\hbox { for all }i,j=1,\ldots ,N\).

    There are 2 cases:

    Case1: \(X_i \hbox { and }X_j \) belong to the same cluster and eventually have approximately the same variances:

    $$\begin{aligned}&\sigma _i^2 \approx \sigma _j^2 \Rightarrow K(X_i ,X_j )\approx K(X_j ,X_i ) \\&\qquad \Rightarrow \frac{1}{2}(K(X_i ,X_j )+K(X_j ,X_i ))\approx K(X_j ,X_i ) \\ \end{aligned}$$

    Case2: \(X_i \hbox { and }X_j \) belong to different clusters and eventually have different variances:

    $$\begin{aligned}&\displaystyle \sigma _i^2 \ne \sigma _j^2 \Rightarrow K(X_i ,X_j )\ne K(X_j ,X_i )\Rightarrow \hbox {do averaging} \\&\displaystyle K(X_i ,X_j )=K(X_j ,X_i )=\frac{1}{2}(K(X_i ,X_j )+K(X_j ,X_i )) \\ \end{aligned}$$
  • Step 6: Calculates in the feature space the distance of each data to each center using the similarity matrix found in Step 5 and updates the membership matrix according to those distances.

    $$\begin{aligned} \begin{array}{l} \textit{for all i and c} \\ D_{ic}^2 =K(X_i ,X_i )-2\sum \limits _{j=1}^N {\mu _{jc} } K(X_i ,X_j)+\sum \limits _{j=1}^N {\sum \limits _{i=1}^N {\mu _{jc} } } \mu _{ic} K(X_j ,X_i ) \\ \mu _{ic} =\frac{1}{\sum \nolimits _{k=1}^c \left( \frac{D_{ic}^2 }{D_{ik} }\right) ^{\frac{1}{m-1}}} \end{array} \end{aligned}$$
  • Step7: Stopping Criterion. If the membership matrix doesn’t change (all the data are remaining in their assigned clusters) then stop else go to Step 3 for another epoch.

    $$\begin{aligned} ||U_\textit{new} -U_\textit{old} || < \varepsilon \end{aligned}$$

4 Validation measures

In order to show the effectiveness of our clustering algorithm, we have used different validity measures (Bensaid et al. 1996; Xie and Beni 1991; Eschrich et al. 2003; Balasko et al.). The Partition Coefficient (PC) measures the amount of “overlapping” between clusters. The Classification Entropy (CE) measures the fuzziness of the cluster partitions. The Partition Index (SC) denotes the ratio of the sum of compactness and separation of the clusters. And the Separation Index (S) uses a minimum-distance separation for partition validity.

The validity measures used have the following formulas:

  • Partition coefficient (PC): \(PC(c)=\frac{1}{N}\sum \limits _{i=1}^c {\sum \limits _{j=1}^N {\mu _\textit{ij} ^{2}} } \)

  • Classification entropy (CE): \(CE(c)=\frac{-1}{N}\sum \limits _{i=1}^N {\sum \limits _{j=1}^c {\mu _\textit{ij} } } \log (\mu _\textit{ij} )\)

  • Partition index (SC): \(SC(c)=\sum \limits _{i=1}^c {\frac{\sum \nolimits _{j=1}^N {\mu _\textit{ij} ^{2}(x_i -v_j )^{2}} }{N_i \sum \nolimits _{k=1}^c {(v_k -v_i )^{2}} }} \)

  • Separation index (S): \(S(c)=\frac{\sum \nolimits _{i=1}^c {\sum \nolimits _{j=1}^N {\mu _\textit{ij} ^{2}(x_j -v_i )^{2}} } }{N(\min _{i,k} (v_k -v_i )^{2})}\)

It should be noted that the above measures can be applied to fuzzy clustering algorithms due to the presence of the membership values \(\mu _\textit{ij} \).

In all the experiments for the MKFCM, we have followed (Baili and Frigui 2011) in choosing the 4 kernels. This is done by calculating:

$$\begin{aligned}&\displaystyle D=\left( \sum \limits _{k=1}^N {(\max (X_k } )-\min (X_k ))^{2}\right) ^{0.5}\\&\displaystyle \quad \textit{Then}\quad \sigma _1 =0.1{*}D,\sigma _2 =0.2{*}D,\sigma _3 =0.3{*}D,\hbox { and }\sigma _4 =0.4{*}D \end{aligned}$$

It should be noted that the weight of each kernel will be the output of the MKFCM algorithm.

For the KFCM we have repeated the algorithm 4 times. Each time the KFCM is initialized with one of the 4 variances. We have shown the results of the KFCM which gave the best results.

Figures 2, 3 show the 2 artificial datasets (4clusters dataset and the Spiral dataset) that we have used in our simulations.

Fig. 2
figure 2

The 4cluster dataset

Fig. 3
figure 3

The 2 Spiral dataset

It should be noted that higher values of the PC and lowest values of SC, S, and CE imply better clustering performance.

Figures 456, and 7 show the PC, S, CE, and SC validation results vs. the number of clusters for the 4cluster dataset. We have applied the FCM, KFCM, MKFCM, and the OKFCM with different number of clusters (2, 3, ... 10). And we have plotted the obtained validation results.

Fig. 4
figure 4

The PC validation results for different number of clusters for the 4cluster dataset

Fig. 5
figure 5

The S validation results for different number of clusters for the 4cluster dataset

Fig. 6
figure 6

The CE validation results for different number of clusters for the 4cluster dataset

Fig. 7
figure 7

The SC validation results for different number of clusters for the 4cluster dataset

We have repeated the same procedure for the Spiral dataset. The validation results vs. The number of clusters was plotted in Figs. 8910, and 11.

Fig. 8
figure 8

The PC validation results for different number of clusters for the Spiral dataset

Fig. 9
figure 9

The S validation results for different number of clusters for the Spiral dataset

Fig. 10
figure 10

The CE validation results for different number of clusters for the Spiral dataset

Fig. 11
figure 11

The SC validation results for different number of clusters for the Spiral dataset

5 UCI-datasets simulation results

We have also compared the performance of the 4 algorithms on real datasets. These datasets were chosen from the UCI repository [19]. They are summarized as follows:

Iris database The dataset contains 3 classes of 50 instances each. The number of features for each instance are the sepal length, the sepal width, the petal length, and the petal width.

Wine database The data contains 3 classes. The number of features is 13. The number of instances is 178 belonging to class1, 59 belonging to class2, and 48 belonging to class3.

Sonar database These data contain 111 instances of patterns obtained by bouncing off sonar signals off a metal cylinder at various angles and under various conditions. It also contains 97 instances of patterns obtained from rocks under similar conditions. Each input pattern is a set of 60 features.

Diabetes database The database consists of 500 instances of patients that did not have the disease and 268 instances of patients that have the disease. The number of features is 8.

Breast database The database contains 458 instances of patients that had benign cancer, and 241 instances of patients that had malignant cancer. The number of features is 9.

Table 1 shows the validation results for the FCM, the KFCM, the MKFCM, and the OKFCM. The number of clusters C is the number of classes of each database.

Table 1 UCI database validity results

Another typical validation measure is the Normalized Mutual Information (NMI) [20]. Given the knowledge of the ground truth class assignments and the clustering algorithm assignments of the same samples, the NMI is a function that measures the agreement of the two assignments. Bigger values of the NMI imply better agreements. The following table shows the NMI values for different databases, and for different percentages of training data. We have compared the performances of the c-means, the FCM, the KFCM, the MKFCM, and the OKFCM on the chosen datasets.

5.1 Discussion

Table 1 shows that the OKFCM algorithm gave the highest values (better performance) compared to the other algorithms for the PC and the lowest values (better performance) for the SC, S and CE validity measures.

Table 2 shows theNMI validation results for different databases and different percentages of training data. We have used 90 %, 80%, 70%, 60%, 50%, and 40% training data. The OKFCM gave the highest value (better performance) for the NMI validation results.

Table 2 NMI validity results for different databases, and for different training percentages of data

6 Conclusions

In this paper, fuzzy clustering using Multiple-Gaussian-Kernels-With-Optimized-Parameters (OKFCM) is introduced. The c-means is best fitted for non-overlapped clusters. The FCM works well for spherical-like clusters. The KFCM uses a single kernel which doesn’t well work with multiple-density clusters. The MKFCM tries to find an optimal linear weighted combination of kernels with initial fixed (not necessarily the best) parameters. Our algorithm is an extension of the KFCM and the MKFCM. It uses multiple kernels. There is no need to give “good” initial parameters of each kernel and initial “good” number of kernels. It is robust and flexible (Vasant 2013). Every cluster will be characterized by a different optimal Gaussian kernel. Its effectiveness and efficiency can be shown by comparing our algorithm to the c-means, FCM, KFCM, and MKFCM algorithms. It is well known that better clustering outputs, the better stable and reliable it is. We have used the following validity measures: The Partition Coeffcient (PC), the Classification Entropy (CE), the Partition Index (SC), the Separation Index (S), and the Normalized Mutual Information (NMI). Two artificial datasets (4clusters dataset and the Spiral dataset) and 5 databases obtained from the UCI repository (Iris, Wine, Diabetes, Breast, and Sonar) were used in our simulations.

The OKFCM algorithm gave the highest values (better performance) for the PC and the NMI and the lowest values (better performance) for SC, S, and CE validity measures.

Future work could include the application of this new algorithm on different real world problems like image segmentation and image compression problems.

Finally, I thank the reviewers for comments that greatly improved the manuscript.