1 Introduction

Founded on Vapnik’s statistical learning theory, support vector machines (SVMs) (Vapnik 1995) have played an important role in many areas including pattern recognition, regression, image processing, and bioinformatics. However, unclassifiable regions exist when they are extended to multi-class problems. Now there are some methods to solve this problem. One is to construct several two-class classifiers, and then to assemble a multi-class classifier, such as one-against-one, one-against-all and DAGSVMs (Platt et al. 2002; Hsu and Lin 2002). The second method is to construct multi-class classifier directly, such as k-SVM proposed by Weston and Watkins (1998). Zhang et al. (2007) also proposed a fuzzy compensation multi classification SVM based on Weston and Walkins’ idea. Recently, Zhu et al. proposed a multi-class classification algorithm which adopted the minimum enclosing spheres to classify a new example and showed that the resulting classifier performed comparable to the standard SVMs (Zhu et al. 2003). Based on Zhu et al. (2003), Wang et al. (2007) and Lee and Lee (2007) also proposed a new classification rule on the basis of Bayesian optimal decision theory respectively.

On the other hand, the obtained training data is often contaminated by noises in many practical engineering applications. Furthermore, some points in the training data set are misplaced far away from main body or even on the wrong side in feature space. The standard SVM is very sensitive to outliers and noises. Some techniques have been found to tackle this problem for SVM in the literature of data classification. One of the methods is to use averaging algorithm to relax the effect of outliers in Song et al. (2002; Hu and Song (2004). But in this kind of method, the additional parameter is difficult to be tuned. The second method, proposed in Lin and Wang (2002, 2003); Inoue and Abe (2001) and named as a fuzzy SVM (FSVM), is to apply fuzzy membership to the training data to relax the effect of the outliers. However, the selection of membership function remains a problem for FSVM so far. The third interesting method, called support vector data description (SVDD) (Tax and Duin 1999, 2004), has been shown effectiveness to detect outliers from the normal data points. However, it is well-known that SVDD is particular in the case of one-class classification problem.

The most common clustering method, the so-called fuzzy c-means clustering method (FCM) (Dunn 1974; Bezdek 1980), is based on the minimization of the distance-based objective function as (1). With fuzzy clustering it is not necessary to definitely place an object within one cluster, since the membership value of this object can be allocated among different clusters. This “distribution” of the membership among different clusters can be interpreted as the measure of similarity between a particular object and the respective clusters. However, traditional Euclidean distance is very difficult to depict the samples’ correlation.

Clustering-based SVM (CB-SVM) (Yu et al. 2003) is a newly developed method to consider the clustering information in reducing the training samples for SVMs. However, the clustering is performed in the input space for nonlinear CB-SVM, while the distance is measured in the kernel space, which introduces unexpected mistakes in sample reduction.

In this paper we propose a fuzzy support vector classifier by integrating FCM method based on Mahalanobis distance into fuzzy support vector data description. This paper first presents a FCM method based on Mahalanobis distance, then gives a fuzzy support vector data description, which uses fuzzy membership degree of the cluster computed by the FCM method to partition training data. Accordingly, this paper proposes the multi classification algorithm based on the fuzzy support vector classifier, and give the classification rule.

The rest of this paper is organized as follows. In Section 2 we give a brief description of fuzzy c-means algorithm and Mahalanobis distance, and presents a FCM algorithm based on Mahalanobis distance. In Section 3 we present fuzzy support vector data description, and present the proposed fuzzy support vector classifier algorithm. In Section 4 we illustrate our proposed algorithm by comparing its performance with other methods. Finally, the conclusion is given in Section 5.

2 FCM algorithm and Mahalanobis distances

Fuzzy c-means (FCM) algorithm, has been shown to have satisfied ability of handling noises and outliers. In this section, we first introduce the fuzzy c-means algorithm, and review the distance measure based on Mahalanobis distance. We present the improved FCM algorithm based on Mahalanobis distance.

2.1 FCM algorithm

Clustering methods seeks to organize a set of items into clusters such that items within a given cluster have a high degree of similarity, whereas items belonging to different clusters have a high degree of dissimilarity.

However, conventional hard clustering methods restrict each point of the data set to exactly one cluster. Fuzzy clustering generates a fuzzy partition based on the idea of partial membership expressed by the degree of membership of each pattern in a given cluster. Dunn (1974) presented one of the first fuzzy clustering methods based on an adequacy criterion defined by the Euclidean distance. Bezdek (1980) further generalized this method. Gustafson and Kessel (1979) introduced the first adaptive fuzzy clustering algorithm, based on a quadratic distance defined by a fuzzy covariance matrix. Now these fuzzy c-means (FCM) algorithms (Dunn 1974; Bezdek 1980; Gustafson and Kessel 1979) are very popular and has been applied successfully in many areas such as taxonomy, image processing (Kawano et al. 2009), information retrieval, data mining, etc.

FCM algorithm starts with an initial guess for the cluster centers, which intends to mark the mean location of each cluster. The initial guess for these cluster centers is most likely incorrect. Additionally, FCM assigns every data point a membership grade for each cluster. By iteratively updating the cluster centers and the membership grades for each data point, FCM iteratively moves the cluster centers to the “right” location within a data set. This iteration is based on minimizing an objective function that represents the distance from any given data point to a cluster center weighted by that data point’s membership grade. Namely

$$ \label{eq1} \min J\left( {U,V} \right)=\sum\limits_{i=1}^c {\sum\limits_{k=1}^l {u_{ik}^m d\left( {x_k ,v_i } \right)} } $$
(1)
$$ \label{eq2} \mbox{s.t.}\;\sum\limits_{i=1}^c {u_{ik} } =1,\;0\le u_{ik} \le 1 $$
(2)

where c is the number of clusters and selected as a specified value in this paper, l is the number of data points, \(u_{ik} \in \left[ {0,1} \right]\) denotes the degree to which the sample x k belongs to the ith cluster, m is the fuzzy parameter controlling the speed and achievement of clustering, \(d\left( {x_k ,v_i } \right)=\left\| {x_k -v_i } \right\|^2\) denotes the distance between point x k and the cluster center v i , and V is the set of cluster centers or prototypes (\(v_i \in R^p)\).

2.2 The Mahalanobis distance measure

The FCM algorithm selects a well-suited distance measure between the sample and the cluster center. The Euclidean distance is a traditional choice. Given two data points \(x_p \in R^n\) and \(x_q \in R^n\), their Euclidean distance can be calculated as follows:

$$ \label{eq3} d_E \left( {x_p ,x_q } \right)=\left\| {x_p -x_q } \right\|^2 $$
(3)

Considering that samples are locally correlated, a local distance measure incorporating samples’ correlation might be a better choice as a distance measure. Mahalanobis distance can take into account the covariance among the variables in calculating distances, therefore it appears more suitable to measure the distance (Kawano et al. 2009).

Let X be a l ×n input matrix containing l random observations \(x_i \in R^n\), i = 1,...,l. The Mahalanobis distance d M from a sample x i to the population X is defined as follows:

$$ \label{eq4} d_M \left( {x_i ,X} \right)=\left( {x_i -\mu } \right)^T\sum ^{-1}\left( {x_i -\mu } \right) $$
(4)

where μ is a mean vector of all samples, ∑ is the covariance matrix calculated as:

$$ \label{eq5} \sum =\frac{1}{l}\sum\limits_{j=1}^l {\left( {x_j -\mu } \right)\left( {x_j -\mu } \right)^T} $$
(5)

Originally, the Mahalanobis distance can be defined as a dissimilarity measure between two random vectors of the same distribution with covariance matrix ∑. If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance.

2.3 Modified FCM algorithm based on Mahalanobis distance

This section presents a modified FCM algorithm based on Mahalanobis distance. To make use of the FCM algorithm for the fuzzy support vector classifier, we extend the original FCM algorithm into the modified algorithm by replacing Euclidean distance with Mahalanobis distance.

Modified FCM algorithm based on Mahalanobis distance minimizes the following objective function

$$ \label{eq6} \min J\left( {U,V,\sum } \right)=\sum\limits_{i=1}^c {\sum\limits_{k=1}^l {u_{ik}^m \left( {x_k -\mu _i } \right)^T \sum\nolimits^{-1}\left( {x_k -\mu _i } \right)} } $$
(6)
$$ \label{eq7} \mbox{s.t.}\;\sum\limits_{i=1}^c {u_{ik} } =1,\;0\le u_{ik} \le 1 $$
(7)

The formulation of optimization using Lagrange multiplier method is as following:

$$ \label{eq8} L=\sum\limits_{i=1}^c {\sum\limits_{k=1}^l {u_{ik}^m \left( {x_k -\mu _i } \right)^T \sum\nolimits^{-1}\left( {x_k -\mu _i } \right)} } +\sum\limits_{k=1}^l {\alpha _k \left( {1-\sum\limits_{i=1}^c {u_{ik} } } \right)} $$
(8)

The corresponding dual is found by differentiating with respect to u ik , μ i and ∑ according to (8). We obtain

$$ \label{eq9} \frac{\partial L}{\partial u_{ik} }=0\;\;\;\;\;\Rightarrow \mbox{ }u_{ik} =\left[ {\sum\nolimits_{j=1}^c {\left( {\frac{\left( {x_k -\mu _i } \right)^T\sum ^{-1}\left( {x_k -\mu _i } \right)}{\left( {x_k -\mu _j } \right)^T\sum ^{-1}\left( {x_k -\mu _j } \right)}} \right)^{\frac{1}{m-1}}} } \right]^{-1} $$
(9)
$$ \label{eq10} \frac{\partial L}{\partial \mu _i }=0\;\;\;\;\;\Rightarrow \mbox{ }\mu _i =\frac{\sum\nolimits_{k=1}^l {u_{ik}^m x_k } }{\sum\nolimits_{k=1}^l {u_{ik}^m } } $$
(10)
$$ \label{eq11} \frac{\partial L}{\partial \sum }=0\;\;\;\;\;\Rightarrow \mbox{ }\sum =\frac{\sum\nolimits_{k=1}^l {u_{ik}^m \left( {x_k -\mu _i } \right)\left( {x_k -\mu _i } \right)^T} }{\sum\nolimits_{k=1}^l {u_{ik}^m } } $$
(11)

The modified FCM algorithm is implemented by iteratively updating (9) and (10) until the stopping criterion is satisfied.

3 Fuzzy classifier based on modified FCM algorithm

Support vector data description (SVDD) (Tax and Duin 1999, 2004) has been shown effectiveness to detect outliers from the normal data points. However, it is well-known that SVDD method is particular in the case of one-class classification problem, and the training process is very sensitive to outlier and noise data. In order to decrease the effect of those outliers or noises, researchers assign each data point in the training dataset with a membership and sum the deviations weighted by their memberships. If one data point is detected as an outlier, it is assigned with a low membership, so its contribution to total error term will be decreased. Unlike the equal treatment in standard SVM, this method fuzzifies the penalty term to reduce the sensitivity of less important data points.

In this section, we introduce support vector data description, and present fuzzy support vector classifier based on modified FCM algorithm.

3.1 Support vector data description

Over the last several years, SVM has already been generalized to solve one-class problems. Tax and Duin (1999, 2004) presented a support vector data description (SVDD) method for classifier designation and outlier detection. The main idea in Tax and Duin (1999, 2004) is to map data points by means of a nonlinear transformation to a high dimensional feature space and to find the smallest sphere that contains most of the mapped data points in the feature space. This sphere, when mapped back to the data space, can be separated into several components, each enclosing a separate cluster of points. In real operations, they introduce the soft idea and kernel techniques from SVM and allow some data points outside the more general sphere. The advantage of SVDD lies in the fact that the algorithm is very intuitive and comprehensive.

More specifically, given a set of training data \(\left\{{x_i,\,\, i=1,2,\cdots,l} \right\}\subset \Re^n\), the minimum enclosing sphere S, which is characterized by its center c and radius R, can be found by solving the following constrained quadratic optimization problem:

$$ \label{eq12} \mathop {\min }\limits_{c,R} \mbox{ }R^2 $$
(12)
$$ \label{eq13} \mbox{s.t.}\;\left\| {x_i -c} \right\|^2\le R^2 $$
(13)

However, in real-world applications, training data of a class is rarely distributed spherically, even if the outermost samples are excluded. To have more flexible descriptions of a class, one can first transform the training samples into a high-dimensional feature space using a nonlinear mapping \(\mathit{\Phi} \) and then compute the minimum enclosing sphere S in the feature space. Furthermore, to allow for the possibility of some samples falling outside of the sphere, one can relax the constraints (13), which require the distance from x i to the center c to be smaller than R for all x i , with a set of soft constraints

$$ \label{eq14} \begin{array}{lll} \left\| {\mathit{\Phi} (x_i )-c} \right\|^2\le R^2+\xi _i \\ \xi _i \ge 0,\mbox{ for }i=1,2,\cdots ,l \\ \end{array} $$
(14)

where c is the center of the minimum enclosing sphere S and ξ i are slack variables allowing for soft boundaries. To penalize large distances to the center of the sphere, the quadratic optimization problem in (12), accordingly, is represented as follows.

$$ \label{eq15} \mathop {\min }\limits_{c,R} \mbox{ }R^2+C\sum\limits_i {\xi _i } $$
(15)

under the constraints (14), where C > 0 is a penalty constant that controls the trade-off between the size of the sphere and the number of samples that possibly fall outside of the sphere.

To solve this problem, we introduce the Lagrangian

$$ \label{eq16} L=R^2-\sum\limits_i {\left( {R^2+\xi _i -\left\| {\mathit{\Phi} (x_i )-c} \right\|^2} \right)\alpha _i -\sum\limits_i {\beta _i \xi _i +C\sum\limits_i {\xi _i } } } $$
(16)

Using the Lagrange multiplier method, the above constrained quadratic optimization problem can be formulated as the Wolfe dual form.

To solve the optimization problem, we differentiate with respect to R,c and ξ i according to (16), and set their derivative values are zero, respectively. Namely, \(\frac{\partial L}{\partial R}=0\), \(\frac{\partial L}{\partial c}=0\), and \(\frac{\partial L}{\partial \xi _i }=0\). We can obtain \(\sum\limits_i {\alpha _i } =1\), \(c=\sum\limits_i {\alpha _i \mathit{\Phi} (x_i )}\), and C = α i  + β i .

Substituting these back into (16), we obtain the dual formulation

$$ \label{eq17} \begin{array}{l} \max \mbox{ }W=\mbox{ }\sum\limits_i {\alpha _i K\left( {x_i ,x_i } \right)} -\sum\limits_{i,j} {\alpha _i \alpha _j K\left( {x_i ,x_j } \right)} \\ \mbox{s.t.} \mbox{ }0\le \alpha _i \le C,\mbox{ }\sum\limits_i {\alpha _i } =1,\mbox{ }j=1,2,\cdots ,l \\ \end{array} $$
(17)

where the kernel K(x i ,x j ) = \(\mathit{\Phi} \)(x i ) · \(\mathit{\Phi} \)(x j ) satisfies Mercer’s condition. Points x i with α i  > 0, called support vectors (SVs), are needed in the description and lie on the boundary of the sphere.

To test the point x, the distance to the center of the sphere has to be calculated. A test point x is accepted when this distance is smaller or equal than the radius:

$$ \label{eq18} \left\| {\mathit{\Phi} \left( x \right)-c} \right\|^2=K\left( {x,x} \right)-2\sum\limits_i {\alpha _i K\left( {x_i ,x} \right)+\sum\limits_{i,j} {\alpha _i \alpha _j K} } \left( {x_i ,x_j } \right)\le R^2 $$
(18)

where R2 is the distance from the center of the sphere c to any support vector x i on the boundary.

3.2 Fuzzy classifier based on support vector data description

Let \(\left\{ {\left( {x_i ,m_i } \right),\mbox{ }i=1,2,\cdots ,l} \right\}\subset \Re ^n\) be a given training data set, where m i is the weights of training sample x i . Once we compute weights m i of the training sample using the improved FCM method in Section 2, fuzzy support vector classifier problems can be subsequently described as follows:

$$ \label{eq19} \mathop {\min }\limits_{c,R} \mbox{ }R^2+C\sum\limits_i {m_i \xi _i } $$
(19)
$$ \label{eq20} \mbox{s.t.}\;{\begin{array}{c} {\left\| {\mathit{\Phi} \left( {x_i } \right)-c} \right\|^2\le R^2+\xi _i } \hfill \\ {\xi _i \ge 0,\mbox{ for }i=1,2,\cdots ,l} \hfill \\ \end{array} } $$
(20)

where m i are the weights generalized by improved FCM method in Section 2.

We can find the solution to this optimization problem in dual variables by finding the saddle point of the Lagrange. Its corresponding Lagrangian formula is

$$ \label{eq21} L=R^2-\sum\limits_i {\left( {R^2+\xi _i -\left\| {\mathit{\Phi} \left( {x_i } \right)-c} \right\|^2} \right)\alpha _i -\sum\limits_i {\beta _i \xi _i +C\sum\limits_i {m_i \xi _i } } } $$
(21)

To solve the optimization problem, we differentiate with respect to R,c and ξ i according to (21), and set their derivative values are zero, respectively. We obtain \({\partial L} /{\partial R=2R ( {1-\sum\limits_i {\alpha _i } } )=0}\), \({\partial L} / {\partial c=2\sum\limits_i {\alpha _i \left( {\mathit{\Phi} \left( {x_i } \right)-c} \right)} =0}\), \({\partial L} / {\partial \xi _i =Cm_i -\alpha _i -\beta _i =0}\), respectively, which lead to \(\sum\limits_i {\alpha _i } =1\), \(c=\sum\limits_i {\alpha _i \mathit{\Phi} \left( {x_i } \right)} \), and Cm i  = α i  + β i .

Substituting these back into (21), we obtain the dual formulation

$$ \label{eq22} {\begin{array}{c} {\max \mbox{ }W=\mbox{ }\sum\limits_i {\alpha _i K\left( {x_i ,x_i } \right)} -\sum\limits_{i,j} {\alpha _i \alpha _j K\left( {x_i ,x_j } \right)} } \hfill \\ {\mbox{s.t. }0\le \alpha _i \le Cm_i ,\mbox{ }\sum\limits_i {\alpha _i } =1,\mbox{ }j=1,2,\cdots ,l} \hfill \\ \end{array} } $$
(22)

We replace the inner products in the original form (22) by a kernel function K(x i , x j ) = \(\mathit{\Phi} \)(x i ) · \(\mathit{\Phi} \)(x j ). Several kernel functions have been proposed, including linear kernel, polynomial kernel, radial basis function kernel and sigmoid kernel. In our experiments we use a Gaussian kernel function.

Solving the dual QP problem, we obtain the Lagrange multipliers α i , which give the center c of the minimum enclosing sphere S as a linear combination of x i ,

$$ \label{eq23} c=\sum\limits_i {\alpha _i x_i } $$
(23)

According to the Karush–Kuhn–Tucker (KKT) optimality conditions, we have

$$ \label{eq24} {\begin{array}{c} {\alpha _i =0\mbox{ }\Rightarrow \mbox{ }\left\| {\mathit{\Phi} \left( {x_i } \right)-c} \right\|^2<R^2\mbox{ and }\xi _i =0} \hfill \\ {0<\alpha _i <Cm_i \mbox{ }\Rightarrow \mbox{ }\left\| {\mathit{\Phi} \left( {x_i } \right)-c} \right\|^2=R^2\mbox{ and }\xi _i =0} \hfill \\ {\alpha _i =Cm_i \mbox{ }\Rightarrow \mbox{ }\left\| {\mathit{\Phi} \left( {x_i } \right)-c} \right\|^2\ge R^2\mbox{ and }\xi _i \ge 0} \hfill \\ \end{array} } $$
(24)

Therefore, only α i that correspond to training examples x i that lie either on or outside of the sphere are nonzero. All the remaining α i are zero and the corresponding training samples are irrelevant to the final solution.

It is clear that the only difference between standard SVDD and fuzzy SVDD is the upper bounds of Lagrange multipliers α i in the dual problem. In standard SVDD, the upper bounds of α i are bounded by a constant C while they are bounded by dynamical boundaries that are weight values Cm i in fuzzy support vector classifier.

3.3 The algorithm description

In this section, we present a method for fuzzy classifier problems. Given a set of training data \(\left( {x_1 ,y_1 } \right),\left( {x_2 ,y_2 } \right),\cdots ,\left( {x_l ,y_l } \right)\), where l ∈ R denotes the number of training samples, \(x_i \in \Re ^n\) denotes an input pattern, and \(y_i \in R=\left\{ {1,2,\cdots ,c} \right\}\) denotes its output class, where c is the number of classes. The central idea of the proposed method is to use the membership matrix generated by the modified FCM algorithm to generate the new training set, and then to use the proposed fuzzy support vector classifier method to classify a data point via the given decision rule.

The detailed procedure of the proposed method is as follows:

  1. Step 1

    Use modified FCM algorithm to compute the membership u ik of each training sample x i to each class k. Given a threshold σ t for membership matrix U = {u ik }, set u ik = 0 if u ik  ≤ σ t .

    1. Step 1.1

      Select the number of clusters c, the maximal iterative count N max , fuzziness parameter m (let m = 2), and converge error ε > 0.

    2. Step 1.2

      Initialize the membership matrix \(U^{(0)}=\left\{ {u_{ik}^{\left( 0 \right)} } \right\}\) satisfied with constraint conditions \(\sum\limits_{i=1}^c {u_{ik} } = 1, 0\le u_{ik} \le 1\).

    3. Step 1.3

      For t = 1,...,N max ,

      calculate the membership matrix \(U^{\left( t \right)}=\left\{ {u_{ik}^{\left( t \right)} } \right\}\) according to (9);

      calculate the cluster centers \(\mu _i ^{\left( t \right)}\left( {i=1,...,c} \right)\) according to (10);

      calculate the objective function J(t)(U,V,∑) according to (6);

      when \(\left| {J^{\left( t \right)}\left( {U,V,\sum } \right)-J^{\left( {t-1} \right)}\left( {U,V,\sum } \right)} \right|<\varepsilon \) or t = N max , stop iteration and return the membership matrix U(t) and the cluster centers \(\mu _i^{\left( t \right)} \).

    4. Step 1.4

      Modify the membership matrix U = {u ik }, set u ik = 0 if u ik  ≤ σ t .

    After partitioning, the weights of outliers and noises will be very small or even nearly to zero, such that the effect of outliers and noises to decision hyperplane is significantly reduced. To reduce the training time in the fuzzy support vector classifier, we set a threshold σ t for weights u ik . If u ik  > σ t , it denotes the training sample x k affects the ith class to some degree. Otherwise, training algorithm ignores the effect of x k to the ith class.

  2. Step 2

    Generate the new training set, and partition the given training data into c subsets D p (p = 1,...,c) according to their membership u ik .

  3. Step 3

    For each class data set D p , train their data points using the proposed fuzzy support vector classifier method. Let the solution of the dual problem (9) be \(\alpha _i^\ast \), \(i=i_1 ,\cdots ,i_{l_p } \), and \(J_p \subset \left\{ {1,\cdots ,l_p } \right\}\) be the set of the index of the nonzero \(\alpha _i^\ast \). The trained kernel support function for each class data set D p is then given by

    $$ \label{eq25} f_p \left( x \right)=K\left( {x,x} \right)-2\sum\limits_{i\in J_p } {\alpha _i^\ast K\left( {x_i ,x} \right)+\sum\limits_{i,j\in J_p } {\alpha _i^\ast \alpha _j^\ast K\left( {x_i ,x_j } \right)} } $$
    (25)
  4. Step 4

    Construct the following classification rule for the new sample x:

    $$ \label{eq26} F\left( x \right)=\arg \mbox{ }\mathop {\mbox{max}}\limits_p \mbox{ }\frac{l_p }{l}\left( {\left( {R_p^\ast } \right)^2-f_p \left( x \right)} \right) $$
    (26)

    where \(p=\left\{ {1,2,\cdots ,C} \right\}p=\left\{ {1,2,\cdots ,C} \right\}\), \(\left( {R_p^\ast } \right)^2=f_p \left( {x_i } \right)\) for any support vector x i of the pth class, l ∈ R denotes the number of the training samples, and l p  ∈ R denotes the number of the pth class in the training samples.

In step 4, we propose the classification rule based on the SVDD which is an optimal classifier indicated by Lee and Lee (2007) according to the Bayesian decision theory, and show that it leads to a better generalization accuracy than the classification rule in Zhu et al. (2003) and Wang et al. (2007) through experiments in Section 4.

4 Experimental results

The effectiveness of the proposed fuzzy support vector classifier algorithm is tested on benchmark data sets from the UCI machine learning repository (http:/www.ics.uci.edu/~mlearn/MLRepository.html), including glass, ionosphere, sonar, pima-diabetes, iris andvehicle. Their related parameters are listed in Table 1. In the Table 1, #pts denotes the number of data points, #att denotes the number of the attributes, and #class denotes the number of the classifications.

Table 1 Data sets from UCI

On each data set, we compared our method to the standard SVM classifier and to the sphere-based classifier using the decision rule of Zhu et al. (2003) and Wang et al. (2007). We used Gaussian kernel for all algorithms and used the ten-fold cross-validation method to estimate the generalization accuracy of the classifiers. In the cross validation process, we ensured that each training set and each testing set were the same for all algorithms.

Accordingly, let a threshold σ t be 0.3 in the step 1 of our proposed algorithm. The value of the threshold σ t influences the algorithm runtime. For small values of σ t more training samples may be included in the extra class.

In the experiments we standardized the data to zero mean. While using the Gaussian kernel, we determined the tuning parameters σ2 and C with standard ten-fold cross validation. The initial search for optimal parameters σ2 and C was done on a 10 × 10 uniform coarse grid in the (σ2, C) space, namely, σ2 = [2 − 3, 2 − 1,...,213, 215], C = [2 − 15, 2 − 13,...,21, 23].

The experimental results are summarized in Table 2. As we see from Table 2, except on the sonar and iris data sets, our proposed algorithm improves significantly in classification accuracy and is better than that of the standard SVM classifier on four out of the six data sets being tested. Moreover, our proposed method also improves when compared with the sphere-based classifier in Zhu et al. (2003) and Wang et al. (2007) from experimental results.

Table 2 Experimental results

Additionally, to test algorithm sensitivity to outlier data, we random add 10% “outlier” data points to iris and vehicle, respectively, which are labeled as iris* and vehicle* in Table 2. Experimental results are shown as the last 2 columns in Table 2. Results indicate that the proposed method successfully reduce the effect of outliers and accordingly yield good generalization performance, especially existing outlier data in data sets.

On the other hand, the proposed method is smaller than SVM method in time costs, whereas it is trivially higher than sphere-based classifier method.

Parameter σ t has some effect on the prediction accuracy. Through our experiments, we find when the value of σ t is bigger (for example σ t  > 0.4), the accuracy of our proposed method is inferior to other methods, whereas its time cost is smaller because the training data points decreases. So, we select σ t be 0.3 in our experiments.

5 Conclusions

In this paper, a fuzzy support vector classifier has been developed using fuzzy c-means clustering based on Mahalanobis distance. The proposed algorithm can be used to deal with the outlier sensitivity problem in traditional multi-class classification problems. The modified fuzzy c-means clustering algorithm based on Mahalanobis distance takes into the samples’ correlation account, and is improved to generate different weight values for main training data points and outliers according to their relative importance in the training data. Experimental results show that the proposed method can reduce the effect of outliers and give higher classification accuracy rate than other existing methods do.